You are using a version of Internet Explorer that may not display all features of this website. Please upgrade to a modern browser.
---------------------------------------------------------------------------- UniProt - Swiss-Prot Protein Knowledgebase SIB Swiss Institute of Bioinformatics; Geneva, Switzerland European Bioinformatics Institute (EBI); Hinxton, United Kingdom Protein Information Resource (PIR); Washington DC, USA ---------------------------------------------------------------------------- Description: Protein naming guidelines Name: nameprot.txt Release: 2014_07 of 09-Jul-2014 ---------------------------------------------------------------------------- Preamble ======== Consistent nomenclature is indispensable for communication, literature searching and entry retrieval. Many species-specific communities have established gene nomenclature committees that try to assign consistent and, if possible, meaningful gene symbols. Other scientific communities have established protein nomenclatures for a set of proteins based on sequence similarity and/or function. But there is no established organization involved in the standardization of protein names, nor are there any efforts to establish naming rules that are valid across the largest spectrum of species possible. Ambiguities regarding gene/protein names are a major problem in the literature and it is even worse in the sequence databases which tend to propagate the confusion. As administrators of UniProt we feel that we can play a major role in standardization of protein nomenclature. UniProt is constantly striving to further standardize the nomenclature for a given protein across related organisms. This is accomplished via protein family-driven annotation, through both manual and automated pipelines. This also involves the ongoing standardization of all the existing UniProtKB/Swiss-Prot protein names according to our guidelines. We try to attribute a recommended name to all the proteins of UniProtKB/Swiss-Prot, following as far as possible the rules listed in this document. We would suggest that authors/laboratories take these rules into account when naming new proteins. Warning: this is a preliminary document; many rules still have to be added, modified or expanded. Glossary ======== RN : Recommended name (RecName) AN : Alternative name (AltName) GS : Gene symbol OLN: Ordered locus name General naming rules ==================== If it exists, use the approved nomenclature. See: nomlist.txt, a list of nomenclature related references for proteins. If no accepted unification exists, and several alternatives are of equal frequency in the literature, we use the one with the easiest extensibility or standardization. In addition, preference is given to names that best reflect the common acronym or gene symbol. The protein naming guidelines are based on the premise that a good and stable recommended name (RN) for a protein is a name that is as neutral as possible. An RN should be, as far as possible, unique and attributed to all orthologs. One reason for this is that it should be possible to propagate a protein name to all orthologous proteins, from various organisms. This is why, ideally, the protein name should not contain a specific characteristic of the protein, and in particular it should not reflect the function or role of the protein, nor its subcellular location, its domain structure, its tissue specificity, its molecular weight or its species of origin. Therefore: - An RN should not contain information about the molecular weight of the protein. e.g. "unicornase subunit A" is preferred to "unicornase 52 kDa subunit." - An RN should not be based on the name of a disease. e.g. "Bloom syndrome protein" is not suitable. - An RN should not be based on tissue specificity. e.g. "testis-specific protein ..." is not suitable. - An RN must not be based on the species name. e.g. "Yeast Ku70 protein" is not suitable. - An RN should not be based on the gene induction. e.g. "androgen-induced protein 1" is not suitable. The most optimal RN is a word that ends with "in" and which can be easily pronounced in English. e.g. "zyxin", "insulin", "hemoglobin", "caveolin", "desmoglein", "secretin", etc. Names ending in "ine" should be avoided. e.g. "maurocalcin" instead of "maurocalcine". Wherever appropriate, the RN should use American spelling conventions (as opposed to British spelling). e.g. "hemoglobin" instead of "haemoglobin". An RN should not contain a roman numeral. e.g. "caveolin-2" instead of "caveolin-II". Exception: historical cases. e.g. "coagulation factor IX", "casein kinase II", "HLA class I", etc. Abbreviations should not be built using the molecular weight. e.g. Abbreviations such as p123, Gp62, p34 are not suitable. Exception: cases where historically the molecular weight has been consistently and generally applied as part of the accepted name. e.g. "p53". For proteins that belong to a multigene family, it is recommended that you choose a coherent nomenclature with numbers to specify the different members of the family. When naming proteins which can be grouped into a family based on homology or according to a notion of shared function (like the interleukins), the different members should be enumerated with a dash "-" followed by an Arabic number. e.g. "desmoglein-1", "desmoglein-2", etc. General syntax ============== Greek letters must be written in full. e.g. "alpha", "omega". Greek letters are written entirely in lower case with the exception of "Delta" in the context of the steroid/fatty acid metabolism nomenclature. If a Greek letter is preceded or followed by a number or letter, then it must be separated by a dash "-". e.g. "unicornase alpha-1", "Myprotease A-beta". An RN should not use diacritics, such as accents, umlauts and so on. e.g. "Kruppel" is not suitable. Eponyms should be used in the non-possessive form (a name should not be followed by "'s"). Note: an eponym is a person, whether real or fictitious, whose name has (or is thought to have) given rise to the name of a particular item. There used to be a debate as to whether the possessive form (e.g. Alzheimer's disease) or the non-possessive form (Alzheimer disease) of eponyms is preferred. As a rule the non possessive form is now preferred. e.g. "Alzheimer disease amyloid A4 protein" instead of "Alzheimer's disease amyloid A4 protein". RN based on the GS should be in the form "Protein
" instead of " protein". e.g. "protein abcD" instead of "abcD protein". When an RN includes a GS, the casing of the GS should be the one used for the gene in the nomenclature for that organism. Since we are always dealing with proteins, it will be understood that gene=protein. e.g. "response regulator algR", "Protein HEX23". Whenever possible commas should be avoided in a RN. e.g. "acyl-CoA dehydrogenase, short-chain specific" should be "short-chain specific acyl-CoA dehydrogenase" Symbols of chemical elements can be used in abbreviations. e.g. "magnesium/calcium co-transporter" can be abbreviated as "Mg/Ca co-transporter". For ions, chemical element symbols (e.g. Cu(+), Mg(2+), etc.) are preferred to systematic names (copper(I), magnesium ion, etc.) and common names (cupric, ferrous, etc). For ions, when necessary, valence should be indicated within parentheses. e.g. "Fe(2+)", "Fe(3+)", Cl(-), etc. Abbreviations should not appear inside a RN, with the exception of: Deoxyribonucleic acid: DNA cDNA dsDNA ssDNA Ribonucleic acid: dsRNA siRNA snRNA ssRNA tmRNA Mono-, di-, tri- nucleic acid phosphates: d[ACGT][MDT]P c[AG]MP Cofactors: FAD FMN NAD NADP Others: hnRNP Note: protein name abbreviations should not be used. e.g. "acyl carrier protein" instead of "ACP". Charged tRNAs are indicated by "tRNA" followed by the three-letter amino acid code, with the first letter capitalized, in brackets. e.g. "Glu-tRNA(Gln) amidotransferase subunit B". Hyphens should be used to form compound modifiers (i.e. two or more words that are acting as a single modifier for a noun). For example before: activated, activating, adapting, adding, amplified, anchored, anchoring, antagonizing, associated, associating, attracting, binding, blocking, bound, branching, bridging, bundling, capping, complementing, concentrating, conjugating, containing, controlled, controlling, converting, coupled, coupling, decapping, degrading, dependent, depolymerizing, derepressing, derived, deriving, destabilizing, docking, editing, enhanced, enhancing, enriched, exposed, expressed, flanking, forming, gated, grabbing, harvesting, independent, induced, inducible, inducing, inhibited, inhibiting, insensitive, interacting, laying, like, linked, linking, metabolizing, modifying, modulating, polymerizing, potentiating, preventing, processing, promoting, recognizing, recruited, recruiting, regulated, regulating, related, released, releasing, remodeling, removing, repressing, required, requiring, resistant, responsive, rich, ripening, scaffolding, sensing, sensitive, signaling, specific, splicing, spreading, stabilized, stabilizing, stacking, stimulated, stimulating, structuring, sulfating, suppressing, trafficking, transformed, transforming, transporting [Note: This list is not complete] e.g. "secretin-binding protein", "pyrophosphate-dependent phosphofructokinase". See: http://www.grammaruntied.com/ Specific rules for enzymes ========================== Enzymes commonly have RNs ending in "ase". e.g. "aminoacylase", "arginase", "caspase", "elastase", etc. Transfer enzymes are often indicated with the source and destination substrate separated by a double dash (--). e.g. "formylmethanofuran--tetrahydromethanopterin formyltransferase". For protein kinases and phosphatases, use the format: " -protein ". e.g. "serine/threonine-protein kinase", "tyrosine-protein phosphatase". In cases where the protein is possibly an inactive version of an enzyme, avoid mentioning the activity in the name unless in expressions such as "X domain-containing protein". Inactive versions refer to proteins where active site residues are altered, for example, and do not refer to pseudogenes. e.g. "protease domain-containing protein". In some cases, the protein is named based on the pathway it is involved in. In such cases the following format is suitable: " biosynthesis protein ". e.g. "thiamine biosynthesis protein thic". Specific rules for multiprotein complexes ========================================= Sometimes a protein is named after a multiprotein complex name, which is only suitable for well-defined complexes. Keep in mind that in some cases, the complex composition is variable and proteins can belong to different multiple complexes (transcription, chromatin remodeling or ubiquitin ligase E3 complex). In such a case, it may be better not to cite the complex name in the RN field. Proteins that belong to well-defined multi-subunit complexes can be named according to the complex, followed by the specific subunit name. e.g. "26S proteasome non-ATPase regulatory subunit 1". The word "subunit" is preferred to "chain", "component" or "polypeptide". Chain refers to proteolytically processed polypeptides arising from a common precursor protein. e.g. "unicornase heavy chain", "unicornase light chain". If the name contains a "type" of subunit, then precede the word "subunit" with the "type". The "type" is a controlled vocabulary: ATP-binding catalytic ferredoxin flavoprotein modulatory regulatory [Note: This list is not complete] e.g. "unicornase regulatory subunit". Avoid the word "subunit" with a size indicator: e.g. "unicornase large subunit". If the name contains a "designator" of the subunit, then the "designator" must follow the word "subunit": Numbers Unicornase subunit 2 Letters Unicornase subunit A GS Unicornase subunit abcD Greek letters Unicornase subunit alpha The preference is to use Numbers > Letters > GS > Greek letters An RN can include both a "Type" and a "Designator" e.g. "unicornase regulatory subunit 1". Additional rules ================ Unfortunately many existing protein names are based not only on the role or function, but sometimes on the domain structure, or on plenty of other characteristics. In these cases we try to apply the following syntaxic rules. Proteins which are NOT conserved or with no known or predicted function or characteristics should be called "Uncharacterized protein ". The following words should be avoided in a RN: Hypothetical Possible Potential Precursor Conserved Unique e.g. "hypothetical protein Abcd" is not suitable. Note: these words can be used IF they are 'internal' to the RN and do not convey a 'global' meaning. e.g. "high-potential iron-sulfur protein", "thiamine precursor biosynthesis protein". When an RN is based on the predicted activity of the protein, it is allowed to precede the RN by 'Probable' or 'Putative'. e.g. "probable acetylornithine deacetylase", "putative acetylornithine deacetylase". Proteins of unknown function which nevertheless contain a defined domain or motif (that itself does not specify a particular function) have been named sometimes according to the domain(s) or repeat(s) present. The name should then be of the following type: " -containing protein". e.g. "PAS domain-containing protein 5". If there is more than one domain/repeat, only use dash for the last item preceding "containing" even though this violates conventional grammar. e.g. "ankyrin repeat and SAM domain-containing protein 1" is correct, but "ankyrin repeat- and SAM domain-containing protein 1" is wrong. Do not use plurals. e.g. "ankyrin repeats-containing protein 8" is wrong. Proteins of unknown function which exhibit significant sequence similarity to a defined protein family have been named in accordance with other members of that family... e.g. "Holliday junction resolvase family endonuclease". It is also possible to use "-like" in the name. Bear in mind that this should only be used for cases that are outliers to a tight homomorphic family. e.g. "Holliday junction resolvase-like protein". The CD antigen nomenclature defined for surface proteins of human leucocytes is propagated to mammalian orthologs. Certain proteins have multiple functions. The RN could reflect this situation. ----------------------------------------------------------------------- Copyrighted by the UniProt Consortium, see http://www.uniprot.org/terms Distributed under the Creative Commons Attribution-NoDerivs License -----------------------------------------------------------------------