gennameprot.txt ---------------------------------------------------------------------------- UniProt - Swiss-Prot Protein Knowledgebase SIB Swiss Institute of Bioinformatics; Geneva, Switzerland European Bioinformatics Institute (EBI); Hinxton, United Kingdom Protein Information Resource (PIR); Washington DC, USA ---------------------------------------------------------------------------- Description: Generalised protein naming guidelines Name: gennameprot.txt Release: 2016_08 of 07-Sep-2016 ---------------------------------------------------------------------------- This is a subset of the UniProtKB document nameprot.txt which has been developed with the International Nucleotide Sequence Database Collaboration (INSDC) (www.insdc.org) to provide guidelines for their submitters. Glossary ======== RN : Recommended name (RecName) AN : Alternative name (AltName) GS : Gene symbol OLN: Ordered locus name General naming rules ==================== If it exists, use the approved nomenclature. See: nomlist.txt (http://www.uniprot.org/docs/nomlist), a list of nomenclature related references for proteins. If no accepted unification exists, and several alternatives are of equal frequency in the literature, we use the one with the easiest extensibility or standardization. In addition, preference is given to names that best reflect the common acronym or gene symbol. The protein naming guidelines are based on the premise that a good and stable recommended name (referred to hereafter as "RN") for a protein is a name that is as neutral as possible. An RN should be, as far as possible, unique and attributed to all orthologs. To facilitate attribution of the RN to all orthologs it should not include references to specific characteristics of the protein in one particular species; in particular it should not reflect the function or role of the protein, nor its subcellular location, its domain structure, its tissue specificity, its molecular weight or its species of origin. The following examples illustrate cases where the use of such terminology renders consistent application of a recommended name difficult, and explains the reasons why: - An RN should not contain information about the molecular weight of the protein, which may vary between orthologs. e.g. "unicornase subunit A" is preferred to "unicornase 52 kDa subunit." - An RN should not be based on the name of a disease in which the protein may be implicated because this may apply to a single species. e.g. "Bloom syndrome protein" is not suitable. - An RN should not be based on species-specific patterns of expression or induction. e.g. "testis-specific protein ..." is not suitable. e.g. "androgen-induced protein 1" is not suitable. - Finally, an RN must not include mention of a particular species. e.g. "yeast Ku70 protein" is not suitable. The most optimal RN is a word that ends with "in" and which can be easily pronounced in English. e.g. "zyxin", "insulin", "hemoglobin", "caveolin", "desmoglein", "secretin", etc. Names ending in "ine" should be avoided. e.g. "maurocalcin" instead of "maurocalcine". Wherever American and British spelling conventions differ, the RN should use the American form. e.g. "hemoglobin" instead of "haemoglobin". - An RN should not contain a roman numeral. e.g. "caveolin-2" instead of "caveolin-II". Exceptions are allowed for historical cases. e.g. "coagulation factor IX", "casein kinase II", "HLA class I", etc. e.g. "type III restriction enzyme", "DNA helicase I", and "type IV pilus assembly protein". Abbreviations should not refer to the molecular weight of a protein. e.g. Abbreviations such as p123, Gp62, p34 are not suitable. Exceptions are allowed for cases where historically the molecular weight has been consistently and generally applied as part of the accepted name. e.g. "p53". General syntax ============== Greek letters are written entirely in lower case with the exception of "Delta" in the context of the steroid/fatty acid metabolism nomenclature. Greek letters must be written in full. e.g. "alpha", "omega". If a Greek letter is preceded or followed by a number or letter, then it must be separated by a dash "-". e.g. "unicornase alpha-1", "myprotease A-beta". An RN should not use diacritics, such as accents, umlauts and so on. e.g. "Krüppel" is not suitable. Eponyms should be used in the non-possessive form (a name should not be followed by "'s"). e.g. "Alzheimer disease amyloid A4 protein" instead of "Alzheimer's disease amyloid A4 protein". RN based on the gene symbol (GS) should be in the form "protein <GS>" instead of "<GS> protein". The word "protein" should be added in cases where no other descriptor can be added instead of merely having the protein symbol by itself. Some examples where the addition of the GS is useful include: mismatch repair proteins DNA repair proteins DNA/RNA polymerases DNA/RNA helicases GTP-binding proteins transcriptional regulators cell division proteins chaperones outer membrane proteins recombination proteins conjugation proteins flagellar proteins sporulation proteins secretion proteins [Note: this list is not exhaustive] Whenever possible commas should be avoided in a RN except when their usage is obligatory in accepted chemical names. e.g. "acyl-CoA dehydrogenase, short-chain specific" should be "short-chain specific acyl-CoA dehydrogenase". Symbols of chemical elements can be used in abbreviations. e.g. "magnesium/calcium co-transporter" can be abbreviated as "Mg/Ca co-transporter". For ions, chemical element symbols (e.g. Cu(+), Mg(2+), etc.) are preferred to systematic names (copper(I), magnesium ion, etc.) and common names (cupric, ferrous, etc). When necessary, the valence should be indicated within parentheses. e.g. "Fe(2+)", "Fe(3+)", Cl(-), etc. - Abbreviations should not appear inside a RN, with the exception of: Deoxyribonucleic acid: DNA cDNA dsDNA ssDNA Ribonucleic acid: dsRNA piRNA siRNA snRNA ssRNA tmRNA Mono-, di-, tri- nucleoside phosphates: d[ACGT][MDT]P c[AG]MP Cofactors: FAD FMN NAD NADP Others: hnRNP Coenzymes CoA Others: SAM [Note: this list is not exhaustive] Note: protein name abbreviations should not be used. e.g. "acyl carrier protein" instead of "ACP". - Charged tRNAs are indicated by "tRNA" followed by the three-letter amino acid code, with the first letter capitalized, in brackets. e.g. "Glu-tRNA(Gln) amidotransferase subunit B". Hyphens should be used to form compound modifiers (i.e. two or more words that are acting as a single modifier for a noun). The following terms are commonly used in compound identifiers. activated, activating, adapting, adding, amplified, anchored, anchoring, antagonizing, associated, associating, attracting, binding, blocking, bound, branching, bridging, bundling, capping, complementing, concentrating, conjugating, containing, controlled, controlling, converting, coupled, coupling, decapping, degrading, dependent, depolymerizing, derepressing, derived, deriving, destabilizing, docking, editing, enhanced, enhancing, enriched, exposed, expressed, flanking, forming, gated, grabbing, harvesting, independent, induced, inducible, inducing, inhibited, inhibiting, insensitive, interacting, laying, like, linked, linking, metabolizing, modifying, modulating, polymerizing, potentiating, preventing, processing, promoting, recognizing, recruited, recruiting, regulated, regulating, related, released, releasing, remodeling, removing, repressing, required, requiring, resistant, responsive, rich, ripening, scaffolding, sensing, sensitive, signaling, specific, splicing, spreading, stabilized, stabilizing, stacking, stimulated, stimulating, structuring, sulfating, suppressing, trafficking, transformed, transforming, transporting [Note: This list is not exhaustive]. e.g. "secretin-binding protein", "pyrophosphate-dependent phosphofructokinase". See: http://www.grammaruntied.com/punctuation/hyphen.html Specific rules for enzymes ========================== Enzymes commonly have RNs ending in "ase". e.g. "aminoacylase", "arginase", "caspase", "elastase", etc. Transfer enzymes are often named in such a way as to describe the source and target of the transfer reaction, with the two separated by a double dash (--). This is an IUBMB recommendation. e.g. "formylmethanofuran--tetrahydromethanopterin formyltransferase". For protein kinases and phosphatases, use the format: "<modified_residues>-protein <activity>". e.g. serine/threonine-protein kinase", "tyrosine-protein phosphatase". In cases where the protein is possibly an inactive version of an enzyme, avoid mentioning the activity in the name unless in expressions such as "X domain- containing protein". Inactive versions refer to proteins where active site residues are altered, for example, and do not refer to pseudogenes. e.g. "protease domain-containing protein". In some cases, the protein is named based on the pathway it is involved in. In such cases the following format is suitable: "<Pathway> biosynthesis protein <GS>". e.g. "thiamine biosynthesis protein ThiC". Specific rules for multiprotein complexes ========================================= Proteins that belong to well-defined multi-subunit complexes can be named according to the complex, followed by the specific subunit name. This type of nomenclature is only allowed for well-defined complexes of known composition. e.g. "26S proteasome non-ATPase regulatory subunit 1". The word "subunit" is preferred to "chain" or "component". Chain refers to proteolytically processed polypeptides arising from a common precursor protein. e.g. "unicornase heavy chain", "unicornase light chain". If the name contains a "type" of subunit, then precede the word "subunit" with the "type". The "type" is a controlled vocabulary: ATP-binding catalytic ferredoxin flavoprotein modulatory regulatory [Note: This list is not exhaustive] e.g. "unicornase regulatory subunit". Avoid the word "subunit" with a size indicator: e.g. "unicornase large subunit", "ribosomal large subunit pseudouridine synthase", etc. If the name contains a "designator" of the subunit, then the "designator" must follow the word "subunit": Numbers unicornase subunit 2 Letters unicornase subunit A GS unicornase subunit AbcD Greek letters unicornase subunit alpha The preference is to use Numbers > Letters > GS > Greek letters An RN can include both a "type" and a "Designator" e.g. "unicornase regulatory subunit 1". Additional rules ================ Unfortunately there are proteins of unknown or uncertain function for which only family/domain identification, similarity or no information at all is available. In these cases, we would recommend the following. "Hypothetical protein" or "Uncharacterized protein". These two are the only recommended terms for naming proteins of unknown function. The following words should be avoided in a RN: Conserved Novel Possible Potential Unique Protein of unknown function Similar to Note: these words can be used IF they are 'internal' to the RN and do not convey a 'global' meaning. e.g. "high-potential iron-sulfur protein" When an RN is based on the predicted activity of the protein, the RN can be preceded by 'putative' e.g. "putative acetylornithine deacetylase". Proteins of unknown function which nevertheless contain a defined domain or motif (that itself does not specify a particular function) have been named sometimes according to the domain(s) or repeat(s) present. The name should then be of the following type: "<domain|repeat>-containing protein". e.g. "PAS domain-containing protein 5", "thioredoxin-domain containing protein. If there is more than one domain/repeat, use a slash for all items preceding "containing" in accordance with grammatical rules. This also helps differentiate specific domains. e.g. "ankyrin repeat/SAM domain-containing protein 1" Do not use plurals. e.g. "ankyrin repeats-containing protein 8" is wrong. Proteins of unknown function which exhibit significant sequence similarity to a defined protein family have been named in accordance with other members of that family. The word protein should be added after family if no other descriptor is possible. e.g. "Holliday junction resolvase family endonuclease", "LysR family transcriptional regulator". It is also possible to use "-like" in the name. Bear in mind that this should only be used for cases that are outliers to a tight homomorphic family. Family is preferred over '-like'. e.g. "Holliday junction resolvase-like protein". Certain proteins have multiple functions. The RN could reflect this situation. For multifunctional proteins which do not yet have a single unique name, a name can be formed by combining individual functions along with a prefix specifying the number of functions ('bi', 'tri', etc.). Each function should be separated by a forward slash "/". e.g. "bifunctional adenylyltransferase/ADP-heptose synthase cyclohydrolase"