Skip Header

You are using a version of Internet Explorer that may not display all features of this website. Please upgrade to a modern browser.

Download

gennameprot.txt

gennameprot.txt

----------------------------------------------------------------------------

UniProt - Swiss-Prot Protein Knowledgebase
SIB Swiss Institute of Bioinformatics; Geneva, Switzerland
European Bioinformatics Institute (EBI); Hinxton, United Kingdom
Protein Information Resource (PIR); Washington DC, USA
----------------------------------------------------------------------------

Description: Generalised protein naming guidelines
Name:        gennameprot.txt
Release:     2014_07 of 09-Jul-2014

----------------------------------------------------------------------------

This is a subset of the UniProtKB document nameprot.txt which has been
developed with the International Nucleotide Sequence Database Collaboration
(INSDC) (www.insdc.org) to provide guidelines for their submitters.

Glossary
========

RN : Recommended name (RecName)
AN : Alternative name (AltName)
GS : Gene symbol
OLN: Ordered locus name

General naming rules
====================

If it exists, use the approved nomenclature.
See: nomlist.txt (http://www.uniprot.org/docs/nomlist), a list of nomenclature
related references for proteins.

If no accepted unification exists, and several alternatives are of equal
frequency in the literature, we use the one with the easiest extensibility or
standardization. In addition, preference is given to names that best reflect
the common acronym or gene symbol.

The protein naming guidelines are based on the premise that a good and stable
recommended name (referred to hereafter as "RN") for a protein is a name that
is as neutral as possible.

An RN should be, as far as possible, unique and attributed to all orthologs.

To facilitate attribution of the RN to all orthologs it should not include
references to specific characteristics of the protein in one particular species;
in particular it should not reflect the function or role of the protein, nor
its subcellular location, its domain structure, its tissue specificity, its
molecular weight or its species of origin.

The following examples illustrate cases where the use of such terminology
renders consistent application of a recommended name difficult, and explains
the reasons why:

- An RN should not contain information about the molecular weight of the
  protein, which may vary between orthologs.
  e.g. "unicornase subunit A" is preferred to "unicornase 52 kDa subunit."

- An RN should not be based on the name of a disease in which the protein may be
  implicated because this may apply to a single species.
  e.g. "Bloom syndrome protein" is not suitable.

- An RN should not be based on species-specific patterns of expression or
  induction.
  e.g. "testis-specific protein ..." is not suitable.
  e.g. "androgen-induced protein 1" is not suitable.

- Finally, an RN must not include mention of a particular species.
  e.g. "yeast Ku70 protein" is not suitable.

The most optimal RN is a word that ends with "in" and which can be easily
pronounced in English.
  e.g. "zyxin", "insulin", "hemoglobin", "caveolin", "desmoglein", "secretin",
  etc.

Names ending in "ine" should be avoided.
  e.g. "maurocalcin" instead of "maurocalcine".

Wherever American and British spelling conventions differ, the RN should use
the American form.
  e.g. "hemoglobin" instead of "haemoglobin".

- An RN should not contain a roman numeral.
  e.g. "caveolin-2" instead of "caveolin-II".

  Exceptions are allowed for historical cases.
  e.g. "coagulation factor IX", "casein kinase II", "HLA class I", etc.
  e.g. "type III restriction enzyme", "DNA helicase I", and "type IV pilus
  assembly protein".

Abbreviations should not refer to the molecular weight of a protein.
  e.g. Abbreviations such as p123, Gp62, p34 are not suitable.
  Exceptions are allowed for cases where historically the molecular weight
  has been consistently and generally applied as part of the accepted name.
  e.g. "p53".

General syntax
==============

Greek letters are written entirely in lower case with the exception of "Delta"
in the context of the steroid/fatty acid metabolism nomenclature. Greek letters
must be written in full.
  e.g. "alpha", "omega".

If a Greek letter is preceded or followed by a number or letter, then it must
be separated by a dash "-".
  e.g. "unicornase alpha-1", "myprotease A-beta".

An RN should not use diacritics, such as accents, umlauts and so on.
  e.g. "Kr�ppel" is not suitable.

Eponyms should be used in the non-possessive form (a name should not be
followed by "'s").

  e.g. "Alzheimer disease amyloid A4 protein" instead of "Alzheimer's
  disease amyloid A4 protein".

RN based on the gene symbol (GS) should be in the form "protein " instead
of " protein". The word "protein" should be added in cases where no other
descriptor can be added instead of merely having the protein symbol by itself.

Some examples where the addition of the GS is useful include:

mismatch repair proteins
DNA repair proteins
DNA/RNA polymerases
DNA/RNA helicases
GTP-binding proteins
transcriptional regulators
cell division proteins
chaperones
outer membrane proteins
recombination proteins
conjugation proteins
flagellar proteins
sporulation proteins
secretion proteins
[Note: this list is not exhaustive]

Whenever possible commas should be avoided in a RN except when their usage is
obligatory in accepted chemical names.
  e.g. "acyl-CoA dehydrogenase, short-chain specific" should be "short-chain
  specific acyl-CoA dehydrogenase".

Symbols of chemical elements can be used in abbreviations.
  e.g. "magnesium/calcium co-transporter" can be abbreviated as "Mg/Ca
       co-transporter".

For ions, chemical element symbols (e.g. Cu(+), Mg(2+), etc.) are preferred to
systematic names (copper(I), magnesium ion, etc.) and common names (cupric,
ferrous, etc). When necessary, the valence should be indicated within
parentheses.
  e.g. "Fe(2+)", "Fe(3+)", Cl(-), etc.

-  Abbreviations should not appear inside a RN, with the exception of:

   Deoxyribonucleic acid:
        DNA
        cDNA
        dsDNA
        ssDNA
   Ribonucleic acid:
        dsRNA
        piRNA
        siRNA
        snRNA
        ssRNA
        tmRNA
   Mono-, di-, tri- nucleoside phosphates:
        d[ACGT][MDT]P
        c[AG]MP
   Cofactors:
        FAD
        FMN
        NAD
        NADP
   Others:
        hnRNP

   Coenzymes
	CoA
   Others:
	SAM
   [Note: this list is not exhaustive]

Note: protein name abbreviations should not be used.
  e.g. "acyl carrier protein" instead of "ACP".

- Charged tRNAs are indicated by "tRNA" followed by the three-letter amino
  acid code, with the first letter capitalized, in brackets.
  e.g. "Glu-tRNA(Gln) amidotransferase subunit B".

Hyphens should be used to form compound modifiers (i.e. two or more words that
are acting as a single modifier for a noun). The following terms are commonly
used in compound identifiers.

  activated, activating, adapting, adding, amplified, anchored, anchoring,
  antagonizing, associated, associating, attracting, binding, blocking, bound,
  branching, bridging, bundling, capping, complementing, concentrating,
  conjugating, containing, controlled, controlling, converting, coupled,
  coupling, decapping, degrading, dependent, depolymerizing, derepressing,
  derived, deriving, destabilizing, docking, editing, enhanced, enhancing,
  enriched, exposed, expressed, flanking, forming, gated, grabbing, harvesting,
  independent, induced, inducible, inducing, inhibited, inhibiting, insensitive,
  interacting, laying, like, linked, linking, metabolizing, modifying,
  modulating, polymerizing, potentiating, preventing, processing, promoting,
  recognizing, recruited, recruiting, regulated, regulating, related, released,
  releasing, remodeling, removing, repressing, required, requiring, resistant,
  responsive, rich, ripening, scaffolding, sensing, sensitive, signaling,
  specific, splicing, spreading, stabilized, stabilizing, stacking, stimulated,
  stimulating, structuring, sulfating, suppressing, trafficking, transformed,
  transforming, transporting [Note: This list is not exhaustive].

  e.g. "secretin-binding protein", "pyrophosphate-dependent
       phosphofructokinase".

  See: http://www.grammaruntied.com/punctuation/hyphen.html

Specific rules for enzymes
==========================

Enzymes commonly have RNs ending in "ase".
  e.g. "aminoacylase", "arginase", "caspase", "elastase", etc.

Transfer enzymes are often named in such a way as to describe the source and
target of the transfer reaction, with the two separated by a double dash (--).
This is an IUBMB recommendation.
  e.g. "formylmethanofuran--tetrahydromethanopterin formyltransferase".

For protein kinases and phosphatases, use the format:
"-protein ".
  e.g. serine/threonine-protein kinase", "tyrosine-protein phosphatase".

In cases where the protein is possibly an inactive version of an enzyme, avoid
mentioning the activity in the name unless in expressions such as "X domain-
containing protein". Inactive versions refer to proteins where active site
residues are altered, for example, and do not refer to pseudogenes.
  e.g. "protease domain-containing protein".

In some cases, the protein is named based on the pathway it is involved in. In
such cases the following format is suitable: " biosynthesis protein
".
  e.g. "thiamine biosynthesis protein ThiC".

Specific rules for multiprotein complexes
=========================================

Proteins that belong to well-defined multi-subunit complexes can be named
according to the complex, followed by the specific subunit name. This type of
nomenclature is only allowed for well-defined complexes of known composition.
  e.g. "26S proteasome non-ATPase regulatory subunit 1".

The word "subunit" is preferred to "chain" or "component".
Chain refers to proteolytically processed polypeptides arising from a common
precursor protein.
  e.g. "unicornase heavy chain", "unicornase light chain".

If the name contains a "type" of subunit, then precede the word "subunit" with
the "type". The "type" is a controlled vocabulary:

  ATP-binding
  catalytic
  ferredoxin
  flavoprotein
  modulatory
  regulatory
  [Note: This list is not exhaustive]

  e.g. "unicornase regulatory subunit".

Avoid the word "subunit" with a size indicator:
  e.g. "unicornase large subunit", "ribosomal large subunit pseudouridine
  synthase", etc.

If the name contains a "designator" of the subunit, then the "designator" must
follow the word "subunit":

  Numbers               unicornase subunit 2
  Letters               unicornase subunit A
  GS                    unicornase subunit AbcD
  Greek letters         unicornase subunit alpha

The preference is to use Numbers > Letters > GS > Greek letters

An RN can include both a "type" and a "Designator"
  e.g. "unicornase regulatory subunit 1".

Additional rules
================

Unfortunately there are proteins of unknown or uncertain function for which
only family/domain identification, similarity or no information at all is
available. In these cases, we would recommend the following.

"Hypothetical protein" or "Uncharacterized protein".
These two are the only recommended terms for naming proteins of unknown
function.

The following words should be avoided in a RN:

  Conserved
  Novel
  Possible
  Potential
  Unique
  Protein of unknown function
  Similar to

  Note: these words can be used IF they are 'internal' to the RN and
        do not convey a 'global' meaning.
  e.g.  "high-potential iron-sulfur protein"

When an RN is based on the predicted activity of the protein, the RN can be
preceded by 'putative' e.g. "putative acetylornithine deacetylase".

Proteins of unknown function which nevertheless contain a defined domain or
motif (that itself does not specify a particular function) have been named
sometimes according to the domain(s) or repeat(s) present. The name should then
be of the following type: "-containing protein".
  e.g. "PAS domain-containing protein 5", "thioredoxin-domain containing protein.

If there is more than one domain/repeat, use a slash for all items preceding
"containing" in accordance with grammatical rules. This also helps differentiate
specific domains.
  e.g. "ankyrin repeat/SAM domain-containing protein 1"

Do not use plurals.
  e.g. "ankyrin repeats-containing protein 8" is wrong.

Proteins of unknown function which exhibit significant sequence similarity to a
defined protein family have been named in accordance with other members of that
family. The word protein should be added after family if no other descriptor is
possible.
  e.g. "Holliday junction resolvase family endonuclease", "LysR family
  transcriptional regulator".

It is also possible to use "-like" in the name. Bear in mind that this
should only be used for cases that are outliers to a tight homomorphic
family. Family is preferred over '-like'.
  e.g. "Holliday junction resolvase-like protein".

Certain proteins have multiple functions. The RN could reflect this
situation. For multifunctional proteins which do not yet have a single unique
name, a  name can be formed by combining individual functions along with a
prefix specifying the number of functions ('bi', 'tri', etc.). Each function
should be separated by a forward slash "/".
  e.g. "bifunctional adenylyltransferase/ADP-heptose synthase cyclohydrolase"