Skip Header

You are using a version of Internet Explorer that may not display all features of this website. Please upgrade to a modern browser.



        UniProt - Swiss-Prot Protein Knowledgebase
        SIB Swiss Institute of Bioinformatics; Geneva, Switzerland
        European Bioinformatics Institute (EBI); Hinxton, United Kingdom
        Protein Information Resource (PIR); Washington DC, USA

Description: Protein naming guidelines
Name:        nameprot.txt
Release:     2014_04 of 16-Apr-2014



Consistent nomenclature is indispensable for communication, literature
searching and entry retrieval. Many species-specific communities have
established gene nomenclature committees that try to assign consistent
and, if possible, meaningful gene symbols. Other scientific communities
have established protein nomenclatures for a set of proteins based on
sequence similarity and/or function. But there is no established
organization involved in the standardization of protein names, nor are
there any efforts to establish naming rules that are valid across the
largest spectrum of species possible.

Ambiguities regarding gene/protein names are a major problem in
the literature and it is even worse in the sequence databases which tend
to propagate the confusion. As administrators of UniProt we feel that we
can play a major role in standardization of protein nomenclature.

UniProt is constantly striving to further standardize the nomenclature for
a given protein across related organisms. This is accomplished via protein
family-driven annotation, through both manual and automated pipelines.
This also involves the ongoing standardization of all the existing
UniProtKB/Swiss-Prot protein names according to our guidelines. We try to
attribute a recommended name to all the proteins of UniProtKB/Swiss-Prot,
following as far as possible the rules listed in this document. We would
suggest that authors/laboratories take these rules into account when naming
new proteins.

Warning: this is a preliminary document; many rules still have to be
added, modified or expanded.


RN : Recommended name (RecName)
AN : Alternative name (AltName)
GS : Gene symbol
OLN: Ordered locus name

General naming rules

If it exists, use the approved nomenclature.
See: nomlist.txt, a list of nomenclature related references for proteins.

If no accepted unification exists, and several alternatives are of equal
frequency in the literature, we use the one with the easiest extensibility
or standardization. In addition, preference is given to names that best
reflect the common acronym or gene symbol.

The protein naming guidelines are based on the premise that a good and
stable recommended name (RN) for a protein is a name that is as neutral as

An RN should be, as far as possible, unique and attributed to all orthologs.

One reason for this is that it should be possible to propagate a protein
name to all orthologous proteins, from various organisms. This is why,
ideally, the protein name should not contain a specific characteristic of
the protein, and in particular it should not reflect the function or role
of the protein, nor its subcellular location, its domain structure, its
tissue specificity, its molecular weight or its species of origin.


- An RN should not contain information about the molecular weight of the
  e.g. "unicornase subunit A" is preferred to "unicornase 52 kDa subunit."
- An RN should not be based on the name of a disease.
  e.g. "Bloom syndrome protein" is not suitable.
- An RN should not be based on tissue specificity.
  e.g. "testis-specific protein ..." is not suitable.
- An RN must not be based on the species name.
  e.g. "Yeast Ku70 protein" is not suitable.
- An RN should not be based on the gene induction.
  e.g. "androgen-induced protein 1" is not suitable.

The most optimal RN is a word that ends with "in" and which can be easily
pronounced in English.
  e.g. "zyxin", "insulin", "hemoglobin", "caveolin", "desmoglein",
       "secretin", etc.

Names ending in "ine" should be avoided.
  e.g. "maurocalcin" instead of "maurocalcine".

Wherever appropriate, the RN should use American spelling conventions
(as opposed to British spelling).
  e.g. "hemoglobin" instead of "haemoglobin".

An RN should not contain a roman numeral.
  e.g. "caveolin-2" instead of "caveolin-II".
  Exception: historical cases.
  e.g. "coagulation factor IX", "casein kinase II", "HLA class I", etc.

Abbreviations should not be built using the molecular weight.
  e.g. Abbreviations such as p123, Gp62, p34 are not suitable.
  Exception: cases where historically the molecular weight has been
  consistently and generally applied as part of the accepted name.
  e.g. "p53".

For proteins that belong to a multigene family, it is recommended that you
choose a coherent nomenclature with numbers to specify the different members
of the family.

When naming proteins which can be grouped into a family based on homology or
according to a notion of shared function (like the interleukins), the
different members should be enumerated with a dash "-" followed by an Arabic
  e.g. "desmoglein-1", "desmoglein-2", etc.

General syntax

Greek letters must be written in full.
  e.g. "alpha", "omega".

Greek letters are written entirely in lower case with the exception of
"Delta" in the context of the steroid/fatty acid metabolism nomenclature.

If a Greek letter is preceded or followed by a number or letter, then it
must be separated by a dash "-".
  e.g. "unicornase alpha-1", "Myprotease A-beta".

An RN should not use diacritics, such as accents, umlauts and so on.
  e.g. "Kruppel" is not suitable.

Eponyms should be used in the non-possessive form (a name should not be
followed by "'s").

Note: an eponym is a person, whether real or fictitious, whose name has
(or is thought to have) given rise to the name of a particular item. There
used to be a debate as to whether the possessive form (e.g. Alzheimer's
disease) or the non-possessive form (Alzheimer disease) of eponyms is
preferred. As a rule the non possessive form is now preferred.

  e.g. "Alzheimer disease amyloid A4 protein" instead of "Alzheimer's
  disease amyloid A4 protein".

RN based on the GS should be in the form "Protein " instead of "
  e.g. "protein abcD" instead of "abcD protein".

When an RN includes a GS, the casing of the GS should be the one used for
the gene in the nomenclature for that organism. Since we are always dealing
with proteins, it will be understood that gene=protein.
  e.g. "response regulator algR", "Protein HEX23".

Whenever possible commas should be avoided in a RN.
  e.g. "acyl-CoA dehydrogenase, short-chain specific" should be
       "short-chain specific acyl-CoA dehydrogenase"

Symbols of chemical elements can be used in abbreviations.
  e.g. "magnesium/calcium co-transporter" can be abbreviated as "Mg/Ca

For ions, chemical element symbols (e.g. Cu(+), Mg(2+), etc.) are
preferred to systematic names (copper(I), magnesium ion, etc.) and common
names (cupric, ferrous, etc).

For ions, when necessary, valence should be indicated within parentheses.
  e.g. "Fe(2+)", "Fe(3+)", Cl(-), etc.

Abbreviations should not appear inside a RN, with the exception of:

  Deoxyribonucleic acid:
  Ribonucleic acid:
  Mono-, di-, tri- nucleic acid phosphates:

Note: protein name abbreviations should not be used.
  e.g. "acyl carrier protein" instead of "ACP".

Charged tRNAs are indicated by "tRNA" followed by the three-letter amino
acid code, with the first letter capitalized, in brackets.
  e.g. "Glu-tRNA(Gln) amidotransferase subunit B".

Hyphens should be used to form compound modifiers (i.e. two or more words
that are acting as a single modifier for a noun). For example before:

  activated, activating, adapting, adding, amplified, anchored, anchoring, antagonizing,
  associated, associating, attracting, binding, blocking, bound, branching, bridging,
  bundling, capping, complementing, concentrating, conjugating, containing, controlled,
  controlling, converting, coupled, coupling, decapping, degrading, dependent,
  depolymerizing, derepressing, derived, deriving, destabilizing, docking,
  editing, enhanced, enhancing, enriched, exposed, expressed, flanking, forming, gated,
  grabbing, harvesting, independent, induced, inducible, inducing, inhibited,
  inhibiting, insensitive, interacting, laying, like, linked, linking, metabolizing,
  modifying, modulating, polymerizing, potentiating, preventing, processing,
  promoting, recognizing, recruited, recruiting, regulated, regulating, related, released,
  releasing, remodeling, removing, repressing, required, requiring, resistant, responsive,
  rich, ripening, scaffolding, sensing, sensitive,
  signaling, specific, splicing, spreading, stabilized, stabilizing, stacking,
  stimulated, stimulating, structuring, sulfating, suppressing, trafficking,
  transformed, transforming, transporting
  [Note: This list is not complete]

  e.g. "secretin-binding protein", "pyrophosphate-dependent


Specific rules for enzymes

Enzymes commonly have RNs ending in "ase".
  e.g. "aminoacylase", "arginase", "caspase", "elastase", etc.

Transfer enzymes are often indicated with the source and destination
substrate separated by a double dash (--).
  e.g. "formylmethanofuran--tetrahydromethanopterin formyltransferase".

For protein kinases and phosphatases, use the format:
"-protein ".
  e.g. "serine/threonine-protein kinase", "tyrosine-protein phosphatase".

In cases where the protein is possibly an inactive version of an enzyme,
avoid mentioning the activity in the name unless in expressions such as "X
domain-containing protein". Inactive versions refer to proteins where
active site residues are altered, for example, and do not refer to
  e.g. "protease domain-containing protein".

In some cases, the protein is named based on the pathway it is involved in.
In such cases the following format is suitable: " biosynthesis
protein ".
  e.g. "thiamine biosynthesis protein thic".

Specific rules for multiprotein complexes

Sometimes a protein is named after a multiprotein complex name, which is
only suitable for well-defined complexes. Keep in mind that in some cases,
the complex composition is variable and proteins can belong to different
multiple complexes (transcription, chromatin remodeling or ubiquitin
ligase E3 complex). In such a case, it may be better not to cite the
complex name in the RN field.

Proteins that belong to well-defined multi-subunit complexes can be named
according to the complex, followed by the specific subunit name.
  e.g. "26S proteasome non-ATPase regulatory subunit 1".

The word "subunit" is preferred to "chain", "component" or "polypeptide".
Chain refers to proteolytically processed polypeptides arising from a common
precursor protein.
  e.g. "unicornase heavy chain", "unicornase light chain".

If the name contains a "type" of subunit, then precede the word "subunit"
with the "type". The "type" is a controlled vocabulary:

  [Note: This list is not complete]

  e.g. "unicornase regulatory subunit".

Avoid the word "subunit" with a size indicator:
  e.g. "unicornase large subunit".

If the name contains a "designator" of the subunit, then the "designator"
must follow the word "subunit":

  Numbers               Unicornase subunit 2
  Letters               Unicornase subunit A
  GS                    Unicornase subunit abcD
  Greek letters         Unicornase subunit alpha

The preference is to use Numbers > Letters > GS > Greek letters

An RN can include both a "Type" and a "Designator"
  e.g. "unicornase regulatory subunit 1".

Additional rules

Unfortunately many existing protein names are based not only on the role
or function, but sometimes on the domain structure, or on plenty of other
characteristics. In these cases we try to apply the following syntaxic

Proteins which are NOT conserved or with no known or predicted function or
characteristics should be called "Uncharacterized protein ".

The following words should be avoided in a RN:


  e.g. "hypothetical protein Abcd" is not suitable.

  Note: these words can be used IF they are 'internal' to the RN and
        do not convey a 'global' meaning.
  e.g.  "high-potential iron-sulfur protein", "thiamine precursor
        biosynthesis protein".

When an RN is based on the predicted activity of the protein, it is allowed
to precede the RN by 'Probable' or  'Putative'.
  e.g. "probable acetylornithine deacetylase", "putative acetylornithine

Proteins of unknown function which nevertheless contain a defined domain or
motif (that itself does not specify a particular function) have been named
sometimes according to the domain(s) or repeat(s) present. The name should
then be of the following type: "-containing protein".
  e.g. "PAS domain-containing protein 5".

If there is more than one domain/repeat, only use dash for the last item
preceding "containing" even though this violates conventional grammar.
  e.g. "ankyrin repeat and SAM domain-containing protein 1" is correct, but
       "ankyrin repeat- and SAM domain-containing protein 1" is wrong.

Do not use plurals.
  e.g. "ankyrin repeats-containing protein 8" is wrong.

Proteins of unknown function which exhibit significant sequence similarity
to a defined protein family have been named in accordance with other members
of that family...
  e.g. "Holliday junction resolvase family endonuclease".

It is also possible to use "-like" in the name. Bear in mind that this
should only be used for cases that are outliers to a tight homomorphic
  e.g. "Holliday junction resolvase-like protein".

The CD antigen nomenclature defined for surface proteins of human leucocytes
is propagated to mammalian orthologs.

Certain proteins have multiple functions. The RN could reflect this

Copyrighted by the UniProt Consortium, see
Distributed under the Creative Commons Attribution-NoDerivs License