Skip Header

You are using a version of browser that may not display all the features of this website. Please consider upgrading your browser.

Does UniProtKB contain all protein sequences?

Last modified July 20, 2020

The two sections of UniProtKB - UniProtKB/Swiss-Prot and UniProtKB/TrEMBL - give access to most protein sequences which are available to the public. However, UniProtKB excludes the following protein sequences:

  1. Most non-germline immunoglobulins and T-cell receptors
  2. Synthetic sequences
  3. Most patent application sequences
  4. Small fragments encoded from nucleotide sequence (<8 amino acids)
  5. Pseudogenes
  6. Sequences from redundant proteomes
  7. Sequences from proteomes that NCBI genomes/RefSeq considers to be low quality assemblies, i.e. excluded proteomes
  8. Fusion/truncated proteins
  9. Not real proteins

The first 5 are identified automatically by the UniProtKB/TrEMBL creation program and never enter UniProtKB. However some proteins belonging to these classes are also identified during the UniProtKB/Swiss-Prot annotation process by the curators and then removed from UniProtKB.

Protein sequences originating from proteomes that are considered redundant or low quality are identified automatically at every release, and are either removed from UniProtKB/TrEMBL, or never enter UniProtKB but stay in UniParc.

Fusion/truncated proteins and those classified as not real proteins are only manually identified by the curators and removed from UniProtKB/TrEMBL or UniProtKB/Swiss-Prot. All these excluded sequences are available at UniParc. The corresponding UniParc entries have been flagged with the reason for the absence of that sequence from UniProtKB.

See also:

UniProt is an ELIXIR core data resource
Main funding by: National Institutes of Health

We'd like to inform you that we have updated our Privacy Notice to comply with Europe’s new General Data Protection Regulation (GDPR) that applies since 25 May 2018.

Do not show this banner again