Does UniProtKB contain all protein sequences?

Last modified March 23, 2015

The two sections of UniProtKB – UniProtKB/Swiss-Prot and UniProtKB/TrEMBL – give access to all the protein sequences which are available to the public. However, UniProtKB excludes the following protein sequences:

  1. Most non-germline immunoglobulins and T-cell receptors
  2. Synthetic sequences
  3. Most patent application sequences
  4. Small fragments encoded from nucleotide sequence (<8 amino acids)
  5. Pseudogenes
  6. Sequences from redundant proteomes
  7. Fusion/truncated proteins
  8. Not real proteins

The first 5 are identified automatically by the UniProtKB/TrEMBL creation program and never enter UniProtKB. However some proteins belonging to these classes are also identified during the UniProtKB/Swiss-Prot annotation process by the curators and then removed from UniProtKB.

Protein sequences originating from proteomes that are considered redundant are identified automatically at every release, and are either removed from UniProtKB/TrEMBL, or never enter UniProtKB but stay in UniParc.

Fusion/truncated proteins and those classified as not real proteins are only manually identified by the curators and removed from UniProtKB/TrEMBL or UniProtKB/Swiss-Prot. All these excluded sequences are available at UniParc. The corresponding UniParc entries have been flagged with the reason for the absence of that sequence from UniProtKB.

