Where do the UniProtKB protein sequences come from?

Last modified April 10, 2017

More than 95% of the protein sequences provided by UniProtKB come from the translations of coding sequences (CDS) submitted to the EMBL-Bank/GenBank/DDBJ nucleotide sequence resources (International Nucleotide Sequence Database Collaboration (INSDC)). These CDS are either generated by gene prediction programs or are experimentally proven. A protein identifier (“protein_id”) is assigned to the translated CDS and can be found in the original EMBL-Bank/GenBank/DDBJ record and in the relevant UniProtKB entry.

The translated CDS sequences are automatically transferred to the TrEMBL section of UniProtKB. The TrEMBL records can be selected for further manual annotation and then integrated into the UniProtKB/Swiss-Prot section. The “protein_id” are listed in the cross-reference part of the ‘Sequence’ section, of the UniProtKB entries (see for example P13744 ‘Translation’).

In addition to translated CDS, UniProtKB protein sequences may come from:

The FAQ Does UniProtKB contain all protein sequences? gives information on our UniProtKB protein sequence exclusion policies, e.g. for redundant proteomes.

(1) Complementary pipelines for import of protein sequences have been developed in collaboration with Ensembl for vertebrate species, Ensembl Genomes for non-vertebrate species, WormBase ParaSite for parasitic nematodes and VectorBase for pathogen vector genomes. In addition, a new pipeline imports selected non-redundant genomes annotated by NCBI RefSeq. These sources provide proteome sequences for a number of key genomes of special interest where the INSDC submission is lacking gene model annotation.

To date, these pipeline have been used to populate UniProtKB with additional predicted sequences for the human and mouse complete proteomes as well as a number of other important vertebrate and non-vertebrate species. See: What are proteomes?

