Skip Header

You are using a version of browser that may not display all the features of this website. Please consider upgrading your browser.

Where do the UniProtKB protein sequences come from?

Last modified May 15, 2015

More than 95% of the protein sequences provided by UniProtKB come from the translations of coding sequences (CDS) submitted to the EMBL-Bank/GenBank/DDBJ nucleotide sequence resources (International Nucleotide Sequence Database Collaboration (INSDC)). These CDS are either generated by gene prediction programs or are experimentally proven. A protein identifier (“protein_id”) is assigned to the translated CDS and can be found in the original EMBL-Bank/GenBank/DDBJ record and in the relevant UniProtKB entry.

The translated CDS sequences are automatically transferred to the TrEMBL section of UniProtKB. The TrEMBL records can be selected for further manual annotation and then integrated into the UniProtKB/Swiss-Prot section. The “protein_id” are listed in the cross-reference part of the ‘Sequence’ section, of the UniProtKB entries (see for example P13744 ‘Translation’).

In addition to translated CDS, UniProtKB protein sequences may come from:

The FAQ Does UniProtKB contain all protein sequences? gives information on our UniProtKB protein sequence exclusion policies, e.g. for redundant proteomes.

(1) A complementary pipeline for import of protein sequences has been developed in collaboration with Ensembl for vertebrate species and Ensembl Genomes for non-vertebrate species. These sources provide protein sequences for a number of key genomes of special interest that currently may lack a complete INSDC submission. To date, this pipeline has been used to populate UniProtKB with additional predicted sequences for the human and mouse complete proteomes as well as a number of other important vertebrate and non-vertebrate species. See: What are proteomes?

See also:

Related terms: source, origin