Where do the UniProtKB protein sequences come from?
Last modified September 21, 2011
About 98% of the protein sequences provided by UniProtKB come from the translations of coding sequences (CDS) submitted to the EMBL-Bank/GenBank/DDBJ nucleotide sequence resources (International Nucleotide Sequence Database Collaboration (INSDC)). These CDS are either generated by gene prediction programs or are experimentally proven. A protein identifier ("protein_id") is assigned to the translated CDS and can be found in the original EMBL-Bank/GenBank/DDBJ record and in the relevant UniProtKB entry.
The translated CDS sequences are automatically transferred to the TrEMBL section of UniProtKB. The TrEMBL records can be selected for further manual annotation and then integrated into the UniProtKB/Swiss-Prot section. The "protein_id" are listed in the 'Cross-reference' section, category: Sequence databases, of the UniProtKB entries (see for example P13744 'Translation')
In addition to translated CDS, UniProtKB protein sequences may come from:
- the PDB database.
- sequences experimentally obtained by direct protein sequencing, by Edman degradation or
MS/MS experiments and submitted to UniProtKB/Swiss-Prot.
Only about 5% of the UniProtKB/Swiss-Prot entries contain sequence data obtained by direct protein
sequencing (list of entries with the keyword
'Direct protein sequencing').
- sequences scanned from the literature (i.g. PRF or other journal scan project).
- sequences derived from gene prediction, not submitted to EMBL-Bank/GenBank/DDBJ (Ensembl (1), RefSeq, CCDS, etc). These data are restricted to some organisms, such as homo sapiens.
- sequences derived from in-house gene prediction, in very specific cases.
The FAQ Does UniProtKB contain all protein sequences? gives information on our UniProtKB protein sequence exclusion policies.
(1) A complementary pipeline for import of protein sequences has been developed in collaboration with Ensembl that provides protein sequences for a number of key genomes of special interest that currently may lack a complete INSDC submission. To date this pipeline has been used to populate UniProtKB with additional predicted sequences for the human and mouse complete proteomes and several other eukaryotes. See: What are complete proteomes?
- Why is UniProtKB composed of 2 sections, UniProtKB/Swiss-Prot and UniProtKB/TrEMBL?
- Does UniProtKB contain all protein sequences?
- What are UniProtKB's criteria for defining a CDS as a protein?
- Why do we keep dubious sequences in UniProtKB? How to discard them from a protein set?
- How do I get the nucleotide sequence that corresponds to the canonical UniProtKB sequence?
Related terms: source, origin