Where do the UniProtKB protein sequences come from?
Last modified March 23, 2015
More than 95% of the protein sequences provided by UniProtKB come from the translations of coding sequences (CDS) submitted to the EMBL-Bank/GenBank/DDBJ nucleotide sequence resources (International Nucleotide Sequence Database Collaboration (INSDC)). These CDS are either generated by gene prediction programs or are experimentally proven. A protein identifier (“protein_id”) is assigned to the translated CDS and can be found in the original EMBL-Bank/GenBank/DDBJ record and in the relevant UniProtKB entry.
The translated CDS sequences are automatically transferred to the TrEMBL section of UniProtKB. The TrEMBL records can be selected for further manual annotation and then integrated into the UniProtKB/Swiss-Prot section. The “protein_id” are listed in the cross-reference part of the ‘Sequence’ section, of the UniProtKB entries (see for example P13744 ‘Translation’).
In addition to translated CDS, UniProtKB protein sequences may come from:
- the PDB database.
- sequences experimentally obtained by direct protein sequencing, by Edman degradation or MS/MS experiments and submitted to UniProtKB/Swiss-Prot. Only about 5% of the UniProtKB/Swiss-Prot entries contain sequence data obtained by direct protein sequencing (list of entries with the keyword
'Direct protein sequencing').
- sequences scanned from the literature (i.g. PRF or other journal scan project).
- sequences derived from gene prediction, not submitted to EMBL-Bank/GenBank/DDBJ (Ensembl and Ensembl Genomes (1), RefSeq, CCDS, etc).
- sequences derived from in-house gene prediction, in very specific cases.
The FAQ Does UniProtKB contain all protein sequences? gives information on our UniProtKB protein sequence exclusion policies, e.g. for “redundant proteomes:/help/proteome_redundancy.
(1) A complementary pipeline for import of protein sequences has been developed in collaboration with Ensembl for vertebrate species and Ensembl Genomes for non-vertebrate species. These sources provide protein sequences for a number of key genomes of special interest that currently may lack a complete INSDC submission. To date, this pipeline has been used to populate UniProtKB with additional predicted sequences for the human and mouse complete proteomes as well as a number of other important vertebrate and non-vertebrate species. See: What are proteomes?
- Why is UniProtKB composed of 2 sections, UniProtKB/Swiss-Prot and UniProtKB/TrEMBL?
- Does UniProtKB contain all protein sequences?
- What are UniProtKB’s criteria for defining a CDS as a protein?
- Why do we keep dubious sequences in UniProtKB? How to discard them from a protein set?
- How do I get the nucleotide sequence that corresponds to the canonical UniProtKB sequence?
Related terms: source, origin