What is UniProt's human proteome?
Last modified December 15, 2015
In 2008, a draft of the complete human proteome was released from UniProtKB/Swiss-Prot: the approximately 20,000 putative human protein-coding genes were represented by one UniProtKB/Swiss-Prot entry, tagged with the keyword
'Complete proteome' and later linked to proteome identifier UP000005640. This UniProtKB/Swiss-Prot H. sapiens proteome (manually reviewed) can be considered as complete in the sense that it contains one representative (canonical) sequence for each currently known human gene. Close to 40% of these 20’000 entries contain manually annotated alternative isoforms representing over 15’000 additional sequences (see What is the canonical sequence? Are all isoforms described in one entry?).
In 2011, a complementary pipeline for import of predicted human protein sequences in UniProtKB/TrEMBL has been developed in collaboration with Ensembl to complete the set of human isoform sequences produced by genes present in UniProtKB/Swiss-Prot.
This pipeline works in the following way:
- Ensembl sequences are first mapped to their UniProtKB counterparts under stringent conditions, requiring 100% identity over 100% of the length of the two sequences. The thus identified UniProtKB entries are flagged as part of the proteome (via a link to UP000005640) and obtain a cross-reference to the mapped Ensembl record.
- Ensembl sequences that are absent from UniProtKB are imported into UniProtKB/TrEMBL. These entries are flagged as part of the proteome and have an Ensembl cross-reference.
This UniProtKB H. sapiens proteome includes thus both the reviewed sequences from UniProtKB/Swiss-Prot (equivalent to an updated version of the H. sapiens proteome completed in 2008), supplemented by unreviewed sequences from UniProtKB/TrEMBL, which may represent additional predicted isoform sequences, but which may potentially also add redundancy.
Remark: The number of entries in the human proteome may vary from one release to the other, especially the manually reviewed set. This is due to our continuous manual updates thanks to the availability of new information. On a regular basis, we have to merge entries that were originally thought to be encoded by two separate genes, but later appeared to be actually a single gene. An entry can also be deleted when there is increasing evidence that it is an erroneous translation derived from a pseudogene. We keep dubious sequences in UniProtKB until there is enough evidence to decide whether we should delete them (see Why do we keep dubious sequences in UniProtKB? How to discard them from a protein set?).
Access to human sequence sets
Our FTP server allows to download expanded FASTA sets, containing both the canonical and manually reviewed isoform sequences, for a selection of the most widely used proteomes, including human.
Below are queries to retrieve different human sequence sets. In order to download the query results, please read How to retrieve sets of UniProtKB protein sequences?
You can retrieve:
1) approximately 20,000 human protein-coding genes represented by the canonical protein sequence in UniProtKB/Swiss-Prot:
Note: Some of the human entries in UniProtKB/Swiss-Prot are not included in the proteome because their protein sequences do not map to the reference genome:
2) additional manually reviewed isoform sequences produced by the protein-coding genes described in UniProtKB/Swiss-Prot. There are currently around 15,000 such additional isoform sequences. These are downloadable in FASTA format together with the canonical sequences:
3) additional predicted and unreviewed sequences in UniProtKB/TrEMBL, flagged to be part of the proteome, which may correspond to novel isoform sequences for genes present in UniProtKB/Swiss-Prot (derived from the Ensembl pipeline):
Note: Additional human sequences in UniProtKB/TrEMBL are not flagged to be part of the proteome. The vast majority of these additional UniProtKB/TrEMBL entries contain sequences which are not identical to any isoform sequences predicted by Ensembl. They might represent other alternative isoforms. The system is not perfect, but it is the best we can provide for the time being.