What is the human complete proteome?
Last modified March 6, 2013
In 2008, a draft of the complete human proteome was released from UniProtKB/Swiss-Prot: the approximately 20,000 putative human protein-coding genes were represented by one UniProtKB/Swiss-Prot
entry, tagged with the keyword
This UniProtKB/Swiss-Prot complete H. sapiens proteome (manually reviewed) can be considered as complete in the sense that it contains one
representative (canonical) sequence for each currently known human gene. Close to 40% of these 20'000 entries contain manually annotated alternative isoforms
representing over 15'000 additional sequences (see What is the canonical sequence? Are all isoforms described in one entry?).
In 2011, a complementary pipeline for import of predicted human protein sequences in UniProtKB/TrEMBL has been developed in collaboration with Ensembl to complete the set of human isoform sequences produced by genes present in UniProtKB/Swiss-Prot.
This pipeline works in the following way:
- Ensembl sequences are first mapped to their UniProtKB
counterparts under stringent conditions, requiring 100% identity over 100% of
the length of the two sequences. The thus identified UniProtKB entries are
tagged with the keyword
'Complete proteome'and obtain a cross-reference to the mapped Ensembl record.
- Ensembl sequences that are absent from UniProtKB are
imported into UniProtKB/TrEMBL. These entries are tagged with the keyword
'Complete proteome'and have an Ensembl cross-reference.
This UniProtKB complete H. sapiens proteome includes thus both the reviewed sequences from UniProtKB/Swiss-Prot (equivalent to an updated version of the complete H. sapiens proteome completed in 2008), supplemented by unreviewed sequences from UniProtKB/TrEMBL, which may represent additional predicted isoform sequences.
Remark: The number of entries in the human complete proteome may vary from one release to the other, especially the manually reviewed set. This is due to our continuous manual updates thanks to the availability of new information. On a regular basis, we have to merge entries that were originally thought to be encoded by two separate genes, but later appeared to be actually a single gene. An entry can also be deleted when there is increasing evidence that it is an erroneous translation derived from a pseudogene. We keep dubious sequences in UniProtKB until there is enough evidence to decide whether we should delete them (see Why do we keep dubious sequences in UniProtKB? How to discard them from a protein set?).
Access to human sequence sets
Our FTP server allows to download expanded FASTA sets, containing both the canonical and manually reviewed isoform sequences, for a selection of the most widely used complete proteomes, including human.
Below are queries to retrieve different human sequence sets. In order to download the query results, please read How to retrieve sets of UniProtKB protein sequences?
You can retrieve:
approximately 20,000 human protein-coding genes represented by the canonical protein sequence in UniProtKB/Swiss-Prot:
Note: Some of the human entries in UniProtKB/Swiss-Prot are not tagged with the keyword
'Complete proteome'because their protein sequences do not map to the reference genome:
additional manually reviewed isoform sequences produced by the protein-coding genes described in UniProtKB/Swiss-Prot. There are currently around 15,000 such additional isoform sequences. These are downloadable in FASTA format together with the canonical sequences:
additional predicted and unreviewed sequences in UniProtKB/TrEMBL, tagged with the keyword
'Complete proteome'which may correspond to novel isoform sequences for genes present in UniProtKB/Swiss-Prot (derived from the Ensembl pipeline):
Note: Additional human sequences in UniProtKB/TrEMBL are not tagged with the keyword
'Complete proteome'. The vast majority of these additional UniProtKB/TrEMBL entries contain sequences which are not identical to any isoform sequences predicted by Ensembl. They might represent other alternative isoforms. The system is not perfect, but it is the best we can provide for the time being.