What are complete proteomes?
Last modified March 21, 2012
UniProt provides 'complete proteome' sets of proteins thought to be expressed by organisms whose genomes have been completely sequenced.
What is a complete proteome?
A complete proteome is the entire set of proteins expressed by a specific organism. The majority of the UniProt complete proteomes are based on the translation of a completely sequenced genome, and will normally include sequences that derive from extra-chromosomal elements such as plasmids or organellar genomes in organisms where these occur. Some complete proteomes may also include protein sequences based on high quality cDNAs that cannot be mapped to the current genome assembly due to sequencing errors or gaps. These are only included in the complete proteome following manual review of the supporting evidence, including careful analysis of homologous sequences from closely related organisms.
What is the curation status of UniProt complete proteomes?
UniProt complete proteomes may include both manually reviewed (UniProtKB/Swiss-Prot) and unreviewed (UniProtKB/TrEMBL) entries. The proportion of reviewed entries varies between proteomes, and is obviously greater for the proteomes of intensively curated model organisms: some complete proteomes, such as those of Saccharomyces cerevisiae 288C and Escherichia coli strain K12 consist entirely of reviewed entries. Curation is a continuing process, and complete proteomes are updated in a regular manner as new information becomes available: pseudogenes and other dubious uncharacterized ORFs may be removed, other newly identified and characterized sequences may be added.
What is the source of the sequences for complete proteomes?
The majority of UniProt complete proteomes are based on translations of genome sequence submissions to the International Nucleotide Sequence Database Consortium (INSDC). One subset of INSDC is the Whole Genome Shotgun (WGS) data. This is not used for the production of complete proteomes for bacteria and archaea at the moment, but it may be used for the production of complete proteomes for other taxa, such as fungi and metazoan.
A complementary pipeline for import of protein sequences has been developed in collaboration with Ensembl that provides proteome sequences for a number of key genomes of special interest where the INSDC submission is lacking gene model annotation. As this pipeline covers organisms for which we already have some sequences in UniProtKB, these existing sequences have to be reconciled with those imported. The procedure works in the following way:
- Ensembl sequences are first mapped to their UniProtKB
counterparts under stringent conditions, requiring 100% identity over 100% of
the length of the two sequences. These entries are tagged with the keyword
'Complete proteome'and updated with an Ensembl cross-reference. - Ensembl sequences that are absent from UniProtKB are
imported into UniProtKB/TrEMBL. These entries are tagged with the keyword
'Complete proteome'and have an Ensembl cross-reference. - All other UniProtKB/Swiss-Prot entries within the
proteome that do not map to Ensembl are tagged with the keyword
'Complete proteome'.
Therefore, a complete proteome is formed from all UniProtKB/Swiss-Prot entries (irrespective of whether they map to Ensembl) plus those UniProtKB/TrEMBL entries mapping to Ensembl for that proteome.
To date this pipeline has been used to populate UniProtKB with additional sequences for the human and mouse proteomes (see headline Complete proteomes for Homo sapiens and Mus musculus) and many other vertebrata.
See also: Where do the UniProtKB protein sequences come from?
How to retrieve complete proteomes?
Complete proteomes for specific taxa can be retrieved by
searching for the
taxonomic identifier in the organism field together with the keyword
'Complete proteome'. For example, to
retrieve the complete proteome for Escherichia coli (strain K12), which
has the taxonomic identifier 83333, the required query would be:
The taxonomic identifier can also be used to query the
taxonomy field rather than the organism field. This
will result in the retrieval of all complete proteome sequences at or below the
taxonomic rank specified by the identifier. For example, to retrieve the
complete proteome for Escherichia coli (strain K12) and all complete
proteomes at lower taxonomic nodes (substrains such as Escherichia coli
(strain K12 / DH10B)), then the required query would be:
How can I download complete proteomes?
Our FTP server allows to download expanded FASTA sets, containing both the canonical and manually reviewed isoform sequences, for a selection of the most widely used complete proteomes.
To download the results of a query:
- Click the orange Download button
- Choose the download format
To download your favorite sets programmatically, please go to the section Downloading data at every UniProt release of our FAQ about programmatic access, where you will find a code example that illustrates how to download the complete proteome sets for all organisms below a given taxonomic node in FASTA format.
Note that the download formats which describe complete
UniProtKB entries (flat text, XML, RDF/XML) include only the 'canonical' or
displayed protein sequences of UniProtKB entries. These canonical sequences can
also be downloaded in FASTA format (option Canonical sequence data in
FASTA format), as can a set of protein sequences including both canonical
and manually reviewed 'isoform sequences' from UniProtKB/Swiss-Prot (where
available) using the option Canonical and isoform sequence data in FASTA
format.
See also:
- What is the canonical sequence? Are all isoforms described in one entry?
- What are reference proteomes?
- What is the human complete proteome?
- How to retrieve sets of UniProtKB protein sequences?
- How can I access resources on this web site programmatically?
- Sequences
- Alternative products
- Alternative sequence
