UniProt release 15.10
Published November 3, 2009
What are UniProt 'Complete proteomes'? How to retrieve them?
The need for users to access and download complete proteomes is unquestionable and the role of a database like UniProtKB is to meet this demand. The issue looks quite simple: there are more and more fully sequenced genomes. These genomes should contain at least minimal annotation, such as gene predictions, and translation of the predicted coding regions (CDSs) should provide a global perspective of the likely proteome of a given organism. The situation is actually more complex. The development of new sequencing techniques is generating a flood of data, which are often left as they have been produced. Databases have to deal with this ever-growing amount of data. The aim of this headline is to provide you with some tips on how we currently approach the problem, keeping in mind that the situation is rapidly evolving.
In order to give our users access to the proteomes of organisms whose genome has been fully sequenced, we have created the 'Complete proteomes' pages. Currently the proteomes of 1'428 organisms are available from these pages, 60% are bacteria, 30% viruses, 5.5% eukaryota and 4.5% archaea. Note that the term 'organism' is used in a broad sense and also includes strains or subspecies. Indeed, each completely sequenced strain is assigned a separate taxonomic identifier and is processed like an independent organism. A striking example of this approach is provided by Escherichia coli for which no less than 24 strain-specific proteomes can be downloaded separately.
A minority of the UniProt proteomes have been entirely manually reviewed and are found in UniProtKB/Swiss-Prot. These include 8 microbial (Methanocaldococcus jannaschii, 3 subspecies of Buchnera aphidicola, Escherichia coli (strain K12), Haemophilus influenzae, Mycoplasma genitalium and Mycoplasma pneumoniae) and 3 eukaryotic species (Saccharomyces cerevisiae, Schizosaccharomyces pombe, and last, but not least Homo sapiens). The current ptoteomes are as stable as new discoveries allow. New proteins may be identified and will have to be annotated.
However, most proteomes comprise 2 components, i.e. a manually reviewed protein set (Swiss-Prot) and an automatically annotated one (TrEMBL), and both are automatically combined to generate a non-redundant proteome. The proportion of Swiss-Prot versus TrEMBL entries is variable and depends upon the organism. For instance, 93% of the Bacillus subtilis proteome has been manually reviewed, while the reverse is true for Bacillus cereus for which 93% of the proteome is only automatically annotated and found in the TrEMBL section of UniProtKB. Note that the B.subtilis proteome will be fully in the Swiss-Prot section by the end of the year.
A third category of proteomes exists for organisms whose genomes have submission/annotation problems that prevent the production of a non-redundant protein set or have problems regarding the gene model predictions. These proteomes can be downloaded from Integr8 using the direct link provided on the 'Complete proteomes' pages. This concerns 38 organisms, including some important model organisms, such as Danio rerio (Zebrafish) and Chlamydomonas reinhardtii.
To be included in the 'Complete proteomes' pages, an organism must have a completely sequenced genome, i.e. fully closed and exhibiting either good gene prediction models or good quality transcriptome/proteome data. That is why for bacterial and archaeal genomes, whole-genome shotguns (WGS) and draft sequences are not included. However, we have to adapt to data availability, thus for fungi, WGS sequences are taken into consideration, as they often are the only available ones.
Another requirement is that all proteins in the set are mapped to the genome. The notorious exception is that of the human proteome, which is yet only partially mapped. It should be noted, however, that all human protein entries have been manually reviewed, thus ensuring they meet the UniProtKB/Swiss-Prot quality standards, and are continuously updated, allowing us to progressively increase the mapping to the genome (and to add many other interesting annotations).
All complete proteomes are available from the UniProt taxonomy resource. A direct link is provided from the UniProt homepage. In addition to providing the taxonomic information about a given species, these pages offer several options, such as the retrieval of all UniProtKB entries for a taxon (a set that may contain redundant entries) or the retrieval of the non-redundant complete proteome (see for example the Dictyostelium discoideum (Slime mold) page), including the proteomes provided by the Integr8 resource. For the 1'390 complete proteomes entirely stored in UniProtKB, all entries have been tagged with the keyword 'Complete proteome' allowing their easy retrieval directly from the database, bypassing the taxonomy pages.
For complementary information, see FAQ.
If you have questions on that subject - or any other - do not hesitate to contact us.
Format change in the cross-references to OMA
The format of the cross-references to the OMA project has changed: The resource identifier, which was a UniProtKB accession number, has been replaced by an OMA group fingerprint. The optional information field 1 is now a dash '-'.
DR OMA; P39899; YANTHIA.
DR OMA; YANTHIA; -.
Changes concerning keywords
- Host cell inner membrane
- Host cell junction
- Host cell membrane
- Host cell outer membrane
- Host cell projection
- Host cytoplasm
- Host cytoplasmic vesicle
- Host cytoskeleton
- Host endoplasmic reticulum
- Host endosome
- Host Golgi apparatus
- Host lipid droplet
- Host lysosome
- Host membrane
- Host microsome
- Host mitochondrion
- Host mitochondrion inner membrane
- Host mitochondrion outer membrane
- Host nucleus
- Host periplasm
- Host thylakoid
Changes in subcellular location controlled vocabulary
New subcellular location:
- Host thylakoid