UniProt release 15.11
Published November 24, 2009
Why do we keep dubious sequences in UniProtKB? How to discard them from a protein set?
More than 99% of the protein sequences provided by UniProtKB come from the translations of coding sequences (CDS) submitted to the EMBL-Bank/GenBank/DDBJ nucleotide sequence resources. These CDS are either generated by the application of gene prediction programs to genomic DNA sequences or via the hypothetical translation of cloned cDNAs (see FAQ 37). These methods themselves provide varying degrees of support for the existence of a protein, which may be further supplemented in some cases by other types of evidence (such as mass spectrometry data or evidence from direct protein sequencing).
In July 2007, a new topic was introduced into UniProtKB to indicate the evidence for the existence of a given protein, called 'Protein existence' (PE). 5 levels of evidence have been defined: 1. evidence at protein level (e.g. clear identification by mass spectrometry), 2. evidence at transcript level (e.g. the existence of a putative coding cDNA), 3. inferred by homology (a predicted protein which has been assigned membership of a defined protein family in UniProtKB), 4. predicted (a predicted protein which has not yet been assigned membership of a defined protein family in UniProtKB) and 5. uncertain (e.g. dubious sequences, such as those derived from the erroneous translation of a pseudogene or non-coding RNA). Currently in UniProtKB/Swiss-Prot, the vast majority (71%) of the entries are found in the PE3 category. PE1 and PE2 represent each approximately 13% of the total number of entries, PE4 3% and PE5 only 0.3%.
Entries that are attributed an existence level of 5 (PE5) are also tagged with the term "Putative" in the 'Protein names' section (see for example the "Putative annexin A2-like protein") and, in the 'General annotation (Comments)' section, with a 'Caution' subsection warning the user of a possible problem. The caution subsections accompanying a PE5 entry usually are of the type: "Could be the product of a pseudogene", "Product of a dubious CDS prediction" or "Product of a dubious gene prediction".
The PE section is included in the UniProtKB search engine. It is thus possible to retrieve all entries corresponding to a defined PE level - and thereby exclude all PE5 proteins. For human proteins this can be achieved by searching for: (organism:"Homo sapiens (Human) " AND reviewed:yes) NOT existence:uncertain. This search allows the retrieval of 19'835 entries, indicating that "uncertain" proteins represent 2.4% of the total human entries. Currently PE5 entries represent only 0.3% of all UniProtKB/Swiss-Prot. The higher proportion of sequences identified as uncertain or dubious in Homo sapiens may be a product of the continuous manual curation and review of these sequences by groups of the CCDS consortium, such as HAVANA, as well as UniProt curators.
One may ask the question: why not delete PE5 sequences from UniProtKB and provide only the most reliable sequences? As stated above, UniProt is continuously reviewing all protein sequences. This process can result in both the removal of some PE5 entries (in which evidence of pseudogenization is overwhelming for instance) as well as the upgrade of other PE5 entries (such as the putative E.coli pseudogene ymiA which has now been found to produce a protein product and which has now acquired a PE of 1 or the human mitochondrial ATP synthase subunit epsilon-like protein). However, many putative pseudogene sequences may be expected to remain in UniProtKB for some time as it can be difficult to prove the non-existence of a protein, and for certain loci some doubts may always persist. To give our users the opportunity to work on the most complete protein set, we have chosen to keep all PE5 sequences with the appropriate 'Caution' comments, leaving to the users the final decision whether to retrieve them or not (using the exclusion mechanism described above). Note that the sequences which are removed from UniProtKB can subsequently be retrieved from the UniParc archive if so desired.
Finally, please remember that the PE assignment is made at the level of the UniProtKB entry and not at the level of individual isoform sequences; hence, dubious alternative isoform sequences cannot be excluded from a protein set by the UniProtKB search engine. However, comments about the evidence supporting the existence of any given isoform can be found in the 'Note:' for that isoform in the 'Alternative products' section (which lists all protein isoforms for each entry). For instance, isoforms that have been identified only once through large scale sequencing are tagged with the comment "No experimental confirmation available". Note that UniProt may include isoforms that contain retained introns (as these may be physiologically relevant) as well as isoforms that contain a premature stop codon and thus could be the target for nonsense-mediated mRNA decay (NMD). The mechanism of NMD involves a first round of translation before the premature stop codon is detected (often referred to as "pioneer translation"), and so at least one protein is synthesized from each NMD target mRNA. In addition, some of the predicted NMD targets appear to be the most abundant isoforms in certain tissues (see for instance the human GABA-B receptor 1 isoform 1E).
Cross-references to OrthoDB
Cross-references have been added to OrthoDB, a database of orthologous groups. OrthoDB presents a catalog of eukaryotic orthologous protein-coding genes. Orthology refers to the last common ancestor of the species under consideration, and thus OrthoDB explicitly delineates orthologs at each radiation along the species phylogeny.
OrthoDB is available at http://cegg.unige.ch/orthodb.
The format of the explicit links in the flat file is:
|Resource identifier||OrthoDB cluster number.|
P00915: DR OrthoDB; EOG90KBJT; -.
Cross-references to PhylomeDB
Cross-references have been added to PhylomeDB, a database for complete collections of gene phylogenies. PhylomeDB allows users to interactively explore the evolutionary history of genes through the visualization of phylogenetic trees and multiple sequence alignments.
PhylomeDB is available at http://phylomedb.org/.
The format of the explicit links in the flat file is:
|Resource identifier||UniProtKB accession number.|
Q8GTR4: DR PhylomeDB; Q8GTR4; -.
Changes concerning keywords
- Phorbol-ester binding
- Plant toxin
Changes in subcellular location controlled vocabulary
New subcellular locations:
- Host synapse
- Target cell membrane
- Target membrane