Why do we keep dubious sequences in UniProtKB? How to discard them from a protein set?
Last modified September 21, 2011
About 98% of the protein sequences provided by UniProtKB come from the translations of coding sequences (CDS) submitted to the EMBL-Bank/GenBank/DDBJ nucleotide sequence resources. These CDS are either generated by the application of gene prediction programs to genomic DNA sequences or via the translation of cDNAs (see Where do the UniProtKB protein sequences come from?)
The 'Protein existence' subsection of the 'Protein attributes' section indicates the evidence for the existence of a given protein, 5 levels of evidence have been defined:
- PE1: evidence at protein level (e.g. clear identification by mass spectrometry)
- PE2: evidence at transcript level (e.g. the existence of cDNA)
- PE3: inferred by homology (a predicted protein which has been assigned membership of a defined protein family in UniProtKB)
- PE4: predicted (a predicted protein which has not yet been assigned membership of a defined protein family in UniProtKB)
- PE5: uncertain (e.g. dubious sequences, such as those derived from the erroneous translation of a pseudogene or non-coding RNA).
One may ask the question: why not delete PE5 sequences from UniProtKB and provide only the most reliable sequences? UniProtKB is continuously reviewing all protein sequences. This process can result in both the removal of some PE5 entries (in which evidence of pseudogenization is overwhelming for instance) as well as the upgrade of other PE5 entries (such as the putative E.coli pseudogene ymiA which has now been found to produce a protein product and which has now acquired a PE of 1).
However, many putative pseudogene sequences may be expected to remain in UniProtKB for some time as it can be difficult to prove the non-existence of a protein, and for certain loci some doubts may always persist. To give our users the opportunity to work on the most complete protein set, we have chosen to keep all PE5 sequences with an appropriate 'Caution' comment in the General annotation section, leaving to the users the final decision whether to retrieve them or not. The 'Caution' subsections accompanying a PE5 entry usually are of the type: "Could be the product of a pseudogene", "Product of a dubious CDS prediction" or "Product of a dubious gene prediction".
The PE subsection is included in the UniProtKB search engine. It is thus possible to retrieve all entries corresponding to a defined PE level - and thereby exclude all PE5 proteins. For human proteins, for instance, this can be achieved by searching for: (organism:"Homo sapiens (Human) " AND reviewed:yes) NOT existence:uncertain.
The PE assignment is made at the level of the UniProtKB entry and not at the level of individual isoform sequences; hence, dubious alternative isoform sequences cannot be excluded from a protein set by the UniProtKB search engine.
However, comments about the evidence supporting the existence of any given isoform can be found in the 'Note:' for that isoform in the 'Alternative products' section (which lists all protein isoforms for each entry). For instance, isoforms that have been identified only once through large scale sequencing are tagged with the comment "No experimental confirmation available".
Note that UniProtKB may include isoforms that contain retained introns (as these may be physiologically relevant) as well as isoforms that contain a premature stop codon and thus could be the target for nonsense-mediated mRNA decay (NMD). The mechanism of NMD involves a first round of translation before the premature stop codon is detected (often referred to as "pioneer translation"), and so at least one protein is synthesized from each NMD target mRNA. In addition, some of the predicted NMD targets appear to be the most abundant isoforms in certain tissues (see for instance the human GABA-B receptor 1 isoform 1E).
- Where do the UniProtKB protein sequences come from?
- What are UniProtKB's criteria for defining a CDS as a protein?
- Does UniProtKB contain all protein sequences?
- Why have some UniProtKB accession numbers been deleted? How can I track them?
- Protein existence
- Headlines: Why do we keep dubious sequences in UniProtKB?