UniProt release 2011_04
Published April 5, 2011
The art of defining the unknown
When the human gene C22orf28 was predicted in 1998, its existence was supported by many cDNAs identified by large scale cDNA sequencing projects. Although nothing was known about the protein it encoded, it was conserved in many species in all 3 kingdoms. A consensus pattern was created in the PROSITE database which was distinctive for all related sequences, from bacteria to mammals. This allowed us to classify all these proteins into a single family, named ‘the Uncharacterized Protein Family (UPF) 0027 (UPF0027)’ to unambiguously indicate the lack of functional data.
For several years, the only annotation in the ‘General annotation (Comments)’ section of the C22orf28 entry was: “Belongs to the UPF0027 (rtcB) family.” Things changed at the beginning of this year with the publication of 3 articles that unraveled C22orf28 function in human, archaea and bacteria. In archaea and human, the protein was shown to be involved in the ligation step during tRNA splicing. In bacteria, the ligation activity may be used in the context of tRNA repair. As of this release, all entries belonging to this family have been updated with these new data and UPF0027 has been deleted from UniProtKB and replaced by the RtcB family.
In the course of manual annotation, we have encountered many examples of uncharacterized conserved proteins. This has led to the definition of a total of 765 UPFs which are listed in the upflist.txt file, available from the UniProt documentation pages, along with all associated entries and information concerning the taxonomic range concerned. Within this list, characterized, hence deleted UPFs represent about 28%. They are tagged with a comment indicating the reason for the deletion, for example, for UPF0027, the comment states that it is “now characterized as a family of RNA-splicing ligases”. It should be mentioned that, in parallel with our efforts to create UPFs, the Pfam database has established an analogous classification system based on ‘Domains of Unknown Function’ (DUFs). It currently reports some 3’000 DUFs.
For bench scientists, UPFs provide a pool of exciting targets for future research since protein sequence conservation suggests important, yet unknown, functions. For database maintenance, the classification of uncharacterized proteins into families presents the major advantage of simplifying the update when functional information becomes available for at least one member.
Changes to keywordsModified keyword:
- Receptor mediated endocytosis of virus by host -> Virus endocytosis by host
Changes in subcellular location controlled vocabulary
New subcellular locations:
Modified subcellular locations: