UniProt release 6.9
Published January 24, 2006
Mammalian, Xenopus and Zebrafish Gene Collections: a goldmine for high-quality sequences
High-quality nucleotide sequences derived from high-throughput sequencing projects, such as those generated by the NIH Gene Collection (GC) initiatives are extremely valuable for a protein sequence database, like UniProtKB/Swiss-Prot. More than 99.98% of the UniProtKB/Swiss-Prot sequences are generated by translation of nucleotide sequences rather than direct protein sequences. In this context, high-quality nucleotide sequences provide a rapid and easy way to control the accuracy of the sequences. Differences between sequences may point at the existence of polymorphisms and many alternative splicing isoforms have been introduced thanks to these projects.
Launched in 1999, the Mammalian Gene collection (MGC) is a NIH multi-institutional initiative. Its goal is to identify and sequence cDNA clones containing a full-length open reading frame. Initially aimed at human and mouse sequences, it was further expanded to rat and bovine clones. Two additional projects enriched the first initiative, these deal with Xenopus (XGC) and Zebrafish (ZGC). The sequences obtained by these projects are submitted to the EMBL/GenBank/DDBJ databases, the submitted CDS are translated and automatically integrated into UniProtKB/TrEMBL. The UniProtKB/TrEMBL entries can then be manually annotated and integrated into UniProtKB/Swiss-Prot. Following the principle of non-redundancy, sequences derived from the same gene in the same species are merged into one UniProtKB/Swiss-Prot entry. This is reflected at the level of cross-references. For instance, currently, the average number of distinct nucleotide sequence cross-references per human entry is close to 5. This implies that each human sequence has been confirmed, on average, by 5 independent submitted sequences, and thus the accuracy of the sequences shown in UniProtKB/Swiss-Prot entries is quite high.
Currently, close to 16'000 UniProtKB/Swiss-Prot entries contain data from GC submissions. Considering the various species involved, it means that MGC data are found in more than 60% of the human entries, more than 50% of mouse entries, 25% of rat entries, but only 2% of bovine entries. ZGC data can be found in close to 55% of zebrafish entries and XGC in close to 20% of Xenopus laevis entries and 85% of Xenopus tropicalis entries.
Cross-references to MIM
Various MIM cross-references can be present in a single UniProtKB/Swiss-Prot human entry. They were annotated in the DR lines according to a format that does not distinguish between MIM entries describing a gene and MIM entries describing a phenotype:
DR MIM; 608463; -.
We added a field to the DR MIM line to allow users and programs to distinguish between MIM "gene" and "phenotype" entries.
The new format of the DR MIM line is:
DR MIM; MIM_identifier; token.
Where token is one of the following values:
- MIM entries which describe a gene
- MIM entries which describe a phenotype
- MIM entries which describe both a gene and a phenotype
DR MIM; 608463; gene. DR MIM; 603813; phenotype. DR MIM; 124080; gene+phenotype.
Changes concerning keywords