UniProt release 2010_07
Published June 15, 2010
UniProt and the International Nucleotide Sequence Database Collaboration
UniProt has had a very beneficial and long-standing collaboration with the three members of the International Nucleotide Sequence Database Collaboration (INSDC) – the EMBL-Bank, GenBank and the DNA Data Bank of Japan (DDBJ). It began at the most basic level with an exchange of nucleotide and protein sequences, evolved through co-development of the nucleotide entry feature table definition to ensure efficient automatic integration of appropriate protein information into UniProt followed by reciprocal cross-references, and from there has recently progressed to a joint endorsement of protein naming guidelines section. This was one outcome of the third NCBI Genome Annotation Workshop in Washington, USA in April 2010 where researchers from life science organizations world-wide collaborated to establish minimal standards for prokaryotic and viral annotation. Extremely productive discussions concerning annotation and underlying problems led to a number of resolutions that were adopted by the international microbial sequencing community. The highlight was the development and acceptance by the community of prokaryotic protein naming guidelines (see file proknameprot.txt) based on an initial proposal from the INSDC and UniProt. Following this agreement, INSDC and UniProt also created a more generalised protein guideline (see file gennameprot.txt) to make this useful for taxa outside cellular prokaryotes. The decision by the INSDC to provide these guidelines for adoption by all submitters to their databases will greatly enhance the annotation of complete genomes and proteomes and ensure that the user community can exploit this data to its full potential. This is a particularly timely and exciting development given the data avalanche. Future plans for the INSDC and UniProt involve collaboration with the NCBI’s Genome project and the Reference Sequence (RefSeq) collection groups to provide synchronized well-annotated genomes and proteomes.
New feature key INTRAMEM in the flat file
In addition to the feature keys TOPO_DOM (which describes the topology of regions for transmembrane proteins that span membrane compartments) and TRANSMEM (which describes the extent of the region spanning a membrane), we have introduced a new feature key INTRAMEM in the flat file to describe the extent of a region located in a membrane without crossing it.
Cross-references to EnsemblBacteria, EnsemblFungi, EnsemblMetazoa, EnsemblPlants and EnsemblProtists
Cross-references have been added to Bacteria, EnsemblFungi, EnsemblMetazoa, EnsemblPlants and EnsemblProtists. These databases are part of Ensembl Genomes. Ensembl Genomes has been created to complement the existing Ensembl site, which focuses on vertebrate genomes.
The format of the explicit links in the flat file is:
|Resource abbreviation||EnsemblBacteria or EnsemblFungi or EnsemblMetazoa or
EnsemblPlants or EnsemblProtists
|Resource identifier||Transcript ID|
|Optional information 1||Protein ID|
|Optional information 2||Gene ID|
DR EnsemblBacteria; EBSTAT00000032812; EBSTAP00000031682; EBSTAG00000032810.Q07163:
DR EnsemblFungi; YDR365W-B; YDR365W-B; YDR365W-B.Q9NDJ2:
DR EnsemblMetazoa; FBtr0071602; FBpp0071528; FBgn0020306. DR EnsemblMetazoa; FBtr0071603; FBpp0071529; FBgn0020306. DR EnsemblMetazoa; FBtr0071604; FBpp0071530; FBgn0020306.P49333:
DR EnsemblPlants; AT1G66340.1-TAIR; AT1G66340.1-P; AT1G66340-TAIR-G.Q54L85:
DR EnsemblProtists; DDB0305146; DDB0305146; DDB_G0286833.