UniProt release 12.3
Published October 2, 2007
Oryza sativa (rice) species separated into japonica and indica subspecies in UniProtKB/Swiss-Prot entries
Although it has been a rule in UniProtKB/Swiss-Prot to merge all protein sequences encoded by the same gene in one species into a single record to avoid redundancy, this rule sometimes has to be adapted to specific cases. For example, this rule applied to rice entries, causing sequences from various rice cultivars to be merged and entries tagged with the unique taxonomic identifier (ID) for Oryza sativa species: 4530.
However, O.sativa comprises 2 subspecies: japonica and indica. A classification at subspecies level is already effective in several databases, including UniProtKB/TrEMBL, and most scientists use it when submitting new sequences. In EMBL/DDBJ/GenBank, there are over 1.2 million japonica and almost 360,000 indica sequences, coming mainly from large scale genome, cDNA or EST sequencing projects. The completion of both the japonica and the indica genomes and the analysis of multiple sets of subspecies-specific transcripts revealed a significant number of sequence variations and a divergence of expression pattern between japonica and indica subspecies. In order to provide clearer information to its users, UniProtKB/Swiss-Prot had to adopt this classification and separate indica and japonica subpecies in rice entries.
Most rice entries contained exclusively japonica sequences and were quickly updated with the appropriate taxonomic ID. But over 220 rice entries contained merged sequences of japonica and indica subspecies and had to be "de-merged". This task was undertaken by the PPAP (Plant Proteome Annotation Program) team. Common information was kept in both japonica and indica entries, while expression patterns or other subspecies-specific experimental evidences were transferred where they belong. Today all rice entries are classified into either japonica or indica subspecies, with the exception of very few entries where subspecies was not specified. When available, cultivars are indicated in the reference section. Each entry also provides cross-references to either japonica (cultivar nipponbare) or indica (cultivar 93-11) genomic sequences.
The gene nomenclature system ('OS' code) defined by RAP-DB and/or TIGR for the japonica cultivar nipponbare can be found in japonica entries in the gene names subsection (Ordered Locus Names). RAP-DB locus identifiers are listed in the rice.txt file.
To get all reviewed UniProtKB Japonica entries, click here.
To get all reviewed UniProtKB Indica entries, click here.
To get all reviewed UniProtKB rice entries, click here.
The mnemonic species identification code in the entry name allows to quickly identify to which subspecies the protein belongs: ORYSJ is the code for japonica, ORYSI for indica and the old ORYSA code indicates that the subspecies is not specified. The list of rice cultivars can be found in the strains.txt file.
Changes concerning the comment line (CC) topic MASS SPECTROMETRY
To be consistent with other comment line topics in the flat file, we have changed the field tags of the topic MASS SPECTROMETRY. At the same time, we have extracted literature references into a new field, Source=, and replaced all molecule descriptions by isoform identifiers.
CC -!- MASS SPECTROMETRY: MW=mass(; MW_ERR=error)?; METHOD=method; CC RANGE=ranges( (molecule))?; NOTE=(references|free_text (references)).
CC -!- MASS SPECTROMETRY: Mass=mass(; Mass_error=error)?; Method=method; CC Range=ranges( (IsoformID))?(; Note=free_text)?; Source=references;
CC -!- MASS SPECTROMETRY: MW=3979.9; METHOD=Electrospray; RANGE=1-31; CC NOTE=Ref.1, Ref.2.
CC -!- MASS SPECTROMETRY: Mass=3979.9; Method=Electrospray; Range=1-31; CC Source=Ref.1, Ref.2;P04653:
CC -!- MASS SPECTROMETRY: MW=23638.14; MW_ERR=3.0; METHOD=Electrospray; CC RANGE=16-214 (P04653-2; Allele A); NOTE=With eleven phosphate CC groups (Ref.2).
CC -!- MASS SPECTROMETRY: Mass=23638.14; Mass_error=3.0; Method=Electrospray; CC Range=16-214 (P04653-2); Note=Allele A, with 11 phosphate groups; CC Source=PubMed:7601973;
Note that literature references of the form Ref.n are replaced by PubMed identifiers where this is possible.
Cross-references to RefSeq
Cross-references have been added to the NCBI Reference Sequences database. The Reference Sequence (RefSeq) collection aims to provide a comprehensive, integrated, non-redundant set of sequences, including genomic DNA, transcript (RNA), and protein products for taxonomically diverse organisms including eukaryotes, bacteria, and viruses. RefSeq is a baseline for medical, functional, and diversity studies; they provide a stable reference for genome annotation, gene identification and characterization, mutation and polymorphism analysis, expression studies, and comparative analyses.
The RefSeq database is available at http://www.ncbi.nlm.nih.gov/RefSeq/.
The format of the explicit links in the flat file is:
|Resource identifier||RefSeq protein accession ID.|
Cross-references to GeneID
Cross-references have been added to the Database of genes from NCBI RefSeq genomes. Entrez Gene is the NCBI's database for gene-specific information. It does not include all known or predicted genes; instead Entrez Gene focuses on the genomes that have been completely sequenced, that have an active research community to contribute gene-specific information, or that are scheduled for intense sequence analysis. The content of Entrez Gene represents the result of curation and automated integration of data from NCBI's Reference Sequence project (RefSeq), from collaborating model organism databases, and from many other databases available from NCBI. Records are assigned unique, stable and tracked integers as identifiers. The content (nomenclature, map location, gene products and their attributes, markers, phenotypes, and links to citations, sequences, variation details, maps, expression, homologs, protein domains and external databases) is updated as new information becomes available. Entrez Gene is a step forward from NCBI's LocusLink, with both a major increase in taxonomic scope and improved access through the many tools associated with NCBI Entrez.
The GeneID database is available at http://www.ncbi.nlm.nih.gov/sites/entrez?db=gene.
The format of the explicit links in the flat file is:
|Resource identifier||GeneID accession ID.|