UniProt release 2011_05
Published May 3, 2011
Complete proteomes for Homo sapiens and Mus musculus
With the imminent closure of the International Protein Index (IPI), UniProt has pledged to provide comprehensive and non-redundant complete proteomes for all species that are currently covered by this soon to be defunct resource. With this release of UniProtKB, we provide the first version of the complete proteome for Mus musculus and an updated version of the Homo sapiens set.
We describe here how each of these complete proteomes is produced, and outline their major characteristics.
For Homo sapiens, a first pass annotation of the complete proteome was completed by UniProt in 2008 and all entries were incorporated into UniProtKB/Swiss-Prot. Within this UniProtKB/Swiss-Prot complete H. sapiens proteome, approximately 20,000 putative protein-coding genes are represented by one canonical protein sequence, with some entries describing multiple isoform sequences. Since its initial release, the UniProtKB/Swiss-Prot complete H. sapiens proteome has been extensively curated and the Ensembl cross-references – mapped based on sequence identity – are in the process of being manually verified. All predicted protein sequences from Ensembl (except fragments) that were found to be absent from the UniProtKB/Swiss-Prot complete H. sapiens proteome were imported into the unreviewed component of UniProtKB, UniProtKB/TrEMBL. These imported UniProtKB/TrEMBL entries were tagged with the keyword ‘Complete proteome’. The aim of this import was to increase the coverage of the existing complete proteome, by supplementing it with those Ensembl protein sequences that had no UniProtKB counterpart. The resulting UniProtKB complete H. sapiens proteome includes both reviewed sequences from UniProtKB/Swiss-Prot (equivalent to an updated version of the complete H. sapiens proteome completed in 2008), now supplemented by unreviewed sequences from UniProtKB/TrEMBL. This process will enable the synchronization of the UniProt set with the CCDS project. This version of the complete H. sapiens proteome provides higher sequence coverage than the preceding version, but now includes sequences that have not been manually reviewed. Users can choose to opt either for this expanded complete H. sapiens proteome or a reduced version that derives exclusively from UniProtKB/Swiss-Prot.
For Mus musculus, we added the keyword ‘Complete proteome’ to the existing UniProtKB/Swiss-Prot mouse entries. We identified those entries in UniProtKB/TrEMBL which mapped to the complete genome in Ensembl and imported the predicted protein sequences (except fragments) from Ensembl which were found not to be present in UniProtKB. The keyword ‘Complete proteome’ was also added to these entries. As for the human counterpart, the UniProtKB complete Mus musculus proteome can now be easily retrieved using this keyword. The Ensembl cross-references have been added to the UniProtKB entries on the basis of 100% sequence identity over their full length.
There has been a deliberate introduction of redundancy into the proteomes based on the complete genomes to ensure that all alternative protein variants and isoforms are presented in the set. Over time, these will be merged with the parent entry as is UniProtKB/Swiss-Prot curation policy. We are also evaluating the fragment sequences predicted in the Ensembl complete proteomes for future incorporation.
We very much welcome the feedback of the community on our efforts. We expect to make the remaining IPI species (Gallus gallus, Bos taurus, Danio rerio, Arabidopsis thaliana and Rattus norvegicus) and some additional species of interest (Sus scrofa, Canis familiaris) available soon.
Changes to the taxonomy of UniProtKB entries from the model fungal organisms Saccharomyces cerevisiae (YEAST) and Schizosaccharomyces pombe (SCHPO)
Historically, UniProt assigned species identification codes (i.e. the 5-letter mnemonic that forms the second part of the composite UniProtKB entry name) at species-level. All sequences of a given protein from a single species were merged into a single UniProtKB/Swiss-Prot entry, which could therefore contain sequences from many different strains of that species. Discrepancies between individual sequences were annotated as “conflicts” or “variants” in the feature table of the entry.
The number of complete genome submissions from different strains of individual species is now increasing at an ever accelerating rate and this data has elucidated that distinct strains of many species can exhibit considerable differences in both gene content and within individual shared genes. The sheer amount of complete proteome data and the associated variability between proteomes means that manual merging of individual strains is no longer sustainable, and this approach has been discontinued. UniProt now assigns mnemonic codes at strain-level for complete genome sequences. We are in the process of reassigning many proteomes corresponding to known strains from a species-level taxonomic identifier (defined by the NCBI_TaxID) to the appropriate strain-level taxonomic identifier. In parallel we are also resolving some of the most significant historical cases of strain-level merging.
Here we describe the changes which accompanied the reassignment of the proteomes of the model fungal organisms Saccharomyces cerevisiae (YEAST) and Schizosaccharomyces pombe (SCHPO) to a strain-level taxonomic identifier.
We have changed the taxonomy of all UniProtKB entries of the S. cerevisiae complete proteome from strain S288c (the reference genome sequence stored at SGD) from NCBI_TaxID=4932 to NCBI_TaxID=559292:
OS Saccharomyces cerevisiae (Baker's yeast). OC Eukaryota; Fungi; Dikarya; Ascomycota; Saccharomycotina; OC Saccharomycetes; Saccharomycetales; Saccharomycetaceae; Saccharomyces. OX NCBI_TaxID=4932;to
OS Saccharomyces cerevisiae (strain ATCC 204508 / S288c) (Baker's yeast). OC Eukaryota; Fungi; Dikarya; Ascomycota; Saccharomycotina; OC Saccharomycetes; Saccharomycetales; Saccharomycetaceae; Saccharomyces. OX NCBI_TaxID=559292;
To facilitate entry recognition and tracking of the entries, those of strain S288c (NCBI_TaxID=559292) have kept the mnemonic YEAST.
All S. cerevisiae entries not originating from the genome strain S288c, or one of the other completely sequenced strains (strain RM11-1a (YEAS1), strain JAY291 (YEAS2), strain AWRI1631 (YEAS6), strain YJM789 (YEAS7) and Lalvin EC1118 (YEAS8)), remain at a species level taxonomic identifier (NCBI_TaxID=4932), for which the new mnemonic YEASX was created (e.g. MAL62_YEASX for P07265).
A similar procedure was applied to the S. pombe proteome. We have changed the taxonomy of all UniProtKB entries of the S. pombe complete proteome from strain 972 (the reference genome sequence stored at S. pombe GeneDB) from NCBI_TaxID=4896 to NCBI_TaxID=284812:
OS Schizosaccharomyces pombe (Fission yeast). OC Eukaryota; Fungi; Dikarya; Ascomycota; Taphrinomycotina; OC Schizosaccharomycetes; Schizosaccharomycetales; OC Schizosaccharomycetaceae; Schizosaccharomyces. OX NCBI_TaxID=4896;to
OS Schizosaccharomyces pombe (strain ATCC 38366 / 972) (Fission yeast). OC Eukaryota; Fungi; Dikarya; Ascomycota; Taphrinomycotina; OC Schizosaccharomycetes; Schizosaccharomycetales; OC Schizosaccharomycetaceae; Schizosaccharomyces. OX NCBI_TaxID=284812;
All S. pombe entries not originating from the genome strain 972 retain a species-level taxonomic identifier (NCBI_TaxID=4896), for which the new mnemonic SCHPM was created.
Changes concerning the controlled vocabulary for PTMsNew terms for the feature key ‘Cross-link’ (‘CROSSLNK’ in the flat file):
- 1-(tryptophan-3-yl)-tryptophan (Trp-Trp) (interchain)
- S-(2-aminovinyl)-L-cysteine (Cys-Cys)