UniProt release 2010_09
Published August 10, 2010
‘De-merge’ of multi-gene entries derived from a single species in UniProtKB/Swiss-Prot
UniProtKB/Swiss-Prot has historically “merged” 100% identical protein sequences from different genes in the same species into one single record. The aim of this approach was to reduce sequence redundancy within the proteome of individual species, facilitating protein identification and the functional annotation of protein sequences. These merged entries provide extensive annotation of the protein sequence itself, as well as information on each of the individual source genes, including cross-references to external gene-centric resources that provide gene models and genomic information.
As the availability and usage of genomic information has greatly increased in recent years, UniProtKB is modifying its merging policy. We have already begun to “de-merge” entries containing multiple individual genes coding for 100% identical protein sequences into individual UniProtKB/Swiss-Prot entries containing a single gene. This will give a gene-centric view of protein space, where the same protein sequence can be represented multiple times by distinct UniProtKB/Swiss-Prot entries, each of which is based on the translation of a single distinct gene. It will allow a cleaner and more logical mapping of gene and genomic resources to UniProtKB, which provide the major point of entry to the resulting proteome for many users. It will also facilitate the annotation of protein features that are uniquely associated with specific copies of duplicated genes, such as alternative splice forms that are found in genes encoded by multiple exons but not in single exon copies derived from retro-transposed cDNAs. This type of information can be most effectively captured in a gene-centric view of protein space, providing a precise description of how genome evolution and structure impact the protein complement of a cell. One consequence of this change in annotation policy is that the level of protein sequence redundancy in UniProtKB will slightly increase, as multiple identical instances of a given protein sequence may now exist within the proteome of a particular species or strain. The process of de-merging has already begun with a number of proteins from Escherichia coli and Homo sapiens and other vertebrates, and will be an ongoing process in UniProtKB. For pragmatic reasons, there are several multi-gene families which will not be targeted for de-merge in the near future, as the difficulties associated with maintaining these individual annotated sequences are significant. These include the human histone genes and the calmodulins, which will continue to be grouped into one entry for the current time. However for simpler cases, especially those in which the genomic context of the gene affects the properties of the encoded protein, de-merging will be preferred.
The de-merge procedure
In simple cases, the demerge procedure simply involves the creation of one new UniProtKB entry for each gene in the current merged UniProtKB entry. A new primary accession number is attributed to each de-merged entry, and the primary accession number of the formerly merged entry is retained as a secondary accession number in each of the resulting de-merged entries. To illustrate how the demerge procedure affects the representation of protein sequences in UniProtKB, consider the example of the human ubiquitin protein. Ubiquitin in humans is encoded by four distinct genes, RPS27A, UBA52, UBB and UBC. RPS27A and UBA52 include a single ubiquitin moiety as an N-terminal fusion to a ribosomal protein, while UBB and UBC encode distinct poly-ubiquitin chains. In UniProtKB release 2010_08, the human ubiquitin protein sequence was represented by one single UniProtKB entry (UBIQ_HUMAN, P62988), that included the ubiquitin protein sequences derived from all four of the aforementioned genes. For UniProt release 2010_09, these four genes were de-merged into 4 distinct UniProtKB entries corresponding to each of the four ubiquitin genes. Following the de-merge, ubiquitin chains from RPS27A and UBA52 were then re-merged to the entries describing their cognate ribosomal proteins, and are now represented as peptides derived from the translated ubiquitin-ribosomal protein fusion. The final result of this process is four distinct UniProtKB entries that include ubiquitin protein sequences derived from four loci: RPS27A, UBA52, UBB, and UBC. Each of these entries retains the primary accession number of the old merged entry UBIQ_HUMAN (P62988) as a secondary accession number.
Cross-references to Protein Model Portal
Cross-references have been added to Protein Model Portal, developed as a module of the PSI-Nature Structural Biology Knowledgebase (http://sbkb.org/). The Protein Model Portal provides a single interface to query simultaneously the existing precomputed models
at various sites, gives access to interactive services for template selection, target-template alignment, model building, and quality assessment. Models are provided by the PSI centers (CSMP, JCSG, MCSG, NESG, NYSGXRC, JCMM), and by independent modeling groups. The task of the portal is to unify the model data from the different sites.
Protein Model Portal is available at http://www.proteinmodelportal.org/.
The format of the explicit links in the flat file is:
|Resource identifier||UniProtKB accession number|
DR ProteinModelPortal; P84155; -.P27362:
DR ProteinModelPortal; P27362; -.
Changes to keywordsNew keywords:
Changes in subcellular location controlled vocabularyNew subcellular locations:
- Cleavage furrow
- Host early endosome
- Host early endosome membrane
- Host trans-Golgi network
- Host trans-Golgi network membrane
- PML body
Changes to the controlled vocabulary for PTMsNew term for the feature key ‘Cross-link’ (‘CROSSLNK’ in the flat file):
- Glycyl cysteine thioester (Cys-Gly) (interchain with G-...)
New BLAST features
We have updated the BLAST results view of uniprot.org:
- All information that is visible or configurable in the results page of a text search in one of the UniProt core datasets is now also available in the results page of a BLAST search. Click the Customize display link on the BLAST results page to see which additional columns you can add to the table. For instance, select Comment, press Show, tick Function, press Show to add the column comment(FUNCTION) to see the available functional annotation of your BLAST hits.
- BLAST results can now be filtered by dataset and taxonomy. The section will show you in brackets the number of hits for each dataset or taxonomy branch. For instance, after running BLAST against the full UniProtKB dataset, you can filter your results to show only hits that are from Bacteria.
- When running a BLAST search against UniProtKB, it is possible to project the sequence annotations of the matched UniProtKB entries onto the alignments generated by BLAST. To see an alignment, click on it in the Alignments column of the table, then tick the Annotations that you would like to highlite in the alignment. This allows you to see at a glance if important positions are conserved.
Another new feature is the option to run BLAST searches against UniParc. Please use this with caution, as UniParc is an archive that also contains pseudogenes and incorrect CDS predictions.
Updated look and feel
The website received a small face lift to improve the navigation. The UniProt entry views, as well as the various tools’ results views, now have blue navigation bars at the top and bottom with links that allow you to quickly access different sections of big views. Where applicable, the top bar features a Customize display link that lets you customize the view.