UniProt release 2010_12
Published November 30, 2010
Fishing for new mutations in the human exome
Understanding the role of genetic variants in human health and disease is crucial in modern biology and medicine. The International HapMap Project and, more recently, the 1000 Genomes Project are progressively unveiling the map of human genome variation at the scale of the human population, generating a flood of interesting data. Smaller research projects focused on disease-causing mutations also contribute through the development of new fruitful approaches. One of the current trends in large and small scale projects is exome sequencing. The rationale is that the clear majority of allelic variants known to underlie mendelian disorders disrupt protein-coding sequences. Restricting sequencing to exons decreases the sample size to 2-5% of that of the whole genome, thus saving time and money, while allowing the identification of missense and nonsense mutations, of small insertions and deletions (indels), as well as of splice donor and acceptor site variants. By definition, exome sequencing does not permit the discovery of mutations in non-coding, regulatory or intronic genomic regions which are known to affect disease.
The exome sequencing strategy is proving to be quite effective, as it has recently been used to pinpoint several genes whose mutations are associated with diseases, including DHODH involved in postaxial acrofacial dysostosis (Ng et al., 2010), WDR62 in severe cerebral cortical malformations (Bilguvar et al., 2010) and MLL2 in Kabuki syndrome (Ng et al., 2010).
The annotation of single amino acid polymorphisms (SAPs) has always been a priority in UniProtKB/Swiss-Prot, including not only ‘neutral’ polymorphisms, resulting from normal variations among individuals, but also disease-associated mutations. Thus missense SAPs identified by the exome-sequencing strategy have been quickly annotated and integrated in the ‘Sequence annotation (Features)’ section of their respective entries (Q02127, O43379 and O14686). The associated phenotypes are described in the ‘General annotation (Comments)’ section in ‘Involvement in disease’ (Q02127, O43379 and O14686).
Over the years, we have developed a defined format to describe SAPs in the ‘Sequence annotation (Features)’ section, including dbSNP accession numbers, when they exist, and links to bibliographic references. Disease-causing mutations are tagged, whenever possible, with the official abbreviation of the phenotype provided by the OMIM database. In addition to missense mutations, in-frame indels are also reported (P35453, P02730 or P33897). When it is not possible to represent the whole variation landscape for a given protein within the UniProtKB entry, we try and provide cross-references to specialized resources (see for instance the ‘Web resources’ section in human p53 entry). Our annotation effort does not include the representation of mutations that cause major changes to a protein sequence, such as frameshift mutations or variations at splice sites, as their deleterious effects on protein function are usually obvious.
Close to 63’000 human SAPs are currently stored in UniProtKB/Swiss-Prot and about 30% of them are reported as disease-associated in the literature. SAPs selected from this pool are mapped to reference nucleotide sequences from RefSeq and LRG, following the guidelines established by the Human Genome Variation Society for sequence variant designation, and submitted to dbSNP (see for instance dbSNP/Swiss-Prot variant rs121908210). Thanks to a tight collaboration with Ensembl, all human variants stored in UniProtKB and characterized by a dbSNP accession number (or submitted to dbSNP) can also be accessed from the Ensembl database and viewed in the context of their nucleotide sequence (see variant rs1269215 stored in UniProtKB entry Q9BVK8). Our ultimate goal is to spread information about protein variations to the broadest possible audience.
Line length limit
Historically, UniProtKB flat file entries were formatted to not exceed 75 characters per line. This limitation served on one hand to display them nicely on small screens and to allow them to be processed by programs that had memory limitations. Meanwhile, computers have become more powerful and most programs have been adapted accordingly. UniProt has already made a few exceptions to the line length limit for data that cannot be wrapped, such as URLs or DOIs, or where wrapping does not increase readability, such as for protein names and a few cross-references to other databases. Especially for the latter, we have increasingly more additional information to incorporate. We will continue to wrap lines at 75 characters where it helps to increase readability, but allow for more characters where necessary. The new upper limit is 255 characters per line, as some users still depend on software with this limitation.
Changes to cross-references to RefSeq
The format of the explicit links in the flat file is:
DR RefSeq; RefSeq protein accession number; RefSeq nucleotide accession number.
Previous format in the flat file:
DR RefSeq; AP_000992.1; -.
DR RefSeq; NP_414874.1; -.
DR RefSeq; AP_000992.1; AC_000091.1.
DR RefSeq; NP_414874.1; NC_000913.2.
Changes to keywordsNew keywords:
- Actin-dependent active transport of viral material
- Cap snatching
- Caveolae-mediated endocytosis of virus by host
- Clathrin- and caveolae-independent endocytosis of virus by host
- Clathrin-mediated endocytosis of virus by host
- Cytoplasmic active transport of viral material
- Fusion of virus membrane with host cell membrane
- Fusion of virus membrane with host endosomal membrane
- Fusion of virus membrane with host membrane
- Helical capsid protein
- Host cell receptor for virus entry
- Initiation of viral infection
- Inner capsid protein
- Intermediate capsid protein
- Microtubule-dependent active transport of viral material
- Outer capsid protein
- Pilus-mediated viral adsorption onto host cell
- Pore-mediated penetration of viral genome into host cell
- Provirus integration
- Receptor mediated endocytosis of virus by host
- RNA suppression of termination
- RNA termination-reinitiation
- RNA translational shunting
- Syncytium formation induced by viral infection
- T=1 icosahedral capsid protein
- T=2* icosahedral capsid protein
- T=3 icosahedral capsid protein
- T=pseudo3 icosahedral capsid protein
- T=4 icosahedral capsid protein
- T=7 icosahedral capsid protein
- T=13 icosahedral capsid protein
- T=16 icosahedral capsid protein
- T=25 icosahedral capsid protein
- T=147 icosahedral capsid protein
- T=169 icosahedral capsid protein
- T=219 icosahedral capsid protein
- Translational shunt
- Viral attachment to host cell
- Viral genome injection through bacterial membranes
- Viral ionic channel
- Viral penetration into host cytoplasm
- Viral penetration into host nucleus
- Viral penetration via lysis of host organellar membrane
- Viral penetration via permeabilization of host organellar membrane
- Viral primary envelope fusion with host outer nuclear membrane
Changes in subcellular location controlled vocabularyNew subcellular locations:
Changes in the controlled vocabulary for PTMsNew term for the feature key ‘Modified residue’ (‘MOD_RES’ in the flat file):