UniProt release 2017_01
Published January 18, 2017
Sheep in wolves’ clothing: human variant reannotation in UniProtKB/Swiss-Prot with ExAC
Annotation of sequence variants has always been an important part of the curation of human proteins in UniProtKB/Swiss-Prot. As of this release, about 76,500 variants are annotated in the knowledgebase. 99% of them are single amino acid polymorphisms (SAPs), the rest are small indels. 38% of the SAPs are associated with a genetic disorder. This high percentage of rare SAPs reflects our strategy to prioritize the annotation of disease-causing and/or functionally characterized variants reported in peer-reviewed scientific literature. Most are annotated as involved in diseases (as disease-causing agents, susceptibility factors or disease modifiers), but for some, the role in the phenotype is not clear, although they have been found in patients and not (yet?) in healthy individuals. These variants are called Variants of Unknown Significance (VUS). In the ‘good old days’, we were quite confident and we associated SAPs with diseases provided some criteria were met, such as cosegregation of the mutation with the phenotype, and absence of the mutation in a reasonably high number of healthy controls. At that time, 100 control individuals, ethnically matched if possible, seemed acceptable. Those days are gone. Nowadays, these simple criteria have been changed for a real roadmap, based on guidelines developed by Richards et al. The stumbling block remains the frequency of a given variant in the population in view of the occurrence of the disease. In other words, if a variant is not found in healthy individuals, is it because it is pathogenic, or simply not looked for hard enough? In this context, the high-quality sequence of almost 61,000 exomes provided by the Exome Aggregation Consortium (ExAC) is a major achievement.
ExAC aims to aggregate and harmonize exome sequencing data from a wide variety of large-scale sequencing projects, and to make summary data available for the wider scientific community. The sequence of 60,706 exomes from unrelated individuals is currently available on the ExAC website. Surprisingly, each ExAC exome donor harbored on average 54 mutations reported to be disease-causing in HGMD or ClinVar. The pathogenicity of most of them (41) could be ruled out due to high allele frequency. Take for instance the gene CLN8. Mutations in this gene have been shown to cause neuronal ceroid lipofuscinosis-8, an autosomal recessive neurodegenerative disorder with an onset age of 2 to 7 years. In view of the clinical synopsis, no ‘healthy’ adult homozygous for any disease-causing mutation is expected. ExAC observed 93 individuals homozygous for the p.Pro229Ala variant, which had formerly been reported to be pathogenic. An analogous result was obtained for the variant p.Met1444Ile in GLI2. This mutation was reported to cause holoprosencephaly-9 (HPE9), an autosomal dominant disorder characterized by a wide phenotypic spectrum of brain developmental defects. Although HPE9 has variable expressivity and incomplete penetrance, the presence of this mutation in 20 homozygous individuals analyzed by ExAC lead to its reclassification as a benign polymorphism.
The ExAC publication has a fruitful impact on our annotation. First, 38 variants (in 36 gene entries) reported in UniProtKB/Swiss-Prot and thought to be pathogenic have been reclassified as either benign polymorphisms or VUS. Second, the ExAC database has become an invaluable tool for curators, helping them to tag human variants with the appropriate status ‘Disease’ (disease-associated), ‘Polymorphism’ (innocuous) or ‘Unclassified’ (i.e. VUS). Third, we are learning to be more and more cautious when annotating new variants. The result is an increased number of VUS in UniProtKB/Swiss-Prot (currently representing about 20% of the total number of variants identified in patients). Old variants will be progressively confirmed or reclassified as new knowledge becomes available.
As of this release, the variants updated thanks to ExAC data are available in UniProtKB/Swiss-Prot.
The UniProt team wishes you a Happy New Year!
Cross-references to SFLD
Cross-references have been added to the Structure Function Linkage Database (SFLD), a resource that links evolutionarily related sequences and structures from mechanistically diverse superfamilies of enzymes to their chemical reactions.
SFLD is available at http://sfld.rbvi.ucsf.edu/django/.
The format of the explicit links is:
|Resource identifier||SFLD identifier|
|Optional information 1||SFLD model name|
|Optional information 2||Number of hits|
DR SFLD; SFLDS00014; RuBisCO; 1.
<dbReference type="SFLD" id="SFLDS00014"> <property type="entry name" value="RuBisCO"/> <property type="match status" value="1"/> </dbReference>
uniprot:P00877 rdfs:seeAlso <http://purl.uniprot.org/sfld/SFLDS00014> . <http://purl.uniprot.org/sfld/SFLDS00014> rdf:type up:Resource ; up:database <http://purl.uniprot.org/database/SFLD> ; rdfs:comment "RuBisCO" ; up:signatureSequenceMatch <http://purl.uniprot.org/isoforms/P00877-1#SFLD_SFLDS00014_match_1> .
Changes to the controlled vocabulary of human diseases
- Aortic aneurysm, familial thoracic 10
- Cardiospondylocarpofacial syndrome
- Dyskinesia, seizures, and intellectual developmental disorder
- Ehlers-Danlos syndrome, periodontal type, 1
- Ehlers-Danlos syndrome, periodontal type, 2
- Epileptic encephalopathy, early infantile, 44
- Epileptic encephalopathy, early infantile, 45
- Epileptic encephalopathy, early infantile, 46
- Epileptic encephalopathy, early infantile, 47
- Frontometaphyseal dysplasia 2
- Mental retardation, X-linked, syndromic, Bain type
- Mental retardation, X-linked, syndromic, Borck type
- Short stature, rhizomelic, with microcephaly, micrognathia, and developmental delay
- Sifrim-Hitz-Weiss syndrome
- Sotos syndrome 3
- Spinocerebellar ataxia, autosomal recessive, 24
Changes to the controlled vocabulary for PTMs
New terms for the feature key ‘Modified residue’ (‘MOD_RES’ in the flat file):
Changes to keywords
Change of the UniRef FASTA header
>UniqueIdentifier ClusterName n=Members Tax=TaxonName TaxID=TaxonIdentifier RepID=RepresentativeMember
- UniqueIdentifier is the primary accession number of the UniRef cluster.
- ClusterName is the name of the UniRef cluster.
- Members is the number of UniRef cluster members.
- TaxonName is the scientific name of the lowest common taxon shared by all UniRef cluster members.
- TaxonIdentifier is the NCBI taxonomy identifier of the lowest common taxon shared by all UniRef cluster members.
- RepresentativeMember is the entry name of the representative member of the UniRef cluster.
>UniRef50_Q9K794 Putative AgrB-like protein n=2 Tax=Bacillus TaxID=1386 RepID=AGRB_BACHD MLERLALTLAHQVKALNAEETESVEVLTFGFTIILHYLFTLLLVLAVGLLHGEIWLFLQI ALSFTFMRVLTGGAHLDHSIGCTLLSVLFITAISWVPFANNYAWILYGISGGLLIWKYAP YYEAHQVVHTEHWERRKKRIAYILIVLFIILAMLMSTQGLVLGVLLQGVLLTPIGLKVTR QLNRFILKGGETNEENS