Automatic annotation program
UniProt’s Automatic Annotation pipeline enhances the unreviewed records in UniProtKB by enriching them with automatic classification and annotation.
UniProt uses InterPro to classify sequences at superfamily, family and subfamily levels and to predict the occurrence of functional domains and important sites. InterPro integrates predictive models of protein function, so-called ‘signatures’, from a number of member databases. InterPro matches are automatically annotated to UniProtKB entries as database cross-references with every InterPro release.
UniProt has developed two prediction systems, UniRule and the Statistical Automatic Annotation System (SAAS) to automatically annotate UniProtKB/TrEMBL in an efficient and scalable manner with a high degree of accuracy:
- Based on rules
- Rules are created, tested and validated against published experimental data in UniProtKB/Swiss-Prot
- Rules are linked to InterPro member database signatures
- Rules have annotations and conditions
- Rules are reapplied to UniProtKB/TrEMBL every four-weekly release with both automatic and manual QA procedures ensuring they are still valid
Rules are devised and tested by experienced curators using experimental data from manually annotated entries as templates. The Unified Rule (UniRule) system is being developed by merging existing manual rule-based systems (HAMAP, PIR name and site rules, and RuleBase rules) into one system which stores, applies, and evaluates all rules. Although originally developed independently, these rule systems all share a common scientific approach of using protein family membership coupled with additional evolutionary and sequence analysis to accurately identify and annotate protein sequences. UniRule rules can annotate protein properties such as the protein name, function, catalytic activity, pathway membership, and subcellular location, along with sequence specific information, such as the positions of post-translational modifications and active sites. All predictions are refreshed with each UniProtKB release to ensure the latest state-of-knowledge predictions.
SAAS generates automatic rules for functional annotation from expertly annotated entries in UniProtKB/Swiss-Prot using the C4.5 decision tree algorithm. This algorithm uses machine learning to find the most concise rule for an annotation based on the properties of sequence length, InterPro group membership and taxonomy. SAAS employs a data exclusion set that censors data not suitable for computational annotation (such as specific biophysical or chemical properties) and generates human-readable rules for each release. SAAS rules can annotate protein properties such as function, catalytic activity, pathway membership, and subcellular location, but protein names and feature predictions are currently excluded. Generating rules on-the-fly in this way allows rules to evolve along with the content of UniProtKB with little or no manual intervention. It also provides a constant supply of potential “seed rules” which can be further developed by the curators into UniRules.