UniProt release 2011_07
Published June 28, 2011
Killing myself softly – Bacterial and Archaeal Type II Toxin-Antitoxin modules (TA)
Bacteria produce many kinds of toxins that attack other organisms: cholera toxin, botulinum toxin, aerolysins, insecticidal toxins, extracellular proteases, to name just a few. In the past 5 years it has become apparent that most free-living bacteria produce another kind of internal toxin. These toxins are almost always encoded as bicistronic antitoxin-toxin operons (TA module), where the antitoxin is unstable and neutralizes the toxin. In type I and III systems the antitoxin is a small RNA, while in type II systems the antitoxin is a protein. If the antitoxin levels decrease then the toxin levels increase and toxic effects can be seen at the cellular level. In most type II cases studied, the antitoxin acts as an autoregulator, repressing transcription; frequently but not always, the toxin acts as a corepressor. First identified in bacteriophage and on plasmids, where their role is clearly in plasmid or phage maintenance, the role of the chromosomally encoded toxins in bacteria is hotly debated. Proposed functions include maintenance of mobile genetic elements, programmed cell death, induction of persistence (dormancy), stress response, virulence promotion in a host, and regulation of biofilm formation. The toxin’s role may in fact depend on the physiology of the organism in question. Although they are widespread in Archaea, no function for these toxins has been shown in vivo.
There are many toxin families. The best characterized so far are the bacterial ribonuclease toxins which belong to the MazF, RelE, MqsR, HigB, YoeB and VapC families. Most of these toxins degrade mRNA; some are sequence-specific, some work only in association with ribosomes, while for others the mode is unknown. VapC has been shown to degrade the anticodon loop of tRNAfMet. Other cellular functions are also toxin targets; DNA gyrase is targeted by the ParE toxin, HipA toxin probably acts by inappropriately phosphorylating cellular targets, RatA blocks ribosomal subunit association, PezT/zeta toxin corrupts peptidoglycan synthesis and CbtA (formerly YeeV) toxin binds FtsZ and MreB, inhibiting them, possibly simultaneously. While the toxins form distinct families, their cognate antitoxins do not, although almost all of them have a DNA-binding domain, in accordance with their probable role in operon regulation. To further complicate matters, in Mycobacterium tuberculosis H37Rv cross-talk between toxins and some non-cognate antitoxins has been seen, while in Caulobacter crescentus such cross-talk does not occur. Additionally, potential new toxins are detected quite frequently.
We recently performed a major update of many of the type II TA families in UniProtKB/Swiss-Prot, with particular attention given to the model organisms Mycobacterium tuberculosis strain H37Rv and E.coli K12 / MG1655. 65 TA modules have been annotated in M. tuberculosis and 15 in E.coli; gene names for M.tuberculosis were assigned in collaboration with the TubercuList database. Interestingly, TA modules are more abundant in pathogens than in related non-pathogenic strains. Hence Mycobacterium smegmatis, a non-pathogenic mycobacterium, is only predicted to encode 3 TA modules. Since January 2011 the mode of action of at least 4 toxin families has been elucidated (CbtA, PezT/zeta, RatA and VapC).
Although the PezT/zeta toxin and associated antitoxin module have not been predicted to exist in M.tuberculosis, there are indeed loci belonging to this TA module encoded in the genome. This is currently such a hot topic that integrating the data will keep us busy for quite a while yet.
All manually annotated type II TA module entries can be retrieved from UniProtKB/Swiss-Prot using the query toxin-antitoxin (TA) module.
Provision of complete proteome data sets for IPI species by UniProt
Complete proteome data sets are now available for download from the FTP and web sites for the species in the International Protein Index (IPI) which is scheduled for closure this year. IPI is an integrated database which clusters protein sequences from different databases to provide non-redundant complete data sets for selected higher eukaryotic organisms. Since it was launched in 2001, IPI has covered the gaps in the gene predictions between different databases, but since then the situation has improved for many of the most-studied genomes. This is due to a close collaboration between Ensembl, RefSeq and UniProt which aims to provide a standard set of gene predictions for the genomes of interest. These new complete proteomes will therefore provide high coverage complete proteomes for IPI users. The complete UniProtKB proteomes will be based on existing UniProtKB sequences supplemented by missing high quality predictions imported from Ensembl.
For Homo sapiens, a first pass annotation of the complete proteome was completed by UniProt in 2008 and all entries were incorporated into UniProtKB/Swiss-Prot. Within this UniProtKB/Swiss-Prot complete H. sapiens proteome, approximately 20,000 putative protein-coding genes are represented by one canonical protein sequence, with some entries describing multiple isoform sequences. Since its initial release, the UniProtKB/Swiss-Prot complete H. sapiens proteome has been extensively curated and the Ensembl cross-references -mapped based on sequence identity -are in the process of being manually verified. All predicted protein sequences from Ensembl (except fragments) that were found to be absent from the UniProtKB/Swiss-Prot complete H. sapiens proteome were imported into the unreviewed component of UniProtKB, UniProtKB/TrEMBL (see release 2011_05 headline). These imported UniProtKB/TrEMBL entries were tagged with the keyword ‘Complete proteome’. The aim of this import was to increase the coverage of the existing complete proteome, by supplementing it with those Ensembl protein sequences that had no UniProtKB counterpart. The resulting UniProtKB complete H. sapiens proteome includes both reviewed sequences from UniProtKB/Swiss-Prot (equivalent to an updated version of the complete H. sapiens proteome completed in 2008), now supplemented by unreviewed sequences from UniProtKB/TrEMBL. This process will enable the synchronization of the UniProt set with the CCDS project. This version of the complete H. sapiens proteome provides higher sequence coverage than the preceding version, but now includes sequences that have not been manually reviewed. Users can choose to opt either for this expanded complete H. sapiens proteome or a reduced version that derives exclusively from UniProtKB/Swiss-Prot.
For the other IPI species (mouse, rat, chicken, zebrafish, cow and dog), we added the keyword ‘Complete proteome’ to the existing UniProtKB/Swiss-Prot entries. We identified those entries in UniProtKB/TrEMBL which mapped to the complete genome in Ensembl and imported the predicted protein sequences (except fragments) from Ensembl which were found not to be present in UniProtKB. The keyword ‘Complete proteome’ was also added to these entries. As for the human counterpart, these proteomes can now be easily retrieved using this keyword. The Ensembl cross-references have been added to the UniProtKB entries on the basis of 100% sequence identity over their full length.
We will expand the coverage to other species of interest in the near future and expect this will be very useful for our users as it will eliminate the need to combine data from different databases.
Changes concerning the controlled vocabulary for PTMsNew terms for the feature key ‘Cross-link’ (‘CROSSLNK’ in the flat file):
- 2-(3-methylbutanoyl)-5-hydroxy-oxazole-4-carbothionic acid (Leu-Cys)
- Proline 5-hydroxy-oxazole-4-carbothionic acid (Pro-Cys)
- S-(15-deoxy-Delta12,14-prostaglandin J2-9-yl)cysteine
Changes to keywordsModified keyword:
- Provirus integration -> Viral genome integration