The UniProt Metagenomic and Environmental Sequences (UniMES) database is a repository specifically developed for metagenomic and environmental data. We provide UniMES clusters in order to obtain complete coverage of sequence space at different resolutions.
Clustered sets of sequences are available at two resolutions: 100% (unimes_cluster100.fasta) and >90% (unimes_cluster90.fasta). In unimes_cluster100.fasta, identical sequences and subfragments from unimes.fasta are placed into a single cluster. The unimes_cluster90.fasta is built by clustering unimes_cluster100.fasta representative sequences (the longest sequence in a cluster) using the CD-HIT algorithm (Li W. and Godzik A., Bioinformatics, 22:1658-1659, 2006) such that each cluster is composed of sequences that have at least 90% sequence identity, to the representative sequence. Only the representative sequences of the clusters are present in these files.