I would like to test the performance of a sequence-based prediction method: Can I use UniProt to build a negative data set?
Last modified July 11, 2012
The manual curation process of UniProtKB/Swiss-Prot includes extensive literature curation, and the annotation items with experimental evidence can be used to construct positive data sets, e.g. all human entries with experimentally determined signal sequences.
However, the absence of annotation should not be used to build negative data sets: It is only in very rare cases that negative annotation is applied, e.g. entries known not to be glycosylated.
Curating a negative data set requires about as much manual curation as building a positive data set. The absence of an annotation does not mean absence of a function (a true negative). Lack of annotation may simply be due to false negatives: incompleteness either in the state of experiment-derived knowledge of a particular protein's function, or incompleteness in representing that knowledge as annotations, i.e. an entry may not be up-to-date and therefore does not have the positive annotation (yet).
In order to obtain a reliable predictor, we recommend to be extremely conservative when trying to build your set, and in case of doubt contact us about the function or modification you are trying to predict.