Evaluation of computer tools for the prediction of transcription factor binding sites on genomic DNA

Computational molecular biology tools are becoming the method of choice for high throughput screening of newly determined DNA sequences. Such bioinformatic methods indeed offer invaluable tools for the analysis of novel genomic sequences, as they allow for instance the identification of candidate disease-responsible genes [see Rawlings and Searls, 1997, for a review]. Effective DNA sequence analysis demands not only the faithful identification of gene elements and boundaries, but it also requires reliable information on the potential function and regulation of the identified genes. Consequently, powerful software tools are more and more relying on the coupling and integration of various prediction algorithms. Such integrated systems should include devices for the recognition of DNA sequences that act as binding sites for regulatory proteins known as transcription factors. The identification of such sites is not only relevant for locating the promoter as the 5' boundary of a gene, but they may also allow the prediction of a tissue- specific gene-expression pattern and responsiveness to known biological signaling pathways. However, binding sites for sequence-specific DNA-binding transcription factors are typically short and degenerate, and their efficient prediction requires sophisticated computational tools. Databases of promoter and transcription factors have been established [Bucher, 1990; Ghosh, 1993; Wingender et al., 1997], and these compiled data were in turn used for the development of algorithms and program packages for the identification of transcription factor binding sites on DNA