Data mining and bioinformatics

Machine Learning for Analyzing Genome-Wide ExpressionProfiles and Proteomics Data Sets
Biological research is becoming increasinglydatabase driven, motivated, inpart, by the advent of large-scale functionalgenomics and proteomics experimentssuch as those comprehensivelymeasuring gene expression. These providea wealth of information on each ofthe thousands of proteins encoded by a genome.Consequently, a challenge inbioinformatics is integrating databases toconnect this disparate information as wellas performing large-scale studies to collectivelyanalyze many different data sets.This approach represents a paradigm shiftaway from traditional single-gene biology,and it often involves statistical analysesfocusing on the occurrence ofparticular features (e.g., folds, functions,interactions, pseudogenes, or localization)in a large population of proteins.Moreover, the explicit application of machinelearning techniques can be used todiscover trends and patterns in the underlyingdata. In this article, we give severalexamples of these techniques in agenomic context: clustering methods toorganize microarray expression data, supportvector machines to predict proteinfunction, Bayesian networks to predictsubcellular localization, and decisiontrees to optimize target selection forhigh-throughput proteomics.
Biological Research IsDatabase Oriented
Databases have defined the informationstructure of molecular biology for over adecade, archiving thousands of protein andnucleotide sequences and three-dimensional(3-D) structures. As large-scalegenomics and proteomics move to the forefrontof biological research, the role of databaseshas become more significant thanever. The current landscape of biologicaldatabases includes large public archives,such as GenBank, DDBJ, and EMBL fornucleic acid sequences [1]; PIR andSWISS-PROT for protein sequences [2];and the Protein Data Bank for 3-D proteinstructure coordinate sets [3]. Anothersource of sequence data is dbEST [4], a divisionof GenBank storing expressed sequencetags (ESTs) from cell lines, whichprovide information about gene expressionin various tissues. Databases such as thesehave been steadily accumulating gene sequencesand protein structures for morethan a decade, which are submitted on aper-instance basis from disparate laboratoriesin the biological sciences community.In addition to these general repositoriesof biomolecular data, specialized systemshave been developed that extend itsinterpretation by providing a context forindividual sequences and structures. TheSCOP, CATH, and FSSP [5] databasesclassify proteins based on structural similarity,Pfam and ProtoMap [6] identifyfamilies of proteins based on sequencehomology, while PartsList andGeneCensus [7] give dynamic reports onthe occurrence of protein families in variousgenomes. Databases have also beendeveloped to provide comprehensive accessto sequence, expression, and functionaldata for all the known genes ofspecific model organisms

