Biochemical and Biophysical Research Communications, Vol.422, No.1, 36-41, 2012
The elusive short gene - an ensemble method for recognition for prokaryotic genome
Accurate prediction of short protein coding DNA from genome sequence information remains an unsolved problem in DNA sequence analysis. Popular gene finding tools show drastic reduction in accuracy while attempting to predict genes of length less than 400 nt, a length we define as short. This study performs a quantitative evaluation of a set of selected coding measures in terms of their discriminative power in recognizing short genes in prokaryotic genomes. By performing Fast Correlation Based Feature Selection (FCBF) technique, we identified a subset of coding measures with high discriminative power. Using the measures identified thus, we present a novel approach for short genes recognition. A short-gene predictor employing AdaBoost.M1 in conjunction with random forests as the base classifier gives 92.74% accuracy, 94.77% sensitivity and 90.06% specificity on short genes. (C) 2012 Elsevier Inc. All rights reserved.
Keywords:Computational gene finding;Short gene prediction;Ensemble classifier;Feature selection;Adaboost.M1;Random forests