Biochemical and Biophysical Research Communications, Vol.469, No.4, 1021-1027, 2016
Effect of k-tuple length on sample-comparison with high-throughput sequencing data
The high-throughput metagenomic sequencing offers a powerful technique to compare the microbial communities. Without requiring extra reference sequences, alignment-free models with short k-tuple (k = 2-10 bp) yielded promising results. Short k-tuples describe the overall statistical distribution, but is hard to capture the specific characteristics inside one microbial community. Longer k-tuple contains more abundant information. However, because the frequency vector of long k-tuple(k >= 30 bp) is sparse, the statistical measures designed for short k-tuples are not applicable. In our study, we considered each tuple as a meaningful word and then each sequencing data as a document composed of the words. Therefore, the comparison between two sequencing data is processed as "topic analysis of documents" in text mining. We designed a pipeline with long k-tuple features to compare metagenomic samples combined using algorithms from text mining and pattern recognition. The pipeline is available at http://culotuple.codeplex.com/. Experiments show that our pipeline with long k-tuple features: (1)separates genomes with high similarity; (2)outperforms short k-tuple models in all experiments. When k >= 12, the short k-tuple measures are not applicable anymore. When k is between 20 and 40, long k-tuple pipeline obtains much better grouping results; (3)is free from the effect of sequencing platforms/protocols. (3)We obtained meaningful and supported biological results on the 40-tuples selected for comparison. (C) 2015 Elsevier Inc. All rights reserved.