TY - GEN
T1 - Analyze influenza virus sequences using binary encoding approach
AU - Lam, Ham Ching
AU - Boley, Daniel
PY - 2011
Y1 - 2011
N2 - Capturing mutation patterns of each individual influenza virus sequence is often challenging; in this paper, we demonstrated that using a binary encoding scheme coupled with dimension reduction technique, we were able to capture the intrinsic mutation pattern of the virus. Our approach looks at the variance between sequences instead of the commonly used p-distance or Hamming distance. We first convert the influenza genetic sequence to a binary string and then apply Principal Component Analysis (PCA) to the converted sequence. PCA also provides a prediction capability for detecting reassortant virus by using data projection technique. Due to the sparsity of the binary string, we were able to analyze large volume of influenza sequence data in a very short time. For protein sequences, our scheme also allows the incorporation of biophysical properties of each amino acid. Here, we present various results from analyzing influenza nucleotide, protein and genome sequences using the proposed approach. With the Next-Generation Sequencing (NGS) promises of sequencing DNA at unprecedented speed and production of massive quantity of data, it is imperative that new technique needs to be developed to provide quick and reliable analysis of any sequence data. Here, we believe our approach can be used at the upstream stage of sequence data analysis pipeline to gain insight as to which direction should be continued on in analyzing the available data.
AB - Capturing mutation patterns of each individual influenza virus sequence is often challenging; in this paper, we demonstrated that using a binary encoding scheme coupled with dimension reduction technique, we were able to capture the intrinsic mutation pattern of the virus. Our approach looks at the variance between sequences instead of the commonly used p-distance or Hamming distance. We first convert the influenza genetic sequence to a binary string and then apply Principal Component Analysis (PCA) to the converted sequence. PCA also provides a prediction capability for detecting reassortant virus by using data projection technique. Due to the sparsity of the binary string, we were able to analyze large volume of influenza sequence data in a very short time. For protein sequences, our scheme also allows the incorporation of biophysical properties of each amino acid. Here, we present various results from analyzing influenza nucleotide, protein and genome sequences using the proposed approach. With the Next-Generation Sequencing (NGS) promises of sequencing DNA at unprecedented speed and production of massive quantity of data, it is imperative that new technique needs to be developed to provide quick and reliable analysis of any sequence data. Here, we believe our approach can be used at the upstream stage of sequence data analysis pipeline to gain insight as to which direction should be continued on in analyzing the available data.
KW - Binary encoding
KW - Evolution
KW - Influenza virus
KW - Principal component analysis
UR - http://www.scopus.com/inward/record.url?scp=85147398620&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85147398620&partnerID=8YFLogxK
U2 - 10.1145/2003351.2003355
DO - 10.1145/2003351.2003355
M3 - Conference contribution
AN - SCOPUS:85147398620
SN - 9781450308397
T3 - Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining
BT - 10th International Workshop on Data Mining in Bioinformatics, BIOKDD 2011 - Held in Conjunction with SIGKDD Conference, the 17th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, KDD-2011
PB - Association for Computing Machinery
T2 - 10th International Workshop on Data Mining in Bioinformatics, BIOKDD 2011 - Held in Conjunction with the 17th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, KDD-2011
Y2 - 21 August 2011 through 24 August 2011
ER -