Generalized suffix trees for biological sequence data: Applications and implementation

Paul Bieganski, John Riedl, John V. Carlis, Ernest F. Retzel

Research output: Chapter in Book/Report/Conference proceedingConference contribution

58 Scopus citations

Abstract

This paper addresses applications of suffix trees and generalized suffix trees (GSTs) to biological sequence data analysis. We define a basic set of suffix tree and GST operations needed to support sequence data analysis. While those definitions are straightforward, the construction and manipulation of disk-based GST structures for large volumes of sequence data requires intricate design. GST processing is fast because the structure is content addressable, supporting efficient searches for all sequences that contain particular subsequences. Instead of laboriously searching sequences stored as arrays, we search by walking down the tree. We present a new GST-based sequence alignment algorithm, called GESTALT. GESTALT finds all exact matches in parallel, and uses best-first search to extend them to produce alignments. Our implementation experiences with applications using GST structures for sequence analysis lead us to conclude that GSTs are valuable tools for analyzing biological sequence data.

Original languageEnglish (US)
Title of host publicationProceedings of the Hawaii International Conference on System Sciences
EditorsJay F. Nunamaker, Ralph H.Jr. Sprague
PublisherPubl by IEEE
Pages35-44
Number of pages10
Volume5
ISBN (Print)0818650907
StatePublished - Jan 1 1995
EventProceedings of the 27th Hawaii International Conference on System Sciences (HICSS-27). Part 4 (of 5) - Wailea, HI, USA
Duration: Jan 4 1994Jan 7 1994

Other

OtherProceedings of the 27th Hawaii International Conference on System Sciences (HICSS-27). Part 4 (of 5)
CityWailea, HI, USA
Period1/4/941/7/94

Cite this