Abstract
This paper addresses applications of suffix trees and generalized suffix trees (GSTs) to biological sequence data analysis. We define a basic set of suffix tree and GST operations needed to support sequence data analysis. While those definitions are straightforward, the construction and manipulation of disk-based GST structures for large volumes of sequence data requires intricate design. GST processing is fast because the structure is content addressable, supporting efficient searches for all sequences that contain particular subsequences. Instead of laboriously searching sequences stored as arrays, we search by walking down the tree. We present a new GST-based sequence alignment algorithm, called GESTALT. GESTALT finds all exact matches in parallel, and uses best-first search to extend them to produce alignments. Our implementation experiences with applications using GST structures for sequence analysis lead us to conclude that GSTs are valuable tools for analyzing biological sequence data.
Original language | English (US) |
---|---|
Title of host publication | Proceedings of the Hawaii International Conference on System Sciences |
Editors | Jay F. Nunamaker, Ralph H.Jr. Sprague |
Publisher | Publ by IEEE |
Pages | 35-44 |
Number of pages | 10 |
Volume | 5 |
ISBN (Print) | 0818650907 |
State | Published - Jan 1 1995 |
Event | Proceedings of the 27th Hawaii International Conference on System Sciences (HICSS-27). Part 4 (of 5) - Wailea, HI, USA Duration: Jan 4 1994 → Jan 7 1994 |
Other
Other | Proceedings of the 27th Hawaii International Conference on System Sciences (HICSS-27). Part 4 (of 5) |
---|---|
City | Wailea, HI, USA |
Period | 1/4/94 → 1/7/94 |