The Capacity of Associated Subsequence Retrieval

Behrooz Tahmasebi, Mohammad Ali Maddah-Ali, Seyed Abolfazl Motahari

Research output: Contribution to journalArticlepeer-review

1 Scopus citations

Abstract

The objective of a genome-wide association study (GWAS) is to associate subsequences of individuals' genomes to the observable characteristics called phenotypes (e.g., high blood pressure). Motivated by the GWAS problem, in this paper we introduce the information-theoretic problem of associated subsequence retrieval, where a dataset of N (possibly high-dimensional) sequences of length G, and their corresponding observable (binary) characteristics is given. The sequences are chosen independently and uniformly at random from XG , where X is a finite alphabet. The observable (binary) characteristic is only related to a specific unknown subsequence of length L of the sequences, called associated subsequence. For each sequence, if the associated subsequence of it belongs to a universal finite set, then it is more likely to display the observable characteristic (i.e., it is more likely that the observable characteristic is one). The goal is to retrieve the associated subsequence using a dataset of N sequences and their observable characteristics. We demonstrate that as the parameters N, G, and L grow, a threshold effect appears in the curve of probability of error versus the rate which is defined as it Gh(LG) N , where h(ċ) is the binary entropy function. This effect allows us to define the capacity of associated subsequence retrieval. We develop an achievable scheme and a matching converse for this problem, and thus characterize its capacity in two scenarios: the zero-error-rate and the-error-rate.

Original languageEnglish (US)
Article number9246738
Pages (from-to)790-804
Number of pages15
JournalIEEE Transactions on Information Theory
Volume67
Issue number2
DOIs
StatePublished - Feb 2021
Externally publishedYes

Bibliographical note

Publisher Copyright:
© 1963-2012 IEEE.

Keywords

  • Genome-wide association study (GWAS)
  • Shannon theory
  • threshold effect

Fingerprint

Dive into the research topics of 'The Capacity of Associated Subsequence Retrieval'. Together they form a unique fingerprint.

Cite this