Penalized Fast Subset Scanning

Skyler Speakman, Sriram Somanchi, Edward McFowland, Daniel B. Neill

Research output: Contribution to journalArticle

3 Citations (Scopus)

Abstract

We present the penalized fast subset scan (PFSS), a new and general framework for scalable and accurate pattern detection. PFSS enables exact and efficient identification of the most anomalous subsets of the data, as measured by a likelihood ratio scan statistic. However, PFSS also allows incorporation of prior information about each data element’s probability of inclusion, which was not previously possible within the subset scan framework. PFSS builds on two main results: first, we prove that a large class of likelihood ratio statistics satisfy a property that allows additional, element-specific penalty terms to be included while maintaining efficient computation. Second, we prove that the penalized statistic can be maximized exactly by evaluating only O(N) subsets. As a concrete example of the PFSS framework, we incorporate “soft” constraints on spatial proximity into the spatial event detection task, enabling more accurate detection of irregularly shaped spatial clusters of varying sparsity. To do so, we develop a distance-based penalty function that rewards spatial compactness and penalizes spatially dispersed clusters. This approach was evaluated on the task of detecting simulated anthrax bio-attacks, using real-world Emergency Department data from a major U.S. city. PFSS demonstrated increased detection power and spatial accuracy as compared to competing methods while maintaining efficient computation.

Original languageEnglish (US)
Pages (from-to)382-404
Number of pages23
JournalJournal of Computational and Graphical Statistics
Volume25
Issue number2
DOIs
StatePublished - Apr 2 2016

Fingerprint

Scanning
Subset
Scan Statistic
Soft Constraints
Event Detection
Likelihood Ratio Statistic
Likelihood Ratio
Penalty Function
Prior Information
Sparsity
Reward
Emergency
Proximity
Anomalous
Statistic
Compactness
Penalty
Likelihood ratio statistic
Inclusion
Attack

Keywords

  • Disease surveillance
  • Likelihood ratio statistic
  • Pattern detection
  • Scan statistic

Cite this

Penalized Fast Subset Scanning. / Speakman, Skyler; Somanchi, Sriram; McFowland, Edward; Neill, Daniel B.

In: Journal of Computational and Graphical Statistics, Vol. 25, No. 2, 02.04.2016, p. 382-404.

Research output: Contribution to journalArticle

Speakman, Skyler ; Somanchi, Sriram ; McFowland, Edward ; Neill, Daniel B. / Penalized Fast Subset Scanning. In: Journal of Computational and Graphical Statistics. 2016 ; Vol. 25, No. 2. pp. 382-404.
@article{f937bb539355403399cbdabe9f89aeb2,
title = "Penalized Fast Subset Scanning",
abstract = "We present the penalized fast subset scan (PFSS), a new and general framework for scalable and accurate pattern detection. PFSS enables exact and efficient identification of the most anomalous subsets of the data, as measured by a likelihood ratio scan statistic. However, PFSS also allows incorporation of prior information about each data element’s probability of inclusion, which was not previously possible within the subset scan framework. PFSS builds on two main results: first, we prove that a large class of likelihood ratio statistics satisfy a property that allows additional, element-specific penalty terms to be included while maintaining efficient computation. Second, we prove that the penalized statistic can be maximized exactly by evaluating only O(N) subsets. As a concrete example of the PFSS framework, we incorporate “soft” constraints on spatial proximity into the spatial event detection task, enabling more accurate detection of irregularly shaped spatial clusters of varying sparsity. To do so, we develop a distance-based penalty function that rewards spatial compactness and penalizes spatially dispersed clusters. This approach was evaluated on the task of detecting simulated anthrax bio-attacks, using real-world Emergency Department data from a major U.S. city. PFSS demonstrated increased detection power and spatial accuracy as compared to competing methods while maintaining efficient computation.",
keywords = "Disease surveillance, Likelihood ratio statistic, Pattern detection, Scan statistic",
author = "Skyler Speakman and Sriram Somanchi and Edward McFowland and Neill, {Daniel B.}",
year = "2016",
month = "4",
day = "2",
doi = "10.1080/10618600.2015.1029578",
language = "English (US)",
volume = "25",
pages = "382--404",
journal = "Journal of Computational and Graphical Statistics",
issn = "1061-8600",
publisher = "American Statistical Association",
number = "2",

}

TY - JOUR

T1 - Penalized Fast Subset Scanning

AU - Speakman, Skyler

AU - Somanchi, Sriram

AU - McFowland, Edward

AU - Neill, Daniel B.

PY - 2016/4/2

Y1 - 2016/4/2

N2 - We present the penalized fast subset scan (PFSS), a new and general framework for scalable and accurate pattern detection. PFSS enables exact and efficient identification of the most anomalous subsets of the data, as measured by a likelihood ratio scan statistic. However, PFSS also allows incorporation of prior information about each data element’s probability of inclusion, which was not previously possible within the subset scan framework. PFSS builds on two main results: first, we prove that a large class of likelihood ratio statistics satisfy a property that allows additional, element-specific penalty terms to be included while maintaining efficient computation. Second, we prove that the penalized statistic can be maximized exactly by evaluating only O(N) subsets. As a concrete example of the PFSS framework, we incorporate “soft” constraints on spatial proximity into the spatial event detection task, enabling more accurate detection of irregularly shaped spatial clusters of varying sparsity. To do so, we develop a distance-based penalty function that rewards spatial compactness and penalizes spatially dispersed clusters. This approach was evaluated on the task of detecting simulated anthrax bio-attacks, using real-world Emergency Department data from a major U.S. city. PFSS demonstrated increased detection power and spatial accuracy as compared to competing methods while maintaining efficient computation.

AB - We present the penalized fast subset scan (PFSS), a new and general framework for scalable and accurate pattern detection. PFSS enables exact and efficient identification of the most anomalous subsets of the data, as measured by a likelihood ratio scan statistic. However, PFSS also allows incorporation of prior information about each data element’s probability of inclusion, which was not previously possible within the subset scan framework. PFSS builds on two main results: first, we prove that a large class of likelihood ratio statistics satisfy a property that allows additional, element-specific penalty terms to be included while maintaining efficient computation. Second, we prove that the penalized statistic can be maximized exactly by evaluating only O(N) subsets. As a concrete example of the PFSS framework, we incorporate “soft” constraints on spatial proximity into the spatial event detection task, enabling more accurate detection of irregularly shaped spatial clusters of varying sparsity. To do so, we develop a distance-based penalty function that rewards spatial compactness and penalizes spatially dispersed clusters. This approach was evaluated on the task of detecting simulated anthrax bio-attacks, using real-world Emergency Department data from a major U.S. city. PFSS demonstrated increased detection power and spatial accuracy as compared to competing methods while maintaining efficient computation.

KW - Disease surveillance

KW - Likelihood ratio statistic

KW - Pattern detection

KW - Scan statistic

UR - http://www.scopus.com/inward/record.url?scp=84971476718&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84971476718&partnerID=8YFLogxK

U2 - 10.1080/10618600.2015.1029578

DO - 10.1080/10618600.2015.1029578

M3 - Article

AN - SCOPUS:84971476718

VL - 25

SP - 382

EP - 404

JO - Journal of Computational and Graphical Statistics

JF - Journal of Computational and Graphical Statistics

SN - 1061-8600

IS - 2

ER -