TY - GEN
T1 - Exploiting a support-based upper bound of Pearson's correlation coefficient for efficiently identifying strongly correlated pairs
AU - Xiong, Hui
AU - Shekhar, Shashi
AU - Tan, Pang Ning
AU - Kumar, Vipin
PY - 2004
Y1 - 2004
N2 - Given a user-specified minimum correlation threshold 6 and a market basket database with N items and T transactions, an all-strong-pairs correlation query finds all item pairs with correlations above the threshold θ. However, when the number of items and transactions are large, the computation cost of this query can be very high. In this paper, we identify an upper bound of Pearson's correlation coefficient for binary variables. This upper bound is not only much cheaper to compute than Pearson's correlation coefficient but also exhibits a special monotone property which allows pruning of many item pairs even without computing their upper bounds. A Two-step All-strong-Pairs corrElation queRy (TAPER) algorithm is proposed to exploit these properties in a filter-and-refine manner. Furthermore, we provide an algebraic cost model which shows that the computation savings from pruning is independent or improves when the number of items is increased in data sets with common Zipf or linear rank-support distributions. Experimental results from synthetic and real data sets exhibit similar trends and show that the TAPER algorithm can be an order of magnitude faster than brute-force alternatives.
AB - Given a user-specified minimum correlation threshold 6 and a market basket database with N items and T transactions, an all-strong-pairs correlation query finds all item pairs with correlations above the threshold θ. However, when the number of items and transactions are large, the computation cost of this query can be very high. In this paper, we identify an upper bound of Pearson's correlation coefficient for binary variables. This upper bound is not only much cheaper to compute than Pearson's correlation coefficient but also exhibits a special monotone property which allows pruning of many item pairs even without computing their upper bounds. A Two-step All-strong-Pairs corrElation queRy (TAPER) algorithm is proposed to exploit these properties in a filter-and-refine manner. Furthermore, we provide an algebraic cost model which shows that the computation savings from pruning is independent or improves when the number of items is increased in data sets with common Zipf or linear rank-support distributions. Experimental results from synthetic and real data sets exhibit similar trends and show that the TAPER algorithm can be an order of magnitude faster than brute-force alternatives.
KW - Pearson's Correlation Coefficient
KW - Statistical Computing
UR - http://www.scopus.com/inward/record.url?scp=12244275963&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=12244275963&partnerID=8YFLogxK
U2 - 10.1145/1014052.1014090
DO - 10.1145/1014052.1014090
M3 - Conference contribution
AN - SCOPUS:12244275963
SN - 1581138881
SN - 9781581138887
T3 - KDD-2004 - Proceedings of the Tenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining
SP - 334
EP - 343
BT - KDD-2004 - Proceedings of the Tenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining
PB - Association for Computing Machinery
T2 - KDD-2004 - Proceedings of the Tenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining
Y2 - 22 August 2004 through 25 August 2004
ER -