TY - GEN

T1 - Divide and conquer strategies for effective information retrieval

AU - Chen, Jie

AU - Saad, Yousef

PY - 2009

Y1 - 2009

N2 - The standard application of Latent Semantic Indexing (LSI), a well-known technique for information retrieval, requires the computation of a partial Singular Value Decomposition (SVD) of the term-document matrix. This computation becomes infeasible for large document collections, since it is very demanding both in terms of arithmetic operations and in memory requirements. This paper discusses two divide and conquer strategies, with the goal of alleviating these difficulties. Both strategies divide the document collection in subsets, perform relevance analysis on each subset, and conquer the analysis results to form the query response. Since each sub-problem resulting from the recursive division process has a smaller size, the processing of large scale document collections requires much fewer resources. In addition, the computation is highly parallel and can be easily adapted to a parallel computing environment. To reduce the computational cost, we perform the analysis on the subsets by using the Lanczos vectors instead of singular vectors as in the standard LSI method. This technique is far more efficient than the computation of the truncated SVD, while its accuracy is comparable. Experimental results confirm that the proposed divide and conquer strategies are effective for information retrieval problems.

AB - The standard application of Latent Semantic Indexing (LSI), a well-known technique for information retrieval, requires the computation of a partial Singular Value Decomposition (SVD) of the term-document matrix. This computation becomes infeasible for large document collections, since it is very demanding both in terms of arithmetic operations and in memory requirements. This paper discusses two divide and conquer strategies, with the goal of alleviating these difficulties. Both strategies divide the document collection in subsets, perform relevance analysis on each subset, and conquer the analysis results to form the query response. Since each sub-problem resulting from the recursive division process has a smaller size, the processing of large scale document collections requires much fewer resources. In addition, the computation is highly parallel and can be easily adapted to a parallel computing environment. To reduce the computational cost, we perform the analysis on the subsets by using the Lanczos vectors instead of singular vectors as in the standard LSI method. This technique is far more efficient than the computation of the truncated SVD, while its accuracy is comparable. Experimental results confirm that the proposed divide and conquer strategies are effective for information retrieval problems.

UR - http://www.scopus.com/inward/record.url?scp=72849117985&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=72849117985&partnerID=8YFLogxK

M3 - Conference contribution

AN - SCOPUS:72849117985

SN - 9781615671090

T3 - Society for Industrial and Applied Mathematics - 9th SIAM International Conference on Data Mining 2009, Proceedings in Applied Mathematics

SP - 445

EP - 456

BT - Society for Industrial and Applied Mathematics - 9th SIAM International Conference on Data Mining 2009, Proceedings in Applied Mathematics 133

T2 - 9th SIAM International Conference on Data Mining 2009, SDM 2009

Y2 - 30 April 2009 through 2 May 2009

ER -