TY - GEN
T1 - Content-based methods for predicting web-site demographic attributes
AU - Kabbur, Santosh
AU - Han, Eui Hong
AU - Karypis, George
PY - 2010
Y1 - 2010
N2 - Demographic information plays an important role in gaining valuable insights about a web-site's user-base and is used extensively to target online advertisements and promotions. This paper investigates machine-learning approaches for predicting the demographic attributes of web-sites using information derived from their content and their hyperlinked structure and not relying on any information directly or indirectly obtained from the web-site's users. Such methods are important because users are becoming increasingly more concerned about sharing their personal and behavioral information on the Internet. Regression-based approaches are developed and studied for predicting demographic attributes that utilize different content-derived features, different ways of building the prediction models, and different ways of aggregating web-page level predictions that take into account the web's hyperlinked structure. In addition, a matrix-approximation based approach is developed for coupling the predictions of individual regression models into a model designed to predict the probability mass function of the attribute. Extensive experiments show that these methods are able to achieve an RMSE of 8-10% and provide insights on how to best train and apply such models.
AB - Demographic information plays an important role in gaining valuable insights about a web-site's user-base and is used extensively to target online advertisements and promotions. This paper investigates machine-learning approaches for predicting the demographic attributes of web-sites using information derived from their content and their hyperlinked structure and not relying on any information directly or indirectly obtained from the web-site's users. Such methods are important because users are becoming increasingly more concerned about sharing their personal and behavioral information on the Internet. Regression-based approaches are developed and studied for predicting demographic attributes that utilize different content-derived features, different ways of building the prediction models, and different ways of aggregating web-page level predictions that take into account the web's hyperlinked structure. In addition, a matrix-approximation based approach is developed for coupling the predictions of individual regression models into a model designed to predict the probability mass function of the attribute. Extensive experiments show that these methods are able to achieve an RMSE of 8-10% and provide insights on how to best train and apply such models.
KW - Content based models
KW - Demographic attribute prediction
KW - Inlink count
KW - Probability mass function
KW - Regression
UR - http://www.scopus.com/inward/record.url?scp=79951748379&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=79951748379&partnerID=8YFLogxK
U2 - 10.1109/ICDM.2010.97
DO - 10.1109/ICDM.2010.97
M3 - Conference contribution
AN - SCOPUS:79951748379
SN - 9780769542560
T3 - Proceedings - IEEE International Conference on Data Mining, ICDM
SP - 863
EP - 868
BT - Proceedings - 10th IEEE International Conference on Data Mining, ICDM 2010
T2 - 10th IEEE International Conference on Data Mining, ICDM 2010
Y2 - 14 December 2010 through 17 December 2010
ER -