TY - GEN
T1 - Identifying unproven cancer treatments on the health Web
T2 - 14th World Congress on Medical and Health Informatics, MEDINFO 2013
AU - Aphinyanaphongs, Yin
AU - Fu, Lawrence D.
AU - Aliferis, Constantin F.
PY - 2013
Y1 - 2013
N2 - Building machine learning models that identify unproven cancer treatments on the Health Web is a promising approach for dealing with the dissemination of false and dangerous information to vulnerable health consumers. Aside from the obvious requirement of accuracy, two issues are of practical importance in deploying these models in real world applications. (a) Generalizability: The models must generalize to all treatments (not just the ones used in the training of the models). (b) Scalability: The models can be applied efficiently to billions of documents on the Health Web. First, we provide methods and related empirical data demonstrating strong accuracy and generalizability. Second, by combining the MapReduce distributed architecture and high dimensionality compression via Markov Boundary feature selection, we show how to scale the application of the models to WWW-scale corpora. The present work provides evidence that (a) a very small subset of unproven cancer treatments is sufficient to build a model to identify unproven treatments on the web; (b) unproven treatments use distinct language to market their claims and this language is learnable; (c) through distributed parallelization and state of the art feature selection, it is possible to prepare the corpora and build and apply models with large scalability.
AB - Building machine learning models that identify unproven cancer treatments on the Health Web is a promising approach for dealing with the dissemination of false and dangerous information to vulnerable health consumers. Aside from the obvious requirement of accuracy, two issues are of practical importance in deploying these models in real world applications. (a) Generalizability: The models must generalize to all treatments (not just the ones used in the training of the models). (b) Scalability: The models can be applied efficiently to billions of documents on the Health Web. First, we provide methods and related empirical data demonstrating strong accuracy and generalizability. Second, by combining the MapReduce distributed architecture and high dimensionality compression via Markov Boundary feature selection, we show how to scale the application of the models to WWW-scale corpora. The present work provides evidence that (a) a very small subset of unproven cancer treatments is sufficient to build a model to identify unproven treatments on the web; (b) unproven treatments use distinct language to market their claims and this language is learnable; (c) through distributed parallelization and state of the art feature selection, it is possible to prepare the corpora and build and apply models with large scalability.
KW - Artificial Intelligence
KW - Consumer Product Safety
KW - Information Storage and Retrieval
KW - Internet
KW - Neoplasms
UR - http://www.scopus.com/inward/record.url?scp=84894327491&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=84894327491&partnerID=8YFLogxK
U2 - 10.3233/978-1-61499-289-9-667
DO - 10.3233/978-1-61499-289-9-667
M3 - Conference contribution
C2 - 23920640
AN - SCOPUS:84894327491
SN - 9781614992882
T3 - Studies in Health Technology and Informatics
SP - 667
EP - 671
BT - MEDINFO 2013 - Proceedings of the 14th World Congress on Medical and Health Informatics
PB - IOS Press
Y2 - 20 August 2013 through 23 August 2013
ER -