Abstract
This paper discusses the challenges and opportunities for using archival Internet data in order to observe a host of social science phenomena. Specifically, this paper introduces HistoryTracker, a new tool for accessing and extracting archived data from the Internet Archive, the largest repository of archived Web data in existence. The HistoryTracker tool serves to create a Web observatory that allows scholars to study the history of the Web. HistoryTracker takes advantages of Hadoop processing capacity, and allows researchers to extract large swaths of archived data into a link list format that can be easily transferred to a number of other analytical tools. A brief illustration of the use of HistoryTracker is presented demonstrating the use of the tool. Finally, a number of continuing research challenges are discussed, and future research opportunities are outlined.
Original language | English (US) |
---|---|
Title of host publication | WWW 2014 Companion - Proceedings of the 23rd International Conference on World Wide Web |
Publisher | Association for Computing Machinery, Inc |
Pages | 1031-1036 |
Number of pages | 6 |
ISBN (Electronic) | 9781450327459 |
DOIs | |
State | Published - Apr 7 2014 |
Externally published | Yes |
Event | 23rd International Conference on World Wide Web, WWW 2014 - Seoul, Korea, Republic of Duration: Apr 7 2014 → Apr 11 2014 |
Publication series
Name | WWW 2014 Companion - Proceedings of the 23rd International Conference on World Wide Web |
---|
Other
Other | 23rd International Conference on World Wide Web, WWW 2014 |
---|---|
Country/Territory | Korea, Republic of |
City | Seoul |
Period | 4/7/14 → 4/11/14 |
Bibliographical note
Funding Information:The author acknowledges support from the National Science Foundation (NSF Award 1244727), as well as the support of a number of collaborators including Kris Carpenter, David Lazer, Katherine Ognyanova, Vinay Goel, Luan Nguyen, Hai Nguyen and Allie Kosterich.
Publisher Copyright:
© Copyright 2014 by the International World Wide Web Conferences Steering Committee.
Keywords
- Archived data
- Data extraction
- Network analysis
- Occupy wall street
- Social sciences
- Web observatory