TY - GEN
T1 - Counting YouTube videos via random prefix sampling
AU - Zhou, Jia
AU - Li, Yanhua
AU - Adhikari, Vijay Kumar
AU - Zhang, Zhi Li
PY - 2011
Y1 - 2011
N2 - Leveraging the characteristics of YouTube video id space and exploiting a unique property of YouTube search API, in this paper we develop a random prefix sampling method to estimate the total number of videos hosted by YouTube. Through theoretical modeling and analysis, we demonstrate that the estimator based on this method is unbiased, and provide bounds on its variance and confidence interval. These bounds enable us to judiciously select sample sizes to control estimation errors. We evaluate our sampling method and validate the sampling results using two distinct collections of YouTube video id's (namely, treating each collection as if it were the "true" collection of YouTube videos). We then apply our sampling method to the live YouTube system, and estimate that there are a total of roughly 500 millions YouTube videos by May, 2011. Finally, using an unbiased collection of YouTube videos sampled by our method, we show that YouTube video view count statistics collected by prior methods (e.g., through crawling of related video links) are highly skewed, significantly under-estimating the number of videos with very small view counts (
AB - Leveraging the characteristics of YouTube video id space and exploiting a unique property of YouTube search API, in this paper we develop a random prefix sampling method to estimate the total number of videos hosted by YouTube. Through theoretical modeling and analysis, we demonstrate that the estimator based on this method is unbiased, and provide bounds on its variance and confidence interval. These bounds enable us to judiciously select sample sizes to control estimation errors. We evaluate our sampling method and validate the sampling results using two distinct collections of YouTube video id's (namely, treating each collection as if it were the "true" collection of YouTube videos). We then apply our sampling method to the live YouTube system, and estimate that there are a total of roughly 500 millions YouTube videos by May, 2011. Finally, using an unbiased collection of YouTube videos sampled by our method, we show that YouTube video view count statistics collected by prior methods (e.g., through crawling of related video links) are highly skewed, significantly under-estimating the number of videos with very small view counts (
KW - YouTube
KW - online social networks
KW - sampling
UR - http://www.scopus.com/inward/record.url?scp=82955197339&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=82955197339&partnerID=8YFLogxK
U2 - 10.1145/2068816.2068851
DO - 10.1145/2068816.2068851
M3 - Conference contribution
AN - SCOPUS:82955197339
SN - 9781450310130
T3 - Proceedings of the ACM SIGCOMM Internet Measurement Conference, IMC
SP - 371
EP - 379
BT - IMC'11 - Proceedings of the 2011 ACM SIGCOMM Internet Measurement Conference
T2 - 2011 ACM SIGCOMM Internet Measurement Conference, IMC'11
Y2 - 2 November 2011 through 4 November 2011
ER -