Web crawling project at IRL

DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING | DWIGHT LOOK COLLEGE OF ENGINEERING | TEXAS A&M UNIVERSITY

HOME
ABOUT
COURSES
PEOPLE
PROJECTS
PUBLICATIONS
CONTACT

Scalable web crawling

Sponsor: NSF

Abstract

Web crawling is a challenging issue in today's Internet due to many factors. These include the massive amount of content available to the crawler, existence of highly branching spam farms, prevalence of useless information, and necessity to adhere to politeness constraints at each target host. This project investigates scalable and efficient web algorithms that can be used in high-performance search engines to crawl hundreds of billions of pages and keep the overhead manageable. Unlike commercial search engines, our focus is to enable Internet-wide crawls and data mining without access to enormous server clusters or exotically expensive hardware.

Journal Publications

Y. Cui, C. Sparkman, H.-T. Lee, and D. Loguinov, "Unsupervised Domain Ranking in Large-Scale Web Crawls," ACM Transactions on the Web, vol. 12, no. 4, November 2018.

PDF

H.-T. Lee, D. Leonard, X. Wang, and D. Loguinov, "IRLbot: Scaling to 6 Billion Pages and Beyond," ACM Transactions on the Web, vol. 3, no. 3, June 2009.

PDF

Conference Papers

S.T. Ahmed, C. Sparkman, H.-T. Lee, and D. Loguinov, "Around the Web in Six Weeks: Documenting a Large-Scale Crawl," IEEE INFOCOM, April 2015.

PDF, PPT

S. Sood and D. Loguinov, "Probabilistic Near-Duplicate Detection Using Simhash," ACM CIKM, October 2011.

PDF, PPT

C. Sparkman, H.-T. Lee, and D. Loguinov, "Agnostic Topology-Based Spam Avoidance in Large-Scale Web Crawls," IEEE INFOCOM, April 2011.

PDF, PPT

H.-T. Lee, D. Leonard, X. Wang, and D. Loguinov, "IRLbot: Scaling to 6 Billion Pages and Beyond,'' WWW, April 2008 (best paper award).

PDF, PPT

Datasets

PLD (i.e., domain-level) out-graphs used in our ranking analysis (INFOCOM 2011, TWEB 2018) are available below. The graphs consist of 8-byte hashes, followed by a 4-byte out-degree and a list of 8-byte neighbor hashes. The map contains an 8-byte hash, followed by a NULL-terminated string of the corresponding domain. All numbers are in LSB-first byte order. Source nodes with zero out-degree and those referring to syntactically invalid PLDs have been eliminated.

	IRLbot domain (2007): graph (14 GB, 25,943,373 sources, 1,799,516,827 edges), map (2.2 GB, 86,533,762 nodes)
	ClueWeb09 domain (2009): graph (3.4 GB, 9,767,917 sources, 415,167,456 edges), map (743 MB, 30,665,459 nodes)
	Ranking results in text format: IRLbot (4.5 GB) and ClueWeb09 (1.7 GB)