DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING | DWIGHT LOOK COLLEGE OF ENGINEERING | TEXAS A&M UNIVERSITY

 

HOME

ABOUT

COURSES

PEOPLE

PROJECTS

PUBLICATIONS

CONTACT

LINKS

Scalable web crawling

Sponsor: NSF

Abstract

Web crawling is a challenging issue in today's Internet due to many factors. These include the massive amount of content available to the crawler, existence of highly branching spam farms, prevalence of useless information, and necessity to adhere to politeness constraints at each target host. This project investigates scalable and efficient web algorithms that can be used in high-performance search engines to crawl hundreds of billions of pages and keep the overhead manageable. Unlike commercial search engines, our focus is to enable Internet-wide crawls and data mining without access to enormous server clusters or exotically expensive hardware.

Journal Publications

 

bullet

H.-T. Lee, D. Leonard, X. Wang, and D. Loguinov, "IRLbot: Scaling to 6 Billion Pages and Beyond," ACM Transactions on the Web, vol. 3, no. 3, June 2009.

PDF

Conference Papers

 
bullet

S. Sood and D. Loguinov, "Probabilistic Near-Duplicate Detection Using Simhash," ACM CIKM, October 2011.

PDF, PPT
 
bullet

C. Sparkman, H.-T. Lee, and D. Loguinov, "Agnostic Topology-Based Spam Avoidance in Large-Scale Web Crawls," IEEE INFOCOM, April 2011.

PDF, PPT

 

bullet

H.-T. Lee, D. Leonard, X. Wang, and D. Loguinov, "IRLbot: Scaling to 6 Billion Pages and Beyond,'' WWW, April 2008 (best paper award).

PDF, PPT


     Copyright © 2002-2011 IRL at Texas A&M. All Rights Reserved.