DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING | DWIGHT LOOK COLLEGE OF ENGINEERING | TEXAS A&M UNIVERSITY

 

HOME

ABOUT

COURSES

PEOPLE

PROJECTS

PUBLICATIONS

CONTACT

LINKS

IRLbot crawler

Overview

IRLbot is a Texas A&M research project that investigates algorithms for mapping the topology of the Internet and discovering the various parts of the web. The crawler downloads random web pages (text only) and follows certain links to find other websites.

Functionality

bullet

The text of downloaded web pages is not distributed to the public or used for any non-research purposes.

bullet

IRLbot is compliant with the robots.txt standard. You can use the following commands to prevent it from accessing your website:

User-agent: IRLbot 

Disallow: /

Note: allow/disallow directives are parsed in the order they appear in the robots file until the first successful match.

Read more

bullet

To signal that certain HTML pages should not be analyzed for links, you can use either of the following meta tags:

<META NAME="robots" CONTENT="nofollow">

<META NAME="IRLbot" CONTENT="nofollow">

Read more

bullet

IRLbot is by default rate-limited to one HTML page per website per 40 seconds; however, this metric may dynamically change during the crawl depending on the size and popularity of each site. Robots.txt can be used to override this behavior by specifying the minimum delay between visits (in seconds):

User-agent: IRLbot 

Crawl-delay: 100

Note: when multiple sites are co-located on a single IP address, the physical server may receive requests at a higher rate depending on how many sites are served by the IP and their popularity.

Read more

bullet

IRLbot may decide to cache certain robots.txt files and not re-load them unless they have expired in the cache. The expiration period varies dynamically between 1 and 7 days based on the frequency of changes detected in your robots file.

bullet

For websites that support compression, IRLbot will request gzip/deflate to be performed on both robots.txt and HTML pages to reduce the traffic load on the target site.

FAQ

bullet

Why are you tweaking the shopping cart, posting into forums, or trying to register?

Crawlers do not generally know the effect of loading a page on the internal operation of your site. Therefore, it is very important to configure your /robots.txt to prevent crawlers from accessing sensitive parts of the site. For example, actions such as modifying shopping carts, posting into forums or blogs, registering a new user, logging into the site, or starting a thread, all need to be protected. Furthermore, to combat spam-bots that ignore robots.txt, site-modifying actions should be either authenticated or challenged with a human-verification test.

Read more

bullet

Your robot is spamming my site!

It is fairly common for spam-bots and email harvesters to impersonate existing user agents (including IRLbot). Our crawler normally runs from web-crawler.irl.cs.tamu.edu (128.194.135.94). Any IP address outside 128.194.*.* that claims the IRLbot user-agent is an impersonator. The recommended action is to block such IPs.

 
bullet

Why do changes to robots.txt not take effect right away?

The most likely reason is that your robots.txt file has been cached and the crawler will not know about the changes until it loads the file next time. Another reason is that your file does not properly exclude IRLbot, which sometimes arises when the allow/disallow directives are written in an incorrect order. In either case, if you'd like to be manually excluded from the crawl, please contact us at the address below.

 
bullet

Why is IRLbot generating 404 (not found) errors?

Two most common reasons are broken links that point to some non-existent page on your site and unsafe characters (such as spaces, tabs, and new lines) in your URLs. If you believe that IRLbot incorrectly parses your links, please let us know and we will fix the problem shortly.

Contact

To report problems or suggestions for IRLbot, please contact Dr. Dmitri Loguinov.

Last modified August 10, 2012 12:52:03 PM


     Copyright 2002-2014 IRL at Texas A&M. All Rights Reserved.