IRLbot crawler


IRLbot is a Texas A&M research project that investigates algorithms for mapping the topology of the Internet and discovering the various parts of the web. The crawler downloads random web pages (text only) and follows certain links to find other websites.


  • The text of downloaded web pages is not distributed to the public or used for any non-research purposes.

  • IRLbot is compliant with the robots.txt standard. You can use the following commands to prevent it from accessing your website:

User-agent: IRLbot 

Disallow: /

Note: allow/disallow directives are parsed in the order they appear in the robots file until the first successful match.

Read more

  • To signal that certain HTML pages should not be analyzed for links, you can use either of the following meta tags:

<META NAME="robots" CONTENT="nofollow">

<META NAME="IRLbot" CONTENT="nofollow">

Read more

  • IRLbot is by default rate-limited to one HTML page per website per 40 seconds; however, this metric may dynamically change during the crawl depending on the size and popularity of each site. Robots.txt can be used to override this behavior by specifying the minimum delay between visits (in seconds):

User-agent: IRLbot 

Crawl-delay: 100

Note: when multiple sites are co-located on a single IP address, the physical server may receive requests at a higher rate depending on how many sites are served by the IP and their popularity.

Read more

  • IRLbot may decide to cache certain robots.txt files and not re-load them unless they have expired in the cache. The expiration period varies dynamically between 1 and 7 days based on the frequency of changes detected in your robots file.

  • For websites that support compression, IRLbot will request gzip/deflate to be performed on both robots.txt and HTML pages to reduce the traffic load on the target site.


  • Why are you tweaking the shopping cart, posting into forums, or trying to register?

Crawlers do not generally know the effect of loading a page on the internal operation of your site. Therefore, it is very important to configure your /robots.txt to prevent crawlers from accessing sensitive parts of the site. For example, actions such as modifying shopping carts, posting into forums or blogs, registering a new user, logging into the site, or starting a thread, all need to be protected. Furthermore, to combat spam-bots that ignore robots.txt, site-modifying actions should be either authenticated or challenged with a human-verification test.

Read more

  • Your robot is spamming my site!

It is fairly common for spam-bots and email harvesters to impersonate existing user agents (including IRLbot). Our crawler normally runs from ( Any IP address outside 128.194.*.* that claims the IRLbot user-agent is an impersonator. The recommended action is to block such IPs.

  • Why do changes to robots.txt not take effect right away?

The most likely reason is that your robots.txt file has been cached and the crawler will not know about the changes until it loads the file next time. Another reason is that your file does not properly exclude IRLbot, which sometimes arises when the allow/disallow directives are written in an incorrect order. In either case, if you'd like to be manually excluded from the crawl, please contact us at the address below.

  • Why is IRLbot generating 404 (not found) errors?

Two most common reasons are broken links that point to some non-existent page on your site and unsafe characters (such as spaces, tabs, and new lines) in your URLs. If you believe that IRLbot incorrectly parses your links, please let us know and we will fix the problem shortly.


To report problems or suggestions for IRLbot, please contact Dr. Dmitri Loguinov.

Last modified May 08, 2017 07:15:22 AM

     Copyright 2002-2019 IRL at Texas A&M. All Rights Reserved.