|
| |
IRLbot
crawler
Overview
|
IRLbot
is a Texas A&M research project that investigates algorithms for mapping the topology
of the Internet and discovering the various parts of the web. The
crawler downloads random web pages (text only) and follows certain links
to find other websites. |
|
Functionality
 |
The text
of downloaded web pages is not
distributed to the public or used for any non-research purposes.
|
|
|
 |
IRLbot is compliant with the robots.txt
standard. You can use the following commands to prevent it from accessing your
website: |
User-agent:
IRLbot
Disallow: /
Note: allow/disallow directives
are parsed in the order they appear in the robots file until the first
successful match.
|
Read more |
 |
To
signal that certain HTML pages should not be analyzed for links, you
can use either of the following meta tags: |
<META NAME="robots" CONTENT="nofollow">
<META NAME="IRLbot" CONTENT="nofollow"> |
Read more |
 |
IRLbot is by default rate-limited to one HTML page per website per
40 seconds;
however, this metric may dynamically change during the crawl
depending on the size and popularity of each site. Robots.txt can be
used to override this behavior by specifying the minimum delay
between visits (in seconds): |
User-agent:
IRLbot
Crawl-delay: 100
Note: when multiple sites are
co-located on a single IP address, the physical server may receive
requests at a higher rate depending on how many sites are served by the
IP and their popularity. |
Read more |
 |
IRLbot may decide to cache certain robots.txt
files and not re-load them unless they have expired in the cache.
The expiration period varies dynamically between 1 and 7 days based on
the frequency of changes detected in your robots file. |
|
|
 |
For websites that support compression, IRLbot will request
gzip/deflate to be performed on both robots.txt and HTML pages to
reduce the traffic load on the target site. |
|
|
FAQ
 |
Why are you tweaking the shopping
cart, posting into forums, or trying to register? |
Crawlers do not
generally know the effect of loading a page on the internal operation of
your site. Therefore, it is very important to configure your /robots.txt
to prevent crawlers from accessing sensitive parts of the site. For
example, actions such as modifying shopping carts, posting into forums
or blogs, registering a new user, logging into the site, or starting a
thread, all need to be protected. Furthermore, to combat spam-bots that
ignore robots.txt, site-modifying actions should be either authenticated
or challenged with a human-verification test. |
Read more |
 |
Your robot is spamming my site! |
It is fairly common
for spam-bots and email harvesters to impersonate existing user agents
(including IRLbot). Our
crawler normally runs
from web-crawler.irl.cs.tamu.edu (128.194.135.94). Any IP address outside 128.194.*.* that claims
the IRLbot user-agent is an impersonator. The recommended action is to block such IPs. |
|
 |
Why do changes to robots.txt not
take effect right away? |
The most likely reason
is that your robots.txt file has been cached and the crawler will not
know about the changes until it loads the file next time. Another reason is
that your file does not properly exclude IRLbot, which sometimes
arises when the allow/disallow directives are written in an incorrect
order. In either case, if you'd like
to be manually excluded from the crawl, please contact us at
the address below. |
|
 |
Why is IRLbot generating 404 (not
found) errors? |
Two most common
reasons are broken links that point to some non-existent page on
your site and unsafe characters (such as spaces, tabs, and new lines) in your URLs. If you believe that IRLbot incorrectly parses your links,
please let us know and we will fix the problem shortly. |
|
Contact
To report problems or suggestions
for IRLbot, please
contact Dr. Dmitri Loguinov.
Last
modified
April 28, 2008 12:12:20 PM
|