Merit Network
Can't find what you're looking for? Search the Mail Archives.
  About Merit   Services   Network   Resources & Support   Network Research   News   Events   Home

Discussion Communities: Merit Network Email List Archives

North American Network Operators Group

Date Prev | Date Next | Date Index | Thread Index | Author Index | Historical

RE: Crawler Ettiquette

  • From: O'Neil,Kevin
  • Date: Thu Jan 24 18:30:05 2002

We have a research project at OCLC that does web crawling.  We created an
email account ( that admins could contact and posted
the email address at the web site from which the crawling was done.  That
way when admins see our IP address in their firewall logs (from attempted
http connects), they could browse to that address and see an explanation of
the project and send email to the contact address to get more info or
request removal from the IP space being crawled.

See for our set up.

...Kevin O'Neil

-----Original Message-----
From: Deepak Jain []
Sent: Wednesday, January 23, 2002 2:35 PM
To: Nanog@Merit. Edu
Subject: Crawler Ettiquette

I figured this was the best forum to post this, if anyone has suggestions
where this might be better placed, please let me know.

A University in our customer base has received funding to start a reasonably
large spider project. It will crawl websites [search engine fashion] and
save certain parts of the information it receives. This information will be
made available to research institutions and other concerns.

    We have been asked for recommendations on what functions/procedures they
put in to be good netizens and not cause undo stress to networks out there.
On the list of functions:

	a) Obey robots.txt files
	b) Allow network admins to automatically have their netblocks
exempted on
	c) Allow ISP's caches to sync with it.

	There are others, but they all revolve around a & b. C was something
seemed like a good idea, but I don't know if there is any real demand for

	Essentially, this project will have at least 1Gb/s of inbound
Average usage is expected to be around 500mb/s for the first several months.
ISPs who cache would have an advantage if they used the cache developed by
this project to load their tables, but I do not know if there is an
internet-wide WCCP or equivalent out there or if the improvement is worth
the management overhead.

	Because the funding is there, this project is essentially a
certainty. If
there are suggestions that should be added or concerns that this raises,
please let me know [privately is fine].

All input is appreciated,


Discussion Communities

About Merit | Services | Network | Resources & Support | Network Research
News | Events | Contact | Site Map | Merit Network Home

Merit Network, Inc.