jdallen2000@yahoo.com writes:
I have a website organized as a large number (> 200,000) of pages.
It is hosted by a large Internet hosting company.
Many websites provide much more information than mine by computing
info on-the-fly with server scripts, but I have, in effect, all the
query results pre-computed. I waste a few gigabytes for the data,
but that's almost nothing these days, and don't waste the server's
time on scripts.
My users may click to 10 or 20 pages in a session. But the indexing
bots want to read all 200,000+ pages! My host has now complained
that the site is under "bot attack" and has asked me to check my own
laptop for viruses!
I'm happy anyway to reduce the bot activity. I don't mind having my
site indexed, but once or twice a year would be enough!
I see that there is a way to stop the Google Bot specifically. I'd
love it if I could do the opposite -- have *only* Google index my
site.
A technician at the hosting company wrote to me
As per the above logs and hitting IP addresses, we have blocked the
46.229.168.* IP range to prevent the further abuse and advice you to
also check incoming traffic and block such IP's in future.
We have also blocked the bots by adding the following entry
in robots.txt:-
User-agent: AhrefsBot
Disallow: /
User-agent: MJ12bot
Disallow: /
User-agent: SemrushBot
Disallow: /
User-agent: YandexBot
Disallow: /
User-agent: Linguee Bot
Disallow: /
[Cross-posting to news:comp.infosystems.www.misc as I feel thatjdallen2000@yahoo.com writes:
this question has more to do with Web than HTML per se.]
...I have a website organized as a large number (> 200,000) of pages.
It is hosted by a large Internet hosting company.
My users may click to 10 or 20 pages in a session. But the indexing
bots want to read all 200,000+ pages! My host has now complained
that the site is under "bot attack" and has asked me to check my own laptop for viruses!
I'm happy anyway to reduce the bot activity. I don't mind having my
site indexed, but once or twice a year would be enough!
I see that there is a way to stop the Google Bot specifically. I'dJFTR, I personally (as well as many other users who value their
love it if I could do the opposite -- have *only* Google index my
site.
privacy) refrain from using Google Search and rely on, say,
https://duckduckgo.com/ instead.
A technician at the hosting company wrote to me
As per the above logs and hitting IP addresses, we have blocked the
46.229.168.* IP range to prevent the further abuse and advice you to
also check incoming traffic and block such IP's in future.
We have also blocked the bots by adding the following entry
in robots.txt:-
User-agent: AhrefsBot
User-agent: MJ12bot
User-agent: SemrushBot
User-agent: YandexBot
User-agent: Linguee Bot
As long as the troublesome bots honor robots.txt (there're those
that do not; but then, the above won't work on them, either),
a more sane solution would be to limit the /rate/ the bots
request your pages for indexing, like:
### robots.txt
### Data:
## Request that the bots wait at least 3 seconds between requests. User-agent: *
Crawl-delay: 3
### robots.txt ends here
This way, the bots will still scan all your 2e5 pages, but their
accessess will be spread over about a week -- which (I hope)
will be well within "acceptable use limits" of your hosting
company.
Eli the Bearded <*@eli.users.panix.com> writes:
In comp.infosystems.www.misc, Ivan Shmakov <ivan@siamics.net> wrote: >>>>> jdallen2000@yahoo.com writes:
I'm happy anyway to reduce the bot activity. I don't mind having
my site indexed, but once or twice a year would be enough!
Some of the better search engines will gladly consult site map files
that give hints about what needs reindexing. See:
https://www.sitemaps.org/protocol.html
A technician at the hosting company wrote to me
As per the above logs and hitting IP addresses, we have blocked
the 46.229.168.* IP range to prevent the further abuse and advice
you to also check incoming traffic and block such IP's in future.
46.229.168.0-46.229.168.255 is:
netname: ADVANCEDHOSTERS-NET
Can't say I've heard of them.
All bots can be impersonated by other bots, so you can't be sure the User-Agent: will be the real identity of the bots.
You can spend a lot of time researching bots and the characteristics
of real bot usage, eg hostnames or IP address ranges of legit bot
servers.
Except for Linguee, I think all of the bots listed above are
well-behaved and will obey robots.txt,
but I don't know if they are all advanced enough to know Crawl-delay.
Some of them explicitly state they do, however.
This way, the bots will still scan all your 2e5 pages, but their
accessess will be spread over about a week -- which (I hope) will be
well within "acceptable use limits" of your hosting company.
Only bot I've ever had to blacklist was a MSN bot that absolutely
refused to stop hitting one page over and over again a few years ago.
I used a server directive to shunt that one bot to 403 Forbidden
errors.
Elijah ------ stopped worrying about bots a long time ago
jdallen2000@yahoo.com writes:
[Cross-posting to news:comp.infosystems.www.misc as I feel that
this question has more to do with Web than HTML per se.]
My users may click to 10 or 20 pages in a session. But the indexing
bots want to read all 200,000+ pages! My host has now complained
that the site is under "bot attack" and has asked me to check my own laptop for viruses!
A technician at the hosting company wrote to me
As per the above logs and hitting IP addresses, we have blocked the
46.229.168.* IP range to prevent the further abuse and advice you to
also check incoming traffic and block such IP's in future.
Elijah ------ stopped worrying about bots a long time ago
Doc O'Leary <droleary@2017usenet1.subsume.com> writes:
For your reference, records indicate that Ivan Shmakov wrote:
jdallen2000@yahoo.com writes:
A technician at the hosting company wrote to me
As per the above logs and hitting IP addresses, we have blocked
the 46.229.168.* IP range to prevent the further abuse and advice
you to also check incoming traffic and block such IP's in future.
There is nothing about the 46.229.160.0/20 range in question that
indicates it represents a legitimate bot. Do the logs actually
indicate vanilla spidering, or something more nefarious like looking
for PHP/WordPress exploits? I see a lot of traffic like that.
In such cases, editing robots.txt is unlikely to solve the problem.
Generally, I don’t even bother configuring the web server to deny a serious attacker. I’d just drop their whole range into my firewall, because odds are good that a dedicated attacker isn’t going to only
go after port 80.
That might be beyond the scope of what a basic web hosting company
provides but, really, given that a $15/year VPS can handle most
traffic for even a 200K page static site with ease, I really can’t
imagine what the real issue is here. More details needed.
-- "Also . . . I can kill you with my brain." River Tam, Trash,
Firefly
That might be beyond the scope of what a basic web hosting company provides but, really, given that a $15/year VPS can handle mostDoc O'Leary <droleary@2017usenet1.subsume.com> writes:
traffic for even a 200K page static site with ease, I really can’t imagine what the real issue is here. More details needed.
Now, that's interesting. The VPS services I use (or used) are
generally $5/month or more.
... Also in my to-watch list. (I've certainly liked Serenity.)
Sysop: | Keyop |
---|---|
Location: | Huddersfield, West Yorkshire, UK |
Users: | 475 |
Nodes: | 16 (2 / 14) |
Uptime: | 16:26:56 |
Calls: | 9,487 |
Calls today: | 6 |
Files: | 13,615 |
Messages: | 6,121,086 |