Understanding Management of Benign Bot Tra�cJan 2019
Radware Research
� � � � � � �� � � �� � �
SummaryKey TakeawaysHumans vs. Good Bots vs. Bad Bots
Types of Good BotsSearch Engine Crawlers Outnumber Other Good Bots
Analysis of Good Bot Tra�cGood Bots by Industry
Travel
E-commerce
Classi�eds and Marketplaces
Media and Publishing
How Crawlers OperateData CentersUser Agents and IP RangesCrawling BudgetAverage Number of Requests Per Minute
Managing Good BotsCommon Strategies
Robots.txt
Validating the Authenticity of Good Bots
Whitelisting
Rate Limiting
Time Allocation
Adopting a Dedicated Bot Management Solution
The Unintentional Consequences of Good Bot Tra�cPitfalls in Managing Good Bots
Spoofed Search Engine Crawlers
Third-Party Bots
False Positives and False Negatives
How Blocking Good Bots Can Have a Negative Impact
Recommendations for E�ective Management of Good Bots
02030405060707070809101111121213141414151516161617171718181819
Table of Contents
While bad bots that cause harm are growing to become among the most serious threats to
organizations, good bots do not receive much attention from security experts. This is
because, rather than intentionally causing harm, good bots perform a range of important
functions that help their users in several ways. This report outlines the prevalence of good
bots and their various types, the main functions they carry out, their tra�c distribution from
July to December (H2) 2018 across four key verticals in our client base (Online Travel
Agencies, E-Commerce, Classi�eds, and Media & Publishing), and concludes with
recommended strategies to e�ciently manage good bots.
Summary
02 Inside Good Bots | Understanding Management of Benign Bot Tra�c
Under certain circumstances, good bots can also have a negative impact on
organizations.
An industry-speci�c approach is needed by webmasters, security chiefs, and marketers.
Analysis of trends in good bot tra�c can provide actionable insights to drive marketing
strategy.
Webmasters and security specialists are well advised to adopt dedicated Bot
Management solutions which allow them to manage various types of good bots based
on speci�c business needs.
Key Takeaways
More than 50 percent of the Internet tra�c today comes from bots, and while we've seen an
increase in bot-based attacks in recent years, 'good' bots still play an important role in
applications' operations. As such, organizations need to look for bot management solutions
that not only e�ectively detect and mitigate bot attacks, but also can accurately distinguish
between 'good' and 'bad' bots in real-time."
David Aviv, CTO of Radware
03 Inside Good Bots | Understanding Management of Benign Bot Tra�c
Let’s take a broad look at the overall composition of tra�c during H2, 2018 across four key
verticals in our user base. Genuine visitors comprised nearly 74% of tra�c, while good bots
totalled nearly 17%, and bad bots nearly 10%.
Breakdown of Overall Tra�c: Humans vs. Good Bots vs. Bad Bots
Bad Bots9.8%
Good Bots16.7%
Humans73.5%
Humans vs. Good Bots vs. Bad Bots
04 Inside Good Bots | Understanding Management of Benign Bot Tra�c
Search Engine Crawlers
These bots or spiders crawl web pages to index them for search engines such as Google and Bing. Website administrators can specify non-binding rules in their ‘robots.txt’ �le for crawlers to follow while indexing web pages, such as their crawl rates and pages or sections that they should not index. Spoofed crawlers are generally hard for webmasters to detect, as they can often sneak through basic security measures.
GooglebotBingbotBaidu SpiderYandex Bot
Bot Type Description & Intent Examples
Partner Bots Partner bots provide essential services and information to websites and their customers. This category includes bots run by vendors which provide transactional/ CRM/ ERP services, credit-scoring, geo-location, inventory and background checks, and many other business-related services.
AlexaSlackbotIBM Watson
Social Network Bots
Bots that are run by social networking sites to provide visibility to their clients’ websites and drive engagement on their platforms.
Facebook BotPinterest BotSnapchat-proxy
Monitoring Bots
Monitoring bots are used to monitor the uptime and system health of websites. These bots periodically check and report on page load times, downtime duration, and so on.
PingdomAlertBotStatusCake
Backlink Checker Bots
These bots check the inbound URLs on a website to provide marketers and SEO specialists with insights, which help them optimize their content and campaigns.
SEMRushBotUAS Link Checker AhrefsBot
Aggregator/ Feed FetcherBots
Aggregator or Feed-Fetcher bots collate information from websites and provide users or subscribers with the latest news, alerts, and other desired content.
Google FeedfetcherSuperfeedrFeedly
Types of Good Bots
05 Inside Good Bots | Understanding Management of Benign Bot Tra�c
As the pie chart below shows, crawlers make up roughly 55% of all good bot tra�c, and all
other good bots combined make up the rest. These statistics represent overall good bot tra�c
in H2, 2018, and can signi�cantly vary every month, as well as by industry.
Search Engine Crawlers Outnumber Other
Good Bots
Percentage of Search Engine Crawlers vs. All Other Good Bots
Search Engine Crawlers
54.5%All Other Good Bots
45.5%
06 Inside Good Bots | Understanding Management of Benign Bot Tra�c
The breakdown of good bot tra�c across the four industries considered here reveals patterns
that are fairly unique to each industry. Not surprisingly, businesses in the Classi�eds &
Marketplaces industry have the highest level of Aggregator bot tra�c, followed by Online
Travel Agencies (OTAs). When it comes to Partner bots, E-Commerce sites have the highest
percentage, followed by OTAs. E-Commerce sites also have the highest levels of Social
Network bots, followed by Media & Publishing sites, while OTAs come in last with a tiny 0.1%
of their good bot tra�c consisting of Social Network bots.
Monitor Bots3.3%
Partner Bots12.1%
Social Network Bots
0.1%
Aggregator Bots
62.5%Backlink Checker Bots
21.9%
Distribution of Good Bot Tra�c: Online Travel Industry
53.5%
All Other Good Bots
46.5% Search Engine Crawlers
Good Bots By Industry
Travel
Analysis of Good Bot Tra�c
07 Inside Good Bots | Understanding Management of Benign Bot Tra�c
Partner Bots19.5%
Social Network Bots
9.2%
Backlink Checker Bots
14.0%Monitor Bots15.5%
Aggregator Bots
41.8%
E-commerce
51.0%All Other Good Bots
49.0%Search Engine
Crawlers
Distribution of Good Bot Tra�c: E-commerce Industry
08 Inside Good Bots | Understanding Management of Benign Bot Tra�c
Classi�eds and Marketplaces
Aggregator Bots
Partner Bots8.2%
Social Network Bots
2.4%
Backlink Checker Bots
67.8%
Monitor Bots3.3%
18.3%
48.0%All Other Good Bots
52.0%Search Engine
Crawlers
Distribution of Good Bot Tra�c: Classi�eds and Marketplaces
09 Inside Good Bots | Understanding Management of Benign Bot Tra�c
Media and Publishing
Aggregator Bots
Partner Bots4.3%
Social Network Bots
3.3%
Backlink Checker Bots
57.3%
32.7%Monitor Bots2.4%
54.7%All Other Good Bots
45.3%Search Engine
Crawlers
Distribution of Good Bot Tra�c: Media and Publishing
10 Inside Good Bots | Understanding Management of Benign Bot Tra�c
The leading search engine bots originate from IP address ranges that generally do not change
often (but sometimes do vary, as with the Googlebot and others). Other types of bots such as
Partner bots generally work out of IP addresses that are provided by vendors or partners,
allowing webmasters to whitelist them. Webmasters are often advised to block crawlers from
data centers in countries that they do not do business in. For example, the Yandex crawler can
be disallowed from indexing your site if you do not do business in the Russian Federation.
Search engine crawlers constitute a clear majority of all good bots visiting our client base, and
have well-de�ned modes of operation when compared to the diversity of operational
methods seen in other good bots. Hence we have focused our attention on how search
engine crawlers operate.
Data Centers
How Crawlers Operate
11 Inside Good Bots | Understanding Management of Benign Bot Tra�c
For search engine crawlers, ‘Crawling Budget’ refers to the methodology used to determine
which pages of a websites, and how many of them, their crawler must index. With trillions of
web pages in existence and many more being added every day, search engines have their
work cut out for them, and a crawling budget sets parameters on how their crawlers can most
e�ciently index web pages.
According to Google, “Googlebot is designed to be a good citizen of the web. Crawling is its main
priority, while making sure it doesn't degrade the experience of users visiting the site. We call this the
"crawl rate limit," which limits the maximum fetching rate for a given site. Simply put, this represents
the number of simultaneous parallel connections Googlebot may use to crawl the site, as well as
the time it has to wait between the fetches. The crawl rate can go up and down based on a couple
of factors… If the site responds really quickly for a while, the limit goes up, meaning more
connections can be used to crawl. If the site slows down or responds with server errors, the limit goes
down and Googlebot crawls less.”
Crawling Budget
The top search engine bots such as those deployed by Google, Bing, Yahoo, and Baidu always
identify themselves in their User Agent (e.g. Googlebot). However, they do not necessarily
operate out of �xed IP address ranges. This is why the Googlebot’s originating IP addresses are
not o�cially listed, since they can change due to various reasons. Webmasters can run a
reverse DNS lookup to ascertain if a bot originates from the domain it claims to be from.
User Agents and IP Ranges
12 Inside Good Bots | Understanding Management of Benign Bot Tra�c
Every search engine crawler has a predetermined range of requests that it may perform per
second as it indexes a website. For example, the Googlebot runs according to algorithms that
determine its optimal crawl speed, with the intention of indexing the maximum number of
pages on each visit without overwhelming the server's bandwidth.
If you discover a crawler making a high number of requests every second to your site — quite
possibly leading to server congestion — it can be asked to limit its crawl rate through
instructions in the ‘robots.txt’ �le at the root level of your site, which we have outlined further
down this report.
It’s important to note that instructions in the robots.txt �le are meant to be voluntarily obeyed,
and while the crawl rate of the Googlebot can be limited, not every crawler will obey these
instructions. Persistently problematic or disobedient crawlers can be blocked by blacklisting
them in your Bot Manager or Web Application Firewall (WAF).
The graph below shows the approximate ratios of the volume of crawler hits across our
customer base:
Average Number of Requests Per Minute
Google bots made up nearly 68% of search engine bots across our client base; Bing bots came in at nearly 26%, and bots from Yandex, Yahoo, and Baidu made up the rest.
Yandex Bot2.4%
Google Bots67.6%
Bing Bot25.7%
Others3.4%
Distribution Of Search Engine Crawler Tra�c
13 Inside Good Bots | Understanding Management of Benign Bot Tra�c
Managing Good Bots
Webmasters have several options to help manage good bot tra�c, based on business needs,
SEO strategies, country of operation, the vendors and partners they use, as well as any
infrastructural constraints that may possibly be caused by good bots. Below we list some of
the key options to manage good bots.
Robots.txt
‘Robots.txt’ (also known as ‘The Robots Exclusion Protocol’) is a file located at the root of a
website (for example, https://www.shieldsquare.com/robots.txt) that lets webmasters set
rules for visiting crawlers to voluntarily obey. It is a standard format used by websites to
communicate their crawling preferences to web crawlers. Webmasters can specify which
sections of a website should not be scanned, such as administrative pages, pages with
duplicate content, internal staging pages, and so on.
Websites that do not have a robots.txt file are essentially allowing search engine crawlers to
index every page on the site without restraint. However, it is important to note that bots
can ignore your robots.txt file, as any instructions therein are non-binding — meant to be
voluntarily followed — and cannot be technically enforced solely through the robots.txt
file.
Bad bots that harvest email address, as well as spam bots and malware bots — and even
certain well-known crawlers such as those from the Internet Archive (AKA Wayback
Machine) — do not heed the robots.txt file, and webmasters can choose to block them
with their Bot Manager or WAF.
Common Strategies
14 Inside Good Bots | Understanding Management of Benign Bot Tra�c
Validating the Authenticity of Good Bots
Spoofed search engine crawlers can be hard for webmasters to detect, with sophisticated
versions adapting techniques to evade conventional web security systems. However, a
dedicated bot manager can signi�cantly turn the tables against malicious botmasters when
it comes to defending against the most sophisticated and harmful bots extant on the Web.
While Google does not provide a list of IP addresses for its Googlebot for webmasters to
whitelist (since these IP address ranges can change without notice), webmasters must run a
DNS lookup to ascertain if a bot purporting to be a Googlebot is authentic or not. For
example, the Googlebot can be veri�ed by examining server logs and then running a reverse
DNS lookup on the accessing IP address using the ‘host’ command. After that, running a
forward DNS lookup on the domain name retrieved (again using the ‘host’ command) will
allow webmasters to verify that it is the same as the original accessing IP address from their
server logs. Googlebot’s User Agent will always be ‘Googlebot.com’, ‘Google.com’ or
‘Googlebot-Image’. Partner bots and social network bots generally operate out of data
centers and use known IP address ranges which can be obtained and whitelisted by
webmasters. In addition, specialized bot management solutions generally include measures
to identify the authenticity of known good bots through their unique �ngerprints and
signatures.
Whitelisting
Webmasters are generally advised to whitelist bots used by their business, such as partner
bots, social network bots, and other bots that are essential for the operation of their business.
Internal IP addresses or User Agents can be whitelisted as well depending on access
requirements and speci�c business needs.
15 Inside Good Bots | Understanding Management of Benign Bot Tra�c
Rate Limiting
While the vast majority of good bots fall under the ‘Search Engine Crawler’ category, rate
limiting is usually carried out by webmasters in case search engines are found to be crawling
at rates close to (or even exceeding) the available server bandwidth needed to provide quick
page loads to visitors. The Googlebot’s crawl rate can be adjusted to ensure that there is no
negative impact on overall page load times. Another management strategy is based on
regional factors. For example, if your business does not cater to the Russian market, the Yandex
bot can be rate limited (or blocked outright). Going further, web pages that are rarely updated
pages can be made o�-limits for crawlers in order to reduce the time and bandwidth used by
crawlers.
Time Allocation
Legitimate bots usually allocate a �xed ‘ crawling budget’ for each website to be indexed.
During that budgeted time, the bot will crawl as many pages as it can. Sites that load faster
can thus get indexed faster as well, which subsequently contributes to a better search engine
ranking (as does having a mobile-optimized site with responsive design).
Adopting a Dedicated Bot Management Solution
A dedicated Bot Management solution will enable webmasters to take speci�c actions against
various types of bots in general, as well as see the crawling history of every type of bot that
visits. Analytics data can also be cleansed of good bot tra�c to provide accurate data on
human visitors and give marketers better insights to help develop their strategy.
Radware Bot Manager’s Dashboard, for example, provides detailed information for
webmasters and security experts, with highly-granular data on every type of bot that visits.
The information presented includes the bot family and category, number of hits
made, originating IP addresses, and more, with options to blacklist or whitelist them
depending on business needs.
16 Inside Good Bots | Understanding Management of Benign Bot Tra�c
The Unintentional Consequences of
Good Bot Tra�c
Unchecked good bot tra�c can slow down websites and a�ect user experience. For example,
high volumes of good bot tra�c can strain servers and slow down loading of pages for users.
Good bot tra�c can also skew website analytics and mislead marketers who rely on that data
to develop their strategies. High crawling rates by search engine bots are also known to
impact overall page load times for visitors. For a good overall user experience, it thus becomes
crucial to be able to manage every type of automated tra�c, both good and bad, which only
a dedicated Bot Management solution can provide.
Pitfalls in Managing Good Bots
Depending on business requirements, it is sometimes advisable to block good bots that are
irrelevant or unnecessary. For example, if your business does not operate in China (or does not
presently intend to do so), it may be a good idea to block Chinese search engine crawlers such
as Baidu and similar crawlers.
Spoofed Search Engine Crawlers
Malicious parties are known to spoof leading crawlers such as the Googlebot to evade
security measures. Countering such attack strategies requires measures such as performing
reverse lookups or comparison of the behavior of suspected spoofed crawlers
with the behavior of real crawlers. A dedicated Bot Management solution
such as Radware Bot Manager leverages collective bot intelligence from across its
global customer base to automatically update every customer’s rule sets and secure
websites against similar attack strategies.
17 Inside Good Bots | Understanding Management of Benign Bot Tra�c
Good bots used by business partners or third-party service providers may get misclassi�ed as
bad bots if they have not been updated or whitelisted in their Bot Management console.
Third-Party Bots
A multi-pronged approach is necessary for a Bot Management solution to eliminate False
Positives and False Negatives to the maximum extent possible. False Positives are cases in
which a human visitor is wrongly labeled a bot, and this can lead to a frustrating user
experience when a CAPTCHA is presented to ascertain a visitor’s humanity. On the other hand,
False Negatives — incorrectly labeling bots as humans — can lead to various harmful
consequences depending on the bot’s intent.
False Positives and False Negatives
Conventional measures or in-house solutions may block good bots on occasion and cause
negative impacts. Search engine crawlers are essential in leading to better search rankings,
and blocking them will certainly have an adverse e�ect on SEO results. Similarly, blocking
partner bots or social network bots can produce highly undesirable results, both to a business
and to its customers.
How Blocking Good Bots Can Have a Negative Impact
18 Inside Good Bots | Understanding Management of Benign Bot Tra�c
As we have seen, good bots span a spectrum of uses and functions, and are only going to
increase in number with the spread of smart, networked devices and the proliferation of the
IoT (Internet of Things). Webmasters must devise strategies to manage good and partner bots
in e�ective ways that contribute to business objectives, while minimizing any potential nega-
tive impact from good bot tra�c.
We recommend that webmasters, security experts, and marketers consider industry-speci�c
approaches to implement speci�c strategies in managing various types of good bots depend-
ing on their utility and other considerations unique to each organization. Business-critical
good bots such as partner bots or search engine crawlers should be prioritized over other
types of bots, according to your objectives.
A dedicated Bot Management solution can not only help ascertain the exact percentage of
human and bot tra�c, but also provide accurate, insightful analytics for marketers to optimize
their strategies. While blocking bad bots is of primary concern to webmasters and security
chiefs considering deployment of a dedicated bot manager, marketers can also gain valuable
insights to help forecast changing trends in the market.
Clearly, management of bots — both good and bad — is an essential component in a holistic
security program. Managing bots on websites, apps, and APIs is now canonical practice in
leading organizations. When it comes to securing business operations, personal data, and
revenue streams, a bot management solution is a keystone component that sets the stage for
success in today’s global marketplace.
We welcome your questions, comments, and feedback at [email protected].
Recommendations for E�ective
Management of Good Bots
19 Inside Good Bots | Understanding Management of Benign Bot Tra�c
www.shieldsquare.comwww.radware.com
Radware® (NASDAQ: RDWR), a leading provider of cyber security and application delivery solutions, acquired ShieldSquare in March 2019. ShieldSquare is now Radware Bot Manager.
Radware® (NASDAQ: RDWR) is a global leader of cybersecurity and application delivery solutions for physical, cloud and software-defined data centers. Its award-winning solutions portfolio secures the digital experience by providing infrastructure, application and corporate IT protection and availability services to enterprises globally. Radware’s solutions empower more than 12,500 enterprise and carrier customers worldwide to adapt quickly to market challenges, maintain business continuity and achieve maximum productivity while keeping costs down. For more information, please visit www.radware.com
Radware encourages you to join our community and follow us on: Radware Blog, LinkedIn, Facebook, Twitter, SlideShare, YouTube, Radware Connect app for iPhone® and our security center DDoSWarriors.com that provides a comprehensive analysis of DDoS attack tools, trends and threats.
Disclaimer
This document is provided for information purposes only. This document is not warranted to be error-free, nor subject to any other warranties or conditions, whether expressed orally or implied in law. Radware specifically disclaims any liability with respect to this document and no contractual obligations are formed either directly or indirectly by this document. The technologies, functionalities, services or processes described herein are subject to change without notice.
© 2020 Radware Ltd. All rights reserved. The Radware products and solutions mentioned in this document are protected by trademarks, patents and pending patent applications of Radware in the U.S. and other countries. For more details, please see: https://www.radware.com/LegalNotice/. All other trademarks and names are property of their respective owners.