Download pdf - Understanding Management of Benign Bot Tra˜c · 2020-04-09 · As the pie chart below shows, crawlers make up roughly 55% of all good bot tra˜c, and all other good bots combined

Understanding Management of Benign Bot Tra�cJan 2019

Radware Research

� � � � � � ��

SummaryKey TakeawaysHumans vs. Good Bots vs. Bad Bots

Types of Good BotsSearch Engine Crawlers Outnumber Other Good Bots

Analysis of Good Bot Tra�cGood Bots by Industry

Travel

E-commerce

Classi�eds and Marketplaces

Media and Publishing

How Crawlers OperateData CentersUser Agents and IP RangesCrawling BudgetAverage Number of Requests Per Minute

Managing Good BotsCommon Strategies

Robots.txt

Validating the Authenticity of Good Bots

Whitelisting

Rate Limiting

Time Allocation

Adopting a Dedicated Bot Management Solution

The Unintentional Consequences of Good Bot Tra�cPitfalls in Managing Good Bots

Spoofed Search Engine Crawlers

Third-Party Bots

False Positives and False Negatives

How Blocking Good Bots Can Have a Negative Impact

Recommendations for E�ective Management of Good Bots

02030405060707070809101111121213141414151516161617171718181819

Table of Contents

While bad bots that cause harm are growing to become among the most serious threats to

organizations, good bots do not receive much attention from security experts. This is

because, rather than intentionally causing harm, good bots perform a range of important

functions that help their users in several ways. This report outlines the prevalence of good

bots and their various types, the main functions they carry out, their tra�c distribution from

July to December (H2) 2018 across four key verticals in our client base (Online Travel

Agencies, E-Commerce, Classi�eds, and Media & Publishing), and concludes with

recommended strategies to e�ciently manage good bots.

Summary

02 Inside Good Bots | Understanding Management of Benign Bot Tra�c

Under certain circumstances, good bots can also have a negative impact on

organizations.

An industry-speci�c approach is needed by webmasters, security chiefs, and marketers.

Analysis of trends in good bot tra�c can provide actionable insights to drive marketing

strategy.

Webmasters and security specialists are well advised to adopt dedicated Bot

Management solutions which allow them to manage various types of good bots based

on speci�c business needs.

Key Takeaways

More than 50 percent of the Internet tra�c today comes from bots, and while we've seen an

increase in bot-based attacks in recent years, 'good' bots still play an important role in

applications' operations. As such, organizations need to look for bot management solutions

that not only e�ectively detect and mitigate bot attacks, but also can accurately distinguish

between 'good' and 'bad' bots in real-time."

David Aviv, CTO of Radware


Let’s take a broad look at the overall composition of tra�c during H2, 2018 across four key

verticals in our user base. Genuine visitors comprised nearly 74% of tra�c, while good bots

totalled nearly 17%, and bad bots nearly 10%.

Breakdown of Overall Tra�c: Humans vs. Good Bots vs. Bad Bots

Bad Bots9.8%

Good Bots16.7%

Humans73.5%

Humans vs. Good Bots vs. Bad Bots


Search Engine Crawlers

These bots or spiders crawl web pages to index them for search engines such as Google and Bing. Website administrators can specify non-binding rules in their ‘robots.txt’ �le for crawlers to follow while indexing web pages, such as their crawl rates and pages or sections that they should not index. Spoofed crawlers are generally hard for webmasters to detect, as they can often sneak through basic security measures.

GooglebotBingbotBaidu SpiderYandex Bot

Bot Type Description & Intent Examples

Partner Bots Partner bots provide essential services and information to websites and their customers. This category includes bots run by vendors which provide transactional/ CRM/ ERP services, credit-scoring, geo-location, inventory and background checks, and many other business-related services.

AlexaSlackbotIBM Watson

Social Network Bots

Bots that are run by social networking sites to provide visibility to their clients’ websites and drive engagement on their platforms.

Facebook BotPinterest BotSnapchat-proxy

Monitoring Bots

Monitoring bots are used to monitor the uptime and system health of websites. These bots periodically check and report on page load times, downtime duration, and so on.

PingdomAlertBotStatusCake

Backlink Checker Bots

These bots check the inbound URLs on a website to provide marketers and SEO specialists with insights, which help them optimize their content and campaigns.

SEMRushBotUAS Link Checker AhrefsBot

Aggregator/ Feed FetcherBots

Aggregator or Feed-Fetcher bots collate information from websites and provide users or subscribers with the latest news, alerts, and other desired content.

Google FeedfetcherSuperfeedrFeedly

Types of Good Bots


https://support.google.com/webmasters/answer/182072

https://www.bing.com/webmaster/help/which-crawlers-does-bing-use-8c184ec0

http://help.baidu.com/question?prod_en=master&class=Baiduspider

https://yandex.com/support/webmaster/robot-workings/robot.html

https://developer.amazon.com/alexa

https://get.slack.help/hc/en-us/articles/202026038-An-introduction-to-Slackbot

https://www.ibm.com/watson/ai-customer-service

https://developers.facebook.com/docs/sharing/webmasters/crawler

https://help.pinterest.com/en/business/article/pinterest-crawler

https://www.snap.com/en-US/

https://www.pingdom.com/

https://www.alertbot.com/

https://www.statuscake.com/

https://www.semrush.com/bot/

https://udger.com/support/UASlinkChecker

https://ahrefs.com/robot

https://support.google.com/webmasters/answer/178852?hl=en

http://superfeedr.com/

https://www.feedly.com/fetcher.html

As the pie chart below shows, crawlers make up roughly 55% of all good bot tra�c, and all

other good bots combined make up the rest. These statistics represent overall good bot tra�c

in H2, 2018, and can signi�cantly vary every month, as well as by industry.

Search Engine Crawlers Outnumber Other

Good Bots

Percentage of Search Engine Crawlers vs. All Other Good Bots

Search Engine Crawlers

54.5%All Other Good Bots

45.5%


The breakdown of good bot tra�c across the four industries considered here reveals patterns

that are fairly unique to each industry. Not surprisingly, businesses in the Classi�eds &

Marketplaces industry have the highest level of Aggregator bot tra�c, followed by Online

Travel Agencies (OTAs). When it comes to Partner bots, E-Commerce sites have the highest

percentage, followed by OTAs. E-Commerce sites also have the highest levels of Social

Network bots, followed by Media & Publishing sites, while OTAs come in last with a tiny 0.1%

of their good bot tra�c consisting of Social Network bots.

Monitor Bots3.3%

Partner Bots12.1%

Social Network Bots

0.1%

Aggregator Bots

62.5%Backlink Checker Bots

21.9%

Distribution of Good Bot Tra�c: Online Travel Industry

53.5%

All Other Good Bots

46.5% Search Engine Crawlers

Good Bots By Industry

Travel

Analysis of Good Bot Tra�c


Partner Bots19.5%

Social Network Bots

9.2%


14.0%Monitor Bots15.5%

Aggregator Bots

41.8%

E-commerce


49.0%Search Engine

Crawlers

Distribution of Good Bot Tra�c: E-commerce Industry


Classi�eds and Marketplaces

Aggregator Bots

Partner Bots8.2%

Social Network Bots

2.4%


67.8%

Monitor Bots3.3%

18.3%


52.0%Search Engine

Crawlers

Distribution of Good Bot Tra�c: Classi�eds and Marketplaces


Media and Publishing

Aggregator Bots

Partner Bots4.3%

Social Network Bots

3.3%


57.3%

32.7%Monitor Bots2.4%


45.3%Search Engine

Crawlers

Distribution of Good Bot Tra�c: Media and Publishing


The leading search engine bots originate from IP address ranges that generally do not change

often (but sometimes do vary, as with the Googlebot and others). Other types of bots such as

Partner bots generally work out of IP addresses that are provided by vendors or partners,

allowing webmasters to whitelist them. Webmasters are often advised to block crawlers from

data centers in countries that they do not do business in. For example, the Yandex crawler can

be disallowed from indexing your site if you do not do business in the Russian Federation.

Search engine crawlers constitute a clear majority of all good bots visiting our client base, and

have well-de�ned modes of operation when compared to the diversity of operational

methods seen in other good bots. Hence we have focused our attention on how search

engine crawlers operate.

Data Centers

How Crawlers Operate


For search engine crawlers, ‘Crawling Budget’ refers to the methodology used to determine

which pages of a websites, and how many of them, their crawler must index. With trillions of

web pages in existence and many more being added every day, search engines have their

work cut out for them, and a crawling budget sets parameters on how their crawlers can most

e�ciently index web pages.

According to Google, “Googlebot is designed to be a good citizen of the web. Crawling is its main

priority, while making sure it doesn't degrade the experience of users visiting the site. We call this the

"crawl rate limit," which limits the maximum fetching rate for a given site. Simply put, this represents

the number of simultaneous parallel connections Googlebot may use to crawl the site, as well as

the time it has to wait between the fetches. The crawl rate can go up and down based on a couple

of factors… If the site responds really quickly for a while, the limit goes up, meaning more

connections can be used to crawl. If the site slows down or responds with server errors, the limit goes

down and Googlebot crawls less.”

Crawling Budget

The top search engine bots such as those deployed by Google, Bing, Yahoo, and Baidu always

identify themselves in their User Agent (e.g. Googlebot). However, they do not necessarily

operate out of �xed IP address ranges. This is why the Googlebot’s originating IP addresses are

not o�cially listed, since they can change due to various reasons. Webmasters can run a

reverse DNS lookup to ascertain if a bot originates from the domain it claims to be from.

User Agents and IP Ranges



https://webmasters.googleblog.com/2017/01/what-crawl-budget-means-for-googlebot.html

Every search engine crawler has a predetermined range of requests that it may perform per

second as it indexes a website. For example, the Googlebot runs according to algorithms that

determine its optimal crawl speed, with the intention of indexing the maximum number of

pages on each visit without overwhelming the server's bandwidth.

If you discover a crawler making a high number of requests every second to your site — quite

possibly leading to server congestion — it can be asked to limit its crawl rate through

instructions in the ‘robots.txt’ �le at the root level of your site, which we have outlined further

down this report.

It’s important to note that instructions in the robots.txt �le are meant to be voluntarily obeyed,

and while the crawl rate of the Googlebot can be limited, not every crawler will obey these

instructions. Persistently problematic or disobedient crawlers can be blocked by blacklisting

them in your Bot Manager or Web Application Firewall (WAF).

The graph below shows the approximate ratios of the volume of crawler hits across our

customer base:

Average Number of Requests Per Minute

Google bots made up nearly 68% of search engine bots across our client base; Bing bots came in at nearly 26%, and bots from Yandex, Yahoo, and Baidu made up the rest.

Yandex Bot2.4%

Google Bots67.6%

Bing Bot25.7%

Others3.4%

Distribution Of Search Engine Crawler Tra�c


http://www.robotstxt.org/robotstxt.html


Managing Good Bots

Webmasters have several options to help manage good bot tra�c, based on business needs,

SEO strategies, country of operation, the vendors and partners they use, as well as any

infrastructural constraints that may possibly be caused by good bots. Below we list some of

the key options to manage good bots.

Robots.txt

‘Robots.txt’ (also known as ‘The Robots Exclusion Protocol’) is a file located at the root of a

website (for example, https://www.shieldsquare.com/robots.txt) that lets webmasters set

rules for visiting crawlers to voluntarily obey. It is a standard format used by websites to

communicate their crawling preferences to web crawlers. Webmasters can specify which

sections of a website should not be scanned, such as administrative pages, pages with

duplicate content, internal staging pages, and so on.

Websites that do not have a robots.txt file are essentially allowing search engine crawlers to

index every page on the site without restraint. However, it is important to note that bots

can ignore your robots.txt file, as any instructions therein are non-binding — meant to be

voluntarily followed — and cannot be technically enforced solely through the robots.txt

file.

Bad bots that harvest email address, as well as spam bots and malware bots — and even

certain well-known crawlers such as those from the Internet Archive (AKA Wayback

Machine) — do not heed the robots.txt file, and webmasters can choose to block them

with their Bot Manager or WAF.

Common Strategies


https://en.wikipedia.org/wiki/Robots_exclusion_standard

http://www.robotstxt.org/robotstxt.html

https://blog.archive.org/2017/04/17/robots-txt-meant-for-search-engines-dont-work-well-for-web-archives/

https://blog.archive.org/2017/04/17/robots-txt-meant-for-search-engines-dont-work-well-for-web-archives/

Validating the Authenticity of Good Bots

Spoofed search engine crawlers can be hard for webmasters to detect, with sophisticated

versions adapting techniques to evade conventional web security systems. However, a

dedicated bot manager can signi�cantly turn the tables against malicious botmasters when

it comes to defending against the most sophisticated and harmful bots extant on the Web.

While Google does not provide a list of IP addresses for its Googlebot for webmasters to

whitelist (since these IP address ranges can change without notice), webmasters must run a

DNS lookup to ascertain if a bot purporting to be a Googlebot is authentic or not. For

example, the Googlebot can be veri�ed by examining server logs and then running a reverse

DNS lookup on the accessing IP address using the ‘host’ command. After that, running a

forward DNS lookup on the domain name retrieved (again using the ‘host’ command) will

allow webmasters to verify that it is the same as the original accessing IP address from their

server logs. Googlebot’s User Agent will always be ‘Googlebot.com’, ‘Google.com’ or

‘Googlebot-Image’. Partner bots and social network bots generally operate out of data

centers and use known IP address ranges which can be obtained and whitelisted by

webmasters. In addition, specialized bot management solutions generally include measures

to identify the authenticity of known good bots through their unique �ngerprints and

signatures.

Whitelisting

Webmasters are generally advised to whitelist bots used by their business, such as partner

bots, social network bots, and other bots that are essential for the operation of their business.

Internal IP addresses or User Agents can be whitelisted as well depending on access

requirements and speci�c business needs.



Rate Limiting

While the vast majority of good bots fall under the ‘Search Engine Crawler’ category, rate

limiting is usually carried out by webmasters in case search engines are found to be crawling

at rates close to (or even exceeding) the available server bandwidth needed to provide quick

page loads to visitors. The Googlebot’s crawl rate can be adjusted to ensure that there is no

negative impact on overall page load times. Another management strategy is based on

regional factors. For example, if your business does not cater to the Russian market, the Yandex

bot can be rate limited (or blocked outright). Going further, web pages that are rarely updated

pages can be made o�-limits for crawlers in order to reduce the time and bandwidth used by

crawlers.

Time Allocation

Legitimate bots usually allocate a �xed ‘ crawling budget’ for each website to be indexed.

During that budgeted time, the bot will crawl as many pages as it can. Sites that load faster

can thus get indexed faster as well, which subsequently contributes to a better search engine

ranking (as does having a mobile-optimized site with responsive design).

Adopting a Dedicated Bot Management Solution

A dedicated Bot Management solution will enable webmasters to take speci�c actions against

various types of bots in general, as well as see the crawling history of every type of bot that

visits. Analytics data can also be cleansed of good bot tra�c to provide accurate data on

human visitors and give marketers better insights to help develop their strategy.

Radware Bot Manager’s Dashboard, for example, provides detailed information for

webmasters and security experts, with highly-granular data on every type of bot that visits.

The information presented includes the bot family and category, number of hits

made, originating IP addresses, and more, with options to blacklist or whitelist them

depending on business needs.



https://developers.google.com/search/mobile-sites/

https://developers.google.com/web/fundamentals/design-and-ux/responsive/

The Unintentional Consequences of

Good Bot Tra�c

Unchecked good bot tra�c can slow down websites and a�ect user experience. For example,

high volumes of good bot tra�c can strain servers and slow down loading of pages for users.

Good bot tra�c can also skew website analytics and mislead marketers who rely on that data

to develop their strategies. High crawling rates by search engine bots are also known to

impact overall page load times for visitors. For a good overall user experience, it thus becomes

crucial to be able to manage every type of automated tra�c, both good and bad, which only

a dedicated Bot Management solution can provide.

Pitfalls in Managing Good Bots

Depending on business requirements, it is sometimes advisable to block good bots that are

irrelevant or unnecessary. For example, if your business does not operate in China (or does not

presently intend to do so), it may be a good idea to block Chinese search engine crawlers such

as Baidu and similar crawlers.

Spoofed Search Engine Crawlers

Malicious parties are known to spoof leading crawlers such as the Googlebot to evade

security measures. Countering such attack strategies requires measures such as performing

reverse lookups or comparison of the behavior of suspected spoofed crawlers

with the behavior of real crawlers. A dedicated Bot Management solution

such as Radware Bot Manager leverages collective bot intelligence from across its

global customer base to automatically update every customer’s rule sets and secure

websites against similar attack strategies.


Good bots used by business partners or third-party service providers may get misclassi�ed as

bad bots if they have not been updated or whitelisted in their Bot Management console.

Third-Party Bots

A multi-pronged approach is necessary for a Bot Management solution to eliminate False

Positives and False Negatives to the maximum extent possible. False Positives are cases in

which a human visitor is wrongly labeled a bot, and this can lead to a frustrating user

experience when a CAPTCHA is presented to ascertain a visitor’s humanity. On the other hand,

False Negatives — incorrectly labeling bots as humans — can lead to various harmful

consequences depending on the bot’s intent.

False Positives and False Negatives

Conventional measures or in-house solutions may block good bots on occasion and cause

negative impacts. Search engine crawlers are essential in leading to better search rankings,

and blocking them will certainly have an adverse e�ect on SEO results. Similarly, blocking

partner bots or social network bots can produce highly undesirable results, both to a business

and to its customers.

How Blocking Good Bots Can Have a Negative Impact


As we have seen, good bots span a spectrum of uses and functions, and are only going to

increase in number with the spread of smart, networked devices and the proliferation of the

IoT (Internet of Things). Webmasters must devise strategies to manage good and partner bots

in e�ective ways that contribute to business objectives, while minimizing any potential nega-

tive impact from good bot tra�c.

We recommend that webmasters, security experts, and marketers consider industry-speci�c

approaches to implement speci�c strategies in managing various types of good bots depend-

ing on their utility and other considerations unique to each organization. Business-critical

good bots such as partner bots or search engine crawlers should be prioritized over other

types of bots, according to your objectives.

A dedicated Bot Management solution can not only help ascertain the exact percentage of

human and bot tra�c, but also provide accurate, insightful analytics for marketers to optimize

their strategies. While blocking bad bots is of primary concern to webmasters and security

chiefs considering deployment of a dedicated bot manager, marketers can also gain valuable

insights to help forecast changing trends in the market.

Clearly, management of bots — both good and bad — is an essential component in a holistic

security program. Managing bots on websites, apps, and APIs is now canonical practice in

leading organizations. When it comes to securing business operations, personal data, and

revenue streams, a bot management solution is a keystone component that sets the stage for

success in today’s global marketplace.

We welcome your questions, comments, and feedback at [email protected].

Recommendations for E�ective

Management of Good Bots


https://www.shieldsquare.com/web/wp-content/uploads/ShieldSquare_Research_Bad-Bots_Report.pdf

https://www.shieldsquare.com/web/wp-content/uploads/Why_Bot_Mitigation_Could_Be_Crucial_-For_GDPR_Compliance.pdf

www.shieldsquare.comwww.radware.com

Radware® (NASDAQ: RDWR), a leading provider of cyber security and application delivery solutions, acquired ShieldSquare in March 2019. ShieldSquare is now Radware Bot Manager.

Radware® (NASDAQ: RDWR) is a global leader of cybersecurity and application delivery solutions for physical, cloud and software-defined data centers. Its award-winning solutions portfolio secures the digital experience by providing infrastructure, application and corporate IT protection and availability services to enterprises globally. Radware’s solutions empower more than 12,500 enterprise and carrier customers worldwide to adapt quickly to market challenges, maintain business continuity and achieve maximum productivity while keeping costs down. For more information, please visit www.radware.com

Radware encourages you to join our community and follow us on: Radware Blog, LinkedIn, Facebook, Twitter, SlideShare, YouTube, Radware Connect app for iPhone® and our security center DDoSWarriors.com that provides a comprehensive analysis of DDoS attack tools, trends and threats.

Disclaimer

This document is provided for information purposes only. This document is not warranted to be error-free, nor subject to any other warranties or conditions, whether expressed orally or implied in law. Radware specifically disclaims any liability with respect to this document and no contractual obligations are formed either directly or indirectly by this document. The technologies, functionalities, services or processes described herein are subject to change without notice.

© 2020 Radware Ltd. All rights reserved. The Radware products and solutions mentioned in this document are protected by trademarks, patents and pending patent applications of Radware in the U.S. and other countries. For more details, please see: https://www.radware.com/LegalNotice/. All other trademarks and names are property of their respective owners.

https://www.radware.com/

https://www.radware.com/products/application-network-security/

https://www.radware.com/products/load-balancing-application-delivery/

https://blog.radware.com/

https://www.linkedin.com/company/165642

https://www.facebook.com/Radware

https://twitter.com/radware

https://www.slideshare.net/Radware

https://www.youtube.com/user/radwareinc

https://itunes.apple.com/us/app/radware-connect/id391124100?mt=8

https://security.radware.com/

https://www.radware.com/newsevents/pressreleases/2019/radware-to-acquire-shieldsquare

https://twitter.com/shieldsquare

https://www.linkedin.com/company/shieldsquare/