Upload
distil-networks
View
342
Download
0
Embed Size (px)
Citation preview
Ensuring Real Estate Website Listing Data Security
Avoid Litigation by Protecting Your Listing Data
Before the Theft Occurs
Presenters
Charlie MinesingerDirector of Solution Sales
Distil Networks
Matt CohenChief TechnologistClareity Consulting
Introductions and BackgroundTrends in Scraping Real Estate WebsitesOverview of Study and FindingsImmediate Opportunities and Threats from Scraping
Agenda
Toward better Security for Real Estate Data Online
Distil in Real Estate and Premium Brands
Market Leader in Bot Detection and Mitigation
● Only bot detection vendor to be included in Gartner’s 2015 Online Fraud Detection Market Guide
● Key Attack Trend: “Fraudsters spreading their attacks over thousands of IP addresses”
● Key Inclusion Criteria: “Ability to detect online fraud as transactions occur in real time or near real time”
● Interesting to note: No WAF vendors in this report (as their detection model is primarily rules-based)
What Is Web Scraping?Web ScrapingAlso known as screen scraping, web scraping is the act of copying large amounts of data from a website – either manually or with an automated program (Bot)
Legitimate ScrapingScraping can sometimes be benevolent and totally acceptable. For example, the search engine bots that index your website
Malicious ScrapingA systematic theft of intellectual property accessible on a website, including pricing, content, images, and proprietary data
MLSs:○ Obligation to protect copyright○ Higher cost to use reactive methods - beacons, legal, etc○ Duty to enforce NAR Policy (VOWs, so far)○ Missed revenue opportunities for licensing content
Brokers / Agents:○Provided content license on listing for specific purpose○Responsible for NAR Policy (VOWs, so far)○Stale (scraped) data undermines trust and reputation in
brand○Higher costs - bots drive up costs for online services
Why Bots / Scraping is a Problem in Real Estate
Software Vendors / Publishers:○ Resource Utilization – more servers and bandwidth costs○ Poor Website Performance – latency and brownouts, etc.○ Clean up Marketing Metrics – optimize for humans○ Ad Fraud – advertisers are not paying for non-human traffic○ People Resources – keep your team focused on revenue!
Bottom LineScrapers scrape because they are making money with your listings! And the Real Estate industry is left with...
→ Higher costs→ Lost revenues
Why Bots / Scraping is a Problem in Real Estate
Realtor.org offers free tools to track data - Reactive = expensive○Checklist for Syndication has many references to data scraping – legal guidance○NoScrape – aborted project - no update since 2010?
Problem is not going away
Industry Help? ...Way behind on Bad Bots
Ads for Scraping Programs on Realtor.com!
Realtor.com blog to “deter scraping” relies on obsolete IP address blocking and expensive IP litigation“REALTOR.com® logging, tracking and monitoring patterns that indicate data is being stolen for these illegitimate purposes. Once an offender is identified, their IP address is blocked from accessing the site.” (Oct 10, 2014)
Scraping as a service sites proliferate – scraping VERY accessible!o Search for “web data scraping” on elance.com, odesk.com,
freelancer.com, etco Google Search terms: “scraping real estate data” and “scrape MLS
listings”o Services: Mozenda.com, 80legs.com, webharvey.com, scraping.pro,
etc
Problem is not going away
Web Scraping - Cheap, Easy & DIY
Costs of Scraping MLS Data○ Resource costs - 10% to 40% of server utilization and
bandwidth○ Customer Care - Cost per call from consumer? Calls per
month?○ Website Performance – brownouts results in 3 days of low
traffic○ Ad Fraud - If 30% of ads are seen by bots, are advertisers
paying?○ Lead Gen… $15/mover, $30/storage facility, … $100s per
listing going to third parties, not the broker, not the agent→ Biggest Losers: MLS and BrokersValue of solution?
○Antivirus is $40 to $75 year per member ( = $3 - $6/month)
○Anti-scraping protection should be same or less cost
Bottom Line on Scraping
For now, two surveys:
○MLS Executives - 100 MLS Executives rep. MLSs with over 600,000 subscribers.
○ IDX Vendors – 14 rep. 400,000 IDX & VOW websites. Others would only speak
informally. Because they manage the largest set of scraping targets
Email invitation, web-survey over several weeks.
Study Methodology
Because they play a part in all scraping contexts – MLS, Publishers, and IDX/VOW.
● Technology Selection. Selects and contracts for the MLS systems.
● Data Licensing. Manages the data license agreements with the Advertising Portals
● Industry Policy. Collectively set IDX / VOW rules
99% say compliance with rules protecting misuse of MLS data is importantImplementing anti-scraping should be a priority for MLS vendors:
95% agree that IDX sites should be subject to rules specifically mandating scraping protections. This needs follow-up w/ NAR committees.
59% of respondents do NOT test VOW sites for anti-scraping compliance
Most testing performed is not rigorous
Some rely on self-reporting
98% of respondents want a set of standardized tests to verifythat VOW and syndication
sites are protected
MLS Study – Key Results
43% of IDX/VOW vendors were not aware of issue pervasiveness.62% rate Compliance with MLS rules is most important factor in having IDX/VOW vendors implement an anti-scraping solution
Other drivers for adoption of anti-scraping protection ○Customer demand for anti-scraping protections○Cost of infrastructure use/abuse ○Security concerns○System performance issues
IDX / VOW Study – Key Results
○ 50% of IDX vendor respondents believe 15-30% bot traffic is acceptable
○ 50% believe less than 1% bot traffic is acceptable (more like MLS)
○ Most IDX/VOW vendors are using reactive detection tacticsLog analysis - reactive and labor-intensive monitoringIP-based methods - ineffective against sophisticated scrapersObsolete Preventions - IP-based rate limiting and CAPTCHAs
→ Likely underestimating (missing bots) with these methods!
○ More than half cannot identify the costs of bots to their business...if you cannot measure it, you cannot manage it, & certainly not budget it
○ While 100% put NAR compliance as a priority, only 25% have budgeted for services to provide anti-scraping service to comply with VOW rules
IDX / VOW Study - Misaligned, Lacking Key Data
○Scripts, such as CURL or Ruby, making requests at any rate○Selenium, fully automated browser making requests at any rate (fully automating browser)○Headless browser with or without Phantom JS (fully simulating browser, browser pre-rendering)○IP cycling using any bot technology at rate of less than 5 requests per IP Address, then change IP○Crawlers - at any speed, even slow crawlers making 10 requests per minute or less○Anonymized proxy for IP to make requests using any technology or at any rate of requests○Spoofed bot user-agent, e.g. using fake “googlebot” or “bingbot” as user-agent, IE running on Linux, etc○Non-Browser user-agent, spoofed user-agents for mobile browsers or mobile applications○Blocking traffic from data centers and hosting providers (why would consumers be using those IP?) ○Blocking bots from Consumer ISPs while letting legitimate requests through
It’s An Arms Race … More Detail:
Modern Anti-Scraping Tool Requirements
○ 7 of top 10 sources of bots are Consumer ISPs: (1) Comcast, (2) Time Warner Cable, (3) Verizon FIOS,
(4) Charter, (5) Cox, (6) CenturyLink, and (7) AT&T Uverse
○ 50% - 75% of bot traffic on RE sites is from Consumer ISPs
○ Most Consumer ISPs had 1,500+ IPs with bot traffic
○ 18-45% Automated browsers - mimicking humans
○ 14-25% in Bot Database - fingerprinted, known bots
○ 16-42% Slow Crawlers - recycling IPs and user agents
Highlights of Bot Sophistication in Real Estate
The Facts on Scraping Real Estate Data
Purpose Built Solution, Not a Feature
Bot Detection is a New Category, NOT a Feature○ NOT a Content Delivery Service (CDN)
○ NOT a Distributed Denial of Service (DDoS) protection solution
○ NOT a simple IP list or set of scripts
○ NOT a Web Application Firewall (WAF)
A purpose built bot detection solution is always updating and evolving
Catch 99.9% of Malicious Bots with Distil
A Typical WAF Catches 20%
IP BLOCK
USER AGENTTESTING
IP ANALYSIS
USER AGENTTESTING
JAVASCRIPTTESTCOOKIE
SELENIUM TEST
BROWSER RATE
LIMITING
AUTOMATED BROWSER
PHANTOM JS
MACHINELEARNING
IP CYCLING
Distil Catches up to 99.9%
Detect Your Bot Traffic
Control Over Your Bot Traffic
MonitorMonitor to inspect requests and record the traffic to Distil and/or your own server logs BlockSet to Block to serve the client an unblock verification form
CAPTCHA Serve a hardened CAPTCHA to test the client for verification
DropDrop them to present them with an access denied page
Flexible Deployment OptionsCloud
○ Deploys in hours
○ Blazing fast Anycast DNS-based GeoIP Routing. Automatic content compression optimizes for faster delivery
○ 17 datacenters automatically fail over when a primary location goes offline
○ Automatically increases infrastructure and bandwidth to accommodate spikes
USER DISTIL CLOUD CDN
LOAD BALANCER WEB SERVER
Flexible Deployment OptionsPhysical or Virtual Appliance(s)
○ Install on virtualized or Bare Metal appliance(s)
○ Deploys in days
○ High availability configurations with failover monitoring
○ Heartbeat up to Distil Cloud
USER INTERNET LOAD BALANCER WEB SERVER
DISTIL APPLIANCE
Best of Breed Solution will Include:
○99% Accuracy, cannot rely on IP address to identify bots or use rate limiting on IP○Dedicated Service - NOT a button/feature/add-on○Layers of tactics, multiple detection tactics, with ongoing R&D○Easy to Implement - deploy in days or weeks○Real-time detection and mitigation - be proactive to save time and money○Flexible Configurable options for actions to mitigate bots○Affordable cost per member, per site, or per MLS - flexible business model
Selection Criteria for Anti-Scraping
www.distilnetworks.com
QUESTIONS….COMMENTS?
1.703.962.1614http://resources.distilnetworks.com/h/c/175726-real-
estate
Call CharlieC H A R L I E @ D I S T I L N E T W O R K S . C O
M