A living hell - lessons learned in eight years of parsing real estate data

Preview:

DESCRIPTION

Slides of talk delivered by Ed Freyfogle (@freyfogle) at #csvconf in Berlin on 15 June 2014.

Citation preview

A living hell: lessons learned in eight years of processing real estate listings

Ed FreyfogleCSVConf Berlin

15 July 2014

Residential property search engine in nine markets

3-4 million unique users per month

Processing close to 20M listings daily

Extensive experience / painful lessons in ETL, geocoding, deduping, ...

http://www.nestoria.com

What we do

Real estate is complex, high value transaction. Our goal is :

Simple

Comprehensive

Fast (user time and time to market)

Where does the data come from?

Seller

Agent 1

Agent 2

Agent 3

Where does the data come from?

Seller

Agent 1

Agent 2

Agent 3

Portal 1

Portal 2

Portal 3

Where does the data come from?

Seller

Agent 1

Agent 2

Agent 3

Portal 1

Portal 2

Portal 3

Where does the data come from?

Seller

Agent 1

Agent 2

Agent 3

Portal 1

Portal 2

Portal 3

Where does the data come from?

Seller

Agent 1

Agent 2

Agent 3

Portal 1

Portal 2

Portal 3

Plenty of chances for data to go bad

Where we do it

India

Very, very good at:

Cricket

Amazing cuisine

World’s largest democracy

Too many other things to list here

India

Very, very good at:

Cricket

Amazing cuisine

World’s largest democracy

Too many other things to list here

Utterly fucking terrible at:

Real Estate data quality

Addresses / Geodata

Must garbage in be garbage out?

Can we turn multiple bits of shit into something useful?

What we really do

Must garbage in be garbage out?

Can we turn multiple bits of shit into something useful?

What we really do

something useful

Chaos

Caveat: I love our clients

All the examples you are about to see are all theoretical *wink, wink*

Examples / Horror stories

Us: “Please set up an automated data transfer. Thx!”

Them: “It’s impossible to export the data from the database”

Them: “Just crawl our website”

Them: “Let’s do incremental updates to save bandwidth”

Them: “I’ll just send you an email when there is new stuff … starting when I get back from holiday”

Getting the data

zip or tar full of subdirs, names of which change with each upload

filename “feed.xml?key=SsKpyM62QN0RbqCwnaAc”

One file per agent, when file not supplied no way to know if missing due to error or intentionally

Format A on Monday, B on Tuesday, ...

Fun with files

<Description>Residential Plot available in Suncity&amp;lt;br/&amp;gt;&#13;&amp;lt;br&amp;gt;&amp;lt;br/&amp;gt;&#13;&amp;lt;br&amp;gt;SUNCITY PROJECT&amp;lt;br/&amp;gt;&#13;&amp;lt;br&amp;gt;&amp;lt;br/&amp;gt;&#13;&amp;lt;br&amp;gt;A complete township...

"&amp;gt;" - for when you really, really want to be sure you've escapedyour XML

&#13; anyone?

XML, LOL

One 500 MB file of XML

On a single line … to save space

Go grep yourself

Newlines, newlines,

newlines

Choose your delimiter wisely - ^B

So simple even a child could get it wrong

Microsoft quotes vs. ASCII quotes

Excel vs. CSV

CSV, LOL

Them “we will send the data in X (where X is large industry player) format”Us “not even X uses that format”

Them “We use X format, but changed it slightly so we could ….”Us *sigh*

Wrong tool for right job

Are they really unique?

Are the unique across time?

Partner re-uses numeric unique ids … in case there is ever a shortage of numbers

Unique identifiers

I’m ranting

Topics we haven’t yet even touched upon:

Character encodings

Geocoding / Parsing addresses

Image processing/classification at scale

Parsing free text descriptions

Deduplication

Too many other things to list here

Never trust, check everything, every single time

Tests, tests, tests, tests

Embrace UNIX philosophy of many small tools in a chain

Reuse rather than reinvent (but not always)

Technology helps manage the problem, it is not “the solution”.Problems are almost always cultural not technical

What have we learned?

Misaligned incentives

Technology laggards

Apathy

Ignorance

Why do they hate us?

Tricked you - there is of course no single perfect solution

Closest thing is dialog, ideally face to face.

People generally want to do right thing, need help to know why and how to do it.

One five minute conversation often more useful than five months of email

The solution

Unless you hate life, do NOT try to scrape real estate data

Re-read the line above.

Our API: http://nestoria.com/api

One more thing

http://nestoria.com and http://nestoria.com/api

http://devblog.nestoria.com - our dev blog

http://www.lokku.com - our parent company

http://opencagedata.com - all your geocoding are belong to us

Twitter: @nestoria, @lokku, @opencagedata, @freyfogle

Slides will be on http://slideshare.net/lokku later today

Learn more

Recommended