33
` NLP @ HomeAway

NLP @ HomeAway roundrockgeeks

Embed Size (px)

Citation preview

Page 1: NLP @ HomeAway roundrockgeeks

`

NLP @ HomeAway

Page 2: NLP @ HomeAway roundrockgeeks

HomeAway

• 1,000,000+ global vacation rental listings• 200,000,000+ vacation days / year• Headquartered in Austin, TX• ~190 countries, 22 languages• Almost 2,000 employees worldwide

Key Facts

Page 3: NLP @ HomeAway roundrockgeeks
Page 4: NLP @ HomeAway roundrockgeeks

All those vacations … a lot of text

We’re going to look at Reviews and Property Descriptions

Reviews• > 10,000,000

Property Descriptions• > 1,000,000

Communications• Real time between

travelers and suppliers

We’ll look at Reviews and Descriptions

Page 5: NLP @ HomeAway roundrockgeeks

An NLP Pipeline

Corpus of Text

Stopword Filter?

StemLemmatize

Vectorize

Tokenize

Vectorized Text and Metadata

Text assumed tagged with LANG and other metadata (e.g. geo codes, time stamps, etc)

Page 6: NLP @ HomeAway roundrockgeeks

Tokenizing – finding “words”There are truly inspiring views at High Point Retreat and plenty of places to sit and enjoy them. Take a load off in one of the many rooms with views of the ski mountain and remember how lucky you are to live like this. Cozy up with family in the sunken living room and chat for hours on end. Sit in a circle of tree stumps around the outdoor fire pit and roast marshmallows. After all that sitting, youll be more than happy to walk 250 yards to the free shuttle to get the blood pumping again. Then, have a seat and enjoy your free ride. Best. Vacation. Ever.

There, are, truly, inspiring, views, at, High, Point, Retreat, and, plenty, of, places, to, sit, and, enjoy, them, ., Take, a, load, off, in, one, of, the, many, rooms, with, views, of, the, ski, mountain, and, remember, how, lucky, you, are, to, live, like, this, ., Cozy, up, with, family, in, the, sunken, living, room, and, chat, for, hours, on, end, ., Sit, in, a, circle, of, tree, stumps, around, the, outdoor, fire, pit, and, roast, marshmallows, ., After, all, that, sitting, ,, youll, be, more, than, happy, to, walk, 250, yards, to, the, free, shuttle, to, get, the, blood, pumping, again, ., Then, ,, have, a, seat, and, enjoy, your, free, ride, ., Best, ., Vacation, ., Ever, .,

Page 7: NLP @ HomeAway roundrockgeeks

Stopword Filter – removing low-signal words

There are truly inspiring views at High Point Retreat and plenty of places to sit and enjoy them. Take a load off in one of the many rooms with views of the ski mountain and remember how lucky you are to live like this. Cozy up with family in the sunken living room and chat for hours on end. Sit in a circle of tree stumps around the outdoor fire pit and roast marshmallows. After all that sitting, youll be more than happy to walk 250 yards to the free shuttle to get the blood pumping again. Then, have a seat and enjoy your free ride. Best. Vacation. Ever.

There truly inspiring views High Point Retreat plenty places sit enjoy . Take load one many rooms views ski mountain remember lucky live like . Cozy family sunken living room chat hours end. Sit circle tree stumps around outdoor fire pit roast marshmallows. After sitting, youll happy walk 250 yards free shuttle get blood pumping . Then, seat enjoy free ride. Best. Vacation. Ever.

Page 8: NLP @ HomeAway roundrockgeeks

Stemming – crude chopping of inflectional ends

There are truly inspiring views at High Point Retreat and plenty of places to sit and enjoy them. Take a load off in one of the many rooms with views of the ski mountain and remember how lucky you are to live like this. Cozy up with family in the sunken living room and chat for hours on end. Sit in a circle of tree stumps around the outdoor fire pit and roast marshmallows. After all that sitting, youll be more than happy to walk 250 yards to the free shuttle to get the blood pumping again. Then, have a seat and enjoy your free ride. Best. Vacation. Ever.

There are truli inspir view at High Point Retreat and plenti of place to sit and enjoy them . Take a load off in one of the mani room with view of the ski mountain and rememb how lucki you are to live like this . Cozi up with famili in the sunken live room and chat for hour on end . Sit in a circl of tree stump around the outdoor fire pit and roast marshmallow . After all that sit , youll be more than happi to walk 250 yard to the free shuttl to get the blood pump again . Then , have a seat and enjoy your free ride . Best . Vacat . Ever

Page 9: NLP @ HomeAway roundrockgeeks

Lemmatizing– morphological grouping of inflectional endings

There are truly inspiring views at High Point Retreat and plenty of places to sit and enjoy them. Take a load off in one of the many rooms with views of the ski mountain and remember how lucky you are to live like this. Cozy up with family in the sunken living room and chat for hours on end. Sit in a circle of tree stumps around the outdoor fire pit and roast marshmallows. After all that sitting, youll be more than happy to walk 250 yards to the free shuttle to get the blood pumping again. Then, have a seat and enjoy your free ride. Best. Vacation. Ever.

there, be, truly, inspiring, view, at, high, point, retreat, and, plenty, of, place, to, sit, and, enjoy, they, ., take, a, load, off, in, one, of, the, many, room, with, view, of, the, ski, mountain, and, remember, how, lucky, you, be, to, live, like, this, ., cozy, up, with, family, in, the, sunken, living, room, and, chat, for, hour, on, end, ., Sit, in, a, circle, of, tree, stump, around, the, outdoor, fire, pit, and, roast, marshmallow, ., after, all, that, sit, ,, youll, be, more, than, happy, to, walk, 250, yard, to, the, free, shuttle, to, get, the, blood, pump, again, ., then, ,, have, a, seat, and, enjoy, you, free, ride, ., best, ., vacation, ., ever, .

Page 10: NLP @ HomeAway roundrockgeeks

Vectorizing – turning words into numbers

We’re going to look at two types of vectorizing:• tf-idf• Topic Modeling

Page 11: NLP @ HomeAway roundrockgeeks

tf-idfTerm-Frequency (tf) multiplied by Inverse Document Frequency (idf)

• tf: a count of how many times a term is used in a document

–Measures how important a term is to a document• idf: a count of how frequent a term appears in a

corpus of documents–Adjusts for very frequent words (statistical stop

words)

Page 12: NLP @ HomeAway roundrockgeeks

Clustering ReviewsPreparation

• Stopword removal• Stemming• Document vectors of tf-idf

weighted terms

Cluster• Cosine distance between

doc vectors … and then color by review rating

Page 13: NLP @ HomeAway roundrockgeeks

What about this review?

Page 14: NLP @ HomeAway roundrockgeeks

That outlier...The house situation is excellent, close to all facilities, restaurants, groceries, beach, stores, etc. The pool, the patio

furniture, the deck, the beach chairs and the towels are very good for bathing and dining outside, The house

offers enough space. We were disapointed by the old tv sets; the bathrooms need to be refreshed as well as the

cupboard in the kitchen and the laundry room. We were expecting more. We already rented two other

houses with HomeAway before of better quality. The other couple also rent something cleaner and nicer for a better price. The cleaning must have been done more metiscusly. The oven was very dirty. We found that

kitchen pot and pans were chipped and old. There are many old stuff under the cupboard. The toaster heats properly only on one side. The BBQ grill was rusty; all the protection was gone on half the surface. We had problems twice with the internet. The manager/owner came once to try (without success) to repair the leaking sink. The bath was very slow to drain; a plumber came one morning and waited half an hour for the owner who

never showed up, so no repair were done. The small carpets in the bathrooms were old, dirty and disgutting. In the yard, close to the pool, there were old mops, brooms, plastic plants that should all be sent to garbage. It's more

a 3.5* than a 4*. There is a real potential for this house but now it seems a bit neglected. If you haven't

seen other places, you don't know; the four of us can compare and we were all disapointed this time.

Page 15: NLP @ HomeAway roundrockgeeks

Negative reviews

Colocations

Page 16: NLP @ HomeAway roundrockgeeks

Traveler’s Hierarchy of Needs

Glass of Wine Hustle and Bustle Within Walking Distance

Open Floor Plan Labor Day WeekendVisitor Recently Left

Bring Your OwnWasher and Dryer

Pots and Pans

Sort of like Maslow’s

Page 17: NLP @ HomeAway roundrockgeeks

HomeAway … said there is a 10% to 15% overlap in HomeAway’s and Airbnb’s listings.Wall Street Journal, Jan 18, 2016

On to Property Descriptions

We have > 1,000,000 descriptions in many languages

• Fraud Detection

• Competitive Intelligence

Page 18: NLP @ HomeAway roundrockgeeks

COIN data provenance

Page 19: NLP @ HomeAway roundrockgeeks

Breckenridge (Blue is HomeAway)

Page 20: NLP @ HomeAway roundrockgeeks

Breckenridge Zoomed InHow do these two properties relate?

Page 21: NLP @ HomeAway roundrockgeeks

Trick Question! Same Property!

Page 22: NLP @ HomeAway roundrockgeeks

The property descriptionsHomeAway The Other Guys

Page 23: NLP @ HomeAway roundrockgeeks

Why did we use descriptions?• Geolocation good for “within 5000 meters”• Image detection can be slow

• Similar descriptions seemed probable Consistent owner branding, easy to replicate

• Tech team wanted to use natural language processing• Didn’t know if this would work when we began

Page 24: NLP @ HomeAway roundrockgeeks

How

• Draw Geo Bounding Box• Filter on metadata

Bedrooms, bathrooms, &c.

• Compare text• Lather, rinse, repeat• Select a duplicate, if any

Page 25: NLP @ HomeAway roundrockgeeks

How, continued

Most similar property description

Page 26: NLP @ HomeAway roundrockgeeks

Methodology concernsTF-IDF vectors, cosine distance work for duplicates and fraud, but

A little slowMany vectors, many dimensions

Vocab size limited to 4500 tokens -> ~4500 dimensionsMillions of vectors

Page 27: NLP @ HomeAway roundrockgeeks

Cluster computing, better math to the rescue!

Spark Clusters (Scala)Topic Modeling (LDA)

Not sure if it will work for duplicationCosine, Jensen-Shannon, or Hellinger distances?

Page 28: NLP @ HomeAway roundrockgeeks

Topic Modeling, quickly

In natural language processing, Latent Dirichlet allocation (LDA) is a generative model that allows sets of observations to be explained by unobserved groups that explain why some parts of the data are similar.

(Wikipedia)

Cat, Dog, Fish, Turtle,

Hamster

Cat, Dog, Mass,

Hysteria, Sleeping, Together

Cat, Dog, Cold, Rain,

Hot, Temperature

“Pets”

“DemonicInvasion”

“Weather”

Page 29: NLP @ HomeAway roundrockgeeks

LDA Current resultsFinding number of topics

220 topics, ~600K (en_US) descriptions(curvature)

Page 30: NLP @ HomeAway roundrockgeeks

LDA, continued

Page 31: NLP @ HomeAway roundrockgeeks

LDA Future• Duplicates?• Fraud Detection?• Property topics in the

“Vacation Rental” Space?

- Marketing, SEO, UX

TOPIC 5

Neighborhood

Lovely

Backyard

Quiet

Residential

TOPIC 17

Beach

Pier

Boardwalk

Isle

Crescent

Page 32: NLP @ HomeAway roundrockgeeks

Logos

LingPipe

CoreNLP

Page 33: NLP @ HomeAway roundrockgeeks

Questions?

Brent SchneemanDirector, Data Science

HomeAway, [email protected]@schnee