41
Twitter content-based Recommendation System Barcelona Tourist City Monitor & Insights 01.07.2016 #MACHINELEARNING #SPARK #KAFKA #CASSANDRA Juan Pablo López Rodica Fazakas Yulia Zvyagelskaya Beatriz Martín BIG DATA MANAGEMENT AND ANALYTICS POSTGRADUATE COURSE - FINAL PROJECT

Twitter + Lambda Architecture (Spark, Kafka, FLume, Cassandra) + Machine Learning

Embed Size (px)

Citation preview

Page 1: Twitter + Lambda Architecture (Spark, Kafka, FLume, Cassandra) + Machine Learning

Twitter content-based Recommendation System Barcelona Tourist City Monitor & Insights

01.07.2016

#MACHINELEARNING #SPARK #KAFKA #CASSANDRA

Juan Pablo LópezRodica FazakasYulia ZvyagelskayaBeatriz Martín

BIG DATA MANAGEMENT AND ANALYTICSPOSTGRADUATE COURSE - FINAL PROJECT

Page 2: Twitter + Lambda Architecture (Spark, Kafka, FLume, Cassandra) + Machine Learning

The ChallengeBuild content-based recommendation system to provide real-time personalized

recommendations to Social Media users and insights visualization for touristic and smart city sector

The product is addressed to:

● Middle and small companies connected to touristic sector, both of B2B&B2C model (Leisure/Travel, Tour operators, tourist online portals, Retail, HoReCa, etc.)

● City and neighborhood public departments and administrations● Event agencies and managers● Advertising and marketing agencies

Page 3: Twitter + Lambda Architecture (Spark, Kafka, FLume, Cassandra) + Machine Learning

The ChallengeBusinesses continue investing budgets to Social Media targeted advertising:

Page 4: Twitter + Lambda Architecture (Spark, Kafka, FLume, Cassandra) + Machine Learning

The ChallengeAims of the project:

● Twitter data collection and management● Tourists vs. residents classification● Topic (user interest) modeling● Recommendation system implementation● Real-time streaming statistic calculation● Predictive model application for streaming

Page 5: Twitter + Lambda Architecture (Spark, Kafka, FLume, Cassandra) + Machine Learning

The ChallengeMain tasks of the project:

● Design and implement the architecture that is able to scale and measure high volume data traffic

● Real-time requests response● Use advanced ML supervised and unsupervised techniques● Extract valuable relevant information (insights) of managed data to deliver

tangible business results to the customers● Provide user-friendly visualization and presentation of the extracted

information

Page 6: Twitter + Lambda Architecture (Spark, Kafka, FLume, Cassandra) + Machine Learning
Page 7: Twitter + Lambda Architecture (Spark, Kafka, FLume, Cassandra) + Machine Learning

Data

Page 8: Twitter + Lambda Architecture (Spark, Kafka, FLume, Cassandra) + Machine Learning

Data Source

ENGLISH, FRENCH, RUSSIAN

[41.34,2.03,41.45,2.25]

Tweets geolocated in Barcelona Tweets with Barcelona KW

Barcelona

Sagrada faM

WC

Page 9: Twitter + Lambda Architecture (Spark, Kafka, FLume, Cassandra) + Machine Learning

Data Source (amount of data)

[41.34,2.03,41.45,2.25]All languages: 20.000 tweets/dayOnly EN, FR, RU: 7.000 tweets/day

All languages: 250.000 tweets/dayOnly EN, FR, RU: 80.000 tweets/dayBarce

lona

Sagrada faM

WC

Page 10: Twitter + Lambda Architecture (Spark, Kafka, FLume, Cassandra) + Machine Learning

Data Management

Page 11: Twitter + Lambda Architecture (Spark, Kafka, FLume, Cassandra) + Machine Learning

Cluster topology

Page 12: Twitter + Lambda Architecture (Spark, Kafka, FLume, Cassandra) + Machine Learning

Architecture

Page 13: Twitter + Lambda Architecture (Spark, Kafka, FLume, Cassandra) + Machine Learning

- Architecture

Page 14: Twitter + Lambda Architecture (Spark, Kafka, FLume, Cassandra) + Machine Learning

Data Collect Layer

Page 15: Twitter + Lambda Architecture (Spark, Kafka, FLume, Cassandra) + Machine Learning

Data Collect Layer

CollectProcess

Page 16: Twitter + Lambda Architecture (Spark, Kafka, FLume, Cassandra) + Machine Learning

Data Collect Layer: Apache Kafka

Distributed publish-subscribe messaging serviceFault-tolerantDecoupling, Simplicity, Efficiency

Fast

topics: twittergeobcn, twitterkwbcn, rtstats, rtpredictions

Page 17: Twitter + Lambda Architecture (Spark, Kafka, FLume, Cassandra) + Machine Learning

Data Collect Layer

CollectProcess

topics: twittergeobcn, twitterkwbcn

Page 18: Twitter + Lambda Architecture (Spark, Kafka, FLume, Cassandra) + Machine Learning

Data Collect Layer

Page 19: Twitter + Lambda Architecture (Spark, Kafka, FLume, Cassandra) + Machine Learning

Data Collection: Apache Flume

Page 20: Twitter + Lambda Architecture (Spark, Kafka, FLume, Cassandra) + Machine Learning

Processing Analytics Layer

Page 21: Twitter + Lambda Architecture (Spark, Kafka, FLume, Cassandra) + Machine Learning

ProcessingAnalytics Layer

Page 22: Twitter + Lambda Architecture (Spark, Kafka, FLume, Cassandra) + Machine Learning

Batch Processing: Pre Process● Collect

PreProcess

● Read Geolocated Tweets stored in HDFS● Clean Tweet Text (lowercase, numbers, spaces,tabs,etc..)● Categorize users (tourist, resident), comparing geolocation of last 200

tweets● Save in Cassandra for ML processes

Page 23: Twitter + Lambda Architecture (Spark, Kafka, FLume, Cassandra) + Machine Learning

Batch Processing: Topic Modelling Process

● Collect●

TPProcess

Page 24: Twitter + Lambda Architecture (Spark, Kafka, FLume, Cassandra) + Machine Learning

Batch Processing: SVM Process

● Collect●

SVMProcess

Model

Page 25: Twitter + Lambda Architecture (Spark, Kafka, FLume, Cassandra) + Machine Learning

Streaming ProcessCollect

StatsProcess

topic: twittergeobcn

topic: rtstats

PredictProcess

topic: rtpredictions

Model

Page 26: Twitter + Lambda Architecture (Spark, Kafka, FLume, Cassandra) + Machine Learning

API Layer

Page 27: Twitter + Lambda Architecture (Spark, Kafka, FLume, Cassandra) + Machine Learning

API Layer

REST API

Page 28: Twitter + Lambda Architecture (Spark, Kafka, FLume, Cassandra) + Machine Learning

Dashboard HTML

Page 29: Twitter + Lambda Architecture (Spark, Kafka, FLume, Cassandra) + Machine Learning

Data Analytics

Page 30: Twitter + Lambda Architecture (Spark, Kafka, FLume, Cassandra) + Machine Learning

Data AnalyticsTasks:

● Geotagged data tourists vs. residents detection algorithm implementation ● Non-geotagged data tourists vs. residents classification with supervised

machine learning● Topic (user interest) modeling with unsupervised machine learning● Recommendation system building● Statistics calculation● Visualization

Page 31: Twitter + Lambda Architecture (Spark, Kafka, FLume, Cassandra) + Machine Learning

Text Preprocessing● remove url’s; ● remove @ sign tags from the data;● remove any number characters, e.g. 1 or 3.14 (removeNumbers);● remove any punctuation characters (removePunctuation);● convert all text to lower case (tolower);● include only words that have a minimum character length of 3;● remove certain stop words from the data; ● reduce words to their ‘stems’, e.g. ‘walk’ is the stem of ‘walking’ and ‘walked’

(stemming);

Page 32: Twitter + Lambda Architecture (Spark, Kafka, FLume, Cassandra) + Machine Learning

SVM: data tourists vs. residents classificationChallenge: meanwhile only less than 1% is geotagged, the twitter users have to be classified for tourists and residents to extract further insights and topics of interests

Aim: build a predictive model to classify non-geotagged twitter texts to distinguish tourists from residents.

Page 33: Twitter + Lambda Architecture (Spark, Kafka, FLume, Cassandra) + Machine Learning

SVM: data tourists vs. residents classificationDataset: labeled data collection of tweet texts (only from Barcelona) as independent variable and labels (TRUE for tourist/FALSE for resident) as predictor variable

Validation protocol:

● Training set (60% of the original dataset) to build up prediction algorithm● Cross-Validation set (20%) to compare the performances and choose the

algorithm with the best one● Test set (20%) to apply best prediction algorithm and get an idea about its

performance on unseen data

Page 34: Twitter + Lambda Architecture (Spark, Kafka, FLume, Cassandra) + Machine Learning

SVM: data tourists vs. residents classificationPrototyping● Naive Bayes● Logistic Regression (Maxent)● k-NN● SVM

Page 35: Twitter + Lambda Architecture (Spark, Kafka, FLume, Cassandra) + Machine Learning

SVM: data tourists vs. residents classificationReasons why SVMs perform well for text categorization

SVMs:

● Acknowledge the particular properties of text: high dimensional feature spaces, few irrelevant features (dense concept vector), and sparse instance vectors

● Outperform other techniques substantially and significantly● Eliminate the need for feature selection, making text categorization

considerably easier● Are robust and do not require much parameter tuning

Page 36: Twitter + Lambda Architecture (Spark, Kafka, FLume, Cassandra) + Machine Learning

Topic Modeling

We use topic modelling to automatically detect topics of interest to Twitter users

previously detected as tourists.

● Uncover the hidden topical structure in tweets.

● Assign topics to users.

● Use these assignments to make targeted recommendation

Page 37: Twitter + Lambda Architecture (Spark, Kafka, FLume, Cassandra) + Machine Learning

Topic ModelingDataset

● Geolocalized tweets from Barcelona, aggregated by identified tourist

Algorithm: baseline Latent Dirichlet Allocation (LDA)

● Unsupervised learning technique ● Extracts key topics. Each topic is an ordered list of representative words.● Describes each doc in the corpus based on allocation to the extracted topics.

Page 38: Twitter + Lambda Architecture (Spark, Kafka, FLume, Cassandra) + Machine Learning

Topic Modelling : LDA TopicsTopic 0 Topic 1 Topic 2 Topic 3 Topic 4

direct love primavera humid photo

work peopl sound wind love

lip happi festiv cloud beauti

june life drink temperatur hotel

book birthdai night finish camp

market hope plai summer centr

design girl live sant view

chang game stage block beach

Page 39: Twitter + Lambda Architecture (Spark, Kafka, FLume, Cassandra) + Machine Learning

Recommendation Systemuser_id topic word recommendation

6448 sports game Bowling Pedralbes, Camp Nou, Museu del FC Barcelona

7296 festivals festiv Festival el Grec, Sonar

1239 sports plai Bowling Pedralbes, Camp Nou, Museu del FC Barcelona

2980 shopping market Boqueria, La Roca Village, Portal del Angel

3501 nature beach Font Magica, Park Guell, Playa de la Barceloneta

Page 40: Twitter + Lambda Architecture (Spark, Kafka, FLume, Cassandra) + Machine Learning

DEMO

Page 41: Twitter + Lambda Architecture (Spark, Kafka, FLume, Cassandra) + Machine Learning

Thank you!