Data Mining for Moderation of Social Data

Preview:

Citation preview

Data Mining for Moderation of Social Data

Fernando G. Guerrero CEO SolidQ fguerrero@solidq.com

3 © 2011 SolidQ

Introductions • Fernando G. Guerrero •Global CEO of SolidQ • fguerrero@solidq.com

•Microsoft Regional Director for Spain since 2004 • SQL Server MVP from year 2000 till 2007 •Usual suspect at many international conferences

SolidQ 2012… 10th anniversary •160 people in 23 countries:

• Argentina, Australia, Austria, Bulgaria, Canada, Chile, Costa Rica, Croatia, Denmark, France, Germany, India, Israel, Italy, Mexico, Saudi Arabia, Serbia, Slovakia, Slovenia, Spain, Sweden, UK, USA

•50 current or former RDs or MVPs •Authors of many books, articles, and whitepapers •Research Collaboration with:

• Universidad de Alicante • Universidad de les Illes Balears • Universidad de Santiago de Compostela • The European Union • The Spanish Ministry of Economy and Innovation

6 © 2012 SolidQ

Agenda

• Social Data •Market Research • Sentiment Analysis, Text Mining •Moderation, Data Mining • SolidQ Research Lines in Social Data

7 © 2012 SolidQ

Social data is everywhere

8

9 © 2012 SolidQ

Social data is about everything

Music

10 © 2012 SolidQ

Social is there

• Is your organization promoting social about you?

Products Services Stories

11 © 2012 SolidQ

Social is there, reputation

•What is social saying about you? • Product • Services • Decisions • Image

12 © 2012 SolidQ

Market Research

•What is social requesting you? • Future Services • Product updates

•Can you ask questions to social?

• Is this service going to succeed • How can I fixed the current problem • Is society ready for this law

13 © 2012 SolidQ

Sentiment Analysis, Text Mining

The movie was fabulous!

The movie stars Mr. X

The movie was horrible!

[ Factual ] [ Sentimental ] [ Sentimental ]

14 © 2011 SolidQ

15 © 2012 SolidQ

What is Data Mining?

• Inform actionable business decisions •Contrasts with “machine learning”

16 © 2012 SolidQ

Media Case Study

•Millions of posts per year (different moderation scenarios) •About 25% are human moderated •About 10% of the moderated posts fail •No Business Intelligence applications for analysis

or reporting

17 © 2012 SolidQ

Moderation, Data Mining

• Contextual Information • Time • Location • User

• At 10am comments are safer than at 2AM. • A user maybe safe talking about science bad

dangerous talking about sports. • If a thread is hot (dangerous), comment maybe hot. • Combining context pattern the systems assign risk to

posts without going into the text.

18 © 2012 SolidQ

Solution – Logical Model

•Post Context (behavior analysis) • Patterns, data mining.

•Post Content (text analysis) • Profanity, low score sentences, text mining, mood or

tone (sentiment analysis)

19 © 2012 SolidQ

Typically Available Data on Posts

•Historical and real time data for: • User (e.g. userid, email, nationalid) • Location (e.g. Life & Style Fashion) • Time (e.g. 12 March 2011 18:56) • Content (e.g. text, link, picture, video). • Moderation result

•Other attributes like geography, age, education could be used

Post context, Patterns, Data Mining •User behavior. • Time behavior. • Location behavior.

20 © 2012 Solid Quality Mentors

Building useful attributes • 1.- Thread ( % Fails in a certain thread) • 2.- User (% Fails per User) • 3.- Diff Hour Forum Created (TimeDatePosted-TimeForumCreated) • 4.- User Forum (% Fails in a certain forum) • 5.- Diff Last for User (TimeDatePosted - TimeLastFailUser) • 6.- Hour of the day • 7.- Diff hour UserJoined-Now (TimeDatePosted-TimeUserJoined) • 8.- User Thread (% Fails per User in a thread) • 9.- Diff Hour Thread Created (TimeDatePosted-TimeThreadCreated) • 10.- Day of Week • More than 100 attributes.

21 © 2012 Solid Quality Mentors

Hard Work •Periods. •Algorithms. •Algorithms' parameters. •Model refreshing. •Attribute analysis. •Outliers. •Overpopulating. •Behavior after this systems is in production.

22 © 2012 Solid Quality Mentors

Data Mining Algorithms

•Decision Trees/Linear Regression • Sequence Analysis •Neural Networks/Logistic Regression •Clustering • Text Mining (Words and Phrases)

23 © 2012 SolidQ

24 © 2012 SolidQ

Conclusion on Context

•Risk based on context of the post • Time • User’s history • Publish location

• Enables risk analysis for all type of content • Comments (in any language) • Links • Pictures • Videos

Logical Model: Post content

•Profanity Analysis • Text Mining

The first minister and his secretary found sleeping together last night. They got drunk at a nearby pub.

• Sentiment Analysis

25 © 2012 SolidQ

26 © 2011 SolidQ

27 © 2012 SolidQ

Moderation, Data Mining System

28 © 2011 SolidQ

Analysis and Reporting •Published through integrated web application

• Moderation statistics. • Users statistics. • News and Stories Statistics. • Peaks.

29 © 2012 SolidQ

30 © 2012 SolidQ

Conclusion: Benefits

•Moderating half of the total posts, the solution captures 90% of failing posts. The remaining 10% seem to be likely safe posts. •Using Intelligent Moderation, media companies

scan the whole universe of posts at a comparatively low cost. •At peak times, Intelligent Moderation works

perfect.

31 © 2011 SolidQ

Football night in Europe

•On January 25th, 2012: • Liverpool defeated Manchester City in the Carling Cup • Barcelona defeated Real Madrid in Copa del Rey

•More than 100.000 comments arrived to the different BBC sites during 10 hours •All comments were filtered through our system •No problems observed during that time

32 © 2012 SolidQ

SolidQ Team in this project

•Project Managers • Francisco Gonzalez, Javier Torrenteras, Alejandro

Leguizamo

•Developers • Itzik Ben-Gan, Enrique Puig, Ruben Pertusa, Carlos

Martinez , Fernando G. Guerrero

• Technical reviewers • Mark Tabladillo, Dejan Sarka

• Social Media Specialist. • Jose Quinto, Rocio Díaz

33 © 2012 SolidQ

SolidQ Reseach

• Incomplete Grammar Analysis •Human interaction with IT systems

• Collaboration • Contextual analysis

• Sentiment Analysis • Market Research • Reputation

•Data Mining of context Social • Moderation • Market Research • Reputation

Invisible computing…

34

… Driven by Social Data

THANK YOU!

35 © 2012 SolidQ

Fernando G. Guerrero Global CEO SolidQ fguerrero@solidq.com

Recommended