View
143
Download
0
Category
Preview:
Citation preview
Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be
Data Science Company
Harvesting Business Value with Data Science
InfoFarm - Seminar18/03/2015
Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be
Agenda
• 09:30 About us
• 09:40 Introduction to data science
• 10:00 Data science in practice:
- Fictive examples
- InfoFarm use-cases
- Big Data at Essent (Els Descheemaeker)
- Fraud detection: Gotch’all (KULeuven)
• 11:30 Possibility to discuss your data
science ideas
Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be
Data Science
Big Data
Provide customers with new information by
identifying, extracting and modeling data of all types
and origins; exploring, correlating and using it in new
and innovative ways in order to extract meaning and
business value from it.
Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be
Java
PHPE-Commerce
Web
Development
Mobile
Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be
InfoFarm - Team
• Mixed skills team
- Data scientists
- Big Data developers
- Infrastructure specialist
• Complementary with client on domain expertise
• Certifications– CCDH - Cloudera Certified Hadoop Developer
– CCAD - Cloudera Certified Hadoop Administrator
– OCJP – Oracle Certified Java Programmer
Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be
VisualizationData
science
BusinessKnowledge
Development
Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.beData Science Company
Introduction to data science
Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be
Being a Data Scientist
• Complementing business knowledge with figures
• “Getting meaning from data”: Finding patterns (data mining)
• Data Scientist: “A person who is better at statistics than any
software engineer and better at
software engineering than any
statistician”
- Josh Wills
Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be
Data Science = about asking the right question!
Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be
Data Science
• Relevance for business – use data to:– Increment conversion
– Increment operational efficiency
– Understand your customers’ needs
– Make better offers
– Make better recommendations
– …
• The key point is spotting opportunities to outperform your
competitors using any data available!
• Data science is very affordable to companies of all sizes
• Typical data science projects are 10’s of man-days of work
Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be
Business Knowledge vs Data Science(Intuitive knowledge vs data driven decisions)
Business Knowledge
Acquired by experience
(assumed) insights
RISK: too high bias on past experience and gut feeling
Data Science
Complementary to business knowledge
Confirmative or new insights
Data-driven decision taking
RISK: too naive data intepretation, disconnected from business
Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be
Business Knowledge vs Data Science(Intuitive knowledge vs data driven decisions)
Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be
Data Science vs Business Intelligence
Business Intelligence Data Science
Basic concepts Structure & query Experimenting & discover
Processes DWH, OLAP, ETL Avoid heavy ETL (loosely structured data and agile use of many sources)
Investment Big investmentDelivers exactly
Limited investment Might or might not deliver
Cycle Development Exploratory working
Perspective
Questions What happened? What will happen?What if?
Data Warehouse, silo Distributed, real-time, “unstructured”
FuturePast
Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be
Data Science vs Big Data
• What about the elephant in the room?
• BigData allows:
– N=ALL (avoid sampling errors)• Sampling issues can be overcome by just processing ALL available data (process massive data)
– N=1 (avoid issues with non-homogenous datasets)• Categorization becomes true personalisation: project towards ONE individual (calculate per item)
Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be
The Data Science maturity model
• Don’t run before you can walk: The Data Science Maturity modelEach level builds on the quality of the underlying step. It’s science, not magic …
• The process is a scientific cycle, no development cycle!
• Being a Science makes that the outcome cannot be predicted
• Even without success you learned something
Collect
Describe
Discover
Predict
Advise
Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.beData Science Company
Data Science: Tools & Techniques
Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be
Machine Learning
Classification
Clustering
Association Rules
Regression
Information extraction
Techniques
Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be
Classification: Use Cases
• Incoming mail redirection
• Sentiment analysis
• Order picking optimization
Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be
Clustering: Use cases
• Customer segmentation
• Product segmentation
Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be
Association Rule Learning: Use Cases
• Recommendations
• Data exploration
• Find connections between unrelated
events
Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be
Regression: Use Cases
• Order Quantity Prediction
• Trend estimation
Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be
Information Extraction
• Extract variables out of unstructured data
like text.
• Named Entity Extraction
Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.beData Science Company
Data Science Examples
Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be
Data Science examples
• Market segmentation
• Impact analysis
• Recommendations
• Water treatment
• Damage type research
• Call center aid
• Personalized client mailing (Essent)
• What do people write about us
• Fraud detection: Gotch’All (KU Leuven)
Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.beData Science Company
#1 Market Segmentation
Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be
Market segmentation
• Business knowledge based approach
– “We know our segments: -25y, 25y-35y, 35y+ groups, and male/female”
– But is this (still) true?
– E.g.: do we really want to send an ad of the new iPhone to a long-time Android
user because he’s a 30-something male customer?
Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be
Market segmentation
• Example:
We want to send mailings about our new product
• Decisions to take:
– Which mail to send to which customers?
– We need customer segmentation!
• Risks in failing to do this correctly
– Missing opportunities (not informing customers)
– Annoying customers with irrelevant mailings (churn, reputation damage, …)
Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be
Market segmentation
WEB SERVER LOGSWhich customers
looked at similar products?
ORDER HISTORYWhich
complementary products does the
customer own?
EXTERNAL DATAReviews or critics?
CRM INFORMATIONTypical profile of a
customer responsive on campaigns for a
similar product?
Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.beData Science Company
#2: Analysis on the impact of physical
stores on your webshop
InfoFarm example
Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be
Impact physical store on online?
– Are online sales higher when physical store is nearby?
– Where to open a new store?
– How to approach your customers to motivate them to buy (more) at
your store?
Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be
Impact physical shops - example
• Analysis for a retailer: Physical shops vs online sales
Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be
Impact physical shops - example
• Impact of opening a physical shop on local online sales
(brand awareness?)
• Simple statistical test
Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be
Impact physical shops – now what?
• Use this correlation information:
– As extra input for determining new shop locations
– Use popup-stores to get brand awareness
• Do these pop-up store have the same non-temporary
influence?
– Publish folders focusing on online in non-covered
areas
– Discounts per region
– Google Adwords campaigns focusing on regions with
limited brand presence
– Customer segmentation based on this information
Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.beData Science Company
#3 Recommendations of products to a
customer
Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be
Recommendations – Why? How?
– Why?• Attempt to cross-sell or up-sell
• Provide customers with alternatives that might please them even more
– Traditional approach• No recommendations at all
• Products in the same category
• Manually managed cross-selling opportunities per product
– Why are these approaches fundamentally flawed?• They all start from the seller perspective, not the customer!
• “We know what you should be buying”
• Manual recommendations are too costly and time-consuming to
maintain – even impossible with large catalogs
Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be
Recommendations
– Product based recommendations
• Main focus on online, but why?
• Who knows best what products to recommend?
• Learn from your data, don’t take decisions based on a feeling.
– Time based recommendations
• Recommend or cross sell different products depending on
– season?
– holiday?
– weather?
– Customer based recommendations
• Learn from your customers and their past.
• Android vs iOS smartphones.
Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be
Recommendations – Traditional approach
Current product Similar products
Related products
Which related products to show
Which brush would be appropriate?
Primer + paint combo?
Traditionally: unavailable
Which similar products to show?
Color alternatives?Glossy/matte alternatives?
Cheaper/better?
Traditionally: too similar
Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be
Recommendations – what does Amazon do?
Cross-selling as realized with other (similar?) customers
Starts from customer point of view!
Recommendations based on perceived customer journeys
Re-use the product comparisons that
previous customers did!
DATA DRIVEN!
Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be
Recommendations – Other ideas
• Data Science ideas
– “x % of the people who looked at this item eventually bought product X or Y”
– Get cross-selling information from ERP in the physical shops and let this feed the
webshop recommendations!
– Similar product in different price ranges
(“best-buy alternative”, “deluxe alternative”)
– ...
• This is very achievable for a webshop of any size
– Just generate ideas, and test to see what actually increases sales!
• Secondary use of various kinds of non-structured data = BigData !
– Weblogs of e-commerce site (use to deduct customer journeys)
– ERP info with bills and/or invoices (use to deduct cross-selling in physical shops)
– Product information (product categorization, …)
Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.beData Science Company
#4 Water treatment
InfoFarm example
Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be
Context
• Rainfall and wastewater entering the sewer system sometimes
peaks to over the max capacity, requiring dumping wastewater into
rivers. To be avoided as much as possible!
• Long-term question: can we come to a better capacity
management of the sewer system with current data available?
• Short-term action: Proof-of-Concept on the application of Data
Science with BigData tools (Hadoop)
Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be
Describe
Data quality – visual inspection
Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be
Lag-analysis between 2 points
17 minutes (= +/- 20km/h = avg wind direction & speed)
NorthNNE
NE
ENE
East
ESE
SE
SSESouth
SSW
SW
WSW
West
WNW
NW
NNW
Wind
Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be
Predict
• (attempt to) Predictions, very limited results due to
– data quality
– our limited business insights
– limited time (Data Science isn’t magic)
• Model predicting whether rain or only wastewater is in the sewer
system based on incoming water at treatment plant
PredictedNo rain
PredictedRain
Observed No Rain 4504 171 96%
Observed Rain 836 602 42%
84% 78% 84%
Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.beData Science Company
#5 Damage type research
Future InfoFarm example
Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be
Damage type research
• Not limited to logistics:– Telecom decoders
– Machinery
• Possible ideas:– Which damage types occur most?
– Are certain damages restricted to certain types of machinery?
– Do certain damages invoke others?
– Do certain damages occur more on certain lines/with certain users?
– Which damages cause early maintenance and can we predict these occurrences in advance
– …
Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.beData Science Company
#6 Call center aid + omnichannel
Future InfoFarm example
Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be
Pro active calling
• Pro active calling:
– List of people most likely to react on callings
• In omnichannel case: better to call, mail, …
– List of items they might be interested in
Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be
Call center information
• Call center information
– Personal information
on caller?
– What are they going
to ask?
– What are they telling
about you?
Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be
Omnichannel
• Are customers more likely to react on:
– Internet based contacts: mailings, webshop, …
– Paper brochures
– Callings
– Physical shop
Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.beData Science Company
#7 Personalized client mailing
Essent
Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be
• Belgian supplier of energy and natural gaz to consumers and profession users
• 4th largest player in Belgium
• 350 000 customers of which 24 000 professional
• Active since 2001
• 150 FET
• For more information, contact els.descheemaeker@essent.be
Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be
We could give the answer
before they give a call
Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be
“A”cquireData
‘A”nalyzeData
Make it“Actionable”
• What is the profile of
the calling customer
• Which parameters are
important
• OUTPUT: algorithm
made by data scientist
• Collection of data
• Quality check of data
• Descriptive,
consumption behavior
data, Call-data
SEEMS EASY, BUT IT
ISN’T
• Apply the defined
profile to NEW customers
with highest risk of
calling.
• HOW ?
Send a personalized
video via email with all
“relevant” data, for which
they normally call.
• WHEN ?
Before the customer
recieves the invoice
3 A’s approach
Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be
Send to those new customers with the highest probability of calling.
Example of e-mail with video link
Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be
Learning
• Guerilla approach – no big project
• Mixed team on top of daily business
• Focused innovation, DQ positioned as side-effect MUST
• “Guerilla” lead to attention for DQ towards right audience
• Engage employees for good DQ output – Input of employees that generate the output
– Leads to a long term commitment
• More impact than big DQ initiatives – part of daily process
Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.beData Science Company
#8 What do people write about us
Infofarm example
Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be
How do we get in the media?
• Find news articles containing certain
keywords/concerning certain topics
• First model:
Identifying relevant
texts
Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be
How do we get in the media?
• Second model: dividing relevant texts into
topic clusters
Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be
How do we get in the media?
• Third model: are the talking
positive/negative about these topics
Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be
How do we get in the media?
• Final idea, extract:
– Who is talking about you?
– To which organization do they belong?
– Can we confirm their figures?
Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.beData Science Company
#9 Fraude detection: Gotch’All
KU Leuven
Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be
Gotch’All
• Research (mini lecture: https://www.youtube.com/watch?v=6H5Lp3i05Cg)– Prof. Dr. Bart Baesens
– Veronique Van Vlasselaer
– Prof. Dr. Tina Eliassi-Rad
– Prof. Dr. Leman Akoglu
– Prof. Dr. Monique Snoeck
Social network analysis– Is fraud a social phenomenon?
– Social security fraud
– Credit card fraud
Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be
Is fraud a social phenomenon
Identity theft:• Before: person calls his/her frequent contacts
• After: person also calls new contacts which coincidentally overlap with
another persons contacts.
Social security fraud• Companies are frequently associated with other companies that perpetrate
suspicious/fraudulent activities.
Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be
Fraud?
• Anomalous behavior– Outlier detection: abnormal behavior and/or characteristics in a data set might
often indicate that that person perpetrates suspicious activities
– Behavior of a person/instance does not comply with overall behavior. E.g., illegal
set up of customer account
Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be
Properties of fraud detection models
• Accuracy (AUC, precision and recall)
• Operational efficiency (e.g. 6 second rule in credit card
fraud)
• Economical cost
• Interpretability (i.e. make sense)
How to detect mister Hyde?
Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be
Social Network Analytics: Components
• Nodes (the objects of the network)– People
– Computers
– Reviewers
– Companies
– Credit card holders
– …
• Links (the relationships between objects)– Call record
– File sharing
– Product reviews
– Shared suppliers/buyers
– Merchant
– …
Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be
Social security institution
End of a company’s lifecycle:
(1) Regular suspensionNo outstanding debts
(2) Regular bankruptcy
Outstanding debts
Cause: economical situation
(3) Fraudulent bankruptcy
Outstanding debts
Cause: intention
Goal: prevention of fraudulent bankruptcies (i.e., intentionally bankruptcies to avoid contribution payments to the government)
Recommended