Upload
others
View
5
Download
0
Embed Size (px)
Citation preview
Power product innovation with Big Data technologies
Introducing:
Zhixuan Wang Experian
Hua Li Experian
©Experian 3
In the 21st century, data is the new oil, Big Data analytics is the new engine, Big Data tools are the new machinery.
4/21/2017 Experian Public Vision 2017
©Experian 4
Big Data and open source landscape
4/21/2017 Experian Public Vision 2017
©Experian 5
Apache Hadoop Stack
4/21/2017 Experian Public Vision 2017
Tip: Use Hadoop streaming to write mapper and reducer in your favorite program language
©Experian 6
Credit Card Attrition Trigger Transaction Data Insight System (TDIS)
4/21/2017 Experian Public Vision 2017
1
Historical spend enables probability
expectation (profile) to be computed
As time passes, new transactions
adjust the probability expectation
Notify when transaction does not
occur within the probability
expectation threshold
2
3
©Experian 7
Hadoop Streaming with secondary sort
4/21/2017 Experian Public Vision 2017
Calculate triggers in reducer
• Build up profile based on account-grouped date-time ordered transactions
• Reuse old python code
Results
• 10M accounts with 1.2B transaction over 24 months
• No profile data to be stored: ~50GB / snapshot
• Finish in 1 hours 17 minutes
6 machine with 8 cores each
• Trigger delivery from weekly to daily
Sort
• Primary key = account number
• Secondary key = date-time
©Experian 8
Apache Spark
4/21/2017 Experian Public Vision 2017
©Experian 9
• Interactive data exploration via Spark Shell (Scala)
Spark use example
4/21/2017 Experian Public Vision 2017
©Experian 10
Credit card transaction data: 24-month
• 25GB bzipped
• 1.2B transaction
– 18 fields / transaction
• 8 machine
– 32 cores / machine
– 256GB memory / machine
Interactively explore credit card transactions data
4/21/2017 Experian Public Vision 2017
©Experian 11
Split, convert and load data
4/21/2017 Experian Public Vision 2017
Split, convert, and load data
Fire up Spark-shell
©Experian 12
Cache data
4/21/2017 Experian Public Vision 2017
Check cached data and executors
Cache it!
©Experian 13
Explore data (fast!)
4/21/2017 Experian Public Vision 2017
Take a peak
Five number summary on TRAN_AMT
©Experian 14
Save results
4/21/2017 Experian Public Vision 2017
Top merchant ZIP Codes™
Save results
©Experian 15
Start Spark Shell
• Set proper number of executors and memory per executor
Convert, load, cache data
• Spark >=1.6v: memory efficient
• Partition data to fit executor’s memory limit
Explore
Recap and tips
4/21/2017 Experian Public Vision 2017
©Experian 16
Graph database
4/21/2017 Experian Public Vision 2017
©Experian 17
Challenge: Finding the missing link
Potential applications:
• Healthcare: Elder patients close to his / her children
• Wealth service: Identify the heirs of the elder customers
• Retail: Condolence / celebration / holiday gifts and services
• Anti-money laundry: Domestic politically exposed persons
• Fraud prevention: Synthetic ID fraud
Who are my family members?
4/21/2017 Experian Public Vision 2017
©Experian 18
What is a graph database?
4/21/2017 Experian Public Vision 2017
Graph
• A collection of vertices (nodes) and edges(relationships) that connect them
Graph database
• Index-free adjacency: connected nodes physically “point” to each other in the database
©Experian 19
• Extremely Flexible data format
• Most of time family members are not directly connected
• Nodes that are useful family indicators:
• Address
• Phone number
• Email address
• Last name
• Other usage:
• Meetup / E-harmony (based on hobby, taste etc.)
• Facebook / LinkedIn (based on co-worker, classmates etc.)
Design the graph for family search
4/21/2017 Experian Public Vision 2017
©Experian 20
Comparison
4/21/2017 Experian Public Vision 2017
SQL Query (RDBMS Database) Cypher Query (Graph Database)
©Experian 21
Geolocation database with PostgreSQL
4/21/2017 Experian Public Vision 2017
©Experian 22
Geolocation data
4/21/2017 Experian Public Vision 2017
©Experian 23
Geolocation data
4/21/2017 Experian Public Vision 2017
Exponential growth of mobile location data with the rise of smart phones
Wide applications:
• Home / work location detection
• Favorite shops
• Mobile marketing service
• Passenger analysis
Key question:
• Where has the consumer been?
Supporting components:
• Where are the Points of Interest (POI) data?
• Which POI is/are around the consumer?
where you
work
where you
shop
how you get there
events you
attend
where you
travel
where you live
where you spend free time
©Experian 24
OpenStreetMap Best Free source for points of interests
4/21/2017 Experian Public Vision 2017
OpenStreetMap (OSM) is a collaborative project to create a free editable map of the world
• Not as accurate as Google, but getting closer and closer, especially in major cities
• Points, lines, polygons
• Rich tags:
Addr: House number, street, city, etc.
Shop: Alcohol, beverage, computer
Admin_level: 2 (country), 4 (state), 6 (city)
Highway: Residential, primary, cycle way, track, etc.
Amenity: Library, school, parking area, bar
Cuisine: coffee, pizza, Chinese, sushi
• Could be easily imported into PostgreSQL with PostGIS extension
©Experian 25
What POIs are around me?
4/21/2017 Experian Public Vision 2017
©Experian 26
9p 9r 9x 9z
9n 9q 9w 9y
9j 9m 9t 9v
9h 9k 9s 9u
95 97 9e 9g
94 96 9d 9f
91 93 99 9c
90 92 98 9b
9q
b
9q
c
9qf 9q
g
9q
u
9q
v
9q
y
9q
z
9q
8
9q
9
9q
d
9q
e
9q
s
9qt 9q
w
9q
x
9q
2
9q
3
9q
6
9q
7
9q
k
9q
m
9q
q
9qr
9q
0
9q
1
9q
4
9q
5
9q
h
9qj 9q
n
9q
p
• Hierarchical group coding of (latitude, longitude) coordinates
• Arbitrary accuracy
• Fast encoding
Geohash
4/21/2017 Experian Public Vision 2017
©Experian 27
Nearby points Easy case of vicinity search
4/21/2017 Experian Public Vision 2017
Which store am I visiting?
Identify the search radius
POI candidates within
candidate Geohash
Filter by actual distance
calculation
©Experian 28
Nearby polygons Advanced case of vicinity search
4/21/2017 Experian Public Vision 2017
Challenge #1: The geohash of a polygon
is the geohash of its center, but the boundary
could be very far away from its center
Which park am I visiting?
Solution:
Categorize polygons by its size first, then
customize search radius by the search
©Experian 29
Nearby polygons Advanced case of vicinity search
4/21/2017 Experian Public Vision 2017
Multiple level search: Find polygons of all sizes
Which park am I visiting?
©Experian 30
Nearby polygons Advanced case of vicinity search
4/21/2017 Experian Public Vision 2017
Am I in the park?
Challenge #2: Given a point, how do we
determine whether it is inside the polygon?
Solution:
ST_Within (PostGIS built-in function):
Using ray_casting algorithm
• Draw a ray from the point in random
direction
• Count the number of intersections
• Odd: In Even: Out
©Experian 31
Key takeaways
4/21/2017 Experian Public Vision 2017
• Use OpenStreetMap + PostgreSQL(PostGIS) to handle your geo-location data
• Filter the candidates first before you calculate distance
©Experian 32
Tips on some latest techniques based on our experiences
• Spark:
– Set proper number of executors and memory per executor
– Partition data to fit executor’s memory limit
• Graph Database:
– Much more efficient when you have to do multiple joins in traditional RDBMS
– Much more flexible
• Geolocation data:
– OpenStreetMap + PostgreSQL
– Filter candidates before a proximity search
Summary
4/21/2017 Experian Public Vision 2017
http://www.experian.com/big-data/datalabs.html
©Experian 33
Experian contact:
Hua Li [email protected]
Zhixuan Wang [email protected]
Questions and answers
4/21/2017 Experian Public Vision 2017
©Experian 34
Share your thoughts about Vision 2017!
4/21/2017 Experian Public Vision 2017
Please take the time now to give us your feedback about this session.
You can complete the survey at the kiosk outside.
How would you rate both the Speaker and Content?