World Cup soccer 2014.07.05 (Money Today) : IoT + Bigdata
German soccer Team
Slide 3
What is big data? Big data is the term for a collection of data
sets so large and complex that it becomes difficult to process
using on-hand database management tools or traditional data
processing applications.
Slide 4
Big Data is Every Where! Lots of data is being collected and
warehoused Web data, e-commerce purchases at department/ grocery
stores Bank/Credit Card transactions Social Network
Slide 5
Slide 6
How much data? Google processes 20 PB a day (2008) Wayback
Machine has 3 PB + 100 TB/month (3/20 09) Facebook has 2.5 PB of
user data + 15 TB/day (4/ 2009) eBay has 6.5 PB of user data + 50
TB/day (5/2009 ) 640K ought to be en ough for anybody.
Slide 7
What does big data do?
Slide 8
Government In 2012, the Obama administration announced the Big
Data Research and Development Initiative, which explored how big
data could be used to address important problems faced by the
government.The initiative was composed of 84 different big data
programs spread across six departments.Obama administration Big
data analysis played a large role in Barack Obama's successful 2012
re-election campaign.Barack Obama2012 re-election campaign The
United States Federal Government owns six of the ten most powerful
supercomputers in the world.United States Federal Government The
Utah Data Center is a data center currently being constructed by
the United States National Security Agency. When finished, the
facility will be able to handle yottabytes of information collected
by the NSA over the Internet.Utah Data CenterUnited StatesNational
Security Agencyyottabytes
Slide 9
Business Amazon.com handles millions of back-end operations
every day, as well as queries from more than half a million
third-party sellers. The core technology that keeps Amazon running
is Linux-based and as of 2005 they had the worlds three largest
Linux databases, with capacities of 7.8 TB, 18.5 TB, and 24.7
TB.Amazon.com Walmart handles more than 1 million customer
transactions every hour, which is imported into databases estimated
to contain more than 2.5 petabytes (2560 terabytes) of data the
equivalent of 167 times the information contained in all the books
in the US Library of Congress.WalmartLibrary of Congress Facebook
handles 50 billion photos from its user base. FICO Falcon Credit
Card Fraud Detection System protects 2.1 billion active accounts
world-wide.FICO The volume of business data worldwide, across all
companies, doubles every 1.2 years, according to estimates.
Windermere Real Estate uses anonymous GPS signals from nearly 100
million drivers to help new home buyers determine their typical
drive times to and from work throughout various times of the
day.Windermere Real Estate
Slide 10
Examples of free big data use sites Google trends Google flue
Google correlate Social metrics insight
Slide 11
Bigdata in google trend
Slide 12
Movement of carts: Product display Bigdata case 12
Slide 13
Wild Fire in Korea(1991 2011 ) 13
Slide 14
Google Flue Service 14
Slide 15
Find Location for your business busienss 15
Slide 16
Crime Mapping in Sanfrancisco : 71% accuracy 16
Slide 17
Similar names for bigdata: Data sciences Business analytics
Data analytics Data mining business intelligence Machine
Learning
Slide 18
Slide 19
Slide 20
Case 1: A case on bigdata analysis MBA (Market Basket
Analysis)
Association Rule : Relationship graph when the link is set to 0
Graphic Representations of Association Rules
Slide 26
6 Relationship graph when the distance is set by value -
network form
Slide 27
Application of MBA : product recommendation system
Slide 28
Case 2: SNS analysis
Slide 29
Social Network (http ://nexus.ludios.net/view/demo)
Slide 30
Analysis of Human Relations (NodeXL)
Slide 31
Friends Networks
Slide 32
Case 3. Bankruptcy Prediction The yearly financial data
collected by the Korea Credit Guarantee Fund. The data consist of
944 bankrupted corporations and 944 healthy (non- bankrupted)
corporations from the fiscal year 1999 to 2002. 32
Slide 33
List of financial variables selected VariableDefinition X13:
interest expenses to sales (interest expenses / sales) 100
X17:profit to sales (profit / sales) 100 X24:operating profit to
sales (operating profit / sales) 100 X27:ordinary profit to total
capital (ordinary profit / total capital) 100 X28:current
liabilities to total capital (current liabilities / total capital)
100 X103:growth rate of tangible assets (tangible assets at the end
of the year / tangible assets at the beginni ng of the 100) 100
X108: turnover of managerial assets sales / {total assets
(construction in progress + investment assets)} net financing cost
interest expenses interest incomes X127: net working capital to
total capital {(current assets current liabilities) / total
capital} 100 X129:growth rate of current assets (current assets at
the end of the year / current assets at the beginnin g of the year
100) 100 X140:ordinary income to net worth (ordinary income / net
worth) 100 33
Slide 34
Decision Tree Analysis 34
Slide 35
Case 4. Income Prediction For our study we selected the United
States Census (5%) 1990 Public Use Microsample data (Census 1990).
This data, which was divided into 18 files, contained the entire 5%
sample made public domain from the 1990 U.S. Census in STATA 6.0
format. Combined, these 18 files included about 4.5 million males
and 5 million females, totaling to 9.1 million records. Census 1990
- http://www.macalester.edu/econdata/United_State s/pums.html
http://www.macalester.edu/econdata/United_State s/pums.html 35
Slide 36
Data Sampling we converted the 18 data files into flat files;
then, using Java code, we merged these 18 flat files into a singe
file consisting of 9.1 million records with 85 variables
(approximately 1.5 GB in size). 36
Slide 37
Algorithm Analogy of Discovering the Complete Set of Rules
(Drawing the Perfect Picture via Coin Scrubbing) 37
Slide 38
The Repetitive Methodology of Merging New Rules into the Domain
Knowledge Base 38
Slide 39
The Relationship Between IRAs Accuracy Level and Number of
Iterations for This Study 39
Slide 40
Performance Comparison CHAIDCARTANNLRDASee5 This st udy Tool
UsedAnswer Tree (SPSS) Answer Tree (SPSS) Neural Conn ection (SPSS
) SPSS See5 (with default rul e) IRA Training Sa mple size 3.24m
10000300k Accuracy (2/3-1/3) 80.19580.30RBF:76.12 MLP 80.68
81.178.382.382.7 40
Slide 41
Mining tools Enterprise Miner (SAS) Clementine (SPSS) R Python
Many visualisation tools: Infographics etc Rapid miner Hadoop
Rhive
Slide 42
Future direction of bigdata
Slide 43
bigdata 2013 bigdata 2014
Slide 44
Google glass Mashup, bigdata, visualisation -> analysis of
commerce area