BigData May2013 VN

Can Big Data Bring Us Big Value?

Ho Tu Bao

Japan Advanced Institute of Science

and Technology (JAIST)

Outline

2

Research & Development on Big Data

Big Data, what is it?

Big Data in Viet Nam?

Three emerging IT technologies Smart devices, cloud computing, big data

3

Điện toán đám mây

Thiết bị thông minh

TGĐ TG Bình “lời giải hạ tầng sẽ dựa trên công nghệ di động, Điện toán đám mây và Dữ liệu lớn”

CTO NL Phương: “Dòng chảy của FPT là CNTT là hạ tầng của hạ tầng với các mái chèo là Mobility, Cloud

Computing, Big Data…”.

Big data nói đến các tập dữ liệu RẤT LỚN và PHỨC TẠP tới mức các kỹ thuật IT truyền thống không xử lý nổi.

Volume: Lớn từ mức Terabytes đến

Petabytes (1015 bytes) cả

Zetabytes (1018 bytes)

Variety: Sự phức tạp của dữ liệu với nhiều

cấu trúc khác nhau, từ dữ liệu

quan hệ, đến logs, văn bản thô…

Velocity: Dòng chuyển động của các lượng

dữ liệu rất lớn

Veracity: Tính tin cậy, độ chính xác, tính

đúng đắn của dữ liệu.

Big data là gì?

Where does big data come from?

Từ các phương tiện xã hội: Nhìn thấu (insights) được hành vi và ý kiến của khách hàng của công ty.

Từ máy móc: Thiết bị công nghiệp, các sensors và dụng cụ giám sát, web logs…

Từ giao dịch kinh doanh: ID và giá cả sản phẩm, thanh toán, dữ liệu chế tạo và phân bố, … ,

Nhiều loại khác

5

Each day: 230M tweets, 2.7B comments to FB, 86400 hours of video to YouTube

Large Hadron Collider generates 40 terabytes/sec

Amazon.com: $10B in sales in Q3 2011, US pizza chain Domino's: 1 million customers per day

Big data can be very small Not all large datasets are big

Big liên quan tới độ phức tạp lớn hơn là kích thước lớn.

Dữ liệu lớn nhưng lại nhỏ

Lò hạt nhân, máy bay… có hàng trăm nghìn sensors sự phức tạp của việc tổ hợp dữ liệu các sensors này tạo ra?

Dòng dữ liệu của tất cả các sensors là lớn mặc dù kích thước của tập dữ liệu là không lớn (một giờ bay: 100,000 sensors x 60 minutes x 60 seconds x 8 bytes nhỏ hơn 3GB).

Tập dữ liệu to nhưng không lớn

Số hệ thống dù tăng lên và tạo ra những lượng rất nhiều dữ liệu đơn giản.

6 MIKE2.0

Big data chases election 2012 undecided voters

7

More than 150 techies are quietly peeling back the layers of your life.

Họ biết bạn đọc gì, mua sắm ở đâu, làm việc gì, bạn bè là ai. Thậm chí biết cả mẹ bạn lần trước bầu cho ai…

From data mining to online organizing. Qua Facebook, Twitter và nhiều nguồn online khác, một chiến dịch không mệt mỏi nhằm tạo ra một cơ sở dữ liệu chứa tiểu sử riÊng của các cử tri tiềm năng.

Obama có 16 triệu Twitter followers so với 500,000 cuae Romeny. Với Facebook, Obama có gần 27 triệu followers so với 1.8 triệu của Romney.

Big data, big analytics, big opportunity

Một số công ty rất lớn nổi tiếng về chế tạo chủ yếu phần cứng trong quá khứ về hiện đang dần thay đổi thành các công ty cung cấp dịch vụ, chẳng hạn như phân tích kinh doanh (business analytics).

IBM’s past: Chế tạo servers, desktop computers, laptops, và thiết bị cho hạ tầng cơ sở.

IBM’s today: Loại bỏ một số thiết bị phần cứng như laptops, và thay vào đó đầu tư hàng tỷ đôla để xây dựng và nhằm đạt được uy tín (credentials), cố gắng tạo dựng vị trí dẫn đầu trong phân tích kinh doanh.

IBM đầu tư hàng tỷ đôla dùng SPSS trong thị trường phân tích kinh doanh để giành được (capture) thị phần bán lẻ. Đối với các kinh doanh thương mại lớn ( largecommercial ventures), IBM dùng Cognos để cung cấp toàn bộ phân tích dịch vụ.

8 http://dawn.com/2012/07/25/big-data-big-analytics-big-opportunity/ 25July 2012

http://dawn.com/2012/07/25/big-data-big-analytics-big-opportunity/












Google’s Cloud Storage and BigQuery

Google hiểu rất rõ quản lý và xử lý thế nào các lượng dữ liệu khổng lồ ở mức lớn hơn hầu hết các công ty khác có thể làm.

Google xây dựng công nghệ riêng của mình cho việc phân tích nhanh và tương tác những lượng dữ liệu khổng lồ: BigQuery (nối với Tableau), Cloud Storage.

http://www.wired.com/insights/2012/11/visual-analytics-brings-big-data-in-googles-cloud-to-life/

9 Google Data Center




















Turning big data into value

Phân tích dữ liệu lớn cho phép các tổ chức giải quyết các bài toán phức tạp trước kia không thể làm được ra các quyết định và hành động tốt hơn.

Các ưu thế cạnh tranh (Competitiveness advantages).

Cung cấp những hiểu biết sâu (insights) về các hành vi phức tạp của xã hội con người.

Đột phá (breakthrough) trong khoa học.

etc.

10

Data-driven approach to science

Carefully designed

data-generating experiment

Generation of

hypotheses

Analyze and test

hypotheses Inductive reasoning

by computation

Data driven XYZ Data analytics

11 Source: Forbes and Gartner, Oct. 15, 2012

Big data inquiries October 19, 2011-October 10, 2012

by industry

by region

by enterprise

12

BIG DATA TORRENT BIG DATA VALUE

Source: McKinsey Global Institute

Gartner prediction on big data

13

IT to spend $232B on Big Data over 5 years

Outline

14




Key concepts

Big data (either data or technologies)

1. Big size: Volume

2. Complex: Variety (heterogeneous) , Velocity (dynamics), Veracity (data quality)

Technologies for Big data

1. Data management: Store, compress, transfer big data

2. Data analytics: Search, analyze, compare, visual analytics

Key challenges

1. Cannot fit the data into the memory for computation

2. Lack effective and efficient methods for complex data

Effective is doing right thing, Efficient is doing things right

A framework of big data

16 Source: WAMDM, Web group

Visual

Analytics

Data

Analytics

Data

Management

Development of machine learning

18

1949 1956 1958 1968 1970 1972 1982 1986 1990 1997 1941 1960 1970 1980 1990 2000 2010 1950

Neural modeling

Pattern Recognition emerged

Rote learning

Minsky criticism

Symbolic concept induction

Math discovery AM

Supervised learning

Unsupervised learning

PAC learning

ICML (1982)

NN, GA, EBL, CBL

Experimental comparisons

Revival of non-symbolic learning

Multi strategy learning

Reinforcement learning

Statistical learning

Successful applications

Active & online learning

Data mining

ILP

Kernel methods

Bayesian methods

Probabilistic graphical models

Nonparametric Bayesian Ensemble methods

Transfer learning

Semi-supervised learning

Structured prediction

MIML

IR & ranking

Dimensionality reduction

Deep learning

Sparse learning

ECML (1989) KDD (1995) ACML (2009) PAKDD (1997)

Abduction, Analogy

dark age renaissance enthusiasm maturity fast development

Sparse and Convex Methods

Convexity

Convex problems (minimizing convex functions over convex sets) can be solved quickly. If necessary, approximate the problem with a convex problem.

Sparsity and sparse modeling

Many interesting problems are high dimensional. But often, the relevant information is effectively low dimensional. Using a small number of variables to model (sparse modeling).

Sparse modeling

20

Selection and construction of a small set of highly predictive variables in high-dimensional datasets.

(chọn và tạo ra một tập nhỏ các biến có

khả năng dự đoán cao từ dữ liệu nhiều

chiều).

Lasso regresion (Tibshirani, 1996) where sparsity meets convexity.

Sparvexity (the marriage of sparsity and convexity) is one of the biggest developments in statistics and machine learning.

Sparse modeling: Beyond Lasso S&P 500: Graphical Lasso vs. Parallel Lasso (VIASM, Lafferty clecture 2012)

21

22

Dimensionality reduction

The process of reducing the number of random variables under consideration, and can be divided into feature selection and feature extraction.

(quá trình rút gọn số biến ngẫu nhiên đang quan tâm, gồm lựa chọn biến và tạo biến mới).

Probabilistic graphical models

A way of describing/representing a reality by probabilistic relationships

between random variables (observed and unobserved ones).

(Mô tả và biểu diễn các hệ thống phức tạp bằng các quan hệ xác suất giữa các

biến ngẫu nhiên (biến hiện và ẩn).

23

Two key tasks

Learning: The structure and parameters of the model

Inference: Use observed variables to computer the posterior distributions of other variables?

Probability Theory + Graph Theory

PCWP CO

HRBP

HREKG HRSAT

ERRCAUTER HR HISTORY

CATECHOL

SAO2 EXPCO2

ARTCO2

VENTALV

VENTLUNG VENITUBE

DISCONNECT

MINVOLSET

VENTMACH KINKEDTUBE INTUBATION PULMEMBOLUS

PAP SHUNT

ANAPHYLAXIS

MINOVL

PVSAT

FIO2

PRESS

INSUFFANESTH TPR

LVFAILURE

ERRBLOWOUTPUT STROEVOLUME LVEDVOLUME

HYPOVOLEMIA

CVP

BP

Monitoring Intensive-Care Patients

24

Graphical models Instances of graphical models

Probabilistic models

Graphical models

Directed Undirected

Bayes nets MRFs

DBNs

Hidden Markov Model (HMM)

Naïve

Bayes

classifier

Mixture

models

Kalman

filter

model

Conditional

random

fields

MaxEnt

LDA

Murphy, ML for life sciences

25

Probabilistic graphical models Topic models: Roadmap to text meaning

Key idea: documents are mixtures of latent topics, where a topic is a probability distribution over words.

Hidden variables, generative processes, and statistical inference are the foundation of probabilistic modeling of topics.

Normalized co-occurrence matrix

C

documents

word

s

F

topics

word

s

Q

documents t

opic

s

Blei, D., Ng, A., Jordan, M., Latent Dirichlet Allocation, JMLR, 2003

Fully sparse topic model

26

Topic model: sparse vs. dense

Topic modeling is the key approach to automate the text meaning (idea: a topic is a set of words with a probability distribution, and a document is mixtures of latent topics).

Our sparse topic model allows dealing

with big text data (millions documents and thousands topics) that current dense topic

models cannot do (reducing the storage from

23.3 Gb to 33.3 Mb for 350,000 documents).

Sparse vs. dense #topics: thousand & hundreds

Inference time Linear vs. non linear

Sparse topic

representation

100 times smaller

Sparse document representation

350 times smaller

Storage 700 times smaller

w Z θ β

N D

K

FSTM

Khoat Than and Tu Bao Ho, papers in ECML 2012 and ACML 2012.

0 50 1000

0.5

1

1.5

2

2.5

3x 10

4

Number of topics

Lear

ning

tim

e (s

)

AP

0 50 1000

0.5

1

1.5

2

2.5

3x 10

4

Number of topics

Lear

ning

tim

e (s

)

KOS

0 50 1000

1

2

3

4

5x 10

4

Number of topics

Lear

ning

tim

e (s

)

gro

FSTM

PLSA

LDA

STC

How fast can the models learn?

How fast can the models infer?

0 50 100 0

20

40

60

80

100

Number of topics

Infe

rence

tim

e (

s)

AP

0 50 100 0

20

40

60

80

100

Number of topics

Infe

rence

tim

e (

s)

KOS

0 50 100 0

500

1000

1500

2000

Number of topics

Infe

rence

tim

e (

s)

Grolier

Big data across the federal government 29 March 2012, Retrieved 26 Sep 2012

84 different big data programs, 6 departments

Defense: Autonomous systems (250M$/year)

Homeland security: COE on visualization and data analytics (from natural disaster to terrorist incidents), Rutgers & Perdue Univ.

Energy: High performance storage system to manage petabytes of data, mathematics for analysis of petascale data (machine learning, statistics,…)

Health and Human Services: Disease Control & Prevention

Food and Drug Administration (FDA)

National Aeronautics & Space Administration (NASA)

National Institutes of Health (NIH)

National Science Foundation (NSF): Core techniques and technologies for advancing big data S&E.

27 www.WhiteHouse.gov/OSTP

NSF 8 projects on Big Data (Call: 3.2012, Selection: 10.2012, Do: 1.2013)

1. Eliminating the Data Ingestion Bottleneck in Big-Data Applications

2. DataBridge - A Sociometric System for Long-Tail Science Data Collections

3. A Formal Foundation for Big Data Management

4. Analytical Approaches to Massive Data Computation with Applications to Genomics

5. Distribution-based machine learning for high dimensional datasets

6. Genomes Galore - Core Techniques, Libraries, and Domain Specific Languages for High-Throughput DNA Sequencing

7. Big Tensor Mining: Theory, Scalable Algorithms and Applications

8. Discovery and Social Analytics for Large-Scale Scientific Literature.

28

International collaboration on big data

29

JST: CREST call for projects

Next

generation

application

platform core

technologies

for big data

Next

generation

core

technologies

for big data

Electronic medical record (EMR)

Electronic medical record (EMR) is a computerized medical record created in an organization that delivers care, such as a hospital or physician's office.

Discharge summary: The clinical notes written by the discharging physician or dentist at the time of releasing a patient from the hospital or clinic, outlining the course of treatment, the status at release, and the postdischarge expectations and instructions.

Japan: In 2009, 62.5% of 825 major hospitals (with at least 400 beds) has EMRs. All 40 national university hospitals operate an EMR system. Chiba: Pioneer in EMR, >50K cases.

EMR’s Framework

Part 1

Develop new and effective

methods for pre-process

different types of data in EMRs

into certain ready-to-use

(intermediate) forms.

Part 2

Develop new and effective

methods for incorporating the

preprocessed EMRs data with

other data sources and new

learning methods to mine those

big data for medical problems.

Goal: Create a framework for using EMRs in health care and medicine research.

EMR’s matching

Develop a new similarity measure for preprocessed: density-based similarity, multiple kernels similarity, string-based similarity, topic-based similarity, ontology-based similarity, etc.

Several matching methods will be considered: (a) k-nearest neighbors, (b) multi-label classification matching, (c) multiple-kernel matching, etc.

Experiment on the ERMs of three hospitals (Chiba University Hospital, Saga University Hospital and St. Luke’s Hospital): Nearly 50,000 cases will be used in this study.

Expert System

Predicting drug side-effects

• EMRs data + data on

biological factors leading to

side-effects (interaction of

protein, enzyme, DNA… in

Drugbank, SIDER, KEGG „drug,

PROMISCUOUS… and

literature, social networks

Highly heterogeneous.

• Predicting side-effects of single

drug by multi-view learning

and multi-label classification.

• Predicting side-effects of poly

drug by network

reconstruction and link

analysis: Regression-based

learning structures of graphical

models, link prediction ….

Drug

Target

Protein Side Effect

Materials Design

“… to shorten the materials development cycle from its current 10-20 years to 2 or 3 years.” Materials Genome Initiatives (launched in the US in 2012)

An optimal structure model of materials and its physical properties, results in a series of optimizing processes and strong multivariate correlations (difficult to uncover).

We use multiple linear regression with LASSO regularized least-squares and least-angle techniques solve the sparse approximation problem on the space of structural and physical properties of materials.

Dam, Ho et al., 2013

Outline

36




New paradigm of science and big data

37

Theory

Science

Comput-

ational

Science

Data-

Intensive

Science

Experim-

entation

Computational science (using math and computation to do work in other sciences) vs. Computer science (making hardware and software for computation)

CACM, Dec. 2010 CACM, Sep. 2010

Jim Gray (1944-2007)

Computational science (CS) Computational science and engineering (CSE)

38

CSE

Mathematics Computer Science

Science & Engineering

CSE: việc phát triển và ứng dụng các mô

hình tính toán và mô phỏng, thường gắn với

các siêu máy tính để giải quyết các bài toán

phức tạp trong phân tích và thiết kế kỹ

thuật cũng như với các hiện tượng tự nhiên.

Ba thành phần của khoa học tính toán:

Mô hình và mô phỏng

Khoa học máy tính: mạng, phân tích dữ liệu

Hạ tầng cơ sở (siêu máy tính)

Source: PITAC report and SIAM

Competition on supercomputers

39

Nov. 2010: China Tianhe-1A 2.56 petaflops, 23552 processors

Nov. 2012: Cray’s Titan computer, 17.59 petaflops, 560640 processors.

June 2012: Japan’s K computer, 10.51 petaflops, 88128 processors

June 2012: SuperMUC, Europe fastest, 2.9 peteflops, 18432 processors.

Lessons learned from Japan’s K computer

Started

21

application

programs

at the

beginning

of the

K computer

project.

Japan national key project, 1 billion USD (2007-2012)

Some national-level problems

41

Phòng chống thảm hoạ

thiên nhiên, ảnh hưởng của

biến đổi khí hậu (river flow,

flood forecasting, ocean

simulation, soil erosion...)

Đánh giá sự cố rủi ro của

các hệ thống lớn như các lò

hạt nhân, nhà máy thuỷ

điện, hệ thống ngân hàng…

CSE trong quốc phòng,

xã hội...

42

Scientific breakthroughs

Khoa học về sự sống, y-sinh: mô hình và dự đoán sự phát tán bệnh, chống bệnh sốt rét…

Khoa học và công nghệ vật liệu: Phát triển các mô hình vật liệu nhiều tỷ lệ (multi-scale) để từ hiểu các cấu trúc nano đến các ứng dụng kỹ thuật chế các vật liệu nano.

Tài chính tính toán: quản lý rủi ro trong đầu tư và thị trường, dự đoán và mô phỏng các kịch bản và phương án kinh tế.

Future work

SHIFT IN MEDICINE RESEARCH

Future work

Molecular medicine is essentially based on learning

from omics data

SHIFT IN MEDICINE RESEARCH

Black–Scholes European Call

Option Pricing Surface

Copyright 2011 FUJITSU LIMITED

Sco

pe o

f ICT U

sag

e

2010 2020 1990 2000

Productivity Improvement

Business Process Innovation

Creating Knowledge, Supporting Human Activities

Computer Centric Network

Network Centric

・Internet ・PC

・Ubiquities terminals ・Mobile network

・Cloud computing ・Sensor technology

Human Centric

Source: Fujitsu

Relationship of human and computer

44

Take home message

Big data and computational science and technology (CSE) are emerging technology and field that impact the future.

Machine learning & data mining have been fast changing with statistics, and are the key technology for big data analytics.

No universal powerful method. Each of different contexts of big data needs its most appropriate solution.

Big opportunities but also big challenges.

Why and how these in Viet Nam?

Thanks

Documents

BigData May2013 VN