Upload
soshi-nemoto
View
252
Download
0
Embed Size (px)
Citation preview
Big data - Overview -
2016/03/04 Mulodo Vietnam Co., Ltd.
“Big data”
Types Science :
LHC: Large Hadron Collider
Medical : Gene analysis
Market (IT?): Business use
What is “Big data”?
Types Science :
LHC: Large Hadron Collider
Medical : Gene analysis
What is “Big data”?
Market (IT?): Business use
History of Data processing
50’s - “BI : Business Intelligence” (1958) 80’s - “DSS : Decision support system” (80’s) - “SQL86” (1986) - “Knowledge Discovery in Databases” (1989) - “BI (Redefinition)” (1989) 90’s - “Data Warehouse” (1990) - “OLAP: online analytical processing” (1993) - “Improvement of computing power” (90’s) - “Price reduction of storage” (90’s) - “Data Mining” (1996)
History of Data processing2000’s - “Spread of The Internet” (00’s) - ‘Google: Big data stack 1.0’ (00’s) - “MapReduce framework” (2004) - “Independence of Hadoop project from Nutch” (2006) - “Amazon: S3” (2006) - “Explosive prosperity of EC” (00’s)
2010’s - “Big data” in ‘The Economist(UK)’ (2010) - “Google: BigQuery” (2010) - “fluentd” (2011) - “Amazon: Redshift” (2012) - “DMP: data management platform” (10’s) - “Google: Big data stack 2.0-3.0” (10’s) - “Apache crunch, Implara, Prest,...” (10’s)
80's 90's 00's 10's
Let's look back on the history of Big data
(Especially storage and query engine)
80's 90's 00's 10's
SQL(86)
Easy to use, structured/ruled.
independent from storage
80's 90's 00's 10's
Map Reduce
SQL(86)
big data stack/GFS
use HUGE data batch like process (for huge logs)
But, Proprietary
Too Huge to treat on usual RDBMS
80's 90's 00's 10's
Map Reduce
SQL(86)
Hadoop
big data stack/GFS
HBaseOpen source products!
We need source. We love freedom.
80's 90's 00's 10's
Map Reduce
SQL(86)
Hadoop
big data stack/GFS
Hive
HBase
pig
Easy to useE-commerce require huge data analysis.
M/R is too heavy to use......
80's 90's 00's 10's
Map Reduce
SQL(86)
Hadoop
big data stack/GFS
Hive
HBase
pig Hive SQL -> (M/R) -> Result
Pig Original language <=> (M/R)
80's 90's 00's 10's
Map Reduce
big data stack/CFS
SQL(86)
Hadoop
big data stack/GFS
Hive
HBase
Dremel
pig
Google announced Dremel
for interactive analysis
of huge data
BigQuery
We want analyze huge data interactively.
80's 90's 00's 10's
Map Reduce
big data stack/CFS
SQL(86)
Hadoop
big data stack/GFS
Hive
HBase
Dremel
pig
BigQuery
Dremel 1. divide SQL for shards 2. process them in parallel.
It’s Not a wrapper of M/R, but process SQL super parallel. (ie. full scan for each query with thousands servers w/o index)
80's 90's 00's 10's
Map Reduce
big data stack/CFS
BigQuery
SQL(86)
Hadoop
big data stack/GFS
Hive
HBase
DremelPrestoImpala
pigOpen source products!
We need source. We love freedom.
80's 90's 00's 10's
Map Reduce
big data stack/CFS
BigQuery
SQL(86)
Hadoop
big data stack/GFS
Hive
HBase
DremelPrestoImpala
pig
Add social circumstances on this figure.
80's 90's 00's 10's
Map Reduce
big data stack/CFS
BigQuery
SQL(86)
Hadoop
big data stack/GFS
Hive
HBaseHDFS
DremelPrestoImpala
pig
RedshiftS3
DWHDataMining
BI BIDSS
DMP
computing powerImprovement of
StoragePrice reduction of Spread of The Internet
Explosive prosperity of EC
Many requests Many solutions...
Many requests Many solutions...
But you can think which solution is better for your project. (I hope)
How to use Big dataA) How to aggregate data? - huge amount of data - too high frequency data
B) How to maintenance data? - Data will increase.... - Query engine cost, Storage cost. - Data check cost
C) How to analyze data? (what for?) - UI / UX — Understanding of business requirements
How to aggregate data<Libevent shock> parallel -> event driven. * similar to “parallel -> USB” Fluentd - Async - (Puseudo) realtime <-> Periodic Batch
other - logstash - Lamda and Kinesis (AWS) - ...
How to analyze dataUI / UX <solution set for log monitering> * ELK : logstash + Elastic search + Kibaa
* Fluentd + Norikra + GrowthForecast
Next : * Trying some storage
* Trying to build system design
* Diving to some solutions