37
Jongwook Woo HiPIC CalStateLA Seoul Elasticsearch Community Meetup Gangnam , Korea Aug 10 2017 Jongwook Woo, PhD, [email protected] High-Performance Information Computing Center (HiPIC) California State University Los Angeles Data Collection and Visualization using Big Data: President Election 2017 in Korea

HiPIC Data Collection and Visualization using Big Data ... · PDF fileSeoul Elasticsearch Community Meetup Gangnam, Korea Aug 10 2017 Jongwook Woo, ... Architecture Demo. High

  • Upload
    trandan

  • View
    217

  • Download
    1

Embed Size (px)

Citation preview

Jongwook Woo

HiPIC

CalStateLA

Seoul Elasticsearch Community Meetup

Gangnam, KoreaAug 10 2017

Jongwook Woo, PhD, [email protected]

High-Performance Information Computing Center (HiPIC)California State University Los Angeles

Data Collection and Visualization using Big Data:

President Election 2017 in Korea

Presenter
Presentation Notes
Add Machine gun figure, LA collaboration slides

High Performance Information Computing CenterJongwook Woo

CalStateLA

Contents

Myself Introduction To Big Data Architecture Demo

High Performance Information Computing CenterJongwook Woo

CalStateLA

MyselfExperience:

Since 2002, Professor at California State University Los Angeles– PhD in 2001: Computer Science and Engineering at USC

Since Jan 2016 : Co-Founder of The Big Link LLC and Wiken Since 1998: R&D consulting in Hollywood

– Warner Bros (Matrix online game), E!, citysearch.com, ARM 등– Information Search and Integration with FAST, Lucene/Solr, Sphinx – implements eBusiness applications using J2EE and middleware

Since 2007: Exposed to Big Data at CitySearch.com 2012 - Present : Big Data Academic Partnerships

– For Big Data research and training• Amazon AWS, MicroSoft Azure, IBM Bluemix• Databricks, Hadoop vendors

High Performance Information Computing CenterJongwook Woo

CalStateLA

Experience (Cont’d): Bring in Big Data R&D and training to Korea since 2009Collaborating with LA city since 2016

– Collect, Search, and Analyze City Data• Spark, Hadoop, ElasticSearch, Solr, Java, Cloudera

Sept 2013: Samsung Advanced Technology Training InstituteSince 2008

– Introduce Hadoop Big Data and education to Univ and Research Centers• Yonsei, Gachon, DongEui• US: USC, Pennsylvania State Univ, University of Maryland College Park, Univ of Bridgeport, Louisiana

State Univ, California State Univ LB• Europe: Univ of Luxembourg

Myself

High Performance Information Computing CenterJongwook Woo

CalStateLA

Experience in Big Data

Collaboration Council Member of IBM Spark Technology Center City of Los Angeles for OpenHub and Open Data Startup Companies in Los Angeles External Collaborator and Advisor in Big Data

– IMSC of USC– Pennsylvania State University– The Big Link, Softzen, Wiken in Korea

Grants and Awards Faculty Scholarship Winner of Teradata University Network 2017 IBM Bluemix , MicroSoft Windows Azure, Amazon AWS in Research and Education Grant

PartnershipAcademic Education Partnership with Databricks, Tableau, Qlik, Cloudera, Hortonworks, SAS,

Teradata

High Performance Information Computing CenterJongwook Woo

CalStateLA

Contents

Myself Introduction To Big DataArchitecture Demo

High Performance Information Computing CenterJongwook Woo

CalStateLA

Two Cores in Big Data

How to store Big DataHow to compute Big DataGoogleHow to store Big Data

– GFS– Distributed Systems on non-expensive commodity computers

How to compute Big Data– MapReduce– Parallel Computing with non-expensive computers

Own super computersPublished papers in 2003, 2004

High Performance Information Computing CenterJongwook Woo

CalStateLA

Definition: Big Data

Non-expensive frameworks that is distributed parallel systems and that can store a large scale data and process it in parallel [1, 2]Hadoop and Spark

– Non-expensive Super Computer– More public than the traditional super computers

• You can store and process your applications– In your university labs, small companies, research centers

Others– Cloud Computing Big Data services

• Amazon AWS, IBM Bluemix, Microsoft Azure– NoSQL DB (Cassandra, MongoDB, Redis, HBase)– ElasticSearch

High Performance Information Computing CenterJongwook Woo

CalStateLA

Spark

In-Memory Data ComputingFaster than Hadoop MapReduce

Can integrate with Hadoop and its ecosystemsHDFS Amzon S3, HBase, Hive, Sequence files, Cassandra, ArcGIS, Couchbase…

New Programming with faster data sharingGood

– Iterative graph algorithms, Machine LearningInteractive query

High Performance Information Computing CenterJongwook Woo

CalStateLA

ElasticSearch

Full Text Search and Visualization ServerGetting more popular than SolrElasticSearch, Kibana, ES-Hadoop, Logstash,…

Based on Apache Lucene libraryHorizontally Scalable

High Performance Information Computing CenterJongwook Woo

CalStateLA

Elastic Stack100% open source

No enterprise editionAll new versions with 5.0

ElasticSearch

High Performance Information Computing CenterJongwook Woo

CalStateLA 12

ES-HadoopElasticsearch for

Hadoop

• Exchange data between Hadoop HDFS and ElasticSearch

ElasticSearch

High Performance Information Computing CenterJongwook Woo

CalStateLA

Contents

Myself Introduction To Big Data Architecture Demo

High Performance Information Computing CenterJongwook Woo

CalStateLA

Big Data Analysis Flow

Data CollectionBatch API: Yelp, GoogleStreaming: Twitter, Apache NiFi, Kafka, Storm Open Data: Government

Data StorageHDFS, S3, Object Storage, NoSQL DB (Couchbase)…

Data FilteringHive, Pig

Data Analysis and ScienceHive, Pig, Spark, BI Tools (Datameer, Qlik, Tableau,…)

Data VisualizationQlik, Datameer, Excel PowerView

High Performance Information Computing CenterJongwook Woo

CalStateLA

Data Engineering

Data SourceTwitter streaming API

– using the keywords• "문재인","moonriver365", "안철수", "cheolsoo0919", "유승민", "yooseongmin2017",

"홍준표", "HongSkyangel808", "심상정", "sangjungsim“

– Roughly: April 28 2017 – May 11 2017Data CollectionApache Nifi for streaming data

– supports powerful and scalable directed graphs • data routing, transformation, and system mediation logic

Data StorageElasticSearchHadoop HDFS at Azure

High Performance Information Computing CenterJongwook Woo

CalStateLA

Data Engineering (Cont’d)

Data Analysis and Prediction: In the futureSpark ML, Spark SQL, Hadoop Hive

Data VisualizationKibana in ElasticSearch

High Performance Information Computing CenterJongwook Woo

CalStateLA

Apache NiFi• NiFi-1.1.2: getTwitter, putElasticSearch5, putHDFS

High Performance Information Computing CenterJongwook Woo

CalStateLA

Hadoop Spark Cluster: HDInsight in Azure

vCores Memory Local SSD (GB) (GB)

4 28 200

High Performance Information Computing CenterJongwook Woo

CalStateLA

ElasticSearch in HDInsights

Did not launch ElasticSearch Service in AzureInstead, install ES5 in Linux Head Node of HDInsights

cluster– ElasticSearch

• 5.3.1– Kibana

• 5.3.2

High Performance Information Computing CenterJongwook Woo

CalStateLA

Mapping to ESTemp-Spatial Analysis For matching the Twitter date format to ES

curl -XPUT localhost:9200/_template/elect17 -d '{

"template" : "elect17*","settings" : {

"number_of_shards" : 1},"mappings" : {"default" : {

"properties" : {"created_at" : {"type" : "date","format" : "EEE MMM dd HH:mm:ss Z YYYY"

},

High Performance Information Computing CenterJongwook Woo

CalStateLA

Mapping to ES (Cont’d)"coordinates" : {

"properties" : {"coordinates" : {

"type" : "geo_point"},"type" : {

"type" : "string"}

}},"user" : {"properties" : {

"screen_name" : { "type" : "string","index" : "not_analyzed"

},

High Performance Information Computing CenterJongwook Woo

CalStateLA

Mapping to ES (Cont’d)"lang" : {

"type" : "string","index" : "not_analyzed"

}}

}}

}}

}'

High Performance Information Computing CenterJongwook Woo

CalStateLA

K-Election 2017 (April 29 – May 9)

High Performance Information Computing CenterJongwook Woo

CalStateLA

K-Election 2017 (April 29 – May 9)

High Performance Information Computing CenterJongwook Woo

CalStateLA

ES-Hadoop Install ES-Hadoop

$ wget -P /tmp http://download.elastic.co/hadoop/elasticsearch-hadoop-5.3.1.zip$ unzip /tmp/elasticsearch-hadoop-5.3.1.zip -d /tmp$ cp /tmp/elasticsearch-hadoop-5.3.1/dist/elasticsearch-hadoop-5.3.1.jar /tmp/elasticsearch-hadoop-5.3.1.jar$ hdfs dfs -copyFromLocal /tmp/elasticsearch-hadoop-5.3.1/dist/elasticsearch-hadoop-5.3.1.jar /tmp$ sudo cp elasticsearch-spark-20_2.11-5.3.1.jar /usr/hdp/current/spark2-client/

High Performance Information Computing CenterJongwook Woo

CalStateLA

ES-Hadoop (Cont’d) Add ES-Hadoop libraries to Hive with one of the

followings:$ hivehive> add jar hdfs:///tmp/elasticsearch-hadoop-5.3.1.jarhive> add jar /tmp/elasticsearch-hadoop-5.3.1.jarhive> add jar file:///tmp/elasticsearch-hadoop-5.3.1.jarhive > list jar ;file:///tmp/elasticsearch-hadoop-5.3.1.jar

High Performance Information Computing CenterJongwook Woo

CalStateLA

ES-Hadoop (Cont’d)hive> select * from elect17_test LIMIT 10;OK856281525070909440 NULL NULL NULL NULL RT @sydbris: 이정도는우리문재인후보님이절대말씀하시지않겠지.

"넌내가유신반대투쟁하고민주화운동할때친구들이랑고대앞하숙방에모여서 xx모의했냐?" Sun Apr 23 22:59:59 +0000 2017856281524995407872 NULL NULL NULL NULL RT @choomiae: 존경하는시흥시민여러분!…

High Performance Information Computing CenterJongwook Woo

CalStateLA

Contents

Myself Introduction To Big Data Architecture Demo

High Performance Information Computing CenterJongwook Woo

CalStateLA

DemoAzure PortalUbuntu VMElasticSearchNiFiKibana: April 29 – May 10

Hive with ES-HadoopTest with the data on April 23 – April 24

High Performance Information Computing CenterJongwook Woo

CalStateLA

Spark Big Data Training and R&D

HiPICCalifornia State University Los Angeles Supported by

– Databricks and its cloud computing services– Amazon AWS, IBM Buemix, MS Azure– Hortonworks, Cloudera– Teradata– ElasticSearch– Qlik, Tableau

High Performance Information Computing CenterJongwook Woo

CalStateLA

Databricks Partners

High Performance Information Computing CenterJongwook Woo

CalStateLA

Training Hadoop and SparkCloudera visits to interview Jongwook Woo

High Performance Information Computing CenterJongwook Woo

CalStateLA

Training Hadoop on IBM Bluemix at California State Univ. Los Angeles

High Performance Information Computing CenterJongwook Woo

CalStateLA

ConclusionK-Elect 2017 in ES5 and HDInsightsES5Easy to collect and visualize

HDInsightsData and Predict Analysis possible

High Performance Information Computing CenterJongwook Woo

CalStateLA

Question?

High Performance Information Computing CenterJongwook Woo

CalStateLA

References

1. “Market Basket Analysis Algorithm with Map/Reduce of Cloud Computing”, Jongwook Woo and Yuhang Xu, The 2011 international Conference on Parallel and Distributed Processing Techniques and Applications (PDPTA 2011), Las Vegas (July 18-21, 2011)

2. Jongwook Woo, DMKD-00150, “Market Basket Analysis Algorithms with MapReduce”, Wiley Interdisciplinary Reviews Data Mining and Knowledge Discovery, Oct 28 2013, Volume 3, Issue 6, pp445-452, ISSN 1942-4795

3. Jongwook Woo, “Big Data Trend and Open Data”, UKC 2016, Dallas, TX, Aug 12 2016

High Performance Information Computing CenterJongwook Woo

CalStateLA

4. Business Data Analysis LA at Databricks, HiPIC of CalStateLA, Jongwook Woo https://docs.databricks.com/spark/latest/training/cal-state-la-biz-data-la.html

5. https://github.com/hipic/spark_mba, HiPIC of California State University Los Angeles

6. Hadoop, http://hadoop.apache.org7. Databricks, http://www.databricks.com8. DS320: DataStax Enterprise Analytics with Spark9. Cloudera, http://www.cloudera.com10.Hortonworks, http://www.hortonworks.com

References (Cont’d)