Bootstrapping Big Data with Spark SQL and Data …...In Memory In Server Big Data Small to modest...

Bootstrapping Big Data with Spark SQL and Data

FramesBrock Palen | @brockpalen | brockp@umich.edu

http://arc-ts.umich.edu

In Memory In Server Big Data● Small to modest data● Interactive or batch work● Might have many

thousands of jobs● Excel, R, SAS, Stata,

● Small to medium data● Interactive or batch work● Hosted/shared and

transactional data● SQL / NoSQL● Hosted data pipelines● iRODS / Globus● Document databases

● Medium to huge data● Batch work● Full table scans● Hadoop, Spark, Flink● Presto, HBase, Impala

Coming Soon: Bigger Big Data

Spark runs on Hadoop, Mesos, standalone, or in the cloud. It can access diverse data sources including HDFS, Cassandra, HBase, and S3.

http://spark.apache.org

What we will runSELECT author, subreddit_id, count(subreddit_id) AS posts FROM reddit_table GROUP BY author, subreddit_id ORDER BY posts DESC

Spark Submit OptionsSpark-submit / pyspark takes R, Python, or Scala

pyspark \--master yarn-client \--queue training \--num-executors 12 \--executor-memory 5g \--executor-cores 4

● pyspark for interactive● spark-submit for scripts

Reddit History: August 2016 -- 279,383,793 Records

Data Format Matters

Format Type Size Size w/Snappy Time Load / Query

Text / JSON / CSV 1.7 TB 2,353 s / 1,292 s

Parquet Column 229 GB 117 GB 3.8 s / 22.1 s

Details: http://bit.ly/2qITrEL

Other types:● Avro● ORC● Sequence File

Parquet FormatUse: parquet-toolsTo look at parquet data and schema off Hadoop filesystems systems.

Details: https://parquet.apache.org/

Data Frames● Popularized by R and Python Pandas● Spark Data Frames and Pandas share most of their API● Similar to SQL Object Relational Models

○ Ie: table.column table.function() etc

Getting Startedreddit = sqlContext.read.parquet("reddit_full.parquet")

reddit.show()

reddit.count()

reddit.printSchema()

reddit.select("over_18").show()

reddit.filter(reddit['over_18'] == 1 ).count()

Started Ctd.reddit.groupBy("author","subreddit_id").count().show()

from pyspark.sql.functions import *

reddit.groupBy("author","subreddit_id").agg(count("subreddit_id")).show()

Aggregations Ctd.reddit.groupBy("author","subreddit_id").agg(count("subreddit_id").alias("posts")).orderBy("posts", ascending=[0]).show()

reddit.groupBy("author","subreddit_id").agg(count("subreddit_id").alias("posts")).orderBy("posts", ascending=[0]).filter(reddit.author != '[deleted]').show()

Data Frames and Spark SQL● SQL in Spark is a feature of Data Frames

○ They can be mixed,○ Make a table with <dataframe>.registerTempTable(“tablename”)

Spark SQLreddit.registerTempTable(“reddit_table”)

results = sqlContext.sql("SELECT author, subreddit_id, count(subreddit_id) AS posts FROM reddit_table WHERE author NOT LIKE '[deleted]' GROUP BY author, subreddit_id ORDER BY posts DESC")

results.show()

results.write.parquet(“myresults”)

Other Handy Functions# Get query plan<statement>.explain()

#save data partitioned by columnresults.write.parquet("myresults_p.parquet", "overwrite", "over_18")

Read: https://spark.apache.org/docs/1.6.2/api/python/pyspark.sql.html

Use Sqoop to import data from DBMS: http://sqoop.apache.org/

Thank You - Contact

• hpc-support@umich.edu

• http://arc-ts.umich.edu/

• @ARCTS_UM

• brockp@umich.edu

• http://myumi.ch/aV7kz

Bootstrapping Big Data with Spark SQL and Data …...In Memory In Server Big Data Small to modest...

Documents

Introduction to Big Data, Big Data Processing, and Big ...cis.csuohio.edu/~sschung/CIS660/Lecture1_IntroBigDataAnalyrics.pdf · What’s Big Data? From Wikipedia: • Big data is

BOOTSTRAPPING SAMPLE QUANTILES OF DISCRETE …€¦ · BOOTSTRAPPING SAMPLE QUANTILES OF DISCRETE DATA CARSTEN JENTSCH AND ANNE LEUCHT Abstract. ... Angus (1993), Deheuvels…

Caterpillar Big Data Infrastructure Big Data, Data

5 Bootstrapping

Bootstrapping 101

Bootstrapping Agile

Big Data ในภาครัฐ - library2.parliament.go.th · Big Data))' 2' 2559) (Big Data)" Big Data Big Data 2559) -

Bootstrapping Privacy Compliance in Big Data Systemswing/publications/Sen-Wing14.pdf · Reduce-like big data systems for compliance with privacy policies that restrict how various

Structural bootstrapping - A novel, generative …h2t.anthropomatik.kit.edu/pdf/Woergoetter2015.pdfStructural bootstrapping - A novel, generative mechanism for faster and ... bootstrapping

Real Time Big data Applications: file · Web viewUNIT I. INTRODUCTION TO BIG DATA. Big Data – Definition, Characteristic Features – Big Data Applications - Big Data vs. Traditional

The BIG Future of BIG Data · 2017-03-08 · The BIG Future of BIG Data. Big Data Data Governance Data Warehousing Data Reporting Data Infrastructure. Become a PREDICTIVEEnterprise

Big Data Visualization: Turning Big Data into Big Insights · Big Data Visualization: Turning Big Data Into Big Insights The Rise of Visualization-based Data Discovery Tools MARCH

Big Data and Business Analytics: The Engine of Digital ... · Enterprise Big Data Strategy . BIG DATA MANAGEMENT . BIG DATA ANALYTICS . BIG DATA APPLICATIONS . BIG DATA INTEGRATION

Bootstrapping Privacy Compliance in Big Data System Shayak Sen, Saikat Guha et al Carnegie Mellon University Microsoft Research Presenter: Cheng Li

Betterment Engineering - Bootstrapping Data Intelligence - Agile R Dashboards

Bootstrapping coverage

09 bootstrapping

Bootstrapping Groovy

Bootstrapping Privacy Compliance in Big Data Systems (cont ...ece734/lectures/5... · Fix code Update Grok 11. Scope, Hive, Dremel Data in the form of Tables Code Transforms Columns

Big Data Infrastructure · 2016. 3. 31. · Why is this a big data problem? " Isn’t gathering labels a serious bottleneck?! Solution: crowdsourcing! Solution: bootstrapping, semi-supervised