View
472
Download
0
Category
Preview:
Citation preview
Twitter Tag: #briefr The Briefing Room
Welcome
Host: Eric Kavanagh
eric.kavanagh@bloorgroup.com @eric_kavanagh
Twitter Tag: #briefr The Briefing Room
Reveal the essential characteristics of enterprise software, good and bad
Provide a forum for detailed analysis of today’s innovative technologies
Give vendors a chance to explain their product to savvy analysts
Allow audience members to pose serious questions... and get answers!
Mission
Twitter Tag: #briefr The Briefing Room
Topics
September: HADOOP 2.0
October: DATA MANAGEMENT
November: ANALYTICS
Twitter Tag: #briefr The Briefing Room
The Great Divide
Ø Close the Gap
Ø Empower Business Users
Ø Shift Focus of IT
Ø Developers are Third Leg
Twitter Tag: #briefr The Briefing Room
Analyst: Mark Madsen
Mark Madsen is president of Third Nature, a technology research and consulting firm focused on business intelligence, data integration and data management. Mark is an award-winning author, architect and CTO whose work has been featured in numerous industry publications. Over the past ten years Mark received awards for his work from the American Productivity & Quality Center, TDWI, and the Smithsonian Institute. He is an international speaker, a contributor to Forbes Online and on the O’Reilly Strata program committee. For more information or to contact Mark, follow @markmadsen on Twitter or visit http://ThirdNature.net
Twitter Tag: #briefr The Briefing Room
Trifacta
Trifacta offers a platform for data transformation and preparation
The interface is rich in visualization and provides a productive data wrangling capability
The platform also includes access to raw data in Hadoop, providing analysts and data scientists with secure, governed data
Twitter Tag: #briefr The Briefing Room
Guests:
Will Davis Director of Product Marketing, Trifacta
Alon Bartur Principal Product Manager, Trifacta
Messy Data Requires Data Wrangling
Question Analyze Insight Discover Structure Clean Enrich Distill
Data Wrangling
The Bottleneck on Hadoop
Ingestion Storage Processing IT
ANALYSIS & CONSUMPTION
LOB
Business System Data
Machine Generated Data
Third Party Data
Java Python
R Pig
etc… How do you move from here?
To here?
80% of the work in any data project is preparing the
data for analysis
Breakdown of Communication Between IT & LOB
LOB IT
How can I access the data in Hadoop? What do you want to analyze?
I can’t tell you until I see the data – let me see the data first.
I can’t just point you to the raw data – you’ll need to tell me.
Bringing Hadoop to an Analyst’s Fingertips
“ “ JOHN, DATA ANALYST
I want direct access to the raw data so I can actually see the content of different datasets to define my analytic requirements.
Wrangle Data Using This?
Analyst Workflow on Hadoop
13
Register Hadoop Data Sets in Trifacta
1.
HDFS
Visualize, Interact & Define Transformation Script
2.
HDFS
Execute Script on Entirety of Data Set at Scale in Hadoop
3.
HDFS Execution in Pig or Spark
Analytic Tools Analytic Tools
Select Transformation Output Format & Location
4.
Analytic Tools Hadoop
HDFS Parquet or Avro
Table in HCatalog
Tableau R
Etc…
Copyright Third Nature, Inc.
Ideas about how we make data available are changing
Making data available is not the same as enabling its use
Copyright Third Nature, Inc.
From scarcity to abundance
All the data
Common, typed, tabular data
The bo9leneck is us
© Third Nature Inc.
Changed design assump=on: analysis isn’t read-‐only
The results of analysis can, o=en do, feed back into the system from which they originate.
Much of the data is being read, wri9en and processed in real @me.
Our design point in IT was not changing tables and ephemeral pa9erns.
Copyright Third Nature, Inc.
Schema
In a repor=ng world data and processing are bounded
No consideration for feedback loops and change
Processing only happens here
Carefully controlled SQL only
access
Nobody creates
new inform
ation
Sources few and well understood
Complex DI is controlled by IT
Schemas are few and designed
Tools are authorized, few in number and kind
One way flow
Copyright Third Nature, Inc.
In an analysis world flow is unbounded and con=nuous
Feedback loops allowed
End-of-analysis dataset may be start of a BI dataset
Continuous data integration and delivery
Files are back as both input and storage
Minimal barrier of / control on collection
Areas of provisioned data
Any shape in, rectangles out
Copyright Third Nature, Inc.
The model and reality of ETL: one-‐way pipes
DI BI
Our methods tell us that data integration and analysis are separate, and schema comes first as the point of synchronization between them.
Schema
Copyright Third Nature, Inc.
Schema
Data isn’t just source or target, it’s a con=nuum
Unusable data that needs
engineering: ETL
Data that can be used : BI
Fuzzy areas of data that need engineering and / or composing: exploration, blending & discovery
Copyright Third Nature, Inc.
Food supply chain: an analogy for data
Mul@ple contexts of use, differing quality levels
Copyright Third Nature, Inc.
Tools were designed with data model assump=ons S
ourc
e da
ta ,m
odel
com
plex
ity
Sim
ple
C
ompl
ex
Target data model complexity
Simple Complex
Blending
Selectively linking and changing data, producing a simpler data model as output
ETL
Multiple complex source models, large complex target model
Application integration
Basic movement of data from one place to another, minimal changes to data
Processing & Analytics
Deriving new data from a relatively simple dataset (like an event stream)
Copyright Third Nature, Inc.
Some ques=ons to start discussion 1. Who is this product aimed at: end users, analysts or the
people who get and manage data for others? 2. Can you get data from places other than Hadoop? 3. How do you deal with WYSIWYG data prepara@on when the
dataset is very large? 4. How well does it handle small datasets? 5. How do you take something from one-‐@me-‐process to a
repeatably executed process in a produc@on environment? 6. What analysis tool integra@on is available? 7. What maintenance features are available?
Copyright Third Nature, Inc.
CC Image AIribu=ons Thanks to the people who supplied the crea@ve commons licensed images used in this presenta@on: Tokyo forum -‐ h9p://flickr.com/photos/fukagawa/2004106475/ klein_bo9le_red.jpg -‐ h9p://flickr.com/photos/sveinhal/2081201200/ donuts_4_views.jpg -‐ h9p://www.flickr.com/photos/le_hibou/76718773/
Copyright Third Nature, Inc.
About the Presenter
Mark Madsen is president of Third Nature, a technology research and consul@ng firm focused on business intelligence, data integra@on and data management. Mark is an award-‐winning author, architect and CTO whose work has been featured in numerous industry publica@ons. Over the past ten years Mark received awards for his work from the American Produc@vity & Quality Center, TDWI, and the Smithsonian Ins@tute. He is an interna@onal speaker, a contributor to Forbes Online and on the O’Reilly Strata program commi9ee. For more informa@on or to contact Mark, follow @markmadsen on Twi9er or visit h9p://ThirdNature.net
Copyright Third Nature, Inc.
About Third Nature
Third Nature is a research and consulting firm focused on new and emerging technology and practices in analytics, business intelligence, information strategy and data management. If your question is related to data, analytics, information strategy and technology infrastructure then you‘re at the right place.
Our goal is to help organizations solve problems using data. We offer education, consulting and research services to support business and IT organizations as well as technology vendors.
We fill the gap between what the industry analyst firms cover and what IT needs. We specialize in product and technology analysis, so we look at emerging technologies and markets, evaluating technology and hw it is applied rather than vendor market positions.
Twitter Tag: #briefr The Briefing Room
Upcoming Topics
www.insideanalysis.com
September: HADOOP 2.0
October: DATA MANAGEMENT
November: ANALYTICS
Twitter Tag: #briefr The Briefing Room
THANK YOU for your
ATTENTION!
Some images provided courtesy of Wikimedia Commons and "Grand Canyon view from Pima Point 2010" by Chensiyuan - Own work. Licensed under GFDL via Commons
- https://commons.wikimedia.org/wiki/File:Grand_Canyon_view_from_Pima_Point_2010.jpg#/media/File:Grand_Canyon_view_from_Pima_Point_2010.jpg
Recommended