30
1 Sponsored by: Sponsored by: ‘Bad Data’ Is Polluting Big Data Enterprises Struggle with Real-Time Control of Data Flows A Global Survey of Big Data Professionals June 2016

Bad Data is Polluting Big Data

Embed Size (px)

Citation preview

Page 1: Bad Data is Polluting Big Data

1 Sponsored by:

Sponsored by:

‘Bad Data’ Is Polluting Big Data Enterprises Struggle with Real-Time Control of Data Flows

A Global Survey of Big Data Professionals June 2016

Page 2: Bad Data is Polluting Big Data

2

Executive Summary

The big data market is still maturing, especially as relates to data in motion and as evidenced by lack of best practices or consistent processes to clean and manage data quality. For companies who use big data to optimize current business operations or to make strategic decisions, it is critical that they ensure their big data teams have real-time visibility and control over the data at all times.

This report finds that companies who are leveraging big data are rarely capable of controlling their data flows. Almost 9 out of 10 companies report ‘bad data’ polluting their data stores and shockingly nearly 3/4 indicate there is ‘bad data’ in their stores currently. The findings also reveal a chasm between the problem detection capabilities data experts have today and what they desire. This translates into a lack of real-time visibility and control of data flows, operations, quality and security.

Page 3: Bad Data is Polluting Big Data

3 Sponsored by:3

Key Findings

• 87% state ‘bad data’ pollutes their data stores while 74% state ‘bad data’ is

currently in their data stores

• Ensuring data quality was the most common challenge cited, by 68% of

respondents, and only 34% claimed to be good at detecting divergent data

• 72% responded that they hand code their data flows while 53% claimed they have to change each pipeline at least several times a month

• Tremendous gaps exist between today’s big data flow management tools’

capabilities and what is needed

• Only 10% of respondents rated their performance as good or excellent across 5

key data flow operational performance areas

• 72% desire a single pane of glass solution to manage all data flows

• 81% state there is a significant operational impact when they upgrade big data components

Page 4: Bad Data is Polluting Big Data

4 Sponsored by:

METHODOLOGY AND PARTICIPANTS

Page 5: Bad Data is Polluting Big Data

5 Sponsored by:5

Research Goal The primary research goal was to capture how companies manage the flow of big data. The research also investigated and documented current tools’ capabilities, data quality and efforts to maintain big data pipelines and infrastructure

Goals and Methodology

MethodologyBig data professionals worldwide were invited to participate in a survey on the topic of big data and ensuring data flow operations and data quality.

The survey was administered electronically and participants were offered a token compensation for their participation.

Participants A total of 314 participants that manage big data operations completed the survey.

Page 6: Bad Data is Polluting Big Data

6 Sponsored by:6

Companies Represented

Industry Size

500 - 1,00025%

1,000 - 5,00029%

5,000 - 10,00016%

More than 10,000

30%

Other

Food and Beverage

Hospitality and Entertainment

Media and Advertising

Non-Profit

Retail

Transportation

Energy and Utilities

Telecommunications

Government

Services

Education

Healthcare

Manufacturing

Financial Services

Technology

0% 2% 4% 6% 8%10%

12%14%

16%18%

20%

2%

1%

1%

1%

1%

4%

5%

5%

5%

6%

6%

6%

10%

12%

18%

18%

Page 7: Bad Data is Polluting Big Data

7 Sponsored by:7

Participant Demographics

Location Role

Business analyst

Business stakeholder who uses data to make decisions

BI or Analytics Technology Owner (e.g. data architect, head of data platform)

IT executive with data initiatives in my portfolio

IT manager responsible for delivering data initiatives

IT staff responsible for implementing and operating data infrastructure (e.g. database administrator, data warehouse architect, …)

0% 20% 40% 60%

6%

8%

17%

34%

52%

56%

United States or Canada

75%

Europe14%

Mexico, Central Amer-ica, or South America

4%

Australia or New Zealand3%

Middle East or Africa3%

Asia2%

Page 8: Bad Data is Polluting Big Data

8 Sponsored by:

DETAILED FINDINGS

Page 9: Bad Data is Polluting Big Data

9 Sponsored by:

What challenges does your

company face when managing

your big data flows?

Top 3 Challenges for Big Data Flows are Quality, Security and Reliable Operation

We have no challenges

Adapting pipelines to meet new requirements

Upgrading big data infrastructure components (Kafka, Hadoop, etc.).

Building pipelines for getting data into the data store

Keeping data flow pipelines operating effectively

Complying with security and data privacy policies

Ensuring the quality of the data (accuracy, completeness, consistency)

0% 10%20%30%40%50%60%70%80%

1%

32%

40%

47%

52%

60%

68%

Author
Can you note the % that chose either the 3rd or 4th choice as they are both forms of custom coding.
Page 10: Bad Data is Polluting Big Data

10 Sponsored by:

Does ‘bad data’ occasionally get into your data

stores? 

87% State ‘Bad Data’ Pollutes Their Data Stores

Yes87%

No13%

Page 11: Bad Data is Polluting Big Data

11 Sponsored by:

Do you believe there is any ‘bad data’ in your data stores currently? 

74% State ‘Bad Data’ is Currently in Their Data Stores

Yes74%

No26%

Author
Change of heart, let's drop the IDKs from the calculation.
Page 12: Bad Data is Polluting Big Data

12 Sponsored by:

How does your company build big

data flow pipelines today? 

77% of Companies Still Use Hand Coding to Build Big Data Flows

Using big data ingestion tools such as StreamSets, NiFi, etc.

Using ETL or data integration tools

Coding with Python, Java, etc. or low-level frameworks such as Sqoop, Flume or Kafka

0% 20% 40% 60% 80% 100%

27%

63%

77%

Page 13: Bad Data is Polluting Big Data

13 Sponsored by:

On average, how often are changes or fixes made to typical data flow

pipeline? 

53% Change Data Flow Pipelines At Least Several Times a Month

Several times a day

Several times a week

Several times a month

Several times a quarter

Several times a year

Less often than several times a

year

0%

5%

10%

15%

20%

25%

30%

35%

3%

19%

31%

26%

12%

8%

Page 14: Bad Data is Polluting Big Data

14 Sponsored by:

When data structure or semantics

unexpectedly change, how big is the impact on the operation of

your big data flows (failures,

slowdowns, data corruption, etc.)?

   

85% State Unexpected Structure and Semantic Changes Have Substantial Impact on Dataflow Operations

0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%

31% 54% 11% 2%2%

Significant impactModerate impactMinor impactStructure and semantic changes have no effect on our big data flowsData structure and semantic changes never occur

Page 15: Bad Data is Polluting Big Data

15 Sponsored by:

How would you assess

your ability to detect each of the following issues in real-

time? 

More Than Half of Companies Lack Real Time Information About Data Flow Quality

Personally identifiable information (credit card numbers, social security numbers) is being inappropriately placed in a data

store

The values of incoming data are diverging from historical norms

Error rates are increasing

Data flow throughput is degrading or latency is growing

A specific data flow pipeline has stopped operating

0% 20% 40% 60% 80% 100%

18%

5%

7%

7%

16%

33%

29%

37%

37%

46%

30%

43%

38%

37%

29%

13%

20%

16%

17%

9%

6%

3%

1%

1%

1%

ExcellentGoodAveragePoorNone

Page 16: Bad Data is Polluting Big Data

16 Sponsored by:

Only 12% Rated Their Performance as ‘Good’ or ‘Excellent’ Across All Five Key Data Flow Metrics

1. A specific data flow pipeline has stopped operating

2. Data flow throughput is degrading or latency is growing

3. Error rates are increasing

4. The values of incoming data are diverging from historical norms

5. Identify personally information within the data flows

Five Key Data Flow Metrics

Number of Key Data Flow Metrics Participants Represented as ‘Good’ or ‘Excellent’

19% 17% 19% 20% 12% 12%

1Metrics

0Metrics

All 5 Metrics

4Metrics

3Metrics

2Metrics

Page 17: Bad Data is Polluting Big Data

17 Sponsored by:

In your opinion, how valuable

would it be to be able to detect each of these issues in real-

time? 

Substantial Value In Real-Time Data Flow Detection Capabilities

Identify personally information within the data flows

The values of incoming data are diverging from historical norms

Error rates are increasing

Data flow throughput is degrading or latency is growing

A specific data flow pipeline has stopped operating

0% 20% 40% 60% 80% 100%

40%

23%

33%

28%

42%

35%

46%

46%

49%

42%

18%

26%

17%

20%

14%

6%

4%

4%

3%

3%

Very valuableValuableAverage valueLimited valueNot valuable

Page 18: Bad Data is Polluting Big Data

18 Sponsored by:

 

Gap Between Current Pipeline Real-Time Visibility Capabilities and Stated Value

Assessed value

Real-time ability

0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%

42%

16%

42%

46%

14%

29%

3%

9%

Excellent/ Very valuableGood/ ValuableAverage/ Average valuePoor/ Limited valueNone/ Not valuable

A specific data flow pipeline has stopped operating

62%

84%

Page 19: Bad Data is Polluting Big Data

19 Sponsored by:

B. Data flow throughput is degrading or latency is growing

 

Chasm Between Today’s Data Flow Throughput Metrics and What is Needed

Assessed value

Real-time ability

0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%

28%

7%

49%

37%

20%

37%

3%

17%

1%

1%

Excellent/ Very valuableGood/ ValuableAverage/ Average valuePoor/ Limited valueNone/ Not valuable

44%

77%

Data flow throughput is degrading or latency is growing

Page 20: Bad Data is Polluting Big Data

20 Sponsored by:

 

Significant Gap Between Error Rate Visibility Value and Current Capabilities

Assessed value

Real-time ability

0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%

33%

7%

46%

37%

17%

38%

4%

16%

Excellent/ Very valuableGood/ ValuableAverage/ Average valuePoor/ Limited valueNone/ Not valuable

44%

79%

Error rates are increasing

Page 21: Bad Data is Polluting Big Data

21 Sponsored by:

 

Chasm Between Value of Detecting Divergent Data and Current Capabilities

Assessed value

Real-time ability

0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%

23%

5%

46%

29%

26%

43%

4%

20%

1%

3%Excellent/ Very valuable

Good/ Valuable

Average/ Average value

Poor/ Limited value

None/ Not valuable

34%

69%

The values of incoming data are diverging from historical norms

Page 22: Bad Data is Polluting Big Data

22 Sponsored by:

 

Large Gap Between Data Privacy Value and Current Capabilities

Assessed value

Real-time ability

0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%

40%

18%

35%

33%

18%

30%

6%

13%

2%

6%

Excellent/ Very valuableGood/ ValuableAverage/ Average valuePoor/ Limited valueNone/ Not valuable

51%

75%

Identify personal information within the data flows

Page 23: Bad Data is Polluting Big Data

23 Sponsored by:

How valuable is it to have a single control panel for comprehensive visibility and management

across all of your data flows? 

72% Desire A Single Pane of Glass Solution To Manage All Data Flows

0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%

24% 48% 24% 4%Very valuableValuableAverage valueLimited value

Page 24: Bad Data is Polluting Big Data

24 Sponsored by:

Which of the following do you

consider to be the most effective approach to

ensuring data quality?  

50% State that Data Cleansing at the Source is the Most Effective Quality Practice

Cleanse data as it flows in from the source

50%

Cleanse and update data once it is in the store

27%

Data scientists or business analysts cleanse data be-

fore using it23%

Page 25: Bad Data is Polluting Big Data

25 Sponsored by:

What is the operational impact of

upgrading big data components

(ingest technologies,

message queues, data stores,

search stores, etc.)? 

81% State There is Significant Operational Impact to Upgrading Big Data Components

0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%

17% 64% 17% 2%Heavy impactModerate impactMinor impactNo impact

Page 26: Bad Data is Polluting Big Data

26 Sponsored by:26

For more information…About Dimensional Research

Dimensional Research provides practical marketing research to help technology companies make smarter business decisions. Our researchers are experts in technology and understand how corporate IT organizations operate. Our qualitative research services deliver a clear understanding of customer and market dynamics. For more information, visit www.dimensionalresearch.com.

About StreamSetsPlace holder

For more information, visit www.streamsets.com.

Page 27: Bad Data is Polluting Big Data

27 Sponsored by:

APPENDIX

Page 28: Bad Data is Polluting Big Data

28 Sponsored by:

Tremendous Gaps Exist Between Currant Big Bata Flow Management Tool Capabilities and What is Needed

Ability to Detect Area in Real-Time Compared Against Stated Value To Detect in Real-Time

Personally identifiable information (credit card numbers, social security numbers) is being inappropriately placed in a data store

The values of incoming data are diverging from historical norms

Error rates are increasing

Data flow throughput is degrading or latency is growing

A specific data flow pipeline has stopped operating

0%10%

20%30%

40%50%

60%70%

80%90%

100%

18%

40%

5%

23%

7%

33%

7%

28%

16%

42%

33%

35%

29%

46%

37%

46%

37%

49%

46%

42%

30%

18%

43%

26%

38%

17%

37%

20%

29%

14%

13%

6%

20%

4%

16%

4%

17%

3%

9%

3%

6%

2%

3%

1%

1%

0%

1%

1%

1%

Excellent/ Very valuable Good/ Valuable Average/ Average value Poor/ Limited value None/ Not valuable

Stated Value

Current Ability

Stated Value

Current Ability

Stated Value

Current Ability

Stated Value

Current Ability

Stated Value

Current Ability

Page 29: Bad Data is Polluting Big Data

29 Sponsored by:

Which of the following

approaches for ensuring data

quality does your company utilize? 

Various Approaches To Managing Data Quality Indicates a Lack of Best Practice

Data scientists or business analysts cleanse data before using it

Cleanse data as it flows in from the source

Cleanse and update data once it is in the store

0% 10% 20% 30% 40% 50% 60%

43%

54%

55%

Page 30: Bad Data is Polluting Big Data

30 Sponsored by:

Approximately, what percentage

of data flow changes and fixes are made for day-

to-day maintenance and troubleshooting

purposes? 

Many Must Perform Maintenance and Troubleshooting on Data Flows Routinely

More than 80% 60% - 80% 40% - 60% 20% - 40% Less than 20%0%

5%

10%

15%

20%

25%

30%

35%

40%

3%

10%

24%

27%

36%