Data Wrangling: The Elephant in the Room of Big Data · Data Wrangling Research • So data wrangling is a research challenge, currently without a community or established priorities

Data Wrangling: The Elephant in the Room of Big Data

Norman Paton

University of Manchester

Data Wrangling

• Definitions:

– a process of iterative data exploration and transformation that enables analysis [1].

– the process of manually converting or mapping data from one "raw" form into another format that allows for more convenient consumption of the data with the help of semi-automated tools [2].

[1] S. Kandal, et al., Research Directions in Data Wrangling: Vizualizations and

Transformations for usable and credible data, Information Visualization, 10(4),

271-288, 2011.

[2] http://en.wikipedia.org/wiki/Data_wrangling, 12th May 2015.

http://en.wikipedia.org/wiki/Data_wrangling

http://en.wikipedia.org/wiki/Data_wrangling

Extract, Transform and Load - 1

• Of course, this is not completely new, and Extract, Transform and Load (ETL) tools have been around for a significant time.

• ETL tools support source wrapping, warehouse population, workflow languages, etc.

• ETL vendors also have “big data” offerings.

www.talend.com

Extract, Transform and Load - 2

• ETL tools are clearly useful, with products from database vendors and data integration companies.

• ETL tools emerged to support data warehousing, and thus typically have roots in enterprise settings.

• ETL tools typically involve significant manual effort.

• ETL costs no doubt vary widely from project to project, but are quoted as representing up to 80% of the development time in warehousing projects [1].

[1] S. Kandal, et al., Research Directions in Data Wrangling: Vizualizations and

Transformations for usable and credible data, Information Visualization, 10(4),

271-288, 2011.

Big Data – does it make a difference?

• Big data is sometimes characterised by the 4 V’s: – Volume – the scale of the data,

– Velocity – speed of change,

– Variety – different forms of data, and

– Veracity – uncertainty of data.

• So size matters but, it isn't everything.

• Data wrangling for big data must address all four V’s at the same time.

• Classical, substantially manual ETL may struggle with numerous and rapidly changing sources.

The Business Case - 1

• There is strong support for big data being commercially important:

– The International Institute of Analytics estimate the Big Data market at $16.1B in 2014, growing 6 times faster than the overall IT market. Projection for 2017 is ~$50B.

– Gartner (2014) estimates the Data Integration tool market at over $2.2B at end 2013, predicted to rise to ~$3.6B by 2018.

– Gartner (2014) estimates the Data Quality market as $960M in software revenue at end 2012 predicted to rise to $2B by 2017.

The Business Case - 2

• … but many of the potential beneficiaries of big data cannot simply throw resource at data wrangling.

• The government’s Information Economy Strategy states:

– the overwhelming majority of information economy businesses – 95% of the 120,000 enterprises in the sector – employ fewer than 10 people.

Case Study: e-Commerce

• If you run an e-Commerce site, then you need to be able to understand pricing trends among your competitors.

• This may involve getting to grips with:

– Volume: thousands of sites;

– Velocity: sites, site descriptions and contents changing;

– Variety: in format, content, user community, …; and

– Veracity: unavailability, inconsistent descriptions, …

• Manual attempts at data wrangling are likely to be expensive, partial, unreliable, poorly targeted, …

Data Wrangling Research

• So data wrangling is a research challenge, currently without a community or established priorities.

• The VADA (Value Added DAta Systems) project seeks to define principles and solutions for adding value to data, supporting users in discovering, extracting, integrating, accessing and interpreting the data of relevance to their questions.

• VADA takes account of the: – user context: requirements such as the trade-off between

completeness and correctness; and the

– data context: availability, cost, provenance, quality….

User Context: e-Commerce

• The same application may involve different user contexts. For example:

– Price comparison may normally be able to work with a subset of high quality sources, but

– Issue investigation may require a more complete picture, at the risk of obtaining more incorrect data, where sales of a popular item have been falling.

• As a result, hard-wiring data wrangling tasks risks the production of data sets that are not always fit for purpose, where the reason for this is implicit.

VADA Components

• VADA seeks to support wrangling by integrating:

– Data extraction,

– Data integration,

– Quality analysis and

– Querying

• … in a best-effort, pay-as-you-go approach to data wrangling.

Qu

alit

y

Integratio

n

VADA

Example: Data Integration

• How do we avoid lots of slow, expensive expert input into data integration?

• In pay-as-you-go data integration, alternative ways of combining data from sources can be generated algorithmically.

• Automatically generated candidate integrations can be refined in the light of feedback, for example from users or crowds.

• Decision support techniques can be used to capture the user’s requirements (e.g. in terms of quality or cost), in ways that inform which integrations are generated.

Example: Mapping Selection - 1

• Problem statement: Given a set of candidate mappings, and feedback on their results, identify the subset that best meets the user’s requirements in terms of precision and recall.

• Associated definitions: – Precision: the fraction of the retrieved results that are correct.

– Recall: the fraction of the correct results that are retrieved.

• The following were among the mappings generated by a commercial schema mapping tool for populating a table with schema <name, country, province>

M1 = SELECT name, country, province from Mondial.city

M2 = SELECT city, country, province from Mondial.located

…

Example: Mapping Selection - 2

• We can estimate the quality of the generated mappings using feedback.

• How much feedback do we need?

• The results presented:

– report the precision obtained for a given precision threshold, for different amounts of feedback.

– report the recall obtained for a given precision threshold, for different amounts of feedback.

Khalid Belhajjame, et al., Incrementally improving dataspaces based on user

feedback. Inf. Syst. 38(5): 656-687 (2013).

Evaluate Result – Precision Threshold

Evaluate Result – Precision Threshold

VADA: All Change

• The component technologies for extraction, integration and cleaning must themselves: – provide automated analyses that are informed by and take account of

the user context;

– share information with each other, so that, e.g., integration can identify issues with extraction, etc; and

– make well informed decisions that use all available evidence about the data context, such as reference data sets and ontologies.

• Thus making data wrangling more cost effective and systematic involves a fundamental rethink across a wide front.

Conclusions

• Data wrangling is a problem and an opportunity:

– A problem because the 4V’s of big data may all be present together a lot of the time, undermining manual approaches.

– An opportunity because if we can make data wrangling much more cost effective, all sorts of hitherto impractical tasks come into reach.

• Call to arms: we will have a serious go at this in VADA, but there is much to do, and there must be different viable approaches to be taken.

Acknowledgements

VADA is funded by: The Engineering and Physical Sciences

Research Council

Through a grant to: Georg Gottlob, Thomas Lukasiewicz,

Dan Olteanu, Giorgio Orsi, Tim Furche

Norman Paton, Alvaro Fernandes,

John Keane

Leonid Libkin, Wenfei Fan, Peter

Buneman, Sebastian Maneth

In cooperation with:

Documents

Data Wrangling: The Elephant in the Room of Big Data · Data Wrangling Research • So data wrangling is a research challenge, currently without a community or established priorities