13
Big Data and Content Classification Paul Balas

Big Data and Classification

Embed Size (px)

DESCRIPTION

A discussion on how to think about Big Data relative to improving analytical outcomes by leveraging Data Governance.

Citation preview

Page 1: Big Data and Classification

Big Data and Content ClassificationPaul Balas

Page 2: Big Data and Classification

How to make meaning out of Big Data Big Data as the poster-child for marketing of open-source

software built-off alternative database storage structures has become a 'Big Nothing'. The ambiguity around what Big Data means requires endless hours of explanation and really only focuses on the problems around dealing with data containing large volumes, velocity, or variety (I'm waiting for more catchy v's such as Victory, or Value!). My perspective is around the phrase 'Big Understanding' which is an optimistic 'View' of making sense of our data and turning it into information. The focus has to shift.

Page 3: Big Data and Classification

Classification = Relevance No matter what vendors say, the better the classification and

structure of your data, the better your search and analytical capabilities will be. Even tools that help with classification require custom rules and dictionaries, and they tend to be domain specific. If you want high quality Big Data, you need Data Governance.

Page 4: Big Data and Classification

Data Governance = Big Quality If you want a high-quality analysis, your data has to be standardized

and consistent. This is especially true where there is a large degree of variety in your inputs. For example, if you have different geopolitical hierarchies for each input source, you have to align them into a standard, or your customer won't find Colorado information when they typed in CO (ok, a trivial example, but valid). Data Governance requires people, process, and tools, and often requires organizational change. 

Many companies would benefit more from improving the quality and 'findability' of their data over piling more data into an already inconsistent data store.

Page 5: Big Data and Classification

Data Governance Lifecycle Applying Data Governance to Big Data helps you to

Understand the quality of your data Be able to categorize it into well-defined groupings, with

commonly shared definitions Be able to look at new data and categorize it into new or existing

groups Share it with your stakeholders Manage it over time

Page 6: Big Data and Classification

A Framework to gain perspective The following slides attempt to provide a framework for

understanding the lifecycle around information management and understanding form the perspective of managing and applying meaning to your data

Page 7: Big Data and Classification

Communication betweenPeople and Processes

Data Governance Life Cycle

VTO Management

Transactions

Content Creation & Sourcing

Content + Governed VTO

Vocabulary

Taxonomy

Ontology

(VTO)

Unstructured Content

Structured Content

ContentVTO

VT

OC

onten

t

V

TO

Content Mining & Classification

Analytics

Search

Page 8: Big Data and Classification

Vocabulary, Taxonomy, and Ontology (VTO) Humans use systems of organization to make order of their world

Effective experiences with Big Data are driven by Subject Matter Experts or machines categorizing content with a common language that can be shared and understood by consumers of the data

Governed Vocabularies, Taxonomies, and Ontologies are the pick-lists, hierarchies, and relationships that define content, which Subject Matter Experts use to categorize, share, and analyze data

Page 9: Big Data and Classification

Content Creation and Sourcing Content is created by people interacting with computer systems as well

as by machines generating data When you have more than one stream of data being produced by

different inputs, the rules for categorization differ between systems Understanding your data sources whether it’s one or more systems

require you know how the data is produced, and therefor how it can be analyzed

Big Data promises that you don’t need to know the meaning of your input data as you collect it

It doesn’t mean that you don’t need to define and understand it before you begin to analyze it

If you apply meaning and structure to your data, the quality of your analysis will improve or even be possible

Page 10: Big Data and Classification

Content Mining and Classification Categorization of your data isn’t a one-time event unless

your analysis is a one-time event Subject Matter Experts need the ability to analyze new data,

and revisit old data to make sure nothing has changed Content Mining is a technique to bring understanding to your

data and how it fits to your view of the world Most Big Data Platforms are weak (today) in this area For Big Data, there is a disconnect in how vendors support

tooling from when we analyze our data and when we categorize it and apply meaning

Page 11: Big Data and Classification

VTO Management Vocabularies, Taxonomies, and Ontologies require

management over time They are not done in isolation, requiring collaboration

between Subject Matter Experts and stakeholders They must be easily shared, versioned, and implemented

against your data Application of defined VTO’s against Big Data is a challenge

in current vendor offerings

Page 12: Big Data and Classification

Search, Transactions, Analytics Search – keyword or navigated searching through detailed or

aggregated data Transactions – adding data to an existing store via people or

machines Analytics – statistics, probabilities, creating models … Big, Medium, or Small data for each of these activities are

benefited by good categorization and application of VTO standards

Page 13: Big Data and Classification

Conclusion As Big Data continues to gain momentum in the confusing

vendor marketplace, don’t loose sight of the basics, don’t give in to unbounded promises of being able to analyze your data to perfection without consideration of the end-goal of why you are collecting this data in the first place -

To apply meaning and understanding to your problem at-hand, and share it with people who can take fruitful action that results in improvement