Processing Unstructured Data - DEV DAY | Processing unstructured data with Azure Cloud Services 06 ... project for the main Apache Hadoop project •Is a data warehouse infrastructure

Embed Size (px)

Citation preview

  • Dinesh Priyankara |

    Senior Architect Specialist, Virtusa(Pvt) Ltd

    http://dinesql.blogspot.com/

    Processing Unstructured Data

    http://dinesql.blogspot.com/

  • http://dinesql.blogspot.com

    http://dinesql.blogspot.com

    http://dinesql.blogspot.com

    http://dinesql.blogspot.com

    http://dinesql.blogspot.com

    http://dinesql.blogspot.com

    http://dinesql.blogspot.com

    Dinesh Priyankara | @dinesh_priya

    Senior Architect Specialist, Virtusa(Pvt) Ltd.

    Microsoft Most Valuable ProfessionalSince 2006, Data Platform (SQL Server)

    Consultant, Trainer, Speaker

    MSc in IT, MCSE, MCDBA

    [email protected]

    http://dinesql.blogspot.com

    mailto:[email protected]://dinesql.blogspot.com/

  • http://dinesql.blogspot.com

    http://dinesql.blogspot.com

    http://dinesql.blogspot.com

    http://dinesql.blogspot.com

    http://dinesql.blogspot.com

    http://dinesql.blogspot.com

    http://dinesql.blogspot.com

    Agenda

    01 | Understanding unstructured data

    02 | Introduction to Hadoop and MapReduce

    03 | The Microsoft way

    04 | Processing unstructured data with Integration Services

    05 | Processing unstructured data with Azure Cloud Services

    06 | Demo

    07 | Q & A

  • 01 | Understanding unstructured data

    Dinesh Priyankara | Senior Architect Specialist, Virtusa(Pvt) Ltd.

    http://dinesql.blogspot.com/

    http://dinesql.blogspot.com/

  • http://dinesql.blogspot.com

    http://dinesql.blogspot.com

    http://dinesql.blogspot.com

    http://dinesql.blogspot.com

    http://dinesql.blogspot.com

    http://dinesql.blogspot.com

    http://dinesql.blogspot.com

    Structured Data

    Structured data resides in a fixed field within a record or file.Relational databases and spreadsheets hold structured data

    Always integrated with a schema (model)Schema defines the structure of data with data types such as string, integers, date,

    etc.

    Schema defines how data is stored,

    accessed and processed.

    Easy maintainability and data managementBased on Schema-on-write method

    Managed with most known Structured Query

    Language (SQL)

  • http://dinesql.blogspot.com

    http://dinesql.blogspot.com

    http://dinesql.blogspot.com

    http://dinesql.blogspot.com

    http://dinesql.blogspot.com

    http://dinesql.blogspot.com

    http://dinesql.blogspot.com

    Semi-Structured Data

    Semi-Structured data does not follow a standard model defined with a

    schema.Structureis imposed in the form of tags or markers.

    Different set of attributes in elements even though

    they are belong to one class.

    Example: XML, HTML, JSON, etc.

    Considered as self-describing data.

  • http://dinesql.blogspot.com

    http://dinesql.blogspot.com

    http://dinesql.blogspot.com

    http://dinesql.blogspot.com

    http://dinesql.blogspot.com

    http://dinesql.blogspot.com

    http://dinesql.blogspot.com

    Unstructured Data: The Definition

    Unstructured Data (or unstructured information) refers to information that either does not have a pre-defined data model or is not organized in a pre-defined manner ~ wikipedia

    Unstructured data represents any data that does not have a recognizable structure. It is unorganized and raw and can be non-textual or textual ~ techopedia

    ~ web

  • http://dinesql.blogspot.com

    http://dinesql.blogspot.com

    http://dinesql.blogspot.com

    http://dinesql.blogspot.com

    http://dinesql.blogspot.com

    http://dinesql.blogspot.com

    http://dinesql.blogspot.com

    Unstructured Data

    Unstructured data does not reside in a field

    or recordNo standard model, does not follow a schema.

    No specific definition on storing, accessing

    and processing.

    Can be seen as word documents, audio files

    , videos, photos, etc.

    Might follow a structure internally But no schema, tags, or markers describing

    the fields of data.

    Difficult to process using traditional computer modules.Has many irregularities and ambiguities

    http://www.informationweek.com/it-life/cartoon-unstructured-data-fatigue/a/d-id/1316534

  • http://dinesql.blogspot.com

    http://dinesql.blogspot.com

    http://dinesql.blogspot.com

    http://dinesql.blogspot.com

    http://dinesql.blogspot.com

    http://dinesql.blogspot.com

    http://dinesql.blogspot.com

    Why it is important?

    80%-90% data is unstructured

    and it growsPreviously unidentified or ignored.

    Hidden business insight

    Provides holistic view of the business

    Provides competitive advantages

    Reveals social trends for improving

    customer satisfaction

    Saves time and money

    2.5 quintillion bytes of data per day

    175 million tweets per day

    1.49 billion monthly active FB users

  • http://dinesql.blogspot.com

    http://dinesql.blogspot.com

    http://dinesql.blogspot.com

    http://dinesql.blogspot.com

    http://dinesql.blogspot.com

    http://dinesql.blogspot.com

    http://dinesql.blogspot.com

    Common Unstructured and Semi-structured Data

    Sentiment datamainly from social networks, online reviews,

    customer support interaction

    Clickstream data

    Sensor or machine data

    Server log data

  • http://dinesql.blogspot.com

    http://dinesql.blogspot.com

    http://dinesql.blogspot.com

    http://dinesql.blogspot.com

    http://dinesql.blogspot.com

    http://dinesql.blogspot.com

    http://dinesql.blogspot.com

    Ways of accessing Unstructured Data

    Mainly two methods

    Impose a structure on unstructured dataBased on schema-on-read method

    Transform unstructured data into

    a structured schemaPermanent structure makes it

    accessible by traditional computer

    modules.