Sabishaw bhaskaran siemens_big_datawarehousing_ieg_2012feb

Information Excellence informationexcellence.wordpress.com

Information Excellence2013 Feb Knowledge Share Session

The best way to put distance between you and the crowd is to do an outstanding job with information.

How you gather, manage, and use information will determine whether you win or lose.

Building a BigData warehousing and analysis system around Apache Hadoop

Sabishaw Bhaskaran

OLTP

Used in Operational systemsRegisters transactions arising out of business workflowsFocus is accurate & consistent recording of transactionsOS are purpose built systems. E.g. SCM, HR, Financial, ManufacturingRelational, heavily normalized & ACID properties essentialBackbone of nearly all business IT systems as we know todayImportance is on current data

Is about Running the business

But, past transactional data is

A trail of how well the business didCan be curated into business insights - kind of “OpIntel”Useful to analyze events, identify trends & make predictionsIs valuable for operational, tactical & strategic decision making

OLAP

Used in decision support systemsPurpose is to display, analyze & discover informationFocus is aggregation and fast query responseData handled is typically historical (less updates), less detailed (aggregated) & holistic (integrated)Capable of analyzing multidimensional data interactively from multiple perspectives & can handle ad-hoc queriesBasic organization – star/snowflake schemas (or some times also 3NF)

Is about Changing the business

ETL

Operational data exists in (departmental) silos Extract – Pick what’s relevant

Operational systems are purpose built Transform – Ensure syntactic/semantic sanity (+cleansing)

We need a enterprise (holistic) view of the businessIntegrate (Load) – Apply the global schema (EDM*)

*Enterprise Data Model

Data Warehouse

SCM

Finance

ExternalData

ETL Logic

Data warehouse

Operational Systems

A formal definition

Data warehouse is a subject-oriented, integrated, time-variant, non-updatable collection of data, used in support of management decision-making processes*

Subject-oriented - The data in the data warehouse is organized so that all the data elements relating to the ‘same real-world event or object’ (e.g. sales) are linked together

Non-volatile - Data in the data warehouse are never over-written or deleted —once committed, the data are static, read-only, and retained for future reporting

Integrated - The data warehouse contains data from most or all of an organization's operational systems and these data are made consistent

Time-variant - Values over time are available and hence trends can be observed

*By Bill Inmon

A word on BigData

Ubiquitous digitization (IT, automation, RFID), Social mediaMind-boggling volumesRapid rate of generationStructured, semi-structured & unstructured3Vs (Volume, Velocity & Variety)Challenges the capabilities of conventional (RDBMS based) systems (for storage & processing)

MapReduce & Hadoop

Source: Hadoop: the Definitive Guide – Tom White

New view on DWH

Old school : ETL prior to warehousing (a.k.a schema on write)

New school : Store first, ETL & analyze when necessary (a.k.a schema on read)

Data sources Hadoop

ETL & analyze

ETL Data warehouse

Data sources

Apache Hive

Projects a relation-oriented structure on the semi-structured data stored in Hadoop Distributed File System (HDFS)Provides an interface to query the data (in HQL similar to SQL) and translates the query to a plan which consists of directed-acyclic graph of map-reduce jobs to be executed by hadoop system in a distributed fashion across the clusterIs an open source data warehousing solution built on hadoop to give analysts the power of using SQL-like language and also the MapReduce programsSince HQL is very closely related to SQL, a mapping from HQL to SQL is possible

Our DW system

Hive

Hadoop

Apache Sqoop

Using Hive for analytics and data processing requires loading data into clusters and processing it in conjunction with other data that often resides in production databases across the enterprise

Apache Sqoop is a tool designed for efficiently transferring large data between HDFS and structured data stores such as relational databases (e.g. MS-SQL, MySQL, Oracle)

Sqoop successfully graduated from the incubator in March of 2012 and is now a top-Level Apache project

Our DW system

Streamingsources

Relational sources

Sqoop

Hive

Hadoop

Microsoft Power Pivot

Microsoft Power Pivot is an add-in to Excel and is an in-memory processing engine that provides multi-dimensional visualization over the data present in Excel

Provides functionalities like slicing, dicing, pivoting etc, thus enabling the user with interactive visualizations over the data loaded into PowerPivot

Another advantage of Power Pivot is its ability to publish the analysis results online using Microsoft SharePoint along with an interactive interface to users (thus bringing in the self-service feature!)

Our DW system - Final picture

Streamingsources

Relational sources

Sqoop

Hive ODBC driver

Microsoft Power Pivot

Microsoft SharePoint

server

Hive

Hadoop

Publish results online

Application - Twitter data analysis

Data Rows Size (GB)

Follower content (userid of each user with the userid of each of hisfollowers)Format: *.txtTotal number of users : 40,103,281

1,468,365,132 26

Tweets data (containing the time, user and content of the tweet)Format: *.txtTotal period of tweets : 1 month

29,986,960 11

Size of twitter data used :

Twitter – Most aggressive users

Twitter – Most popular topics

A user-defined function is written in Java to pick the word following each hash-tag in eachtweet (to get the trending topics). This particular user-defined function is used in the query toderive the corresponding total count of each of the word

Twitter – Most popular users

It was extremely unexpected as none of the top 10 users from this list appears in the list corresponding to tweet count. So, it is not really the case that people with really high number of followers tweet proportionally frequently.And the first user with 3 million followers was also a really astonishing result

Twitter - Spammers

This show that wpstudios and dominiquerdr have more than 99% retweets in the total tweet count, by which we can classify these users as spammers since their corresponding number of their original tweets are significantly very less

Twitter – Activity spread during the day (Re-tweets)

This analysis employs the use of the slicer function in Power Pivot

Thank you for your attention

Information Excellence informationexcellence.wordpress.com

About Information Excellence Group

Reach us at:

blog: http://informationexcellence.wordpress.com/

linked in: http://www.linkedin.com/groups/Information-Excellence-3893869

facebook: http://www.facebook.com/pages/Information-excellence-group/171892096247159

presentations: http://www.slideshare.net/informationexcellence

twitter: #infoexcel

email: [email protected]@gmail.com

Have you enriched yourself by contributing to the community Knowledge Share..

Technology

Sabishaw bhaskaran siemens_big_datawarehousing_ieg_2012feb