Upload
information-excellence
View
377
Download
1
Tags:
Embed Size (px)
DESCRIPTION
BigDataWarehousing by Sabishaw Bhaskaran, Siemens in Information Excellence Session 2013 Feb
Citation preview
Information Excellence informationexcellence.wordpress.com
Information Excellence2013 Feb Knowledge Share Session
The best way to put distance between you and the crowd is to do an outstanding job with information.
How you gather, manage, and use information will determine whether you win or lose.
Building a BigData warehousing and analysis system around Apache Hadoop
Sabishaw Bhaskaran
OLTP
Used in Operational systemsRegisters transactions arising out of business workflowsFocus is accurate & consistent recording of transactionsOS are purpose built systems. E.g. SCM, HR, Financial, ManufacturingRelational, heavily normalized & ACID properties essentialBackbone of nearly all business IT systems as we know todayImportance is on current data
Is about Running the business
But, past transactional data is
A trail of how well the business didCan be curated into business insights - kind of “OpIntel”Useful to analyze events, identify trends & make predictionsIs valuable for operational, tactical & strategic decision making
OLAP
Used in decision support systemsPurpose is to display, analyze & discover informationFocus is aggregation and fast query responseData handled is typically historical (less updates), less detailed (aggregated) & holistic (integrated)Capable of analyzing multidimensional data interactively from multiple perspectives & can handle ad-hoc queriesBasic organization – star/snowflake schemas (or some times also 3NF)
Is about Changing the business
ETL
Operational data exists in (departmental) silos Extract – Pick what’s relevant
Operational systems are purpose built Transform – Ensure syntactic/semantic sanity (+cleansing)
We need a enterprise (holistic) view of the businessIntegrate (Load) – Apply the global schema (EDM*)
*Enterprise Data Model
Data Warehouse
SCM
Finance
ExternalData
ETL Logic
Data warehouse
Operational Systems
A formal definition
Data warehouse is a subject-oriented, integrated, time-variant, non-updatable collection of data, used in support of management decision-making processes*
Subject-oriented - The data in the data warehouse is organized so that all the data elements relating to the ‘same real-world event or object’ (e.g. sales) are linked together
Non-volatile - Data in the data warehouse are never over-written or deleted —once committed, the data are static, read-only, and retained for future reporting
Integrated - The data warehouse contains data from most or all of an organization's operational systems and these data are made consistent
Time-variant - Values over time are available and hence trends can be observed
*By Bill Inmon
A word on BigData
Ubiquitous digitization (IT, automation, RFID), Social mediaMind-boggling volumesRapid rate of generationStructured, semi-structured & unstructured3Vs (Volume, Velocity & Variety)Challenges the capabilities of conventional (RDBMS based) systems (for storage & processing)
MapReduce & Hadoop
Source: Hadoop: the Definitive Guide – Tom White
New view on DWH
Old school : ETL prior to warehousing (a.k.a schema on write)
New school : Store first, ETL & analyze when necessary (a.k.a schema on read)
Data sources Hadoop
ETL & analyze
ETL Data warehouse
Data sources
Apache Hive
Projects a relation-oriented structure on the semi-structured data stored in Hadoop Distributed File System (HDFS)Provides an interface to query the data (in HQL similar to SQL) and translates the query to a plan which consists of directed-acyclic graph of map-reduce jobs to be executed by hadoop system in a distributed fashion across the clusterIs an open source data warehousing solution built on hadoop to give analysts the power of using SQL-like language and also the MapReduce programsSince HQL is very closely related to SQL, a mapping from HQL to SQL is possible
Our DW system
Hive
Hadoop
Apache Sqoop
Using Hive for analytics and data processing requires loading data into clusters and processing it in conjunction with other data that often resides in production databases across the enterprise
Apache Sqoop is a tool designed for efficiently transferring large data between HDFS and structured data stores such as relational databases (e.g. MS-SQL, MySQL, Oracle)
Sqoop successfully graduated from the incubator in March of 2012 and is now a top-Level Apache project
Our DW system
Streamingsources
Relational sources
Sqoop
Hive
Hadoop
Microsoft Power Pivot
Microsoft Power Pivot is an add-in to Excel and is an in-memory processing engine that provides multi-dimensional visualization over the data present in Excel
Provides functionalities like slicing, dicing, pivoting etc, thus enabling the user with interactive visualizations over the data loaded into PowerPivot
Another advantage of Power Pivot is its ability to publish the analysis results online using Microsoft SharePoint along with an interactive interface to users (thus bringing in the self-service feature!)
Our DW system - Final picture
Streamingsources
Relational sources
Sqoop
Hive ODBC driver
Microsoft Power Pivot
Microsoft SharePoint
server
Hive
Hadoop
Publish results online
Application - Twitter data analysis
Data Rows Size (GB)
Follower content (userid of each user with the userid of each of hisfollowers)Format: *.txtTotal number of users : 40,103,281
1,468,365,132 26
Tweets data (containing the time, user and content of the tweet)Format: *.txtTotal period of tweets : 1 month
29,986,960 11
Size of twitter data used :
Twitter – Most aggressive users
Twitter – Most popular topics
A user-defined function is written in Java to pick the word following each hash-tag in eachtweet (to get the trending topics). This particular user-defined function is used in the query toderive the corresponding total count of each of the word
Twitter – Most popular users
It was extremely unexpected as none of the top 10 users from this list appears in the list corresponding to tweet count. So, it is not really the case that people with really high number of followers tweet proportionally frequently.And the first user with 3 million followers was also a really astonishing result
Twitter - Spammers
This show that wpstudios and dominiquerdr have more than 99% retweets in the total tweet count, by which we can classify these users as spammers since their corresponding number of their original tweets are significantly very less
Twitter – Activity spread during the day (Re-tweets)
This analysis employs the use of the slicer function in Power Pivot
Thank you for your attention
Information Excellence informationexcellence.wordpress.com
About Information Excellence Group
Reach us at:
blog: http://informationexcellence.wordpress.com/
linked in: http://www.linkedin.com/groups/Information-Excellence-3893869
facebook: http://www.facebook.com/pages/Information-excellence-group/171892096247159
presentations: http://www.slideshare.net/informationexcellence
twitter: #infoexcel
email: [email protected]@gmail.com
Have you enriched yourself by contributing to the community Knowledge Share..