IS 466 ADVANCE TOPICS IN INFORMATION SYSTEMS 1 Chapter 2
Information Collection, Processing & Retrieval: Part 1: Big
Data Analytics 1436/1437 Semester II Dr. Sapiah Sakri Assistant
Professor [email protected]
Slide 2
OPENING SCENARIO 2
Slide 3
3 LEARNING OBJECTIVES By the end of the lecture the students
should be able to: Recognize what is Big Data Analytics and its
challenges. Recognize new models and/or techniques in Big Data
Analytics. Recognize examples of how Big Data Analytics is used
today. Recognize the benefits of Big Data Analytics
Slide 4
WHAT ARE WE GOING TO UNDERSTAND What is Big Data? Huge amount
of Data - Are we ready? Why we landed up there? To whom does it
matter What are the concerns? Tools and Technologies Is Big Data
Hadoop Moving Towards Analytics Applications 4
Slide 5
INTRODUCTION Big data is the term for a collection of data sets
so large and complex that it becomes difficult to process using
on-hand database management tools or traditional data processing
applications. The challenges that we face with dbms tools and other
tehnologies is capture, curation, storage, search, sharing,
transfer, analysis, and visualization.
Slide 6
SIMPLE TO START What is the maximum file size you have dealt so
far? Movies/Files/Streaming video that you have used? What have you
observed? What is the maximum download speed you get? Simple
computation How much time to just transfer. 6
Slide 7
WHAT IS BIG DATA? Every day, we create 2.5 quintillion bytes of
data so much that 90% of the data in the world today has been
created in the last two years alone. This data comes from
everywhere: sensors used to gather climate information, posts to
social media sites, digital pictures and videos, purchase
transaction records, and cell phone GPS signals to name a few. This
data is big data. 7
Slide 8
HUGE AMOUNT OF DATA There are huge volumes of data in the
world: From the beginning of recorded time until 2003, IBM created
5 billion gigabytes (exabytes) of data. In 2011, the same amount
was created every two days In 2013, the same amount of data is
created every 10 minutes. 8
Slide 9
9 2+ billion people on the Web by end 2011 30 billion RFID tags
today (1.3B in 2005) 4.6 billion camera phones world wide 100s of
millions of GPS enabled devices sold annually 76 million smart
meters in 2009 200M by 2014 12+ TBs of tweet data every day 25+ TBs
of log data every day ? TBs of data every day 9 9 THE SOCIAL LAYER
IN AN INSTRUMENTED INTERCONNECTED WORLD
Slide 10
10 HUGE AMOUNT OF DATA
Slide 11
11
Slide 12
WEB 2.0 IS DATA-DRIVEN 12
Slide 13
THE WORLD OF DATA-DRIVEN APPLICATIONS 13
Slide 14
CHARACTERISTICS OF BIG DATA Collectively Analyzing the
broadening Variety Responding to the increasing Velocity Cost
efficiently processing the growing Volume Establishing the Veracity
of big data sources 30 Billion RFID sensors and counting 1 in 3
business leaders dont trust the information they use to make
decisions 50x 35 ZB 2020 80% of the worlds data is unstructured
2010 14
Slide 15
CHARACTERISTICS OF BIG DATA 15
Slide 16
Volume: Enterprises are awash with ever-growing data of all
types, easily amassing terabyteseven petabytesof information. Turn
12 terabytes of Tweets created each day into improved product
sentiment analysis Convert 350 billion annual meter readings to
better predict power consumption Velocity: Sometimes 2 minutes is
too late. For time-sensitive processes such as catching fraud, big
data must be used as it streams into your enterprise in order to
maximize its value. Scrutinize 5 million trade events created each
day to identify potential fraud Analyze 500 million daily call
detail records in real-time to predict customer churn faster The
latest I have heard is 10 nano seconds delay is too much. Variety:
Big data is any type of data - structured and unstructured data
such as text, sensor data, audio, video, click streams, log files
and more. New insights are found when analyzing these data types
together. Monitor 100s of live video feeds from surveillance
cameras to target points of interest Exploit the 80% data growth in
images, video and documents to improve customer satisfaction 16
EXAMPLES
Slide 17
FINALLY. `Big- Data is similar to Small-data but bigger But
having data bigger it requires different approaches: Techniques,
tools, architecture with an aim to solve new problems or old
problems in a better way 17
Slide 18
WHY BIG DATA? Key enablers for the appearance and growth of
Big- Data are: Increase in storage capabilities Increase in
processing power Availability of data 18
Slide 19
TEN COMMON BIG DATA PROBLEMS 19
Slide 20
THE BIG DATA OPPORTUNITY 20
Slide 21
INDUSTRIES ARE EMBRACING BIG DATA 21
Slide 22
WHOM DOES IT MATTER Financial Services 22 Business Community -
New tools, new capabilities, new infrastructure, new business
models etc., Health Services
Slide 23
BIG DATA EXPLORATION: VALUE 23 File Systems Relational Data
Content Management Email CRM Supply Chain ERP RSS Feeds Cloud
Custom Sources Data Explorer Application/ Users Find, Visualize
& Understand all big data to improve business knowledge Greater
efficiencies in business processes New insights from combining and
analyzing data types in new ways Develop new business models with
resulting increased market presence and revenue
Slide 24
BIG DATA EXPLORATION: CUSTOMER EXAMPLE Exploring 4 TB to drive
point business solutions ( supplier portal, call center, etc.)
Single-point of data fusion for all employees to use Reduced costs
& improved operational performance for the business Can you
navigate and explore all enterprise and external content in a
single user interface? Can you quickly identify areas of data risk?
Do you have a logical starting point for your big data initiatives?
Key Questions to Ask Can you separate the noise from useful
content? Can you perform data exploration on large and complex
data? Can you find insights in new or unstructured data types (e.g.
social media and email)? 24 Airline Manufacturer
Slide 25
ENHANCED 360 VIEW OF THE CUSTOMER: NEEDS Need a deeper
understanding of customer sentiment from both internal and external
sources Extend existing customer views (MDM, CRM, etc) by
incorporating additional internal and external information sources
Desire to increase customer loyalty and satisfaction by
understanding what meaningful actions are needed Challenged getting
the right information to the right people to provide customers what
they need to solve problems, cross-sell & up-sell 25
Slide 26
ENHANCED 360 CUSTOMER VIEW: CUSTOMER EXAMPLE Create Facebook
Identify 200+ different customer profiles to help in fulfillment
& marketing efforts Leverage new data types in customer
analysis How are you driving consistency across your information
assets when representing your customer, clients, partners etc.? How
can a complete view of the customer enhance your line of business
users and result in better business outcomes? Key Questions to Ask
Can you identify and deliver all data as it relates to a customer,
product, competitor to those to need it? Can you gathering insights
about your customers from social data, surveys, support emails,
etc.? Can you combine your structured and unstructured data to run
analytics? Product Starting Point: InfoSphere MDM Server, Data
Explorer, BigInsights 26 Confidential, Internal Use Only
Slide 27
OPERATIONS ANALYSIS: NEEDS Benefits: Gain real-time visibility
into operations, customer experience, transactions and behavior
Proactively plan to increase operational efficiency Analyze a
variety of machine data for improved business results Business
Challenges: Complexity and rapid growth of machine data Difficult
to capture small fraction of machine for better decision In-ability
to analyze machine data and combine it with enterprise data for a
full view analysis Identify and investigate anomalies Monitor
end-to-end infrastructure to proactively avoid service degradation
or outages
Slide 28
Raw Logs and Machine Data Indexing, Search Statistical Modeling
Root Cause Analysis Federated Navigation & Discovery Real-time
Analysis Only store what is needed OPERATIONS ANALYSIS: VALUE &
DIAGRAM Machine Data Accelerator
Slide 29
OPERATIONAL - ANALYSIS Capabilities: Hadoop & Stream
Computing Intelligent Infrastructure Management: log analytics,
energy bill forecasting, energy consumption optimization, anomalous
energy usage detection, presence-aware energy management Optimized
building energy consumption with centralized monitoring; Automated
preventive and corrective maintenance
Slide 30
Integrate big data and data warehouse capabilities to increase
operational efficiency DATA WAREHOUSE AUGMENTATION: NEEDS Need to
leverage variety of dataExtend warehouse infrastructure Optimized
storage, maintenance and licensing costs by migrating rarely used
data to Hadoop Reduced storage costs through smart processing of
streaming data Improved warehouse performance by determining what
data to feed into it Structured, unstructured, and streaming data
sources required for deep analysis Low latency requirements
(hoursnot weeks or months) Required query access to data
Slide 31
DATA WAREHOUSE AUGMENTATION: CUSTOMER EXAMPLE Are you drowning
in very large data sets (TBs to PBs) that are difficult and costly
to store? Are you able to utilize and store new data types? Are you
facing rising maintenance/licensing costs? Do you use your
warehouse environment as a repository for all data? Internal Use
Only Creates pre-processing hub and performs ad hoc analysis
Hadoop-based landing zone used to store, manage and analyze
structured, semi-structured and multi-structured data before moving
to the warehouse Benefits: Data warehouse optimized for workload
and performance Utilized InfoSphere BigInsights, InfoSphere
DataStage Do you have a lot of cold, or low-touch, data driving up
costs or slowing performance? Do you want to perform analysis of
data in-motion to determine what should be stored in the warehouse?
Do you want to perform data exploration on all data? Are you using
your data for new types of analytics? Key Questions to Ask 31
Slide 32
HOW ARE REVENUES LOOKING LIKE. 32
Slide 33
WHAT DOES BIG DATA TRIGGER? From Big Data and the Web:
Algorithms for Data Intensive Scalable Computing, Ph.D Thesis,
Gianmarco 33
Slide 34
DEALING WITH BIG DATA IS HARD When the operations on data are
complex: Eg. Simple counting is not a complex problem. Modeling and
reasoning with data of different kinds can get extremely complex
Good news with big-data: Often, because of the vast amount of data,
modeling techniques can get simpler (e.g., smart counting can
replace complex model-based analytics) as long as we deal with the
scale. 34
Slide 35
35 Manage & store huge volume of any data Hadoop File
System MapReduce Manage streaming data Stream Computing Analyze
unstructured data Text Analytics Engine Data Warehousing Structure
and control data Integrate and govern all data sources Integration,
Data Quality, Security, Lifecycle Management, MDM Understand and
navigate federated big data sources Federated Discovery and
Navigation 35 TECHNOLOGIES USED IN BIG DATA
Slide 36
TYPES OF TOOLS TYPICALLY USED IN BIG DATA SCENARIO Where is the
processing hosted? Distributed server/cloud Where data is stored?
Distributed Storage (eg: Amazon s3) Where is the programming model?
Distributed processing (Map Reduce) How data is stored and indexed?
High performance schema free database What operations are performed
on the data? Analytic/Semantic Processing (Eg. RDF/OWL) 36
Slide 37
WHY HADOOP? 37
Slide 38
WHY HADOOP? 38
Slide 39
Time for thinking What do you do with the data. Lets take an
example: From application developers to video streamers,
organizations of all sizes face the challenge of capturing,
searching, analyzing, and leveraging as much as terabytes of data
per secondtoo much for the constraints of traditional system
capabilities and database management tools. 39
Slide 40
QUESTIONS FROM BUSINESSES 40
Slide 41
Big Data Platform Systems Management Application Development
Visualization & Discovery Accelerators Information Integration
& Governance Hadoop System Stream Computing Data Warehouse BI /
ReportingExploration / Visualization Functional App Industry App
Predictive Analytics Content Analytics Analytic Applications New
analytic applications drive the requirements for a big data
platform Integrate and manage the full variety, velocity and volume
of data Apply advanced analytics to information in its native form
Visualize all available data for ad-hoc analysis (even in motion!)
Development environment for building new analytic applications
Workload optimization and scheduling Security and Governance BIG
DATA STRATEGY: MOVE THE ANALYTICS CLOSER TO THE DATA And grow and
evolve on your current IT infrastructure
Slide 42
FOUR ENTRY POINTS OF BIG DATA Unlock Big Data Simplify Your
Warehouse Preprocess Raw Data Analyse Streaming Data Big Data
Platform Systems Management Application Development Visualization
& Discovery Accelerators Information Integration &
Governance Hadoop System Stream Computing Data Warehouse BI /
ReportingExploration / Visualization Functional App Industry App
Predictive Analytics Content Analytics Analytic Applications
Slide 43
ADVANTAGES Dialogue with consumers Redevelop your products
Perform risk analysis Keeping data safe Customize your website in
real time Reducing maintenance cost