B28-BICTIB

Embed Size (px)

Citation preview

  • 8/13/2019 B28-BICTIB

    1/31

    SUBMITTED BY:

    POOJ A M I S HRA( 1 2 6 0 9 0 7 1 )

    SWATI GUPTA(126090 48)

    S E CT I ON B

  • 8/13/2019 B28-BICTIB

    2/31

    Integrating Data Sources

    (Chapter 15)

    12/3/2013 2BIG DATA FOR DUMMIES

  • 8/13/2019 B28-BICTIB

    3/31

    Introduction

    Identifying the data you need

    Understanding the fundamentals of big data integration

    Using Hadoop as ETL

    Knowing best practices for data integration

    12/3/2013 3BIG DATA FOR DUMMIES

  • 8/13/2019 B28-BICTIB

    4/31

    Identifying the data you need

    Before you can begin to plan for integration of your big data, you

    need to take stock of the type of data you are dealing with.

    By leveraging new tools, organizations are gaining new insight

    from previously untapped sources of unstructured data in e-mails,

    customer service records, sensor data, and security logs.

    As you begin your big data analysis, you probably do not know

    exactly what you will find. Your analysis will go through several

    stages.

    Exploratory stage

    Codifying stage

    Integration and incorporation stage

    12/3/2013 4BIG DATA FOR DUMMIES

  • 8/13/2019 B28-BICTIB

    5/31

    Exploratory Stage

    In the early stages of your analysis, you will want to search for

    patterns in the data.

    It is only by examining very large volumes (terabytes and

    petabytes) of data that new and unexpected relationships and

    correlations among elements may become apparent.

    You will need a platform such as Hadoop for organizing your big

    data to look for these patterns.

    In the exploratory stage, you are not so concerned about

    integration with operational data.

    12/3/2013 5BIG DATA FOR DUMMIES

  • 8/13/2019 B28-BICTIB

    6/31

    Using FlumeNG for big data integration

    Flume is used to collect large amounts of log data from

    distributed servers.

    Flume is designed for scalability and can continually add more

    resources to a system to handle extremely large amounts of data

    in an efficient way.

    Flumes output can be integrated with Hadoop and Hive for

    analysis of the data.

    Flume also has transformation elements to use on the data and

    can turn your Hadoop infrastructure into a streaming source of

    unstructured data.

    12/3/2013 6BIG DATA FOR DUMMIES

  • 8/13/2019 B28-BICTIB

    7/31

    Looking for patterns in big data

    In the exploratory stage, technology can be used to rapidly search

    through huge amounts of streaming data and pull out the trending

    patterns that relate to specific products or customers.

    As companies search for patterns in big data, the huge data

    volumes are narrowed down as if they are passed through afunnel.

    You may start with petabytes of data and then, as you look for

    data with similar characteristics or data that forms a particular

    pattern, you eliminate data that does not match up.

    12/3/2013 7BIG DATA FOR DUMMIES

  • 8/13/2019 B28-BICTIB

    8/31

    Codifying stage

    After you find something interesting in your big data analysis,

    you need to codify it and make it a part of your business process.

    You need to make the connection between your big data analytics

    and your inventory and product systems.

    To codify the relationship between your big data analytics and

    your operational data, you need to integrate the data.

    12/3/2013 8BIG DATA FOR DUMMIES

  • 8/13/2019 B28-BICTIB

    9/31

    Integration and incorporation stage

    Once big data analysis is complete, an approach is needed that

    will allow to integrate or incorporate the results of big data

    analysis into business process and real-time business actions.

    Technologies for high-speed transport of very large and fast data

    are a requirement for integrating across distributed big datasources and between big data and operational data.

    A company that uses big data to predict customer interest in new

    products needs to make a connection between the big data and the

    operational data on customers and products to take action. If the company wants to use this information to buy new

    products or change pricing it needs to integrate its operational

    data with the results of its big data analysis.

    12/3/2013 9BIG DATA FOR DUMMIES

  • 8/13/2019 B28-BICTIB

    10/31

    Understanding the Fundamentals

    of Big Data Integration

    You must create a common understanding of data definitions.

    You must develop of a set of data services to qualify the data and

    make it consistent and ultimately trustworthy.

    You need a streamlined way to integrate your big data sources

    and systems of record.

    12/3/2013 10BIG DATA FOR DUMMIES

  • 8/13/2019 B28-BICTIB

    11/31

    Defining Traditional ETL

    Traditionally ETL has been used with batch processing in data

    warehouse environments.

    ETL tools are used to transform the data into the format required

    by the data warehouse.

    However, ETL is evolving to support integration across much

    more than traditional data warehouses. ETL can support

    integration across transactional systems, operational data stores,

    BI platforms, MDM hubs, the cloud, and Hadoop platforms

    12/3/2013 11BIG DATA FOR DUMMIES

  • 8/13/2019 B28-BICTIB

    12/31

    Read data from the sourcedatabase.

    Extract

    Convert the format of the

    extracted data so that it conformto the requirements of the targetdatabase.

    Transform

    Write data to the target database.Load

    12/3/2013 12BIG DATA FOR DUMMIES

  • 8/13/2019 B28-BICTIB

    13/31

    Data transformation

    Data transformation is the process of changing the format of data

    so that it can be used by different applications.

    This process also includes mapping instructions so that

    applications are told how to get the data they need to process.

    The process of data transformation is made far more complex

    because of the staggering growth in the amount of unstructured

    data.

    Data transformation tools are not designed to work well with

    unstructured data.

    As a result, companies faced with a significant amount of manual

    coding to accomplish the required data integration.

    12/3/2013 13BIG DATA FOR DUMMIES

  • 8/13/2019 B28-BICTIB

    14/31

    Prioritizing Big Data Quality

    You should follow a two-phase approach to

    data quality:

    Phase 1: Look for patterns in big data without concern for data

    quality.

    Phase 2:After you locate your patterns and establish results that

    are important to the business, apply the same data quality

    standards that you apply to your traditional data sources. You

    want to avoid collecting and managing big data that is not

    important to the business and will potentially corrupt other dataelements in Hadoop or other big data platforms.

    12/3/2013 14BIG DATA FOR DUMMIES

  • 8/13/2019 B28-BICTIB

    15/31

    Using Hadoop as ETL

    Hadoop can be used to handle some of the transformation process

    and to otherwise improve on the ETL and data-staging processes.

    You can speed up the data integration process by loading both

    unstructured data and traditional operational and transactional

    data directly into Hadoop, regardless of the initial structure of thedata.

    After the data is loaded into Hadoop, it can be further integrated

    using traditional ETL tools.

    When Hadoop is used as an aid to the ETL process, it speeds theanalytics process.

    12/3/2013 15BIG DATA FOR DUMMIES

  • 8/13/2019 B28-BICTIB

    16/31

    Best Practices for Data Integration

    in a Big Data World

    Keep data quality in perspective.

    Consider real-time data requirements.

    Dontcreate new silos of information.

    12/3/2013 16BIG DATA FOR DUMMIES

  • 8/13/2019 B28-BICTIB

    17/31

    Dealing With Real-time Data Streams And

    Complex Event Processing(Chapter 16)

    12/3/2013 17BIG DATA FOR DUMMIES

  • 8/13/2019 B28-BICTIB

    18/31

    Introduction

    Explaining Streaming Data Meaning

    Principles

    Uses

    Products for Streaming data.

    Explaining Complex Event Processing Meaning

    Uses

    Vendors

    Differentiating CEP from Streams Understanding the Impact of Streaming Data and CEP on

    Business

    12/3/2013 18BIG DATA FOR DUMMIES

  • 8/13/2019 B28-BICTIB

    19/31

    Data Streaming

    MEANING:-Streaming data is an analytic computing platform that is focused on

    speed. This is because these applications require a continuousstream of often unstructured data to be processed.

    o Therefore, data is continuously analyzed and transformed inmemory before it is stored on a disk.

    o Processing streams of data works by processing timewindowsof data in memory across a cluster of servers.

    o It is a single-pass analysis i.e the analyst cannot reanalyze thedata after it is streamed.

    o Streaming data is useful when analytics need to be done in realtime while the data is in motion.

    12/3/2013 19BIG DATA FOR DUMMIES

  • 8/13/2019 B28-BICTIB

    20/31

    PRINCIPLES:-

    When it is necessary to determine a retail buying opportunity atthe point of engagement, either via social media or via

    permission-based messaging

    Collecting information about the movement around a secure site

    To be able to react to an event that needs an immediate response,

    such as a service outage or a change in a patients medical

    condition

    Real-time calculation of costs that are dependent on variables

    such as usage and available resources

    12/3/2013 20BIG DATA FOR DUMMIES

  • 8/13/2019 B28-BICTIB

    21/31

    USES:-

    A power plant needs to be a highly secure environment.

    Companies often place sensors around the perimeter of a site to

    detect movement. Therefore, the vast amount of data coming from

    these sensors needs to be analyzed in real time so that an alarm is

    sounded only when an actual threat exists.

    A

    power

    plant

    It is a highly competitive market. Communications systemsgenerate huge volumes of data that have to be analyzed inreal time to take th appropriate action. A delay in detectingan error can seriously impact customer satisfaction.

    A

    telecommuni

    cations

    compa

    ny

    12/3/2013 21BIG DATA FOR DUMMIES

  • 8/13/2019 B28-BICTIB

    22/31

    Continued.

    This needs to know exactly the sources of oil, environmental

    factors impacting their Operations, water depth, temperature, ice

    flows etc. This massive amount of data needs to be analyzed and

    computed so that mistakes are avoided.

    Oil

    explo

    ration

    company

    These are required to be able to take massive amounts of data

    from brain scans and analyze the results in real time to determine

    where the source of a problem is and what type of action needed

    to be taken to help the patient.

    Medic

    aldiagn

    ostic

    group

    12/3/2013 22BIG DATA FOR DUMMIES

  • 8/13/2019 B28-BICTIB

    23/31

    PRODUCTS FOR STREAMING DATA :-

    IBM Infosphere Streams

    InfoSphere Streams provides continuous analysis of

    massive data volumes. It is intended to perform complex analytics of

    heterogeneous data types.

    It can perform real-time and look-ahead analysis of

    regularly generated data, using digital filtering,pattern/correlation analysis, and decomposition as well

    as geospacial analysis.

    12/3/2013 23BIG DATA FOR DUMMIES

  • 8/13/2019 B28-BICTIB

    24/31

    TwittersStorm

    TwittersStorm is an open source real-time analytics engine.

    Twitter uses Storm internally.

    It is still available as open source and has been gaining significant

    traction among emerging companies.

    It can be used with any programming language for applications

    Storm is designed to work with existing queuing and database

    technologies.

    12/3/2013 24BIG DATA FOR DUMMIES

  • 8/13/2019 B28-BICTIB

    25/31

    Apache S4

    The four Ssin S4 stand for Simple Scalable Streaming System.

    It allows programmers to easily develop applications for

    processing continuous streams of data.

    S4 is designed as a highly distributed system.

    The S4 design is best suited for large-scale applications for data

    mining and machine learning in a production environment.

    12/3/2013 25BIG DATA FOR DUMMIES

  • 8/13/2019 B28-BICTIB

    26/31

    Complex Event Processing

    MEANING:-

    CEP is an advanced approach based on simple event processing that

    collects and combines data from different relevant sources to

    discover events and patterns that can result in action.

    o It is a technique for tracking, analyzing, and processing data as an

    event happens.

    o It unable companies to establish the correlation between streams

    of information and match the resulting pattern with defined

    behaviors such as mitigating a threat or seizing an opportunity.

    12/3/2013 26BIG DATA FOR DUMMIES

  • 8/13/2019 B28-BICTIB

    27/31

    USES :-

    It creates a tiered loyalty program to increase repeat sales.

    Using a CEP platform, the system triggers a process that offers

    the customer an extra discount on a related product.

    Ret

    ail

    chai

    n

    These uses CEP to better manage fraud.. The underlying

    system will correlate the incoming transactions, track the

    stream of event data, and trigger a process.

    Cred

    it

    card

    com

    pany

    CEP is also implemented in financial trading applications,

    weather-reporting applications, and sales management

    Applications.

    Appl

    icati

    ons

    12/3/2013 27BIG DATA FOR DUMMIES

  • 8/13/2019 B28-BICTIB

    28/31

    VENDORS OF CEP :-

    Esper (open source vendor),

    IBM with IBM Operational Decision Manager, Informatica withRulePoint,

    Oracle with its Complex Event Processing Solution,

    MicrosoftsStreamInsights,

    SAS DataFlux Event Stream Processing Engine, StreambasesCEP

    12/3/2013 28BIG DATA FOR DUMMIES

  • 8/13/2019 B28-BICTIB

    29/31

    Differentiating CEP from streams

    Streaming computing is typically applied to analyzing vast

    amounts of data in real time, while CEP is focused on solving a

    specific use case based on events and actions.

    In many situations CEP is dependent on data streams; however,

    CEP is not required for streaming data.

    Streaming computing is used to handle unstructured data, while

    CEP deals with variables correlated with specific business

    process.

    Streaming data is managed in a highly distributed clusteredenvironment, while CEP often run on less complex hardware.

    12/3/2013 29BIG DATA FOR DUMMIES

  • 8/13/2019 B28-BICTIB

    30/31

    Impact of streaming data and CEP on business

    With streaming data, companies are able to process and analyze

    big data in real time to gain an immediate insight.

    With CEP approaches, companies can stream data and then

    leverage a business process engine to apply business rules to the

    results of that streaming data analysis.

    12/3/2013 30BIG DATA FOR DUMMIES

  • 8/13/2019 B28-BICTIB

    31/31

    12/3/2013 31BIG DATA FOR DUMMIES