Oracle’s Big Data solutions Jean-Philippe Breysse Oracle Suisse

Oracles Big Data solutions Jean-Philippe Breysse Oracle Suisse

The following is intended to outline our general product direction. It is intended for information purposes only, and may not be incorporated into any contract. It is not a commitment to deliver any material, code, or functionality, and should not be relied upon in making purchasing decisions. The development, release, and timing of any features or functionality described for Oracles products remain at the sole discretion of Oracle.

Copyright 2012, Oracle and/or its affiliates. All rights reserved.Insert Information Protection Policy Classification from Slide 13 4 USE CASE 3: LOGS ANALYSIS OF SERVERS Short Description : Daily logs analysis Issues: Find correlations on what drives to failures Log files stored as flat files

Copyright 2012, Oracle and/or its affiliates. All rights reserved.Insert Information Protection Policy Classification from Slide 13 5 Oracle Technology mapped to Analytics Landscape Acquire Analyze Organize Data Decide Structured Semi-structured Unstructured Master & Reference Transactions Machine Generated Text, Image, Video, Audio Oracle 12g Files Oracle NoSQL Oracle Hadoop HDFS Oracle Data Integrator Oracle Hadoop MapReduce Oracle 12g Oracle Essbas e Oracle R Enterprise & Oracle Data Mining Oracle BI Enterprise Oracle Real Time Decisions Oracle Endeca Information Discovery Oracle Times Ten Endeca MDEX Oracle Golden Gate

Agenda Big Data Solution Spectrum Inside the Big Data Appliance Big Data Applications Software Big Data Analytics Conclusions

Big Data Why Everyone Should Care

Tapping into Diverse Data Sets Transactions Information Architectures Today: Decisions based on database data Big Data: Decisions based on all your data Video and Images Machine-Generated Data Social Data Documents

9 A bit of history... : Developed initially by Doug Cutting (Nutch - Opensource websearch engine) and Yahoo -> inspired by Googles papers on MapReduce and GFS (2003-2004) resulted in Apache Hadoop (2006) Amazon Dynamo (2007): distributed systems technologies Cassandra: was developed at Facebook (2008) to power their Inbox Search feature (columnar oriented distributed DB) based initially on Dynamo and Bigtable (built by Google) Voldemort: is a distributed data store that is designed as a key-value store used by LinkedIn for high-scalability storage (NoSql key value) Cloudera:. It contributes to Hadoop and related Apache projects and provides a commercial distribution of Hadoop

10 So What is Big Data Anyway? Its a matter of perspective. Big Data is both: LARGE AND VARIABLE DATASETS that are difficult for traditional database tools to easily manage including datasets that once seemed not important or too problematic to deal with. Big Data datasets include: Extremely large files of unstructured or semi-structured data Large and highly distributed datasets that are otherwise difficult to manage as a single unit of information NEW SET OF TECHNOLOGIES that can economically capture, store, manage, and extract value from Big Data datasets thus facilitating better, more informed business decisions Structured Data vs. Unstructured Data Relational databases work best with structured data data which has underlying structure (schema) and size that easily fits the specific confines of database columns and rows. Unstructured data is highly variable, lacks fixed structure, and is often too large to easily handle by RDBMS systems. Source: IDC Digital Universe Study, Extracting Value from Chaos, June 2011 (sponsored by EMC)IDC Digital Universe Study, Extracting Value from Chaos, June 2011 (sponsored by EMC)

Drive Value from Big Data Building a Big Data Platform

Divided Solution Spectrum AcquireAnalyze Organize MapReduce Solutions DBMS (DW) DBMS (OLTP) Advanced Analytics Distributed File Systems Transaction (Key-Value) Stores ETL NoSQL Flexible Specialized Developer Centric SQL Trusted Secure Administered Schema-less Unstructured Data Variety Schema

Hadoop to Oracle Bridging the Gap AcquireAnalyze Organize Hadoop MapReduce HDFS Cassandra RDBMS (OLTP) RDBMS (DW) Advanced Analytics ETL Oracle Loader for Hadoop Schema-less Unstructured Data Variety Schema

Oracle Integrated Software Solution AcquireAnalyze Organize Oracle (DW) Oracle (OLTP) Schema-less Unstructured Data Variety Schema Hadoop HDFS Oracle NoSQL DB Oracle Analytics: Data Mining R Spatial Graph mapreduce Oracle Analytics: Data Mining R Spatial Graph mapreduce OBI EE Oracle Data Integrator Oracle Loader for Hadoop

Inside the Big Data Appliance Overview

16 Copyright 2011, Oracle and/or its affiliates. All rights reserved. Insert Information Protection Policy Classification from Slide 8 Oracle Engineered Solutions Acquire Analyze Organize Oracle Database (DW) Oracle Database (DW) Oracle Database (OLTP) Oracle Database (OLTP) In-DB Analytics R Mining Text Graph Spatial In-DB Analytics R Mining Text Graph Spatial Oracle BI EE Oracle BI EE Oracle NoSQL DB HDFS Hadoop Oracle Data Integrator Oracle Loader for Hadoop Data Variety Information Density Unstructured Schema Big Data Appliance Hadoop NoSQL Database Oracle Loader for hadoop Oracle Data Integrator Oracle Exadata OLTP & DW Data Mining & Oracle R Semantics Spatial Exalytics Speed of Thought Analytics

Big Data Appliance Usage Model InfiniBand Oracle Big Data Appliance Oracle Exadata AcquireOrganizeAnalyze & Visualize Stream Oracle Exalytics InfiniBand

Why build a Hadoop Appliance? Time to Build? Required Expertise? Cost and Difficulty Maintaining?

18 Sun X4270 M2 Servers 48 GB memory per node = 864 GB memory 12 Intel cores per node = 216 cores 24 TB storage per node = 432 TB storage 40 Gb p/sec InfiniBand 10 Gb p/sec Ethernet Oracle Big Data Appliance Hardware

Big Data Appliance Cluster of industry standard servers for Hadoop and NoSQL Database Focus on Scalability and Availability at low cost Compute and Storage 18 High-performance low-cost servers acting as Hadoop nodes 24 TB Capacity per node 2 6-core CPUs per node Hadoop triple replication NoSQL Database triple replication 10GigE Network 8 10GigE ports Datacenter connectivity InfiniBand Network Redundant 40Gb/s switches IB connectivity to Exadata

Scale Out to Infinity Scale out by connecting racks to each other using Infiniband Expand up to eight racks without additional switches Scale beyond eight racks by adding an additional switch

Oracle Enterprise Linux 5.6 Oracle Hotspot Java VM Clouderas Distribution including Apache Hadoop Cloudera Manager Open Source Distribution of R Oracle NoSQL Database Community Edition Oracle Big Data Appliance Software

Why Open-Source Apache Hadoop? Fast evolution in critical features Built by the Hadoop experts in the community Practical instead of esoteric Focus on what is needed for large clusters Proven at very large scale In production at all the large consumers of Hadoop Extremely stable in those environments Well-understood by practitioners

Software Layout Node 1: M: Name Node, Balancer & HBase Master S: HDFS Data Node, NoSQL DB Storage Node Node 2: M: Secondary Name Node, Management, Zookeeper, MySQL Slave S: HDFS Data Node, NoSQL DB Storage Node Node 3: M: JobTracker, MySQL Master, ODI Agent, Hive Server S: HDFS Data Node, NoSQL DB Storage Node Node 4 18: S: HDFS Data Nodes, Task Tracker, HBase Region Server, NoSQL DB Storage Nodes Your MapReduce runs here!

Big Data Application Software Acquire New Information

Key-Value Store Workloads Large dynamic schema based data repositories Data capture Web applications (click-through capture) Online retail Sensor/statistics/network capture (factory automation for example) Backup services for mobile devices Data services Scalable authentication Real-time communication (MMS, SMS, routing) Personalization Social Networks

Oracle NoSQL DB A distributed, scalable key-value database Simple Data Model Key-value pair with major+sub-key paradigm Read/insert/update/delete operations Scalability Dynamic data partitioning and distribution Optimized data access via intelligent driver High availability One or more replicas Disaster recovery through location of replicas Resilient to partition master failures No single point of failure Transparent load balancing Reads from master or replicas Driver is network topology & latency aware Elastic (Planned for Release 2) Online addition/removal of Storage Nodes Automatic data redistribution Storage Nodes Data Center A Storage Nodes Data Center B NoSQLDB Driver Application NoSQLDB Driver Application

Oracle NoSQL DB Differentiation Commercial Grade Software and Support General-purpose Reliable Based on proven Berkeley DB JE HA Easy to install and configure Scalable throughput, bounded latency Simple Programming and Operational Model Simple Major + Sub key and Value data structure ACID transactions Configurable consistency & durability Easy Management Web-based console, API accessible Manages and Monitors: Topology; Load; Performance; Events; Alerts Completes Oracle large scale data storage offerings

Big Data Application Software Organizing Data for Analysis

30 Copyright 2011, Oracle and/or its affiliates. All rights reserved. Oracle Loader for Hadoop Features Load data into a partitioned or non-partitioned table Single level, composite or interval partitioned table Support for scalar datatypes of Oracle Database Load into Oracle Database 11g Release 2 Runs as a Hadoop job and supports standard options Pre-partitions and sorts data on Hadoop Online and offline load modes

31 Copyright 2011, Oracle and/or its affiliates. All rights reserved. Oracle Loader for Hadoop SHUFFLE /SORT MAP SHUFFLE /SORT REDUCE SHUFFLE /SORT REDUCE INPUT 2 INPUT 1 MAP REDUCE MAP REDUCE MAP REDUCE ORACLE LOADER FOR HADOOP

32 Copyright 2011, Oracle and/or its affiliates. All rights reserved. Oracle Loader for Hadoop: Online Option SHUFFLE /SORT REDUCE MAP REDUCE ORACLE LOADER FOR HADOOP Connect to the database from reducer nodes, load into database partitions in parallel Read target table metadata from the database Perform partitioning, sorting, and data conversion

33 Copyright 2011, Oracle and/or its affiliates. All rights reserved. Oracle Loader for Hadoop: Offline Option SHUFFLE /SORT REDUCE MAP REDUCE ORACLE LOADER FOR HADOOP Read target table metadata from the database Perform partitioning, sorting, and data conversion Write from reducer nodes to Oracle Data Pump files Import into the database in parallel using external table mechanism

34 Copyright 2011, Oracle and/or its affiliates. All rights reserved. Oracle Loader for Hadoop Advantages Offload database server processing to Hadoop: Convert input data to final database format Compute table partition for row Sort rows by primary key within a table partition Generate binary datapump files Balance partition groups across reducers

35 Copyright 2011, Oracle and/or its affiliates. All rights reserved. Selection Output Option for Use Case Oracle Loader for Hadoop Output Option Use Case Characteristics Online load with JDBCThe simplest use case for non partitioned tables Online load with Direct PathFast online load for partitioned tables Offline load with datapump filesFastest load method for external tables On Oracle Big Data Appliance Direct HDFS Leave data on HDFS Parallel access from database Import into database when needed

36 Copyright 2011, Oracle and/or its affiliates. All rights reserved. Automate Usage of Oracle Loader for Hadoop ODI has knowledge modules to Generate data transformation code to run on Hive/Hadoop Invoke Oracle Loader for Hadoop Use the drag-and-drop interface in ODI to Include invocation of Oracle Loader for Hadoop in any ODI packaged flow Oracle Data Integrator (ODI)

Big Data Analytics Real Time Analytics Platform

R Statistical Programming Language Open source language and environment Used for statistical computing and graphics Strength in easily producing publication-quality plots Highly extensible with open source community R packages

Drive Value from Big Data Conclusions

Big Data Appliance Big Data for the Enterprise Optimized and Complete Everything you need to store and integrate your lower information density data Integrated with Oracle Exadata Analyze all your data Easy to Deploy Risk Free, Quick Installation and Setup Single Vendor Support Full Oracle support for the entire system and software set

DECIDE Oracle Analytic Applications Oracle Integrated Solution Stack for Big Data ACQUIRE Oracle NoSQL Database HDFS Enterprise Applications ORGANIZE Hadoop (MapReduce) Oracle Loader for Hadoop Oracle Data Integrator ANALYZE In-Database Analytics Data Warehouse

Oracle: Big Data for the Enterprise The most comprehensive solution Includes everything needed to acquire, organize and analyze all your data Optimized for Extreme Analytics Deepest analytics portfolio with access to all data Engineered to Work Together Eliminate deployment risk and support risk Enterprise Ready Deliver extreme performance and scalability

Questions