Hcatalog HUG Draft5

Embed Size (px)

Citation preview

HCatalog

(and friends)

Sushanth Sowmyan Committer, Apache HCatalog [email protected] @khorgath Hortonworks Inc. 2011 Page 1

Let's think about data for a bit...From Wikipedia: Data ( / de t / day-t , / dt / da-t , or / d t / dah-t )

Qualitative or quantitative attributes of a variable or set of variables. Data are typically the results of measurements and can be the basis of graphs, images, or observations of a set of variables. Data are often viewed as the lowest level of abstraction from which information and then knowledge are derived. Raw data, i.e., unprocessed data, refers to a collection of numbers, characters, images or other outputs from devices that collect information to convert physical quantities into symbols.

Architecting the Future of Big Data Hortonworks Inc. 2011

Page 2

So what is needed to make Data useful?

Arguably, tools to convert data into information.

Arguably also, knowledge about the data, so that the tools can then make use of the data in a meaningful sense, to extract information from it.

Architecting the Future of Big Data Hortonworks Inc. 2011

So what are the characteristics of a Data Warehouse?

Data is present, organized, recorded, and catalogued. Tools exist that are able to operate on the data.

So what do tools need to be able to operate on data?Architecting the Future of Big Data Hortonworks Inc. 2011

Finding it

Photo credit : dkeats on flickr Hortonworks Inc. 2011

Finding it

Knowing where data is.

Evolve : Knowing which data is where naming data,

Evolve : Organization to support various data modeling concepts (table, partitions, columns, records)

Evolve : done semantics, existence semantics

Architecting the Future of Big Data Hortonworks Inc. 2011

Reading it

Photo credit : kylesteed on flickr

Architecting the Future of Big Data Hortonworks Inc. 2011

Page 7

Reading it

Each tool having its own storage space, its own private world

Evolve : Abstracting away storage mechanism and having tools sit on top of file formats and mechanisms, so now, suddenly, tools have interoperability.

Evolve : Having a storage abstraction that adapts to existing storage mechanisms in an easy to develop manner

Architecting the Future of Big Data Hortonworks Inc. 2011

Who are the various actors in a data ecosystem?

Analyst uses sql (hive) and/or jdbc-based tools

Programmer cares about data transformation - uses Pig or M/R

Project owner - cares about amount of resources used, data portability, data connectors

Ops - needs to manage data storage, cluster management, need to control data expiry, replication, import and export.

Architecting the Future of Big Data Hortonworks Inc. 2011

(stealing slide from Alan's TriHUG talk)

Architecting the Future of Big Data Hortonworks Inc. 2011

Also :

People who help aforementioned people: Tool Writer - wants abstractions to deal with variances, wants to be able to store and retrieve relevant metadata and data, so they can focus on their user

Storage subsystem writer - wants standardization so that they can be used by other actors.

Architecting the Future of Big Data Hortonworks Inc. 2011

What do they all want?

Need it Working Correctness Speed, Efficiency

Interoperability, Convenience

Architecting the Future of Big Data Hortonworks Inc. 2011

Did somebody say Interoperability?

Hortonworks Inc. 2011

Making Your Structured Data Available to the MapReduce EngineMapReduce

Pig HCatalog

Hive

HDFS

HBase

MPP Store

Users can query data with Pig, Hive, or custom MapReduce jobs Standard HDFS formats available Q1 2012 HBase data by early Q2 201214

Architecting the Future of Big Data Hortonworks Inc. 2011

Hcatalog underlying architecture

HCatLoader HCatInputFormat

HCatStorer HCatOutputFormat Hive MetaStore Client Generated Thrift Client CLI Notification

Hive MetaStore

RDBMS

Architecting the Future of Big Data Hortonworks Inc. 2011

Page 15

Problem: Need to Know Where Data IsPIG HIVE MapReduce

Storage

Architecting the Future of Big Data Hortonworks Inc. 2011

Page 16

Solution: Register Through HCatalogPIG HIVE MapReduce

HCatalog

Storage

Architecting the Future of Big Data Hortonworks Inc. 2011

Page 17

Problem: Data in variety of formats Data files maybe organized in different formats Data files may contain different formats in different partitions

Storage (HDFS, HBASE , etc)

Architecting the Future of Big Data Hortonworks Inc. 2011

Page 18

Solution: HCat provides common abstractionHadoop Application Registered Data w/ Schema HCat normalizes data to application

HCatalog

Storage

Architecting the Future of Big Data Hortonworks Inc. 2011

Page 19

Getting Involved

Incubator site : http://incubator.apache.org/hcatalog User list: [email protected] Dev list: [email protected]

Architecting the Future of Big Data Hortonworks Inc. 2011

TODO

HCATALOG-8 : HCatalog needs a logo HBase integration, trying to nail down a better table metaphor Hive integration interoperability between the notion of StorageDriver and StorageHandler, project dependency management 0.23 Work HCATALOG-182 : Improve the and friends bit.

Architecting the Future of Big Data Hortonworks Inc. 2011

Waitaminnit... what was that about friends ?

Architecting the Future of Big Data Hortonworks Inc. 2011

TempletonA Webservices API for Hadoop

Photo credit : PKMousie on flickr

Architecting the Future of Big Data Hortonworks Inc. 2011

Templeton: ISV Front-door for Hadoop

Insulation from interface changes release to release Opens the door to languages other than Java Thin clients through webservices vs forced fat-clients in gatewa

Still prototyping! But see a common need.

Architecting the Future of Big Data Hortonworks Inc. 2011

Page 24

Templeton Specific SupportMove data directly into/out-of HDFS through WebHDFS Webservice calls to HCatalog

Register table relationships for data (e.g., createTable, createDatabase) Adjust tables (e.g., AlterTable) Look at a statistics (e.g., ShowTable)

Webservice calls to start work

MapReduce, Pig, Hive Poll for job status Notification URL when job completes (optional)

Stateless Server

Horizontally scale for load Configurable for HA Currently Requires ZooKeeper to track job status info

Architecting the Future of Big Data Hortonworks Inc. 2011

Page 25

ANY QUESTIONS ?

Architecting the Future of Big Data Hortonworks Inc. 2011