Apache Kite

Apache KiteMAKING LIFE EASIER IN HADOOP

Why Kite?

1. Codify expert patterns and practices for building data-oriented systems and applications.

2. Let developers focus on business logic, not plumbing or infrastructure.

3. Provide smart defaults for platform choices.

4. Support piecemeal adoption via loosely-coupled modules.

Kite Data ModuleProvides APIs to interact with data in Hadoop.

The data module contains APIs and utilities for defining and performing actions on datasets. entities

schemas

datasets

dataset repositories

loading data

dataset writers

viewing data

Entities, SchemasEntity:

A single record in the dataset. (More like a plain java object, analogous to a row in relational RDBMS table)

Schemas:

A schema specify fieldname and the data type for a dataset. Kite relies on Apache AVRO for the same.

◦ Using Java API (AVRO schema is inferred from a Java class/ AVRO data)

◦ Using command line argument (AVRO schema is inferred from a Java class/ CSV data)

DatasetsDatasets:

A collection of zero or more entities. It is analogous to a RDBMS table.

The HDFS implementation of dataset is stored as Snappy-compressed Avro data files by default. We can also store it in column oriented Parquet format.

◦ Performance of a dataset can be increased by partition strategy◦ Based on one or more fields in the entity

◦ Partitioning can be done using Hash, Identity or Date (year, month, day, hour) strategies

◦ It provides coarse grained organization

◦ Partition strategy is configured with a JSON-based format

◦ Partition strategy can be applied only when dataset is created and cannot be altered later on.

◦ We can work with a subset of dataset entities using Views API.

◦ Datasets are identified using URIs.

Datasets◦ Dataset URIs:

Depending on the data set scheme, we can specify dataset URI using one of the following pattern.

◦ View URIs:

A view URI is constructed by changing the prefix of a dataset URI from ‘dataset:’ to ‘view:’. The query arguments can be added as name/value pairs, similar to query arguments in HTTP URL.

Hive dataset:hive:<namespace>/<dataset>

HDFS dataset:hdfs:/<path>/<namespace>/<dataset-name>

Local FS dataset:file:/<path>/<namespace>/<dataset-name>

HBase dataset:hbase:<zookeeper>/<dataset-name>

Dataset Repositories, Loading, Dataset Writers, Viewing Data

Dataset Repositories:

• The physical storage location for datasets. It is equivalent to database in RDBMS model.

• Required for logical grouping, security, access controls, backup policies, etc.

• Each dataset belong to exactly one dataset repository.

• Kite does not provide the functionality of copying/moving a dataset from one dataset repository to another. (However, it can be done via Map Reduce)

Loading:

• We can load comma separated values into dataset repository using CLI.

Dataset Writers:

• Used to add entities to datasets.

Viewing Data:

• We can query the data using Hive/Impala

• We can also use CLI.

Kite Dataset Lifecycle

Generate Schema

•A Kite dataset is defined using an Avro schema.

•It can manually written or generated from Java object/CSV data file.

• CLI command for: • Schema generation from Java class

kite-dataset obj-schema org.kitesdk.cli.example.Movie -o movie.avsc

• Schema generation from CSV file

kite-dataset csv-schema movie.csv --class Movie -o movie.avsc

Example – Schema Generationpackage org.kitesdk.examples.data;

/** Movie class */

class Movie {

private int id;

private String title;

private String releaseDate;

. . .

public Movie() {

// Empty constructor for serialization purposes

}

}

{"type":"record","name":"Movie","namespace":"org.kitesdk.examples.data","fields":[

{"name":"id","type":"int"},{"name":"title","type":"string"},{"name":"releaseDate","type":"string"},

]}

Create DatasetDataset is created using the Avro schema.

kite-dataset create movie --schema movie.avsc

Partition Strategy:

• Logical partitions for improving performance

• Specified using a JSON fileExample: movie.json

[ {

"source" : "id",

"type" : "int",

"name" : "id"

}]

kite-dataset create movie --schema movie.avsc partition-by movie.json

Create DatasetColumn Mapping:

•Specifies how data should be stored in Hbase for maximum performance

•Specified in JSON file• Each definition is a JSON object with following fields• SOURCE – The field in the entity

• TYPE – Where the field data is stored (cells in Hbase)

• FAMILY – the column family in Hbase table

• QUALIFIER – the column name in Hbase table

Example

{"source" : "timestamp", "type" : "column", "family" : "m", "qualifier" : "ts"}

• There are five mapping types:1. Column 2. Counter 3. keyAsColumn 4. Key 5. Version

Populate-Validate-Update-Annihilate DatasetPopulate Dataset:

There are various ways to populate data to dataset• Importing form csv files• Copying from another dataset• Using Flume ingestion, etc.

Validate Dataset:

‘SHOW’ command can be used to validate the data loaded.

Update Dataset:

Kite supports schema evolution as AVRO.

Annihilate Dataset:

Delete dataset when it is not required.

QUERIES?

Data & Analytics

Apache Kite