Elastic search integration with hadoop leveragebigdata

11/23/2014 Elastic Search integration with Hadoop | leveragebigdata

http://leveragebigdata.wordpress.com/2014/06/28/elasticsearch-integration-with-hadoop/ 1/9

Tags

leveragebigdata

Elastic Search integration with Hadoop

28 Saturday Jun 2014

POSTED BY LEVERAGEBIGDATA IN UNCATEGORIZED

≈ LEAVE A COMMENT

[]

Elastic Search, Hadoop, Hive, MapReduce

Elastic is open source distributed search engine, based on lucene framework with Rest API. Youcan download the elastic search using the URLhttp://www.elasticsearch.org/overview/elkdownloads/. Unzip the downloaded zip or tar file andthen start one instance or node of elastic search by running the script ‘elasticsearch-1.2.1/bin/elasticsearch’ as shown below:

Installing plugin:

We can install plugins for enhance feature like elasticsearch-head provide the web interface tointeract with its cluster. Use the command ‘elasticsearch-1.2.1/bin/plugin -install mobz/elasticsearch-head’ as shown below:

———————

http://leveragebigdata.wordpress.com/

http://leveragebigdata.wordpress.com/author/leveragebigdata/

http://leveragebigdata.wordpress.com/category/uncategorized/

http://leveragebigdata.wordpress.com/2014/06/28/elasticsearch-integration-with-hadoop/

http://leveragebigdata.wordpress.com/tag/elastic-search/

http://leveragebigdata.wordpress.com/tag/hadoop/

http://leveragebigdata.wordpress.com/tag/hive/

http://leveragebigdata.wordpress.com/tag/mapreduce/

http://www.elasticsearch.org/overview/elkdownloads/

https://leveragebigdata.files.wordpress.com/2014/06/es_start6.jpg



And, Elastic Search web interface can be using url: http://localhost:9200/_plugin/head/

Creating the index:

(You can skip this step) In Search domain, index is like relational database. By default number ofshared created is ’5′ and replication factor “1″ which can be changed on creation depending onyour requirement. We can increase the number of replication factor but not number of shards.

Create Elastic Search Index

Loading data to Elastic Search:

1 curl -XPUT "http://localhost:9200/movies/" -d '{"settings" : {"number_of_shards" : 2, "number_of_replicas" : 1}}'

https://leveragebigdata.files.wordpress.com/2014/06/es_plugin.png

http://localhost:9200/_plugin/head/

https://leveragebigdata.files.wordpress.com/2014/06/es_plugin1.png

https://leveragebigdata.files.wordpress.com/2014/06/es_index.png

http://localhost:9200/movies/



If we put data to the search domain it will automatically create the index.

Load data using -XPUTWe need to specify the id (1) as shown below:

Note: movies->index, movie->index type, 1->id

Elastic Search -XPUT

Load data using -XPOSTThe id will be automatically generated as shown below:

Elastic Search -XPOST

Note: _id: U2oQjN5LRQCW8PWBF9vipA is automatically generated.

The _search endpoint

The index document can be searched using below query:

ES Search Result

1 curl -XPUT "http://localhost:9200/movies/movie/1" -d '{"title": "Men with Wings", "director": "William A. Wellman", "year": 1938, "genres": ["Adventure", "Drama","Drama"]}'

1 curl -XPOST "http://localhost:9200/movies/movie" -d' { "title": "Lawrence of Arabia", "director": "David Lean", "year": 1962, "genres": ["Adventure", "Biography", "Drama"] }'

1 curl -XPOST "http://localhost:9200/_search" -d' { "query": { "query_string": { "query": "men", "fields": ["title"] } } }'

https://leveragebigdata.files.wordpress.com/2014/06/es_put.png

https://leveragebigdata.files.wordpress.com/2014/06/es_post.png

https://leveragebigdata.files.wordpress.com/2014/06/es_search.png

http://localhost:9200/movies/movie/1

http://localhost:9200/movies/movie

http://localhost:9200/_search



Integrating with Map Reduce (Hadoop 1.2.1)

To integrate Elastic Search with Map Reduce follow the below steps:

Add a dependency to pom.xml:

or Download and add elasticSearch-hadoop.jar file to classpath.

Elastic Search as source & HDFS as sink:In Map Reduce job, you specify the index/index type of search engine from where you need tofetch data in hdfs file system. And input format type as ‘EsInputFormat’ (This format type isdefined in elasticsearch-hadoop jar). In org.apache.hadoop.conf.Configuration set elastic searchindex type using field ‘es.resource’ and any search query using field ‘es.query’ and also setInputFormatClass as ‘EsInputFormat’ as shown below:

ElasticSourceHadoopSinkJob.java

123456789

<dependency> <groupId>org.elasticsearch</groupId> <artifactId>elasticsearch-hadoop</artifactId> <version>2.0.0</version> </dependency>

12345678910111213141516171819

import java.io.IOException;import org.apache.hadoop.conf.Configuration;import org.apache.hadoop.fs.Path;import org.apache.hadoop.io.MapWritable;import org.apache.hadoop.io.Text;import org.apache.hadoop.mapreduce.Job;import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;import org.elasticsearch.hadoop.mr.EsInputFormat; public class ElasticSourceHadoopSinkJob { public static void main(String arg[]) throws IOException, ClassNotFoundException, InterruptedException{ Configuration conf = new Configuration(); conf.set("es.resource", "movies/movie"); //conf.set("es.query", "?q=kill"); final Job job = new Job(conf,



ElasticSourceHadoopSinkMapper.java

HDFS as source & Elastic Search as sink:In Map Reduce job, specify the index/index type of search engine from where you need to loaddata from hdfs file system. And input format type as ‘EsOutputFormat’ (This format type isdefined in elasticsearch-hadoop jar). ElasticSinkHadoopSourceJob.java

202122232425262728293031323334

"Get information from elasticSearch."); job.setJarByClass(ElasticSourceHadoopSinkJob.class); job.setMapperClass(ElasticSourceHadoopSinkMapper.class); job.setInputFormatClass(EsInputFormat.class); job.setOutputFormatClass(TextOutputFormat.class); job.setNumReduceTasks(0); job.setMapOutputKeyClass(Text.class); job.setMapOutputValueClass(MapWritable.class); FileOutputFormat.setOutputPath(job, new Path(arg[0])); System.exit(job.waitForCompletion(true) ? 0 : 1); }}

123456789101112131415

import java.io.IOException; import org.apache.hadoop.io.MapWritable;import org.apache.hadoop.io.Text;import org.apache.hadoop.mapreduce.Mapper; public class ElasticSourceHadoopSinkMapper extends Mapper<Object, MapWritable, Text, MapWritable> { @Override protected void map(Object key, MapWritable value, Context context) throws IOException, InterruptedException { context.write(new Text(key.toString()), value); }}

1234567891011

import java.io.IOException; import org.apache.hadoop.conf.Configuration;import org.apache.hadoop.fs.Path;import org.apache.hadoop.io.MapWritable;import org.apache.hadoop.io.NullWritable;import org.apache.hadoop.mapreduce.Job;import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;import org.elasticsearch.hadoop.mr.EsOutputFormat;



ElasticSinkHadoopSourceMapper.java

121314151617181920212223242526272829303132333435

public class ElasticSinkHadoopSourceJob { public static void main(String str[]) throws IOException, ClassNotFoundException, InterruptedException{ Configuration conf = new Configuration(); conf.set("es.resource", "movies/movie"); final Job job = new Job(conf, "Get information from elasticSearch."); job.setJarByClass(ElasticSinkHadoopSourceJob.class); job.setMapperClass(ElasticSinkHadoopSourceMapper.class); job.setInputFormatClass(TextInputFormat.class); job.setOutputFormatClass(EsOutputFormat.class); job.setNumReduceTasks(0); job.setMapOutputKeyClass(NullWritable.class); job.setMapOutputValueClass(MapWritable.class); FileInputFormat.setInputPaths(job, new Path("data/ElasticSearchData")); System.exit(job.waitForCompletion(true) ? 0 : 1); }}

1234567891011121314151617181920212223

import java.io.IOException; import org.apache.hadoop.io.ArrayWritable;import org.apache.hadoop.io.IntWritable;import org.apache.hadoop.io.LongWritable;import org.apache.hadoop.io.MapWritable;import org.apache.hadoop.io.NullWritable;import org.apache.hadoop.io.Text;import org.apache.hadoop.mapreduce.Mapper; public class ElasticSinkHadoopSourceMapper extends Mapper<LongWritable, Text, NullWritable, MapWritable>{ @Override protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException { String[] splitValue=value.toString().split(","); MapWritable doc = new MapWritable(); doc.put(new Text("year"), new IntWritable(Integer.parseInt(splitValue[0]))); doc.put(new Text("title"), new Text(splitValue[1])); doc.put(new Text("director"), new Text(splitValue[2]));



Integrate with Hive:

Download elasticsearch-hadoop.jar file and include it in path using hive.aux.jars.path as shownbelow: bin/hive –hiveconf hive.aux.jars.path=<path-of-jar>/elasticsearch-hadoop-2.0.0.jar or ADD elasticsearch-hadoop-2.0.0.jar to <hive-home>/lib and <hadoop-home>/lib

Elastic Search as source & Hive as sink:Now, create external table to load data from Elastic search as shown below:

You need to specify the elastic search index type using ‘es.resource’ and can specify query using‘es.query’.

Load data from Elastic Search to Hive

Elastic Search as sink & Hive as source:Create an internal table in hive like ‘movie_internal’ and load data to it. Then load data frominternal table to elastic search as shown below:

Create internal table:

24252627282930313233

String genres=splitValue[3]; if(genres!=null){ String[] splitGenres=genres.split("\\$"); ArrayWritable genresList=new ArrayWritable(splitGenres); doc.put(new Text("genres"), genresList); } context.write(NullWritable.get(), doc); }}

1 CREATE EXTERNAL TABLE movie (id BIGINT, title STRING, director STRING, year BIGINT, genres ARRAY<STRING>) STORED BY

1 CREATE TABLE movie_internal (title STRING, id BIGINT, director STRING, year BIGINT, genres ARRAY<STRING>) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' COLLECTION ITEMS TERMINATED BY '$' MAP KEYS TERMINATED BY '#' LINES TERMINATED BY '\n' STORED AS TEXTFILE;

https://leveragebigdata.files.wordpress.com/2014/06/es_hive1.png



Load data to internal table:

hiveElastic.txt

Load data from hive internal table to ElasticSearch :

Load data from Hive to Elastic Search

Verify inserted data from Elastic Search query

References:

1. ElasticSearch2. Apache Hadoop3. Apache Hbase4. Apache Spark5. JBKSoft Technologies

1 LOAD DATA LOCAL INPATH '<path>/hiveElastic.txt' OVERWRITE INTO TABLE movie_internal;

12

Title1,1,dire1,2003,Action$Crime$ThrillerTitle2,2,dire2,2007,Biography$Crime$Drama

1 INSERT OVERWRITE TABLE movie SELECT NULL, m.title, m.director, m.year, m.genres FROM movie_internal m;



http://www.elasticsearch.com/

http://hadoop.apache.org/

http://hbase.apache.org/

https://erashokagarwal.wordpress.com/

http://www.jbksoft.com/



Create a free website or blog at WordPress.com. The Chateau Theme.

Occasionally, some of your visitors may see an advertisement here.

Tell me more | Dismiss this message

About these ads

https://wordpress.com/?ref=footer_website

https://wordpress.com/themes/chateau/

http://wordpress.com/about-these-ads/

Software

Elastic search integration with hadoop leveragebigdata