21
Seshubabu Simhadri Chief Technology Officer, GCE Lucene in the Cloud: Leveraging the Power of Search and Big Data to Shed Light on Government Spending Confidential, Do Not Disclose. Property of Global Computer Enterprises, Inc..

How is the Government Spending Your Money? How GCE is Using Lucene and the GCE Big Data Cloud

  • View
    1.336

  • Download
    0

Embed Size (px)

DESCRIPTION

Presented by Seshu Simhadri | Global Computer Enterprises - See conference video - http://www.lucidimagination.com/devzone/events/conferences/lucene-revolution-2012 A leader in bringing innovative technologies to the Federal Government, GCE looks to open source tools to drive down cost and provide the foundation for building value-added services for its customers. This talk will discus GCE’s innovative use of Lucene/Solr combined with the GCE Big Data Cloud to open up access to Federal spending data. This data is in wide use across the Federal government, Federal contracting community, media and press, as well as Capitol Hill. GCE has utilized this toolset to deliver the type of capability that users typically only find in web consumer applications. This session will highlight the technical side of the challenge in implementing these tools across a large user community and data set in a Cloud environment.

Citation preview

Page 1: How is the Government Spending Your Money? How GCE is Using Lucene and the GCE Big Data Cloud

Seshubabu Simhadri Chief Technology Officer, GCE

Lucene in the Cloud:

Leveraging the Power of Search and

Big Data to Shed Light on Government

Spending

Confidential, Do Not Disclose. Property of Global Computer Enterprises, Inc..

Page 2: How is the Government Spending Your Money? How GCE is Using Lucene and the GCE Big Data Cloud

Background

What is USASpending.gov?

Moving to Our Big Data cloud

Some of the design decisions Tool Selection Cluster Design Hardware Design

Limitations and enhancements

Confidential, Do Not Disclose. Property of Global Computer Enterprises, Inc..

Overview

Page 3: How is the Government Spending Your Money? How GCE is Using Lucene and the GCE Big Data Cloud

Confidential, Do Not Disclose. Property of Global Computer Enterprises, Inc..

What is USASpending.gov?

Page 4: How is the Government Spending Your Money? How GCE is Using Lucene and the GCE Big Data Cloud

Confidential, Do Not Disclose. Property of Global Computer Enterprises, Inc..

U.S. Government Spending vs. Other Entities

Page 5: How is the Government Spending Your Money? How GCE is Using Lucene and the GCE Big Data Cloud

Confidential, Do Not Disclose. Property of Global Computer Enterprises, Inc..

Distribution of U.S. Government Spending

Page 6: How is the Government Spending Your Money? How GCE is Using Lucene and the GCE Big Data Cloud

• Analytics •  Stats •  Top-K

• Free Text Search (With auto Suggestions)

• Large Data Feeds

• APIs

What can users do on the site?

Confidential, Do Not Disclose. Property of Global Computer Enterprises, Inc..

Page 7: How is the Government Spending Your Money? How GCE is Using Lucene and the GCE Big Data Cloud

Who are the users of the site?

• Public

• Media

• Congress

• Value Added Resellers

Confidential, Do Not Disclose. Property of Global Computer Enterprises, Inc..

Page 8: How is the Government Spending Your Money? How GCE is Using Lucene and the GCE Big Data Cloud

Leveraging the industry leading

open source platform to deliver cost savings and

scalability within a Cloud

computing model

GCE Big Data and Analytics Cloud Confidential, Do Not Disclose. Property of Global Computer Enterprises, Inc..

Page 9: How is the Government Spending Your Money? How GCE is Using Lucene and the GCE Big Data Cloud

What’s Inside the GCE Cloud?

•  Hadoop − For indexing and downloads

•  Distributed Solr − Analytics − Free text search

•  Drupal static content

•  Visualization

Confidential, Do Not Disclose. Property of Global Computer Enterprises, Inc..

Start by

Looking at the Usual Suspects

Page 10: How is the Government Spending Your Money? How GCE is Using Lucene and the GCE Big Data Cloud

Solr Node Sizing

The greatest

challenge is how to optimally

design a node – which

combination of CPUs, memory, and shard size

delivers the desired

performance?

Confidential, Do Not Disclose. Property of Global Computer Enterprises, Inc..

Page 11: How is the Government Spending Your Money? How GCE is Using Lucene and the GCE Big Data Cloud

Solr Node Sizing

Multiple index types

Different types of spending Varying sizes

Break complete dataset into shards as small as required to meet the response times

Choose shard size based on response times Single Core with multiple cores or Multiple Solr instances each with single core?

Confidential, Do Not Disclose. Property of Global Computer Enterprises, Inc..

Page 12: How is the Government Spending Your Money? How GCE is Using Lucene and the GCE Big Data Cloud

Solr Cluster Design

How do you design the cluster –

which ones are individual

nodes and which ones

are aggregators?

Confidential, Do Not Disclose. Property of Global Computer Enterprises, Inc..

Page 13: How is the Government Spending Your Money? How GCE is Using Lucene and the GCE Big Data Cloud

Solr Cluster Design

Should all shards be treated equal? Userà Aggregator Nodes à Shards Different requirements for nodes collecting the data and nodes serving a specific dataset Aggregator Node 1,2,3 ….m

Large Solr Instances, No local index Shard Nodes 1,2,3,..100..n

Small Solr Instance with index

Confidential, Do Not Disclose. Property of Global Computer Enterprises, Inc..

Page 14: How is the Government Spending Your Money? How GCE is Using Lucene and the GCE Big Data Cloud

What configuration did we choose?

Separate Solr instances

Multiple hard

drives per server

Solid state

disks

Infiniband

Confidential, Do Not Disclose. Property of Global Computer Enterprises, Inc..

Page 15: How is the Government Spending Your Money? How GCE is Using Lucene and the GCE Big Data Cloud

Solr Enhancements

Enhanced Faceting: Enabling

aggregation by more than

one field

Will be contributed to Solr project

Confidential, Do Not Disclose. Property of Global Computer Enterprises, Inc..

Page 16: How is the Government Spending Your Money? How GCE is Using Lucene and the GCE Big Data Cloud

Solr Data Importer: Why Not?

When the shards increase,

management of SQLs inside Solr

becomes a challenge

External Data Importer Using

Hadoop

Confidential, Do Not Disclose. Property of Global Computer Enterprises, Inc..

Page 17: How is the Government Spending Your Money? How GCE is Using Lucene and the GCE Big Data Cloud

Util izing Large Commodity Servers

Solr in the Cloud required building a cost effective and

high performance infrastructure

Small vs. large

Commodity servers

Confidential, Do Not Disclose. Property of Global Computer Enterprises, Inc..

Page 18: How is the Government Spending Your Money? How GCE is Using Lucene and the GCE Big Data Cloud

Disadvantages of higher capacity servers

Failure of one node results in failure of

multiple shards -careful

design is required

Confidential, Do Not Disclose. Property of Global Computer Enterprises, Inc..

Page 19: How is the Government Spending Your Money? How GCE is Using Lucene and the GCE Big Data Cloud

Summary

Sharded architecture

Multiple Solr instances per server each handling small datasets

Aggregator nodes + shards Hadoop for data indexing and data feeds Large Commodity Servers

•  48-core •  256GB RAM •  SSD •  Infiniband

Confidential, Do Not Disclose. Property of Global Computer Enterprises, Inc..

Page 20: How is the Government Spending Your Money? How GCE is Using Lucene and the GCE Big Data Cloud

Confidential, Do Not Disclose. Property of Global Computer Enterprises, Inc..

Come build the future

of Big Data

GCECloud.com

We’re hiring!

Page 21: How is the Government Spending Your Money? How GCE is Using Lucene and the GCE Big Data Cloud

Questions? ssimhadri at GCECloud.com

Visit us at www.GCECloud.com