Location Intelligence for Big Data - Pitney Bowes€¦ · Location Intelligence for Big Data A Pitney Bowes technical white paper Page 4 The benefits of a native optimized approach

Technical white paper

Location Intelligence for Big DataMaximizing the distributed nature of big-data clusters to achieve breakthrough performance

Location Intelligence

Location Intelligence for Big Data A Pitney Bowes technical white paper

Page 2

Add agility to big data analysis.

Companies struggle to generate positive returns on big data implementations. Many find it difficult to generate actionable insight from their big data assets.

Geospatial processing changes the dynamics. Location Intelligence for Big Data makes vast quantities of data consumable using GeoEnrichment and location analytics. Run spatial operations within a native environment. Then visualize relationships in a spatial context to improve analysis and decision making.

Pitney Bowes offers a unique approach, embedding location technology within big data solutions. When you embed rather than connect, you can interpret transactional data faster and resolve critical business issues with the clarity you need.

Discover the technology that delivers richer insights and a faster ROI.

• High scalability, high speed data processing

• GeoEnrichment

• Cluster-level data partitioning

• Node-level data processing

Page 3

Big data technologies increasingly allow companies to store and process incredibly large datasets of customer calls, financial transactions and social media feeds. Yet, many companies struggle to generate meaningful, actionable insights. Key performance indicators remain elusive as data volume and velocity continue to grow.

The challenge is to connect data within and across datasets in a way that:

• Ensures accuracy and precision

• Enables enrichment

• Keeps pace with the extraordinary speed and scale required

The Location Intelligence advantageLocation Intelligence brings critical perspective to data analysis. Big data typically comes with a locational component. This could be a customer address, a mobile phone GPS signal, the location of an ATM, a store transaction or a social media check-in.

Through GeoEnrichment, a process for appending location-based ancillary data, organizations can augment records with latitude/longitude coordinates. Then, that coordinate data can be used to integrate additional context into every record.

With this enriched embedded insight:

• Rules-based workflows can utilize this appended data to automate business decisions.

• Spatial aggregation can condense data volumes, making them more manageable.

• Data can be generalized in a spatial context for results that are easier to visualize and analyze.

• Organizations can gain new perspectives into business drivers and subsequent company responses.

Valuable applications An embedded approach enables companies to formulate business questions and solve them within a big data environment. For example:

Telecommunications companies continually process a huge number of call records. These can be condensed and presented via highly accurate, near real-time coverage maps. This type of visual analysis helps firms to improve customer service, reduce churn, market more effectively and gain market share.

Financial services firms continually process an incredibly vast number of transactions. Each can be appended with a latitude/longitude coordinate pair, and operational rules can help determine when to flag records as potentially fraudulent. This process enables financial services firms to provide better safeguards to consumers and their privacy, while helping to reduce losses due to fraud.

The big data challenge


Page 4

The benefits of a native optimized approachMany geospatial technology providers supply solutions that connect to big data platforms, then transfer data from these distributed platforms into their own GIS server-based technology. The disadvantage of this “connector” approach is that it doesn’t leverage the processing power of the distributed platform (Hadoop, Spark, etc.). Instead, the actual geospatial operations occur in a single server or a small server cluster, which limits your ability to process large data sets.

Pitney Bowes takes a different, big-data-ready approach. To maximize the capabilities of distributed processing environments, we enable geospatial operations to run natively within a variety of distributed platforms.

Let’s look first at the technology we provide; then at the process steps that enable it to work natively in a big data environment.

Innovative, modular technologyThe Pitney Bowes Location Intelligence for Big Data solution is comprised of location technology software development kits (SDKs) that allow companies to GeoEnrich datasets and spatially aggregate results, condensing big data into a consumable output. It also includes APIs and data.

• Our Java-based SDKs can be transferred into any big data environment, such as Hadoop or Spark, so companies are not limited by their technology choices in a transient and evolving field.

• We offer 350+ datasets that can be used to add spatial context, serve as a container for aggregation, and be analyzed and visualized using our web-based mapping technologies.

Examples of performance achieved via the Pitney Bowes native and optimized technology strategyThese are actual numbers achieved in our clients’ use cases. We are continually improving the technology and can achieve much better performance using newer technologies like Spark.

Location Intelligence SDK• Takes spatial primitives (points, lines and polygons) and applies a

geometric function (“contains”, “combine”, “intersects” etc.) using additional spatial and aspatial data

• Enables creation of spatial query such as “Aggregate point data within this polygon” or “Find the nearest point to this line”

• Can be used to GeoEnrich a dataset by appending additional attributes using customer data, third-party data or any of the 350+ datasets in the Pitney Bowes Global Data Catalog

• U.S. customers may also choose to utilize the pre-enriched Pitney Bowes Master Location Database (MLD) assets for U.S. postal addresses to further improve processing speeds and location accuracy for operational workflows

Global Geocoding API• Geocoding turns a street address, place or point of interest

into a latitude, longitude co-ordinate pair.

• Reverse geocoding take the co-ordinate and gives a street address, or administrative boundary.

Routing SDK • Takes a known location (e.g. retail store) and uses the road

network to derive information such as equal drive times (isochrones) around that point, or the shortest path to that point.

Geocoding

“Find the Nearest” spatial join

Point-in- polygon processing

US Parcel Centroid geocoding of 106 million addresses

1 billion mobile points spatially joined to 12 million points of interest

Aggregated 19 billion mobile call records to 950 million polygons

30 minutes

36 minutes

30 minutes

5 nodes Hadoop cluster

20 nodes EMR on AWS

56 node Hadoop cluster

Pitney Bowes Location Intelligence for Big Data capabilities

Page 5

Running geocoding within Hadoop Listed below is an example of how Pitney Bowes can run geocoding as a Hadoop MapReduce batch job in command line. Note that the user can set up different geocoding parameters in the config.xml, such as the dictionary to use and the fields to return. Both forward geocoding and reverse geocoding is supported.

Making it more accessible While a MapReduce batch job works well for users with a data-engineering background, it is not user-friendly for other data analysts. To make geocoding in Hadoop more accessible, we’ve developed a HIVE geocoding UDF so any user with a SQL background can use it in Hadoop. Most Pitney Bowes Location Intelligence capabilities can be deployed in Hadoop or Spark using an approach similar to the geocoding example above.

[jun@osboxes ~]$: hadoop jar Geocoding_Hadoop.jar com.pb.mr.GeocodingDriver -input /addressdatafolder -output /geocoderesult -appConfig Geocode_config.xml HIVE> select geocode (street, city, state, zip, ‘USA’)

from customersAddTable;

Geocoding SDK

MR/Yarn Application Hadoop

NameNode

JarFile

Putting our technology to workThe diagram to the right uses the Global Geocoding API to illustrate this architecture as well as how it can be used in various big data related processes. It shows how Pitney Bowes integrates geocoding capabilities natively into Hadoop.

The key components of the solution are the Global Geocoding API (GGA) and geocoding dictionary files.

• GGA is a collection of Jar files that can be used in writing Java based MapReduce, Yarn or Spark applications.

• The geocoding dictionary data files can be pre-installed in all data nodes of the Hadoop cluster or distributed into the cluster dynamically before use.

Data Node

Geocoding Dictionary

Data Node


Data Node


Input Data Output Data


Page 6

Breaking down the Pitney Bowes approach Data preparation is critical to a highly performant spatial process. This requires enhancements to both the spatial data partition at the cluster level and spatial data processing in node level.

Cluster-level data partitioning dictates how large datasets are divided so they can be efficiently processed on a single node.

Node-level data processing optimizes spatial indexing and processing of small pieces of the data subset in a local node to expedite joint query processing.

We will explore each of these below using the following large-scale point-in-polygon use-case example.

Use case: Point-in-polygon

Objective

Join mobile log points with GPS information to store boundary polygons for the purpose of determining the store visit patterns of mobile users.

Challenge

Both data sets are too big to import to a single machine (terabytes of points; gigabytes of polygons).

Solution

A partition strategy and corresponding algorithm are needed.

Optimizing geospatial data processing in Hadoop and SparkGeospatial data processing is a fundamental step in almost all location related big data applications. For example, to analyze users’ mobile records with GPS locations, ancilliary data is added to provide context, such as the individual’s address or a nearby point of interest. To enable this GeoEnrichment processes at scale, a set of highly efficient geospatial processes, such as point-in-polygon or “find the nearest” site searches is needed. These processes need to be optimzied for big data technologies like Hadoop or Spark in order to leverage large-scale parallel computing power.

Using point-in-polygon analysis in Hadoop as an example, there are different types of strategies. These depend on the use cases and data to be analyzed.

In many use cases the number of polygons to be evaluated is small. When this is true, it can be sufficient to use a broadcaster for evaluation, for example, to evaluate whether point records fall within polygons representing administrative boundaries. These types of use cases may account for the majority of traditional spatial aggregation and analysis.

In the context of the Internet of Things (IoT) there are many more use cases for which the simplistic broadcaster approach breaks down. It becomes overwhelmed by the volume of polygons that would be needed to be broadcast to every node and held in memory. Its spatial processes become prohibitively slow. This is where a different approach becomes essential. Pitney Bowes brings agility to these big-data queries, stepping up to market needs to expedite and optimize results.

Page 7

Cluster-level data partitioning is comprised of two main process steps: pre-partitioning and matching. 01. Pre-partitioning02. The matching process

01. Pre-partitioning Pre-partitioning uses spatial attributes within the data to organize datasets for a big data file system (e.g. HDFS) prior to running the application. It allows the data to be queried or processed quickly as the application runs.

First, the nature of the data is examined to decide the best pre-partition approach.

Use caseIn our use case, user mobile data logs are constantly streamed into HDFS daily. The store-boundary data is provided by data vendors like Pitney Bowes and updated quarterly. It is more efficient to pre-partition store boundary data than the mobile user data, updating this preparation once per quarter when the boundary data is updated.

There are multiple algorithms to partition boundary data, these range from space-oriented algorithms like Geohash Grid to data-oriented algorithms like R-tree. However, space-oriented algorithms are usually more parallel friendly than data-oriented algorithms, so they are the preferred algorithm family at cluster level.

Regular grid is the most commonly used algorithm in the industry. Figure 1 shows an example of how regular grid is used to partition a large polygon dataset.

Cluster-level data partitioning

Figure 1


Page 8

Balancing the data load However, there is one key drawback of the grid- based method: the spatial data distribution is often highly skewed.

Use caseIn a single day there may be thousands of mobile data user records generated at Grand Central in New York City and zero records generated in the Arizona desert.

The grid method is likely to create partitions with high-density data tiles which, in turn, will cause load balance issues in a Hadoop cluster-like environment.

To address this issue, Pitney Bowes has developed two algorithms:

• The bisect grid algorithm

• The adaptive tile based algorithm

Figure 2 shows the results of a data load-balance comparison between the regular grid approach and the two new algorithms. The flatter the data distribution, the fewer the load-balancing issues, and the better the performance in the Hadoop cluster. You can see how much flatter the distribution is for the adaptive-tile algorithm. In point-in-polygon tests using large numbers of points and polygons, the adaptive tile algorithm out-performs the regular grid method by more than twenty times.

Figure 2

Regulargrid

Bisectgrid

Adaptivetile

Page 9

02. The matching processAfter pre-partitioning the store boundary data, the spatial joining or query processes can be designed.

This can be done, for example using a MapReduce application as illustrated in Figure 3:

• All pre-partitioned store boundaries will be loaded first within each partition, including a partition key.

• Then mobile point data are loaded and matched to a corresponding partition key.

• The matching process is similar to a simple geo-hashing process and can be accomplished quickly.

• Then, matched pairs of data are imported into the reducer for spatial joining at the local level.

In the example to the right, the mobile point data records were not pre-processed. However, they could be. For example, if the process had required repeated spatially querying or spatial joining, an additional pre-partitioning step could be added to pre-partition this point dataset using the partition results of the store-boundary dataset.

A spatial encoding step could also be executed. This would apply a geohashing-like algorithm to latitude/longitude fields in each incoming point-data record during the streaming or data importing process, generate a variable-gridding based key and append it to the point-data records. This key could then be used in an HDFS or NoSQL database for data storage indexing or partition, enabling fast spatial query or joining of these points data in later uses.

Geospatial processing in node levelGeospatial processing is also comprised of two main process steps: Building the local spatial index, and applying detailed geometry operations.

Use case After both the mobile point data records and store-boundary data are partitioned, matched, and sent into differnet slave nodes, the operation in the node level is very similar to single-machine geospatial processing.

Building the local spatial indexBuild a local spatial index for data in the memory of the node dynamically when the application is started. Typically, a data-oriented spatial index algorithm like R-Tree is used at this level, rather than the space-oriented index algorithm that is preferred in the cluster level.

Applying detailed geometry operationsApply the appropriate detailed geometry operations between point data and polygon data. Different types of point-in-polygon analysis can be accommodated, from simply identifying whether a polygon contains a particular point, to advanced point-in-polygon analysis that can also return the distance from the point within a polygon to the polygon edges.

With this type of partition-based point-in-polygon analysis, users are able to join dynamic mobile logs with store boundaries within 30 minutes. They can quickly run multiple analyses daily to gain timely insights about customer-store visiting patterns.

This type of high-precision, high-speed, high scalability spatial analysis on mobile data was previously difficult to accomplish at all. With the Pitney Bowes solution, execution is quick and insights are easy to assimilate.

Figure3

Part 1: A1 Part 1: B1



Input Dataset A

Input Dataset B

Part 1

Part 2

Part 3

Partitioned Data

PartitionAssignment

Partitioningusing

samples

Local Join on Each Partition


Page 10

Built for today and tomorrowBig data technology is advancing rapidly and will continue to evolve. Today we are starting to see Spark replacing Hadoop as the latest big data technology and the pace of innovation continues.

Businesses also have diverse use cases. These require different technology options such as batch in Hadoop, real-time streaming in Storm, or interactive spatial querying in NoSQL databases like HBase.

Pitney Bowes takes an agile approach to this diverse and rapidly changing environment so you can:

• Reflect a high degree of understanding of both spatial processing and big data technology in each individual use case.

• Plug industry-leading capabilities into most big data components and platforms.

• Gain the flexibility to address myriad user requirements.

• Ensure highly efficient application of capabilities against any given spatial use case.

• Maximize the distributed nature of big data clusters to optimize today’s high-data-volume applications.

With the right technology and capabilities, you can capitalize on extraordinary big-data insights.

Learn more To learn more about Location Intelligence for Big Data visit us at pitneybowes.com

http://pitneybowes.com

United States3001 Summer StreetStamford, CT 06926-0700800 327 [email protected]

Europe/United KingdomThe Smith CentreThe FairmileHenley-on-ThamesOxfordshire RG9 6AB0800 840 [email protected]

Canada5500 Explorer DriveMississauga, ON L4W5C7800 268 [email protected]

Australia/Asia PacificLevel 1, 68 Waterloo RoadMacquarie Park NSW 2113+61 2 9475 [email protected]

Pitney Bowes and the Corporate logo are trademarks of Pitney Bowes Inc. or a subsidiary. All other trademarks are the property of their respective owners. © 2016 Pitney Bowes Inc. All rights reserved. 16DC03768_US

For more information, visit us online: pitneybowes.com

http://www.pitneybowes.com/us

Documents

Location Intelligence for Big Data - Pitney Bowes€¦ · Location Intelligence for Big Data A Pitney Bowes technical white paper Page 4 The benefits of a native optimized approach