26
High Performance Indexing of Large Heterogeneous Data Sets using GPU Massimo Bernaschi IAC – National Research Council of Italy funded by the ISEC programme under GA n° 4000003856

High Performance Indexing of Large Heterogeneous Data Sets using GPU - GTC … · 2015. 3. 18. · GTC 2015 15 One CUDA Thread per ... GTC 2015 16 2x 7x 9x70 0 10 20 30 40 50 60 4MB

  • Upload
    others

  • View
    7

  • Download
    0

Embed Size (px)

Citation preview

Page 1: High Performance Indexing of Large Heterogeneous Data Sets using GPU - GTC … · 2015. 3. 18. · GTC 2015 15 One CUDA Thread per ... GTC 2015 16 2x 7x 9x70 0 10 20 30 40 50 60 4MB

High Performance Indexing of Large Heterogeneous Data Sets using GPU

Massimo Bernaschi IAC – National Research Council of Italy

funded by the ISEC programme under GA n° 4000003856

Page 2: High Performance Indexing of Large Heterogeneous Data Sets using GPU - GTC … · 2015. 3. 18. · GTC 2015 15 One CUDA Thread per ... GTC 2015 16 2x 7x 9x70 0 10 20 30 40 50 60 4MB

Why a new indexer?

• Law Enforcement Agencies need an easy and fast tool to index and search seized disk images

GTC 2015 2

Page 3: High Performance Indexing of Large Heterogeneous Data Sets using GPU - GTC … · 2015. 3. 18. · GTC 2015 15 One CUDA Thread per ... GTC 2015 16 2x 7x 9x70 0 10 20 30 40 50 60 4MB

How it works • Extract raw files and metadata from (seized) disk images

• Distribute them over multiple systems

• Extract plain text and metadata from every file – including deleted files

• Create distributed indexes

• Provide a friendly user interface to query results

• Organize query results in an intuitive visual representation

GTC 2015 3

Page 4: High Performance Indexing of Large Heterogeneous Data Sets using GPU - GTC … · 2015. 3. 18. · GTC 2015 15 One CUDA Thread per ... GTC 2015 16 2x 7x 9x70 0 10 20 30 40 50 60 4MB

Architecture Overview

GTC 2015 4

HPC Cluster

Web GUI DATABASE

CONNECTIONS’ LEGEND

DB input/ouput

HPC cluster

INDEX

REPO

SEARCHER

Search Admin

MEDIATOR

Worker Nodes

DBMS

COORDINATOR Status

Manager

Job

Scheduler WORKER

AGENT

Page 5: High Performance Indexing of Large Heterogeneous Data Sets using GPU - GTC … · 2015. 3. 18. · GTC 2015 15 One CUDA Thread per ... GTC 2015 16 2x 7x 9x70 0 10 20 30 40 50 60 4MB

Architecture Overview (cont.) • Coordinator

– Manage, coordinate and monitor the whole system

• DBMS – Provides the interface to the Database

• Mediator – Mediates among all components to ease message communication

• Admin – Web UI – Used to manage the infrastruture, create investigation cases and add disk images for indexing

• Worker Agent – Runs all worker nodes and provides services for monitoring, starting, stopping, configuring

local components

• Index Repository – Repository used to store results of all indexing jobs

GTC 2015 5

Page 6: High Performance Indexing of Large Heterogeneous Data Sets using GPU - GTC … · 2015. 3. 18. · GTC 2015 15 One CUDA Thread per ... GTC 2015 16 2x 7x 9x70 0 10 20 30 40 50 60 4MB

Architecture Overview (cont.) • Each worker node can run one or more

– Image-Extractor • to extract files from seized disk images

– Docu-Parser • to trasform extracted documents into plain text and metadata

– Docu-Indexer • to create searchable indexes from transformed text and metadata

• Managed by worker agents • They are connected to form an

Extraction –> Parse –> Indexing Pipeline

GTC 2015 6

Page 7: High Performance Indexing of Large Heterogeneous Data Sets using GPU - GTC … · 2015. 3. 18. · GTC 2015 15 One CUDA Thread per ... GTC 2015 16 2x 7x 9x70 0 10 20 30 40 50 60 4MB

Extract – Parse – Indexing Pipeline

GTC 2015 7

Docu - Parser

Image - Extractor

Docu - Parser

Docu - Parser

Docu - Parser

Docu - Indexer

Docu - Indexer

Docu - Indexer

Docu - Indexer

1: EXTRACT 2: PARSE 3: INDEXING

Page 8: High Performance Indexing of Large Heterogeneous Data Sets using GPU - GTC … · 2015. 3. 18. · GTC 2015 15 One CUDA Thread per ... GTC 2015 16 2x 7x 9x70 0 10 20 30 40 50 60 4MB

Extract – Parse – Indexing Pipeline

GTC 2015 8

Docu - Parser

Image - Extractor

Docu - Parser

Docu - Parser

Docu - Parser

Docu - Indexer

Docu - Indexer

Docu - Indexer

Docu - Indexer

1: EXTRACT 2: PARSE 3: INDEXING

Page 9: High Performance Indexing of Large Heterogeneous Data Sets using GPU - GTC … · 2015. 3. 18. · GTC 2015 15 One CUDA Thread per ... GTC 2015 16 2x 7x 9x70 0 10 20 30 40 50 60 4MB

Disk Image Extraction • Performed by the Image Extractor component • Based on The Sleuth Kit Library® • Supports Unix, Linux, OSx and Windows volumes and

file systems • Extracts raw files and file system metadata

GTC 2015 9

The Sleuth Kit Library http://www.sleuthkit.org/

CREATION_DATE

FILENAME

SIZE

PATH

LAST_MODIFICATION_DATE

SYSTEM METADATA

Page 10: High Performance Indexing of Large Heterogeneous Data Sets using GPU - GTC … · 2015. 3. 18. · GTC 2015 15 One CUDA Thread per ... GTC 2015 16 2x 7x 9x70 0 10 20 30 40 50 60 4MB

Document Parsing • Performed by Docu-Parser component • Based on Apache Tika™ Library • Detects and extracts document metadata and structured text • Supports about 1400 file types

GTC 2015 10

Tika Library http://tika.apache.org/

AUTHOR

TITLE

KEYWORDS

SUMMARY

LANGUAGE

TOOL

RIGHTS

FORMAT

DOCUMENT METADATA

Page 11: High Performance Indexing of Large Heterogeneous Data Sets using GPU - GTC … · 2015. 3. 18. · GTC 2015 15 One CUDA Thread per ... GTC 2015 16 2x 7x 9x70 0 10 20 30 40 50 60 4MB

Document Indexing • Perfomed by Docu-Indexer component • Based on Apache Lucene™ Libraries • Provides indexing and search capabilities • Index size roughly 20-30% the size of text indexed • Indexes are collected into Index Repository

GTC 2015 11

Apache Lucene™ Libraries http://lucene.apache.org/

Page 12: High Performance Indexing of Large Heterogeneous Data Sets using GPU - GTC … · 2015. 3. 18. · GTC 2015 15 One CUDA Thread per ... GTC 2015 16 2x 7x 9x70 0 10 20 30 40 50 60 4MB

Document Searching • Based on Apache Lucene™ Libraries • Provides searching capabilities:

– ranked searching – multiple-index searching with merged results – many powerful query types – fielded searching (e.g. title, author, contents)

• Working on presenting results through an efficient and interactive interface

GTC 2015 12

Page 13: High Performance Indexing of Large Heterogeneous Data Sets using GPU - GTC … · 2015. 3. 18. · GTC 2015 15 One CUDA Thread per ... GTC 2015 16 2x 7x 9x70 0 10 20 30 40 50 60 4MB

HPC Document Indexing • Text analysis requires tokenization, filtering and stop

words removal • GPU cards offer huge computing power • Combine CLucene indexing with GPU power to

accelerate these steps

GTC 2015 13

Clucene Libraries http://clucene.sourceforge.net/

Page 14: High Performance Indexing of Large Heterogeneous Data Sets using GPU - GTC … · 2015. 3. 18. · GTC 2015 15 One CUDA Thread per ... GTC 2015 16 2x 7x 9x70 0 10 20 30 40 50 60 4MB

GPU CUDA Text Analysis

GTC 2015 15

One CUDA Thread per character. Each thread applies LowerCase Filter

-1 -1 2 -1 -1 -1 -1 7 -1 -1 10 -1 -1 -1 14 15

0 3 8 11 2 7 10 14

my

M n a m e i s B

Each CUDA Thread performs Tokenization by locating delimiter positions

Vector processing in order to create two vectors representing start and end token indexes respectively.

y o b . \0

m y n a m e i s b o b . \0

Start Indexes (related to input text)

name

is

bob

End Indexes (related to input text)

my

name

bob

One CUDA Thread per token.

Each thread applies StopWords Filter.

Page 15: High Performance Indexing of Large Heterogeneous Data Sets using GPU - GTC … · 2015. 3. 18. · GTC 2015 15 One CUDA Thread per ... GTC 2015 16 2x 7x 9x70 0 10 20 30 40 50 60 4MB

(2070 Fermi) GPU CUDA Results

GTC 2015 16

2x

7x

9x

0

10

20

30

40

50

60

70

4MB 32MB 128MB

Tim

e (

Seco

nd

s)

Plain-Text Size

CLucene

GPU+CLucene

Speed-Up

Page 16: High Performance Indexing of Large Heterogeneous Data Sets using GPU - GTC … · 2015. 3. 18. · GTC 2015 15 One CUDA Thread per ... GTC 2015 16 2x 7x 9x70 0 10 20 30 40 50 60 4MB

CUDA and (Java)Lucene 1/2

● How do they cooperate?

Page 17: High Performance Indexing of Large Heterogeneous Data Sets using GPU - GTC … · 2015. 3. 18. · GTC 2015 15 One CUDA Thread per ... GTC 2015 16 2x 7x 9x70 0 10 20 30 40 50 60 4MB

CUDA and (Java)Lucene 2/2

● How do they cooperate efficiently? o smart and efficient memory transfer using Java

Unsafe API

Page 18: High Performance Indexing of Large Heterogeneous Data Sets using GPU - GTC … · 2015. 3. 18. · GTC 2015 15 One CUDA Thread per ... GTC 2015 16 2x 7x 9x70 0 10 20 30 40 50 60 4MB

Test Environment • 4 Worker Nodes

– 4 CPUs / 24 Cores 2.67GHz 48 GB RAM – 2 2070 GPU per node – Running Worker Agents and Extract – Parse – Indexing Pipelines

• 1 Management Node – Running all other components

GTC 2015 19

1G Ethernet

Page 19: High Performance Indexing of Large Heterogeneous Data Sets using GPU - GTC … · 2015. 3. 18. · GTC 2015 15 One CUDA Thread per ... GTC 2015 16 2x 7x 9x70 0 10 20 30 40 50 60 4MB

Disk Images for Test • Disk images built using the Govdocs1 document set

• Govdocs1 digital corpora includes nearly 1 milion freely-redistributable files

GTC 2015 20 Govdocs1 available @ http://digitalcorpora.org/corpora/files

0% 5% 10% 15% 20% 25%

pdf

image

doc

ppt

ps

gz

Govdocs1 File Types

Page 20: High Performance Indexing of Large Heterogeneous Data Sets using GPU - GTC … · 2015. 3. 18. · GTC 2015 15 One CUDA Thread per ... GTC 2015 16 2x 7x 9x70 0 10 20 30 40 50 60 4MB

Results

Disk Image Size (GB) Extract-Parse-Indexing Time DD Time # Files Index Size (GB)

32 00:09:15 00:07:36 58225 4.8

80 00:17:43 00:14:50 117282 8.5

100 00:37:16 00:20:15 186305 12

210 01:02:22 00:33:21 368856 19

0:00:00

0:07:12

0:14:24

0:21:36

0:28:48

0:36:00

0:43:12

0:50:24

0:57:36

1:04:48

1:12:00

32 80 100 210

Time

Seized disk image size (in GB)

"Extract-Parse-Index Time"

29/09/14 21

Page 21: High Performance Indexing of Large Heterogeneous Data Sets using GPU - GTC … · 2015. 3. 18. · GTC 2015 15 One CUDA Thread per ... GTC 2015 16 2x 7x 9x70 0 10 20 30 40 50 60 4MB

64 GB Disk Image Indexing

GTC 2015 22

0 10 20 30 40 50 60 70

Disk Image

Extracted Text

ISODAC Index

SIZE (GB)

text pdf xls others html doc csv xml ps ppt gz image

Page 22: High Performance Indexing of Large Heterogeneous Data Sets using GPU - GTC … · 2015. 3. 18. · GTC 2015 15 One CUDA Thread per ... GTC 2015 16 2x 7x 9x70 0 10 20 30 40 50 60 4MB

Highlights • Streamed In-Memory Extraction+Parse+Indexing

– Only indexes written on disks – Much faster than a Map-Reduce based solution

• File indexing failure recovery – Files are processed again in case of failure – Selectable files extraction and indexing

• Exportable indexes – Generated indexes can be exported and handled to back to

investigators

GTC 2015 23

Page 23: High Performance Indexing of Large Heterogeneous Data Sets using GPU - GTC … · 2015. 3. 18. · GTC 2015 15 One CUDA Thread per ... GTC 2015 16 2x 7x 9x70 0 10 20 30 40 50 60 4MB

Future Works

• Distribute workload based on file type

• Enhance scheduling algorithm

• Support file extraction filtering

• Alternative ad-hoc parser based on file type

• CUDA version of Tesseract (for fast OCR)

• Enhanced and interactive results visualization

GTC 2015 24

Page 24: High Performance Indexing of Large Heterogeneous Data Sets using GPU - GTC … · 2015. 3. 18. · GTC 2015 15 One CUDA Thread per ... GTC 2015 16 2x 7x 9x70 0 10 20 30 40 50 60 4MB

Tesseract OCR

Profiling with valgrind’s tool callgrind reveals how 3 functions collect approximately 50% self time execution

try to parallelize these in CUDA

In multi-paged documents, ProcessPages function takes near 98% of total execution time

openmp: 1 page per thread

or

get total number of pages and launch a process per page

go parallel

go parallel

Page 25: High Performance Indexing of Large Heterogeneous Data Sets using GPU - GTC … · 2015. 3. 18. · GTC 2015 15 One CUDA Thread per ... GTC 2015 16 2x 7x 9x70 0 10 20 30 40 50 60 4MB

GTC 2015 26

Please complete the Presenter Evaluation sent to you by email or through the GTC Mobile App. Your feedback is important!

[email protected]

Page 26: High Performance Indexing of Large Heterogeneous Data Sets using GPU - GTC … · 2015. 3. 18. · GTC 2015 15 One CUDA Thread per ... GTC 2015 16 2x 7x 9x70 0 10 20 30 40 50 60 4MB

Why Not ? • Hadoop performance

• MapReduce performance [Jiang et al. (2010)] [Lin et al. (2012)] • HDFS performance [Dong et al. (2014)]

• Seized disk images are neither stored on cluster nor available on a distributed infrastructure

• As fast as possible – In-Memory Streaming Pipeline

• Only indexes are written to disk • Ad-hoc Recovery process

GTC 2015 27