Download pdf - [Hadoop] Terapot: Massive Email Archiving with Hadoop

Next Revolution, Toward Open Platform

Jason Han Founder & CEO, NexR

[email protected]

Terapot: Massive Email Archiving with Hadoop & Friends

- Commercial Hadoop Application

#2

NexR: Introduction

icube-cc (Compute)

icube-sc (Storage)

Had

oop

Pro

visi

onin

g &

Man

agem

ent

Massive Data Storage & Processing Platform

Cloud Computing Platform (Compatible with Amazon AWS)

Academic Support Program Massive Email Archiving MapReduce Workflow

Hadoop & Cloud Computing Services

Offering Hadoop & Cloud Computing Platform and Services

#3

Email Archiving: Objectives

 Regulatory compliance  e-Discovery: Litigation and legal discovery  E-mail backup and disaster recovery  Messaging system & storage optimization  Monitoring of internal and external e-mail content

#4

Email Archiving: Architecture

Email Servers

DB Server

Email Archiving Servers (HA)

Tape Library NAS SAN

Storage Network

Archival Storage

Metadata Indexes

Email Aging

Journaling

DAS

Search & Discovery

Crawling

#5

Email Archiving: Challenges

  Explosive growth of digital data -  6 times (988XB) in 2010 than 2006 -  95% (939 XB) unstructured data including email -  Increasing the cost and complexity of archiving Requiring scalable & low cost archiving

  Reinforcement of data retention regulation -  Retention, Disposal, e-Discovery, Security -  HIPPA(Healthcare) 21 ~ 23 yrs, SEC17(Trading) 6 yrs, OSHA(Toxic) 30 yrs, SOX(Finance) 5 yrs, J-SOX, K-SOX Requiring scalable archiving & fast discovery

  Needs for intelligent data management -  Knowledge management from email data -  Filtering, monitoring, data mining, etc Requiring integration with intelligent system

#6

Email Archiving: Regulatory Compliance

#7

Email Archiving: Problems

Email Servers

DB Server


Tape Library NAS SAN

Storage Network

Archival Storage

Metadata Indexes

Email Aging

Journaling

DAS

Search & Discovery

Crawling

Storage is expensive &

not scalable

Centralized search is slow &

not scalable

Discovery from tape is slow

#8

Terapot: When Hadoop Met Email Archiving…

Email Servers

DB Server


Metadata

Distributed Crawling

Journaling Server

Journaling

Hadoop HDFS (Archiving)

Hadoop MapReduce (Crawling, Indexing, etc)

Distributed Search & Discovery

  Scale-out architecture with Hadoop -  Hadoop HDFS for archiving email data -  Hadoop MapReduce for crawling & indexing -  Apache Lucene for search & discovery

#9

Terapot: Overview

 Design Principles   Shared nothing architecture Unlimited scalability   Inexpensive hardware Low cost   Using open source software Fast development   Exploiting parallelism High performance   Integrating with analysis High intelligence

  Features   Distributed massive email archiving   High scalability

  thousands of servers, billions of emails   High Performance

  Fast search under 1-2 seconds for each user account   Fast discovery in parallel with MapReduce

  High Intelligence   Email data mining, such as social network analysis

  Support both on-premise version and cloud(hosted) version   Development with various open source software

#10

Terapot: Open Source Software Stack

Hadoop HDFS

Hadoop MapReduce

Apache Lucene Hive

Crawling Indexing Searching Downloadi

ng

Email Mining

Zook

eepe

r Apache Tomcat Apache JAMES

MyS

QL

Frontend Layer

Backend Layer

#11

Terapot: Architecture

Searching Crawling Real-Time Indexing

Search Gateway

Batch processing

MailServer MR Workflow Manager

Terapot Frontend

Terapot Clients

POP3 Server

HTTP/ FTP/SFTP

Server

Mail Server

NAS/ NFS

Email Sources

SOAP REST JSON

Local (index)

HDFS (email, index)

Indexing Merging

Analyzer

Mining ETL

Analysis

Hadoop MapReduce, Lucene, & Hive

#12

Terapot Data Archiving Flow

1. Send email

HDFS

Shard Shard Shard

emails

Index Index Index

emails emails emails

Real-Time Shard

emails

Index

Index emails

emails Index

HTTP/ FTP/SFTP

Server

SMTP Server

NAS/ NFS

Crawler (MR)

Indexing (MR)

1. Fetch emails in parallel

2. Save emails

3. Build index files

3. Push email

4. Save email & build index files in runtime

Internet

5. Forward email

2. Deliver email

6. Receive email

1. Search emails

Search Layer Real-Time Indexing Layer Batch Processing Layer

#13

Terapot Data Analysis Flow

HDFS

HIVE

Analysis data Analysis data

Analysis data

Terapot Mining Engine

Terapot Archiving Storage

Transform (MR)

1. Fetch emails in parallel

2. Store large data

1. Send HiveQL to analysis data

1. View Report for Archving data

Report Retrieval Layer Data Analysis Layer ETL Layer

Analysis data

NexR Terapot Front

Shard Shard

2. Generate Report in MySQL

MySQL

#14

Technical Features   Distributed Archiving

  Hadoop HDFS for storing email data   Compression and deduplication for storage space efficiency

  Distributed Crawling & Indexing   Implemented by Hadoop MapReduce   Support both push-based crawling(HTTP) and pull-based crawling(SFTP, FTP

, HTTP, NFS, etc)   Support batch indexing & merging by MapReduce and real-time indexing for i

nstant archiving   Distributed Search

  Shard a search job and executing it in parallel   Searchable instantly on receiving an email (due to real-time indexing)

  Parallel Download   Download full search results in parallel by MapReduce   Support various download protocol (Local FS, HDFS, FTP, SFTP, HTTP, etc)

  Standard Client Interface   Support REST/SOAP and JSON interface

  Management   Configurable MapReduce job scheduling (crawling, indexing, merging, etc)

#15

Crawling

  Store Massive Email Data in HDFS through MapReduce   Hadoop utility(dfs –put) just copies data sequentially   Each Crawling MR takes & stores a range of data in parallel

Crawling Client

Crawling MR

Crawling MR

Crawling MR

HDFS

Data Location

Information

{key,email}*

INPUT

Splitting

#16

Indexing

  Indexing Email Data with MapReduce   Each Indexing MR takes a range of data and makes lucene index

in parallel

Indexing Client

Indexing MR

Indexing MR

Indexing MR

HDFS Email Data

{key,index}*

INPUT

Splitting

#17

Real-Time Indexing

  Indexing Email Data in Runtime   Indexing in memory on arriving a new email   Flushing RT-Shard periodically into HDFS

Mailet Component

RT Shard

RT Shard

HDFS

Forwarding Email Data

Mail

JAMES

Local Index

emails

emails

emails

Real-Time Shard Periodic flushing

into HDFS

#18

Searching

 Distributed Search   Indexes are split & stored in local disks   Shard is responsible for searching a range of index

Searching Client

Shard

Shard

RT Shard

HDFS Search

Read email

Local Index

Zookeeper

Notification Update shard state & index information

#19

Parallel Downloading

 Downloading Massive Search Results in Parallel   Support various types of communications for downloading   Downloading MR sorts search results globally & pushes into targets

Donwload Client

Shard

Shard

Shard

Download Request

write result

write result

write result

write result

HDFS

DL Map

DL Reduce

DL Reduce

DL Reduce

DL Map

DL Map

DL Map

DL Map

Local

HDFS

FTP

SFTP

HTTP

Distributed Global Sort

write result directly

#20

Email Data Analysis   Analysis Process

  ETL(Extract-Transform-Load) email archiving data to Hive table format   Analyzing data using Hive with various analysis algorithm   Generating the analysis result report

Terapot Mining

ETL MR

ETL MR

ETL MR

Load Archving Data

write result

write result

write result

write result

HIVE

Terapot Mining

MySQL

execute HiveQL

Generate Report

#21

Types of Analysis

  Social Network Analysis   Personal Network Analysis

  Computing distance between recipients or senders based on TO, CC, FROM links

  Analyzing the statistics of mail frequency   Domain Analysis

  Computing distance between recipient’s domain based on TO, CC, FROM

 Keyword Analysis (in progress)   Keyword frequency for each user

#22

Terapot Performance  Experimental Environment

  11 Intel Servers: 1 Master + 10 Slaves  Xeon 2.0 GHz 2 CPU, 16 GB Memory 4 TB Disk

  The number of emails: 270 millions (Index size: 270 GB)  Results

Number of Emails Number of Results Response Time (sec)

67,217,298 12,547,398 1.4

134,434,596 25,094,796 1.4

201,651,894 37,642,194 1.4

268,869,192 50,189,592 1.4

Number of Emails Number of Results Response Time (sec)

67,217,298 12,547,398 2.8

134,434,596 25,094,796 2.8

201,651,894 37,642,194 3.2

268,869,192 50,189,592 3.2

Indexing in local disks

Indexing in HDFS

#23

Demonstration

#24

Hadoop & Cloud Computing Company

www.nexr.co.kr