Next Revolution, Toward Open Platform
Jason Han Founder & CEO, NexR
Terapot: Massive Email Archiving with Hadoop & Friends
- Commercial Hadoop Application
#2
NexR: Introduction
icube-cc (Compute)
icube-sc (Storage)
Had
oop
Pro
visi
onin
g &
Man
agem
ent
Massive Data Storage & Processing Platform
Cloud Computing Platform (Compatible with Amazon AWS)
Academic Support Program Massive Email Archiving MapReduce Workflow
Hadoop & Cloud Computing Services
Offering Hadoop & Cloud Computing Platform and Services
#3
Email Archiving: Objectives
Regulatory compliance e-Discovery: Litigation and legal discovery E-mail backup and disaster recovery Messaging system & storage optimization Monitoring of internal and external e-mail content
#4
Email Archiving: Architecture
Email Servers
DB Server
Email Archiving Servers (HA)
Tape Library NAS SAN
Storage Network
Archival Storage
Metadata Indexes
Email Aging
Journaling
DAS
Search & Discovery
Crawling
#5
Email Archiving: Challenges
Explosive growth of digital data - 6 times (988XB) in 2010 than 2006 - 95% (939 XB) unstructured data including email - Increasing the cost and complexity of archiving Requiring scalable & low cost archiving
Reinforcement of data retention regulation - Retention, Disposal, e-Discovery, Security - HIPPA(Healthcare) 21 ~ 23 yrs, SEC17(Trading) 6 yrs, OSHA(Toxic) 30 yrs, SOX(Finance) 5 yrs, J-SOX, K-SOX Requiring scalable archiving & fast discovery
Needs for intelligent data management - Knowledge management from email data - Filtering, monitoring, data mining, etc Requiring integration with intelligent system
#7
Email Archiving: Problems
Email Servers
DB Server
Email Archiving Servers (HA)
Tape Library NAS SAN
Storage Network
Archival Storage
Metadata Indexes
Email Aging
Journaling
DAS
Search & Discovery
Crawling
Storage is expensive &
not scalable
Centralized search is slow &
not scalable
Discovery from tape is slow
#8
Terapot: When Hadoop Met Email Archiving…
Email Servers
DB Server
Email Archiving Servers (HA)
Metadata
Distributed Crawling
Journaling Server
Journaling
Hadoop HDFS (Archiving)
Hadoop MapReduce (Crawling, Indexing, etc)
Distributed Search & Discovery
Scale-out architecture with Hadoop - Hadoop HDFS for archiving email data - Hadoop MapReduce for crawling & indexing - Apache Lucene for search & discovery
#9
Terapot: Overview
Design Principles Shared nothing architecture Unlimited scalability Inexpensive hardware Low cost Using open source software Fast development Exploiting parallelism High performance Integrating with analysis High intelligence
Features Distributed massive email archiving High scalability
thousands of servers, billions of emails High Performance
Fast search under 1-2 seconds for each user account Fast discovery in parallel with MapReduce
High Intelligence Email data mining, such as social network analysis
Support both on-premise version and cloud(hosted) version Development with various open source software
#10
Terapot: Open Source Software Stack
Hadoop HDFS
Hadoop MapReduce
Apache Lucene Hive
Crawling Indexing Searching Downloadi
ng
Email Mining
Zook
eepe
r Apache Tomcat Apache JAMES
MyS
QL
Frontend Layer
Backend Layer
#11
Terapot: Architecture
Searching Crawling Real-Time Indexing
Search Gateway
Batch processing
MailServer MR Workflow Manager
Terapot Frontend
Terapot Clients
POP3 Server
HTTP/ FTP/SFTP
Server
Mail Server
NAS/ NFS
Email Sources
SOAP REST JSON
Local (index)
HDFS (email, index)
Indexing Merging
Analyzer
Mining ETL
Analysis
Hadoop MapReduce, Lucene, & Hive
#12
Terapot Data Archiving Flow
1. Send email
HDFS
Shard Shard Shard
emails
Index Index Index
emails emails emails
Real-Time Shard
emails
Index
Index emails
emails Index
HTTP/ FTP/SFTP
Server
SMTP Server
NAS/ NFS
Crawler (MR)
Indexing (MR)
1. Fetch emails in parallel
2. Save emails
3. Build index files
3. Push email
4. Save email & build index files in runtime
Internet
5. Forward email
2. Deliver email
6. Receive email
1. Search emails
Search Layer Real-Time Indexing Layer Batch Processing Layer
#13
Terapot Data Analysis Flow
HDFS
HIVE
Analysis data Analysis data
Analysis data
Terapot Mining Engine
Terapot Archiving Storage
Transform (MR)
1. Fetch emails in parallel
2. Store large data
1. Send HiveQL to analysis data
1. View Report for Archving data
Report Retrieval Layer Data Analysis Layer ETL Layer
Analysis data
NexR Terapot Front
Shard Shard
2. Generate Report in MySQL
MySQL
#14
Technical Features Distributed Archiving
Hadoop HDFS for storing email data Compression and deduplication for storage space efficiency
Distributed Crawling & Indexing Implemented by Hadoop MapReduce Support both push-based crawling(HTTP) and pull-based crawling(SFTP, FTP
, HTTP, NFS, etc) Support batch indexing & merging by MapReduce and real-time indexing for i
nstant archiving Distributed Search
Shard a search job and executing it in parallel Searchable instantly on receiving an email (due to real-time indexing)
Parallel Download Download full search results in parallel by MapReduce Support various download protocol (Local FS, HDFS, FTP, SFTP, HTTP, etc)
Standard Client Interface Support REST/SOAP and JSON interface
Management Configurable MapReduce job scheduling (crawling, indexing, merging, etc)
#15
Crawling
Store Massive Email Data in HDFS through MapReduce Hadoop utility(dfs –put) just copies data sequentially Each Crawling MR takes & stores a range of data in parallel
Crawling Client
Crawling MR
Crawling MR
Crawling MR
HDFS
Data Location
Information
{key,email}*
INPUT
Splitting
#16
Indexing
Indexing Email Data with MapReduce Each Indexing MR takes a range of data and makes lucene index
in parallel
Indexing Client
Indexing MR
Indexing MR
Indexing MR
HDFS Email Data
{key,index}*
INPUT
Splitting
#17
Real-Time Indexing
Indexing Email Data in Runtime Indexing in memory on arriving a new email Flushing RT-Shard periodically into HDFS
Mailet Component
RT Shard
RT Shard
HDFS
Forwarding Email Data
JAMES
Local Index
emails
emails
emails
Real-Time Shard Periodic flushing
into HDFS
#18
Searching
Distributed Search Indexes are split & stored in local disks Shard is responsible for searching a range of index
Searching Client
Shard
Shard
RT Shard
HDFS Search
Read email
Local Index
Zookeeper
Notification Update shard state & index information
#19
Parallel Downloading
Downloading Massive Search Results in Parallel Support various types of communications for downloading Downloading MR sorts search results globally & pushes into targets
Donwload Client
Shard
Shard
Shard
Download Request
write result
write result
write result
write result
HDFS
DL Map
DL Reduce
DL Reduce
DL Reduce
DL Map
DL Map
DL Map
DL Map
Local
HDFS
FTP
SFTP
HTTP
Distributed Global Sort
write result directly
#20
Email Data Analysis Analysis Process
ETL(Extract-Transform-Load) email archiving data to Hive table format Analyzing data using Hive with various analysis algorithm Generating the analysis result report
Terapot Mining
ETL MR
ETL MR
ETL MR
Load Archving Data
write result
write result
write result
write result
HIVE
Terapot Mining
MySQL
execute HiveQL
Generate Report
#21
Types of Analysis
Social Network Analysis Personal Network Analysis
Computing distance between recipients or senders based on TO, CC, FROM links
Analyzing the statistics of mail frequency Domain Analysis
Computing distance between recipient’s domain based on TO, CC, FROM
Keyword Analysis (in progress) Keyword frequency for each user
#22
Terapot Performance Experimental Environment
11 Intel Servers: 1 Master + 10 Slaves Xeon 2.0 GHz 2 CPU, 16 GB Memory 4 TB Disk
The number of emails: 270 millions (Index size: 270 GB) Results
Number of Emails Number of Results Response Time (sec)
67,217,298 12,547,398 1.4
134,434,596 25,094,796 1.4
201,651,894 37,642,194 1.4
268,869,192 50,189,592 1.4
Number of Emails Number of Results Response Time (sec)
67,217,298 12,547,398 2.8
134,434,596 25,094,796 2.8
201,651,894 37,642,194 3.2
268,869,192 50,189,592 3.2
Indexing in local disks
Indexing in HDFS