© 2013 by Elbit Systems | Elbit Systems Proprietary
ותקשובאלביט מערכות יבשה
Defense Industry & Open Source & BigData
© 2013 by Elbit Systems | Elbit Systems Proprietary
מרצה
גרמן גברילוב[email protected]
אלביט מערכות יבשה ותקשוב
מנהל מודיעין
תחום סייבר
© 2013 by Elbit Systems | Elbit Systems Proprietary
Defense Industry
Open SourceBig Data
Defense Industry & Open Source & Big Data
© 2013 by Elbit Systems | Elbit Systems Proprietary
Agenda
צורך
גידול בנפח מידע עולמי
צורך במערכות מודיעניות
?Big Dataמה זה
3V Model of Big Data
Scale up / Scale out
CAP theorem
סוגי פתרונות
Apache Hadoopפרוייקט
HDFS
Map Reduce
Hadoop Projects
דוגמא לארכיטקטורה של מערכת מידע
Hadoopבעזרת
© 2013 by Elbit Systems | Elbit Systems Proprietary
גידול בנפח מידע עולמי -צורך
Twitter produces over 340 million tweets per day, with over 500
million registered users as of 2012
Over 32 billion searches were performed last month on Twitter
Facebook creates over 30 billion pieces of content ranging from
web links, news, blogs, photo
Zynga processes 1 petabyte of content for players every day
More than 2 billion videos are watched on YouTube every day
By 2015, nearly 3 billion people will be online, pushing
the data created and shared to nearly 8 zettabytes.
© 2013 by Elbit Systems | Elbit Systems Proprietary
גידול בנפח מידע עולמי -צורך
© 2013 by Elbit Systems | Elbit Systems Proprietary
גידול בנפח מידע עולמי -צורך
quantity of global data
© 2013 by Elbit Systems | Elbit Systems Proprietary
צורך במערכות מודיעניות -צורך
יכולת קליטה בזמן קצר נפחים גדולים
(near real-time)של נתונים
יכולת קליטה סוגים שונים של נתונים
יכולת עיבוד נפחים גדולים של מידע
יכולת הרצת אנליזות שונות מותאמות
סוג מידע
יכולת תחקור של הצגה של מידע
מהירה ונוחה, בצורה ברורההלקוח רוצה לדעת לקרוא את המידע הקיים
בעולם בצורה נוחה
© 2013 by Elbit Systems | Elbit Systems Proprietary
דוגמאות לתמונות שאנשים העלו בחשבון טוויטר
© 2013 by Elbit Systems | Elbit Systems Proprietary
?Big Dataמה זה
What is data?
Data is Information in raw or unorganized form such as alphabets,
numeric or symbols.
What is Big Data?
Big Data refers to large datasets which are difficult to store, manage
and analyze.
Everyday, we create over 2.5 trillion byte of
data – so much that 90% of the data in the
world today has been created in the last tow
years alone.
© 2013 by Elbit Systems | Elbit Systems Proprietary
?Big Dataמה זה
O’Reilly Radar definition:
Big data is when the size of the data itself becomes part of the problem
• EMC/IDC definition of big data:
Big data technologies describe a new generation of technologies and
architectures, designed to economically extract value from very large volumes
of a wide variety of data, by enabling high-velocity capture, discovery, and/or
analysis.
• IBM says that ”three characteristics define big data:”
Volume (Terabytes -> Zettabytes)
Variety (Structured -> Semi-structured -> Unstructured)
Velocity (Batch -> Streaming Data)
© 2013 by Elbit Systems | Elbit Systems Proprietary
3V Model of Big Data
© 2013 by Elbit Systems | Elbit Systems Proprietary
ביזור מדיע בין מכונות
Scale up / Vertical scaling Scale out / Horizontal scaling /
Distributed systems
To scale horizontally means to add more nodes
to a system, such as adding a new computer to
a distributed software application.
To scale vertically means to add resources to a
single node in a system, typically involving the
addition of CPUs or memory to a single
computer.
© 2013 by Elbit Systems | Elbit Systems Proprietary
CAP theorem
CA
RDBMSs (MySql,…(
Greenplum
Vertica
Aster Data
AP
Cassandra
CouchDB
SimpleDB
Dynamo
CP
Hbase
MongoDB
Terrastore
BigTable
MemcacheDB
© 2013 by Elbit Systems | Elbit Systems Proprietary
סוגי פתרונות
Conceptual StructuresDescriptionStore type
Schema-lessKey Value Stores
Storage by columnColumn-oriented
databases
Uses nodes and edges to
represent data.
Graph Databases
Store documents that are
semi-structured. Often XML
databases.
Document Oriented
Databases
Sharded RDBMS
(MPP databases)
ValueKey
Data
Node
Data
Node
Data
Node
Structured
Document (XML)Key
RDBMS RDBMS RDBMS
Weight
2.85 kg
1.23 kg
3.76 kg
Price
24.00 $
17.50 $
27.30 $
Target
Israel
Italia
Turkey
© 2013 by Elbit Systems | Elbit Systems Proprietary
סוגי פתרונות
FunctionalityComplexity
of Operation
Flexibility in
Data Variety
Horizontal
Scalability
PerformanceType
variable
(none)nonehighhighhigh
Berkeley
Scalaris
MemcacheDB
Key-Value
stores
minimallowmoderatehighhigh
Cassandra
HP Vertica
BigTable
Hbase
OrientDB
Column-oriented
databases
graph
theoryhighhighvariablevariable
Neo4j
InfiniteGraph
Titan
OrientDB
Graph
Databases
variable
(low)lowhigh
variable
(high)high
CouchDB
MongoDB
SimpleDB
Redis
Document
Oriented
Databases
relationalmoderatelowvariablevariable
HP Vertica
EMC
Greenplum
Aster Data
Shard RDBMS
(MPP)
© 2013 by Elbit Systems | Elbit Systems Proprietary
Apache Hadoopפרוייקט
hadoop.apache.org
“The Apache Hadoop software library is a framework that allows for the
distributed processing of large data sets across clusters of computers using a
simple programming model”
wikipedia.org
Apache Hadoop is an open-source software framework that supports data-
intensive distributed applications. Hadoop implements a computational paradigm
named MapReduce, where the application is divided into many small fragments
of work, each of which may be executed or re-executed on any node in the
cluster.
Hadoop provides a distributed file system that stores data on the compute
nodes, providing very high aggregate bandwidth across the cluster. It enables
applications to work with thousands of computation-independent computers and
petabytes of data.
© 2013 by Elbit Systems | Elbit Systems Proprietary
Apache Hadoopפרוייקט
Facebook.com
Amazon.com
Ancestry.com
Akamai
American Airlines
AOL
Apple
eBay
Hortonworks
Federal Reserve Board of Governors
Foursquare
Yahoo!
InMobi
Intuit
Joost
Last.fm
Microsoft
NetApp
Netflix
Ooyala
Riot Games
The New York Times
SAP AG
SAS Institute
StumbleUpon
Yodlee
Fox Interactive Media
Gemvara
Hewlett-Packard
IBM
Organizations are using Hadoop to run large distributed computations
IBM - InfoSphere BigInsights
Oracle - Big Data Appliance
EMC - Pivotal HD
Microsoft – HDInsights
Others
Companies are provides Hadoop in they products
© 2013 by Elbit Systems | Elbit Systems Proprietary
Apache Hadoop–hdfsפרוייקט
HDFS is a distributed, scalable, and portable
file system. HDFS is designed to store a large
amount of data in various servers/clusters.
© 2013 by Elbit Systems | Elbit Systems Proprietary
Apache Hadoop–map/reduceפרוייקט
MapReduce is the key algorithm that the
Hadoop MapReduce engine uses to distribute
work around a cluster.
© 2013 by Elbit Systems | Elbit Systems Proprietary
Apache Hadoopפרוייקט
• Pig )simply query language(
• Hive )SQL like queries(
• Cascading )software abstraction layer (
• Mahout )machine learning(
• Hama )scientific computation(
• Avro )data serialization system(
• Hadoop Map Reduce implementation
• Ambari (deploying, managing, and monitoring tool)
• Sqoop (transferring data tool)
• Oozie (workflow scheduler system)
• Zookeeper (coordination service)
• Flume (framework for populating Hadoop)
• Hadoop Distributed File System
• Hue (File Browser for HDFS)
• HBase (column oriented database)
• HCatalog (table/storage management service)
Data Access / Query
abilities
Map Reduce
Distributed processing
Storage / Data
structure
Management tools
© 2013 by Elbit Systems | Elbit Systems Proprietary
Hadoop Ecosystem
© 2013 by Elbit Systems | Elbit Systems Proprietary
Hadoopדוגמא לארכיטקטורה של מערכת מידע בעזרת
© 2013 by Elbit Systems | Elbit Systems Proprietary
סוף
Thank You!
גידול בנפח מידע עולמי
צורך של מערכות מודעיניות
Big Dataפתרונות
Apache Hadoopמימוש בעזרת