Upload
zohar-elkayam
View
810
Download
2
Embed Size (px)
Citation preview
1
Rapid Cluster Computing with Apache Spark
Zohar Elkayam CTO, Brillix
Twitter: @realmgic
2
Who am I?• Zohar Elkayam, CTO at Brillix
• Programmer, DBA, team leader, database trainer, public speaker, and a senior consultant for over 18 years
• Oracle ACE Associate
• Part of ilOUG – Israel Oracle User Group
• Blogger – www.realdbamagic.com and www.ilDBA.co.il
3
About Brillix• We offer complete, integrated end-to-end solutions based
on best-of-breed innovations in database, security and big data technologies
• We provide complete end-to-end 24x7 expert remote database services
• We offer professional customized on-site trainings, delivered by our top-notch world recognized instructors
4
Some of Our Customers
5
Agenda• The Big Data problem and possible solutions
• Basic Spark Core
• Working with RDDs
• Working with Spark Cluster and Parallel programming
• Spark modules: Spark SQL and Spark Streaming
• Performance and Troubleshooting
6
Our Goal Today• Knowing more about Big Data and Big Data
solutions
• Get a taste of Spark abilities – not becoming a Spark expert
• This is a starting point – don’t be afraid to try
7
The REAL Agenda
נשמח לקבל את חוות , משובטופס יחולק יום הסמינר בסיום •
!טאבלטממלאי המשוב יוגרל מידי יום בין . דעתכם
הפסקה10:30-10:45
2באולם ששון י הכנס ארוחת צהריים לכל משתתפ12:30-13:30
הפסקה מתוקה במתחם קבלת הפנים15:00-15:15
הולכים הביתה16:30
The Challenge
And a Possible Solution…
9
The Big Data Challenge
10
Volume
• Big data comes in one size: Big.
• Size is measured in Terabyte (1012), Petabyte (1015),
Exabyte (1018), Zettabyte (1021)
• The storing and handling of the data becomes an issue
• Producing value out of the data in a reasonable time
is an issue
11
Data Grows Fast!
12
Variety• Big Data extends beyond structured data, including semi-
structured and unstructured information: logs, text, audio and videos
• Wide variety of rapidly evolving data types requires highly flexible stores and handling
Un-Structured Structured
Objects Tables
Flexible Columns and Rows
Structure Unknown Predefined Structure
Textual and Binary Mostly Textual
13
Data Types By Industry
14
Velocity • The speed in which data is being generated and
collected
• Streaming data and large volume data movement
• High velocity of data capture – requires rapid ingestion
• Might cause a backlog problem
15
The Backlog Problem• Caused when the data is produced very quickly
• The time it takes to digest the new data is as long or very close to the time it takes the new data to be generated
• If the intake of new data is down for any reason, there is no way to complete that missing data thus causing the backlog problem
16
The Internet of Things (IoT)/
17
Value
Big data is not about the size of the data,
It’s about the value within the data
18
19
So, We Define Big Data Problem…• When the data is too big or moves too fast to
handle in a sensible amount of time
• When the data doesn’t fit any conventional database structure
• When we think that we can still produce value from that data and want to handle it
• When the technical solution to the business need becomes part of the problem
How to do Big Data
21
22
Big Data in Practice• Big data is big: technological framework and
infrastructure solutions are needed
• Big data is complicated:
– We need developers to manage handling of the data
– We need devops to manage the clusters
– We need data analysts and data scientists to produce value
23
Possible Solutions: Scale Up• Older solution: using a giant server with a lot of
resources (scale up: more cores, faster processers, more memory) to handle the data– Process everything on a single server with hundreds of CPU
cores
– Use lots of memory (1+ TB)
– Have a huge data store on high end storage solutions
• Data needs to be copied to the processes in real time, so it’s no good for high amounts of data (Terabytes to Petabytes)
24
Another Solution: Distributed Systems• A scale-out solution: let’s use distributed systems:
use multiple machine for a single job/application
• More machines means more resources
– CPU
– Memory
– Storage
• But the solution is still complicated: infrastructure and frameworks are needed
25
Distributed Infrastructure Challenges• We need Infrastructure that is built for:
– Large-scale
– Linear scale out ability
– Data-intensive jobs that spread the problem across clusters of server nodes
• Storage: efficient and cost-effective enough to capture and store terabytes, if not petabytes, of data
• Network infrastructure that can quickly import large data sets and then replicate it to various nodes for processing
• High-end hardware is too expensive - we need a solution that uses cheaper hardware
26
Distributed System/Frameworks Challenges• How do we distribute our workload across the
system?
• Programming complexity – keeping the data synced
• What to do with faults and redundancy?
• How do we handle security demands to protect highly-distributed infrastructure and data?
A Big Data Solution:Apache Hadoop
28
Apache Hadoop• Open source project run by Apache Foundation
(2006)
• Hadoop brings the ability to cheaply process large amounts of data, regardless of its structure
• It Is has been the driving force behind the growth of the big data industry
• Get the public release from:
– http://hadoop.apache.org/core/
29
Original Hadoop 1.0 Components• HDFS (Hadoop Distributed File System) – distributed file system that
runs in a clustered environment
• MapReduce – programming technique for running processes over a clustered environment
• Hadoop main idea: let’s distribute the data to many servers, and then bring the program to the data
30
Hadoop 2.0• Hadoop 2.0 changed the Hadoop conception and
introduced a better resource management concept:– Hadoop Common
– HDFS
– YARN
– Multiple data processing frameworks including MapReduce, Spark and others
31
HDFS is...• A distributed file system
• Designed to reliably store data using commodity hardware
• Designed to expect hardware failures and still stay resilient
• Intended for larger files
• Designed for batch inserts and appending data (no updates)
32
Files and Blocks• Files are split into 128MB blocks (single unit of
storage)
– Managed by NameNode and stored on DataNodes
– Transparent to users
• Replicated across machines at load time
– Same block is stored on multiple machines
– Good for fault-tolerance and access
– Default replication factor is 3
33
HDFS Node Types• HDFS has three types of Nodes
• Datanodes– Responsible for actual file store
– Serving data from files(data) to client
• Namenode (MasterNode)– Distribute files in the cluster
– Responsible for the replication between the datanodes and for file blocks location
• BackupNode– It’s a backup of the NameNode
34
Using HDFS in Command Line
35
How Does HDFS Look Like (GUI)
36
Interfacing with HDFS
37
MapReduce is...• A programming model for expressing distributed
computations at a massive scale
• An execution framework for organizing and performing such computations
• MapReduce can be written in Java, Scala, C, Python, Ruby and others
• Bring the code to the data, not the data to the code
38
MapReduce paradigm• Implement two functions:
• MAP - Takes a large problem and divides into sub problems and performs the same function on all subsystems
Map(k1, v1) -> list(k2, v2)
• REDUCE - Combine the output from all sub-problems
Reduce(k2, list(v2)) -> list(v3)
• Frameworks handles everything else (almost)
• Value with same key must go to the same reducer
39
Divide and Conquer
40
YARN• Takes care of distributed processing and coordination
• Scheduling
– Jobs are broken down into smaller chunks called tasks
– These tasks are scheduled to run on data nodes
• Task Localization with Data
– Framework strives to place tasks on the nodes that host the segment of data to be processed by that specific task
– Code is moved to where the data is
41
YARN (cont.)• Error Handling
– Failures are an expected behavior so tasks are automatically re-tried on other machines
• Data Synchronization
– Shuffle and Sort barrier re-arranges and moves data between machines
– Input and output are coordinated by the framework
42
Submitting a Job• Yarn script with a class argument command
launches a JVM and executes the provided Job
$ yarn jar HadoopSamples.jar mr.wordcount.StartsWithCountJob \/user/sample/hamlet.txt \/user/sample/wordcount/
43
Resource Manage: UI
44
Application View
45
Hadoop Main Problems• But Hadoop (MapReduce Framework – not
MapReduce paradigm) had some problems:
– Developing MapReduce was complicated – there was more than just business logics to develop
– Transferring data between stages requires the intermediate data to be written to disk (and than read by the next step)
– Multi-step needed orchestration and abstraction solutions
– Initial resource management was very painful –MapReduce framework was based on resource slots
A Different Big Data Solution: Apache Spark
47
Introducing Apache Spark• Apache Spark is a fast, general engine
for large-scale data processing on a cluster
• Originally developed by UC Berkeley in 2009 as a research project, and is now an open source Apache top level project
• Main idea: use the memory resources of the cluster for better performance
• It is now one of the most fast-growing project today
48
Spark Advantages• High level programming framework: programmers focus on
what (logic), not how• Cluster computing
– Managed by a single master node– Distributed to worker nodes– Scalable and fault tolerant by the framework
• Distributed storage– Data is distributed when stored– Replication for efficiency and fault tolerance– Bring code to the data state of mind
• High performance by in-memory utilization and caching
49
Code Complexity
50
Scalability• Spark is highly scalable
• Adding worker nodes to the cluster increase performance in a near-linear scale
– More processing power
– More memory
• Nodes can be added and removed according to load – ideal for cloud computing (EC2)
51
Fault Tolerance• Commodity hardware is bound to fail
• Spark is built for low-cost clusters and has fault tolerance embedded in the framework
– System continue to function
– Master re-assign tasks to nodes
– Data is replicated so there is no data loss
– Node will rejoin cluster automatically when they recover
52
Spark and Hadoop• Spark and Hadoop are built to co-exist
• Spark can use other storage systems (S3, local disks, NFS) but works best when combined with HDFS
– Uses Hadoop InputFormats and OutputFormats
– Fully compatible with Avro and SequenceFiles as well of other types of files
• Spark can use YARN for running jobs
53
Spark and Hadoop (cont.)• Spark interacts with the Hadoop ecosystem:
– Flume
– Sqoop (watch out for DDoS on the database…)
– HBase
– Hive
• Spark can also interact with tools outside the Hadoop ecosystem: Kafka, NoSQL, Cassandra, XAP, Relational databases, and more
54
The Spark Stack• In addition to the Core spark engine, there are
some related projects to extend Spark functionality
55
Spark Use Cases• Spark is especially useful when working with any
combination of– Large amount of data
– Intensive computing
– Iterative algorithm
• Spark does well because of– Distributed storage
– Distributed computing
– In-memory processing and pipelining
56
Common Spark Use Cases• ETL Processing
• Text Mining
• Index Building
• Graph Creation and analysis
• Pattern recognition
• Fraud detection
• Collaborative filtering
• Stream processing
• Prediction models
• Sentiment analysis
• Risk assessment
• Machine learning
57
Examples for Common Use Cases• Risk analysis
– How likely is this borrower to pay back a loan?
• Recommendation
– Which products will this customer enjoy?
• Predictions
– How can we prevent service outage?
• Classification
– How can we tell which mail is spam and which is Not?
58
Spark 2.0
59
Spark 2.0 Major Changes• Major performance improvements
• Unifying DataFrames and Datasets for Scala/Java
• Changes to extensions– Multiple changes to MLlib and Machine Learning
– Improvements for Spark Streaming [ALPHA]
– Spark SQL supports ANSI SQL 2003
– R UDF
• Over 2000 bugs fixed
• Current version: 2.0.2 (released, Nov 14, 2016)
Basic Spark
Spark Core
61
What is Apache Spark• Apache Spark is a fast the general engine for large-
scale data processing• Written in Scala• Spark Shell
– Interactive interface for learning, testing, and data explorations
– Scala or Python shells available– Spark on R using RStudio and SparkR
• Spark Application– Framework for running large scale processes– Supports Scala, Python, and Java
62
Starting the Shells$ pyspark
Welcome to____ __/ __/__ ___ _____/ /___\ \/ _ \/ _ `/ __/ '_//__ / .__/\_,_/_/ /_/\_\ version 1.6.0
/_/
Using Python version 2.6.6 (r266:84292, Jul 23 2015 15:22:56)SparkContext available as sc, HiveContext available as sqlContext.>>>
$ spark-shell
Welcome to____ __/ __/__ ___ _____/ /___\ \/ _ \/ _ `/ __/ '_//___/ .__/\_,_/_/ /_/\_\ version 1.6.0
/_/
Using Scala version 2.10.5 (OpenJDK 64-Bit Server VM, Java 1.7.0_95)Spark context available as sc.SQL context available as sqlContext.
scala>
63
Spark Context• Every Spark application requires a Spark Context
– The main entry point to the Spark API
• Spark shells provide preconfigured Spark contexts called sc
• Spark application will need to create their own Spark context instance
• Spark Context is there the “magic” is happening
64
RDD: Resilient Distributed Datasets• Basic component is the RDD
– Resilient – if the data is lost, it can be recreated for previous steps
– Distributed – appears as a single chunk of code, but is actually distributed across nodes
– Dataset – initial data can come from file or created programmatically
• RDD are the fundamental unit of data in spark
• Most of Spark program consists of performing operations on RDDs
65
How to Create a RDD• We can create RDD in 3 ways:
– Create the RDD from a file, set of files, or directory
– Create RDD from data already in memory
– Create RDD by manipulating another RDD
• Later, we will talk about creating RDDs from different data sources…
66
Creating RDD from Files• Creating RDD from files:
– Use the SparkContext.textFile – it can read a single file, comma delimited list or wildcards
– Each line in the file is separate record in the RDD
• Files are referenced by absolute or relative URI– Absolute: file:/home/myfile.txt
– Relative (used default file system): myfile.txt
sc.textFile (“myfile.txt”)sc.textFile (“mydata/*.txt”)sc.textFile (“myfile1.txt,myfile2.txt”)
67
Example: Creating RDD From Files (Scala)scala> val mydata = sc.textFile("file:/home/spark/derby.log")16/06/12 13:15:39 INFO MemoryStore: Block broadcast_3 stored as values in memory (estimated size 220.3 KB, free 960.5 KB)16/06/12 13:15:39 INFO MemoryStore: Block broadcast_3_piece0 stored as bytes in memory (estimated size 26.4 KB, free 986.9 KB)16/06/12 13:15:39 INFO BlockManagerInfo: Added broadcast_3_piece0 in memory on localhost:36895(size: 26.4 KB, free: 511.4 MB)16/06/12 13:15:39 INFO SparkContext: Created broadcast 3 from textFile at <console>:27mydata: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[7] at textFile at <console>:27
scala> mydata.count()[..]16/06/12 13:15:41 INFO DAGScheduler: Job 0 finished: count at <console>:30, took 0.489132 sres3: Long = 13
scala> val mydata = sc.textFile("hdfs:/tmp/eventlog-demo.log")[..]mydata: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[9] at textFile at <console>:27
scala> mydata.count()16/06/12 13:16:41 INFO DAGScheduler: Job 1 finished: count at <console>:30, took 1.515042 sres4: Long = 451210
68
RDD Operations• There are two types of operations
– Actions – return values
– Transformations – define a new RDD based on the current one(s)
• RDD have a lazy execution model
– Transformations set things up
– Actions cause calculations to actually be performed
69
RDD Operations: Actions• Common actions:
– take(n) – return an array of the first n elements
– collect() – return an array of all elements
– saveAsTextFile(file) – save RDD to text file
– Count() – return the number of elements in the RDDscala> mydata.take(2)res6: Array[String] = Array(2016-06-08T16:48:49|121.170.77.248|FR|SUCCESS, 2016-06-08T16:48:49|142.13.127.131|FR|SUCCESS)
scala> for (line <- mydata.take(2)) println(line)2016-06-08T16:48:49|121.170.77.248|FR|SUCCESS2016-06-08T16:48:49|142.13.127.131|FR|SUCCESS
scala> mydata.count()res4: Long = 451210
70
RDD Operations: Transformations• Transformations define a new RDD based on the current one
• RDD are immutable– Data in an RDD cannot change
– Transform in sequence to modify as needed
• Operations can be chained (piped) for multiple operations
• Common transformations:– map(function) – creates a new RDD by preforming a function on
each record in the base RDD
– filter(function) – creates a new RDD by including/excluding records in the base RDD according to the Boolean function it got
What is Functional Programming?
72
Functional Programming• Spark is dependent on the concept of functional
programming
– Functions are the fundamental unit of programming
– Functional have input and output only – there is no state or side effects
• Key concepts
– Passing functions as input to other functions
– Anonymous functions
73
Passing Functions as Parameters• Some of the RDD operations take functions as
parameters• The received function is applied to all the records in
the RDD• For example:
– map function gets a function as a parameter. That function will convert each record in the RDD to a key and value tuple (AKA Pair RDD)
– filter function gets a function which will check each record in the RDD and return a Boolean value for filtering the data
74
Defining Functions and Passing Them• Python Example:
• Scala Example:
> def toUpper(s):return s.upper()
> mydata.map(toUpper).take(3)
> Def toUpper(s: String): String ={ s.toUpperCase }
> mydata.map(toUpper).take(2)
75
Anonymous Functions• Scala, Python, R, and Java can declare anonymous
one-time functions
• These functions are in-line functions without a name, often used for one-off functions
• Spark doesn’t require the use of anonymous functions
• Examples:
– Python: lambda x: …
– Scala X => …
– Java 8: x -> …
76
Using Anonymous Functions• Python:
• Scala:
• Scala, using “_” as anonymous parameter:
> mydata.map(lambda line: line.upper()).take(3)
> mydata.map(line => line.toUpperCase()).take(2)
> mydata.map(_.toUpperCase()).take(2)
Working with RDDs
78
RDD Data Types• RDD can hold any type of element
– Primitive: integers, chars, Booleans, etc.
– Sequences: string, list, tuple, dictionaries, arrays, and all kind of nested data types
– Scala and Java serialized types
– Mixed types
• Some RDD types have additional functionality– Pair RDDs consist of Key-Value pairs
– Double RDDs consist of numeric data
79
Generating RDDs From Collections• We can create RDDs from collections, and not from
files. Common uses: testing, integrating, and in case we need to generate data programmatically
• Example:
> Randomnumlist = [random.uniform(0,10) for _ in xrange (10000)]
> randomrdd = sc.parallelize(randomnumlist)
> Print “Mean is %f” % randomrdd.mean
80
Common RDD Operations (1)
• Common transformations– flatMap – maps an element to 0 or more output
elements
– distinct – return a new dataset that contains unique elements in the original RDD
• Other RDD operations– first – return the first element in the dataset
– foreach – apply function to each element in an RDD
– top(n) return the n largest elements using natural ordering
81
Common RDD Operations (2)
• Sampling operations
– sample(percent) – create a new RDD with a sampling of elements
– takeSample(percent) – return an array of sampled elements (with or without replacement)
• Double RDD operations
– Statistical functions: mean, sum, stdev etc.
82
Using flatMap and distinct> sc.textFile(file) \
.flatMap(lambda line: line.split()) \
.distinct()
> sc.textFile(file).flatMap(line => line.split(“\\W”)).distinct
I see the worldand the world see me
I
see
the
world
and
the
world
see
me
I
see
the
world
and
me
83
Pair RDD• Pair RDDs are special form of RDD
– Each element must be a key-value pair (tuple)
– Key and values be any type
• Use Pair RDDs when using MapReduce algorithms
• Many additional functions for common data processing needs: sorting, grouping, joining, etc.
84
Simple Pair RDD• Create a Pair RDD from comma delimited file (CSV)
> users = sc.textFile (“file:/users.csv”) \.map(lambda line: line.split(“,”)) \.map(lambda fields: (fields[0], fields[1]))
> val users = sc.textFile (“file:/users.csv”).map (line => line.split(‘,’)) .map (fields => (fields(0), fields(1)))
user001,Zohar Elkayamuser002,Efratuser009,Tamaruser100,Ido
(user001, Zohar Elkayam)
(user002, Efrat)
)user009, Tamar)
(user100, Ido)
85
Creating Pair RDDs• Commonly used functions that create Pair RDDs
– map
– flatMap
– flatMapValues
– keyBy
• Deciding what is the key and what is the value is very important as first step in most workflows in to convert the base RDD to a Pair RDD
86
Complex Values• When creating key-value pairs, the values can be a
complex value
> users = sc.textFile (“file:/users.csv”) \.map(lambda line: line.split(“,”)) \.map(lambda fields: (fields[0], (fields[1], field[2])))
user001,Zohar, Elkayamuser002,Efrat, Elkayamuser009,Tamar, Fritziuser100,Ido, Bob
(user001, (Zohar, Elkayam))
(user002, (Efrat, Elkayam))
)user009, (Tamar, Fritzi))
(user100, (Ido, Bob))
87
Using flatMapValues• flatMapValues will convert multiple values having
the same key into different key-value pairs
0001 a1:b1:c10002 a2:b20003 c3
> users = sc.textFile (“file”) \.map(lambda line: line.split(“ ”)) \.map(lambda fields: (fields[0], fields[1])) \.flatMapValues(lambda val: val.split(‘:’))
(0001, a1:b1:c1)
(0002, a2:b2)
)0003, c3)
(0001, a1)
(0001, b1)
(0001, c1)
(0002, a2)
(0002, b2)
(0003, b3)
[0001, a1:b1:c1]
[0002, a2:b2]
[0003, c3]
88
MapReduce: Reminder• MapReduce is a common programming paradigm• MapReduce breaks complex tasks down into smaller
elements which can be executed in parallel• Hadoop MapReduce was the first major framework
implementation, but it had some limitations:– Each job can have only one Map, one Reduce– Job output and intermediate data be saved to files
• Spark implement the MapReduce model with greater flexibility– Map and Reduce function can be interspersed– Results stored in memory (or to disk, if not enough memory)– Operations can easily chained
89
MapReduce in Spark• MapReduce in spark works on Pair RDDs
• Map phase:– Operated on one record at a time
– “Maps” each record to one or more new records
– Use map and flatMap for the Mapping phase
• Reduce phase– Works on Map output
– Consolidates multiple records
– reduceByKey operation
90
Word Count using Spark• The famous word count example using Spark
made very easy:
the cat sat on the mat
> Count = sc.textFile (“file”) \.flatMap(lambda line: line.split()) \.map(lambda word: (word, 1)) \.reduceByKey(lambda v1, v2: v1 + v2)
(the, 2)
(cat, 1)
(on, 1)
(mat, 1)
(sat, 1)
the
cat
sat
on
the
mat
(the, 1)
(cat, 1)
(sat, 1)
(on, 1)
(the, 1)
(mat, 1)
91
Why Is It Always Word Count?!• Word count is an easy explanation of many things:
– File handling
– Breaking lines into key-value pairs
– Reducing the keys and handling values
• Calculating statistics are often simple aggregative functions, just like in word count example
• Many common tasks are very similar to word count - Log file analysis for example
92
ReduceByKey• ReduceByKey function behaves like the original
reduce key in the MapReduce model
• The function ReduceByKey gets must be binary –combining two values from two keys
• In order to work properly, the function must be
– Commutative (x+y = y+x)
– Associative ((x+y)+z = x+(y+z))
• All keys are being handled together (piped)
93
Other Pair RDD Operations• Pair RDDs can do other things beside reduce
– countByKey – returns map with count occurrences by key
– groupByKey – group all values for each key in an RDD
– sortByKey – sort in ascending/descending order by key
– join – return RDD containing all pairs with matching keys from two pair RDDs
94
Pair RDD Operations Examples
> grpUsers = users.groupByKeys()
> sortUsers = users.sortByKey (ascending=False)
(0001, a1:b1:c1)
(0002, a2:b2)
)0003, c3)
(0001, a1)
(0001, b1)
(0001, c1)
(0002, a2)
(0002, b2)
(0003, b3)
(0003, b3)
(0002, b2)
(0002, a2)
(0001, c1)
(0001, c1)
(0001, b1)
95
Joining RDDs• Using Joins is a common programming pattern
– Map separate datasets into key-value pair RDDs
– Join by key (make sure key is the same data type and same key structure)
– Map joined data into the desired format
– Save, display or continue processing the new RDD
96
Other Pair RDD Operations (cont.)• Other pairs operations
– keys – return an RDD of just the keys (no values)
– values – return an RDD of just the values
– lookup(key) return the value for a specific key
– leftOuterJoin, rightOuterJoin – left and right outer joins
– mapValues, flatMapValues – execute function on just the values, keeping the key the same
• See PairRDDFunctions class for a full (long) list of functions
Working with Clusters
Spark Clusters and Resource Management
98
Spark Clusters• Spark is designed to run on a cluster
– Spark includes basic cluster management called “Spark Standalone”
– Can also run on Hadoop and Mesos
• The jobs are broken down into tasks and sent to worker nodes– Each worker nodes run executioners with stand alone
JVMs
• Spark cluster workers can work closely with HDFS
99
Spark Cluster Options
• Locally (No distributed processing)
• Locally with multiple Worker threads
• On an actual cluster, resources managed by:
– Spark Standalone
– Apache Hadoop YARN (Yet Another Resource Negotiator)
– Apache Mesos
100
Spark Cluster Terms• Cluster is a group of computer working together
– i.e. Spark Standalone Cluster, Hadoop (YARN), Mesos
• Node is an individual computer in the cluster
– Master node mange the distributed work
– Worker node does the actual work
• Daemon is a program running on a node
– Each preform different functions in the cluster
101
Spark Standalone Cluster Daemons• Spark Master (cluster manager)
– One per cluster
– Manages applications, distribute individual tasks to Spark Workers
• Spark Worker– One per node
– Starts and monitor Executors for applications
– Spark workers can run on Hadoop DataNodes – for reading data from HDFS efficiently
102
Spark Driver
• Spark driver is the main program• Runs in a Spark Shell or as Spark Application• The driver creates the Spark Context for the run• Communicate with the Cluster Manager to
distribute the work between workers
103
Driver Modes: Client vs. Cluster• Driver runs outside the cluster by default
– Called “client” deploy mode
– Most common
– Required for interactive use
• We can run the driver from one of the worker nodes in a cluster
– Called “cluster” deploy mode
– Doesn’t require interaction with cluster’s nodes
104
Running a Cluster Application
105
Running a Cluster Application
106
Running a Cluster Application
107
Running a Cluster Application
108
Running a Cluster Application
109
Supported Cluster Resource Managers• Spark Standalone (EC2 or private)
– Included with Spark– Easy to install and run– Limited configurability and scalability– Useful for testing, development, or smaller systems
• Hadoop YARN– Requires an Hadoop Cluster– Common for production sites– Allows sharing cluster resources with other applications (MapReduce,
Hive, etc.)
• Apache Mesos– Original platform for Spark– Less common
110
Setting sc.master • Using the --master parameter, we set the SparkContext.master
parameter in spark shells
• Spark shell can set different Cluster Masters using command line:– URL – the url location for the cluster manager (Spark Standalone Manager or
Mesos manager)
– local[*] – runs locally with as many threads as cores (default)
– local[n] – runs locally with n worker threads
– local – does not use distributed processing
– yarn – use YARN as the cluster manager
• Example:$ pyspark --master spark://sparkmasternode:7077
$ spark-shell --master yarn
111
UI Management• Spark standalone master provides a UI interface for
monitoring and history• UI runs by default on port 18080
112
Spark Job Details UI
113
Spark Job Details Timeline
114
Stage Breakdown
Parallel Programming with Spark
116
Datasets in the Cluster• Resilient Distributed Datasets
– Data is partitioned across worker nodes
• Partitioning is being done by the Spark framework – no action needed by the programmer
• We can control the number of partitions
117
Working With Partitioned Files• Partitioning a single file is based on the size of the
file – the default is 2
• We can control the number of partitions (optional):
• The more partitions we have, the more parallel the program is
sc.textFile(“myfile.txt”, 4)
118
Working With Multiple Files• When working with sc.textFile, we can choose a
wildcard or directory– Each file will become at least one partition
– Operations can be done per file (JSON and XML parsing)
• We can automatically create Pair RDD by using sc.wholeTextFile(“mydir”)
– Useful for many small files
– Key = file name
– Value = file ccontents
> sc.textFile (“mydir/*”)
119
Running Operations on Partition• Most operations are working on single elements
• Some operations can be ran at the partition level– foreachPartition – call a function for each partition
– mapPartition – create a new RDD by executing a function on each partition in the RDD (transformation)
– mapPartitionWithIndex – same as mapPartition but includes an index for the RDD (transformation)
• Commonly used for initializations
• Functions in partition operators get iterators as argument to iterate through the elements
120
HDFS and Data Locality• By default, Spark partitions file-based RDDs by
block. Each block loaded to a single partition
• An action triggers execution: tasks on executors load data from blocks into partitions
• When using HDFS, workers run near their respected blocks
• The Collect operation will copy the data from the workers back to the driver (no locality here)
121
Parallel Operations• RDD operations run in parallel on partitions
• Some operations preserve partitioning– map, flatMap, filter
• Some operations need repartitioning– reduce, sort, group
• Repartitioning requires the data to move between workers – thus hurting performance– Try to reduce the number of elements movement by
running better sequences up to the reshuffle stages
122
Execution Terminology• Job – a set of tasks executed as a result of an action
• Stage – a set of tasks in a job that can exected in parallel
• Task – an individual unit of work sent to one executor
123
From Job to Stages• Spark calculates a Directed Acyclic
Graph (DAG) of RDD dependencies• Narrow operations:
– Only one child depends on the RDD– No Shuffle required between nodes– Can be collected in a single stage– e.g., map, filter, union
• Wide operations– Multiple depends on the RDD– Defines a new stage– e.g., reduceByKey, join, groupByKey
124
Controlling the Level of Parallelism• Wide operations partition result RDDs
– More partitions = more parallel tasks
– Cluster will be under-utilized if there are too few partitions
• We can control the number of partitions:
– Setting a default property (spark.default.parallelism)
– Setting an optional parameter in the function call> users = sc.textFile (“file”) \
.flatMap(lambda line: line.split()) \
.map(lambda word: (word, 1)) \
.reduceByKey(lambda v1, v2: v1 + v2, 10)
125
Spark Lineage• Each transformation operation create a new
child RDD
• Spark keeps track of the parent RDD for each new RDD
• Child RDDs depends on their parents
• Action operations execute parent transformations
• Each action re-executes the linage transformations starting with the base RDD
126
Caching• Since running all the transformations from the
base RDD can be expensive, RDDs can be cached
• Caching an RDD means saving it to the memory to reduce dependency link length
• Caching is a suggestion to Spark
– If not enough memory is available, transformations will be re-executed when needed
– Cache will never spill to disk – it’s in memory only
127
Caching and Fault-Tolerance• Resilient Distributed Datasets
– Resiliency is a product of tacking lineage
• RDDs can always be recomputed from their base if needed
• In case a worker fail, the task can be re-run on a different node, recalculating data from the base RDD using the same partition
128
Persistence Levels• Cache stores data in-memory only
• The persist method offers other options called Storage Levels
• Storage location – where to store the data– MEMORY_ONLY – same as cache
– MEMORY_AND_DISK – stores partition in memory, use disk if not enough memory
– MEMORY_ONLY_SER – stores partition as serialized java object
– DISK_ONLY – store partition on disk
• Replication – store partition on two cluster nodes– MEMORY_ONLY_2, MEMORY_AND_DISK_2, DISK_ONLY_2
129
Persistence Options• To stop persisting and remove from memory
and/or disk– rdd.unpersist()
• To change persistency level we need to unpersistand re-persist the data in the new level
130
When and Where to Cache• When should we cache a dataset
– When dataset is likely to be re-used (machine learning, iterative algorithms, etc.)
– When calculation is long and we don’t want to lose steps in case of failure
• How to choose a persistent level– Memory only – whenever possible, use serialized object to
reduce memory usage if possible– Disk – choose when recomputation is more expensive than
disk read (filtering large dataset, expensive functions)– Replication – choose when recomputation is more
expensive than memory
131
Checkpointing• Maintianing RDD linage provides resillence but can
also cause problem when the lineage gets very long
• Recovery can be very expensive
• Potential stack overflow
• Solution: Checkpointing, saving the data to HDFS (reliable) or to local disk (local)
– HDFS Provides fault –tolerance storage across nodes
– Linage is not saved
– Must be checkpointed before any actions on the RDD
Spark Modules
Spark SQL and Spark Streaming
133
The Spark Stack• In addition to the Core spark engine, there are
some related projects to extend Spark functionality
134
Spark SQL• Spark SQL is a Spark module for structured data
processing• Spark SQL provides more information about the
structure of both the data and the computation being performed for more optimization
• Supports basic SQL and HiveQL• Spark SQL can also act as a distributed query engine
using its JDBC/ODBC or command-line interface• For more information:
http://spark.apache.org/docs/latest/sql-programming-guide.html
135
Dataframes, Datasets and RDDs• A DataFrame is a distributed collection of data
organized into named columns– It is conceptually equivalent to a table in a relational
database or a data frame in R/Python, but with richer optimizations under the hood
• A Dataset is a new experimental interface added in Spark 1.6 that tries to provide the benefits of RDDs with the benefits of Spark SQL’s optimized execution engine– A Dataset can be constructed from JVM objects and then
manipulated using functional transformations
136
Spark SQL Context• Spark SQL entry point is Spark SQL Context:
from pyspark.sql import SQLContextsqlContext = SQLContext(sc)
val sc: SparkContext // An existing SparkContext.val sqlContext = new org.apache.spark.sql.SQLContext(sc)
// this is used to implicitly convert an RDD to a DataFrame.import sqlContext.implicits._
sqlContext <- sparkRSQL.init(sc)
137
Running a Query
from pyspark.sql import SQLContextsqlContext = SQLContext(sc)df = sqlContext.sql("SELECT * FROM table")
val sqlContext = ... // An existing SQLContextval df = sqlContext.sql("SELECT * FROM table")
sqlContext <- sparkRSQL.init(sc)df <- sql(sqlContext, "SELECT * FROM table")
138
Connecting Oracle and Spark SQL• Connecting Spark SQL and Oracle is easy when
using JDBC:
• More about it (and a demo): https://www.realdbamagic.com/spark-sql-and-oracle-database-integration/
scala> val employees = sqlContext.load("jdbc", Map("url" -> "jdbc:oracle:thin:zohar/zohar@//localhost:1521/single", "dbtable" -> "hr.employees"))warning: there were 1 deprecation warning(s); re-run with -deprecation for detailsemployees: org.apache.spark.sql.DataFrame = [EMPLOYEE_ID: decimal(6,0), FIRST_NAME: string, LAST_NAME: string, EMAIL: string, PHONE_NUMBER: string, HIRE_DATE: timestamp, JOB_ID: string, SALARY: decimal(8,2), COMMISSION_PCT: decimal(2,2), MANAGER_ID: decimal(6,0), DEPARTMENT_ID: decimal(4,0)]
140
Agenda• What is stream processing?
• Principles
• Stream processing with Spark
• Demo
• How does it compare?
141
Spark Streaming• Spark Streaming is an extension of the core Spark
API that enables scalable, high-throughput, fault-tolerant stream processing of live data streams
142
What is Stream Data Processing?• Stream – a constant flow of data events
• Data – any occurrence that happens at a clearly defined time and is recorded in a collection of fields
• Processing – the act of analyzing data
143
When Do You Use Stream Processing?• Wherever you have a continuous stream of data
• This data needs to be processed quickly so that the business can react
• Examples include trading, fraud detection, spam filtering, trading and many more
144
Data Delivery Methods• At-most-once – possibility for data loss.
• At-least-once – messages may be redelivered.
• Exactly-once – each message is only delivered once.
145
Keep The Data Moving• Data events are processed in the stream (in
memory)
• Storage operations add unnecessary latency
• Data will be stored on disk at the end of the stream
146
Window Consideration• Windowing: grouping of events based on time
• Windowing can also be data-driven
• Out of order events make windowing tricky
• There are different types of windowing including fixed, sliding and count windows
147
Window Types
Time
1 3
4
Inputevents
2 7
9
Time
1 3Inputevents
2 7 2 4
1012
9
Fixed Sliding
Stream Processing With Spark
149
Spark Streaming • An extension of the core Spark API
• Fault tolerant
• Built in support for merging streaming data with historical data
• Supports Scala, Python and Java
150
Spark Streaming - Metrics• Exactly once delivery
• Provides Statefull state management
• Groups events into micro batches
• Latency depends on the configuration of the DStream microbatch interval
151
The DStream• An abstraction provided by Spark
• Represents a continuous stream of data
• Internally treated as a sequence of RDDs
• Each RDD consists of the last X seconds
152
How do we use it?• Create a StreamingContext
• Define the input sources
• Apply transformations and output operations to DStreams
• Issue streamingContext.start() to start receiving data
• Wait for the processing to be stopped using streamingContext.awaitTermination()
153
DStreams and Receivers• Every input DStream is associated with a receiver
object
• The receiver receives the data and stores it in memory for processing
• If you receive multiple input DStreams, multiple receivers will be created
• Remember to allocate spark with enough cores to process the data as well as to run the receivers
154
Scala Exampleimport org.apache.spark._import org.apache.spark.streaming._import org.apache.spark.streaming.StreamingContext._ // not necessary since Spark 1.3
// Create a local StreamingContext with two working thread and batch interval of 1 second.// The master requires 2 cores to prevent from a starvation scenario.
val conf = new SparkConf().setMaster("local[2]").setAppName("NetworkWordCount")val ssc = new StreamingContext(conf, Seconds(1))
// Create a DStream that will connect to hostname:port, like localhost:9999val lines = ssc.socketTextStream("localhost", 9999)
// Split each line into words, pairs, and countval words = lines.flatMap(_.split(" "))val pairs = words.map(word => (word, 1))val wordCounts = pairs.reduceByKey(_ + _)
// Print the first ten elements of each RDD generated in this DStream to the consolewordCounts.print()
ssc.start() // Start the computationssc.awaitTermination() // Wait for the computation to terminate
155
Python Examplefrom __future__ import print_function
import sys
from pyspark import SparkContextfrom pyspark.streaming import StreamingContext
if __name__ == "__main__":if len(sys.argv) != 3:
print("Usage: network_wordcount.py <hostname> <port>", file=sys.stderr)exit(-1)
sc = SparkContext(appName="PythonStreamingNetworkWordCount")ssc = StreamingContext(sc, 1)
lines = ssc.socketTextStream(sys.argv[1], int(sys.argv[2]))counts = lines.flatMap(lambda line: line.split(" "))\
.map(lambda word: (word, 1))\
.reduceByKey(lambda a, b: a+b)counts.pprint()
ssc.start()ssc.awaitTermination()
Spark Streaming Demo
157
Source Types• Spark Streaming provides two built in types of
streaming sources:
– Basic sources: such as file systems and socket connections
– Advanced sources: such as like Kafka, Flume, Kinesis etc
158
Basic Sources - Files• The StreamingContext API provides several
methods for creating input DStreams from files
• File Streams: monitors a given directory and process any files created in it– Files must have the same format
– Once moved into the directory the files can not be changed
• There is an easier method for simple text files which doesn’t require a receiver
159
Advanced Sources• Requires interfacing with non spark libraries
• Functionality to create DStreams from advanced sources has moved to separate libraries
• This is done to prevent version conflict issues
• Libraries need to be explicitly linked when needed
• There is also the ability to create a custom source using a user defined receiver
160
Sliding Window Operations• Spark Streaming allows for windowed computations
• Used to apply transformations on a sliding window of data
• Every time the window slides the source RDDs that fit into the window are combined into a single DStream object
• A window operation needs these two parameters
– Window length: the duration of the window
– Sliding interval: the interval in which the window slides
161
Checkpointing• A streaming application usually runs 24/7
• Therefore it has to survive numerous types of failures outside the application logic
• Spark streaming checkpoints data to a fault tolerant storage system to recover if and when any failure occurs
• Two types of data are checkpointed for that purpose
162
Checkpointed Data Types• Metadata: the information defining the stream
computation
– The configuration used to create the streaming application
– The DStream operations that define the streaming application
– Batches who’s jobs are queued and are yet to be completed
• Data itself: the RDDs containing the data
– Necessary in transformations that combine data from multiple batches
163
How To Checkpoint• Set up a directory in a fault tolerant file system
• Use StreamingContext.checkpoint(dir) to enable data checkpointing
• The application will recover from driver failures (metadata checkpointing) If it does the following:– Create StreamingContext at first run, set up DStreams
then start()
– When restarted StreamingContext will be recreated from checkpoint directory
164
Monitoring And Tuning Streaming Applications• Statistics about a running streaming application
are available thru the Spark web UI
• There are two main metrics to monitor and tune:
– Processing Time: the time to process each batch of data (lower is better)
– Scheduling Delay: are my batches being processed as fast as they are arriving (they should be)
165
Reducing Batch Processing Time• Divide your input DStream into several DStreams
• Increase parallelism of data processing if you feel your resources are under utilized
• Reduce serialization overhead by tuning the serialization format
166
Setting The Right Batch Interval• Batch processing time should be less then batch
interval time
• First start with a conservative interval (a few seconds)
• Make sure batch processing time is below batch interval
• If so you may increase data rate or lower batch interval
• Consistently monitor the batch processing time
167
Other Streaming Frameworks• Apache Storm
• Samza
• Apache Flink
168
How Does It Compare?Storm Samza Flink Spark
Deliverysemantics
At Least OnceExactly-oncewith Trident
At Least Once Exactly once Exactly Once
Statemanagement
StatelessRoll your own or
use Trident
StatefulEmbaded key-
value store
Statefulperiodically writes state
without inerrupting
Statefulwrites state to
storage
Latency Sub-second Sub-Second Sub-second SecondsDepending on
batch size
Languagesupport
Any JVM-languages,
Ruby, Python Javascript, Perl
Scala, Java Scala, Java Scala, Java, Python
169
For More Info• For more information:
http://spark.apache.org/docs/latest/streaming-programming-guide.html
Spark Applications
171
Spark Shell vs. Spark Applications• The Spark Shell allows interactive exploration and
manipulation of data (REPL – read, evaluate, print, loop)
• Spark applications run as independent programs
– Python, Scala, R with SparkR package, or Java
– Common uses: ETL processing, Streaming, and more
172
SparkContext• Every Spark program needs a SparkContext
– The interactive shell creates SC for us
– When creating our own application, we need to create the context ourselves
– A common convention is to name the context cs
173
import org.apache.spark.SparkContextimport org.apache.spark.SparkContext._import org.apache.spark.SparkConf
object SparkWordCount {def main(args: Array[String]) {
// create Spark context with Spark configurationval sc = new SparkContext(new SparkConf().setAppName("Spark Count"))
// get thresholdval threshold = args(1).toInt
// read in text file and split each document into wordsval tokenized = sc.textFile(args(0)).flatMap(_.split(" "))
// count the occurrence of each wordval wordCounts = tokenized.map((_, 1)).reduceByKey(_ + _)
// filter out words with fewer than threshold occurrencesval filtered = wordCounts.filter(_._2 >= threshold)
// count charactersval charCounts = filtered.flatMap(_._1.toCharArray).map((_, 1)).reduceByKey(_ + _)
System.out.println(charCounts.collect().mkString(", "))}
}
174
import sys
from pyspark import SparkContext, SparkConf
if __name__ == "__main__":
# create Spark context with Spark configurationconf = SparkConf().setAppName("Spark Count")sc = SparkContext(conf=conf)
# get thresholdthreshold = int(sys.argv[2])
# read in text file and split each document into wordstokenized = sc.textFile(sys.argv[1]).flatMap(lambda line: line.split(" "))
# count the occurrence of each wordwordCounts = tokenized.map(lambda word: (word, 1)).reduceByKey(lambda v1,v2:v1 +v2)
# filter out words with fewer than threshold occurrencesfiltered = wordCounts.filter(lambda pair:pair[1] >= threshold)
# count characterscharCounts = filtered.flatMap(lambda pair:pair[0]).map(lambda c: c).map(lambda c: (c,
1)).reduceByKey(lambda v1,v2:v1 +v2)
list = charCounts.collect()print repr(list)[1:-1]
175
library(SparkR)
args <- commandArgs(trailing = TRUE)
if (length(args) != 2) {print("Usage: wordcount <master> <file>")q("no")
}
# Initialize Spark contextsc <- sparkR.init(args[[1]], "RwordCount")lines <- textFile(sc, args[[2]])
words <- flatMap(lines,function(line) {
strsplit(line, " ")[[1]]})
wordCount <- lapply(words, function(word) { list(word, 1L) })
counts <- reduceByKey(wordCount, "+", 2L)output <- collect(counts)
for (wordcount in output) {cat(wordcount[[1]], ": ", wordcount[[2]], "\n")
}
176
Building a Spark Application: Scala or Java• Scala or Java applications must be compiled and
assembled into JAR files.
– The JAR file will be passed (uploaded) to worker nodes
• Most developers use Apache Maven or SBT to build
– See http://spark.apache.org/docs/latest/building-spark.html for more details about building an application
• Build details will differ depending on Hadoop version, deployment platform, and other factors
177
Running a Spark Application• The easiest way to run a Spark application is by using spark-
submit– Python
– Scala and Java
• Spark-submit options:– --master (local, local[*], yarn, etc.)– --deploy_mode (client or cluster)– --name – application name for UI– --conf – configuration changes from default or settings– … more …
$ spark-submit WordCount.py fileURL
$ spark-submit –class WordCount myJarFile.jar fileURL
178
Running time Configuration Options• Spark-submit can accept a properties file with
settings, instead of parameters– Tab or space-delimited list of properties
– Load with spark-submit –properties-file filename
– Example:
• Site defaults properties file– $SPARK_HOME/conf/spark-defaults
spark.master spark://masternode:7077spark.local.dir /tmp/sparkSpark.ui.port 28080
179
Setting Configuration at Runtime• Spark allow changing the configuration when
creating the SparkContext
• Configure the parameters with the SparkConfobject
• Some functions– setAppName(name)
– setMaster(master)
– set(property-name, value)
• Set functions return a SparkConf object to support chaining
180
Logging
• Spark uses Apache log4j for logging
• You can configure it by adding a log4j.propertiesfile in the $SPARK_HOME/conf directory
• Log file locations depends on the cluster management platform– Spark daemons: /var/log/spark
– Individual tasks: $SPARK_HOME/work on each worker node
– YARN has a log aggregator for log files from workers
Spark Performance and Troubleshooting
Common use cases, problems and solutions
182
Broadcast Variables• Broadcast variables set by the driver and retrived
by workers
• They are read-only once set
• The first read of Broadcast variable retrieves and stores its value on nodes
183
Why Use Broadcast Variables• Use to minimize transfer of data over the network,
which is usually the biggest bottleneck
• Spark Broadcast variables are distributed to worker nodes using a very efficient peer-to-peer algorithmscala> val broadcastVar = sc.broadcast(Array(1, 2, 3))
broadcastVar: org.apache.spark.broadcast.Broadcast[Array[Int]] = Broadcast(0)
scala> broadcastVar.valueres5: Array[Int] = Array(1, 2, 3)
184
Accumulators• Accumulators are shared variables
– Worker nodes can add to the value– Only the driver application can access the value
• Default accumulator is of type int or double, but we can create custom types when needed (extend class AccumulatorParam)
scala> val accum = sc.accumulator(0, "My Accumulator")accum: spark.Accumulator[Int] = 0
scala> sc.parallelize(Array(1, 2, 3, 4)).foreach(x => accum += x)...16/06/18 14:34:40 INFO DAGScheduler: Job 5 finished: foreach at <console>:30, took 0.111459 s
scala> accum.valueres6: Int = 10
185
Accumulators (cont.)
• Accumulators will only increment once per task
– If task must be return due to failure, spark will correctly add only for task which succeeded
• Only the driver can access the value
– Code will throw an exception if we use .value on a worker
• Supports the increment (+=) operator
186
Common Performance Issues: Serialization• Serialization affects
– Network bandwidth– Memory (save memory by serializing to disk)
• Default methods of serialization in Spark is basic java serializations– Simple, but slow
• Use Kryo Serialization for Scala and Java– Set spark.serializer = spark.KryoSerializer– Create KryoRegistrar class and set the class in
spark.kryo.registrator=MyRegistrator– Register classes with Kryo (kryo.register(classOf[MyClass]))
187
Small Partitions
• Problem: filter() can result in partitions with small amounts of data
– Results in many small tasks
• Solution: repartition(n)
– This is the same as coalesce(n, suffle=true)
• This will build a new partition RDD, reducing the number of tasks
188
Passing Too Much Data in Functions• Problem: Passing large amounts of data to parallel
functions results in poor performance
• Solution:
– If the data is relatively small, use broadcast variable
– If the data is very large, parallelize into RDDs
189
Where to Look for Performance Issues• Scheduling and lunching tasks
– Are you passing too much data to tasks?
– Use broadcast variable, or RDD
• Task execution– Are there tasks with a very high per-record overhead?
• mydata.map(dbLookup)
• Each lookup call opens a connection to the DB, reads, and closes
– Try mapPartitions
– Are a few tasks taking much more time than others?• Repartition, partition on a different key, or write custom partitioner
190
Where to Look for Performance Issues (cont)• Shuffling
– Make sure you have enough memory for buffer cache
– Make sure spark.local.dir is local disk, ideally dedicated or SSD
• Collecting data to the Driver– Beware of returning large amounts of data to the
driver (using collect())
– Process data on the worker, not the driver
– Save large results to HDFS
191
Conclusion• We talked about the Big Data problem and
Hadoop
• We learned how to use Spark Core
• We got an overview about Spark Cluster and Parallel programming
• We reviewed Spark modules: Spark SQL, Spark Streaming, Machine learning and Graphs
• Spark is one of the leading technologies in todays world
192
What Did We Not Talk About?• Spark unified APIs and data frames
• Spark MLlib and machine learning in general
• Spark graph processing
Q&AAny Questions? Now will be the time!
Zohar Elkayamtwitter: @[email protected]
www.ilDBA.co.ilwww.realdbamagic.com
195