1
Big Data and Lynda.comSubash DSouza
2
• lynda.com is an online learning company that helps anyone learn software, design, and business skills to achieve their personal and professional goals
• Founded in 1995 by Lynda Weinman and Bruce Heavin.• Went online in 2002.• As of January 2014, lynda.com offers more than 2,400 courses in business, design, web,
programming, photography, video, 3D and animation, audio, education, and CAD
Who is Lynda.com?
3
Why Big Data?
• With the growth of users on Lynda.com, data has increased rapidly.
• With the amount of data we collect, there has a been a drive to derive more insights from the data.
• We collect data from multiple sources such as Google Analytics, internal logs and user sessions.
4
Current Use cases of Big Data at Lynda.com
• We use MongoDB for a Learning Record Store, host user configuration for Notifications, as well as for a data source for the localized text on the main web site.
• A Learning Record Store (LRS) is a data store that serve as a repository for learning records necessary for using the Tin Can API.
5
Current Use cases of Big Data at Lynda.com
• Recommendation algorithms using Myrrix. We have data that is fed once a day to our recommendations servers which run on Myrrix.
• Myrrix was a Machine “Big Learning” Software built on top of Apache Hadoop and Apache Mahout.
• It was brought out by Cloudera last August• Succeeded by Oryx, which has tighter integration with CDH• Working on migrating to Oryx
6
The future of Big Data at Lynda.com
• Use the data we collect to gain better insights into our business decision making
• Combine Google Analytics with our own internal logs and User Sessions to understand our users better. This will allow us to create customized experiences for our users.
• A better user experience will keep the user on the site for longer and will also be better for turnover rate
7
How we are achieving that?
• Building out Hadoop Clusters on YARN• Use HBase for some of our real time use cases • Testing out Spark and Storm• Still in early stages
• Introduction of Hadoop to lynda.com
Big Data Overview
8
Agenda
Hadoop Architecture Stack
9
Extract Load Transform
HD
FS
Propagate
RDBM
S/Fi
les
API Access
Business Intelligence
Stag
ing
Cons
umab
le D
ata
SecurityKerberos
Data Chronology
Hiv
e/Pi
g
Hba
se
Meta DataHCatalog
Job Schedulingoozie
Extract to RDBMSsqoop
Monitoring ToolsNagios, Ganglia, Ambari
Direct Access to Raw DataHue
Data SerializationAvro
GovernanceH
adoop Stack and Data Access
Data ExtractionFlume
Google Analytics
Data MovementMap Reduce
lyndaLogs
User Sessions Serv
ices
and
API
`s
Hadoop Architecture Stack
10
Extract Load Transform
HD
FS
Propagate
RDBM
S/Fi
les
API Access
Business Intelligence
Stag
ing
Cons
umab
le D
ata
SecurityKerberos
Data Chronology
Hiv
e/Pi
g
Hba
se
Meta DataHCatalog
Job Schedulingoozie
Extract to RDBMSsqoop
Monitoring ToolsNagios, Ganglia, Ambari
Direct Access to Raw DataHue
Data SerializationAvro
GovernanceH
adoop Stack and Data Access
Data ExtractionFlume
Google Analytics
Data MovementMap Reduce
lyndaLogs
User Sessions Serv
ices
and
API
`s
Data Collecting/Acquisition
Start with Archiving User Sessions
Data AcquisitionGoogle AnalyticsLynda Logs.
Hadoop Architecture Stack
11
Extract Load Transform
HD
FS
Propagate
RDBM
S/Fi
les
API Access
Business Intelligence
Stag
ing
Cons
umab
le D
ata
SecurityKerberos
Data Chronology
Hiv
e/Pi
g
Hba
se
Meta DataHCatalog
Job Schedulingoozie
Extract to RDBMSsqoop
Monitoring ToolsNagios, Ganglia, Ambari
Direct Access to Raw DataHue
Data SerializationAvro
GovernanceH
adoop Stack and Data Access
Data ExtractionFlume
Google Analytics
Data MovementMap Reduce
lyndaLogs
User Sessions Serv
ices
and
API
`s
Hadoop Architecture Stack
12
Extract Load Transform
HD
FS
Propagate
RDBM
S/Fi
les
API Access
Business Intelligence
Stag
ing
Cons
umab
le D
ata
SecurityKerberos
Data Chronology
Hiv
e/Pi
g
Hba
se
Meta DataHCatalog
Job Schedulingoozie
Extract to RDBMSsqoop
Monitoring ToolsNagios, Ganglia, Ambari
Direct Access to Raw DataHue
Data SerializationAvro
GovernanceH
adoop Stack and Data Access
Data ExtractionFlume
Google Analytics
Data MovementMap Reduce
lyndaLogs
User Sessions Serv
ices
and
API
`s
StagingData Processing
ELT Put the data in one place so that it can be Transformed efficiently by another process.This will be the “Extract” and “Load” part of the ELT process.
Hadoop Architecture Stack
13
Extract Load Transform
HD
FS
Propagate
RDBM
S/Fi
les
API Access
Business Intelligence
Stag
ing
Cons
umab
le D
ata
SecurityKerberos
Data Chronology
Hiv
e/Pi
g
Hba
se
Meta DataHCatalog
Job Schedulingoozie
Extract to RDBMSsqoop
Monitoring ToolsNagios, Ganglia, Ambari
Direct Access to Raw DataHue
Data SerializationAvro
Governance
Hadoop Stack and Data Access
Data ExtractionFlume
Google Analytics
Data MovementMap Reduce
lyndaLogs
User Sessions Serv
ices
and
API
`s
HDFSWith HDFS and the other components of the Hadoop Stack lynda.com will be able to acquire and store large amounts of data quickly and accurately.
Hadoop Architecture Stack
14
Extract Load Transform
HD
FS
Propagate
RDBM
S/Fi
les
API Access
Business Intelligence
Stag
ing
Cons
umab
le D
ata
SecurityKerberos
Data Chronology
Hiv
e/Pi
g
Hba
se
Meta DataHCatalog
Job Schedulingoozie
Extract to RDBMSsqoop
Monitoring ToolsNagios, Ganglia, Ambari
Direct Access to Raw DataHue
Data SerializationAvro
Governance
Hadoop Stack and Data Access
Data ExtractionFlume
Google Analytics
Data MovementMap Reduce
lyndaLogs
User Sessions Serv
ices
and
API
`s
Consumable DataThis is data that has been transformed and can be consumed by systems outside of Hadoop.
Given our lack of expertise in Java we will probably rely on our ingestion or rather use an ETL rather than a ELT strategy.
Hadoop Architecture Stack
15
Extract Load Transform
HD
FS
Propagate
RDBM
S/Fi
les
API Access
Business Intelligence
Stag
ing
Cons
umab
le D
ata
SecurityKerberos
Data Chronology
Hiv
e/Pi
g
Hba
se
Meta DataHCatalog
Job Schedulingoozie
Extract to RDBMSsqoop
Monitoring ToolsNagios, Ganglia, Ambari
Direct Access to Raw DataHue
Data SerializationAvro
Governance
Hadoop Stack and Data Access
Data ExtractionFlume
Google Analytics
Data MovementMap Reduce
lyndaLogs
User Sessions Serv
ices
and
API
`s
HBaseThis interface to Hadoop is tightly integrated with HDFS. Hive and Pig do not have this tight integration.
Hadoop Architecture Stack
16
Extract Load Transform
HD
FS
Propagate
RDBM
S/Fi
les
API Access
Business Intelligence
Stag
ing
Cons
umab
le D
ata
SecurityKerberos
Data Chronology
Hiv
e/Pi
g
Hba
se
Meta DataHCatalog
Job Schedulingoozie
Extract to RDBMSsqoop
Monitoring ToolsNagios, Ganglia, Ambari
Direct Access to Raw DataHue
Data SerializationAvro
Governance
Hadoop Stack and Data Access
Data ExtractionFlume
Google Analytics
Data MovementMap Reduce
lyndaLogs
User Sessions Serv
ices
and
API
`s
Hive/PigHive and Pig are SQL/Scripting interfaces into Hadoop. Both of these interfaces sit outside of Hadoop.
Hadoop Architecture Stack
17
Extract Load Transform
HD
FS
Propagate
RDBM
S/Fi
les
API Access
Business Intelligence
Stag
ing
Cons
umab
le D
ata
SecurityKerberos
Data Chronology
Hiv
e/Pi
g
Hba
se
Meta DataHCatalog
Job Schedulingoozie
Extract to RDBMSsqoop
Monitoring ToolsNagios, Ganglia, Ambari
Direct Access to Raw DataHue
Data SerializationAvro
Governance
Hadoop Stack and Data Access
Data ExtractionFlume
Google Analytics
Data MovementMap Reduce
lyndaLogs
User Sessions Serv
ices
and
API
`s
RDBMS/Flat FilesHadoop data will be “pushed” and/or “pulled” into RDMS’ or Flat Files for consumption outside of the Hadoop stack.
Hadoop Architecture Stack
18
Extract Load Transform
HD
FS
Propagate
RDBM
S/Fi
les
API Access
Business Intelligence
Stag
ing
Cons
umab
le D
ata
SecurityKerberos
Data Chronology
Hiv
e/Pi
g
Hba
se
Meta DataHCatalog
Job Schedulingoozie
Extract to RDBMSsqoop
Monitoring ToolsNagios, Ganglia, Ambari
Direct Access to Raw DataHue
Data SerializationAvro
Governance
Hadoop Stack and Data Access
Data ExtractionFlume
Google Analytics
Data MovementMap Reduce
lyndaLogs
User Sessions Serv
ices
and
API
`s
Services and API’sAPI’s will be available for the consumption of data. These API’s will make data available from Hadoop and RDMBS’s.
Hadoop Architecture Stack
19
Extract Load Transform
HD
FS
Propagate
RDBM
S/Fi
les
API Access
Business Intelligence
Stag
ing
Cons
umab
le D
ata
SecurityKerberos
Data Chronology
Hiv
e/Pi
g
Hba
se
Meta DataHCatalog
Job Schedulingoozie
Extract to RDBMSsqoop
Monitoring ToolsNagios, Ganglia, Ambari
Direct Access to Raw DataHue
Data SerializationAvro
Governance
Hadoop Stack and Data Access
Data ExtractionFlume
Google Analytics
Data MovementMap Reduce
lyndaLogs
User Sessions Serv
ices
and
API
`s
SecurityAuthentication & Access to the HDFS data will be done with Kerberos.
Note: This Security will not be comparable to an RDBMS.
Hadoop Architecture Stack
20
Extract Load Transform
HD
FS
Propagate
RDBM
S/Fi
les
API Access
Business Intelligence
Stag
ing
Cons
umab
le D
ata
SecurityKerberos
Data Chronology
Hiv
e/Pi
g
Hba
se
Meta DataHCatalog
Job Schedulingoozie
Extract to RDBMSsqoop
Monitoring ToolsNagios, Ganglia, Ambari
Direct Access to Raw DataHue
Data SerializationAvro
Governance
Hadoop Stack and Data Access
Data ExtractionFlume
Google Analytics
Data MovementMap Reduce
lyndaLogs
User Sessions Serv
ices
and
API
`s
Hcatalog HCatalog abstracts data locations and standardizes data types across Pig, Hive, and MapReduce. It is a Meta Data tool that is part of the Hadoop ecosystem.
Hadoop Architecture Stack
21
Extract Load Transform
HD
FS
Propagate
RDBM
S/Fi
les
API Access
Business Intelligence
Stag
ing
Cons
umab
le D
ata
SecurityKerberos
Data Chronology
Hiv
e/Pi
g
Hba
se
Meta DataHCatalog
Job Schedulingoozie
Extract to RDBMSsqoop
Monitoring ToolsNagios, Ganglia, Ambari
Direct Access to Raw DataHue
Data SerializationAvro
Governance
Hadoop Stack and Data Access
Data ExtractionFlume
Google Analytics
Data MovementMap Reduce
lyndaLogs
User Sessions Serv
ices
and
API
`s
Map ReduceIn regards to Hadoop and manipulating data in HDFS this is “lower level” programming. It will be awhile before we venture into this area of expertise. This is all written in Java and requires a strong understanding of the Hadoop File System (HDFS).
Hadoop Architecture Stack
22
Extract Load Transform
HD
FS
Propagate
RDBM
S/Fi
les
API Access
Business Intelligence
Stag
ing
Cons
umab
le D
ata
SecurityKerberos
Data Chronology
Hiv
e/Pi
g
Hba
se
Meta DataHCatalog
Job Schedulingoozie
Extract to RDBMSsqoop
Monitoring ToolsNagios, Ganglia, Ambari
Direct Access to Raw DataHue
Data SerializationAvro
Governance
Hadoop Stack and Data Access
Data ExtractionFlume
Google Analytics
Data MovementMap Reduce
lyndaLogs
User Sessions Serv
ices
and
API
`s
oozieSchedulingMap Reduce Jobs Need Scheduling.Put Map Reduce Jobs somewhere for consumption
This could be in Hadoop itselfOozie – Workflow organizerPython or Cron Scripts
Data Output – Data Output of Scheduled jobs.Send emails for reportsWhere the data will be putIn what format will they be put like into a SQL table or file
Hadoop Architecture Stack
23
Extract Load Transform
HD
FS
Propagate
RDBM
S/Fi
les
API Access
Business Intelligence
Stag
ing
Cons
umab
le D
ata
SecurityKerberos
Data Chronology
Hiv
e/Pi
g
Hba
se
Meta DataHCatalog
Job Schedulingoozie
Extract to RDBMSsqoop
Monitoring ToolsNagios, Ganglia, Ambari
Direct Access to Raw DataHue
Data SerializationAvro
Governance
Hadoop Stack and Data Access
Data ExtractionFlume
Google Analytics
Data MovementMap Reduce
lyndaLogs
User Sessions Serv
ices
and
API
`s
sqoopSqoop is an Apache project that is designed to “sqoop” export data between Hadoop and Relational Databases.
Data is “sqooped up” and put into SQLServer or dumped into a file.
Remember: “The tyranny of “OR” and the inclusiveness of “AND””.
We are not going to use SqlServer OR Hadoop. We will use SqlServer AND Hadoop. Facebook has to use both and when it comes to this technology we are not better than Facebook.
Hadoop Architecture Stack
24
Extract Load Transform
HD
FS
Propagate
RDBM
S/Fi
les
API Access
Business Intelligence
Stag
ing
Cons
umab
le D
ata
SecurityKerberos
Data Chronology
Hiv
e/Pi
g
Hba
se
Meta DataHCatalog
Job Schedulingoozie
Extract to RDBMSsqoop
Monitoring ToolsNagios, Ganglia, Ambari
Direct Access to Raw DataHue
Data SerializationAvro
Governance
Hadoop Stack and Data Access
Data ExtractionFlume
Google Analytics
Data MovementMap Reduce
lyndaLogs
User Sessions Serv
ices
and
API
`s
flumeFlume is part of the Hadoop ecosystem that is used to collect data and or data files from multiple locations and load it into HDFS.
Hadoop Architecture Stack
25
Extract Load Transform
HD
FS
Propagate
RDBM
S/Fi
les
API Access
Business Intelligence
Stag
ing
Cons
umab
le D
ata
SecurityKerberos
Data Chronology
Hiv
e/Pi
g
Hba
se
Meta DataHCatalog
Job Schedulingoozie
Extract to RDBMSsqoop
Monitoring ToolsNagios, Ganglia, Ambari
Direct Access to Raw DataHue
Data SerializationAvro
Governance
Hadoop Stack and Data Access
Data ExtractionFlume
Google Analytics
Data MovementMap Reduce
lyndaLogs
User Sessions Serv
ices
and
API
`s
Nagios, Ganglia, Ambari, Cloudera ManagerGanglia, Nagios, Ambari, and Cloudera Manager can be used to monitor the Map Reduce Operations. This will ensure that jobs are running on time and it will ensure that alerts are sent when jobs are running too long. These tools will also assist in performance monitoring and optimization.
Hadoop Architecture Stack
26
Extract Load Transform
HD
FS
Propagate
RDBM
S/Fi
les
API Access
Business Intelligence
Stag
ing
Cons
umab
le D
ata
SecurityKerberos
Data Chronology
Hiv
e/Pi
g
Hba
se
Meta DataHCatalog
Job Schedulingoozie
Extract to RDBMSsqoop
Monitoring ToolsNagios, Ganglia, Ambari
Direct Access to Raw DataHue
Data SerializationAvro
Governance
Hadoop Stack and Data Access
Data ExtractionFlume
Google Analytics
Data MovementMap Reduce
lyndaLogs
User Sessions Serv
ices
and
API
`s
Services and API Access to Hive/Pig
Hadoop Architecture Stack
27
Extract Load Transform
HD
FS
Propagate
RDBM
S/Fi
les
API Access
Business Intelligence
Stag
ing
Cons
umab
le D
ata
SecurityKerberos
Data Chronology
Hiv
e/Pi
g
Hba
se
Meta DataHCatalog
Job Schedulingoozie
Extract to RDBMSsqoop
Monitoring ToolsNagios, Ganglia, Ambari
Direct Access to Raw DataHue
Data SerializationAvro
Governance
Hadoop Stack and Data Access
Data ExtractionFlume
Google Analytics
Data MovementMap Reduce
lyndaLogs
User Sessions Serv
ices
and
API
`s
Hue aggregates the most common Hadoop components (i.e. file browser for HDFS, Job Browser (Map Reduce, YARN), Hbase, Hive, Pig) into a single interface.
Hadoop Architecture Stack
28
Extract Load Transform
HD
FS
Propagate
RDBM
S/Fi
les
API Access
Business Intelligence
Stag
ing
Cons
umab
le D
ata
SecurityKerberos
Data Chronology
Hiv
e/Pi
g
Hba
se
Meta DataHCatalog
Job Schedulingoozie
Extract to RDBMSsqoop
Monitoring ToolsNagios, Ganglia, Ambari
Direct Access to Raw DataHue
Data SerializationAvro
Governance
Hadoop Stack and Data Access
Data ExtractionFlume
Google Analytics
Data MovementMap Reduce
lyndaLogs
User Sessions Serv
ices
and
API
`s
AvroAvro – It uses JSON for defining data types and protocols, and serializes data in a compact binary format. It can provide both a serialization format for persistent data, and a wire format for communication between Hadoop nodes, and from client programs to the Hadoop services.
Hadoop Architecture Stack
29
Extract Load Transform
HD
FS
Propagate
RDBM
S/Fi
les
API Access
Business Intelligence
Stag
ing
Cons
umab
le D
ata
SecurityKerberos
Data Chronology
Hiv
e/Pi
g
Hba
se
Meta DataHCatalog
Job Schedulingoozie
Extract to RDBMSsqoop
Monitoring ToolsNagios, Ganglia, Ambari
Direct Access to Raw DataHue
Data SerializationAvro
Governance
Hadoop Stack and Data Access
Data ExtractionFlume
Google Analytics
Data MovementMap Reduce
lyndaLogs
User Sessions Serv
ices
and
API
`s
Business IntelligenceB.I. Strategy will need to developed and enabled.
This will be critical because one of the cited “Greatest” benefits of Hadoop is that of discovery. We will need to Enable discovery in this paradigm.
Hadoop Architecture Stack
30
Extract Load Transform
HD
FS
Propagate
RDBM
S/Fi
les
API Access
Business Intelligence
Stag
ing
Cons
umab
le D
ata
SecurityKerberos
Data Chronology
Hiv
e/Pi
g
Hba
se
Meta DataHCatalog
Job Schedulingoozie
Extract to RDBMSsqoop
Monitoring ToolsNagios, Ganglia, Ambari
Direct Access to Raw DataHue
Data SerializationAvro
Governance
Hadoop Stack and Data Access
Data ExtractionFlume
Google Analytics
Data MovementMap Reduce
lyndaLogs
User Sessions Serv
ices
and
API
`s
GovernanceThe fundamental essentials of Data Governance will need to established. Core values like “Master Data” will need to be established and the “Big Data” Platform will need to be beholden and integrated with these Data Governance Values. Issues like data life cycle and entitlements to Pii data will be part of the Big Data implementation.
Hadoop Architecture Stack
31
fl umeIngest
Describe Hcatalog
Compute Map Reduce
Persist HDFS/Hbase
Monitor Nagios
Propagate Sqoop
Develop Hive/Pig
/avros
Process Implementation
Hadoop Anthology
.