Upload
saurav-mukherjee
View
262
Download
6
Embed Size (px)
Citation preview
Enterprise Data Management: A Perspective From the days of Data Silo, EDW to the present day of Hadoop & Data Lake
This document discusses the evolution of the enterprise data management over the years, the
challenges of the current CTOs and chief enterprise architects, and the concept of the Data Lake as a
means to tackle such challenges. It also talks about some reference architectures and recommended
toolset in today’s context.
March, 2016
Authors:
Selva Kumar VR
Saurav Mukherjee
Enterprise Data Management – A Perspective
Page 1 of 18
Contents 1. The Evolution of Data Management – what led to ‘Data Lake’? .......................................................... 3
1.1. Data Silo ........................................................................................................................................ 3
1.2. Enterprise Data Warehouse (EDW) .............................................................................................. 3
1.3. Big Data ......................................................................................................................................... 4
1.4. Hadoop .......................................................................................................................................... 5
2. The Challenges of present CTOs ........................................................................................................... 6
3. Data Lake ............................................................................................................................................... 7
3.1. Key Components of Data Lake ...................................................................................................... 7
3.1.1. Storage .................................................................................................................................. 7
3.1.2. Ingestion ................................................................................................................................ 7
3.1.3. Inventory & Cataloguing ....................................................................................................... 7
3.1.4. Exploration ............................................................................................................................ 7
3.1.5. Entitlement ........................................................................................................................... 7
3.1.6. API & User Interface .............................................................................................................. 7
4. Data Lake – Implementing the Architecture ......................................................................................... 8
4.1. Storage .......................................................................................................................................... 8
4.2. Ingestion........................................................................................................................................ 9
4.2.1. The Challenges ...................................................................................................................... 9
4.2.2. Recommendation .................................................................................................................. 9
4.3. Inventory, Catalogue & Explore .................................................................................................. 12
4.3.1. Discovery ............................................................................................................................. 12
4.3.2. Catalog & Visualization ....................................................................................................... 12
4.4. Entitlement & Auditing ............................................................................................................... 12
4.5. API & User Interface Access ........................................................................................................ 13
5. Conclusion ........................................................................................................................................... 14
6. Bibliography ........................................................................................................................................ 15
7. Few Other Useful References ............................................................................................................. 18
Enterprise Data Management – A Perspective
Page 2 of 18
Figures Figure 1: Data Management in Silos 3 Figure 2: Typical EDW Implementation 4 Figure 3: Typical Data Lake Implementation 8 Figure 4: Apache Nifi Data Flow View 10 Figure 5: Apache Nifi Data Provenance View 10 Figure 6: Nifi - The Power of Provenance 10 Figure 7: Apache Nifi Stats View 11
Tables Table 1: Key Challenges for CTOs/Chief Enterprise Architects of today 6 Table 2: Data Ingestion Challenges - beyond just the tools 9
Enterprise Data Management – A Perspective
Page 3 of 18
1. The Evolution of Data Management – what led to ‘Data Lake’? The concept of data management evolved in last 30 years based on the idea of providing better and timely analytics to the business teams. IT team always struggled with the business demand of providing everything in the next minute to serve new business ideas.
1.1. Data Silo Initially, data management systems for analytics were created in silos. This approach helped extract some insights from the organization’s data asset. However, the silos were very much restricted towards individual LOBs (line of business) and hence, were never considered comprehensive. Usually, LOBs used to send data to other LOBs as required and requested. In most cases, they were just reports (static & analytical) getting pulled from application database.
Figure 1: Data Management in Silos
1.2. Enterprise Data Warehouse (EDW) To break away from the data silos so that LOBs get the freedom to create their own data marts, the idea of Enterprise Data Warehouse (EDW) was adopted widely by industry. This concept has been researched for long. A joint research by HP-Labs and Microsoft research team provides a good overview of this concept and approach (Chaudhuri, et al., 1997). All data marts source their data from one central version of data, thereby maintaining data integrity and consistency at the enterprise level.
Though EDW solved the problem of providing an enterprise-level view of data to all business teams to a certain extent, answering questions or providing necessary data to business teams within the next minute of new business idea still remained a cherished but elusive dream for IT & business teams. Also, this ‘one version-fits-all’ idea did not go well with every group in organization. And the culture of business analysts downloading data to Microsoft Excel spread sheets or Microsoft Access from EDW and merging them with source data continued to be widely followed.
EDW architecture offered numerous technical challenges. Few such challenges are listed below.
Cost
Licensing cost (Database licenses, ETL tools etc.)
Storage cost
Ridiculously long lead time before database schemas could be created as per standards, which in turn followed by long ETL development cycles
LOB-1 LOB-n-1
LOB-2
LOB-1
LOB-3
LOB-n
Enterprise Data Management – A Perspective
Page 4 of 18
Every post-production fix involved long and repetitive development cycle
Complicated designs
Need for highly skilled labor force
Figure 2: Typical EDW Implementation
1.3. Big Data In the meanwhile, technology evangelists like Google, Netflix, Amazon, Facebook, Twitter, advanced oil
drill equipment manufacturing companies, space companies etc. injected new types of problems in to
the data space, e.g. data type and volume. It was no more the case of structured data mindset. It
involved unstructured data like videos, social text streams, sensor data, data streams from IoT devices
etc. These data types can neither be accommodated into traditional database nor their scale are easily
manageable like structured data. In addition to the data volume, variety and velocity of data flow had to
be tackled together to derive business advantage and doing that faster than the competition. These new
generation companies also created applications which are ground up distributed in nature. New
distributed file systems, new distributed processing applications etc. were required to handle the
volume and the velocity. Thought papers from companies like Google (Chang, et al., 2006) (Dean, et al.,
2004) (Ghemawat, et al., 2003), Amazon (DeCandia, et al., 2007) etc. offer detailed discussion on this
topic. The dimensions of volume, variety and velocity gave birth to what came to be known as ‘Big
Data’1.
1 Over time, couple more V’s – veracity & volatility got attributed to Big Data.
Reporting Layer
EDW Layer ETL Layer Data Source Layer
Data Source – LOB-1
Data Source – LOB-2
Data Source – LOB-3
Data Source – LOB-n
ETL Tools EDW Holds schema on Write
i.e., predefined databases schema
Data Mart LOB-1
Data Mart LOB-2
Data Mart LOB-3
Data Mart LOB-n
Re
po
rtin
g La
yer
Enterprise Data Management – A Perspective
Page 5 of 18
1.4. Hadoop Doug Cutting, Chief Architect at Cloudera, adopted the distributed systems idea and created Hadoop,
being inspired and modeled by Google’s high volume data processing systems. Hadoop is open source
and relies on the concept of bulk commodity hardware. It solves the cost issue (licensing cost, storage
cost) and data variety issue.
Over time, new ecosystem got created around HDFS (Hadoop Distributed File System). It generated new
efficiencies for data architecture through optimization of data processing workloads such as data
transformation and integration. It simultaneously lowered the cost of storage. Ideas like flexible
‘schema-on-read’ access to all enterprise data allowed circumventing long database schema design and
long ETL development cycles started taking shape.
Though Hadoop potentially solves data storage problem, it requires high latency for data retrieval (batch
processing). The latency issue led to new ways of data storage & retrieval in form of NoSQL databases
e.g., Apache HBase, Apache Cassandra - inspired by Amazon (DeCandia, et al., 2007) etc. for and better
processing engines like Spark (Zaharia, et al., 2012) (Zaharia, et al., 2010) (Zaharia, et al., 2012), Flink
(Apache Software Foundation, 2015) etc. However, NoSQL databases have their own challenges like
complicated table designs, joins not working well like in traditional RDBMS etc.
This landed the industry at the juncture of good infrastructure framework, low cost open source tools
(e.g., storage tools like HDFS, NoSQL databases like MongoDB, HBase, Cassandra, MemCache etc., data
processing tools like Spark, Map Reduce, Pig, Hive, Flink, Nifi etc., message broking tools like Kafka
(Kreps, et al.), RabbitQ etc.) and, of course, the existing high cost enterprise toolsets & easy access
storage (i.e. RDBMS like Oracle, DB2, SQL Server etc., Massively Parallel Processing (MPP) tools like
Teradata, Impala etc., processing tools like AbInitio, Informatica, DataStage etc.).
Along the way, the revolution called open source added significant value to technology community. It
facilitated creation of lot of start-ups, encouraged new ideas and of course, added a lot of chaos. Each
of these tools (whether low cost or high cost) are focused on solving specific use case. Every other
month, new open source products started getting released. However, for an enterprise CTO or an
architect, it gets really challenging to identify sustainable open source solutions which would also solve
multiple use cases instead of solving specific ones. Here came the open source bundling companies e.g.,
Cloudera, Hortonworks, MapR etc. They took the ownership of identifying software that are good and
sustainable, and managing tools which go through very frequent releases for improvised versions. This
solved the basic adaptation problem of open source ecosystem into an enterprise to a good extent.
There have been differences in the selection of tools of the open source bundling companies’ and of
course, it is purely left to enterprise’s use cases to decide which one to go for.
Once the new ecosystem (majorly based open source solutions) got stabilized, next challenge was to
adopt a suitable methodology for application development and maintenance. Adoption of open source
ecosystem also mandated replacement of all/some of the well accepted traditional enterprise software
and tools. Such replacement entails its own share of risks.
Also, there are no widely practiced and adopted standards in the industry for open source based
enterprise data management solutions. Most of the advanced business analysts still rely on power tools
like SQL, Metadata Management repositories etc. to infer business insights. Hadoop lacks the flexibility
Enterprise Data Management – A Perspective
Page 6 of 18
of data extraction using SQL at similar speed. On top of it, there have been challenges of dealing with
regulations, preventing data falling into wrong hands, auditing etc.
2. The Challenges of present CTOs The previous section discussed about the evolution of data management, the multidimensional
challenges that it posed and the challenges in identifying proper adoption framework or architecture
which may be widely used, standardized and easily adopted by enterprises. The CTOs or architects
would be better served by having reference architecture or framework to minimize the risks involved.
Few exceptional use cases which may not fit well in this framework or architecture can be handled
separately.
Before delving deep in to the adoption framework or architecture, here is a quick summary of the
critical challenges from enterprise data management perspective as an evolution from EDW era.
# Description
1 Provide low cost storage and processing. Accommodate any data type.
2 Provide consolidated view of enterprise data to empower business teams to pull all required information next minute new business idea pops up.
3 Provide consolidated view of enterprise data and flexibility of ad hoc reporting on any data element in enterprise to the business analyst.
4 Provide metadata cataloguing and search facility for metadata.
5 Store data in original raw form to guarantee data fidelity.
6 Provide entitlement management features that take care of regulation, authorization, authentication, encryption, data masking, auditing etc.
7 Leverage existing licensed tools for use cases / problems which open source systems cannot solve.
8 Maintain existing good features like faster data extraction using SQL for analysis and add new features that have significant reduction in latency in creating advanced analytical applications like machine learning.
9 Provide data access to external & internal teams based on entitlement.
10 Provide enterprise data elements in raw form to a new category of analysts, called data scientists.
11 Select technologies to minimize tool replacement costs and keep up with technology trends for keep enterprise competitive.
12 Integrate data profiling and data quality results into metadata management framework. Table 1: Key Challenges for CTOs/Chief Enterprise Architects of today
Enterprise Data Management – A Perspective
Page 7 of 18
3. Data Lake ‘Data Lake’ came across as the next key concept in data management area and was primarily
conceptualized to tackle the challenges mentioned in the section above - The Challenges of present
CTOs. It is more of architectural concept and may be defined as - “Repository of enterprise-wide, large
quantities and variety of data elements, both structured and unstructured, in raw form.”
This definition is purely based on the insights from multiple data management implementations in
Hadoop environment, identifying challenges and coming up with architecture to solve these challenges.
However, just repository alone will not suffice in meeting the challenges mentioned in Table 1. It would
require supporting components to deliver the benefits.
3.1. Key Components of Data Lake The Data Lake architecture involves some mandatory components (mentioned below) to make it a
successful implementation.
3.1.1. Storage
Low cost
Store raw data from different input sources
Support any data type
High durability
3.1.2. Ingestion
Facilitate both batch & streaming ingestion frameworks
Offer low latency
3.1.3. Inventory & Cataloguing
Discover metadata and generate tags
Discover lineage information
Manage tags
3.1.4. Exploration
Browse / Search Inventory
Inspect Data Quality
Tag Data Quality attributes
Auditing
3.1.5. Entitlement
Identify & Access Management
o Authentication, Authorization, Encryption, Quotas, Data Masking
3.1.6. API & User Interface
Expose search API
Expose Data Lake to customers using API & SQL interface based on entitlements and access
rights
Enterprise Data Management – A Perspective
Page 8 of 18
4. Data Lake – Implementing the Architecture
Components mentioned in Figure 3 above are minimal requirements for implementing a Data Lake.
Hadoop (HDFS) can accommodate application storage as well. These applications can also leverage Data
Lake’s built-in framework components like Catalogue, Data Quality, and Search & Entitlements.
4.1. Storage Primary requirement of storage is to low cost, able to accommodate high volume and long durability.
Storage should be able to accommodate any data type. Current technological trends suggest that HDFS,
MapR-FS and Amazon S3 suit the need. Even though, they have different underlying implementation,
they still adhere to Hadoop standards.
Along with storing data in distributed file systems, it would be a good idea to identify suitable storage
options for different data types as below.
Unstructured data
o Store native file format (logs, dump files, videos etc.)
o Compress with Streaming Codec (LZO, Snappy)
Semi-Structured data – JSON, XML files
o Good to store in schema aware formats e.g., Avro. Avro allows versioning &
extensibility like adding new fields.
Structured data
o Flat records (CSV or some other field separated)
o Avro or Columnar Storage (Parquet)
Streaming Ingestion
Access Layer Data Lake Layer Ingestion Layer Data Source Layer
Data Source – LOB-1
Data Source – LOB-2
Data Source – LOB-3
Data Source – LOB-n
Batch Ingestion
Data Lake
Automated Inventory, Catalogue & Tagging Framework
Data Quality Tagging Framework
Inventory & Catalogue Search & Explore Framework
Entitlement Framework
RD
BM
S /
No
SQL
Acc
ess
Se
arch
( S
olr
/ E
last
ic
Sear
ch)
Acc
ess
AP
I
Acc
ess
Figure 3: Typical Data Lake Implementation
Enterprise Data Management – A Perspective
Page 9 of 18
Storage life cycle policy can also be defined. There are many open source tools like Apache Falcon
(Apache Software Foundation, 2016) that operates based on pre-defined policies. Data directory
structure can be defined to segregate data based on life cycle policy - e.g., latest data, 7 years old data
as required by regulations, data older than 7 years etc.
4.2. Ingestion Ingestion is the first piece of the puzzle that needs to be put in place after setting up the storage. This
involves setting up ingestion framework that will handle both batch and streaming data. Looking into
the current trends of data processing tools, next generation of technologies might look at batch
processing as legacy systems. Better processing tools (e.g. Spark (near real-time), Flink (real-time) etc.)
are promoting batch as streams. Complexity of good stream processing depends on use cases. O’Reilly
offers in-depth discussion on the topic of streaming - going beyond batch (Akidau, 2015) (Akidau, 2016).
4.2.1. The Challenges
However, having advanced processing tools would not be enough to ensure proper ingestion. The
following list summarizes few such challenges (as listed below) that need to be circumvented.
# Description
1 Making use of advanced processing tools would require high skilled resources in good numbers.
2 Traditional data processing engineers widely use GUI-based ETL tools (like AbInitio, Informatica etc.) that use data flow programming techniques. Coding applications using open source processing tools (like Spark, Flink etc.) still take consider time for development and testing for those engineers.
3 Due to nature of open source eco system, there will always be new processing tool that will outrun the benefit of current toolset and will add business advantage than enterprise’s competition. This would require easy and quick adoptability, which may be a big challenge.
4 Data processing tools are good at processing data. However, ingestion framework also needs to go beyond that and solve challenges like:
Low latency & guaranteed data delivery
Handling back pressure
Data provenance (tracking data all the way from data source)
Customizability
Quick implementation and better UI for operations team.
Supporting wide variety of protocols used for sending/receiving data (e.g., SSL, SSH, HTTPS, other encrypted contents etc.)
Load data into wide number of destinations (HDFS, Spark, MapR-FS, S3, RDBMS, NoSQL etc.).
5 From enterprise perspective, it is often desired to have same tools to be used across the enterprise for any applications that would require data push/pull. However, zeroing on ‘the one toolset’ is always challenging.
Table 2: Data Ingestion Challenges - beyond just the tools
4.2.2. Recommendation
Based on tool evaluation research, two tools may be recommended to handle the ingestion problems
and quick adoptability challenges:
Enterprise Data Management – A Perspective
Page 10 of 18
4.2.2.1. Apache Nifi
Apache Nifi (Apache Software Foundation, 2015) is one of the best open source data flow programming
tool. Nifi kind of fits the bill for most of the data push/pull use cases. Just to get the uninitiated excited
about it, here are few Nifi snapshots:
Figure 4: Apache Nifi Data Flow View
Figure 5: Apache Nifi Data Provenance View
Figure 6: Nifi - The Power of Provenance
Enterprise Data Management – A Perspective
Page 11 of 18
Figure 7: Apache Nifi Stats View
Nifi can be used as a full-fledged ETL tool and does support most of the ETL features. However, Nifi still
claims itself as simple event processing and data provenance tool. If open source support is extended to
it, Nifi may well be transformed into a full-fledged ETL tool.
4.2.2.2. Cascading
To deal with the quick adaptability part, it would be a good idea to have wrapper technologies. They
would allow the coding to be done once and the processing engine underneath to be changed based on
the latest trends or best fit. Our research recommends Cascading (Driven, Inc., 2015)to be a good
candidate here. At present, Cascading supports multiple processing engines underneath (Spark, Map
Reduce, Flink etc.).
Cascading also supports development in Java and Scala. Cascading allows developing the business logic
separately from the integration logic. Complete applications may be developed and unit tests may be
written without touching a single Hadoop API. It provides the degrees of freedom to easily move
through the application development life-cycle and to deal separately with integrating existing systems.
Cascading provides a rich API that allows thinking in terms of data and business problems with
capabilities such as sort, average, filter, merge, etc. The computation engine and process planner
convert the business logic into efficient parallel jobs, delivering the optimal plan at run-time to the
computation fabric of choice.
In simple terms, cascading may be considered as the plumbing components that are used for building
pipelines. It provides sinks, traps, connections etc. It is just the matter of plugging them together to
build business logic without bothering about whether the code will run on MapReduce or Spark or Flink.
This is famously known as pattern language.
Developers can develop all the way till unit testing without touching Hadoop or any processing engine.
From technology category perspective, it is the middleware for designing workflow.
Enterprise Data Management – A Perspective
Page 12 of 18
4.3. Inventory, Catalogue & Explore Data lake-based storage ideas & ingestion ideas mentioned above can solve few challenges like low cost
storage, low cost processing, real-time sync with data source and data in raw form for maintaining
fidelity as mentioned in Table 1. Enterprise data pushed in raw form into Data Lake provides the
flexibility to business analysts and data scientists to pull any enterprise data element as required
without waiting for long ETL development and data modeling exercises to complete. Streaming ingestion
facilitates Data Lake to be in sync with data source in as near real time as possible.
However, enterprise data in their own raw format might be huge and it will be like finding needle in hay
stack for a data scientist or any user. This mandates self-data-service framework to be built for data
discovery (Inventory), data preparation (Catalogue) and data visualization (Explore).
4.3.1. Discovery
First step in data discovery is to provide a metadata framework (a sub-component in self-data-service
framework) to capture business metadata, technical metadata and operational metadata. This process
needs to be automated to handle the sheer volume of file load into Data Lake. Even though in theory
Data Lake talks about data availability to everyone, there are constraints in the form of entitlement
which needs to be put in place for Data Governance purposes.
Metadata framework should also have features to create important data lineage information as part of
Ingestion frameworks. This will enable lineage all the way from data source to Data Lake.
4.3.2. Catalog & Visualization
Once metadata (business, technical & operation) has been captured for the raw data provided by data
sources, it may be used as catalog and UI may then be used to explore these metadata. Along with
metadata, data profiling ability & data quality metric for all data pushed into Data Lake are really
valuable and desirable in this context.
Most of the available frameworks are tag based. They identify and mark metadata, profiling metrics &
quality metrics. These frameworks are inbuilt with CRUD API, Query API or Analytics API for handling
metadata management.
This area is fairly new to industry. Only few vendors are out there who provide data self-service
framework. Below is a list of such vendors and their products.
Cloudera (Cloudera, 2016)– Cloudera Navigator (not open source, license based).
Waterline Data (Waterline Data, Inc., 2016) – Independent organization and integrates with any
Hadoop distribution.
Hortonworks (Hortonworks Inc., 2016) – Apache Atlas, still in incubation. However, a limited
featured version has been added to HDP 2.3 release. Hortonworks actively partnered with
Waterline data as well.
4.4. Entitlement & Auditing Entitlement is one of the primary pieces of data governance. Generally, data governance has few
mandatory components like Data Profiling, Data Quality, Entitlement and Auditing. Main goal of the
governance is to facilitate easy & secured data accessibility along with the reliability of the data
Enterprise Data Management – A Perspective
Page 13 of 18
(profiling & data quality measures). Previous section discussed profiling & data quality. This section will
focus on entitlements and auditing.
Entitlement & auditing covers wide range of activities, like o Authentication
o Authorization
o Encryption
o Auditing
o Data Masking
o Data Field Level Authorization
Almost all Hadoop distribution vendors use Kerberos as authentication protocol. MapR uses propriety
authentication tool, which follows similar approach as Kerberos.
For authorization, Data Masking & Data field level authorization, different Hadoop distribution vendors
use different toolsets. Cloudera uses Sentry & Cloudera Navigator. Hortonworks uses Apache Ranger &
Apache Knox. MapR uses proprietary ACE (Authorization Control Expression) that provides better
flexibility than ACL (Authorization control list). ACE is supported by all Vendors.
All vendors offer encryption for at-rest and in-transit data. Approaches taken for Key Management for
keys used for encryption/decryption are quiet proprietary.
There are multiple open source projects in Hadoop security area. A list of few such projects is given below.
Apache Knox (Apache Software Foundation, 2016): A REST API gateway that provides a single access point for all REST interactions with Hadoop clusters.
Apache Sentry (Apache Incubator): A modular system for providing role-based authorization for both data and metadata stored in HDFS. Sentry project is primarily led by Cloudera, one of the best-known Hadoop distributors.
Apache Ranger (Hortonworks, Inc., 2016): A centralized environment for administering and managing security policies across the Hadoop ecosystem. This project is led by Hortonworks, another well-known Hadoop distributor, and includes technology that it gained when it acquired XA Secure in mid-2014 (Hortonworks, Inc., 2014).
Apache Falcon (Apache Software Foundation, 2016): A data governance engine that allows administrators to define and schedule data management and governance polices across the Hadoop environment. The section 4.1 also discusses this.
Project Rhino (Williams, 2013): Creates an encryption, key management capabilities and a common authorization framework across Hadoop projects and subprojects (TechTarget). This project is led by Intel.
Most of these security tools are inbuilt and distributed by different Hadoop bundling vendors.
4.5. API & User Interface Access To provide easy and secure access, it is recommended to allow control access to Data Lake either using
API or Interactive SQL. This will in turn enforce inbuilt entitlement as discussed in sections above.
Enterprise Data Management – A Perspective
Page 14 of 18
Wide range of tools is available for API management and SQL (Spark SQL, Flink SQL, Impala etc.). Even
with all these tools available, data access might not be as fast as RDBMS tools. This is a case in point to
leverage existing enterprise tools.
As mentioned earlier, most of the framework setup for Data Lake can be re-used for other use cases. If
data cleansing & standardization has to be done, it can be run in Hadoop environment using data
processing tools like Map Reduce, Cascading, Spark, Flink etc. and the HDFS environment can be
segmented to hold cleansed, standardized and aggregated level information. Standardized version of
data may also be pushed to existing EDW. This approach moves complete ETL from EDW to Hadoop
environment, thus minimizing processing and licensing cost. Also, highly granularity data will reduce
RDBMS storage cost.
5. Conclusion Data Lake provides an architectural approach with embedded Governance model. It helps Data
Management teams to implement variety of solutions using cost effective storage, efficient processing
engines and self-data-service features. Teams implementing Data Lake need to give lot of focus while
defining metadata for all types data objects ingested into Data Lake. Metadata plays key role in Data
Lake to expose self-data-service flexibility to analysts/data scientists/users and it is a key component for
defining entitlements.
Enterprise Data Management – A Perspective
Page 15 of 18
6. Bibliography Akidau Tyler The world beyond batch: Streaming 101 - O'Reilly Media [Online] // The world beyond
batch: Streaming 101 - O'Reilly Media. - O'Reilly, Aug 05, 2015. - Mar 08, 2016. -
https://www.oreilly.com/ideas/the-world-beyond-batch-streaming-101.
Akidau Tyler The world beyond batch: Streaming 102 - O'Reilly Media [Online] // The world beyond
batch: Streaming 102 - O'Reilly Media. - O'Reill, Jan 20, 2016. - Mar 08, 2016. -
https://www.oreilly.com/ideas/the-world-beyond-batch-streaming-102.
Amazon Dynamo: Amazon’s Highly Available Key-value Store [Online] // All Things Distrubuted. -
Amazon, 2007. - Mar 04, 2016. - http://www.allthingsdistributed.com/files/amazon-dynamo-
sosp2007.pdf.
Apache Incubator Apache Sentry (incubating) [Online] // Apache Sentry (incubating). - Apache Software
Foundation. - Mar 09, 2016. - https://sentry.incubator.apache.org/.
Apache Software Foundation Apache Flink: Scalable Batch and Stream Data Processing [Online] //
Apache Flink: Scalable Batch and Stream Data Processing. - Apache Software Foundation, 2015. - Mar
07, 2016. - http://flink.apache.org/.
Apache Software Foundation Apache Nifi [Online] // Apache Nifi. - Apache Software Foundation, 2015. -
Mar 08, 2016. - https://nifi.apache.org/.
Apache Software Foundation Falcon - Falcon - Feed Management & Data processing platform
[Online] // Falcon - Falcon - Feed Management & Data processing platform. - Apache Software
Foundation, Feb 15, 2016. - Mar 08, 2016. - https://falcon.apache.org/.
Apache Software Foundation Knox Gateway - REST API Gateway for Hadoop Ecosystem [Online] // Knox
Gateway - REST API Gateway for Hadoop Ecosystem. - Apache Software Foundation, Mar 01, 2016. - Mar
09, 2016. - https://knox.apache.org/.
Chang Fay [et al.] Bigtable: A Distributed Storage System for Structured Data [Online] // Bigtable: A
Distributed Storage System for Structured Data. - Google, 2006. - Mar 15, 2016. -
http://static.googleusercontent.com/media/research.google.com/en//archive/bigtable-osdi06.pdf.
Chaudhuri Surajit and Dayal Umeshwar An Overview of Data Warehousing and OLAP Technology
[Online] // Microsoft Research - Turning Ideas into Reality. - Microsoft, Mar 1997. - Mar 03, 2016. -
http://research.microsoft.com/pubs/76058/sigrecord.pdf.
Cloudera Cloudera [Online] // Cloudera. - Cloudera, 2016. - Mar 08, 2016. - https://cloudera.com/.
Dean Jeffrey and Ghemawat Sanjay MapReduce: Simplified Data Processing on Large Clusters
[Online] // MapReduce: Simplified Data Processing on Large Clusters. - Google, 2004. - Mar 15, 2016. -
http://static.googleusercontent.com/media/research.google.com/en//archive/mapreduce-osdi04.pdf.
DeCandia Giuseppe [et al.] Dynamo: Amazon’s Highly Available Key-value Store [Online] // Dynamo:
Amazon’s Highly Available Key-value Store. - Amazon, 2007. - Mar 15, 2016. -
http://www.allthingsdistributed.com/files/amazon-dynamo-sosp2007.pdf.
Enterprise Data Management – A Perspective
Page 16 of 18
Driven, Inc. Cascading | Application Platform for Enterprise Big Data [Online] // Cascading | Application
Platform for Enterprise Big Data. - Driven, Sep 2015. - Mar 08, 2016. - http://www.cascading.org/.
Ghemawat Sanjay, Gobioff Howard and Leung Shun-Tak The Google File System [Online] // The Google
File System. - Google, 2003. - Mar 15, 2016. -
http://static.googleusercontent.com/media/research.google.com/en//archive/gfs-sosp2003.pdf.
Google, Inc The Google File System [Online] // Research at Google. - Google, 2003. - Mar 4, 2016. -
http://static.googleusercontent.com/media/research.google.com/en//archive/gfs-sosp2003.pdf.
Google, Inc. Bigtable: A Distributed Storage System for Structured Data [Online] // Research at Google. -
Google, 2006. - Mar 04, 2016. -
http://static.googleusercontent.com/media/research.google.com/en//archive/bigtable-osdi06.pdf.
Google, Inc. MapReduce: Simplified Data Processing on Large Clusters [Online] // Research at Google. -
Google, 2004. - Mar 04, 2016. -
http://static.googleusercontent.com/media/research.google.com/en//archive/mapreduce-osdi04.pdf.
Hortonworks Inc. Hortonworks: Open and Connected Data Platforms [Online] // Hortonworks: Open
and Connected Data Platforms. - Hortonworks, 2016. - Mar 08, 2016. - http://hortonworks.com/.
Hortonworks, Inc. Apache Ranger [Online] // Apache Ranger. - Hortonworks, 2016. - Mar 09, 2016. -
http://hortonworks.com/hadoop/ranger/.
Hortonworks, Inc. Hortonworks Acquires XA Secure - Hortonworks [Online] // Hortonworks Acquires XA
Secure - Hortonworks. - May 15, 2014. - Mar 09, 2016. - http://hortonworks.com/press-
releases/hortonworks-acquires-xa-secure/.
Kreps Jay, Narkhede Neha and Rao Jun Kafka: a Distributed Messaging System for Log Processing
[Online] // Microsoft Research - Turning Ideas into Reality. - LinkedIn Corp.. - Mar 7, 2016. -
http://research.microsoft.com/en-us/um/people/srikanth/netdb11/netdb11papers/netdb11-
final12.pdf.
TechTarget Managing Hadoop projects: What you need to know to succeed [Online] // Managing
Hadoop projects: What you need to know to succeed. - TechTarget. - Mar 09, 2016. -
http://searchdatamanagement.techtarget.com/essentialguide/Managing-Hadoop-projects-What-you-
need-to-know-to-succeed.
University of California, Berkeley Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-
Memory Cluster Computing [Online] // Computer Science Division | EECS at UC Berkley. - University of
California, Berkley, 2012. - Mar 07, 2016. -
https://www.cs.berkeley.edu/~matei/papers/2012/nsdi_spark.pdf.
University of California, Berkeley Spark: Cluster Computing withWorking Sets [Online] // Computer
Science Division | EECS at UC Berkley. - University of California, Berkley, 2012. - Mar 07, 2016. -
http://www.cs.berkeley.edu/~matei/papers/2010/hotcloud_spark.pdf.
Enterprise Data Management – A Perspective
Page 17 of 18
Waterline Data, Inc. Waterline Data | Find, understand, and govern data in Hadoop [Online] //
Waterline Data | Find, understand, and govern data in Hadoop. - Waterline, 2016. - Mar 09, 2016. -
http://www.waterlinedata.com/.
Williams Alex Intel Launches Hadoop Distribution And Project Rhino, An Effort To Bring Better Security
To Big Data [Online] // Intel Launches Hadoop Distribution And Project Rhino, An Effort To Bring Better
Security To Big Data. - TechCrunch Network, Feb 26, 2013. - Mar 09, 2016. -
http://techcrunch.com/2013/02/26/intel-launches-hadoop-distribution-and-project-rhino-an-effort-to-
bring-better-security-to-big-data/.
Zaharia Matei [et al.] Discretized Streams: An Efficient and Fault-Tolerant Model for Stream Processing
on Large Clusters [Online]. - University of California, Berkeley, 2012. - Mar 07, 2016. -
https://people.csail.mit.edu/matei/papers/2012/hotcloud_spark_streaming.pdf.
Zaharia Matei [et al.] Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster
Computing [Online] // Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster
Computing. - University of California, Berkeley, 2012. - Mar 15, 2016. -
https://www.cs.berkeley.edu/~matei/papers/2012/nsdi_spark.pdf.
Zaharia Matei [et al.] Spark: Cluster Computing with Working Sets [Online] // Spark: Cluster Computing
with Working Sets. - University of California, Berkeley, 2010. - Mar 15, 2016. -
http://www.cs.berkeley.edu/~matei/papers/2010/hotcloud_spark.pdf.
Enterprise Data Management – A Perspective
Page 18 of 18
7. Few Other Useful References
Data Lake References
Horton works & Teradata Paper - Data Lake
Amazon’s experience on Data Lake - Data Lake Implementation Guidelines
Knowledgent Reference - Data Lake Design Waterline Data - Self Data Service
Flink: A new breed in processing tool
Flink Streaming & Batching in One Engine
Data Security
Cloudera Security – Paper on Hadoop Security
Cloudera reference on Hadoop Encryption - Encryption in Cloudera
Hortonworks - Data Governance