Upload
andrew-wiles
View
145
Download
2
Embed Size (px)
Citation preview
A “little big data” solution from WorldDataOnline…
NoETL – a low cost, self-administered, self-service data lake
big data
The software industry is abuzz with “big data” at the moment. The talk of Petabyte sized databases, structured and unstructured data, Hadoop, MapReduce, U-SQL. These new technologies have in turn led to a new field of data science. Data scientists are the new rock stars of the software industry. As part of your IT department they are custodians and gatekeepers of this new universe.
Big data projects are - to state the blindingly obvious - time-consuming, expensive and complex. As a line of business manager or analyst you may be waiting a long time to see any benefits trickle down to your domain. If you are the owner of a small to medium sized enterprise you will probably think this technology will be out of reach.
This is where WorldDataOnline comes in. We are taking elements of big data technology and making them accessible to people who don’t have the skills of a data scientist, or the budget of a fortune 100 company.
We call this “little big data”.
little big data
WorldDataOnline’s approach is to deliver predictable, low cost by utilizing existing infrastructure and software licenses, this contrasts with the big data vendors cloud pricing which is complex and unpredictable:
little big data
pro: Secure – you own the infrastructure on which your data sits*
pro: Predictable cost
pro: Existing skills – both data center and end user
con: Restricted feature set
con: Scalability tied to infrastructure limits?
big data
pro: Built in scalability
pro: Feature rich
pro: Curb appeal (cutting edge technology)
con: Developer centric (high skill requirement)
con: Unpredictable cost (based on storage and query consumption)
* Little big data can be easily implemented in the cloud and will provide predictable low cost when implemented on virtual hardware
The self-service data lake
One, easy to understand concept of big data is the “data lake”. In very simple terms a data lake is just a place to store data until you know how you want to use it. Important attributes of a data lake are:
1. Data is loaded “as-is”. No pre-determination is made about how the data will be used.
2. Data from the lake may be re-organized and moved into data marts in order to provide answers to specific reporting needs.
3. A great deal of analysis and reporting can be carried out directly from the data lake with the latest versions of Microsoft Excel and tools such as Power BI.
As part of little big data the idea of a self-administered, self-service data lake represents an outstanding win.
Keep it simple, don’t try to do everything
The big data version of a data lake would include the following types of data:
• Tabular data from sources such as csv files, databases, OData sources etc.
• Unstructured text such as email, Facebook posts and log files
• Complex object data such as JSON and XML
• Images
In the little big data warehouse we only work with tabular data and some simple unstructured text files because:
1. These represent 95% of the data that most normal people work with
2. There are great tools designed for analysts (non data scientists) that work well with this data.
How does this relate to “data warehousing”?
In traditional data warehousing data is:• Extracted from its source environment• Transformed to the format needed by
the data warehouse• Loaded into the target structure
In a data lake data is:• Extracted from its source environment• Loaded “as is” into the warehouse• Transformed as needed into warehouse
data marts
sour
ce d
ata
tran
sfor
mre
plica
tetransfor
m
sour
ce d
ata
data warehouse
data lake
Place a file here File detected… Your table appears here
NoETL import service
The NoETL import service empowers end users to administer the collect and organize data for their personal lake.
NoETL features
• The end user controls the organization of the data lake through a simple folder hierarchy
• Files placed into this folder hierarchy are automatically imported in near real-time
• Data is replicated into SQL Server ensuring a robust, scalable, high performance data store
• Minimal IT support is required to set up the SQL Server instance and install the import service
• Data from the lake can be easily accessed using the data connectivity features in Microsoft Excel
little big data reporting options
Hierarchy of tools depending on skill level
1. Plain old Excel pivot tables
2. Excel 2016 Get and Transform Data (was Power Query)
3. PowerPivot
4. Power BI