NoETL - Self-service data lake

A “little big data” solution from WorldDataOnline…

NoETL – a low cost, self-administered, self-service data lake

big data

The software industry is abuzz with “big data” at the moment. The talk of Petabyte sized databases, structured and unstructured data, Hadoop, MapReduce, U-SQL. These new technologies have in turn led to a new field of data science. Data scientists are the new rock stars of the software industry. As part of your IT department they are custodians and gatekeepers of this new universe.

Big data projects are - to state the blindingly obvious - time-consuming, expensive and complex. As a line of business manager or analyst you may be waiting a long time to see any benefits trickle down to your domain. If you are the owner of a small to medium sized enterprise you will probably think this technology will be out of reach.

This is where WorldDataOnline comes in. We are taking elements of big data technology and making them accessible to people who don’t have the skills of a data scientist, or the budget of a fortune 100 company.

We call this “little big data”.

little big data

WorldDataOnline’s approach is to deliver predictable, low cost by utilizing existing infrastructure and software licenses, this contrasts with the big data vendors cloud pricing which is complex and unpredictable:

little big data

pro: Secure – you own the infrastructure on which your data sits*

pro: Predictable cost

pro: Existing skills – both data center and end user

con: Restricted feature set

con: Scalability tied to infrastructure limits?

big data

pro: Built in scalability

pro: Feature rich

pro: Curb appeal (cutting edge technology)

con: Developer centric (high skill requirement)

con: Unpredictable cost (based on storage and query consumption)

* Little big data can be easily implemented in the cloud and will provide predictable low cost when implemented on virtual hardware

The self-service data lake

One, easy to understand concept of big data is the “data lake”. In very simple terms a data lake is just a place to store data until you know how you want to use it. Important attributes of a data lake are:

1. Data is loaded “as-is”. No pre-determination is made about how the data will be used.

2. Data from the lake may be re-organized and moved into data marts in order to provide answers to specific reporting needs.

3. A great deal of analysis and reporting can be carried out directly from the data lake with the latest versions of Microsoft Excel and tools such as Power BI.

As part of little big data the idea of a self-administered, self-service data lake represents an outstanding win.

Keep it simple, don’t try to do everything

The big data version of a data lake would include the following types of data:

• Tabular data from sources such as csv files, databases, OData sources etc.

• Unstructured text such as email, Facebook posts and log files

• Complex object data such as JSON and XML

• Images

In the little big data warehouse we only work with tabular data and some simple unstructured text files because:

1. These represent 95% of the data that most normal people work with

2. There are great tools designed for analysts (non data scientists) that work well with this data.

How does this relate to “data warehousing”?

In traditional data warehousing data is:• Extracted from its source environment• Transformed to the format needed by

the data warehouse• Loaded into the target structure

In a data lake data is:• Extracted from its source environment• Loaded “as is” into the warehouse• Transformed as needed into warehouse

data marts

sour

ce d

ata

tran

sfor

mre

plica

tetransfor

m

sour

ce d

ata

data warehouse

data lake

Place a file here File detected… Your table appears here

NoETL import service

The NoETL import service empowers end users to administer the collect and organize data for their personal lake.

NoETL features

• The end user controls the organization of the data lake through a simple folder hierarchy

• Files placed into this folder hierarchy are automatically imported in near real-time

• Data is replicated into SQL Server ensuring a robust, scalable, high performance data store

• Minimal IT support is required to set up the SQL Server instance and install the import service

• Data from the lake can be easily accessed using the data connectivity features in Microsoft Excel

little big data reporting options

Hierarchy of tools depending on skill level

1. Plain old Excel pivot tables

2. Excel 2016 Get and Transform Data (was Power Query)

3. PowerPivot

4. Power BI

Software

NoETL - Self-service data lake