12
Big Data in Azure: Demo and Hands-On Labs Pre-requisites You will need an Azure subscription with available HDInsight cores Power BI / Excel 2013 o Download the Power Query add-in, choose 32bit or 64bit to match your Office installation http://www.microsoft.com/en-us/download/details.aspx? id=39379&CorrelationId=d8002172-0438-4ef5-b0fa-e635f8f17251 o Enable PowerPivot and Power View in your Excel options – com add-ins. Download HOL labs https://github.com/Azure-Readiness/CloudDataCamp . For April 30 only use https://github.com/cindygross/CloudDataCamp instead. If you already have GitHub installed, choose to “Clone in Desktop”. Otherwise choose “Download ZIP” and UNZIP the files. Save the location to a Notepad file. Data movement – one or both o GUI: Install CloudXplorer http://clumsyleaf.com/products/downloads . I will be using v3, you can download the v3 trial or the free v1 (with fewer features). o Cmd line: Install AzCopy http://azure.microsoft.com/en-us/documentation/articles/storage-use- azcopy/ . Save the install location as you will need it later, it will default to (without the x86 on 32bit) C:\Program Files (x86)\Microsoft SDKs\Azure\AzCopy. Install SQL 2014 SSMS http://www.microsoft.com/en-gb/download/details.aspx? id=42299 Today’s slides: http:// tinyurl.com/lxutdd4 Goal Understand how to use some of the common pieces of an Azure hosted Big Data and Analytics solution. These components are often part of an Internet of Things solution, which is a common Big Data and Analytics scenario. At the end of this hands-on lab you will have: o Created an Azure storage account and container then loaded data to it. You will also use this account for storage of data generated in other steps. o Create a Hadoop on Azure instance (HDInsight), added structure (tables) stored in HCatalog, and queried the data on the storage account using Hive.

Big datademo

Embed Size (px)

Citation preview

Page 1: Big datademo

Big Data in Azure: Demo and Hands-On LabsPre-requisites

You will need an Azure subscription with available HDInsight cores Power BI / Excel 2013

o Download the Power Query add-in, choose 32bit or 64bit to match your Office installation http://www.microsoft.com/en-us/download/details.aspx?id=39379&CorrelationId=d8002172-0438-4ef5-b0fa-e635f8f17251

o Enable PowerPivot and Power View in your Excel options – com add-ins. Download HOL labs https://github.com/Azure-Readiness/CloudDataCamp. For April 30 only use

https://github.com/cindygross/CloudDataCamp instead. If you already have GitHub installed, choose to “Clone in Desktop”. Otherwise choose “Download ZIP” and UNZIP the files. Save the location to a Notepad file.

Data movement – one or botho GUI: Install CloudXplorer http://clumsyleaf.com/products/downloads. I will be using v3, you can

download the v3 trial or the free v1 (with fewer features). o Cmd line: Install AzCopy http://azure.microsoft.com/en-us/documentation/articles/storage-use-azcopy/.

Save the install location as you will need it later, it will default to (without the x86 on 32bit) C:\Program Files (x86)\Microsoft SDKs\Azure\AzCopy.

Install SQL 2014 SSMS http://www.microsoft.com/en-gb/download/details.aspx?id=42299 Today’s slides: http:// tinyurl.com/lxutdd4

GoalUnderstand how to use some of the common pieces of an Azure hosted Big Data and Analytics solution. These components are often part of an Internet of Things solution, which is a common Big Data and Analytics scenario.

At the end of this hands-on lab you will have:o Created an Azure storage account and container then loaded data to it. You will also use this account for

storage of data generated in other steps.o Create a Hadoop on Azure instance (HDInsight), added structure (tables) stored in HCatalog, and queried

the data on the storage account using Hive.o Connected an AzureML experiment to Hive – Hadoop is “just another data source”.o Create and ran an Azure Stream Analytics job that reads data generated on the fly from your laptop via a

Service Bus Event Hub and outputs aggregated data to a SQL Azure database.o Used Power BI to visualize and present the data.

LabsWe’re going to use a modified version of the Cloud Data Camp hands on labs. Those labs have screenshots and more detailed instructions than what I have below, please refer to the original docs if you need more detailed steps.

Guidelines Many names within Azure have to be globally unique, try prefixing services with your initials or company name.

Some service names must be all lower case, it’s easier to make all names lower case. For this lab prefix all names with the same identifier. Open Notepad and type in the name of the prefix you will use.

Let’s pick a single data center and use it for all our work (though some services are not yet available in all regions). For Montreal let’s choose East US. Note that this is NOT the same as East US 2.

Page 2: Big datademo

I suggest you start a single file in a simple editor like Notepad and keep all the links, names, and passwords/keys we use in that central location for the duration of the labs.

HOL1: Intro to the Azure Portal The detailed lab file is in the CloudDataCamp download under docs or you can get it here: https://github.com/Azure-Readiness/CloudDataCamp/blob/master/HOL/HOL1-IntroductionToAzure.md

In Lab 1 we’ll create a storage account and load data with AzCopy and/or CloudXplorer. Then we’ll create a SQL Database, open the firewall to our client machine, and create some SQL tables for structured data. Next we’ll generate some loosely structured data, simulating a “thing” or device that generates small chunks of data.

Portals Production management portal: https://manage.windowsazure.com/ - login and choose subscription Preview portal: https://portal.azure.com/ - login and choose subscription

Storage Account (creation takes 2-3 minutes) In the Preview portal https://portal.azure.com/ (resource groupings are not available in the management portal)

choose to create a new storage account. New -> Data + Storage -> Storage.o Name: Your prefix + storage. Mine is bddragonstorage.o Pricing: Locally Redundant. <select>o Resource Group: New -> Your prefix + rg. Mine is bddragonrg.o Subscription: use one subscription for all steps!o Location: East USo Diagnostics: Not configuredo Pin to Startboard: Yeso <Create>

Still in the preview portal, add a container to the storage accounto Name: data (this name is required due to the way the lab is setup)o Access type: Private

Click on Settings -> Keys in the storage account and copy the name and primary key to your Notepad file.

Ingest data Either AZCopy

Open a command prompt and change directories:Cd c:\Program Files (x86)\Microsoft SDKs\Azure\AzCopy (without the x86 on 32bit OS)

Use your actual local directory, storage account name, and storage account key.azcopy /Source:"{your path}\CloudDataCamp\data\" /Dest:https://[storage account name].blob.core.windows.net/data/input /DestKey:[storage account key] /S

If you installed CloudXplorer you can add the storage account and key on the “accounts” button then view the files there.

Note that you can also drag/drop small files from your local File Explorer to CloudXplorer, but AzCopy is better for larger files or automated processes.

Or CloudXplorer Add your storage account Choose to add a “folder” called input to the data container Drag the file from {your path}\CloudDataCamp\data to the input “directory” under the data container on your

account

Extra Credit Try both AzCopy and CloudXplorer

Page 3: Big datademo

Load the data from Bill’s talk yesterday to a DIFFERENT FOLDER. Create tables to refer to them. Query the tables. Since Hive points to directories and not to single files, each type of data must be in its own folder!

Azure SQL DBCreate a new SQL database

In the preview portal https://portal.azure.com/ choose New -> Data + Storage -> SQL Database Name: cdcasa (this is unique within your server and is hardcoded for the demo) Server: “Create a new server”

o Name: Your prefix + SQL. Mine is bddragonsqlo Server Admin Login: Something you will remember, put it in your notepado Password: Something you will remember, put it in your notepad. If you are going to use the same

password for other services, make it 10+ characters with upper/lower case, #, special character.o Location: same as the rest (East US for Montreal)o Allow Azure Services to Access Server: Yes, check the box! (Very important!)o OK

Select Source: Blank Database Pricing Tier: Standard (cheapest is fine for the demo) Optional Configuration: leave at defaults Resource Group: the one we created above Subscription: the same one we’ve been using Choose to add it to the Startboard. <Create> (wait 3-4 minutes)

Configure the firewall Open the non-preview management portal https://manage.windowsazure.com/ . Click on the SQL Databases in the left pane. Highlight cdcasa and then choose Servers from the upper menu (not the database, the server). Click on the server you created earlier (bddragonsql is mine) and go to Configure. Where it says “Current Client IP Address” choose “add to the allowed IP addresses”. Doublecheck that “Windows Azure Services” is set to Yes. Choose save in the bottom bar.

Create SQL schemas for ASA Open SQL Server Management Studio (SSMS). Note that this can optionally be done from Visual Studio 2013

with update 4 or later.o Server Type: Database Engineo Server Name: {yourSQLserver.database.windows.net} For example mine is

bddragonsql.database.windows.net.o Authentication: SQL Server Authentication (note in the real world never log in with your sysadmin

account for dbo activities) Login: the one you created earlier Password: the one you created earlier

Choose the cdcasa database from the left menu (Object Explorer). Cntl-O to open 1_CreateSQLTable.sql from C:\{your directory}\CloudDataCamp\scripts\ASA Verify you are in the cdcasa dataase (there’s a dropdown box over object explorer) Hit F5 or the Execute button to run it. Note: It will be populated later by ASA.

Create Event Hub for Data Ingestion Open the non-preview management portal https://manage.windowsazure.com/ Click on Service Bus in the left menu

Page 4: Big datademo

Choose New -> App Services -> Service Bus -> Event Hub -> Custom Createo Event Hub Name: Your prefix + eh. Mine is bddragoneho Region: The same one we’ve been usingo Namespace: Create a new namespaceo Namespace Name: Your prefix + eh + -ns (it will default to this)o Choose next using arrow on bottom righto Partition Count: 8o Message Retention: 2o Choose the checkmark to finish

Configure shared accesso Click on the new Service Bus namespaceo Choose Event Hubs from the top menuo Click on the Event Hubo Choose Configure from the top menuo In the “shared access policies” section add a policy

Name: mypolicy Permissions: send, listen Choose Save at the bottom

o Copy the policy name and its primary key to your Notepad file.

Generate Data (Device Sender) Open a command prompt Cd {your directory}\CloudDataCamp\tools\DeviceSender Replace your actual values in the below command:

DeviceSender GenerateDataToEventHub -n <eventHubNamespace> -e <eventHubName> -p <policyName> -k <policyKey>

Paste the edited command into the command prompt and hit enter to execute it. You should see a series of “Messages fired onto the eventhub!” messages indicating data is being sent from your machine to Azure.

Do NOT close the window. This data will be used later.

HOL9: Azure Stream AnalyticsCreate Streaming Job

Open http://manage.windowsazure.com Click on New -> Data Services -> Stream Analytics -> Quick Create

o Job Name: prefix + streamo Region: (East US isn’t available yet – use East US 2)o Regional Monitoring Storage Account: Create newo New Storage Account Name: prefix + streammonitor

Configure Streaming JobInputs

Click on the job you just created, choose Inputs from the top ribbon, and click “Add Input”. Choose “Data stream” then “Event Hub”. Event Hub Settings:

o Input Alias: MyEventHubStream (must be exactly this)o Subscription: Currento Namespace: The one you created in the Event Hub step (prefix + -ns)o Event Hub Name: The one you createdo Policy: mypolicy

Page 5: Big datademo

o Consumer Group: $Default Serialization settings

o Format: JSONo Encoding: UTF8

Output In the streaming job, choose Outputs from the upper ribbon and “Add Output” Choose SQL Database SQL Database Settings

o Output alias: outputo Subscription: Currento SQL Database: cdcasao Server Name: the one you created earlier, prefix + sqlo Username/Password: The SQL admin account you createdo Table: AvgReadings

Query Choose Query from the upper ribbon Paste in and then SAVE:

SELECT DateAdd(minute,-1,System.TimeStamp) as WinStartTime, system.TimeStamp as WinEndTime, Type = 'Temperature', RoomNumber, Avg(Temperature) as AvgReading, Count(*) as EventCount FROM MyEventHubStream Where Temperature IS NOT NULL GROUP BY TumblingWindow(minute, 1), RoomNumber, TypeUNIONSELECT DateAdd(minute,-1,System.TimeStamp) as WinStartTime, system.TimeStamp as WinEndTime, Type = 'Humidity', RoomNumber, Avg(Humidity) as AvgReading, Count(*) as EventCount FROM MyEventHubStream Where Humidity IS NOT NULL GROUP BY TumblingWindow(minute, 1), RoomNumber, TypeUNIONSELECT DateAdd(minute,-1,System.TimeStamp) as WinStartTime, system.TimeStamp as WinEndTime, Type = 'Energy', RoomNumber, Avg(Kwh) as AvgReading, Count(*) as EventCount FROM MyEventHubStream Where Kwh IS NOT NULL GROUP BY TumblingWindow(minute, 1), RoomNumber, TypeUNIONSELECT DateAdd(minute,-1,System.TimeStamp) as WinStartTime, system.TimeStamp as WinEndTime, Type = 'Light', RoomNumber, Avg(Lumens) as AvgReading, Count(*) as EventCount FROM MyEventHubStream Where Lumens IS NOT NULL GROUP BY TumblingWindow(minute, 1), RoomNumber, Type

Start Steaming Job Click on Start in the bottom ribbon, choose default (Job Start Time) Verify DeviceSender is running (or restart it)

Page 6: Big datademo

View Data in SQL After a few minutes you can query the SQL database from SSMS and see the data in AvgReadings. Stop the DeviceSender app if it’s still running.

You have successfully ingested data from a “thing” (your laptop) to Azure! You pushed that data through a query (streaming) and sent the aggregated output to a destination in the cloud – Azure SQL Database.

---- Back to SLIDES -----

HOL2: Intro to HDInsightIn lab 2 we create a Hadoop cluster in Azure using the HDInsight service. Then we RDP to the head node and see that it’s truly Apache open source Hadoop running on Windows. HDInsight is also available on Linux but we are using Windows for the lab.

Create an HDInsight Hadoop cluster Login to https://manage.windowsazure.com/ Choose HDInsight (the elephant) from the left menu Choose New -> Data Services -> HDInsight -> Custom Create Page 1 / Cluster Details

o Cluster Name: Your prefix + hdio Cluster Type: Hadoopo Operating System: Windowso Version: default

Page 2 / Configure Clustero Data Nodes: 1o Region: the same region you’ve been using, the storage account must be in the same regiono Head Node Size: default A3o Data Node Size: default A3

Page 3 / Configure Cluster Usero Name: Your prefix + admin (you can use the same as the SQL db for the demo but don’t do that in

production)o Password: (you can use the same as the SQL db for the demo but don’t do that in production)o Enable the remote desktop for cluster: Yes (you will generally choose no)

RDP User Name: cluster name + 1 (don’t do this in production) RDP Password: (you can use the same as the SQL db for the demo but don’t do that in

production) Expires On: tomorrow

o Enter the Hive/Oozie Metastore: No (you will generally choose yes for production) Page 4 / Storage Account

o Storage Account: Use existing storageo Account Name: the storage account we created earliero Default Container: datao Additional Storage Accounts: 0

Page 5 / Script Actionso Click the arrow to create the cluster, wait about 15 minutes

Use the Hadoop Distributed File System (HDFS) RDP to the head node Get a listing of files

o Hadoop fs –ls /

Page 7: Big datademo

o Hadoop fs –ls /example/data

---- Back to SLIDES -----

HOL3: HDI Batch Analysis and Power BIWe’ll do some batch analysis and create aggregations. Then we will view the data in Power BI.

Hive

Navigate to CloudDataCamp\scripts\Hive in your file explorer. In the Azure management portal click on your HDInsight instance. Click on Query Console at the bottom of the

screen to open a query window. Login with the cluster credentials (not the RDP credentials). Choose the Hive editor.

Create an External Table DeviceReadings

Open CloudDataCamp\scripts\Hive\1_CreateDeviceReadings.txt in a text editor like Notepad.Update the location: replace <storage account name> with the storage account you created in Hands on Lab 1 (remove the brackets). Paste the edited query into the Hive editor and hit Submit to create a Hive table.

LOCATION 'wasb://data@<storage account name>.blob.core.windows.net/input';

View the job output – it opens in a new window. For a create schema statement you want to verify there are no errors (the messages about logging are not errors). It will show the time taken.

Query the table Copy the below query and run it from the Hive editor:

SELECT deviceId FROM DeviceReadings LIMIT 100;

View the job output.

Create External Tables for AveragesCreate and populate tables that store aggregates.

Open CloudDataCamp\scripts\Hive\2_CreateAverageReadingByType.txt. Edit the location and run from the Hive editor. Repeat changing the location and executing the remaining create/insert scripts: CloudDataCamp\scripts\Hive\3_CreateAverageReadingByMinute.txt. CloudDataCamp\scripts\Hive\4_CreateMaximumReading.txt. CloudDataCamp\scripts\Hive\5_CreateMinimumReading.txt.

File BrowserThe location of the data was specified in the table creation statements using location. The browser shows data on the default storage account for the cluster.

View the original and the aggregated data in the File Browser tab of the console. If you have CloudXplorer, view the data in CloudXplorer (hit refresh).

Extra Credit Write SELECT statements to view each table’s dataset. Write more complex queries. Show tables; describe formatted AverageReadingByType; Connect to Hive from PowerPivot using the Microsoft Hive ODBC driver and a DSN

Page 8: Big datademo

AzureMLConnect to Hadoop from AzureML. Note that this is not in the CloudDataCamp, the HOL10 in that series points to a flat file and here we use a Hive query.

From manage.windowsazure.com, click on AzureML and choose to sign in to your AzureML studio. Choose a new blank experiment. Drag a Reader from the left to the designer. Highlight the Reader and view the options you have for connecting.

o Data source: Hive Queryo Hive database query: SELECT * FROM AverageReadingByTypeo HCatalog server URI: http://{yourhdicluster}.azurehdinsight.neto Hadoop user account name: your cluster admin (not rdp) accounto Hadoop user account password: your passwordo Location of output data: Azureo Azure storage account name: {your storage account}o Azure storage key: {your key}o Azure container name: data

Choose Save and Run from the bottom ribbon When it completes view the results dataset by right clicking on the circle and either visualize or download

Reference: https://andersspur.wordpress.com/2014/10/10/use-hive-to-read-data-into-azure-ml/

Cluster CleanupAt this point we have new datasets created based on aggregates of our first, static data file. We could either leave the cluster up and query it directly from tools like Power BI using Hive or drop the cluster and directly access the data in the flat files. We’ll use the latter – flat files. This emphasizes that these are on-demand clusters, you don’t need to pay to keep them up all the time.

Drop the HDInsight cluster.

Power Query Open a new workbook in Excel 2013. Verify you installed and enabled Power Query. Click on Power Query. Choose From Azure -> From Microsoft Azure HDInsight. Enter the storage account you created earlier and the key you saved in Notepad. In Navigator expand your storage account and double-click on the container named data to open the query

editor. Find the “Folder Path” column on the far right and choose the dropdown arrow. Enter output in the search box and you’ll see the ‘directories’ and files we have created today. If you chose ok, in “Applied Steps” on the far right click the red X next to “Filtered Rows” to remove this filter. Create a new filter to averageReadingByMinute – this will show a single row (because we had a small amount of

data and only ran the insert once we only have one file in that directory). Choose ok. Scroll back to the left and in the “Content” column click on “Binary” to import the file. Name the columns: DeviceType, ReadingDateTime, RoomNumber, Reading Choose “Close & Load” from the upper left to create a new sheet called AverageReadingByMinute. Save the workbook to your desktop.

Power View Go to the workbook created in the last step. Choose the Insert tab at the top then choose Power View in the middle of the top. It is populated with the table from the worksheet – you can see the columns in “Power View Fields” on the right.

Page 9: Big datademo

Note that the numeric fields have a sum figure next to them. We don’t want to summarize room number, so go to the bottom of the “Power View Fields” in the “Fields” section and choose “Do Not Summarize” for RoomNumber.

Click inside the table in the report designer pane (left). In the Design menu item in the ribbon to the right of “PowerView” choose “Other Chart” -> “Line”.

In the Filters section choose Chart. Expand DeviceType and put a check next to energy. Edit the title to “Energy Reading By Minute”. Save the workbook and close it.

You have now done distributed processing with Hadoop on Azure (HDInsight) utilizing the power of WASB to access that same data outside of Hadoop. You then used Power BI to discover and visualize that data, opening up the possibilities for new data-driven insights.

Cleanup Verify you have dropped your HDInsight cluster – you are charged for its existence whether you are running

anything or not. Stop the DeviceSender app if it’s still running. Drop the other resources we’ve created – they have minimal costs if you aren’t actively using them.

o Streaming Jobo Event Hub (under Service Bus)o Service Bus namespaceo Storageo SQL Azure Database cdcasa (and optionally the hosting SQL Server)o AzureML Experimentso Resource Group

Optionally delete the Excel workbook. Optionally remove some or all files and tools from this workshop

o CloudDataCamp folder and all fileso CloudXplorero AzCopyo DeviceSender