218
User Guide June 2014

DataSift User Guide

  • Upload
    xogen

  • View
    48

  • Download
    0

Embed Size (px)

DESCRIPTION

User guide for using Datasift

Citation preview

Page 1: DataSift User Guide

User Guide

June 2014

Page 2: DataSift User Guide

Copyright Statement Copyright© 2000-2014 DataSift All Rights Reserved

This document, as well as the software described herein, are protected by copyright laws, and are proprietary to DataSift.

Disclaimer The information in this document is subject to change without notice and should not be construed as a commitment by DataSift. DataSift assumes no responsibility for any errors that may appear in this document.

Trademarks and Patents DataSift and the DataSift logo are trademarks of DataSift and may be registered in some jurisdictions. All other marks are the trademarks of their respective owners.

Because of the nature of this material, numerous hardware and software products are mentioned by name. In most, if not all cases, these product names are claimed as trademarks by the companies that manufacture the products. It is not the intent of DataSift to claim these names or trademarks as its own.

Page 3: DataSift User Guide

Copyright© DataSift. All Rights Reserved. i

Contents 1 DataSift Overview ........................................................................................................... 1

Social Media ............................................................................................................................................. 1 DataSift Platform ..................................................................................................................................... 3 Stage One – Aggregation ........................................................................................................................ 3 Stage Two – Processing .......................................................................................................................... 4 Stage Three – Delivery ............................................................................................................................ 7 Historic Streams ...................................................................................................................................... 9 Billing ........................................................................................................................................................ 9 Registering an Account ......................................................................................................................... 10 Web Application Interface .................................................................................................................... 12

2 Configuring Sources ..................................................................................................... 15 Finding Sources ..................................................................................................................................... 15 Source Types .......................................................................................................................................... 15 Viewing Source Information ................................................................................................................ 18 Activating Sources ................................................................................................................................. 20 Source Activation Impact ..................................................................................................................... 22

3 Configuring Streams in Query Builder ....................................................................... 23 Query Builder ......................................................................................................................................... 23 Enabling Sources ................................................................................................................................... 24 Creating New Streams .......................................................................................................................... 25 Creating Simple Filters .......................................................................................................................... 26 Reviewing Filter Cost ............................................................................................................................. 29 Previewing Streams .............................................................................................................................. 31 Creating Multiple Filter Conditions ..................................................................................................... 33 Using Logical Operators ....................................................................................................................... 36 Embedding & Customizing Query Builder ......................................................................................... 37

4 Analyzing Interactions ................................................................................................. 39 Displaying Interaction Details .............................................................................................................. 39 Analyzing Interaction Details – Web Application............................................................................... 41 Analyzing Interaction Details – API...................................................................................................... 47

5 Writing Simple Filters in CSDL ..................................................................................... 49 Filtering Condition Elements ............................................................................................................... 49 Selecting Targets ................................................................................................................................... 49 Selecting Operators .............................................................................................................................. 51 Using Multiple Conditions .................................................................................................................... 57 Hints & Tips ............................................................................................................................................ 58

6 Configuring Streams – CSDL Web Application ........................................................... 59 Enabling Sources ................................................................................................................................... 59 Writing Filters with CSDL Editor........................................................................................................... 60 Validating Filters .................................................................................................................................... 67 Compiling Filters .................................................................................................................................... 68 Previewing Streams .............................................................................................................................. 69

7 Configuring Categorization ......................................................................................... 71 Configuring Tagging .............................................................................................................................. 72 Configuring Tag Namespaces .............................................................................................................. 74

Page 4: DataSift User Guide

ii Copyright© DataSift. All Rights Reserved.

Configuring Scoring ............................................................................................................................... 77 Configuring Cascading .......................................................................................................................... 80 Including Library Classifiers ................................................................................................................. 82 Billing ...................................................................................................................................................... 85

8 Configuring Streams – API ........................................................................................... 87 Enabling Sources ................................................................................................................................... 87 Making API Calls .................................................................................................................................... 88 Validating Filters .................................................................................................................................... 88 Compiling Filters .................................................................................................................................... 90 Previewing Streams .............................................................................................................................. 92

9 Configuring Stream Recording .................................................................................... 93 Data Destinations .................................................................................................................................. 93 Starting Record Tasks ........................................................................................................................... 94 Viewing Record Tasks ........................................................................................................................... 97 Pausing Record Tasks ........................................................................................................................... 98 Stopping Record Tasks ......................................................................................................................... 98 Exporting Record Task Data................................................................................................................. 99 Deleting Record Tasks ........................................................................................................................ 103

10 Configuring Historic Previews ................................................................................. 105 Historic Archive .................................................................................................................................... 105 Historic Preview ................................................................................................................................... 105 Report Types ........................................................................................................................................ 106 Historic Preview Billing ....................................................................................................................... 113 Configuring Historic Preview ............................................................................................................. 114 Downloading Reports ......................................................................................................................... 117

11 Configuring Historic Stream Recording ................................................................. 119 Historic Tasks ....................................................................................................................................... 119 Data Destinations ................................................................................................................................ 120 Starting Historic Tasks ........................................................................................................................ 121 Viewing Historic Tasks ........................................................................................................................ 124 Pausing Historic Tasks ........................................................................................................................ 125 Stopping Historic Tasks ...................................................................................................................... 125 Deleting Historic Tasks ....................................................................................................................... 126

12 Configuring Destinations – Amazon S3 .................................................................. 127 Configuring Amazon S3 for DataSift ................................................................................................. 127 Configuring DataSift for Amazon S3 ................................................................................................. 134

13 Configuring Destinations – Google BigQuery ........................................................ 139 Google BigQuery ................................................................................................................................. 139 Configuring Google BigQuery for DataSift ....................................................................................... 141 Configuring DataSift Web Application for Google BigQuery ......................................................... 149 Configuring DataSift API for Google BigQuery ................................................................................ 154 Querying Data in BigQuery ................................................................................................................ 155 Deleting Google Cloud Projects ......................................................................................................... 157

14 Configuring Push Delivery – API .............................................................................. 161 Push Delivery Workflow ..................................................................................................................... 161 Locating API Credentials ..................................................................................................................... 161 Validating Push Destinations ............................................................................................................. 162 Creating Push Subscriptions .............................................................................................................. 163

Page 5: DataSift User Guide

Copyright© DataSift. All Rights Reserved. iii

Checking Push Subscriptions ............................................................................................................. 166 Retrieving Push Subscription Logs .................................................................................................... 167 Pausing Push Subscriptions ............................................................................................................... 169 Resuming Push Subscriptions ........................................................................................................... 169 Stopping Push Subscriptions ............................................................................................................. 170 Deleting Push Subscriptions .............................................................................................................. 170

15 Configuring Destinations – MySQL ......................................................................... 171 Configuring Amazon RDS ................................................................................................................... 171 Configuring Databases ....................................................................................................................... 181 Configuring Database Tables............................................................................................................. 183 Configuring Mapping .......................................................................................................................... 185 Configuring DataSift Destination (API) ............................................................................................. 187 Configuring DataSift Destination (Web Application) ...................................................................... 192

16 Monitor Usage & Billing ........................................................................................... 195 Subscription Model ............................................................................................................................. 195 Modifying Billing Account Details ...................................................................................................... 196 Viewing Usage & Balance ................................................................................................................... 197 Setting Overage Limit ......................................................................................................................... 198 Viewing Usage Statistics ..................................................................................................................... 199 Viewing Current Streams ................................................................................................................... 203 Locating Pricing Information ............................................................................................................. 203

17 Locating Help ............................................................................................................ 205 Locating Documentation .................................................................................................................... 205 Viewing Platform Status ..................................................................................................................... 207 Viewing Known Issues ........................................................................................................................ 208 Forum Discussions .............................................................................................................................. 208 Subscribing to Updates ...................................................................................................................... 209 Attending Workshops ......................................................................................................................... 209 Submitting Support Requests............................................................................................................ 210 Viewing Screencasts ............................................................................................................................ 211 Attending Training ............................................................................................................................... 211

Page 6: DataSift User Guide

iv Copyright© DataSift. All Rights Reserved.

(This page intentionally left blank)

Page 7: DataSift User Guide

Copyright© DataSift. All Rights Reserved. 1

1 DataSift Overview DataSift is a powerful cloud platform for extracting value from social networks, blogs and forums. It achieves this by capturing, augmenting, filtering and delivering social media interactions.

Social Media Social media interactions come from a wide range of sources which include Twitter, Facebook, blogs, message boards, social news, comments and forums.

Every day millions of new social interactions are created and shared across the networks. More than a billion people are sharing information which include comments on brands, products and services.

Figure 1: Social Media Interaction Figures

Consumers across all age groups spend more time on social networks than on any other category of site.

• 50% of people are recommending brands on social networks • 90% of people are listening to recommendations on social networks • 50% of people will see a company’s social presence before the corporate web

site

Before the rise of social media, interactions were one-to-one or to small groups. Whereas social interactions are typically many-to-many with some users having considerable influence.

Social interactions are real time, come from many different sources and may be seen by millions of people.

Page 8: DataSift User Guide

2 Copyright© DataSift. All Rights Reserved.

Business Challenges Many companies build a social network presence which is used for marketing and advertising. However, they may not have the ability to manage vast quantities of social network interactions which come from many different sources, have different formats and are presented without any context.

Social interactions require filtering and analysis before they can be used to inform business decisions.

Use Cases Examples of how business can use social media data include:

• Social customer service • Reputation management

o Track conversations and content around a brand o Identify trends

• Competitive analysis o Measure share of social activity vs. the competition

• Marketing planning & measurement o Identify content which is generating most interest and engagement

• Identify social influencers o Identify and rank influencers of brand and completion brands

• Identify social advocates

Page 9: DataSift User Guide

Copyright© DataSift. All Rights Reserved. 3

DataSift Platform The DataSift platform uses a three-stage process to derive value from social media Interactions.

1. Aggregate interactions from multiple sources

2. Process the interactions by normalizing the structure, adding meta-data and filtering based on user-defined parameters

3. Deliver a stream of filtered interactions to a destination

This allows relevant data, in a standard and enhanced format, to be integrated into products and applications for analysis. Streams can be delivered from live updates or from an archive of historic interactions.

Stage One – Aggregation Interaction data is captured from a growing list of sources.

• Twitter • Intense • Instagram • Topix • Tumblr Debate • NewsCred • IMDb • Facebook • Google+ • Reddit • Videos • Sina Weibo • YouTube • Wikipedia • Blogs • WordPress • Bitly • DailyMotion • Boards

DataSift has agreements with many social networks to gather all public interactions. Billions of social interactions are captured at a rate of over 15,000 per second.

It is also possible to gather interactions from networks and sites which require a login. These are called Managed Sources

Managed Sources Companies often have social media presence on networks such as Facebook and Google+ relating to their brand or products. To capture the interactions on these private pages and sites, the company provides DataSift with revocable access tokens. DataSift use the tokens to collect interactions.

For simplicity of filtering, interactions from managed sources are aggregated with all other sources before augmentation and filtering. The number of managed sources is growing and includes the following:

• Facebook Pages • Instagram • Google+ • Yammer

Page 10: DataSift User Guide

4 Copyright© DataSift. All Rights Reserved.

Stage Two – Processing Normalization Normalization is the process of standardizing interaction format without losing any of the information unique to particular networks. This greatly simplifies the job of managing and analyzing the stream of filtered interactions.

Augmentation Augmentation is the process of adding extra information to each interaction. The augmented interaction provides the context information for an interaction which allows more sophisticated filtering and analysis.

• Demographic & Gender o Interactions are augmented with demographic and gender information

allowing companies to build a dynamic picture of a market segment.

EXAMPLE: Starbucks demographic and gender analysis: http://demo.datasift.com/coffee/

• Klout o Klout score is a measure of an author’s influence in the online

community. The scale is from 1 to 100 with 100 being the most influential. Klout scores are provided by klout.com who analyze 400 signals from eight networks every day.

• Language o A language detector recognizes over 140 languages and provides an ISO-

639-1 language code for interactions. o Detected languages are provided with a confidence rating percentage.

• Link o Links in social interactions are often shortened and may be redirected.

The DataSift platform resolves links to their final destination. • Salience & Sentiment

o The salience augmentation lists the entities found in a post. Entities are typically proper nouns such as people, companies, products, and places.

o Sentiment scoring allows the rating of positive or negative assertions.

Page 11: DataSift User Guide

Copyright© DataSift. All Rights Reserved. 5

Categorization Categorization is applying user-defined tags to interactions. These can be used when processing and analyzing the output stream. Tagging is a hierarchical namespace, for example tags could include "tag.device.name", "tag.device.type", "tag.device.type.model".

Scores can be applied to tags which are cumulative. As interactions are matched by filter conditions, tags scores can be incremented or decremented.

Filtering Using more than 300 unique fields and advanced operators, one or more filters can be defined to generate streams of relevant interactions. Two methods are available for creating filters: a visual tool called Query Builder and a programming language called Curated Stream Definition Language (CSDL).

Query Builder Query Builder is a web application allowing users to create and edit filters without them having to learn CSDL. An instance of Query Builder is provided on the DataSift dashboard.

Figure 2: Query Builder Example

Page 12: DataSift User Guide

6 Copyright© DataSift. All Rights Reserved.

Curated Stream Definition Language (CSDL) CSDL is a simple programming language which allows a filter to be created using targets, operators, arguments and logic. Custom tags and scores can be added to the interactions which are passed by the filter to the output stream.

The following CSDL has the same effect as the Query Builder example.

Figure 3: CSDL Example

CSDL can be written and compiled using an editor on the DataSift dashboard or submitted for compilation by Application Programing Interface (API) calls.

Page 13: DataSift User Guide

Copyright© DataSift. All Rights Reserved. 7

Stage Three – Delivery The augmented and filtered interactions make the output stream. The stream can be delivered to pre-configured destinations or is available for real-time consumption.

Destinations The DataSift platform can be configured to send the output stream to a growing number of destinations which are configured through the web application or API.

Figure 4: Example Destinations

Using Amazon Simple Storage Service (S3) as an example, the destination is configured in the DataSift platform with user credentials, keys, storage container, folder and file details. The amount and frequency of data delivery is configured to ensure none is lost. Configuration is by web application or API.

Delivery of data is buffered up to one hour and delivery is guaranteed.

Page 14: DataSift User Guide

8 Copyright© DataSift. All Rights Reserved.

Streaming Using the API, your stream of augmented and filtered interactions is available in real-time. A connection is made between your application and the DataSift platform using HTTP or websockets.

Output Format Each interaction in the stream is represented as attribute-value pairs in JavaScript Object Notation (JSON) objects. Some of the attributes are the same for all interactions, others are specific to the interaction source. For example, retweet count would not be available in a Facebook JSON object.

Figure 5: Example Twitter JSON Object (excerpt)

Page 15: DataSift User Guide

Copyright© DataSift. All Rights Reserved. 9

Historic Streams Interactions from many sources are archived which allows filters to be run against historic data for a user-defined period. It is possible to estimate data volume and job completion time by running filters a 1% or 10% sample of historic data. Data delivery from the archive happens with in minutes or hours.

Billing There are on-demand and subscription payment models. Billing for platform usage is based the amount of processing required to generate the stream and a license fee for each interaction in the stream.

Platform Fees Platform fees are charges as Data Processing Unit (DPUs). A single DPU is equivalent to the processing that can be carried out in one hour by a single standardized processor.

Complex filters require more processing than simple filters so each filter is automatically assigned a DPU rating. Running a filter rated at 3 DPU hours for 10 hours, results in 30 DPU hour usage.

Data Fees There is a charge for each interaction which is delivered in the stream from your filter. A single stream may contain licensing charges for interactions from many sources. Each source must be manually enabled and a license agreement must be signed.

Page 16: DataSift User Guide

10 Copyright© DataSift. All Rights Reserved.

Registering an Account To register a DataSift platform account, open a web browser and go to http://datasift.com. Click the Login link.

Figure 6: datasift.com Login link

A Login and Register tab are shown. Click the Register tab.

Figure 7: Register Tab

Complete the form fields to register a new DataSift account or link a new DataSift account to an existing account from any of the networks shown. By linking your DataSift account, it allows login to the DataSift platform using single sign-on with another network’s credentials.

Figure 8: New Account Options

Page 17: DataSift User Guide

Copyright© DataSift. All Rights Reserved. 11

If completing the form, ensure the username only contains letters, numbers, periods and underscores. Click the terms and conditions link to view the terms and conditions in a new window. If you agree, select I agree and click the create button.

Figure 9: DataSift Account Form

Page 18: DataSift User Guide

12 Copyright© DataSift. All Rights Reserved.

Web Application Interface When the new streams dialogue has been completed or skipped, the web application interface is displayed with default settings explained below.

Account Details The top of the page displays account information including a link to account settings, number of unread notifications and license cost.

Figure 10: Account Details

NOTE: This may look different when using a Pay-as-you-go (PAYG) account

Tips Tips are displayed at the top of each page until they are dismissed. Tips are used to provide assistance and suggestions. The new account has a tip suggesting the Twitter data source is enabled.

Figure 11: Tips

Notifications Clicking the notifications icon shows unread notifications from the platform which may include information about jobs which have completed or billing.

Figure 12: Notifications

Page 19: DataSift User Guide

Copyright© DataSift. All Rights Reserved. 13

Adding Connected Identities A variety of other online identities can be linked to the DataSift account to provide single-sign on using the other account's credentials.

Click Settings on the top navigation bar and click Identities. Add identities from the list on the right.

Figure 13: Adding Connected Identities

Pre-Configured Streams

Figure 14: Streams Tab

The web application has multiple pages identified by the tabs shown. New accounts start with the Streams page open.

Filters are configured in the Streams page. These filters can be used to create streams of interactions which match the filter conditions.

Figure 15: Pre-Configured Streams

How to edit and use these filters is covered in later modules.

Page 20: DataSift User Guide

14 Copyright© DataSift. All Rights Reserved.

Dashboard

Figure 16: Dashboard Tab

The dashboard page displays the notifications mentioned previously along with a list of configured stream filters.

Figure 17: Dashboard - Details

The lower half of the page shows API usage divided into the Sources & Augmentations used, and the number of hours.

Figure 18: Dashboard - API Usage

Page 21: DataSift User Guide

Copyright© DataSift. All Rights Reserved. 15

2 Configuring Sources Sources are the social networks, news sites, forums, comments, message boards, blogs and other networks which provide input interactions for the DataSift platform. This section explains the source types and how to find detailed information on each source. It also demonstrates data source configuration.

Finding Sources Data sources are listed in the DataSift web application. Log in to http://datasift.com and select the Data Sources tab. New sources appear here automatically.

Figure 19: Data Sources Page

Source Types Sources are divided into three types: Feeds, Managed Sources, and Augmentations.

Feeds Feeds are the most common type. They are identified on the Data Sources page with light blue tabs in the corner.

Figure 20: Example Feed Source

Page 22: DataSift User Guide

16 Copyright© DataSift. All Rights Reserved.

Some feeds are from sources which send interactions to the DataSift platform as they occur. Twitter is a good example of this type. There is a ‘firehose’ of interactions coming from Twitter directly to the DataSift platform. Tumblr is another example of this type.

Other feeds are from sources which are constantly monitored by the DataSift platform. In either case, interactions become input to the platform and available for filtering.

Use the links on the left to further filter the choice of sources.

Figure 21: Selecting Social Network Sources

Managed Sources Companies often create social media presence on various networks. These may include Facebook, and Google+. Interactions on these pages and sites are not always public, they are protected by login credentials.

Managed sources allow a customer to include interactions from closed pages or public pages which require a login before they can be accessed. This allows all interactions from feeds and managed sources to be filtered as a single stream.

The company acquires an access token from the source which is used by the DataSift platform to retrieve interactions. The key can be revoked at any time and interactions from managed sources remain private. Managed Sources are available on the Managed Sources page of the Data Sources tab.

Figure 22: Managed Sources Page

Page 23: DataSift User Guide

Copyright© DataSift. All Rights Reserved. 17

Augmentations Interactions from data sources contain a lot of information specific to that data source and the content of the interaction. However, there is more information which can be retrieved and calculated by the DataSift Platform which enriches the information in the interaction.

Augmentations are identified on the Data Sources page with purple tabs in the corner.

Figure 23: Example Augmentation Source

Information added by augmentations may include language, positive or negative sentiment in a message, topics, trends, social media influence, gender, and many more. Augmentations are applied to all applicable interactions in the same way, allowing a single stream of interactions to be generated from the filter.

Figure 24: Augmentation of all Interactions

Many of the data feeds share common features such as the main text, the creation time, the author, and frequently links to other online content. The Interaction augmentation also groups these features together making it easier to write filters across all the feeds.

Page 24: DataSift User Guide

18 Copyright© DataSift. All Rights Reserved.

Viewing Source Information Each source has a page with a detailed description. From the Sources or Managed Sources pages, click the source name or logo.

Figure 25: Open Source Description

In this example of the Facebook public source, information about the number of interactions and a breakdown of types and languages is shown.

Figure 26: Example Source Description

Page 25: DataSift User Guide

Copyright© DataSift. All Rights Reserved. 19

Pricing In each source description, the price for using the source is shown. This is the data licensing price for each interaction returned by your filter from this source. It is in addition to the Data Processing Unit (DPU) hours charged for running the filter.

Figure 27: Example Data License Pricing

Target Fields A target is something which can be used in a filter. Each source may contain many tens of targets. Under the source description, the source targets are shown. This can be used when selecting sources to identify what fields to use in a filter.

Figure 28: Example Target Fields

Sample Definition The sample definition is an example of how the target fields can be used in a Curated Stream Definition Language (CSDL) filter. In this example, the CSDL matches Tweets containing the words “San Francisco” which have been retweeted more than nine times by users with over 999 followers or at least 10 times as many people following them as they follow.

Figure 29: Twitter Sample Definition

Page 26: DataSift User Guide

20 Copyright© DataSift. All Rights Reserved.

Example Output When an interaction has been matched by a filter, the interaction data, any augmented data and any extra tagging information added by the filter are sent to a destination or streamed.

The example output shown on the source page is an example of an output interaction formatted as a JavaScript Object Notation (JSON) object.

Figure 30: Example Output

The actual attribute-value pairs in the JSON object will vary depending on which augmentations are enabled.

Activating Sources Some sources are automatically activated when a new account is registered. Others require activation. Previous modules have covered Twitter as an example of an inactive source which requires a license to be signed before it is activated. From the Data Sources page, use the Activate and Deactivate buttons.

Figure 31: Source Activate and Deactivate Buttons

Some sources are restricted to accounts on premium subscriptions. These are displayed with an Enquire button. The Enquire button opens a web form to request access to the source.

Page 27: DataSift User Guide

Copyright© DataSift. All Rights Reserved. 21

Figure 32: Enquire Button on a Premium Source

NOTE: When activated, demographics anonymizes all interactions.

Verifying Source State The Data Sources page shows which sources are active or inactive. Sources with a gray Deactivate button are already activated.

Figure 33: Verifying Source State

Activating Augmentations Augmentation sources are activated in the same way as feed sources.

Activating Managed Sources Managed sources can be defined multiple times, for example a customer may have more than one Google+ page to use as an interaction source. In this screenshot, two Google+ pages are being used as sources. Defined instances are listed under My Managed Sources.

Figure 34: Two Instances of a Managed Source

Page 28: DataSift User Guide

22 Copyright© DataSift. All Rights Reserved.

On the Managed Sources page, the Google+ source doesn’t have an Activate/Deactivate button. The + symbol is used to create another instance of the managed source.

Figure 35: Managed Source Available for New Instances

Source Activation Impact When a feed is activated, the interactions become available immediately to all running filters. Any running filters which would match interactions from the new feed will then produce those interactions in the output stream.

As soon as the new feed is activated, the data license fee for interactions from the new feed is being charged.

Page 29: DataSift User Guide

Copyright© DataSift. All Rights Reserved. 23

3 Configuring Streams in Query Builder This section describes the Query Builder editor and walks through the process of creating, validating and compiling filters with Query Builder.

Query Builder Query Builder is a simple and powerful web application for creating and editing filters. The alternative to Query Builder is writing filters using the Curated Stream Definition Language (CSDL). An instance of Query Builder is provided on the DataSift dashboard.

Figure 36: Choosing Query Builder Editor

HINT: Query Builder filters can be converted into CSDL with one click.

Page 30: DataSift User Guide

24 Copyright© DataSift. All Rights Reserved.

Enabling Sources Creating filters in Query Builder allows interactions from a number of data sources to be filtered. Before creating the filter, choose which sources are used and ensure they are enabled. Sources are divided into three types shown below.

Source Type Description Example

Feeds Public sources of interactions Twitter Tumblr Reddit

Managed Sources

Sources of interactions which require authentication

Facebook Pages Google+ Instagram

Augmentations Extra information retrieved or calculated about the interaction

Demographics Sentiment Language

Page 31: DataSift User Guide

Copyright© DataSift. All Rights Reserved. 25

Enable and disable Sources in the Data Sources page of the dashboard.

Figure 37: Location of Data Sources Configuration

Refer to the training module on Configuring Sources for more detail on managing sources.

Creating New Streams Query Builder is an editor which creates conditions to filter data from the enabled data sources. To use Query Builder, click the Create Stream button on the Streams page.

Figure 38: Creating a new Stream

Page 32: DataSift User Guide

26 Copyright© DataSift. All Rights Reserved.

Enter a Name for the new stream and optionally, a Description. In this example, the filter name is Starbucks Filter. There are two choices of editor, Query Builder and CSDL Code Editor. Select Query Builder and click Start Editing.

Figure 39: Selecting Query Builder Editor

TIP Multiple streams can have the same name so devise your own naming convention.

Creating Simple Filters Query Builder allows multiple filter conditions to be combined to generate the content for an output stream. The new stream definition opens with no filter conditions. Click Create New Filter to create the first filter condition.

Figure 40: Create New Filter Condition

Page 33: DataSift User Guide

Copyright© DataSift. All Rights Reserved. 27

Creating Filter One A list of enabled sources is shown. Use the left and right arrows to scroll for more sources. If a required source is not visible, it may not be enabled.

Figure 41: Choose Sources

When a source is chosen, the targets within the source are shown. In this example, the filter condition will be applied to the TWEET message from the Twitter source.

Figure 42: Choose Targets

Page 34: DataSift User Guide

28 Copyright© DataSift. All Rights Reserved.

Depending on the source and target chosen, there may be more target layers to select from. In the example of filtering on the message in a tweet, the text and an operator must be selected.

Figure 43: Contains Words Condition

With this target, there are multiple operators to allow matching of text strings in a variety of ways.

Figure 44: Text Operators

When entering multiple words for Contains or Contains words near, hit enter between each word as shown below. Text strings are not case sensitive unless the ‘Aa’ button is selected.

Figure 45: Entering Multiple Text Strings

Page 35: DataSift User Guide

Copyright© DataSift. All Rights Reserved. 29

When the filter condition is complete, click Save and Preview. A summary of the new filter condition is shown.

Figure 46: Summary of Filter Condition

Reviewing Filter Cost Pricing is made up from the Data Processing Unit (DPUs) and a data licensing cost per interaction in the output stream. More complicated filter conditions consume more DPU hours but may also reduce the data licensing cost by being more specific.

To review the cost of a filter, open the Streams page and click on the stream name.

Figure 47: Select Existing Stream

Page 36: DataSift User Guide

30 Copyright© DataSift. All Rights Reserved.

The cumulative DPU cost for the filter is shown on the left. At the bottom of the page is a Stream Breakdown showing a breakdown of the DPU usage for the stream.

Figure 48: Review DPU Cost

The following example shows the DPU breakdown for multiple filter conditions.

Figure 49: DPU Breakdown for Multiple Filter Conditions

The cost of a DPU may change over time. DPUs and pricing is covered on a separate training module.

Page 37: DataSift User Guide

Copyright© DataSift. All Rights Reserved. 31

Previewing Streams The stream preview is used to review and fine-tune filters by identifying irrelevant interactions in the output stream. The filter conditions can then be modified to exclude them.

From the Streams page, click on the stream name to open the summary page. Click Live Preview.

Figure 50: Live Preview

A summary of enabled sources is shown. Check that only the required sources for this stream and any other running streams are in the list. Enabling more than the required sources may unnecessarily increase the data licensing cost.

Figure 51: Sources Summary

NOTE: DPU cost is incurred when using the live preview.

Page 38: DataSift User Guide

32 Copyright© DataSift. All Rights Reserved.

Click the play button at the bottom of the screen to start previewing interactions matched by the filter conditions.

Figure 52: Play Button

Interactions with their augmentation icons and information appear as they are matched.

Figure 53: Example Preview Interactions

NOTE: The number of interactions sent per second is limited in preview mode. The total number of interactions sent may also be limited

To stop the preview, click on an interaction or click the pause button.

Figure 54: Pause Button

Page 39: DataSift User Guide

Copyright© DataSift. All Rights Reserved. 33

Creating Multiple Filter Conditions Filtering becomes much more powerful when multiple conditions are combined. In this example, two more filter conditions are added to the Starbucks stream. Click on the stream name to open the stream summary.

Figure 55: Edit Stream

Click the Create New Filter button to add a new filter condition.

Figure 56: Add filter Conditions

Page 40: DataSift User Guide

34 Copyright© DataSift. All Rights Reserved.

Creating Filter Two Add a filter condition for a Klout score over 40.

Figure 57: Klout over 40

Page 41: DataSift User Guide

Copyright© DataSift. All Rights Reserved. 35

Creating Filter Three Add a third filter condition for any level of positive sentiment in interaction content.

Figure 58: Above Zero Sentiment

Page 42: DataSift User Guide

36 Copyright© DataSift. All Rights Reserved.

All three filters are now listed in the filter preview.

Figure 59: Three Filters in Filter Preview

Using Logical Operators Above the filter descriptions are three options for the logic which should be applied to the filter conditions. The default is ALL of the following.

Figure 60: Filter Condition Logic

If ALL of the following is selected, a logical AND is used between the conditions so all three conditions must be matched in an input interaction for it to be sent to the output stream.

If ANY of the following is selected, a logical OR is used between the conditions so if an input interaction is matched by one or more conditions, the interaction is sent to the output stream.

Page 43: DataSift User Guide

Copyright© DataSift. All Rights Reserved. 37

Enabling Advanced Logic The ADVANCED selection expands the logic to show what is currently applied and allows more complex logic to be applied. The current example shows logical ANDs between each condition.

Figure 61: Advanced Logic Expanded

Grouping Conditions If using different logical operators between conditions, make use of brackets to make it clear which conditions are grouped. In this example, condition 1 must be matched along with either of conditions 2 or 3. Conditions, operators and brackets can be dragged to the required position.

Figure 62: Brackets and Dragging in Advanced Logic

Negating Conditions Use the NOT operator to negate a condition or group of conditions. In this example, the interaction is sent to the output stream if condition 1 is matched and neither condition 2 or 3 is matched.

Figure 63: Negated Conditions

Embedding & Customizing Query Builder The Query Builder is available under open source license as an independent, embeddable module written in JavaScript. Anyone can add it to their own web page, blog, or a web view in a desktop or mobile application.

Documentation and examples for embedding and customizing Query Builder are available at http://dev.datasift.com/editor

Page 44: DataSift User Guide

38 Copyright© DataSift. All Rights Reserved.

(This page intentionally left blank)

Page 45: DataSift User Guide

Copyright© DataSift. All Rights Reserved. 39

4 Analyzing Interactions Interactions contain data from the original source as well as augmentation data provided by the DataSift Platform. This section shows how to analyze the details of an output interaction using the Web Application.

Displaying Interaction Details Interaction details are available from the live preview of a stream. From the Streams page, click on a stream name to display the stream summary.

Click Live Preview.

Figure 64: Stream Summary Page

Review the summary of Live Preview sources to ensure the correct sources are enabled.

Figure 65: Live Preview Sources

Page 46: DataSift User Guide

40 Copyright© DataSift. All Rights Reserved.

Click the play button to start previewing interactions matched by the filter conditions.

Figure 66: Live Preview Play Button

Pause the preview when there are interactions to analyze by clicking the pause button. Clicking an interaction also pauses the preview.

Figure 67: Live Preview Pause Button

Move the mouse pointer over the interaction to be analyzed. The interaction becomes grayed and a bug icon appears.

Figure 68: Bug Icon on an Interaction

Click anywhere on the interaction to open the debug viewer.

Figure 69: Interaction Debug Viewer

Page 47: DataSift User Guide

Copyright© DataSift. All Rights Reserved. 41

Analyzing Interaction Details – Web Application Interaction information is available as icons below the message and in a debug window.

Augmentation Icons Some interactions have icons displayed under the message summary. These mostly relate to information available from augmentations. If they are not visible, the necessary augmentations may not be enabled.

The following example shows the user’s avatar with a Klout score of 46 and the message content has positive sentiment.

Figure 70: Augmentation Icons

Page 48: DataSift User Guide

42 Copyright© DataSift. All Rights Reserved.

Interaction Example Interaction details are available for all interactions regardless of source. While each data source or managed source has its own attributes, the interaction augmentation provides a consistent set of attributes regardless of the source type.

Information about the author is separated from the message information.

Figure 71: Example Interaction Output

Not all attributes are available for every data source. For example, some data sources may not have an avatar.

REFERENCE: The interaction targets are documented here: http://dev.datasift.com/docs/targets/common-interaction

Page 49: DataSift User Guide

Copyright© DataSift. All Rights Reserved. 43

Klout Example With this augmentation enabled, each interaction which matches the filter conditions is augmented with the author’s Klout score. Klout is a value for the author’s social media influence. Values are on a scale of 1-100 with higher values being more influential.

Figure 72: Klout Score Interaction Output

REFERENCE: Klout targets are documented here: http://dev.datasift.com/docs/targets/augmentation-klout

Language Example With language augmentation enabled, the language of the message is calculated along with a value indicating the level of confidence that the language has been identified correctly. In this example from a different Tweet, the DataSift platform is 100% sure the message is in English.

Figure 73: Language Interaction Output

It may not be possible to calculate language for every interaction. In which case, the language section is excluded from the output.

REFERENCE: The language augmentation and targets are documented here: http://dev.datasift.com/docs/targets/augmentation-language

Page 50: DataSift User Guide

44 Copyright© DataSift. All Rights Reserved.

Salience Example If the Salience augmentation is enabled, it adds topics and sentiment to the interaction. In this example from a different Tweet, two topics have been discovered which are both companies. In each case, the sentiment of the message about the companies is negative.

Figure 74: Salience Interaction Output

REFERENCE: Salience documentation is available here: http://dev.datasift.com/docs/targets/augmentation-salience

Page 51: DataSift User Guide

Copyright© DataSift. All Rights Reserved. 45

Twitter Example Interactions matched in the Twitter feed contain attributes and values from Twitter. The attributes and meanings of each attribute are defined by Twitter and may change.

Figure 75: Twitter Interaction Output

Page 52: DataSift User Guide

46 Copyright© DataSift. All Rights Reserved.

More detailed information about the user is available under the user attribute.

Figure 76: Twitter User Details

Page 53: DataSift User Guide

Copyright© DataSift. All Rights Reserved. 47

Analyzing Interaction Details – API When consuming interactions programmatically, the interactions arrive as JavaScript Object Notation (JSON) objects. This is an open standard format providing data as text attribute-value pairs.

The following example shows a single Tweet matched by a filter and delivered as a JSON object in a stream from the DataSift platform:

{ "interaction": { "id": "1e376bcd3c3caa80e07422ab947f4e52", "type": "twitter" }, "twitter": { "created_at": "Mon, 06 Jan 2014 10:25:13 +0000", "filter_level": "medium", "id": "420138980948992000", "lang": "en", "source": "web", "text": "Test tweet message", "user": { "name": "Paul Smart", "description": "Test account", "statuses_count": 41, "followers_count": 2, "friends_count": 0, "screen_name": "DataSiftPaul", "profile_image_url": "http://abs.twimg.com/sticky/default_profile_2_normal.png", "profile_image_url_https": "https://abs.twimg.com/sticky/default_profile_2_normal.png", "lang": "en", "listed_count": 0, "id": 2268899964, "id_str": "2268899964", "geo_enabled": true, "verified": false, "favourites_count": 0, "created_at": "Mon, 30 Dec 2013 13:59:22 +0000" } } }

Notice how the whole JSON object is surrounded by brackets and more brackets are used to divide the object into blocks for interaction and twitter. Each augmentation adds an extra block. Each interaction is from a single source so only one source block is ever present.

Attributes appear with values. JSON allows multiple data types for the values. When an attribute has no value, the attribute does not appear in the JSON object.

Page 54: DataSift User Guide

48 Copyright© DataSift. All Rights Reserved.

Generating JSON Object Stream The PUSH API can be used to generate a stream of JSON objects for viewing. In this example, the hash from a previously compiled filter is used to stream 20 interactions:

$ curl -sX POST https://api.datasift.com/v1/push/create \ -d name=pullsub -d hash=dbdf49e22102ed01e945f608ac05a57e \ -d output_type=pull -d output_params.format=json \ -H 'Authorization: paul:6cab930bdf40cf89e68f2ecad2c' $ curl -sX POST https://api.datasift.com/v1/pull \ -d id=37c2d26b6596f163276ed8ee1b8cacf4 -d size=1 \ -H 'Authorization: paul:6cab930bdf40cf89e68f2ecad2c' { "interaction": { "author": { "avatar": "http://pbs.twimg.com/profile_images/normal.jpeg", "id": 215741751, "language": "en", "link": "http://twitter.com/escamil61", "name": "\u2693\ufe0fcinthia escamill", "username": "escamil61" }, [output omitted]

Page 55: DataSift User Guide

Copyright© DataSift. All Rights Reserved. 49

5 Writing Simple Filters in CSDL Curated Stream Definition Language (CSDL) is the language used to write filter conditions on the DataSift platform. While CSDL can be used to create very complex filters, this section covers the syntax and use of CSDL to create simple filters.

Filtering Condition Elements Filters are made on one or more filter conditions. Each filter condition usually has a Target, Operator and Argument. Sometimes only the Target and Operator are required. In the editor, these are color coded with blue, red and green.

The example shown below, is a filtering condition with all three elements.

interaction.content contains "starbucks"

Selecting Targets A single interaction contains many individual attributes and values. For example, a Reddit interaction contains an author name and a title. These attributes are called targets and a filtering condition starts with the name of a target.

All the targets are listed in the developer documentation in the three groups: feeds, augmentations and managed sources.

REFERENCE: The targets available from each source are documented here: http://dev.datasift.com/docs/targets

Target Operator Argument

Page 56: DataSift User Guide

50 Copyright© DataSift. All Rights Reserved.

Target Data Types A target contains data which has a data type. Some common data types are listed below.

Data Type Description Example

string A sequence of characters, usually alphanumeric.

"datasift" "starbucks Frappuccino"

int Integer. A whole number without a fractional or decimal component.

4 7294745

array(int) A collection of integers [34, 42, 88, 1]

float Floating point number A number with a decimal component.

4.123 7294745.9984

array(string) A collection of strings. www.yahoo.com/news.htm, www.yahoo.com/sport.htm

geo A geographic region represented as a circle from a point with a radius, a rectangle or a polygon.

51.4553,-0.9689:50 51.4911,-1.0617:51.4194,-0.8921 51.4615,-0.9864:51.4586,-0.9472:51.4466,-0.9412:51.4443,-0.9651:51.4445,-0.9831"

Target Examples The following table shows some commonly used targets with their data type, an example value, and a description of how it may be used.

Example Target Data Type

Example value

Description

interaction.content string "Just had a coffee with Dave at Starbucks"

The interaction targets are normalized from the original data source. Content is the body of the message.

facebook.type string "photo" The type of content posted in a Facebook update.

twitter.geo geo 51.4553,-0.9689:50

The location of the device used to send the Tweet. This may not always be available.

links.domain string "bosch.com" The domain name of the final destination of a shared link.

twitter.retweet.count int 100 Number of times the interaction has been retweeted.

Page 57: DataSift User Guide

Copyright© DataSift. All Rights Reserved. 51

Selecting Operators The operator defines what comparison is made between the argument and the value of the target. The choice of operator may affect the DPU cost of the filter condition.

Exists The exists operator returns a true result if the named target exists in the input interaction with any value. This is the only time a condition does not need an argument.

The example below shows the exists operator being used to identify interactions which have a geographic value.

interaction.geo exists

WARNING: Use the exists operator with great caution. It is likely to match a very large number of interactions and will rapidly use data licensing credit.

String Operators The == and != operators match interactions where the argument is exactly the same string as the target value or is not the same string as the target value. The following example matches interactions where the screen_name contains exactly the string shown and nothing more.

twitter.user.screen_name == "datasift"

WARNING: Use the != operator with great caution. It is likely to match a very large number of interactions and will rapidly use data licensing credit. For example, twitter.user.screen_name != "datasift" matches all twitter interactions not from @datasift and all interactions from all other active data sources.

The contains operator looks for a string anywhere in the value of a target. The example shown below matches interactions if the content has the string "hertz rental" anywhere in upper, lower, or mixed case.

interaction.content contains "hertz rental"

The following table shows the result for each example value.

Example Value Result "Where can I find Hertz Rental in SFO?" True "Where can I find hertz rental in SFO?" True "Why do Hertz never have my rental car" False

Page 58: DataSift User Guide

52 Copyright© DataSift. All Rights Reserved.

The contains_any operator uses a comma-separated list of strings. The condition returns a true result if any one of the strings is matched in the target value.

interaction.content contains_any "Hewlett-Packard,Hewlett Packard"

WARNING: If two commas are used as shown in the example below, all interactions with a space in them will be matched. This returns a very large number of interactions. interaction.content contains_any "thinkpad, ,lenovo"

The contains_near operator matches interactions when strings appear within a specified number of words from each other. The strings cannot contain spaces.

interaction.content contains_near "deere,tractor:5"

The following table shows the result for each example value.

Example Value Result “I have bought a new John Deere Tractor” True “Deere make the best tractor for me” True “John Deere isn't going to be the right tractor for me” False

The contains_all operator matches interactions which contain all of the strings in a comma separated list.

interaction.content contains_all "deere,tractor"

The following table shows the result for each example value.

Example Value Result “I have bought a new John Deere Tractor” True “Deere make the best tractor for me” True “John Deere isn't going to be the right tractor for me” True

The in operator matches interactions where the value is one of the listed strings or integers. In the following example, the interaction is matched if the language is any one of the three shown.

language.tag in "en,de,es"

Brackets must be used if matching integer data types.

twitter.user.id in [111111,111112,111113]

Page 59: DataSift User Guide

Copyright© DataSift. All Rights Reserved. 53

The previous string operators match whole words. To match strings which may be a part of a word, use the substr operator.

interaction.content substr "gator"

The following table shows the result for each example value.

Example Value Result “I almost got caught by an alligator" True “I need to drink some gatorade” True

Case Sensitivity Notice in the previous examples, the results were true regardless of the case used in the string. By default, string operators are not case sensitive. To force case sensitivity, use the case modifier with any string operator, as shown below.

interaction.content [case(true)] contains_near "FBI,CIA,NSA:15"

Wildcard Operators Wildcards are used to match strings with the following two special characters:

Special Character

Description

? Match exactly one character in a string. Can be included multiple times. "famous??" matches "famous12" and "famously".

* Match any string of 0 or more characters. "famous*" matches "famous", "famous1", and "famous123456789ly".

twitter.text wildcard "colo*r"

Page 60: DataSift User Guide

54 Copyright© DataSift. All Rights Reserved.

Numeric Operators The == and != operators match numeric values which are, or are not, the same as the argument.

twitter.user.followers_count == 100

twitter.user.followers_count != 0

To compare values which are greater or less than the argument, use the >, <, >=, and <= operators. These work with integers and floating point numbers.

twitter.retweet_count >= 100

Page 61: DataSift User Guide

Copyright© DataSift. All Rights Reserved. 55

Geographic Operators Some interactions contain geographic information. There are operators which identify which interaction values are within an area defined by the argument.

The geo_radius operator defines a point and a radius. The point is defined as a longitude and latitude, the radius is defined in kilometers. The following example matches twitter interactions with geographic values within a 50km radius of the coordinates.

twitter.geo geo_radius "51.4553,-0.9689:50"

When using the web application, geographical areas can be selected on a map.

Figure 77: Web Application Radius Configuration

The geo_box operator uses two sets of coordinates to define the upper-left and lower-right corners of a rectangle.

interaction.geo geo_box "51.5013,-1.0997:51.4193,-0.8669"

Page 62: DataSift User Guide

56 Copyright© DataSift. All Rights Reserved.

To define more complex areas, use the geo_polygon operator which accepts up to 32 points. Interactions with geographic values within the points are matched.

interaction.geo geo_polygon "51.5002, -1.0815:51.4827,

-1.0107:51.4857, -0.9737:51.5079,

-0.9647:51.4925, -0.8693:51.4210,

-0.8954:51.4296, -1.0567"

Figure 78: Web Application Geo Polygon Configuration

URL Operators Links are normalized by the DataSift platform to make URL matching easier. The following are removed from URLs:

• Protocol • Hostname • Query strings • Anchor tags • Names of common default pages

The url_in operator is used to match the argument with normalized URLs in the target.

twitter.links url_in "http://apple.com, http://finance.yahoo.com"

HINT: The links.normalized_url target is useful for filtering on normalized URLs. links.normalized_url any "apple.com, finance.yahoo.com"

Page 63: DataSift User Guide

Copyright© DataSift. All Rights Reserved. 57

Negating Conditions Logic can be added to negate the effect of a condition. In the following example, interactions which do not contain the string 'data' are matched.

NOT twitter.user.description contains "data"

WARNING: This also matches Twitter interactions which have no value in the twitter.user.description target and all interactions from other active data sources. Use with multiple conditions.

Using Multiple Conditions Most filters have more than one condition. When joining multiple conditions, a logical operator is used.

Logical Operators The AND operator is used when an interaction must match both conditions. In this example, the content of a Tweet must include the word Starbucks and the language must be English.

twitter.text contains "starbucks"

AND twitter.lang == "en"

However, the example above only matches the word Starbucks in Tweets, not Retweets. The OR operator could be used to look for Starbucks in either. Note the use of parenthesis to group conditions.

( twitter.text contains "starbucks"

OR twitter.retweet.text contains "starbucks" )

AND

( twitter.lang == "en"

OR twitter.retweet.lang == "en" )

HINT: Use interaction.content as a normalized form of twitter.text and twitter.retweet.text Use language.tag as a way of matching language across multiple data sources.

Page 64: DataSift User Guide

58 Copyright© DataSift. All Rights Reserved.

Hints & Tips Twitter Mentions When using the twitter.text or interaction.content targets, Twitter users mentioned as @username in the text of a Tweet are removed from the text. Mentions are available in the twitter.mentions and twitter.retweet.mentions targets with the @ symbol removed.

twitter.text contains "starbucks"

AND twitter.mentions == "beyonce"

HINT: Use interaction.raw_content to match mentions without the @ symbol removed.

Hashtags The # symbol is treated as punctuation by the DataSift platform so using '#datasift' as an argument would match interactions with '#' and 'datasift' separated by any amount of whitespace. Use the any operator to match hastags in twitter targets or use the hashtag targets to match hashtags in all interactions.

twitter.text any "#starbucks, #nero, #costa"

interaction.hashtags in "starbucks, nero, costa"

Page 65: DataSift User Guide

Copyright© DataSift. All Rights Reserved. 59

6 Configuring Streams – CSDL Web Application Interactions which have been matched by your filter conditions are provided as an output stream. This covers how to write a filter in CSDL using the web application, validate it, compile it and preview the stream.

Enabling Sources Interactions from a number of data sources can be filtered with the Web Application CSDL editor. Before creating the filter, choose which sources to receive data from and ensure they are enabled. Sources are divided into three types shown below.

Source types Description Example Feeds Public sources of interactions Twitter

Tumblr Reddit

Managed Sources

Sources of interactions which require authentication

Google+ Instagram Yammer

Augmentations Extra information retrieved or calculated about the interaction

Demographics Sentiment Language

Enable and disable Sources in the Data Sources page of the dashboard.

Figure 79: Location of Data Sources Configuration

Refer to the training module on Configuring Sources for more detail on configuring sources.

Page 66: DataSift User Guide

60 Copyright© DataSift. All Rights Reserved.

Writing Filters with CSDL Editor From the Streams page, click Create Stream.

Figure 80: Creating New Stream

Enter a Name for the new stream along with an optional Description. Select CSDL Code Editor before clicking Start Editing.

Figure 81: Stream Name and Editor Selection

Using Targets The code editor opens with numbered lines. These line numbers are used to reference which line has a problem if code validation fails.

As targets and operators are typed, the editor provides a list of possible completions. Either continue typing or select one from the list. When a target is highlighted, a description appears. The More link shows more information.

Figure 82: Target Auto-Completion and Hints

Page 67: DataSift User Guide

Copyright© DataSift. All Rights Reserved. 61

Using Lists When using an operator which allows a list of arguments to be defined, the List button becomes available. In this example, the in operator is used and the List button clicked.

Figure 83: List Configuration

The list editor opens in editing mode. Elements of the list are entered followed by return. To edit any existing element, click in on the text.

Figure 84: Editing Lists

Page 68: DataSift User Guide

62 Copyright© DataSift. All Rights Reserved.

In re-ordering mode, the list editor allows elements to be dragged and dropped into a new order.

Figure 85: Manual Re-ordering

In deleting mode, click on an element to remove it from the list.

Figure 86: Deleting Elements

Page 69: DataSift User Guide

Copyright© DataSift. All Rights Reserved. 63

The Import button opens a dialogue box where comma separated value (CSV) files are uploaded or pasted.

Figure 87: Importing Lists

Return to the CSDL editor to see a collapsed form of the list. Click the + symbol to expand and collapse the list.

Figure 88: Collapsed List in Editor

Page 70: DataSift User Guide

64 Copyright© DataSift. All Rights Reserved.

Using Geo Operators Notice that targets are automatically colored blue, operators are red and arguments are green. When using geographic arguments, click the Geo – Selection button to open a map.

Figure 89: Coloring and Geo Button

The map allows coordinates and radius to be selected by clicking on a map rather than typing latitudes and longitudes. Use the search box to locate a place then click once to define the center of a geo_radius and click again to define the perimeter.

Figure 90: Geo Radius Configuration

Page 71: DataSift User Guide

Copyright© DataSift. All Rights Reserved. 65

To define an arbitrary shape with up to 32 points, use the geo_polygon operator. Click the Geo – Selection button and click for each point. Points may also be dragged to adjust the polygon. In this example, the perimeter of London Heathrow Airport is defined.

Figure 91: Geo Polygon Configuration

Page 72: DataSift User Guide

66 Copyright© DataSift. All Rights Reserved.

Using Versions Every time the CSDL is saved, the editor saves it as a new version. Using the date drop-down menu, it is possible to revert to previously saved versions of the code.

Figure 92: Code Versions

WARNING: Different versions of the same filter have different stream hashes. Any recording or consumption of a stream is tied to a particular stream hash.

Page 73: DataSift User Guide

Copyright© DataSift. All Rights Reserved. 67

Validating Filters The validate button checks the CSDL syntax. Use this before saving.

Figure 93: Validating CSDL

If the code is free from errors, the following message is displayed:

Figure 94: Validation Pass

If there are problems with the CSDL, a descriptive message is displayed identifying the line and character position the problem is found and the error. In this example, geographic coordinates are missing from a geo_radius operator.

Figure 95: Validation Fail

Page 74: DataSift User Guide

68 Copyright© DataSift. All Rights Reserved.

Compiling Filters When the code has passed validation, click Save & Close. This compiles the code and saves it in the DataSift platform.

Figure 96: Saving and Closing

Every compiled filter is saved in the platform and referenced by a hash. To view the hash for a filter, click Consume via API from the stream summary page.

Figure 97: Consume via API Button

Among other details, the Stream Hash is displayed. This is the unique hash which references your filter. In later modules, this hash is used to reference the filter from the API and to reference these filter conditions from within another filter.

Figure 98: Stream Hash

NOTE: Remember to update any applications consuming a stream to the new hash after each change. The old hash still references the previous version.

Page 75: DataSift User Guide

Copyright© DataSift. All Rights Reserved. 69

Previewing Streams The stream preview is used to review and fine-tune filters by identifying irrelevant interactions in the output stream. The filter conditions can then be modified to exclude them.

From the Streams page, click on the stream name to open the summary page. Click Live Preview.

Figure 99: Live Preview Button

A summary of enabled sources is shown. Check that only the required sources are in the list. Enabling more than the required sources may unnecessarily increase the data licensing cost.

Figure 100: Summary of Data Sources

Page 76: DataSift User Guide

70 Copyright© DataSift. All Rights Reserved.

Click the play button at the bottom of the screen to start previewing interactions matched by the filter conditions.

Figure 101: Play Button

Interactions with their augmentation icons and information appear as they are matched. To stop the preview, click the pause button.

Figure 102: Pause Button

NOTE: The number of interactions sent per second is limited in preview mode.

Page 77: DataSift User Guide

Copyright© DataSift. All Rights Reserved. 71

7 Configuring Categorization Categorization allows user-defined meta-data to be added to interactions based on conditions.

While interactions may be filtered and sent to a destination based on one set of conditions, another set of conditions can be used to assign tag strings and values to those interactions. This extra information can greatly simplify the task of post-processing interactions and opens the platform to machine learning where tags and values are assigned based on programmable intelligence.

The assignment of simple tags has been present in the platform for some time. The latest additions are tagging using namespaces, scoring and cascading tags. The new additions are part of the DataSift VEDO feature.

Files which defines tags and scores are also known as Classifiers.

Page 78: DataSift User Guide

72 Copyright© DataSift. All Rights Reserved.

Configuring Tagging Tags are user-defined strings of text added to output interactions based on the results of conditions.

Writing Tag Conditions Tags are defined using the tag keyword in CSDL. This is followed with one or more conditions grouped in curly brackets. In this example, the tag "US" is being applied to interactions which match the conditions.

tag "US" { twitter.user.location contains_any "usa,united states" or twitter.place.country contains_any "usa,united states" or twitter.place.country_code == "US" }

The condition syntax used for tagging is exactly the same as the condition syntax used for filtering but no filtering takes place. Multiple tags can be defined in a single CSDL file.

Filtering is a separate section of the CSDL file and must be identified using the return keyword. In this example, tag and return keywords are used in a single CSDL file.

tag "UK" { twitter.place.country_code == "UK" or twitter.user.location contains_any "uk,united kingdom" } tag "US" { twitter.place.country_code == "US" or twitter.user.location contains_any "usa,united states" } return { twitter.text contains "starbucks" }

NOTE: It is not possible to filter interactions based on tag values

Page 79: DataSift User Guide

Copyright© DataSift. All Rights Reserved. 73

Interpreting Tags in Output Fields Interactions which match the filtering conditions are augmented with the appropriate tags and the tag values are available in the output interaction.

In this example, the preview is used to view an interaction which matched a filtering condition and tagging condition which applied the US tag.

Figure 103: Interaction Preview With Tags

Where multiple tags are added to the same interaction, they become elements of an array named tags.

Tags in JSON When consuming the filter output in a JSON format, the tags appear as attributes and values.

"interaction": { "author": { <output omitted> "schema": { "version": 3 }, "source": "Twitter for iPhone", "tags": [ "US" ], "type": "twitter" },

Page 80: DataSift User Guide

74 Copyright© DataSift. All Rights Reserved.

Tags Mapped to Database Fields A mapping file may be required to map interaction attributes to database fields. In the case of MySQL databases, this is achieved with an .ini file written by hand or constructed with the SQL Schema Mapper.

A list iterator is required to iterate multiple tag values.

REFERENCE: http://dev.datasift.com/docs/push/connectors/ini/list-iterator

Configuring Tag Namespaces The previous examples used a single string (e.g. UK or US) as a tag which is a flat namespace. Using a tag tree may be more suited to the desired tagging namespace.

Writing Tag Tree Conditions In this example, the assignment of US and UK tags is placed into a hierarchy with a branch for country. The format uses periods between elements of the path.

tag.country "UK" { twitter.place.country_code == "UK" or twitter.user.location contains_any "uk,united kingdom" } tag.country "US" { twitter.place.country_code == "US" or twitter.user.location contains_any "usa,united states" } return { twitter.text contains "starbucks" }

Page 81: DataSift User Guide

Copyright© DataSift. All Rights Reserved. 75

The following example assigns tags in a hierarchy to match device attributes:

• tag o device

name os format manufacturer

tag.device.name "iPhone" {interaction.source contains "iPhone"} tag.device.name "Android" {interaction.source contains_any "Android"} tag.device.os "iOS" {interaction.source contains_any "iOS,iPhone,iPod,iPad" } tag.device.os "Android" {interaction.source contains_any "Android"} tag.device.format "Mobile" {interaction.source contains_any "iPhone,iPod, mobile web, phone, Blackberry" } tag.device.format "Desktop" {interaction.source contains_any "web, Tweet Button, Twitter for Mac, Tweet for Web" and not interaction.source contains "mobile"} tag.device.manufacturer "Apple" {interaction.source contains_any "iPhone,iPod,iPad,OS X,Mac"} tag.device.manufacturer "HTC" {interaction.source contains "HTC"}

NOTE: Values cannot be assigned at branches, only leaf nodes

Page 82: DataSift User Guide

76 Copyright© DataSift. All Rights Reserved.

Interpreting Tag Namespace in Output Fields The tags are assigned using a tree structure in the output. The preview shows tag_tree in the interaction augmentation, which is expanded to reveal the hierarchy and values.

Figure 104: Tag Tree Preview Example

The JSON output shows a similar hierarchy in the interaction section.

"interaction": { "author": { <output omitted> "source": "Twitter for Android", "tag_tree": { "device": { "name": [ "Android" ], "os": [ "Android" ] } }, <output omitted>

Page 83: DataSift User Guide

Copyright© DataSift. All Rights Reserved. 77

Configuring Scoring Scoring extends tagging by associating a numeric value to an interaction which matches a condition.

Writing Scoring Conditions Scoring allows the numeric value to be incremented or decremented. This allows the final value to be varied as multiple conditions are evaluated. In this example, the score is assigned or incremented by different values depending which group of conditions are matched. The final score is an indication of probability that the author was in the USA.

tag.country.US +10 { interaction.geo geo_polygon "48.80686346108517,-124.33593928813934:48.922499263758255,<output omitted>" } tag.country.US +5 { twitter.user.location contains_any "usa,united states" or twitter.place.country contains_any "usa,united states" or twitter.place.country_code == "US" } tag.country.US +2 { twitter.user.location contains_any "Alabama, Alaska, Arizona, Arkansas, California, Colorado, Connecticut, Delaware, Florida <output omitted>" or twitter.place.full_name contains_any "Alabama, Alaska, Arizona, <output omitted>" or twitter.user.location contains_any "Abilene, Akron, Albuquerque, Alexandria, Allentown, Amarillo, Anaheim, Anchorage" or twitter.place.full_name contains_any "Abilene, Akron, Albuquerque, Alexandria, Allentown, Amarillo, Anaheim, Anchorage" } tag.country.US +1 { twitter.user.time_zone contains_any "Alaska,Arizona,Atlantic Time (Canada),Central America,Central Time (US & Canada),Eastern Time (US & Canada),Mountain Time (US & Canada),Pacific Time (US & Canada)" } return { twitter.text contains "starbucks" }

Page 84: DataSift User Guide

78 Copyright© DataSift. All Rights Reserved.

Interpreting Scores in Output Fields In the example output, the interaction acquired a total score of 18.

10 for having geo parameters in the USA, another 5 for having a country code of "US" or country of "United States", then a final 3 for matching a state or city name and the time zone (not shown).

Figure 105: Example Tag Scoring in Preview

Page 85: DataSift User Guide

Copyright© DataSift. All Rights Reserved. 79

Decrementing & Incrementing Scores Scores may be decremented as well as incremented. The following excerpt shows an excerpt of tagging logic which assigns a value based on likelihood that an interaction is a customer service 'rave' or 'rant'.

tag.rave 0.793474 {interaction.content contains "great"} tag.rave 0.611286 {interaction.content contains "thank you"} tag.rave -0.001199 {interaction.content contains "cancelled?"} tag.rave -0.001199 {interaction.content contains "been on hold"} tag.rave -0.001199 {interaction.content contains "any way to"} tag.rant 0.699983 {interaction.content contains "fail"} tag.rant 0.529781 {interaction.content contains "never"} tag.rant -0.001199 {interaction.content contains "you try to"} tag.rant -0.001199 {interaction.content contains "you try"} tag.rant -0.001199 {interaction.content contains "you for your"}

Machine Learning Scoring opens the platform to the advantages of machine learning. A sample set of interactions is classified or scored by a human and a machine determines what content causes the human to classify or score in a particular way.

Classification conditions, tags and values are then generated by the machine to ensure future interactions are classified in the same way a human would.

The previous example of rules scoring rants and raves was constructed by machine learning.

Page 86: DataSift User Guide

80 Copyright© DataSift. All Rights Reserved.

Configuring Cascading Tags defined in one CSDL file can be referenced from within other CSDL files – known as cascading. This allows Tags to be written once and used in multiple filters. The tags are referenced by a hash which changes each time the tag file is modified.

Writing Re-usable Tag Definitions A CSDL file containing just tags, which is to be referenced by another CSDL file, does not need to have a return statement. In this example, the file only contains tag statements.

tag.country "UK" { twitter.place.country_code == "UK" or twitter.user.location contains_any "uk,united kingdom" } tag.country "US" { twitter.place.country_code == "US" or twitter.user.location contains_any "usa,united states" }

When saved in the web application, the usual stream summary buttons are not available because the file doesn’t contain conditions to match interactions to be sent to a destination.

Figure 106: Stream Summary for Tag File

The only actions available are to edit and share the CSDL or use the tags from within another CSDL file.

Page 87: DataSift User Guide

Copyright© DataSift. All Rights Reserved. 81

Re-using Tag Definitions The previously defined tags are used within another filter by using the tags statement followed by the hash.

tags "fb0960ab37ef2ed4f42049a4a812d066" return { interaction.content contains "starbucks" }

NOTE: The hash changes every time the tags file is modified

Multiple tag files may be referenced in a filter and tags can be cascaded multiple times. In this example, two tag files are referenced in a filter file which also includes its own tags.

tag.model "iPhone" {interaction.source contains "iPhone"} tag.model "iPad" {interaction.source contains "iPad"} tag.model "Blackberry" {interaction.source contains "Blackberry"} Figure 107: Model Tags with hash 55a1eeab06858259a401bfc16b7771ce

tag.device.format "Mobile" {interaction.source contains_any "iPhone,Blackberry" } tag.device.format "Tablet" {interaction.source contains "iPad"} Figure 108: Device format tags with hash 2ff3a1745c92503d1a534228856ba4e4

tags "55a1eeab06858259a401bfc16b7771ce" tags "2ff3a1745c92503d1a534228856ba4e4" tag.device.model "Android" {interaction.source contains "Android"} return { interaction.content contains "Venice" OR links.title contains "Venice" }

Any tagging taking place in the file which contains the return block must come after the imported tags.

WARNING: Care should be taken to avoid overlapping tagging hierarchies.

Page 88: DataSift User Guide

82 Copyright© DataSift. All Rights Reserved.

Including Library Classifiers A library of commonly-used classifiers is available for including in filters. They include classifiers which just tag interactions or tag and score. Some are developed using machine learning.

To locate the classifiers, click the Streams tab and select Library.

Figure 109: Library Page

Example 1 – People vs. Organizations The People vs. Organizations is a machine learned classifier that distinguishes authors between individual people and organizations.

Figure 110: People vs. Organizations Classifier

Clicking on the classifier opens a summary page which starts with a summary of the classifier type, hash and which source it applies to. In this example, the classifier only applies to Tweets. The Copy button copies the hash for pasting into a filer.

Figure 111: Example Classifier Summary

Page 89: DataSift User Guide

Copyright© DataSift. All Rights Reserved. 83

The classifier definition shows a summary of the tagging CSDL. The Copy to stream button opens a new blank filter in the CSDL editor and adds the tags. The Copy button places the tags in the copy buffer to be pasted into another filter.

Figure 112: Classifier Definition

At the bottom of the page are tabs showing examples of using this classifier in other filters and the hierarchy of tags provided.

Figure 113: Tag Descriptions

Page 90: DataSift User Guide

84 Copyright© DataSift. All Rights Reserved.

Using Libraries Locate the classifier hash from the classifier summary page, or copy the tags statement.

Figure 114: Tag Include Example

Edit a new or existing filter and paste the tags statement before any local tags statements. Ensure the filtering conditions are enclosed in a return block.

tags "bfb5bc9a599aa04f91b8a1dc4ae44d45" return { interaction.content contains_all "java,job" }

The classifier is verified by checking preview output. This example shows tags and scores are being applied.

Figure 115: Example of Scores Correctly Applied

Page 91: DataSift User Guide

Copyright© DataSift. All Rights Reserved. 85

Billing Simple Tagging Operators used inside a tag statement are normally charged at 10% of their usual DPU cost.

For example, if the normal cost of a rule is 1 DPU, that same code inside a tag statement costs 0.1 DPU.

Advanced Tagging (VEDO) If you use tags with namespaces or scoring rules, or cascade tags from one filter to another, the pricing is based on the combined cost of operators in the tagging logic and in the filter definition.

The number of times each operator appears is counted and the overall cost calculated.

For example, if the contains operator is used nine times in tagging and used twice in the filtering logic, the charge is for 11 uses of that operator.

Reviewing Processing Cost The summary of any filter including tags displays the DPU cost.

Figure 116: DPU Cost of a Filter using VEDO

Page 92: DataSift User Guide

86 Copyright© DataSift. All Rights Reserved.

When VEDO features are used, the stream breakdown does not show DPU costs per operator. It shows a summary of the tag statements showing which of these are within the return block.

Figure 117: Stream Breakdown

Page 93: DataSift User Guide

Copyright© DataSift. All Rights Reserved. 87

8 Configuring Streams – API Filters can be written programmatically through the Application Programming Interface (API). The API is used for submitting, validating and compiling filters. This section explains how to write and submit filters to the DataSift platform and preview the output stream using the API.

Enabling Sources Interactions are only sent to the filter if one or more sources is enabled. Sources are not enabled and disabled programmatically, they must be configured from the web application.

Log in to http://datasift.com and select the Data Sources tab. Ensure the required sources are activated.

Figure 118: Enabling Data Sources

More information on data sources is available in the 'Configuring Sources' training module.

Page 94: DataSift User Guide

88 Copyright© DataSift. All Rights Reserved.

Making API Calls Calls to the DataSift Platform API are made using HTTPS requests. The calls are usually made programmatically but for testing, it is possible to make the calls from the command line with utilities such as curl.

There are two types of API available which use different URLs.

API Type API Domain Description REST API https://api.datasift.com/v1/ Representational state transfer.

Used for validating and compiling CSDL. Also used for configuring destinations, starting, stopping and querying jobs.

Streaming API http://stream.datasift.com/ Used for real-time streaming of interaction streams which continue until stopped.

Validating Filters The first example of using the API is to validate a simple piece of CSDL code. The CSDL being used is shown below:

twitter.text contains "starbucks"

The CSDL can be passed as a query sting in the URI. The following example shows the validate endpoint being called with the CSDL as a query string:

https://api.datasift.com/v1/validate?csdl=twitter.text%20contains%20%22starbucks%22

Notice that spaces and double-quotes are URL-encoded, substituting ASCII percent-encoded values of %20 and %22.

Page 95: DataSift User Guide

Copyright© DataSift. All Rights Reserved. 89

Using API Authentication Every call to the REST API must be authenticated so user credentials may be added as parameters in the query string. The credentials required are username and API key.

&username=<username>&api_key=<api_key>

The user name and API key are available from the web application dashboard.

Figure 119: Location of Username and API Key

The complete validation request with authentication is shown below:

https://api.datasift.com/v1/validate?csdl=twitter.text%20contains%20%22starbucks%22&username=paul&api_key=b366978a8ee4e36c9d2171ccee4e1234

This request could be made using the curl utility and a query string.

$ curl \ "https://api.datasift.com/v1/validate?csdl=twitter.text%20contains%20%22starbucks%22&username=paul&api_key=b366978a8ee4e36c9d2171ccee4e1234"

The command is more readable if used with arguments rather than a query string.

$ curl –X POST https://api.datasift.com/v1/validate \ -d 'csdl=twitter.text contains "starbucks"' \ -H 'Authorization: paul:366978a8ee4e36c9d2171ccee4e1234'

Page 96: DataSift User Guide

90 Copyright© DataSift. All Rights Reserved.

Validation Failure All information returned from REST API calls is in JavaScript Object Notation (JSON). If the CSDL is not correct, the validation fails and a JSON object is returned with an error message.

{"error":"We are unable to parse this stream... <output omitted>

Validation Success If the CSDL has passed validation, a JSON object is returned similar to the following example.

{ "created_at": "2013-11-07 16:08:03", "dpu": "0.1" }

The timestamp is when the CSDL was first validated. The dpu value is the number of Data Processing Unit hours used by the CSDL being validated if executed.

Compiling Filters Filters are submitted to the DataSift platform for compilation. Compiled filters are available to be run. Use the same syntax as the validate example but using the compile API endpoint.

https://api.datasift.com/v1/compile?csdl=twitter.text%20contains%20%22starbucks%22&username=paul&api_key=b366978a8ee4e36c9d2171ccee4e1234

In this example of making a compile request using curl, a JSON object is returned.

$ curl -X POST https://api.datasift.com/v1/compile \ -d 'csdl=twitter.text contains "starbucks"' \ -H 'Authorization: paul:366978a8ee4e36c9d2171ccee4e1234'

{ "hash": "bfd56316e55d8d480b89c7a653359e6d", "created_at": "2013-11-08 12:37:31", "dpu": "0.1" }

The 'hash' returned in the JSON object is a unique reference for the compiled CSDL on the platform.

Page 97: DataSift User Guide

Copyright© DataSift. All Rights Reserved. 91

Referencing Web Application Filters There may be times when it is more convenient to use the CSDL editor in the web application to create or modify filters. The CSDL editor provides color highlighting of code, target and operator auto-completion and helpers for geo and list arguments.

After the filter has been written and saved in the web application, refer to the stream summary page to see its hash. The filter is referenced by this hash in the API.

Figure 120: Hash Location in Stream Summary

NOTE: Filters created in the API are not visible in the list of streams in the web application.

Page 98: DataSift User Guide

92 Copyright© DataSift. All Rights Reserved.

Previewing Streams There is no preview API endpoint which is similar to the preview in the web application. To see a few interactions from the filter, use the stream PUSH API endpoint. Specify the filter hash returned from the compile call:

hash=bfd56316e55d8d480b89c7a653359e6d

Along with a count of the number of interactions to display:

count=2

Example curl command retrieving two interactions:

$ curl -X POST https://api.datasift.com/v1/stream \ -d 'hash=bfd56316e55d8d480b89c7a653359e6d' \ -d 'count=2' \ -H 'Authorization: paul:b366978a8ee4e36c9d2171ccee4e1234'

Page 99: DataSift User Guide

Copyright© DataSift. All Rights Reserved. 93

9 Configuring Stream Recording So far, you have covered how to create multiple filter conditions and submit them for compilation in the DataSift platform using either the web application or API. This section explains how to record the stream for later analysis.

Data Destinations Previewing the stream of interactions in the web interface is a good way to verify the filter conditions are matching the desired interactions but the streamed data is not stored anywhere.

To store the streamed data for later analysis, a destination is required. Destinations are listed on the Data Destinations page of the web interface. Click on Browse Destinations to see all available destinations.

Figure 121: Data Destinations Page

Notice that some destinations have a padlock icon; these deliver the data using more secure protocols and authentication. Any destinations which are unavailable to the account display an Inquire button to request access.

Click the + symbol to add one of these destinations. Configuration of each destination is covered in separate training modules.

DataSift Storage DataSift Storage is available to all accounts as a way of temporarily storing streams. The data is held on the DataSift platform indefinitely. If no other destinations have been configured, DataSift Storage is the default destination for recording tasks.

Page 100: DataSift User Guide

94 Copyright© DataSift. All Rights Reserved.

Starting Record Tasks To start recording the stream from a filter, locate and click the filter name on the Streams page.

Figure 122: Select Filter

The summary page opens with options to Use This Stream. To create a recording task from this filter, click the Record Stream button.

Figure 123: Record Stream Button

In step one, the new recording task requires start and end times. Start times are available up to two years in the future or the Start Now option starts the recording as soon as the task is submitted.

Page 101: DataSift User Guide

Copyright© DataSift. All Rights Reserved. 95

End dates are also available up to two years in the future or use the Keep Running option. Check the time zone has been correctly detected and give the task a name before clicking Continue.

Figure 124: Recording Task Step 1

In step two, select which destination is used for the raw data stream. Use the settings icon next to each destination to review the configuration. DataSift Storage is a free temporary destination allowing download of the data or saving to an alternative destination.

Figure 125: Recording Task Step 2

NOTE: If no other destinations have been configured, DataSift Storage is used and the preceding step is not displayed.

Page 102: DataSift User Guide

96 Copyright© DataSift. All Rights Reserved.

In step three, review the new recording summary and click Start Task.

Figure 126: Recording Task Confirmation

Page 103: DataSift User Guide

Copyright© DataSift. All Rights Reserved. 97

Viewing Record Tasks To view queued, running and completed recording tasks, open the Tasks page. Tasks are divided into Recordings and Historic Queries. The page opens showing all tasks, including the previously configured recording.

Figure 127: Tasks Page

The percentage shown after Task running reflects where the recording is in the total recording time. Click on the task name to view more information about the task. In this example, 21 interactions have been recorded from the Starbucks filter and the stream is still running.

Figure 128: Recording Task Details

Page 104: DataSift User Guide

98 Copyright© DataSift. All Rights Reserved.

Pausing Record Tasks When recording to any destination except DataSift Storage, it is possible to pause data delivery. This could be used when the destination is offline for maintenance.

Figure 129: Pause Delivery Button

Data is buffered for up to one hour. To resume delivery, click the Resume Delivery button.

Figure 130: Resume Delivery Button

Stopping Record Tasks To stop a recording task before the configured stop time, locate the task in the Tasks page and click Stop Task. Data continues to be delivered until the buffer is empty or the data in the buffer has expired.

Figure 131: Stop Task Button

Page 105: DataSift User Guide

Copyright© DataSift. All Rights Reserved. 99

Exporting Record Task Data When the recording task is complete using DataSift storage, the output interactions must be exported. Exporting allows the raw data collected by the recording task to be available in configurable formats and for specified time and date ranges.

Locate the completed recording in the Recordings or All Tasks tabs of the Tasks page. Click Export Data.

Figure 132: Locate Export Data Button

Complete the Name field and select the Format from a choice of Comma-Separated Values (CSV) and JavaScript Object Notation (JSON). Select start and end times and choose the storage destination for the exported interactions.

Page 106: DataSift User Guide

100 Copyright© DataSift. All Rights Reserved.

By default, all output fields are exported. To make all fields available for individual selection, click to clear the All check box. When the export configuration is complete, click Create.

Figure 133: Export Dialogue

The task details are displayed with completion of the export shown as a percentage.

Figure 134: Export Completion

Page 107: DataSift User Guide

Copyright© DataSift. All Rights Reserved. 101

When the export is complete, a notification is emailed to the logged in user.

Figure 135: Export Complete Email Notification

Exporting Running Tasks When using DataSift Storage as the destination for the raw data, it is possible to export the stream to a chosen destination, in a desired format, while the task is running. Use the Tasks page to locate the running record task and click Export Data.

Figure 136: Availability of Export on Running Tasks

Page 108: DataSift User Guide

102 Copyright© DataSift. All Rights Reserved.

Downloading Exported Data The data is now available for download. All exports are listed on the Tasks page with download links. Click a Download link to download the exported data.

Figure 137: Exported Data in Tasks Summary

Delete Exported Data From the Tasks page, click on the task name to display the task summary. Click the Delete link by each export to delete the exported data from DataSift Storage. This does not delete downloaded copies.

Figure 138: Delete Links for Exported Data

NOTE: Exports to DataSift Storage expire after two weeks

Page 109: DataSift User Guide

Copyright© DataSift. All Rights Reserved. 103

Deleting Record Tasks To delete a recording task, open the Tasks page and click the Delete Task link.

Figure 139: Delete Task Link

Click OK to confirm the export task deletion.

Figure 140: Delete Task Confirmation

WARNING: Deleted data is not recoverable

Page 110: DataSift User Guide

104 Copyright© DataSift. All Rights Reserved.

(This page intentionally left blank)

Page 111: DataSift User Guide

Copyright© DataSift. All Rights Reserved. 105

10 Configuring Historic Previews So far, you have seen how to create multiple filter conditions and submit them for compilation in the DataSift platform using the web application. After completing this module you will be able to query the historic archive for a preview of interactions which match a filter.

Historic Archive DataSift has an archive of interactions which is queried using the same filters used for live interactions. The archive is many petabytes in size and growing rapidly.

Capturing of interactions to the archive started at different times for each source. Some of the oldest interactions may not be fully augmented. The most complete archive is for Twitter interactions.

• Pre-December 2011 o Partially augmented Twitter

• August 2012 o Demographics added

• May 2012 o Facebook added o More augmentations added

• November 2012 o Bitly & Newscred added

• July 2012 o Fully augmented

• Aug-Dec 2013 o Tumblr & WordPress added

REFERENCE: Full list of sources and earliest archived interactions http://dev.datasift.com/docs/historics/archive/schema

Interactions are retrieved form the archive many times faster than real-time interactions. Destinations must be configured to cope with a data stream which may be 100x faster than live streams.

Historic tasks are queued for execution. Notifications are sent when tasks are complete.

Historic Preview Historic preview allows filters to be applied to a random 1% of the archive for any duration between one hour and 30 days. One of several pre-configured reports are created from the filtered interactions.

This is ideal for testing a filter before running a larger historic task on 100% of the interactions. For analyzing trends, in most cases the results of the preview will be sufficient.

Page 112: DataSift User Guide

106 Copyright© DataSift. All Rights Reserved.

Report Types There are 5 built-in reports and the ability to configure a custom report.

1. Basic Preview The basic preview report contains five charts.

Interaction Volume 1% of the volume of interactions over the specified period of time.

Interaction Tags When interactions have been tagged,

Interaction Content A world cloud of words found in the content of the interaction.

REFERENCE: http://dev.datasift.com/docs/tags/applying-tags

Interaction Type A pie chart showing the proportion of interactions found from each source.

Language Tag A pie chart of the languages found in interactions matching the filter.

Page 113: DataSift User Guide

Copyright© DataSift. All Rights Reserved. 107

2 Twitter Preview The Twitter preview only shows Twitter interactions from the filter but shows more detailed Twitter information.

Twitter Text A word cloud of the most commonly occurring words in the Tweet text.

Twitter User Description A word cloud built from the user's description of themselves.

Language Tag The proportions of each language identified.

Twitter User Time Zone The distribution of user's time zones in the matching interactions.

Twitter ID The volume of interactions in the preview.

Twitter User Language The top languages found in Twitter text.

Page 114: DataSift User Guide

108 Copyright© DataSift. All Rights Reserved.

Twitter Hashtags A pie chart of the most common hashtags found in matching Tweets.

Page 115: DataSift User Guide

Copyright© DataSift. All Rights Reserved. 109

3 Links Preview The links preview shows the volume of links along with Twitter card data and Facebook OpenGraph data.

Links Code vs. Links Meta vs. Links Meta OpenGraph The dark blue line is the number of links found in all the interactions. When the links are followed, the light blue line shows the number of pages with Facebook OpenGraph meta tags; the orange line shows the number of pages with Twitter Card meta tags.

Links Meta Language When links are followed, this shows the value of a language meta tags used on the page.

Links Meta Keywords The value of keyword meta tags used on the linked pages.

Links Title A word cloud showing the words found in page titles of linked pages.

Page 116: DataSift User Guide

110 Copyright© DataSift. All Rights Reserved.

4 Natural Language Processing Data of special value to natural language processing specialists, including sentiment for the content and title plus entities found in the content and title.

Salience Content Sentiment Salience is the positivity or negativity of comments with neutral being a zero value.

Salience Title Sentiment The positivity or negativity of interaction titles.

Salience Content Topics The products and topics being talked about.

Salience Title Topics Topics in the titles of interactions.

Page 117: DataSift User Guide

Copyright© DataSift. All Rights Reserved. 111

5 Demographics This preview is only available on accounts with the Demographics augmentation activated. The preview shows anonymized demographic information including gender, age, location by city, state, and country, and profession and likes and interests.

Twitter Text World cloud of words found in the Tweet.

Demographic Age Range A pie chart of the most common ages of authors.

Twitter Hashtags Top hashtags found in Tweets matched by the filter.

Demographic First Language A pie chart of the most common first language among the authors.

Demographic Sex The sex of the author.

Demographic Location (US State) The top US state locations of the authors.

Page 118: DataSift User Guide

112 Copyright© DataSift. All Rights Reserved.

Demographic Location (City) The top city locations of interaction authors.

Demographic Professions A pie chart of the top interaction author professions.

Demographic Location (Country) The top country locations of interaction authors.

Demographic Likes & Interests The most common likes and interests of the interaction authors.

Page 119: DataSift User Guide

Copyright© DataSift. All Rights Reserved. 113

6 Custom Preview A report is generated by selecting targets from a list. Targets with a numeric value allow selection of Volume or Mean, Min, Max charts.

Targets with an array of string values allow selection of Volume or Top.

Targets which have free text values allow selection of Volume or Word Cloud.

Historic Preview Billing Historic previews are billed at a fixed cost of 10 DPUs with an extra 2 DPUs added for each day in the preview. The costs are the same for every report type, even custom reports with lots of targets.

For example, a preview covering 5 days:

• Preview cost 10 DPU • Per day cost 5x 2 DPU = 10 DPU • Total 20 DPU

There are no data licensing costs for historic preview because the interactions are never delivered, only charts and aggregated statistics about the interactions.

Previews which are cancelled before completion are not charged.

Page 120: DataSift User Guide

114 Copyright© DataSift. All Rights Reserved.

Configuring Historic Preview To run a historic preview, first select the filter.

Figure 141: Select Filter

From the filter summary page, click Historic Preview.

Figure 142: Click Historic Preview

Only preview is available for a filter at any one time. If a preview has already been made, click the Overwrite Historic Preview button to allow a new preview to be configured.

Figure 143: Overwrite an existing preview

Page 121: DataSift User Guide

Copyright© DataSift. All Rights Reserved. 115

Ensure the time zone is correct and select the start and end dates and times.

Figure 144: Select start and end times

The DPU cost is calculated automatically.

Figure 145: Processing Cost

Clicking "?" shows the calculation.

Figure 146: DPU Cost Calculation

Page 122: DataSift User Guide

116 Copyright© DataSift. All Rights Reserved.

Select one of the 6 report types.

Figure 147: Select Report Type

Wait for the report to build. This typically takes less than 5 minutes. The more days are included in the preview, the longer it takes to complete the report building.

Figure 148: Report Building

When the building is complete, the report is displayed, a notification is created and an email is sent.

Figure 149: Report Complete Email

Page 123: DataSift User Guide

Copyright© DataSift. All Rights Reserved. 117

Downloading Reports The whole report cannot be downloaded but each individual chart can.

Each chart has a download icon. Click this icon to open a download window.

Click Download to download a PNG image file.

Figure 150: Download PNG File

Page 124: DataSift User Guide

118 Copyright© DataSift. All Rights Reserved.

(This page intentionally left blank)

Page 125: DataSift User Guide

Copyright© DataSift. All Rights Reserved. 119

11 Configuring Historic Stream Recording So far, you have covered how to create multiple filter conditions and submit them for compilation in the DataSift platform using either the web application or API. This section explains how to query the historic archive for all interactions which match a filter.

Historic Tasks DataSift has an historic archive of interactions which is queried using the same filters used for live interactions.

Capturing of interactions to the archive started at different times for each source. The most complete archive is for Twitter interactions.

• Pre-December 2011 o Partially augmented Twitter

• May 2012 o Facebook added o More augmentations added

• July 2012 o Fully augmented

• August 2012 o Demographics added

• November 2012 o Bitly & Newscred added

• Aug-Dec 2013 o Tumblr & WordPress added

REFERENCE: Full list of sources and earliest archived interactions http://dev.datasift.com/docs/historics/archive/schema

Historic tasks are queued and processed 100x faster than live streams.

Up to 31 days of archive can be queried in one task. To retrieve a longer duration, use multiple historic tasks.

Page 126: DataSift User Guide

120 Copyright© DataSift. All Rights Reserved.

Data Destinations To store the matching interactions for later analysis, a destination is required. Destinations are listed on the Data Destinations page of the web interface. Click on Browse Destinations to see all available destinations.

Figure 151: Data Destinations Page

Notice that some destinations have a padlock icon; these deliver the data using more secure protocols and authentication. Any destinations which are unavailable to the account display an Inquire button to request access.

Click the + symbol to add one of these destinations. Configuration of each destination is covered in separate training modules.

Page 127: DataSift User Guide

Copyright© DataSift. All Rights Reserved. 121

Starting Historic Tasks To start a historic task, locate and click the filter name on the Streams page.

Figure 152: Select Filter

The summary page opens with options to Use This Stream. To create a historic task from this filter, click the Historic Query button.

Figure 153: Historic Query Button

In step one, the new historic task requires start and end times. Start times are available back to the start of the archive in 2010 and for a duration up to 31 days after the start date

Check the time zone has been correctly detected and give the task a name before clicking Continue.

Select the required data source and choose to query all of the archive (100%) or a 10% sample of the interactions in the archive.

Page 128: DataSift User Guide

122 Copyright© DataSift. All Rights Reserved.

Figure 154: Historic Task Step 1

In step two, select which destination is used for the stream. Use the settings icon next to each destination to review destination configuration.

Figure 155: Historic Task Step 2

Page 129: DataSift User Guide

Copyright© DataSift. All Rights Reserved. 123

In step three, review the New Historic Query summary and click Start Task.

Figure 156: Historic Task Confirmation

Page 130: DataSift User Guide

124 Copyright© DataSift. All Rights Reserved.

Viewing Historic Tasks To view queued, running and completed historic tasks, open the Tasks page. Tasks are divided into Recordings and Historic Queries. The page opens showing all tasks, including the previously configured Historic Query.

Figure 157: Tasks Page

The task stays in a Task queued state until it reaches the top of the queue.

When running, the percentage shown after Task running reflects where the historic query is in the total duration of the query. Click on the task name to view more information about the task. Historic queries are split into chunks for processing and the progress of each chunk is seen.

Figure 158: Historic Task Details

Page 131: DataSift User Guide

Copyright© DataSift. All Rights Reserved. 125

Pausing Historic Tasks It is possible to pause historic tasks in the queue and to pause data delivery. This could be used when the destination is offline for maintenance.

Figure 159: Pause Button

To resume delivery, click the Resume Delivery button.

Figure 160: Resume Button

Stopping Historic Tasks To stop a task while it is queued or during delivery, locate the task in the Tasks page and click Stop Task.

Figure 161: Stop Task Button

Page 132: DataSift User Guide

126 Copyright© DataSift. All Rights Reserved.

Deleting Historic Tasks When the historic task is complete, a notification is emailed to the logged in user.

Figure 162: Export Complete Email Notification

To delete a historic task, open the Tasks page and click the Delete Task link.

Figure 163: Delete Task Link

Click OK to confirm the export task deletion.

Figure 164: Delete Task Confirmation

WARNING: Deleted tasks are not recoverable

Page 133: DataSift User Guide

Copyright© DataSift. All Rights Reserved. 127

12 Configuring Destinations – Amazon S3 Amazon Simple Storage Service (S3) is an online file storage web service. Amazon S3 is available in the DataSift platform as a data destination. This section explains how to configure Amazon S3 for the DataSift platform and how to configure the DataSift platform to use Amazon S3.

Configuring Amazon S3 for DataSift S3 is part of Amazon Web Services (AWS). To configure S3, either create a new AWS account or log in using an existing AWS account.

Creating AWS Account To create a new account, go to http://aws.amazon.com and click the Sign Up button. Enter an email address and select I am a new user.

Figure 165: Creating AWS Account

Follow the instructions on the page to complete creation of a new account.

Click the Create account and continue button.

Credit card details are required to complete sign-up. At the time of writing, it is possible to use S3 to store a small amount of data at no charge.

Page 134: DataSift User Guide

128 Copyright© DataSift. All Rights Reserved.

Add AWS services to the new account. The Add Now button adds all services, including S3.

Figure 166: Add AWS Services

Complete identity verification by entering a phone number, waiting for a call and entering the identification number shown on the page.

Figure 167: Identity Verification

Page 135: DataSift User Guide

Copyright© DataSift. All Rights Reserved. 129

Select an AWS support plan. It is possible to select a free plan.

Figure 168: Select Support Plan

When account configuration is complete, use the link to launch the AWS Management Console.

Figure 169: Launch Console Link

Page 136: DataSift User Guide

130 Copyright© DataSift. All Rights Reserved.

Signing in to AWS Account Go to http://aws.amazon.com and complete the authentication page.

Figure 170: AWS Sign In

When at the AWS console page, select the S3 service from the Storage and Content Delivery group.

Figure 171: Selecting S3 Service

Page 137: DataSift User Guide

Copyright© DataSift. All Rights Reserved. 131

Creating Buckets Storage is divided into buckets. Use the Create Bucket button to configure a bucket to be used by the DataSift platform.

Figure 172: Creating a Bucket

Select a region in which the data is stored and provide a bucket name. Some regions have restrictions on the characters which can be used in bucket names. Click Create to continue.

Figure 173: Naming a Bucket

Page 138: DataSift User Guide

132 Copyright© DataSift. All Rights Reserved.

Creating Folders The DataSift platform requires a bucket name and a folder name. Create a folder by clicking Create Folder, entering a folder name and clicking the check mark.

Figure 174: Creating a Folder

NOTE: If a folder is specified in a DataSift subscription which does not exist, the DataSift platform will create the folder.

Locating Authorization Keys The DataSift platform requires AWS security credentials to send data to S3. To locate AWS security credentials, click on the username in the S3 console and click Security Credentials.

Figure 175: Security Credentials Link in S3 Console

Page 139: DataSift User Guide

Copyright© DataSift. All Rights Reserved. 133

Either create and use AWS Identity and Access Management (IAM) users or click Continue to Security Credentials. This example assumes IAM is not being used.

Figure 176: Continue to Security Credentials

NOTE: AWS Best Practice is to use Identity and Access Management (IAM). For simplicity, IAM is not used in this training module.

Expand the Access Keys heading and click Create New Access Key.

Figure 177: Create New Access Key

Copy the keys from the page or click Download Key File to download a CSV file of the keys.

Figure 178: Copy or Download Access Key File (keys abbreviated in this example)

Page 140: DataSift User Guide

134 Copyright© DataSift. All Rights Reserved.

With the account configured with a bucket and folder, and a copy of the security credentials available, the next step is to configure Amazon S3 as a destination in the DataSift platform.

Configuring DataSift for Amazon S3 When Amazon S3 is configured in the web application, it remains available as a data destination when creating or modifying tasks. When using the API to configure Amazon as a destination, the S3 configuration information must be provided as each task is defined.

Configuring Destination in Web Application To configure Amazon S3 as a destination in the web application, open the Data Destinations page, Browse Destinations and click the + symbol in the Amazon S3 box.

Figure 179: Adding Amazon S3 Destination

Page 141: DataSift User Guide

Copyright© DataSift. All Rights Reserved. 135

Complete the form with the information used when creating the Amazon S3 bucket and folder.

WARNING: If the combination of delivery frequency and max file size is not sufficient for all the interactions in the stream, you may lose data. For example, the stream may generate 15MB of data every 60 seconds but the frequency and max file size of the destination may impose a limit of 10MB every 60 seconds.

Figure 180: Destination Details

Locate the Amazon S3 security credentials from the AWS console and enter them in the Auth fields. Use the Test Connection button to ensure the destination is available and the credentials are correct.

Page 142: DataSift User Guide

136 Copyright© DataSift. All Rights Reserved.

Click Create & Activate to continue.

Figure 181: Adding Keys and Testing Connection

The data source appears under My Destinations on the Data Destinations page. Amazon S3 can be configured multiple times with different buckets, folders and delivery rates.

Figure 182: My Destinations

When creating new tasks in the web application, the destination is available to select.

Figure 183: New Destination Available to Tasks

Page 143: DataSift User Guide

Copyright© DataSift. All Rights Reserved. 137

Configuring Destination in API Amazon S3 destinations configured in the web application cannot be referenced when creating new tasks programmatically. The REST API is used to configure Amazon S3 as a destination every time a new task is created. The following information is needed:

Name Description Example Bucket Name The bucket name in S3 siftersmithbucket1 Directory Name

The folder name in S3 Starbucks1

Access Key The S3 access Key JHDSOWUREHDOSJDOA Secret Key The S3 secret key hy64fgHJ85T43erOP045Fcvfd Delivery Frequency

How often to deliver data to S3 10 seconds

Max Delivery Size

How much to deliver each time (in bytes)

10485760

Username DataSift platform username siftersmith API Key DataSift platform API Key

Available in the web application 8a8ee4e36c9d2171ccee4eec55

Stream Hash The hash for a compiled filter The filter can be compiled in the web application or programmatically

bfd56316e55d8d480b89c7

The push/validate endpoint of the REST API allows validation of S3 parameters. In this example, the curl command is used to send the arguments in a query string:

$ curl \ "https://api.datasift.com/v1/push/validate?output_type=s3&output_params.bucket=siftersmithbucket1&output_params.directory=Starbucks1&output_params.acl=private&output_params.auth.access_key=AKIAJILMG&output_params.auth.secret_key=H05GAejDyS&output_params.delivery_frequency=10&output_params.max_size=10485760&username=siftersmith&api_key=b36697c55"

Page 144: DataSift User Guide

138 Copyright© DataSift. All Rights Reserved.

It is more readable when using the –d option of the curl command:

$ curl -X POST 'https://api.datasift.com/v1/push/validate' \ -d 'output_type=s3' \ -d 'output_params.bucket=siftersmithbucket1' \ -d 'output_params.directory=interactions' \ -d 'output_params.acl=private' \ -d 'output_params.auth.access_key=AKIAJILMG' \ -d 'output_params.auth.secret_key=H05GAejDyS' \ -d 'output_params.delivery_frequency=10' \ -d 'output_params.max_size=10485760' \ -H 'Authorization: siftersmith:b36697c55'

If successful, the following JSON object is returned:

{ "success": true, "message": "Validated successfully" }

Other training modules are available that show how to create, pause, resume, stop, and monitor REST API delivery of streams to destinations.

Page 145: DataSift User Guide

Copyright© DataSift. All Rights Reserved. 139

13 Configuring Destinations – Google BigQuery One of the strengths of the DataSift platform is the ease with which streams can be sent to destinations. This section explains how to configure Google BigQuery to accept data from the DataSift platform and configure tasks to send output streams to Google BigQuery.

Google BigQuery can be used by analysis tools such as the visual analysis tools provided by Tableau. This destination is only available to Enterprise Edition customers.

REFERENCE: DataSift documentation for Google BigQuery http://dev.datasift.com/docs/push/connectors/bigquery

Google BigQuery Google Cloud Platform is a Platform as a Service (PaaS). It comprises multiple services for hosting applications, storing data, and computing. Google BigQuery is the Google Cloud Platform service for querying billions of rows of data within tables.

Google BigQuery is a web service and REST API.

When using BigQuery as a data destination, batches of interactions are queued and sent at 90 second intervals to BigQuery tables. Each interaction is a new table row.

The following example shows a Structured Query Language (SQL) query to a dataset which could have billions of rows.

Figure 184: BigQuery Example

Page 146: DataSift User Guide

140 Copyright© DataSift. All Rights Reserved.

Google BigQuery Terminology Google BigQuery uses the following terminology:

Projects The top-level container for Google cloud services is a project. It stores information about authentication and billing. Each project has a name, an ID, and a number.

Datasets Datasets allow for organization and access control to multiple tables.

Tables BigQuery data is held in tables along with a corresponding table schema to describe the fields. When used with DataSift, the Schema is automatically derived from the JSON objects in the stream.

Jobs Jobs are actions which are executed by BigQuery. For example, to query a table for particular records. Jobs are executed synchronously and may take a long time to complete.

Page 147: DataSift User Guide

Copyright© DataSift. All Rights Reserved. 141

Configuring Google BigQuery for DataSift Logging in to Google Go to http://developers.google.com/console and log in using existing credentials or use the Create an account link to create a new account.

Figure 185: Log in or Create an Account

To proceed, the Terms of Service box must be selected before clicking Continue.

Figure 186: Terms and Conditions

Page 148: DataSift User Guide

142 Copyright© DataSift. All Rights Reserved.

Creating a Project Projects are the top-level container for all Google Cloud Platform services. The console opens on a pre-configured default Project. To create a new project, click the API Project link which returns to the top-level menu.

Figure 187: API Project Link

Click the CREATE PROJECT button

Figure 188: Create Project

Page 149: DataSift User Guide

Copyright© DataSift. All Rights Reserved. 143

When using a new account or using an existing account with Google Cloud services for the first time, SMS verification may be required. Click Continue and follow the instructions to receive a code number on a mobile phone which must be entered into the web page.

Figure 189: SMS Account Verification

When SMS verification is complete, enter a Project name. The Project IDs are generated automatically. Click the refresh button in the Project ID field to create new IDs. Click Create to continue.

Figure 190: Project name and ID

Page 150: DataSift User Guide

144 Copyright© DataSift. All Rights Reserved.

From the project overview page, make note of the Project ID and Project Number. Both are required when configuring the DataSift platform to use BigQuery as a destination.

Figure 191: New Project Overview

Enabling Billing A small amount of data may be stored in a Cloud Datastore and used with BigQuery for no charge. However, this is not available until billing information has been completed. Complete billing information by clicking Settings.

Figure 192: Enabling Project Billing

NOTE: Billing information must be entered for each project separately.

Page 151: DataSift User Guide

Copyright© DataSift. All Rights Reserved. 145

Complete the billing information form. When complete the Enable billing button turns into a Disable billing button.

Figure 193: Enabled Billing

Retrieving Authentication Details The DataSift platform requires authentication credentials to send data to a BigQuery Table. This is done with public and private keys. To generate the credentials and public/private keys, click APIs & auth then Credentials from the project summary. Then click the CREATE NEW CLIENT ID button.

Figure 194: Retrieving Credentials

Page 152: DataSift User Guide

146 Copyright© DataSift. All Rights Reserved.

Select Service account and click Create client ID.

Figure 195: Create Client ID

The new public and private keys are created. The private key is automatically downloaded with a .p12 filename extension.

Figure 196: p12 File Downloaded

The private key password is also displayed. Make a note of the password as it is not shown again. When ready, click Okay, got it.

Figure 197: Public/Private Key Generated

Page 153: DataSift User Guide

Copyright© DataSift. All Rights Reserved. 147

The new credentials are displayed. Make a note (or download the JSON object) of the Client ID and Email address.

Figure 198: Service Account Credentials

The following credentials are required to configure BigQuery as a data destination:

• Client ID • Email address • p12 key file

Configuring Datasets From the project summary page, click the BigQuery link. A new browser window opens for https://bigquery.cloud.google.com/.

Figure 199: Click BigQuery Link

Page 154: DataSift User Guide

148 Copyright© DataSift. All Rights Reserved.

The BigQuery console opens with example datasets. A new dataset must be configured which is used by the DataSift platform when sending the stream of interactions. The stream automatically creates a new table in the dataset.

Click the project menu and select Create new dataset.

Figure 200: Click Create new dataset

Enter a Dataset ID using only letters, numbers, and underscore.

Figure 201: Enter Dataset ID

Page 155: DataSift User Guide

Copyright© DataSift. All Rights Reserved. 149

Configuring DataSift Web Application for Google BigQuery The dataset and credential information created in the previous section are used to configure the BigQuery destination in the DataSift platform. This section looks at the web application configuration.

Configuring BigQuery Destination Open the DataSift platform web application by logging in at datasift.com. From the Data Destinations page, click the '+' symbol on the Google BigQuery tile.

Figure 202: Select Google BigQuery Destination

Complete the New Google BigQuery Destination form with the following information:

• Label o This name is only used in the web application. It is possible to define

multiple BigQuery destinations to different projects, datasets and tables. Use this name to differentiate multiple instances.

• Table ID o The name of a table which is created automatically in the chosen dataset.

Whitespace is not permitted in table names. • Dataset ID

o The dataset in which to create a table. This must exist in BigQuery. • Project ID

o The project ID or project number. • Client ID

o The client ID generated by the creation of new service account credentials.

Page 156: DataSift User Guide

150 Copyright© DataSift. All Rights Reserved.

• Service account email address o The email address generated by the creation of new service account

credentials. • P12 Key file

o The private key file which was automatically downloaded by creation of new service account credentials.

Figure 203: New Google BigQuery Destination Form

The new destination appears in My Destinations. Notice how multiple instances of BigQuery destinations are referenced by their label.

Figure 204: New Data Destination Saved

Page 157: DataSift User Guide

Copyright© DataSift. All Rights Reserved. 151

Configuring Stream Tasks Using BigQuery From the Streams page, create or select a stream to use with the Google BigQuery destination. A task is created using a live stream recording or historic data. In this example, a live recording is used. Select Record Stream from the stream summary.

Figure 205: Click Record Stream

Configure start & end times and give the recording a name. Click Continue.

Figure 206: Select Task start and end times

Page 158: DataSift User Guide

152 Copyright© DataSift. All Rights Reserved.

Select a destination. The new BigQuery destination is available for selection. Click Continue.

Figure 207: Select BigQuery Destination

Check the details and confirm by clicking Start Task.

Figure 208: Start Task

Page 159: DataSift User Guide

Copyright© DataSift. All Rights Reserved. 153

The stream sends interactions to Google BigQuery which automatically creates a table using the name provided in the destination configuration. In this example, the Starbucks_Table1 table has been created and the schema is displayed.

To view the table, expand the dataset and click on the table name.

Figure 209: New Table Schema

On the Table Details page, click the Details button to see the size of the table in bytes and rows.

Figure 210: Table Details

If an end date and time were not specified, remember to Pause or Stop the recording task in the DataSift web application when enough records have been received.

Page 160: DataSift User Guide

154 Copyright© DataSift. All Rights Reserved.

Configuring DataSift API for Google BigQuery Google BigQuery is available as a data destination when using the push endpoints in the REST API. When using Google BigQuery as a destination, the p12 key file must be Base64 encoded and then URL encoded in order to remove URL-unsafe characters. This can be done with the following p12tobigquery Python script:

import argparse import base64 import urllib import sys # parse arguments parser = argparse.ArgumentParser(description='Convert a .p12 file into a string a Google BigQuery Push connector can use.') parser.add_argument('-f', required=True, action='store', dest='fin', help='the name of the .p12 file') args = parser.parse_args() with open(args.fin, 'r') as f: p12 = f.read() sys.stdout.write(urllib.quote(base64.b64encode(p12))) Figure 211: p12tobigquery Python Script

REFERENCE: Link to Python script https://gist.github.com/paulsmart/8197435

In the following example, the push/create endpoint is being used with an existing stream hash. Command substitution is being used to process the key file encoding script.

$ curl -X POST 'https://api.datasift.com/v1/push/create' \ -d 'name=googlebigquery' \ -d 'hash= c0e53815905869ac96aa80358' \ -d 'output_type=bigquery' \ -d 'output_params.project_id=617232322419' \ -d 'output_params.dataset_id=Dataset_Starbucks' \ -d 'output_params.table_id=Starbucks_Table1' \ -d 'output_params.auth.client_id=0000.apps.googleusercontent.com' \ -d '[email protected]' \ -d "output_params.auth.key_file=`python ./p12tobigquery.py -f 7bdb72743e7da7605fef5c-privatekey.p12`" -H 'Authorization: datasift-user:your-datasift-api-key'

See push delivery training modules for more information on the push API endpoints.

REFERENCE: REST API endpoints http://dev.datasift.com/docs/rest-api

Page 161: DataSift User Guide

Copyright© DataSift. All Rights Reserved. 155

Querying Data in BigQuery From the Table Details page in the BigQuery console, click Query Table.

Figure 212: Click Query Table Button

An editor opens at the top of the page. BigQuery uses Structured Query Language (SQL) to query the table. Use the SELECT clause to select fields from the interactions table. The table name is already defined in the FROM clause. Use the LIMIT clause to limit the number of records returned. Click RUN QUERY to run the query statement.

Figure 213: Example Query

NOTE: Databases in BigQuery are append-only.

REFERENCE: Query language documentation https://developers.google.com/bigquery/query-reference

Page 162: DataSift User Guide

156 Copyright© DataSift. All Rights Reserved.

In this example, the first five records are returned showing only the author name. The query took 1.2 seconds of elapsed time to complete.

Figure 214: Example Query Output

Use the Download as CSV and Save as Table buttons to download the output as a CSV file or create a new table for further querying.

Query History All queries are saved and available for editing or running by clicking Query History.

Figure 215: Recent Query History

Page 163: DataSift User Guide

Copyright© DataSift. All Rights Reserved. 157

Deleting Google Cloud Projects Ensure the streaming task from DataSift is paused, stopped or deleted.

Figure 216: Paused Task

In the BigQuery project page, select Delete table from the table menu.

Figure 217: Delete Table

Page 164: DataSift User Guide

158 Copyright© DataSift. All Rights Reserved.

Select Delete dataset from the dataset menu.

Figure 218: Delete Dataset

Return to the project summary. From the settings page, click the Disable billing button.

Figure 219: Disable Billing

Page 165: DataSift User Guide

Copyright© DataSift. All Rights Reserved. 159

From the developer's console, select the project and click Delete. It is necessary to confirm the deletion by typing the project ID when prompted.

Figure 220: Delete Project

The deletion is scheduled and may take several hours to complete.

Page 166: DataSift User Guide

160 Copyright© DataSift. All Rights Reserved.

(This page intentionally left blank)

Page 167: DataSift User Guide

Copyright© DataSift. All Rights Reserved. 161

14 Configuring Push Delivery – API Sending an output stream of filtered interactions to a destination is called a recording task in the web interface. When using the REST API, it is called a push delivery subscription. This section explains how to create push delivery subscriptions, monitor them throughout their duration, and finally stop them.

Push Delivery Workflow After authentication credentials have been located, the destination is validated and a push delivery subscription created. While running, the subscription is monitored using log and get endpoints, or paused and resumed while data is buffered. Finally it is stopped and deleted.

Figure 221: Push Delivery Workflow

Locating API Credentials Access to the REST API is controlled by Username and API Key. Both of these are available in the web application.

Figure 222: Locating API Credentials (API Key truncated in example)

Page 168: DataSift User Guide

162 Copyright© DataSift. All Rights Reserved.

Validating Push Destinations Every time a new push subscriptions is created in the REST API, the configuration details of the destination are required. Using Amazon S3 as an example, the following information is needed:

Name Description Example Bucket Name

The bucket name in S3 siftersmithbucket1

Directory Name

The folder name in S3 Starbucks1

Access Key The S3 access Key JHDSOWUREHDOSJDOA Secret Key The S3 secret key hy64fgHJ85T43erOP045Fcvfd Delivery Frequency

How often to deliver data to S3 10 seconds

Max Delivery Size

How much to deliver each time 10485760

Username DataSift platform username siftersmith API Key DataSift platform API Key

Available in the web application 8a8ee4e36c9d2171ccee4eec55

The validate endpoint of the REST API allows validation of the destination parameters. In this example, the curl command is used to send the arguments in a query string:

$ curl \ "https://api.datasift.com/v1/push/validate?output_type=s3&output_params.bucket=siftersmithbucket1&output_params.directory=Starbucks1&output_params.acl=private&output_params.auth.access_key=AKIAJILMG&output_params.auth.secret_key=H05GAejDyS&output_params.delivery_frequency=10&output_params.max_size=10485760&username=siftersmith&api_key=b36697c55"

It is more readable when using the –d option of the curl command:

$ curl -X POST 'https://api.datasift.com/v1/push/validate' \ -d 'output_type=s3' \ -d 'output_params.bucket=siftersmithbucket1' \ -d 'output_params.directory=interactions' \ -d 'output_params.acl=private' \ -d 'output_params.auth.access_key=AKIAJILMG' \ -d 'output_params.auth.secret_key=H05GAejDyS' \ -d 'output_params.delivery_frequency=10' \ -d 'output_params.max_size=10485760' \ -H 'Authorization: siftersmith:b36697c55'

Page 169: DataSift User Guide

Copyright© DataSift. All Rights Reserved. 163

If successful, the following JSON object is returned:

{ "success": true, "message": "Validated successfully" }

REFERENCE: Documentation of the push/validate endpoint: http://dev.datasift.com/docs/api/1/pushvalidate

Creating Push Subscriptions The push/create endpoint uses the same syntax as the validate endpoint but also requires a stream hash. The stream hash is taken from the web application or the JSON response to an API compile request.

Name Description Example Stream Hash The hash for a compiled filter

The filter is compiled in the web application or programmatically

bfd56316e55d8d480b89c7

Locating Stream Hashes In the web application go to the streams page and click Consume via API.

Figure 223: Consume via API Button

Page 170: DataSift User Guide

164 Copyright© DataSift. All Rights Reserved.

The data shown on the Consume this stream via API screen includes:

1. Stream Hash - A unique reference for the stream filter 2. API Key - An authentication key is currently required to make any API calls 3. Explore our Libraries - Examples of how to use the API in multiple languages

Figure 224: API Information in the Web Application

When compiling filters programmatically, the stream hash is returned in the JSON object from a compile endpoint. In the following example, a simple CSDL filter is compiled using curl and the hash is returned.

$ curl \ "https://api.datasift.com/v1/compile?csdl=twitter.text%20contains%20%22Starbucks%22&username=siftersmith&api_key=b366978a8ee55"

{ "hash": "bfd56316e55d8d480b89c73359e6d", "created_at": "2013-11-08 12:37:31", "dpu": "0.1" }

REFERENCE: Documentation of the compile endpoint: http://dev.datasift.com/docs/api/1/compile

Page 171: DataSift User Guide

Copyright© DataSift. All Rights Reserved. 165

Writing push/create Calls The push/create call is made using the destination information, API credentials, and stream hash. The following example uses the curl utility to pass these arguments to the API endpoint.

$ curl -X POST 'https://api.datasift.com/v1/push/create' \ -d 'name=myamazonsubscription' \ -d 'hash=bfd56316e55d8d480b89c73359e6d' \ -d 'output_type=s3' \ -d 'output_params.bucket=siftersmithbucket1' \ -d 'output_params.directory=interactions' \ -d 'output_params.acl=private' \ -d 'output_params.auth.access_key=AKIAJILMG' \ -d 'output_params.auth.secret_key=H05GAejDyS' \ -d 'output_params.delivery_frequency=60' \ -d 'output_params.max_size=10485760' \ -H 'Authorization: paulsmart:b36697c55'

If the push/create is successful, then a JSON object similar to the one shown is received:

{ "id": "3a9d78afc28d8c71262d1d5f4e280c9f", "output_type": "s3", "name": "myamazonsubscription", "created_at": 1384183625, "hash": "bfd56316e55d8d480b89c73359e6d", "hash_type": "stream", "output_params": { "bucket": "siftersmithbucket1", "directory": "interactions", "acl": "private", "delivery_frequency": 60, "max_size": 10485760 }, "status": "active", "last_request": null, "last_success": null, "remaining_bytes": null, "lost_data": false, "start": 1384183625, "end": 0 }

REFERENCE: Documentation of the push/create endpoint, including how to use scheduled start: http://dev.datasift.com/docs/api/1/pushcreate

Page 172: DataSift User Guide

166 Copyright© DataSift. All Rights Reserved.

Checking Push Subscriptions The push/get endpoint is used to check the status of a push delivery subscription. With no arguments, the status of all subscriptions is returned in the JSON object.

$ curl -X POST 'https://api.datasift.com/v1/push/get' \ -H 'Authorization: siftersmith:b36697c55'

{ "subscriptions": [ { "id": "3a9d78afc28d8c71262d1d5f4e280c9f", "output_type": "s3", "name": "myamazonsubscription", "created_at": 1384183625, "user_id": 28619, "hash": "bfd56316e55d8d480b89c73359e6d", "hash_type": "stream", "output_params": { "bucket": "siftersmithbucket1", "directory": "interactions", "acl": "private", "delivery_frequency": 60, "max_size": 10485760 }, "status": "active", "last_request": 1384183657, "last_success": 1384183658, "remaining_bytes": null, "lost_data": false, "start": 1384183625, "end": null } ], "count": 1 }

NOTE: The remaining_bytes field (number of bytes buffered ready to send) is always null when calling push/get on all subscriptions. Specify an individual subscription to see a remaining_byte value.

Page 173: DataSift User Guide

Copyright© DataSift. All Rights Reserved. 167

Notice the following two attributes have changed since the push/create call:

Attribute Description Value when created

Value now

LAST_REQUEST The time of the most recent Push delivery request sent to the associate data destination. A Unix timestamp.

null 1384183657

LAST_SUCCESS The time of the most recent successful delivery. A Unix timestamp.

null 1384183658

To get the status of a particular subscription, use the id argument:

$ curl -X POST 'https://api.datasift.com/v1/push/get' \ -d 'id=d468655cfe5f93741ddcd30bb309a8c7' \ -H 'Authorization: datasift-user:your-datasift-api-key'

REFERENCE: Documentation of the push/get endpoint: http://dev.datasift.com/docs/api/1/pushget

Retrieving Push Subscription Logs Push subscriptions may return messages advising the delivery is complete or an error has occurred. The push/log endpoint is the single most important endpoint to retrieve these messages and troubleshoot subscription problems. The minimum information required is just the API credentials. This retrieves log information for all subscriptions.

$ curl -X POST 'https://api.datasift.com/v1/push/log' \ -H 'Authorization: datasift-user:your-datasift-api-key'

Page 174: DataSift User Guide

168 Copyright© DataSift. All Rights Reserved.

Example Output:

{ "success": true, "count": 4, "log_entries": [ { "subscription_id": "4b7ce39a5292b96ccd98f69324b0dc99", "success": true, "request_time": 1344859261, "message": "The delivery has completed" }, { "subscription_id": "13ba92f6784da5e60b82f532f43c7d17", "success": false, "request_time": 1344855061, "message": "The delivery was paused for too long" }, { "subscription_id": "4e097f46ef0dd2e8e3f25f84dddda775", "success": false, "request_time": 1344630221, "message": "Stopped due to too many failed delivery attempts" }, { "subscription_id": "4e097f46ef0dd2e8e3f25f84dddda775", "success": false, "request_time": 1344630221, "message": "The endpoint returned a 500 internal server error" } ] }

Subscription status changes are also provided:

"message": "The status has changed to: finished"

The following message advises that data is not being consumed fast enough and is being lost:

"message": "Some data remained in the queue for too long and so was expired (consumer too slow)"

Page 175: DataSift User Guide

Copyright© DataSift. All Rights Reserved. 169

Pausing Push Subscriptions It is possible to pause a push subscription for up to one hour. This may be required if the destination requires scheduled downtime. The data is buffered and delivered when the push subscription is resumed. Use the push/pause endpoint with a subscription id as the argument.

$ curl -X POST 'https://api.datasift.com/v1/push/pause' \ -d 'id=d468225f93741ddcd30bb309a8c7' \ -H 'Authorization: datasift-user:your-datasift-api-key'

The returned JSON object includes the following status:

"status": "paused",

REFERENCE: Documentation on the push/pause endpoint: http://dev.datasift.com/docs/api/1/pushpause

Resuming Push Subscriptions Paused subscriptions should be resumed as quickly as possible to prevent buffered data expiring and being lost. To resume a paused push subscription, use the push/resume endpoint with the id of the push subscription as an argument.

$ curl -X POST 'https://api.datasift.com/v1/push/resume' \ -d 'id=d468225f93741ddcd30bb309a8c7' \ -H 'Authorization: datasift-user:your-datasift-api-key'

The returned JSON object includes the following status:

"status": "active",

REFERENCE: Documentation on the push/resume endpoint: http://dev.datasift.com/docs/api/1/pushresume

Page 176: DataSift User Guide

170 Copyright© DataSift. All Rights Reserved.

Stopping Push Subscriptions To stop an active push subscription, use the push/stop endpoint with the id of the push subscription as an argument.

$ curl -X POST 'https://api.datasift.com/v1/push/stop' \ -d 'id=d468225f93741ddcd30bb309a8c7' \ -H 'Authorization: datasift-user:your-datasift-api-key'

Stopped subscriptions cannot be restarted or resumed. The returned JSON object includes the following status:

"status": "finishing",

REFERENCE: Documentation on the push/stop endpoint: http://dev.datasift.com/docs/api/1/pushstop

Deleting Push Subscriptions Deleting is not necessary as finished push subscriptions are automatically deleted after two weeks.

If immediate deletion is required, use the push/delete endpoint with the id of the push subscription as an argument. Any undelivered data in the buffer will not be delivered but will be charged for. To avoid data loss, stop the push subscription and use the push/get and endpoint to ensure the status equals delivered.

$ curl -X POST 'https://api.datasift.com/v1/push/delete' \ -d 'id=d468225f93741ddcd30bb309a8c7' \ -H 'Authorization: datasift-user:your-datasift-api-key'

There is no JSON object returned. The response is an HTTP 204 status code. Deleted subscription cannot be recovered and all logs are deleted immediately.

REFERENCE: Documentation on the push/delete endpoint: http://dev.datasift.com/docs/api/1/pushdelete

Page 177: DataSift User Guide

Copyright© DataSift. All Rights Reserved. 171

15 Configuring Destinations – MySQL One of the strengths of the DataSift platform is the ease with which streams can be sent to destinations. This section explains how to configure a MySQL instance in Amazon Web Services to accept data from the DataSift platform.

MySQL is used directly by some analysis tools such as those provided by Tableau. This destination is only available to Enterprise Edition customers.

REFERENCE: DataSift documentation for MySQL http://dev.datasift.com/docs/push/connectors/mysql

Configuring Amazon RDS Amazon RDS is a cloud relational database service which allows configuration of a MySQL database. Open the Amazon AWS Console and click RDS.

https://console.aws.amazon.com/console/

Figure 225: Click RDS

Page 178: DataSift User Guide

172 Copyright© DataSift. All Rights Reserved.

Creating RDS Instance RDS allows multiple database instances of different types. To create a new instance using MySQL, click the Launch a DB Instance button.

Figure 226: Launch an RDS Instance

Page 179: DataSift User Guide

Copyright© DataSift. All Rights Reserved. 173

From the list of database types, click Select next to MySQL.

Figure 227: Select MySQL

Choose if this instance will be used for production purposes. This example assumes it will not be used for production.

Figure 228: Select production purpose

Choose the appropriate parameters for your database. This example uses the latest DB engine version on a micro instance class with 10GB of storage.

Page 180: DataSift User Guide

174 Copyright© DataSift. All Rights Reserved.

Enter database instance identifier (the instance name), username and password, and make note of these details as they will be required later.

Figure 229: DB Instance Details

The RDS instance is created with one database. More databases can be created within the RDS instance later.

Enter a name for the first database and choose the port. 3306 is the default port and would only need to be changed if this port was blocked by a firewall.

The VPC is a Virtual Private Cloud and can be thought of as the virtual network with a firewall to which the database is connected. Multiple VPCs can be configured with different firewall rules. In this example, the Default VPC is used for the subnet group and security group.

NOTE: Rules to allow database queries through the VPC firewall will be added later.

Page 181: DataSift User Guide

Copyright© DataSift. All Rights Reserved. 175

Ensure the instance is Publicly Accessible.

Figure 230: Additional Config

Select automatic backups and a maintenance window if required. In this example, automated backups are disabled.

Figure 231: Select Management Options

Page 182: DataSift User Guide

176 Copyright© DataSift. All Rights Reserved.

Review the configuration and click Launch DB Instance.

Figure 232: Review

Page 183: DataSift User Guide

Copyright© DataSift. All Rights Reserved. 177

The following confirmation is displayed.

Figure 233: Creation Confirmation

Allow up to 10 minutes for the new RDS instance to be created. In this example, the Status is still creating.

Figure 234: Instance List

Page 184: DataSift User Guide

178 Copyright© DataSift. All Rights Reserved.

Click on the arrow at the start of the row to see the database details.

Figure 235: Instance Details

Configuring Network Security The default VPC security group is blocking access to the database instance. To add a rule allowing access, click the name of the security group in the instance details page.

Figure 236: Security Group

The Security Groups page in the EC2 dashboard opens with a filter for the security group being used for the database instance.

Page 185: DataSift User Guide

Copyright© DataSift. All Rights Reserved. 179

Click the Inbound tab, followed by the Edit button to create a new rule which allows traffic from the Internet.

Figure 237: Security Group Inbound Rules

Click Add Rule

Figure 238: Add inbound rule

Select MySQL in the Type and specify the permitted source of traffic for the database instance. If access should be permitted from anywhere then select Anywhere (0.0.0.0/0). Otherwise use CIDR notation to specify an individual IP address (e.g. 52.127.88.102/32) or network (e.g. 52.0.0.0/16).

Figure 239: Configure inbound rule

Page 186: DataSift User Guide

180 Copyright© DataSift. All Rights Reserved.

Verifying Network Security Configuration Locate the database endpoint and port number from the instance details.

Figure 240: Instance Endpoint

Use the mysql command line utility to verify access is available. This example is being run from an Amazon EC2 instance, attempting to connect to the Amazon RDS instance.

$ mysql -h ds-instance1.cu1rwk85ieme.us-west-2.rds.amazonaws.com \ -P 3306 -u training --password=xxxxxxxx Welcome to the MySQL monitor. Commands end with ; or \g. Your MySQL connection id is 15187 Server version: 5.6.13 MySQL Community Server (GPL) Copyright (c) 2000, 2013, Oracle and/or its affiliates. All rights reserved. Oracle is a registered trademark of Oracle Corporation and/or its affiliates. Other names may be trademarks of their respective owners. Type 'help;' or '\h' for help. Type '\c' to clear the current input statement. mysql> quit

Page 187: DataSift User Guide

Copyright© DataSift. All Rights Reserved. 181

Configuring Databases Listing Databases The new RDS instance has several pre-created databases; including the database named during instance configuration.

Use show databases; to show all databases in an RDS instance. In this example, MyDatabase is the database created by the Amazon RDS configuration wizard.

mysql> show databases; +--------------------+ | Database | +--------------------+ | information_schema | | MyDatabase | | innodb | | mysql | | performance_schema | | tmp | +--------------------+

The only database available for use as a destination is MyDatabase. The others are internal databases used by the database server.

Creating Databases Use the create database command to create more databases. Each database can be used as a separate DataSift destination.

mysql> create database banana; Query OK, 1 row affected (0.00 sec) mysql> show databases; +--------------------+ | Database | +--------------------+ | information_schema | | MyDatabase | | banana | | innodb | | mysql | | performance_schema | | tmp | +--------------------+

Page 188: DataSift User Guide

182 Copyright© DataSift. All Rights Reserved.

Deleting Databases The drop database command is used to delete a database. In this example, a database called banana is deleted.

mysql> show databases; +--------------------+ | Database | +--------------------+ | information_schema | | MyDatabase | | banana | | innodb | | mysql | | performance_schema | | tmp | +--------------------+ 7 rows in set (0.00 sec) mysql> drop database banana; Query OK, 0 rows affected (0.00 sec) mysql> show databases; +--------------------+ | Database | +--------------------+ | information_schema | | MyDatabase | | innodb | | mysql | | performance_schema | | tmp | +--------------------+

Page 189: DataSift User Guide

Copyright© DataSift. All Rights Reserved. 183

Configuring Database Tables Before the DataSift platform sends interactions to the database, a table must be created with the appropriate fields. The table and fields are created with SQL commands.

The interaction fields are mapped to the database fields with a mapping file (also known as an .ini file).

The configuration of a table depends on the information which is stored. It is possible to create a simple database table which inly stores one attribute of each interaction, or a more complex one which stores a large number of attributes.

Listing Database Tables Use the show databases; command to list the databases then the use command to make one database the focus of the following commands.

Use the show tables; command to list all tables in the database which is currently in use.

mysql> show databases; +--------------------+ | Database | +--------------------+ | information_schema | | MyDatabase | | apple | | innodb | | mysql | | performance_schema | | tmp | +--------------------+ mysql> use MyDatabase; Database changed mysql> show tables; +----------------------+ | Tables_in_MyDatabase | +----------------------+ | hashtags | | mentions | | twitter | +----------------------+

Page 190: DataSift User Guide

184 Copyright© DataSift. All Rights Reserved.

Creating Tables – CLI The create table command is used to create new tables. The following example shows how to create a minimal table with only two fields.

mysql> create database banana; Query OK, 1 row affected (0.00 sec) mysql> use banana Database changed mysql> CREATE TABLE twitter ( interaction_id VARCHAR(64) PRIMARY KEY, content TEXT DEFAULT NULL ); Query OK, 0 rows affected (0.08 sec) mysql> show tables; +------------------+ | Tables_in_banana | +------------------+ | twitter | +------------------+ 1 row in set (0.00 sec)

There are many examples of the SQL commands required to create tables for interactions in a DataSift GitHub repository.

Link https://github.com/datasift/push-schemas

An excerpt of an SQL schema which is used to create a table for Twitter data is shown below.

Example files: https://github.com/datasift/push-schemas/blob/master/sources/twitter/mysql.sql

CREATE TABLE twitter ( interaction_id VARCHAR(64) PRIMARY KEY, interaction_type VARCHAR(64) NOT NULL, <output omitted> geo_latitude DOUBLE DEFAULT NULL, geo_longitude DOUBLE DEFAULT NULL, content TEXT DEFAULT NULL, twitter_lang VARCHAR(64) DEFAULT NULL, <output omitted> )

Page 191: DataSift User Guide

Copyright© DataSift. All Rights Reserved. 185

Configuring Mapping The DataSift platform sends interactions which contain many fields to the MySQL database. It is necessary to map the interaction fields to the database fields using a mapping file.

Example mapping files are available on DataSift's GitHub repository.

Example files: https://github.com/datasift/push-schemas/blob/master/sources/twitter/mapping.ini

In this example, a Twitter mapping file is being used. An excerpt of the file is shown below. It contains mappings of fields to interaction attributes and iterators which process arrays of information from the interaction.

[twitter] interaction_id = interaction.id interaction_type = interaction.type created_at = interaction.created_at (data_type: datetime, transform: datetime) author_name = interaction.author.name author_username = interaction.author.username <output omitted> twitter_id = twitter.id geo_latitude = interaction.geo.latitude geo_longitude = interaction.geo.longitude content = interaction.content <output omitted> retweet_count = twitter.retweet.count [hashtags :iter = list_iterator(interaction.hashtags)] interaction_id = interaction.id interaction_type = interaction.type created_at = interaction.created_at (data_type: datetime, transform: datetime) hashtag = :iter._value [mentions :iter = list_iterator(interaction.mentions)] interaction_id = interaction.id interaction_type = interaction.type created_at = interaction.created_at (data_type: datetime, transform: datetime, transform: datetime) mention = :iter._value

Documentation: INI Files http://dev.datasift.com/docs/push/connectors/ini

Download and save the example mapping file.

Page 192: DataSift User Guide

186 Copyright© DataSift. All Rights Reserved.

Encoding Mapping Files The contents of the file can be used with the DataSift API in clear text, or base64 encoded.

When encoded, the mapping file has no whitespace which makes it easier to handle. The following command is an example of encoding a mapping file and saving the result to a new file.

$ base64 -w 0 twittermapping.ini > twittermapping64.ini

NOTE: The encoded file does not end with a CR or LF

Page 193: DataSift User Guide

Copyright© DataSift. All Rights Reserved. 187

Configuring DataSift Destination (API) The first stage in using MySQL as a destination is to create the subscription using the push/create endpoint.

The push/create endpoint requires several parameters:

push/create parameter Description/Example name A user-specified name to identify the subscription hash The hash of a filter output_type mysql output_params.host The endpoint URL from Amazon RDS instance

details page (excluding the port number) output_params.port The port number

Usually 3306 output_params.schema The clear text or base64 encoded mapping file output_params.database The database name.

A RDS instance can have multiple databases output_params.auth.username The username used when creating the RDS

instance output_params.auth.password The password used when creating the RDS

instance authorization API authorization credentials

Creating a Subscription The following example shows the push/create endpoint being used to create a subscription with an Amazon RDS MySQL database destination.

$ curl -sX POST https://api.datasift.com/v1/push/create \ -d name=mysql \ -d hash=dbdf49e22102ed01e945f608ac05a57e \ -d output_type=mysql \ -d output_params.host=ds-instance1.cu1rwk85ieme.us-west-2.rds.amazonaws.com \ -d output_params.port=3306 \ -d 24uaWQKaW50ZXJhY3Rpb25fdHlwZSA9IGludGVyYWN0aW9uLnR5cGUKY3 JlYXRlZF9hdCA9IGludGVyYWN0aW9uLmNyZWF0ZWRfYXQgKGRhdGFfdHlwZT <output omitted> 0aW1lLCB0cmFuc2Zvcm06IGRhdGV0aW1lLCB0cmFuc2Zvcm06IGRhdGV0aW1l KQptZW50aW9uID0gOml0ZXIuX3ZhbHVlCgo= \ -d output_params.database=MyDatabase \ -d output_params.auth.username=training \ -d output_params.auth.password=xxxxxxxx \ -H 'Authorization: paulsmart:84768663b04a62ac7d4ac43'

Page 194: DataSift User Guide

188 Copyright© DataSift. All Rights Reserved.

If successful, a JSON object is returned with a status of active.

{ "created_at": 1397565784, "end": 0, "hash": "dbdf49e22102ed01e945f608ac05a57e", "hash_type": "stream", "id": "4bb94a4b68a1ec3e9d63af61802201cb", "last_request": null, "last_success": null, "lost_data": false, "name": "mysql", "output_params": { "database": "MyDatabase", "host": "ds-instance1.cu1rwk85ieme.us-west-2.rds.amazonaws.com", "port": 3306, "schema": "W3R3aXR0ZXJdCmludGVyYWN0aW9uX2lkID0gaW50ZXJhY3Rpb24uaWQKaW50ZXJhY3Rpb25fdHlwZSA9IGludGVyYWN0aW9uLnR5cGUKY3JlYXRlZF9hdCA9IGludGVyYWN0aW9uLmN <output omitted> 6IGRhdGV0aW1lLCB0cmFuc2Zvcm06IGRhdGV0aW1lKQptZW50aW9uID0gOml0Z XIuX3ZhbHVlCgo=" }, "output_type": "mysql", "remaining_bytes": null, "start": 1397565784, "status": "active", "user_id": 28619 }

Documentation http://dev.datasift.com/docs/push/connectors/mysql

Page 195: DataSift User Guide

Copyright© DataSift. All Rights Reserved. 189

Monitoring a Subscription Use the push/get endpoint to monitor running or paused subscriptions. This example displays the status of all active subscriptions. Use the id parameter to specify individual subscriptions.

$ curl -sX POST https://api.datasift.com/v1/push/get \ -H 'Authorization: paulsmart:84768663b04a62ac7d4ac' { "count": 1, "subscriptions": [ { "created_at": 1397565784, "end": null, "hash": "dbdf49e22102ed01e945f608ac05a57e", "hash_type": "stream", "id": "4bb94a4b68a1ec3e9d63af61802201cb", "last_request": 1397566364, "last_success": 1397566360, "lost_data": false, "name": "mysql", "output_params": { "database": "MyDatabase", "host": "ds-instance1.cu1rwk85ieme.us-west-2.rds.amazonaws.com", "port": 3306, "schema": "W3R3aXR0ZXJdCmludGVyYWN0aW9uX2lkID0gaW50ZXJhY3Rpb24uaWQKaW50ZXJhY3Rpb25fdHlwZSA9IGludGVyYWN0aW9uLnR5cGUKY3JlYXRlZF9hdCA9IGludGVyYWN0aW9uLmN <output omitted> nRlcmFjdGlvbi5jcmVhdGVkX2F0IChkYXRhX3R5cGU6IGRhdGV0aW1lLCB0cmFuc2Zvcm06IGRhdGV0aW1lLCB0cmFuc2Zvcm06IGRhdGV0aW1lKQptZW50aW9uID0gOml0ZXIuX3ZhbHVlCgo=" }, "output_type": "mysql", "remaining_bytes": null, "start": 1397565784, "status": "active", "user_id": 28619 } ] }

Documentation http://dev.datasift.com/docs/api/1/pushget

Page 196: DataSift User Guide

190 Copyright© DataSift. All Rights Reserved.

Stopping a Subscription Use the push/stop endpoint with the subscription ID to stop a running subscription.

$ curl -sX POST https://api.datasift.com/v1/push/stop \ -d id=4bb94a4b68a1ec3e9d63af61802201cb \ -H 'Authorization: paulsmart:84768663b04a62ac7d4a' { "created_at": 1397565784, "end": 1397566402, "hash": "dbdf49e22102ed01e945f608ac05a57e", "hash_type": "stream", "id": "4bb94a4b68a1ec3e9d63af61802201cb", "last_request": 1397566400, "last_success": 1397566401, "lost_data": false, "name": "mysql", "output_params": { "database": "MyDatabase", "host": "ds-instance1.cu1rwk85ieme.us-west-2.rds.amazonaws.com", "port": 3306, "schema": "W3R3aXR0ZXJdCmludGVyYWN0aW9uX2lkID0gaW50ZXJhY3Rpb24uaWQKaW50ZXJhY3Rpb25fdHlwZSA9IGludGVyYWN0aW9uLnR5cGUKY3JlYXRlZF9hdCA9IGludGVyYWN0aW9uLmN <output omitted> nRlcmFjdGlvbi5jcmVhdGVkX2F0IChkYXRhX3R5cGU6IGRhdGV0aW1lLCB0cmFuc2Zvcm06IGRhdGV0aW1lLCB0cmFuc2Zvcm06IGRhdGV0aW1lKQptZW50aW9uID0gOml0ZXIuX3ZhbHVlCgo=" }, "output_type": "mysql", "remaining_bytes": null, "start": 1397565784, "status": "finishing", "user_id": 28619 }

Documentation: http://dev.datasift.com/docs/api/1/pushstop

Page 197: DataSift User Guide

Copyright© DataSift. All Rights Reserved. 191

Deleting a Subscription Use the push/delete endpoint with an ID to delete the specified subscription. No JSON object is returned, just an HTTP status 204

$ curl -sX POST -w 'Status: %{http_code}\n' \ https://api.datasift.com/v1/push/delete \ -d id=4bb94a4b68a1ec3e9d63af61802201cb \ -H 'Authorization: paulsmart:84768663b04a62ac7d4ac4350cb' Status: 204

Documentation: http://dev.datasift.com/docs/api/1/pushdelete

Page 198: DataSift User Guide

192 Copyright© DataSift. All Rights Reserved.

Configuring DataSift Destination (Web Application) The details of the MySQL database are entered in the DataSift platform once and then used many times.

Adding MySQL Destination Login to the DataSift web application and open the Data Destinations page. From the Browse Destinations page, click the + on the MySQL destination.

Figure 241: Add new MySQL Destination

Complete the form using the following information.

Field Description Label This name is used to refer to this destination in the web

application Host A URL for the database instance. In Amazon RDS, this is called the

endpoint. Port Port number, 3306 is usual Database Name The database name (MyDatabase in previous examples) Mappings Select a interaction field to database field mapping file (usually

ending .ini) Username The MySQL user with permissions to create Password Corresponding password

Page 199: DataSift User Guide

Copyright© DataSift. All Rights Reserved. 193

After entering the fields, click Test Connection to test the DataSift platform can connect and authenticate with the database instance.

Figure 242: MySQL Destination setup form

When the connection test is OK, click Create& Activate.

The new destination is saved and available for all future recordings and historic tasks.

Figure 243: Listing of New MySQL Destination

Page 200: DataSift User Guide

194 Copyright© DataSift. All Rights Reserved.

(This page intentionally left blank)

Page 201: DataSift User Guide

Copyright© DataSift. All Rights Reserved. 195

16 Monitor Usage & Billing It is important to be aware of how billing is designed and to monitor usage of the platform to optimize usage and prevent unexpected bills. This module describes the billing model and shows how to monitor usage in detail and how to set usage limits.

Subscription Model The subscription billing model is a monthly package that includes a pre-set number of DPUs which are credited at the beginning of each month and consumed during the course of that month.

Delivered interactions also have a data license cost. This is separate to DPU costs and is not included in the DPU allowance.

Overage limit In the event that more DPU Hours are consumed than are included in the monthly package the user is charged ‘overage’ at the ‘on-demand’ price for DPU Hours.

The user may define a limit to control potential overspend. The limit is applied to DPU and Data License costs.

The user is notified at 80% and 100% of the limit. At 100% processing of filters and delivery of interactions are stopped.

Page 202: DataSift User Guide

196 Copyright© DataSift. All Rights Reserved.

Modifying Billing Account Details Billing account details must be kept up-to-date. To modify the billing account details, click Settings in the header of any web application page.

Figure 244: Settings Link

Click the Billing Account Details link and click Edit.

Figure 245: Edit Billing Account Details

Page 203: DataSift User Guide

Copyright© DataSift. All Rights Reserved. 197

Viewing Usage & Balance To open the billing page, either click on the billing information in the header or click the Billing tab.

Figure 246: Open Billing Page

Subscription Type The top of the Billing Overview page includes the subscriptions type and credit. In this example, it is a Professional level account which receives a 20,000 DPU allowance every month.

Figure 247: Subscription Type

The 'Next billing cycle' date is when the allowance is reset to 20,000 DPU.

DPU Usage The remaining DPU usage in this billing period is displayed along with the DPU used in the last 24 hours and the number of days until the next billing cycle.

Data Cost For most sources and augmentations, there is a cost for each interaction delivered. The Data Cost shows the running cost this billing cycle, the cost incurred in the last 24 hours and the number of days remaining in the billing cycle.

Page 204: DataSift User Guide

198 Copyright© DataSift. All Rights Reserved.

Setting Overage Limit Overage is the sum of DPU cost over the allowance, and the data cost incurred in this billing cycle. A limit can be set to prevent excessive bills.

Viewing Current Limit The current limit is shown on the Billing Overview page. In this example it is set to $100.

Figure 248: Viewing Cost Limit

At 80% of the overage limit, a notification is sent to the user. Tasks continue to be processed.

Figure 249: Close to Limit Notification

At 100% of the limit another notification is sent along with an email. All running tasks are stopped and new tasks cannot be started.

Figure 250: Reached Limit Notification

Page 205: DataSift User Guide

Copyright© DataSift. All Rights Reserved. 199

Setting New Limit To set a new limit, click the Set Limit link on the Billing Overview page

Figure 251: Set Limit Link

Or from the account settings, click on Billing Account Details. Select the new limit or enter a custom limit. Click Save Limit.

Figure 252: Setting Cost Limit

Viewing Usage Statistics Usage statistics are available for the past 7 days. From the Billing tab, click Usage Statistics. Statistics are divided into Public Sources and Managed Sources. Use the drop-down menu to select individual days from the past week.

Figure 253: Viewing Usage Statistics

Page 206: DataSift User Guide

200 Copyright© DataSift. All Rights Reserved.

Viewing Cost by DPU & Data License The first chart shows DPU and Data License cost. In this example, the DPU cost does not appear. This is likely to be because the monthly DPU allowance is being used.

Figure 254: Cost in DPU

Viewing DPU Usage This chart shows DPU usage and a running total for the past week. In this example, over 100,000 DPUs have been used.

Figure 255: DPU Usage

Page 207: DataSift User Guide

Copyright© DataSift. All Rights Reserved. 201

Viewing Data Volume This chart shows the volume of data received each day for the past 7 days. The total is broken down into Sources and Augmentations.

Figure 256: Data Volume

Viewing Connected Hours The number of connected hours is displayed using different colors for the types of connection. In this example, all connections were streaming. This number may be over 24 hours per day if multiple connections are made.

Figure 257: Connected Hours

Page 208: DataSift User Guide

202 Copyright© DataSift. All Rights Reserved.

Viewing Historic Hours The number of hours in historic queries are shown in the final chart.

Figure 258: Historic Hours

Viewing Managed Source Statistics The DPU usage for each type of managed source is reported in a dedicated chart. Click on Managed Sources to reveal this chart.

Figure 259: Managed Source Usage Statistics

Page 209: DataSift User Guide

Copyright© DataSift. All Rights Reserved. 203

Viewing Current Streams To see an overview of currently running streams, go to the Billing page and click Currently Used Streams. This example shows three streams with their type, start time and DPU.

Figure 260: Viewing Current Streams

Click on a stream name to open the task summary.

Locating Pricing Information Data License prices are available in the web application. From the Data Sources page, click on any source or augmentation to see the price per 1,000 interactions.

Figure 261: Data License Pricing

Data Processing prices are detailed in the documentation along with a description of billing and an FAQ

LINK: Understanding Billing: http://dev.datasift.com/docs/billing Billing FAQ: http://dev.datasift.com/docs/getting-started/billingfaq

Page 210: DataSift User Guide

204 Copyright© DataSift. All Rights Reserved.

(This page intentionally left blank)

Page 211: DataSift User Guide

Copyright© DataSift. All Rights Reserved. 205

17 Locating Help This module reviews of all the resources available to you while you become more experienced with the DataSift platform.

Locating Documentation Hundreds of pages of documentation are available on the developer web site along with a user guide.

Documentation includes:

• Targets for all sources and augmentations

• API endpoints • CSDL programming • Destinations • Billing • Historics • FAQs • Best Practice

LINK: http://dev.datasift.com/docs

Figure 262: Documentation Pages

Page 212: DataSift User Guide

206 Copyright© DataSift. All Rights Reserved.

User Guide Documentation of the platform and web application is available as a PDF user guide.

Figure 263: User Guide

Page 213: DataSift User Guide

Copyright© DataSift. All Rights Reserved. 207

Viewing Platform Status The status.datasift.com site displays the live status and latency information about all parts of the DataSift platform.

Figure 264: Status Web Site

Click on individual issues to see the status.

Figure 265: Individual Issue Status

Page 214: DataSift User Guide

208 Copyright© DataSift. All Rights Reserved.

Viewing Known Issues The DataSift issue tracking system (dev.datasift.com/issues) is used to report and give updates on bugs found in the platform. Before reporting any new issue, be sure to check the issues list.

Existing issues are automatically linked to the documentation pages and related discussion threads.

Figure 266: Issue Tracking

Forum Discussions The discussion forums (dev.datasift.com/discussions) are an ideal place to ask questions about any part of the platform and receive advice from more experienced users.

Figure 267: Discussion Forum

Page 215: DataSift User Guide

Copyright© DataSift. All Rights Reserved. 209

Subscribing to Updates Platform status and known issue updates are provided on Twitter. Follow the following accounts to receive updates:

• @datasiftapi • @datasiftdev

The developer blogs contain useful information on how to achieve particular goals. Use the following link for an RSS feed.

• http://dev.datasift.com/blog/feed

Attending Workshops Remote workshops covering new features are delivered regularly. Look on datasift.com/learning for the latest schedule.

Figure 268: Workshop Schedule

Page 216: DataSift User Guide

210 Copyright© DataSift. All Rights Reserved.

Submitting Support Requests The support site (support.datasift.com) is where support requests are raised. Be sure to check known issues and platform status before submitting a request.

Solutions to requests are often available in the forums or documentation.

When raising a support request pay careful attention to selecting the correct priority level.

• Urgent o A major component of DataSift is unavailable, and status.datasift.com is

not reporting any irregularities with the service in question • High

o A major component of DataSift is not behaving normally. This issue is causing considerable business impact

• Normal o Minor issues using DataSift. Workarounds are generally available. This

issue is causing minor business impact • Low

o No immediate business impact

The support site also allows historical requests to be viewed.

Figure 269: View Historical Requests

Page 217: DataSift User Guide

Copyright© DataSift. All Rights Reserved. 211

Viewing Screencasts DataSift has a YouTube channel with playlists of training screencasts and workshop recordings.

• Training Playlist o https://www.youtube.com/playlist?list=PLzM6Pg1YzDVoInlRyMiw6AQHisW

wiDtZQ • Workshop Recordings

o https://www.youtube.com/playlist?list=PLzM6Pg1YzDVqTcakqPqFrMBKsje0R9cSO

Figure 270: DataSift YouTube Channel

Attending Training Further instructor-led learning is available which covers the platform fundamentals using the API and advanced filtering.

DataSift Fundamentals (API) Advanced Filtering DataSift Overview Case Sensitivity and Tokenization Configure Sources & Augmentations Importing Filters Write Simple CSDL Filters Tagging Analyze Interactions Cascading Tags Configure Destinations Scoring Configure Live Push Subscriptions Using Classifier Libraries Configure Live Pull Subscriptions Wildcards and Regular Expressions Configure Historic Previews CSDL Optimization Configure Historic Recordings Billing and Usage Monitoring Locating Help

Page 218: DataSift User Guide

212 Copyright© DataSift. All Rights Reserved.

(This page intentionally left blank)