BDI-DST- 10 day Training programme on Bigdata Analytics ...psnacet.edu.in/bigdata/2016dstsum.pdf · BDI-DST- 10 day Training programme on Bigdata Analytics (18th-27th, May 2016) Page

BDI-DST- 10 day Training programme on Bigdata Analytics (18th-27th, May 2016) Page 1


REPORT

As set out in the programme of the training, its general objective was to give the participants a basic

understanding of Big Data technologies, beginning with describing Big Data and the need for Hadoop,

Hadoop architecture and how to work with the Hadoop Distributed File System (HDFS) and further

advanced techniques. Distinguished experts have been invited to speak on the current problems and

technical aspects in their areas of research, while a panel of experts from the industry have been called to

share their perspectives with the participants.

Big data is in its infancy stage at present in India, mostly we are dealing in data warehousing but

gradually we are heading towards solutions. The learning is percolating from global companies and

importantly from one’s own work experiences in a specific discipline or domain. It helps us take a step

further by making decisions based on real-time data rather than only past and big bang experiences. As an

initiative we signed an MOU with C-DAC. The focus of Big Data research in our IT department spans

towards training for faculties and studentsin this domain. 23 students of our department have completed

their academic projects in Big Data under the guidance of C-DAC.

We have received an excellent number of responses around 140 online registrations of candidates that

started immediately after posting the event. a steady stream of candidates including doctorates, research

scholars, faculty members, industry people and students have applied to attend this training programme.

From these, we selected 32 external participants and 30 internal participants.


Programme Summary

Day 1 (18-05-2016)

Dr. Dharanipragada Janakiram, Ph.D

Professor

Department of Computer Science and Engineering

INDIAN INSTITUTE OF TECHNOLOGY MADRAS

It was very delightful day that inaugural function was held at seminar hall, PSNA College of Engineering

and Technology on the morning of 18th

May 2016 in the august presence of chief guest Dr.

Dharanipragada Janakiram, Professor, Department of Computer Science and Engineering, IIT Madras.

Thiru. R.S.K. Raguraam, Pro-Chairman, and Dr.V.Soundararajan, Principal, PSNACET presided over

the function.

Dr. A. Vincent Antony Kumar, Professor & Head,

Department of IT, who is the Convener of this

programme introduced the structure of the training

programme. Then, Dr. Janakiram delivered the

basic key note address on big data analytics and his

own experience.

The key notes involved topics like Business analytics, occupying the intersection of the worlds of

management science, computer science and

statistical science, is a potent force for

innovation in both the private and public

sectors. This note discusses some business

case considerations for analytics projects

involving “Big Data”, and proposes key

questions that businesses should ask. And

challenges exist in analytic methods added by

Dr.Janakiram. He shared his vision that is to

become an Internet Data Sharing Platform.


We realized that the next frontier of Big Data depends on effectively managing, using, and exploiting

these heterogeneous data.

Then he presented some examples of Predictive Analytics over Big Data with case studies in

ecommerce Marketing, on-line publishing and recommendation systems, and advertising targeting.

Finally, he gave very useful references and websites for our future research on big data analytics.


Day 1 (18-05-2016 AN) and Day 2 (19-05-2016 FN)

M.M.Shankar,

CARES,

Bangalore.

The following topics were covered by the speaker on R along with a hands-on session.

R basics: R is a programming language and software environment for statistical analysis, graphics

representation and reporting. The R statistical programming language is a free open source package based

on the S language developed by Bell Labs.

The benefits of R are

• R is free. R is open-source and runs on UNIX, Windows and Macintosh.

• R has an excellent built-in help system.

• R has excellent graphing capabilities.

• Students can easily migrate to the commercially supported S-Plus program if commercial software is

desired.

• R’s language has a powerful, easy to learn syntax with many built-in statistical functions.

• The language is easy to extend with user-written functions.


• R is a computer programming language. For programmers it will feel more familiar than others and for

new computer users, the next leap to programming will not be so large.

• The data is stored in R as a vector.

• This means simply that it keeps track of the order that the data is entered in. In particular there is a first

element, a second element up to a last element

• R is most easily used in an interactive manner.

R-LISTS: Lists are the R objects which contain elements of different types like − numbers, strings,

vectors and another list inside it. A list can also contain a matrix or a function as its elements. List is

created using list() function.

R-MATRICES: Matrices are the R objects in which the elements are arranged in a two-dimensional

rectangular layout. They contain elements of the same atomic types. Though we can create a matrix

containing only characters or only logical values, they are not of much use. We use matrices containing

numeric elements to be used in mathematical calculations.

R-VECTOR: Vectors are the most basic R data objects and there are six types of atomic vectors.

They are logical, integer, double, complex, character and raw.

R-BAR PLOT: A bar chart represents data in rectangular bars with length of the bar proportional

to the value of the variable. R uses the functionbarplot() to create bar charts. R can draw both vertical

and horizontal bars in the bar chart. In bar chart each of the bars can be given different colors.

STATISTICS: Statistical analysis in R is performed by using many in-built functions. Most of

these functions are part of the R base package. These functions take R vector as an input along with the

arguments and give the result.

REGRESSION ANALYSIS: Regression analysis is a very widely used statistical tool to establish a

relationship model between two variables. One of these variable is called predictor variable whose value

is gathered through experiments. The other variable is called response variable whose value is derived

from the predictor variable.

R Examples

1. Keeping track of a stock; adding to the data. Suppose the daily closing price of a stock for two

weeks is 45,43,46,48,51,46,50,47,46,45 This could be tracked with R using a vector

2. Working with mathematics R makes it easy to translate mathematics in a natural way once your

data is read in. For example, suppose the yearly number of whales beached in Texas during the


period 1990 to 1999 is 74 122 235 111 292 111 211 133 156 79. What is the mean, the variance,

the standard deviation?

3. Your cell phone bill varies from month to month. Suppose your year has the following monthly

amounts 46 33 39 37 46 30 48 32 49 35 30 48 Enter this data into a variable called bill. Use the

sum command to find the amount you spent this year on the cell phone. What is the smallest

amount you spent in a month? What is the largest? How many months was the amount greater

than $40? What percentage was this?

4. Suppose you track your commute times for two weeks (10 days) and you find the following times

in minutes 17 16 20 24 22 15 21 15 17 22 Enter this into R. Use the function max to find the

longest commute time, the function mean to find the average and the function min to find the

minimum. Oops, the 24 was a mistake. It should have been 18. How can you fix this? Do so, and

then find the new average. How many times was your commute 20 minutes or more? To answer

this one can try (if you called your numbers commutes) > sum( commutes >= 20) What do you

get? What percent of your commutes are less than 17 minutes? How can you answer this with R?

5. Find a dataset that is a candidate for linear regression (you need two numeric variables, one a

predictor and one a response.) Make a scatterplot with regression line using R

6. For the data set babies make a pairs plot (pairs(babies)) to investigate the relationships between

the variables. Which variables seem to have a linear relationship? For the variables for

birthweight and gestation make a scatter plot using different plotting characters (pch) depending

on the level of the factor smoke

7. Make a histogram of 100 exponential numbers with mean 10. Estimate the median. Is it more or

less than the mean?

8. The Bernoulli example is also skewed when p is not .5. Do an example with n = 100 and p = .25,

p = .05 and p = .01. Is the data approximately normal in each case? The rule of thumb is that it

will be approximately normal when np ≥ 5 and n(1 − p) ≥ 5. Does this hold?

9. The t distribution will be important later. It depends on a parameter called the degrees of freedom.

Use the rt(n,df) function to investigate the t-distribution for n=100 and df=2, 10 and 25.

10. Load the Simple data set vacation. This gives the number of paid holidays and vacation taken by

workers in the textile industry. 1. Is a test for y¯ appropriate for this data? 2. Does a t-test seem

appropriate? 3. If so, test the null hypothesis that µ = 24. (What is the alternative?)

11. In an effort to increase student retention, many colleges have tried block programs. Suppose 100

students are broken into two groups of 50 at random. One half are in a block program, the other

half not. The number of years in attendance is then measured. We wish to test if the block

program makes a difference in retention. The data is:


Do a test of hypothesis to decide if there is a difference between the two types of programs in

terms of retention

12. The cost of a home depends on the number of bedrooms in the house. Suppose the following data

is recorded for homes in a given town

Make a scatterplot, and fit the data with a regression line. On the same graph, test the hypothesis

that an extra bedroom costs $60,000 against the alternative that it costs more


Day 2 (19-05-2016 AN) and Day 3 (20-05-2016 FN)

Dr. R.B.V. Subramanyam,

Associate Professor,

Department of Computer Science & Engineering,

National Institute of Technology,Warangal.

The following topics were covered by the speaker.

SENSITIVITY AND SPECIFICITY

Sensitivity also called the true positive rate measures the proportion of positives that are correctly

identified as such (e.g., the percentage of sick people who are correctly identified as having the

condition). Specificity also called the true negative rate measures the proportion of negatives that are

correctly identified as such (e.g., the percentage of healthy people who are correctly identified as not

having the condition).

PRECISION AND RECALL

In pattern recognition and information

retrieval with binary classification, precision also called positive

predictive value is the fraction of retrieved instances that are

relevant while recall also known as sensitivity is the fraction of

relevant instances that are retrieved.

BAGGING

Bagging is essentially taking repeated samples from the single training set in order to

generate B different bootstrapped training data sets. We then train our method on the bth training set and

average all the predictions.

Bootstrap Model

Randomly generate L set of cardinality N from the original set Z with replacement.

Corrects the optimistic bias of R-Method "Bootstrap Aggregation"

Create Bootstrap samples of a training set using sampling with replacement.

Each bootstrap sample is used to train a different component of base classifier

Classification is done by plurality voting


Regression is done by averaging

Works for unstable classifiers- Neural Networks, Decision Trees

BOOSTING

A technique for combining multiple base classifiers whose combined performance is significantly better

than that of any of the base classifiers.

Sequential training of weak learners

Each base classifier is trained on data that is weighted based on the performance of the previous

classifier

Each classifier votes to obtain a final outcome

Boosting follows the model of online algorithm.

Algorithm allocates weights to a set of strategies and used to predict the outcome of the certain

event After each prediction the weights are redistributed.

Correct strategies receive more weights while the weights of the incorrect strategies are reduced

further.

Relation with Boosting algorithm.

Strategies corresponds to classifiers in the ensemble and the event will correspond to assigning a

label to sample drawn randomly from the input.

RANDOM FORESTS

A Random Forest consists of a collection or ensemble

of simple tree predictors, each capable of producing a

response when presented with a set of predictor values.

For classification problems, this response takes the

form of a class membership, which associates, or

classifies, a set of independent predictor values with

one of the categories present in the dependent variable. Alternatively, for regression problems, the tree

response is an estimate of the dependent variable given the predictors.

A Random Forest consists of an arbitrary number of simple trees, which are used to determine the

final outcome. For classification problems, the ensemble of simple trees vote for the most popular

class. In the regression problem, their responses are averaged to obtain an estimate of the dependent

variable. Using tree ensembles can lead to significant improvement in prediction accuracy (i.e., better

ability to predict new data cases).


SVM

A Support Vector Machine (SVM) is a discriminative classifier formally defined by a separating

hyperplane. In other words, given labeled training data, the algorithm outputs an optimal hyperplane

which categorizes new examples.

NEURAL NETWORK

A neural network usually involves a large number of processors operating in parallel, each with

its own small sphere of knowledge and access to data in its local memory. Typically, a neural network is

initially "trained" or fed large amounts of data and rules about data relationships. A program can then tell

the network how to behave in response to an external stimulus or can initiate activity on its own.

Day 3 (20-05-2016 AN)

Dr.B.Ramadoss,

Professor,

Department of Computer Applications,

National Institute of Technology,

Trichirappalli.


MULTI DIMENSIONAL SCALING

MDS is a means of visualizing the level of similarity of individual cases of a dataset. It refers to a

set of related ordination techniques used in information visualization, in particular to display the

information contained in a distance matrix. An MDS algorithm aims to place each object in N-

dimensional space such that the between-object distances are preserved as well as possible.

Multidimensional scaling (MDS) can be considered to be an alternative to factor analysis. In

general, the goal of the analysis is to detect meaningful underlying dimensions that allow the researcher

to explain observed similarities or dissimilarities (distances) between the investigated objects. In factor

https://en.wikipedia.org/wiki/Ordination_(statistics)

https://en.wikipedia.org/wiki/Information_visualization

https://en.wikipedia.org/wiki/Distance_matrix

https://en.wikipedia.org/wiki/Algorithm

https://en.wikipedia.org/wiki/Dimension


analysis, the similarities between objects (e.g., variables) are expressed in the correlation matrix. With

MDS, you can analyze any kind of similarity or dissimilarity matrix, in addition to correlation matrices.

MDS attempts to arrange "objects" (major cities in this example) in a space with a particular

number of dimensions (two-dimensional in this example) so as to reproduce the observed distances. As a

result, we can "explain" the distances in terms of underlying dimensions; in our example, we could

explain the distances in terms of the two geographical dimensions: north/south and east/west

PRINCIPAL COMPONENTS ANALYSIS

Principal components analysis is a procedure for identifying a smaller number of uncorrelated variables,

called "principal components", from a large set of data. The goal of principal components analysis is to

explain the maximum amount of variance with the fewest number of principal components. Principal

components analysis is commonly used in the social sciences, market research, and other industries that

use large data sets.

Principal components analysis is commonly used as one step in a series of analyses. You can use

principal components analysis to reduce the number of variables and avoid multicollinearity, or when you

have too many predictors relative to the number of observations

A consumer products company wants to analyze customer responses to several characteristics of a

new shampoo: color, smell, texture, cleanliness, shine, volume, amount needed to lather, and price. They

perform a principal components analysis to determine whether they can form a smaller number of

uncorrelated variables that are easier to interpret and analyze. The results identify the following patterns:

Color, smell, and texture form a "Shampoo quality" component.

Cleanliness, shine, and volume form an "Effect on hair" component.

Amount needed to lather and price form a "Value" component.


LASSO AND LARS

In statistics and machine learning, lasso (least absolute shrinkage and selection operator) is a regression

analysis method that performs both variable selection and regularization in order to enhance the

prediction accuracy and interpretability of the statistical model it produces. In statistics, least-angle

regression (LARS) is an algorithm for fitting linear regression models to high-dimensional data,

developed by Bradley Efron, Trevor Hastie, Iain Johnstone and Robert Tibshirani. Suppose we expect a

response variable to be determined by a linear combination of a subset of potential covariates. Then the

LARS algorithm provides a means of producing an estimate of which variables to include, as well as

their coefficients.

Instead of giving a vector result, the LARS solution consists of a curve denoting the solution for each

value of the L1 norm of the parameter vector. The algorithm is similar to forward stepwise regression,

but instead of including variables at each step, the estimated parameters are increased in a direction

equiangular to each one's correlations with the residual.

https://en.wikipedia.org/wiki/Statistics

https://en.wikipedia.org/wiki/Machine_learning

https://en.wikipedia.org/wiki/Regression_analysis



https://en.wikipedia.org/wiki/Variable_selection

https://en.wikipedia.org/wiki/Regularization_(mathematics)

https://en.wikipedia.org/wiki/Statistical_model

https://en.wikipedia.org/wiki/Statistics

https://en.wikipedia.org/wiki/Linear_regression

https://en.wikipedia.org/wiki/Bradley_Efron

https://en.wikipedia.org/wiki/Trevor_Hastie

https://en.wikipedia.org/wiki/Robert_Tibshirani

https://en.wikipedia.org/wiki/L1_norm

https://en.wikipedia.org/wiki/Stepwise_regression


Day 4 (21-05-2016 FN)

Dr. T. Senthil Kumar,

Associate Professor, Department of CSE,

Amrita School of Engineering,

Amrita Vishwa Vidyapeetham, Coimbatore.


CENTRALITY

In graph theory and network analysis, indicators of centrality identify the most

important vertices within a graph. Applications include identifying the most influential person(s) in

a social network, key infrastructure nodes in the Internet or urban networks, and super-spreaders of

disease. Centrality concepts were first developed in social network analysis, and many of the terms used

to measure centrality reflect their sociological origin.

LINK ANALYSIS

In network theory, link analysis is a data-analysis technique used to evaluate relationships

(connections) between nodes. Relationships may be identified among various types of nodes (objects),

including organizations, people and transactions. Link analysis has been used for investigation of


criminal activity (fraud detection, counterterrorism, and intelligence), computer security analysis, search

engine optimization, market research, medical research, and art.

PAGE RANK

Page Rank is an algorithm used by Google Search to rank websites in their search engine results.

Page Rank was named after Larry Page, one of the founders of Google. Page Rank is a way of measuring

the importance of website pages. Page Rank works by counting the number and quality of links to a page

to determine a rough estimate of how important the website is. Page Rank is a link analysis algorithm and

it assigns a numerical weighting to each element of a hyperlinked set of documents, such as the World

Wide Web, with the purpose of "measuring" its relative importance within the set. The algorithm may be

applied to any collection of entities with reciprocal quotations and references. The numerical weight that

it assigns to any given element E is referred to as the Page Rank of E.

Day 4 (21-05-2016 AN)

Dr.Karthikeyan Vaiyapuri,

Scientist R & D,

TCS Innovation Labs, Chennai.


K MEANS

k-means clustering is a method of vector quantization, originally from signal processing, that is

popular for cluster analysis in data mining. k-means clustering aims to partition n observations

into k clusters in which each observation belongs to the cluster with the nearest mean, serving as

a prototype of the cluster. This results in a partitioning of the data space into Voronoi cells.

PAM

The most common realisation of k-medoid clustering is the Partitioning Around Medoids

(PAM) algorithm. PAM uses a greedy search which may not find the optimum solution, but it is faster

than exhaustive search.

CLUSTER EVALUATION


Silhouette refers to a method of interpretation and validation of consistency within clusters of

data. The technique provides a succinct graphical representation of how well each object lies within its

cluster. It was first described by Peter J. Rousseeuw in 1986. The silhouette value is a measure of how

similar an object is to its own cluster (cohesion) compared to other clusters (separation). The silhouette

ranges from -1 to 1, where a high value indicates that the object is well matched to its own cluster and

poorly matched to neighboring clusters.

BIRCH

BIRCH (balanced iterative reducing and clustering using hierarchies) is an unsupervised data

mining algorithm used to perform hierarchical clustering over particularly large data-sets.

ANOMALY DETECTION

In data mining, anomaly detection or outlier detection is the identification of items, events or

observations which do not conform to an expected pattern or other items in a dataset. Typically the

anomalous items will translate to some kind of problem such as bank fraud, a structural defect, medical

problems or errors in a text. Anomalies are also referred to as outliers, novelties, noise, deviations and

exceptions.


Day 5 (22-05-2016 FN)

Dr. Sairam,

Associate Dean,

SASTRA University,

Thanjavur


MAP REDUCE

MapReduce is a programming model and an associated implementation for processing and

generating large data sets with a parallel, distributed algorithm on a cluster. Conceptually similar

approaches have been very well known since 1995 with the Message Passing Interface standard having

reduce and scatter operations. A MapReduce program is composed of a Map() procedure that performs

filtering and sorting (such as sorting students by first name into queues, one queue for each name) and

a Reduce() method that performs a summary operation (such as counting the number of students in each

queue, yielding name frequencies).

https://en.wikipedia.org/wiki/Programming_model

https://en.wikipedia.org/wiki/Parallel_computing

https://en.wikipedia.org/wiki/Distributed_computing

https://en.wikipedia.org/wiki/Cluster_(computing)

https://en.wikipedia.org/wiki/Message_Passing_Interface

https://en.wikipedia.org/wiki/Map_(parallel_pattern)

https://en.wikipedia.org/wiki/Procedure_(computing)


ENTITES

Automatically matching entities (objects) and ontologies are key technologies to semantically

integrate heterogeneous data. These match techniques are needed to identify equivalent data objects

(duplicates) or semantically equivalent metadata elements. The proposed techniques demand very high

resources that limit their applicability to large-scale (Big Data) problems a powerful cloud infrastructure

can be utilized.

HDFS

An important characteristic of Hadoop is the partitioning of data and computation across many

thousands of hosts and the execution of application computations in parallel close to their data. A Hadoop

cluster scales computation capacity, storage capacity and I/O bandwidth by simply adding commodity

servers. Many hundred organizations worldwide report using Hadoop. HDFS stores filesystem metadata

and application data separately. As in other distributed filesystems, like PVFS, Lustre and GFS, HDFS

stores metadata on a dedicated server, called the NameNode. Application data are stored on other servers


called DataNodes. All servers are fully connected and communicate with each other using TCP-based

protocols Instead, like GFS, the file content is replicated on multiple DataNodes for reliability.

Day 5 (22-05-2016 AN)

Dr. Sudheesh Kumar Kattumannil

Assistant Professor


DECISION TREE

Graphical representation of possible solutions to a decision based on certain conditions.Root node,

Decision/Internal node, Leaf/Terminal node.

CLASSIFICATION AND REGRESSION TREE

Problem of identifying to which of a set of categories a new observation belongs, on the basis of a

training set of data containing observations whose category membership is known. Dependent variable is

categorical. Dependent variable is continuous in regression tree.


BAGGING IN REGRESSION TREE

To apply bagging to regression trees, we simply constructB regression trees using bootstrapped training

setsAverage the resulting predictions.Averaging these B trees reduces the variance.

RANDOM FORESTS

Random forests is a notion of the general technique of random decision forests that are an ensemble

learning method for classification, regression and other tasks, that operate by constructing a multitude of

decision trees at training time and outputting the class that is the mode of the classes (classification) or

mean prediction (regression) of the individual trees. Random decision forests correct for decision trees'

habit of overfitting to their training set.

ADABOOST

AdaBoost generates a sequence of weak classifiers, where at each iteration the algorithm finds the

bestclassifier based on the current sample weights. Samples that are incorrectly classified in the kth

iterationreceive more weight in the (k + 1)st iteration, while samples that are correctly classified receive

less weight inthe subsequent iteration. In each iteration of the algorithm is required to learn adifferent

aspect of the data, focusing on regions that contain difficult to classify samples.

COMPUTING IN R

Two widely used implementations for single regression trees in R are rpart and party. The ipred package

contains two functions for bagged trees. Bagging uses the formula interface and ipredbagg has the non-

formula interface.39 /


Day 6 (23-05-2016)

Mr. K. Vijaya Kumar,

Senior Scientist, C-DAC,Chennai.

During this session he covered lot of topics that were very interesting to learn. We had hands-on-session

in hadoop. He shared his knowledge about RHIPE architecture; RHadoop architecture and examples;

Hadoop Streaming R package; Writing programs for linear and logistic regression, and clustering and

classification, using R and Hadoop.

In His next session, he pointed out about Rmpi that provides an interface (wrapper) to MPI APIs. It also

provides interactive R slave environment. By default, R will not take advantage of all the cores available

on a computer. In order to execute code in parallel, we have to first make the desired number of cores

available to R by registering a ’parallel backend’, which effectively creates a cluster to which

computations can be sent, added by him. He told that Rhipe is a Java package that integrates the R

environment with Hadoop, the open source implementation of Google’s MapReduce. Using Rhipe, it is

possible to write MapReduce algorithms in R. He also gave insights on the below topics.

R PACKAGES

Many useful R function come in packages, free libraries of code written by R's active user community. R

will download the package from CRAN, so you'll need to be connected to the internet. Once you have a


package installed, you can make its contents available to use in your current R session by running the

library.

MySQL

It is a popular choice of database for use in web applications, and is a central component of the widely

used LAMP open-source web application software stack .

RExcel

The main features are:

Data transfer (matrices and data frames) between R and Excel in both directions

Running R code directly from Excel ranges

Writing macros calling R to perform calculations without exposing R to the user

Calling R functions directly from cell formulas, using Excel's autoupdate mechanism to trigger

recalculation by R

RMongoDB

MongoDB is a scalable, high-performance, document-oriented NoSQL database. The rmongodb package

provides an interface from the statistical software R to MongoDB and back using the mongodb-C library.

RHive

RHive is an R package developed by NexR with a focus on providing distributed analysis capabilities

through Hadoop. RHive allows an analyst to interact with data stored in a Hadoop Distributed File

System (HDFS) cluster by utilizing familiar SQL-like constructs established through Hive.

RHBase

This package provides basic connectivity to the HBASE distributed database, using the Thrift server. R

programmers can browse, read, write, and modify tables stored in HBASE from within R. This package

has to be installed only on the node that will run the R client.


Day 7 (24-05-2016)

Mr. Sachin P Bappalige,

HPC Development Manager,IBM, Bangalore


CEP

Complex event processing, or CEP, is event processing that combines data from multiple sources to

infer events or patterns that suggest more complicated circumstances. The goal of complex event

processing is to identify meaningful events (such as opportunities or threats) and respond to them as

quickly as possible.

ONE-PASS COMPUTING

In computer programming, a one-pass compiler is a compiler that passes through the parts of

each compilation unit only once, immediately translating each part into its final machine code.

ONLINE ALGORITHM

In computer science, an online algorithm is one that can process its input piece-by-piece in a serial

fashion, i.e., in the order that the input is fed to the algorithm, without having the entire input available

from the start.

STREAM SAMPLING

Stream sampling is the process of collecting a representative sample of the elements of a data stream. The

sample is usually much smaller than the entire stream, but can be designed to retain many important

characteristics of the stream, and can be used to estimate many important aggregates on the stream.

CONCEPT DRIFT

In predictive analytics and machine learning, the concept drift means that the statistical properties of the

target variable, which the model is trying to predict, change over time in unforeseen ways. This causes

problems because the predictions become less accurate as time passes.

MASSIVE ONLINE ANALYSIS

MOA is the most popular open source framework for data stream mining, with a very active growing

community. It includes a collection of machine learning algorithms and tools for evaluation.


Day 8 (25-05-2016 FN)

Dr.P.G.Babu,

Professor,

Indira Gandhi Institute of Development Research,

Mumbai.

The speaker covered the features of various databases including relational, document, column-based,

graph and spatial databases.

Relational database

A relational database is one that offers extremely complex and sophisticated queries and searches thanks

to two factors: tables and cross-referencing. It stores data as tables rather than plain lists, making it easier

to filter individual elements of each record. It also allows cross-referencing between different sets of data.

A basic database stores all the details in a single file, made up of a string of records.


Column oriented database

A column-oriented DBMS (or columnar database) is a database management system that stores data

tables as columns rather than as rows. Practical use of a column store versus a row store differs little in

the relational DBMS world. Both columnar and row databases use traditional database languages

like SQL to load data and perform queries. Both row and columnar databases can become the backbone

in a system to serve data for common ETL and data visualization tools.

GRAPH DATABASE

Another benefit of columnar storage is compression efficiency. It is well known that a record of similar

data compresses more efficiently than disparate data across records. For this reason, columnar are well-

known for minimizing storage and reducing the I/O spent reading data to answer a query compared to

row-based databases.


Day 8 (25-05-2016 AN)

Mr.Peter Manoj,

Centre for Nano Science and Engineering,

IISc,

Bengaluru.


HADOOP AS ETL

Hadoop's architecture is fundamentally superior for supporting many of the most commonly deployed

data integration functions. Hadoop has not replaced nor it will replace ETL in the coming time

because Hadoop complements ETL for the processing of Bigdata.

BIG DATA INTEGERATION

Big data integration is a key operational challenge for today's enterprise IT departments. IT groups may

find their skill sets, workload, and budgets over-stretched by the need to manage terabytes or petabytes of

data in a way that delivers genuine value to business users. Talent, the leading provider of open source

data management solutions, helps organizations large and small meet the big data challenge by making

big data integration easy, fast, and affordable.


CLOUD DEPLOYMENT AND DELIVERY MODELS FOR BIG DATA

A number of cloud delivery models exist for big data. Infrastructure as a Service (IaaS) is one of the most

straightforward of the cloud computing services. IaaS is the delivery of computing services including

hardware, networking, storage, and data center space based on a rental model. Platform as a Service

(PaaS) is a mechanism for combining IaaS with an abstracted set of middleware services, software

development, and deployment tools that allow the organization to have a consistent way to create and

deploy applications on a cloud or on premises. Software as a Service(SaaS) is a business application

created and hosted by a provider in a multitenant model. Data as a Service (DaaS) is closely related to

SaaS. DaaS is a platform-independent service that would let you connect to the cloud to store and retrieve

your data.

Day 8 (25-05-2016 AN)

Prof.C.T.Vinu,

Assistant Professor,

Indian Institute of Management,

Trichy


Financial statement analysis

Financial statement analysis is the process of reviewing and analyzing a company's financial

statements to make better economic decisions. These statements include the income statement, balance

sheet, statement of cash flows, and a statement of changes in equity.

Predictive analytics

Predictive analytics encompasses a variety of statistical techniques from predictive modelling, machine

learning, and data mining that analyze current and historical facts to make predictions about future or

otherwise unknown events. In business, predictive models exploit patterns found in historical and

transactional data to identify risks and opportunities. Models capture relationships among many factors to

allow assessment of risk or potential associated with a particular set of conditions, guiding decision

making for candidate transactions


Electronic governance

Electronic governance or e-governance is the application of information and communication

technology (ICT) for delivering government services, exchange of information communication

transactions, integration of various stand-alone systems and services between government-to-customer

(G2C), government-to-business (G2B), government-to-government (G2G) as well as back

office processes and interactions within the entire government framework.

Spatial analysis

Spatial analysis or spatial statistics includes any of the formal techniques which study entities using

their topological, geometric, or geographic properties. Spatial analysis includes a variety of techniques

using different analytic approaches and applied in fields as diverse as astronomy, with its studies of the

placement of galaxies in the cosmos, to chip fabrication engineering.


Day 9 (26-05-2016 FN)

Dr.S.P.Syed Ibrahim,

Professor,

School of Computing Science and Engineering (SCSE),

VIT University,

Chennai.

Summary

Apache Spark has an advanced DAG execution engine that supports cyclic data flow and in-memory

computing. It offers over 80 high-level operators that make it easy to build parallel apps and it can be

used interactively from the Scala, Python and R shells. Spark powers a stack of libraries including SQL

and DataFrames, MLlib for machine learning, GraphX, and Spark Streaming. Spark can run using its

standalone cluster mode, on EC2, on Hadoop YARN, or on Apache Mesos. And can access data in

HDFS, Cassandra, HBase, Hive, Tachyon, and any Hadoop data source.


Apache Spark provides programmers with an application programming interface centered on a data

structure called the resilient distributed dataset (RDD), a read-only multiset of data items distributed over

a cluster of machines that is maintained in a fault-tolerant way. Spark is used at a wide range of

organizations to process large datasets.

Scala is a general purpose programming language. Scala has full support for functional programming and

a very strong static type system. It has many features of functional programming languages like Scheme,

Standard ML and Haskell, including currying, type inference, immutability, lazy evaluation, and pattern

matching. It also has an advanced type system supporting algebraic data types, covariance and

contravariance, higher-order types (but not higher-rank types), and anonymous types.


The following topics were covered by the speaker along with a hands-on session on SPARK and

SCALA.

Sentiment analysis

Sentiment analysis refers to the use of natural language processing, text analysis and computational

linguistics to identify and extract subjective information in source materials. Sentiment analysis is widely

applied to reviews and social media for a variety of applications, ranging from marketing to customer

service.

Text mining

Text mining, also referred to as text data mining, roughly equivalent to text analytics, refers to the

process of deriving high-quality information from text. High-quality information is typically derived

through the devising of patterns and trends through means such as statistical pattern learning. Text

mining usually involves the process of structuring the input text deriving patterns within the structured

data, and finally evaluation and interpretation of the output. 'High quality' in text mining usually refers to

some combination of relevance, novelty, and interestingness. Typical text mining tasks include text

categorization, text clustering, concept/entity extraction, production of granular taxonomies, sentiment

analysis, document summarization, and entity relation modeling

Web mining

In customer relationship management (CRM), Web mining is the integration of information gathered by

traditional data mining methodologies and techniques with information gathered over the World

Wide Web.


Social media mining

Social media mining is the process of representing, analyzing, and extracting actionable patterns

from social media data. Social media mining introduces basic concepts and principal algorithms suitable

for investigating massive social media data; it discusses theories and methodologies from different

disciplines such as computer science, data mining, machine learning, social network analysis, network

science, sociology, ethnography, statistics, optimization, and mathematics. It encompasses the tools to

formally represent, measure, model, and mine meaningful patterns from large-scale social media data

Day 9 (26-05-2016 AN)

Dr. A. Vadivel,

Associate Professor,

Department of Computer Applications,

National Institute of Technology,

Trichy.

Summary

Recommender systems

Recommender systems or recommendation systems are a subclass of information filtering system that

seek to predict the 'rating' or 'preference' that a user would give to an item.

Fraud detection

Fraud detection is a topic applicable to many industries including banking and financial sectors,

insurance, government agencies and law enforcement, and more. Fraud attempts have seen a drastic

increase in recent years, making fraud detection more important than ever.


Image analysis

Image analysis is the extraction of meaningful information from images; mainly from digital images by

means of digital image processing techniques. Image analysis tasks can be as simple as reading bar

coded tags or as sophisticated as identifying a person from their face.

Unstructured textual data analysis

The proliferation of textual data in business is overwhelming. Unstructured textual data is being

constantly generated via call center logs, emails, documents on the web, blogs, tweets, customer

comments, customer reviews, and so on. While the amount of textual data is increasing rapidly,

businesses’ ability to summarize, understand, and make sense of such data for making better business

decisions remain challenging. This paper takes a quick look at how to organize and analyze textual data

for extracting insightful customer intelligence from a large collection of documents and for using such

information to improve business operations and performance. Multiple business applications of case

studies using real data that demonstrate applications of text analytics and sentiment mining using SAS

Text Miner and SAS Sentiment Analysis Studio are presented. While SAS products are used as tools for

demonstration only, the topics and theories covered are generic.


Day 10 (27-05-2016 FN)

Mr. K. Vijaya Kumar,

Senior Scientist, C-DAC,Chennai.

Case Studies:

Typical case studies may include Text Analysis; Sentiment analysis; Social media mining; Web mining;

Unstructured and Image data analysis; Recommendation systems; Fraud detection; Financial data

analysis; Predictive analysis in time series data; Health and Environmental analytics; Genomic data

analysis; Agriculture; Sensor and Uncertain data analysis; Internet of Things; Data fusion; Geo-

informatics and Spatial statistical analysis; E-governance applications; etc.

Day 10 (27-05-2016 AN)

The valedictory function was held at HRDC seminar hall, PSNA College of Engineering and Technology

on 27th

May 2016 from 2.00 PM to 4.00 PM presided over by Thiru.R.S.K.Sukumaran, Vice-Chairman,

Dr.V.Soundararajan, Principal and Dr.A.Vincent Antony Kumar, Convener and Head, Department of

Information Technology. Among others who attended the Valedictory function were the distinguished

resource person and Chief Guest Mr. Muralidharan Subbukutty Senior Manager – Technology - CTS,

participants and invitees. The honourable Chief Guest delivered the Vaedictory Address and distributed

Certificates to the participants on successful completion of the training programme. There was a time of

sharing where the participants shared their experiences in the ten day training programme.


Documents

BDI-DST- 10 day Training programme on Bigdata Analytics ...psnacet.edu.in/bigdata/2016dstsum.pdf · BDI-DST- 10 day Training programme on Bigdata Analytics (18th-27th, May 2016) Page