25
NITTE MEENAKSHI INSTITUE OF TECHNOLOGY (A Unit of Nitte Education Trust (R), Mangalore) An Autonomous Institution Department of Information Science and Engineering Curriculum Handbook for M.Tech Data Science

Curriculum Handbook for M.Tech Data Science Scienc… · 2. Trevor Hastie, Robert Tibshirani and Jerome Friedman, "The Elements of Statistical Learning: Data Mining, Inference, and

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Curriculum Handbook for M.Tech Data Science Scienc… · 2. Trevor Hastie, Robert Tibshirani and Jerome Friedman, "The Elements of Statistical Learning: Data Mining, Inference, and

NITTE MEENAKSHI INSTITUE OF TECHNOLOGY (A Unit of Nitte Education Trust (R), Mangalore)

An Autonomous Institution

Department of Information Science and

Engineering

Curriculum

Handbook for

M.Tech – Data

Science

Page 2: Curriculum Handbook for M.Tech Data Science Scienc… · 2. Trevor Hastie, Robert Tibshirani and Jerome Friedman, "The Elements of Statistical Learning: Data Mining, Inference, and

SEMESTER I

Page 3: Curriculum Handbook for M.Tech Data Science Scienc… · 2. Trevor Hastie, Robert Tibshirani and Jerome Friedman, "The Elements of Statistical Learning: Data Mining, Inference, and

Semester: I Year: 2019-2020

Department: Information Science and Engineering Course Type: Core

Course Title: Introduction to Data Management Course Code:19DS11

L-T-P:3-0-2 Credits: 04

Total Contact Hours:39 hrs Duration of SEE: 3 hrs

SEE Marks: 50 CIE Marks: 50

Pre-requisites:

Database Management Systems

Good programming skills

Course Outcomes:

Students will be able to

CO’s Course Learning Outcomes BL

CO1 Describe the need for managing/storing data and identify the value and

relative importance of data management. L2

CO2 Describe fundamentals of Data Management techniquessuitable for

Enterprise Applications. L2

CO3 Apply Data Management Solution for Internet Applications. L3

CO4 Describe various data analysis techniques in the internet Context. L2

Teaching Methodology:

Blackboard teaching and PPT

Programming Assignment

Assessment Methods

Open Book Test for 10 Marks.

Assignment evaluation for 10 Marks on basis of Rubrics

Three internals, 30 Marks each will be conducted and the Average of best of two will be taken.

Final examination, of 100 Marks will be conducted and will be evaluated for 50 Marks.

Course Outcome to Programme Outcome Mapping

PO1 PO2 PO3 PO4 PO5 PO6

CO1 1 3 1

CO2 1 2 3

CO3 2 2 1 2 2

CO4 1 2 2

19DS11 1 2 1 2 2

Page 4: Curriculum Handbook for M.Tech Data Science Scienc… · 2. Trevor Hastie, Robert Tibshirani and Jerome Friedman, "The Elements of Statistical Learning: Data Mining, Inference, and

COURSE CONTENT

Unit – I 10 Hrs

Introduction to Data Science and Class Logistics/Overview, Statistical Inference and Exploratory Data Analysis,

Principles of Data Management, SQL for Data Science: SQL Basics, SQL Joins and aggregates, Grouping and query

evaluation, SQL Sub-queries, Key Principles of RDBMS

Unit – II 10 Hrs

Data Models, Data Warehousing, OLAP, Data Storage and Indexing , Query Optimization and Cost Estimation,

Datalog, E/R Diagrams and Constraints, Design Theory, BCNF

Unit – III 8 Hrs

Data Management Solutions for Enterprise Applications: Introduction to Transactions, Transaction

Implementations, Transaction Model, Database Concurrency Control Protocols, Transaction Failures and Recovery,

Database Recovery Protocols.

Unit – IV 12 Hrs

Parallel Databases: Introduction to NoSQL database , Apache Cassandra, MongoDB, Apache Hive

(Text Book-3- Chapter1, 2, 5))

Unit – V 12 Hrs

Data Management Solution for Internet Applications: Google's Application Stack: Chubby Lock Service, BigTable

Data Store, and Google File System; Yahoo's key-value store: PNUTS; Amazon's key-value store: Dynamo;

Text Books:

1. Database Systems: the Complete Handbook, by Hector Garcia-Molina, Jennifer Widom, and Jeffrey

Ullman. Second edition.

2. Fundamentals of database systems by Elsmasri and Navathe

3. Seven NoSQL Databases in a Week: Get up and running with the fundamentals, By Xun (Brian) Wu,

Sudarshan Kadambi, Devram Kandhare, Aaron Ploetz, Packt Publishers

Reference Books/resources:

1. Database management systems by Raghu Ramakrishnan and Johannes Gehrke.

2. Foundations of database systems by Abiteboul, Hull and Vianu 3. “Transactional Information Systems” by Gerhard WEIKUM and Gottfried VOSSEN, publisher Morgan

Kaufmann. 4. Programming Hive: Data Warehouse and Query Language for Hadoop By Edward Capriolo, Dean

Wampler, Jason Rutherglen, O’Reilly

5. https://ai.google/research/pubs/pub27897

6. Fay Chang, Jeffrey Dean, Sanjay Ghemawat, Wilson C. Hsieh, Deborah A. Wallach Mike Burrows,

Tushar Chandra, Andrew Fikes, Robert E. Gruber, Bigtable: A Distributed Storage System for Structured

Data, Google, Inc. OSDI 2006

7. Brian F. Cooper et al., “ PNUTS: Yahoo!'s hosted data serving platform”, Journal Proceedings of the

VLDB Endowment VLDB Endowment Hompage archive Volume 1 Issue 2, August 2008 Pages 1277-

1288 8. Giuseppe DeCandia et al. , “Dynamo: Amazon’s Highly Available Key-value Store”, Proceeding SOSP

'07 Proceedings of twenty-first ACM SIGOPS symposium on Operating systems principles,Pages 205-

220 Stevenson, Washington, USA — October 14 - 17, 2007

Page 5: Curriculum Handbook for M.Tech Data Science Scienc… · 2. Trevor Hastie, Robert Tibshirani and Jerome Friedman, "The Elements of Statistical Learning: Data Mining, Inference, and

Semester: I Year: 2019-2020

Department: Information Science and Engineering Course Type: Core

Course Title: Statistics for Data Science Course Code:19DS12

L-T-P: 4-0-0 Credits: 04

Total Contact Hours:52 hrs Duration of SEE: 3 hrs

SEE Marks: 50 CIE Marks: 50

Pre-requisites:

Good understanding of engineering mathematics (especially Algebra and Arithmetic).

Inferring conclusions from two, and three dimensional graphs.

Course Outcomes:

Students will be able to:

Cos Course Outcome Description Blooms Level

CO1 Describe the basic and intermediate concepts of probability, statistics, and

distributions. L2

CO2 Describe the applications of discrete probability distributions. L2

CO3 Analyze the inference about population statistic based on the parameters of sample population.

L4

CO4 Analyze hypothesis to accept/reject alternative hypothesis based on

statistical evidence available. L4

CO5 Apply regression, ANOVA, and goodness of fit test to construct model

and infer conclusions about population/sample. L3

Teaching Methodology:

Black Board Teaching / Power Point Presentation.

Seminar

Assessment Methods:

Rubrics to evaluate Case Study (depends on the course)

Rubrics to evaluate Course Project (depends on the course)

Three internals, 30 Marks each will be conducted and the Average of best of two will be taken.

Final examination of 100 Marks will be conducted and will be evaluated for 50 Marks.

Course Outcome to Programme Outcome Mapping

PO1 PO2 PO3 PO4 PO5 PO6

CO1 1 2 2

CO2 2 2 1

CO3 2 2 3

CO4 3 3 3 2

CO5 2 3 3 2

19DS12 2 2 3 2

Page 6: Curriculum Handbook for M.Tech Data Science Scienc… · 2. Trevor Hastie, Robert Tibshirani and Jerome Friedman, "The Elements of Statistical Learning: Data Mining, Inference, and

COURSE CONTENT

UNIT – I

Probability and statistics

10 hours

Why Study Statistics?, Modern Statistics, Statistics and Engineering, two Basic Concepts—Population and

Sample, A Case Study: Visually Inspecting Data to Improve Product Quality, Pareto Diagrams and Dot

Diagrams, Frequency Distributions, Graphs of Frequency Distributions, Stem-and-Leaf Displays, Descriptive

Measures, Quartiles and Percentiles, calculation of X bar and S, Problems with aggregating data, Sample Spaces

and Events, Counting, Probability, The Axioms of Probability, Some Elementary Theorems, Conditional

Probability, Bayes’ Theorem.

UNIT – II

Probability Distributions

10 hours

Random Variables, The Binomial Distribution, The Hypergeometric Distribution, The Mean and the Variance

of a Probability Distribution, Chebyshev’s Theorem, The Poisson Distribution and Rare Events, Poisson

Processes, The Geometric and Negative, Binomial Distribution, The Multinomial Distribution, Simulation.

UNIT – III

Probability Densities and Sampling Distributions

12 hours

Continuous Random Variables, The Normal Distribution, The Normal Approximation to the, Binomial

Distribution, Other Probability Densities, The Uniform Distribution, The Log-Normal Distribution, The

Gamma Distribution, The Beta Distribution, The Weibull Distribution, Continuous Random Variables, The

Normal Approximation to the Binomial Distribution, Other Probability Densities, The Uniform Distribution,

The Log-Normal Distribution, The Gamma Distribution, The Beta Distribution, The Weibull Distribution,

Populations and Samples,

UNIT – IV

Inferences concerning mean and variance

10 hours

Statistical Approaches to Making, Generalizations, Point Estimation, Interval Estimation, Maximum Likelihood

Estimation, Tests of Hypotheses, Null Hypotheses and Tests of Hypotheses, Hypotheses Concerning One Mean,

The Relation between Tests and Confidence Intervals, Power, Sample Size, and Operating Characteristic Curve,

The Estimation of Variances, Hypotheses Concerning One Variance, Hypotheses Concerning Two Variances.

UNIT – V

Analysis of Variance/ Regression/ Goodness-of-fit tests

10 hours

Single-Factor ANOVA, Multiple Comparisons in ANOVA, More on Single-Factor ANOVA, Introduction Two-Factor ANOVA with Kij=1, Two-Factor ANOVA with Kij>1,Three-Factor ANOVA, Introduction, The

Simple Linear Regression Model, Estimating Model Parameters, Inferences About the Slope Parameter,

Inferences Concerning and the Prediction of Future Y Values, Correlation, Introduction, Assessing Model

Adequacy, Polynomial Regression, Goodness-of-Fit Tests

Text books:

1. Miller &freund’s Probability and statistics for engineers, ninth edition, Richard a. Johnson, Pearson.

2. Devore. J.L., “Probability and Statistics for Engineering and the Sciences”, Cengage Learning, New

Delhi, 8th Edition, 2012.

Reference books:

1. Walpole. R.E., Myers. R.H., Myers. S.L. and Ye. K., “Probability and Statistics for Engineers and

Scientists”, Pearson Education, Asia, 8th Edition, 2007.

2. Ross, S.M., “Introduction to Probability and Statistics for Engineers and Scientists”, 3rd Edition,

Elsevier, 2004.

3. Spiegel. M.R., Schiller. J. and Srinivasan. R.A., “Schaum’s Outline of Theory and Problems of

Probability and Statistics”, Tata McGraw Hill Edition, 2004. 4. Griffiths, Dawn. Head first statistics. " O'Reilly Media, Inc.", 2008.

Page 7: Curriculum Handbook for M.Tech Data Science Scienc… · 2. Trevor Hastie, Robert Tibshirani and Jerome Friedman, "The Elements of Statistical Learning: Data Mining, Inference, and

Semester: I Year: 2019-2020

Department: Information Science and Engineering Course Type: Core

Course Title: Machine Learning-I Course Code:19DS13

L-T-P:3-0-2 Credits: 04

Total Contact Hours:39 hrs Duration of SEE: 3 hrs

SEE Marks: 50 CIE Marks: 50

Prerequisite:

Linear Algebra, Probability & Statistics, Calculus, Data Mining

Any programming language C++, Python.

Course Outcomes:

Students will be able to

CO’s Course Learning Outcomes BL

CO1 Describe the basic underlying machine learning concepts. L2

CO2 Analyze a range of machine learning algorithms along with their strength &

weaknesses. L4

CO3 Apply appropriate machine learningtechniques to solve problems of

moderate complexity.

L3

CO4 Implement Ensemble methods to obtain better predictive performance than

could be obtained from any of the constituent learning algorithms alone

L3

Teaching Methodology:

Black board teaching / Power Point presentations

Executable Codes/ Live Demonstration

Programming Assignment

Assessment Methods: Online certification from NPTEL/course-era

Programming Assignment (10M), evaluated on the basis of Rubrics.

Three internals, 30Marks each will be conducted and the Average of best of two will be taken.

Final examination, of100 Marks will be conducted and will be evaluatedfor50Marks.

Course Outcome to Programme Outcome Mapping

PO1 PO2 PO3 PO4 PO5 PO6

CO1 1 1 2

CO2 1 2 1

CO3 2 1 2 3 1

CO4 1 2 2 1

19DS13 1 1 2 2 1

Page 8: Curriculum Handbook for M.Tech Data Science Scienc… · 2. Trevor Hastie, Robert Tibshirani and Jerome Friedman, "The Elements of Statistical Learning: Data Mining, Inference, and

COURSE CONTENT

Unit – I 8 Hrs

Concept Learning: Learning problems, Designing a learning system, perspectives and issues in Machine Learning.

Concept Learning Task, Concept Learning as search, Find S, Version space and Candidate Elimination Algorithm.

(TextBook-1)

Decision Tree Learning: Introduction, Decision tree representation, Appropriate problems for Decision Tree Learning,

The Basic Decision Tree Learning Algorithm, Hypothesis Space Search in Decision Tree Learning, Inductive Bias in

Decision Tree Learning, Issues in Decision Tree Learning (TextBook-1)

Unit – II 9Hrs

Feature Engineering for Machine Learning: Machine Learning Pipeline, Binarization, Quantization/Binning, Log

Transformation, Feature Scaling/Normalization, Interaction features, and feature selection

Text Data: Flattening, Filtering and chunking: Bag-of-X: Turning Natural Text into Flat Vectors, Filtering for

cleaner features, Atoms of Meaning: From words to n-Grams to Phrases. (TextBook3)

Unit – III 10 Hrs

Categorical variables: Encoding categorical variables, dealing with large categorical variables: feature hashing, Bin

counting

Dimensionality reduction: Intuition, Derivation, PCA in Action, Whitening and ZCA, Considerations and limitations

of PCA, Use cases (TextBook3)

Unit – IV 6 Hrs

Bayesian Learning: Bayes theorem – An Example; Bayes theorem and concept learning: Brute-Force Bayes Concept

Learning, MAP Hypotheses and Consistent Learners; maximum likelihood and least-squared error hypotheses; Bayes

optimal classifier; Gibbs algorithm, naive Bayes classifier; Bayesian belief networks – Conditional Independence,

Representation, Inference, Learning Bayesian Belief Networks.

Cluster Analysis: Basic concepts and algorithms: Overview, K-Means, Agglomerative Hierarchical clustering,

DBSCAN. (TextBook2)

Unit – V 06 Hrs

Ensemble Methods: Rationale for ensemble method, methods for constructing an Ensemble classifier, Bias-Variance

decomposition, Bagging, Boosting, Random forests, Empirical comparison among Ensemble methods. (TextBook2)

Text Books:

1. Tom M. Mitchell, “Machine Learning”, McGraw-Hill Education (INDIAN EDITION), 2013.

2. Introduction to Data Mining-Pang-NingTan, Michael Steinbach,Vipin Kumar, Pearson Education, 2007.

3. Amanda Casari, Alice Zheng, “Feature Engineering for Machine Learning”, O’Reilly, 2018.

Additional Reference Book:

1. Gareth James, Daniela Witten, Trevor Hastie and Robert Tibshirani, "An Introduction to Statistical

Learning: with Applications in R", Springer, 2016.

2. Trevor Hastie, Robert Tibshirani and Jerome Friedman, "The Elements of Statistical Learning: Data

Mining, Inference, and Prediction", Springer, 2016

3. Andreas Muller, "Introduction to Machine Learning with Python: A Guide for Data

Scientists", Shroff/O'Reilly; First edition (2016) 4. Introduction to Data Mining-Pang-NingTan, Michael Steinbach,Vipin Kumar, Pearson Education, 2007.

Online Materials:

1. https://nptel.ac.in/courses/106106139/ 2. Andrew NG's online Course

Programming Assignments: (Sample)

Page 9: Curriculum Handbook for M.Tech Data Science Scienc… · 2. Trevor Hastie, Robert Tibshirani and Jerome Friedman, "The Elements of Statistical Learning: Data Mining, Inference, and

1) Implement the CANDIDATE – ELIMINATION algorithm. Show how it is used to learn from training

examples and hypothesize new instances in Version Space.

2) Implement the FIND–S algorithm. Show how it can be used to classify new instances of target concepts.

Run the experiments to deduce instances and hypothesis consistently.

3) Implement the ID3 algorithm for learning Boolean–valued functions for classifying the training examples

by searching through the space of a Decision Tree.

4) Design and implement the Back-propagation algorithm by applying it to a learning task involving an

application like FACE RECOGNITION.

5) Design and implement Naïve Bayes Algorithm for learning and classifying TEXT DOCUMENTS.

Page 10: Curriculum Handbook for M.Tech Data Science Scienc… · 2. Trevor Hastie, Robert Tibshirani and Jerome Friedman, "The Elements of Statistical Learning: Data Mining, Inference, and

Semester: I Year: 2019-2020

Department: Information Science and Engineering Course Type: Core

Course Title: Exploratory Data Analysis Course Code:19DS14

L-T-P:3-0-2 Credits: 04

Total Contact Hours:39 hrs Duration of SEE: 3 hrs

SEE Marks: 50 CIE Marks: 50

Pre-requisites:

Graduate Mathematics.

Basic understanding of Probability and Statistics.

Ability to comprehend and understand relational, and unstructured datasets.

Course Outcomes:

Students will be able to:

Cos Course Outcome Description BL

CO1 Describe the philosophy of exploratory data analysis, L2

CO2 Apply visualize discrete and continuous probability distributions L3

CO3 Describe visualizing, and estimating the correlation between variables. L2

CO4 Apply linear and nonlinear models visually. L3

CO5 Describe the visualization and analysis of time series and survival calculations. L2

Teaching Methodology:

Black Board Teaching

Power Point Presentation.

Seminar

Assessment Methods:

Rubrics to evaluate Seminar

Three internals, 30 Marks each will be conducted and the Average of best of two will be taken.

Final examination of 100 Marks will be conducted and will be evaluated for 50 Marks.

Course Outcome to Programme Outcome Mapping

PO1 PO2 PO3 PO4 PO5 PO6

CO1 1 1 2

CO2 2 2 3 1

CO3 1 1 2

CO4 2 2 3 1

CO5 2 2 2 1 2

19DS14 2 2 2 2 1

Page 11: Curriculum Handbook for M.Tech Data Science Scienc… · 2. Trevor Hastie, Robert Tibshirani and Jerome Friedman, "The Elements of Statistical Learning: Data Mining, Inference, and

COURSE CONTENT

UNIT – I

Introduction to Exploratory data analysis, and distributions

8hrs

Creating a Data Frame, Getting Information About a Data Structure, adding a Column to a Data Frame, Deleting

a Column from a Data Frame, Renaming Columns in a Data Frame, Reordering Columns in a Data Frame,

Getting a Subset of a Data Frame, Changing the Order of Factor Levels, Changing the Order of Factor Levels

Based on Data Values, Changing the Names of Factor Levels, Removing Unused Levels from a Factor,

Changing the Names of Items in a Character Vector, Recoding a Categorical Variable to Another Categorical

Variable, Recoding a Continuous Variable to a Categorical Variable, Transforming Variables, Transforming

Variables by Group, Summarizing Data by Groups, Summarizing Data with Standard Errors and Confidence

Intervals, Converting Data from Wide to Long, Converting Data from Long to Wide, Converting a Time Series

Object to Times and Values. UNIT – II

Probability mass function, Cumulative distributions, and modeling distributions

8hrs

Making a Basic Histogram, Making Multiple Histograms from Grouped Data, Making a Density Curve ,Making

Multiple Density Curves from Grouped Data, Making a Frequency Polygon, Making a Basic Box Plot, Adding

Notches to a Box Plot, Adding Means to a Box Plot, Making a Violin Plot, Making a Dot Plot, Making Multiple

Dot Plots for Grouped Data, Making a Density Plot of Two-Dimensional Data UNIT – III

Miscellaneous Graphs

8hrs

Making a Correlation Matrix, Plotting a Function, Shading a Subregion Under a Function Curve, Creating a

Network Graph, Using Text Labels in a Network Graph, Creating a Heat Map, Creating a Three-Dimensional

Scatter Plot, Adding a Prediction Surface to a Three-Dimensional Plot, Saving a Three-Dimensional Plot,

Animating a Three-Dimensional Plot, Creating a Dendrogram, Creating a Vector Field, Creating a QQ Plot,

Creating a Graph of an Empirical Cumulative Distribution Function, Creating a Mosaic Plot, Creating a Pie

Chart, Creating a Map, Creating a Choropleth Map, Making a Map with a Clean Background

UNIT – IV

Relationship between variables, and estimation

8hrs

Scatter Plots, Characterizing Relationships, Correlation, Covariance, Pearson’s Correlation, Nonlinear

Relationships, Spearman’s Rank Correlation, Correlation and Causation, The Estimation Game, Guess the

Variance, Sampling Distributions, Sampling Bias, Exponential Distributions, Classical Hypothesis Testing,

Hypothesis Test, Testing a Difference in Means, Other Test Statistics, Testing a Correlation, Testing

Proportions, Chi-Squared Tests, First Babies Again, Power, Replication,

UNIT – V

Time series and survival analysis

7hrs

Survival Curves, Hazard Function, Estimating Survival Curves, Kaplan-Meier Estimation, The Marriage Curve,

Estimating the Survival Function, Confidence Intervals, Normal Distributions, Sampling Distributions,

Representing Normal Distributions, Central Limit Theorem, Testing the CLT, Applying the CLT, Correlation

Test, Chi-Squared Test

Text books:

1. Think Stats, 2nd Edition: Exploratory Data Analysis, Allen B. Downey, Year:2014, Pages:226, ISBN 13:978-

1-49190-733-7

Reference books:

1. Making sense of Data: A practical Guide to Exploratory Data Analysis and Data Mining, by Glenn J. Myatt.

2. Making Sense of Data II: A Practical Guide to Data Visualization, Advanced Data Mining Methods, and

Applications, Glenn J. Myatt, and Wayne P. Johnson. Print ISBN:9780470222805 |Online

ISBN:9780470417409 |DOI:10.1002/9780470417409.

Page 12: Curriculum Handbook for M.Tech Data Science Scienc… · 2. Trevor Hastie, Robert Tibshirani and Jerome Friedman, "The Elements of Statistical Learning: Data Mining, Inference, and

Semester: I Year: 2019-2020

Department: Information Science and Engineering Course Type: Elective

Course Title: Advanced Algorithms and Optimization Course Code:19DSE241

L-T-P: 4-0-0 Credits: 04

Total Contact Hours:52 hrs Duration of SEE: 3 hrs

SEE Marks: 50 CIE Marks: 50

Pre-requisites:

Students should have knowledge of ‘C’ Programming.

Knowledge of data structures; discrete mathematics, probability, basics of mathematical concepts.

Students should have completed Analysis and Design of Algorithm course.

Course Outcomes:

Students will be able to

CO’s Course Learning Outcomes BL

CO1 Apply the most appropriate algorithms to solve a real world problem through data science applications.

L3

CO2 Evaluate and measure the performance of an algorithm L4

CO3 Design Algorithm for given problem to find out approximate solution. L5

CO4 Describe optimization techniques using algorithms and perform feasibility

study for solving an optimization problem. L2

CO5 Apply optimization techniques for the given problems L3

Teaching Methodology:

Blackboard teaching and PPT

Assignment

Assessment Methods

Open Book Test for 10 Marks.

Assignment evaluation for 10 Marks.

Three internals, 30 Marks each will be conducted and the Average of best of two will be taken.

Final examination, of 100 Marks will be conducted and will be evaluated for 50 Marks.

Course Outcome to Programme Outcome Mapping

PO1 PO2 PO3 PO4 PO5 PO6

CO1 2 2 3 2 1

CO2 2 2 2

CO3 3 2 3 1

CO4 2 2 1

CO5 3 1 2 3 1 1

19DSE241 2 1 2 3 1 1

Page 13: Curriculum Handbook for M.Tech Data Science Scienc… · 2. Trevor Hastie, Robert Tibshirani and Jerome Friedman, "The Elements of Statistical Learning: Data Mining, Inference, and

COURSE CONTENT

Unit – I 10 Hrs

Basics of Algorithm Analysis; Probabilistic Analysis & Randomized Algorithm: The hiring problem, Indicator

Random Variables, Randomized Algorithms

Dynamic Programming: Principles of Dynamic programming, Segmented Least Squares, Sequence Alignment in Linear Space.

Unit – II 12 Hrs

Network Flow: Maximum Flow Networks, Pre-flow push Maximum Flow Algorithm,

Graph Algorithms:Basics- Searching and Traversing, Ideas Behind Map Searches: A* Algorithm,

Spectral Algorithms:The Best Fit Space, Mixture Model, Streaming algorithms for computing statistics on the

data: Models and Basic techniques, Hash Functions, Counting Distinct Elements, Frequency Estimation, Other

Streaming problems

Unit – III 10 Hrs

NP-and computational Intractability - Polynomial Time Reduction, The satisfiability problem, Polynomial Time

Verification, NP-Completeness & reducibility, NP-Complete Problems

Approximation Algorithms:Greedy Algorithms and bound on optimum, Center Selection problem, The Pricing

Method, Maximization Via the Pricing Method, Linear programming & Rounding

Unit – IV 12 Hrs

Optimization Methods:

Need for unconstrained methods in solving constrained problems. Necessary conditions of unconstrained optimization,

Structure of methods, quadratic models. Methods of line search, Armijo-Goldstein and Wolfe conditions for partial line

search. Global convergence theorem, Steepest descent method. Quasi-Newton methods: DFP, BFGS, Broyden family.

Unit – V 8 Hrs

Conjugate-direction Methods: Fletcher-Reeves, Polak-Ribierre. Derivative-free methods: finite differencing.

Restricted step methods. Methods for sums of squares and nonlinear equations. Linear and Quadratic Programming.

Duality in optimization.

Optimization algorithms for parameter tuning or design projects:Genetic algorithms, quantum-inspired

evolutionary algorithms, simulated annealing, particle-swarm optimization, Ant Colony Optimization

Text Books:

1. John Kleinberg, Eva Trados, “Algorithm Design”, Pearson Addison Wesley

2. CormenT.H.,LeisersonC.E.,RivestR.L.,SteinC.,IntroductiontoAlgorithms,3rdedition,PHI2010,ISBN:9780262033848

3. Fletcher R., Practical Methods of Optimization, John Wiley, 2000.

Reference Material

1. Spectral Algorithms, by Ravindran Kannan, Santosh vempala, 2009,

https://www.cc.gatech.edu/~vempala/spectralbook.pdf

2. Streaming Algorithms, Great Ideas in Theoretical Computer Science, Saarland University, Summer

2014

3. S. Muthukrishnan. Data Streams: Algorithms and Applications. Foundations and Trends in Theoretical

Computer Science, 1(2), 2005.

4. http://theory.stanford.edu/~amitp/GameProgramming/AStarComparison.html

Page 14: Curriculum Handbook for M.Tech Data Science Scienc… · 2. Trevor Hastie, Robert Tibshirani and Jerome Friedman, "The Elements of Statistical Learning: Data Mining, Inference, and

Semester: I Year: 2019-2020

Department: Information Science and Engineering Course Type: Elective

Course Title: Time Series Analysis and Forecasting Course Code:19DS152

L-T-P: 4-0-0 Credits: 04

Total Contact Hours:52 hrs Duration of SEE: 3 hrs

SEE Marks: 50 CIE Marks: 50

Pre-requisites:

Probability and Statistics for data Science.

Good programming skills

Course Outcomes:

Students will be able to

CO’s Course Learning Outcomes BL

CO1 Describe the fundamental advantage and necessity of forecasting in various

situations. L2

CO2 Identify how to choose an appropriate forecasting method in a particular

environment. L2

CO3 Apply various forecasting methods, which include obtaining the relevant data and carrying out the necessary computation using suitable statistical

software.

L3

CO4 Improve forecast with better statistical models based on statistical analysis L4

Teaching Methodology:

Blackboard teaching and PPT

Programming Assignment

Assessment Methods

Open Book Test for 10 Marks.

Assignment evaluation for 10 Marks on basis of Rubrics

Three internals, 30 Marks each will be conducted and the Average of best of two will be taken.

Final examination, of 100 Marks will be conducted and will be evaluated for 50 Marks.

Course Outcome to Programme Outcome Mapping

PO1 PO2 PO3 PO4 PO5 PO6

CO1 1 2 2

CO2 2 2 3 1 1

CO3 3 1 3 3 2 2

CO4 3 3 3

19DS22 2 1 2 3 1 1

Page 15: Curriculum Handbook for M.Tech Data Science Scienc… · 2. Trevor Hastie, Robert Tibshirani and Jerome Friedman, "The Elements of Statistical Learning: Data Mining, Inference, and

COURSE CONTENT

Unit – I 10 Hrs

An Introduction to Forecasting: Forecasting and Data. Forecasting Methods. Errors in Forecasting. Choosing a

Forecasting Technique. An Overview of Quantitative Forecasting Techniques.

REGRESSION ANALYSIS: The Simple Linear Regression Model. The Least Squares Point Estimates. Point

Estimates and Point Predictions. Model Assumptions and the Standard Error. Testing the Significance of the Slope and

y Intercept. Confidence and Prediction Intervals. Simple Coefficients of Determination and Correlation. An F Test for

the Model.

Unit – II 10Hrs

Multiple Linear Regressions: The Linear Regression Model. The Least Squares Estimates, and Point Estimation and

Prediction. The Mean Square Error and the Standard Error. Model Utility: R2, Adjusted R2, and the Overall F Test.

Model Building and Residual Analysis: Model Building and the Effects of Multicollinearity. Residual Analysis in

Simple Regression. Residual Analysis in Multiple Regression. Diagnostics for Detecting Outlying and Influential

Observations

Unit – III 12 Hrs

Time Series Regression: Modelling Trend by Using Polynomial Functions. Detecting Autocorrelation. Types of

Seasonal Variation. Modelling Seasonal Variation by Using Dummy Variables and Trigonometric Functions. Growth

Curves. Handling First-Order Autocorrelation.

Decomposition Methods: Multiplicative Decomposition. Additive Decomposition. The X-12-ARIMA Seasonal

Adjustment Method. Exercises.

Exponential Smoothing: Simple Exponential Smoothing. Tracking Signals. Holt’s Trend Corrected Exponential

Smoothing. Holt-Winters Methods. Damped Trends and Other Exponential

Unit – IV 10 Hrs

Non-seasonal Box-Jenkins Modelling and Their Tentative Identification: Stationary and Nonstationary Time

Series. The Sample Autocorrelation and Partial Autocorrelation Functions: The SAC and SPAC. An Introduction to

Non-seasonal Modelling and Forecasting. Tentative Identification of Non-seasonal Box-Jenkins Models.

Estimation, Diagnostic Checking, and Forecasting for Non-seasonal Box-Jenkins Models: Estimation. Diagnostic

Checking. Forecasting. A Case Study. Box-Jenkins Implementation of Exponential Smoothing.

Unit – V 10 Hrs

Box-Jenkins Seasonal Modelling: Transforming a Seasonal Time Series into a Stationary Time Series. Examples of

Seasonal Modelling and Forecasting. Box-Jenkins Error Term Models in Time Series Regression.

Advanced Box-Jenkins Modelling: The General Seasonal Model and Guidelines for Tentative Identification.

Intervention Models. A Procedure for Building a Transfer Function Model

Causality in time series: Granger causality. Hypothesis testing on rational expectations. Hypothesis testing on market

efficiency.

Text Books:

1. Bruce L. Bowerman, Richard O'Connell, Anne Koehler, “Forecasting, Time Series, and Regression,

4th Edition”, Cengage Unlimited Publishers

2. Enders W. Applied Econometric Time Series. John Wiley & Sons, Inc., 1995

Additional Reference Material

1. Mills, T.C. The Econometric Modelling of Financial Time Series. Cambridge University Press, 1999

2. Andrew C. Harvey. Time Series Models. Harvester wheatsheaf, 1993

3. P. J. Brockwell, R. A. Davis, Introduction to Time Series and Forecasting. Springer, 1996

4. Cryer, Jonathan D.; Chan, Kung-sik, “Time series analysis : with applications in R”, ed.: New York:

Springer, cop. 2008

Page 16: Curriculum Handbook for M.Tech Data Science Scienc… · 2. Trevor Hastie, Robert Tibshirani and Jerome Friedman, "The Elements of Statistical Learning: Data Mining, Inference, and

Semester: I Year: 2019-2020

Department: Information Science and Engineering Course Type: Elective

Course Title: Computer Vision Course Code:19DS153

L-T-P: 4-0-0 Credits: 04

Total Contact Hours:52 hrs Duration of SEE: 3 hrs

SEE Marks: 50 CIE Marks: 50

Pre-requisites:

Basic knowledge of Data Mining

Programming knowledge in object oriented methodology

Course Outcomes:

Students will be able to:

Cos Course Learning Outcomes BL

CO1 Identify image processing techniques to solve real world applications L2

CO2 Apply deep learning methods on images to solve high complexity problems L3

CO3 Develop a technique for image feature extraction L3

CO4 Design techniques for image analysis and classification L3

Teaching Methodology:

Black Board Teaching / Power Point Presentation

Programming Assignment

Assessment Methods:

Three internals, 30 Marks each will be conducted and the Average of best of two will be taken.

Rubrics for Programming Assignment for 20 marks.

Final examination of 100 Marks will be conducted and will be evaluated for 50 Marks.

Course Outcome to Programme Outcome Mapping:

COs PO1 PO2 PO3 PO4 PO5 PO6

CO1 2 2 1 1

CO2 2 3 3 1

CO3 3 3 2

CO4 2 2 3 3

19DS251 2 2 3 2 1

Page 17: Curriculum Handbook for M.Tech Data Science Scienc… · 2. Trevor Hastie, Robert Tibshirani and Jerome Friedman, "The Elements of Statistical Learning: Data Mining, Inference, and

COURSE CONTENT

UNIT – I 10hrs

Introduction: Why is computer vision difficult?, Image presentation and analysis tasks. The image, its representations and

properties- a few concepts, image digitization, digital image properties, color images, cameras, image, its mathematical

and physical background- linear integral transforms, image as stochastic processes, image formation physics.

UNIT – II 10 hrs

Data structures for image analysis-levels of image data representation, traditional image data structures, and hierarchical

data structures. Image –preprocessing- pixel brightness transformations, geometric transformations, local preprocessing, Image restoration.

UNIT – III 10 hrs

Segmentation- thresholding, edge based segmentation, region based segmentation, matching, evaluation issues in

segmentation. Image data compression- image data properties, discrete image transforms in image data compression,

predictive compression methods, vector quantization, hierarchical and progressive compression methods, comparison of

compression methods.

UNIT – IV 11 hrs

Shape representation and description- region identification,, contour based shape representation and description-chain

codes, simple geometric border representation, region based representation and description- simple scalar region

descriptors, moments.

UNIT – V 11 hrs

Recognition: knowledge representation statistical pattern recognition- classification principles, classifier settings,

classifier learning. Support vector machines, cluster analysis. Neural nets- feed forward networks, unsupervised learning,

hopefield neural networks.

Text books:

1. Digital image processing and computer vision by Milan Sonka

Reference book:

1. Digital image processing and analysis by Chanda and Dutta Majumder

2. Digital image processing by Gonzalez and woods.

Page 18: Curriculum Handbook for M.Tech Data Science Scienc… · 2. Trevor Hastie, Robert Tibshirani and Jerome Friedman, "The Elements of Statistical Learning: Data Mining, Inference, and

Semester: I Year: 2019-2020

Department: Information Science and Engineering Course Type: Core

Course Title: Data Engineering Lab Course Code:19DSL16

L-T-P:0-0-4 Credits: 04

Total Contact Hours:26hrs Duration of SEE: 3 hrs

SEE Marks: 50 CIE Marks: 50

Pre-requisites:

Basic Python Programming,

Machine learning

Fundamentals of Probability and Statistics

Course Outcomes:

Students will be able to

CO’s Course Learning Outcomes BL

CO1 Describe the commands and set up the programming environment of Python, R and MapReduce

L2

CO2 Apply machine learning concepts to analyze real world problems using Data

Analysis. L3

CO3 Apply probability and statistical techniques to solve problems of moderate

complexity. L3

CO4 Analyze large data sets to derive interesting inferences. L4

Teaching Methodology:

Blackboard teaching and PPT

Executables

Programming Assignment

Assessment Methods

Program Evaluation on the basis of Rubrics.

Two internals, 20 Marks each will be conducted and the Average of best of two will be taken.

Final examination, of 50 Marks will be conducted and will be evaluated for 50 Marks.

Course Outcome to Programme Outcome Mapping

PO1 PO2 PO3 PO4 PO5 PO6

CO1 2 2 1 1 2

CO2 3 2 2 3 1 2

CO3 3 2 2 3 1 2

CO4 3 1 3 2 3 2

19DSL16 3 2 2 3 1 2

Page 19: Curriculum Handbook for M.Tech Data Science Scienc… · 2. Trevor Hastie, Robert Tibshirani and Jerome Friedman, "The Elements of Statistical Learning: Data Mining, Inference, and

COURSE CONTENT

Progr

am

No.

Domain Assignment

1 Basic Python The number of birds banded at a series of sampling sites has been counted by your

field crew and entered into the following list. The first item in each sublist is an

alphanumeric code for the site and the second value is the number of birds banded. Cut

and paste the list into your assignment and then answer the following questions by

printing them to the screen.

data = [['A1', 28], ['A2', 32], ['A3', 1], ['A4', 0],

['A5', 10], ['A6', 22], ['A7', 30], ['A8', 19],

['B1', 145], ['B2', 27], ['B3', 36], ['B4', 25],

['B5', 9], ['B6', 38], ['B7', 21], ['B8', 12],

['C1', 122], ['C2', 87], ['C3', 36], ['C4', 3],

['D1', 0], ['D2', 5], ['D3', 55], ['D4', 62],

['D5', 98], ['D6', 32]]

1. How many sites are there?

2. How many birds were counted at the 7th site?

3. How many birds were counted at the last site?

4. What is the total number of birds counted across all sites?

5. What is the average number of birds seen on a site?

6. What is the total number of birds counted on sites with codes beginning with

C?

2. Basic Python Dr. Granger is interested in studying the relationship between the length of house-

elves’ ears and aspects of their DNA. She has obtained DNA samples and ear

measurements from a small group of house-elves to conduct a preliminary analysis.

You are supposed to conduct the analysis for her. She has placed the file on the web for

you to download.

Write a Python script that:

1. Imports the data into a data structure of your choice

2. Loops over the rows in the dataset 3. For each row in the dataset checks to see if the ear length is large (>10 cm) or

small (<=10 cm) and determines the GC-content of the DNA sequence (i.e.,

the percentage of bases that are either G or C)

4. Stores this information in a table where the first column has the ID for the

individual, the second column contains the string ‘large’ or the string ‘small’

depending on the size of the individuals ears, and the third column contains

the GC content of the DNA sequence.

5. Prints the average GC-content for both large-eared elves and small-eared

elves to the screen.

6. Exports the table of individual level GC values to a CSV (comma delimited

text) file titled grangers_analysis.csv.

3. Basic Exploratory

Data Analysis

Measurements of electric power consumption in one household with a one-minute sampling rate over a period of almost 4 years. Different electrical quantities and some

sub-metering values are available.

Dataset: Individual household electric power consumption Data SetElectric power

consumption

https://github.com/mGalarnyk/datasciencecoursera/blob/master/4_Exploratory_Data_A

nalysis/project1/README.md

Perform the following:

1. Load the data

2. Subset the data from the dates 2007-02-01 and 2007-02-02.

3. Create a histogram 4. Create a Time series

5. Create a plot for sub metering

6. Create multiple plot

Page 20: Curriculum Handbook for M.Tech Data Science Scienc… · 2. Trevor Hastie, Robert Tibshirani and Jerome Friedman, "The Elements of Statistical Learning: Data Mining, Inference, and

4. Exploratory

Data Analysis

Using R

The data for this assignment are available from the course web site as a single zip file:

Data Set

The zip file contains two files:

$PM{2.5}$ Emissions Data (summarySCC_PM25.rds): This file contains a data frame

with all of the PM2.5 emissions data for 1999, 2002, 2005, and 2008. For each year,

the table contains number of tons of $PM{2.5}$ emitted from a specific type of source

for the entire year. Here are the first few rows.

## fips SCC Pollutant Emissions type year

## 4 09001 10100401 PM25-PRI 15.714 POINT 1999

## 8 09001 10100404 PM25-PRI 234.178 POINT 1999

## 12 09001 10100501 PM25-PRI 0.128 POINT 1999

## 16 09001 10200401 PM25-PRI 2.036 POINT 1999

## 20 09001 10200504 PM25-PRI 0.388 POINT 1999

## 24 09001 10200602 PM25-PRI 1.490 POINT 1999

fips: A five-digit number (represented as a string) indicating the U.S. county

SCC: The name of the source as indicated by a digit string (see source code

classification table)

Pollutant: A string indicating the pollutant Emissions: Amount of PM2.5 emitted, in tons

type: The type of source (point, non-point, on-road, or non-road)

year: The year of emissions recorded

Source Classification Code Table (Source_Classification_Code.rds): This table

provides a mapping from the SCC digit strings into the Emissions table to the actual

name of the $PM{2.5}$ source. The sources are categorized in a few different ways

from more general to more specific and you may choose to explore whatever

categories you think are most useful. For example, source 10100101 is known as Ext

Comb /Electric Gen /Anthracite Coal /Pulverized Coal.

You can read each of the two files using the readRDS() function in R. For example,

reading in each file can be done with the following code:

You must address the following questions and tasks in your exploratory analysis. For

each question/task you will need to make a single plot. Unless specified, you can use

any plotting system in R to make your plot.

1. Have total emissions from $PM_{2.5}$ decreased in the United States from

1999 to 2008? Using the base plotting system, make a plot showing the total

$PM_{2.5}$ emission from all sources for each of the years 1999, 2002,

2005, and 2008.

2. Have total emissions from $PM_{2.5}$ decreased in the Baltimore City, Maryland (fips == "24510") from 1999 to 2008? Use the base plotting system

to make a plot answering this question.

3. Of the four types of sources indicated by the type (point, nonpoint, onroad,

nonroad) variable, which of these four sources have seen decreases in

emissions from 1999, 2008 for Baltimore City? Which have seen increases in

emissions from 1999, 2008? Use the ggplot2 plotting system to make a plot

answer this question.

4. Across the United States, how have emissions from coal combustion-related

sources changed from 1999, 2008?

5. How have emissions from motor vehicle sources changed from 1999, 2008 in

Baltimore City? 6. Compare emissions from motor vehicle sources in Baltimore City with

emissions from motor vehicle sources in Los Angeles County, California

(fips == "06037"). Which city has seen greater changes over time in motor

vehicle emissions?

Making and Submitting Plots

Page 21: Curriculum Handbook for M.Tech Data Science Scienc… · 2. Trevor Hastie, Robert Tibshirani and Jerome Friedman, "The Elements of Statistical Learning: Data Mining, Inference, and

For each plot you should:

1. Construct the plot and save it to a PNG file.

2. Create a separate R code file (plot1.R, plot2.R, etc.) that constructs the

corresponding plot, i.e. code in plot1.R constructs the plot1.png plot. Your

code file should include code for reading the data so that the plot can be fully

reproduced. You should also include the code that creates the PNG file. Only

include the code for a single plot (i.e. plot1.R should only include code for

producing plot1.png)

3. Upload the PNG file on the Assignment submission page

4. Copy and paste the R code from the corresponding R file into the text box at

the appropriate point in the peer assessment.

Hint-

https://github.com/mGalarnyk/datasciencecoursera/blob/master/4_Exploratory_Data_A

nalysis/project2/project2.md

5. 5

. Decision

Tree

Binary Decision Trees: One very interesting application area of machine learning is in

making medical diagnoses.

Objective: To train and test a binary decision tree to detect breast cancer using

real world data using Python /R. Predict whether the cancer is benign or

malignant.

DataSet:

https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+%28Diagnostic%29

The Dataset We will use the Wisconsin Diagnostic Breast Cancer (WDBC) dataset1 .

The dataset consists of 569 samples of biopsied tissue. The tissue for each sample is

imaged and 10 characteristics of the nuclei of cells present in each image are

characterized. These characteristics are 1. Radius 2. Texture 3. Perimeter 4. Area 5.

Smoothness 6. Compactness 7. Concavity 8. Number of concave portions of contour

Each of the 569 samples used in the dataset consists of a feature vector of length 30.

The first 10 entries in this feature vector are the mean of the characteristics listed above

for each image. The second 10 are the standard deviation and last 10 are the largest

value of each of these characteristics present in each image. Each sample is also

associated with a label. A label of value 1 indicates the sample was for malignant (cancerous) tissue. A label of value 0 indicates the sample was for benign tissue. This

dataset has already been broken up into training, validation and test sets for you and is

available in the compressed archive for this problem on the class website. The names

of the files are “trainX.csv”, “trainY.csv”, “validationX.csv”, “validationY.csv”,

“testX.csv” and “testY.csv.” The file names ending in “X.csv” contain feature vectors

and those ending in “Y.csv” contain labels. Each file is in comma separated value

format where each row represents a sample.

6. Linear

Regression

with One

variable

Objective: Implement linear regression with one variable to predict profits for a

food truck.

Data Set: https://searchcode.com/codesearch/view/5404318/#

Suppose you are the CEO of arestaurant franchise and are considering different cities for opening a newoutlet. The chain already has trucks in various cities and you have

data forprofitss and populations from the cities.You would like to use this data to help

you select which city to expandtonext.The file ex1data1.txt contains the dataset for our

linear regression problem. The first column is the population of a city and the second

column isthe profitt of a food truck in that city. A negative value for profitt indicates

aloss.

7. linear

regression

with multiple

variables

Objective: Implement linear regression with multiple variables to predict the

prices of houses.

Data Set: https://searchcode.com/codesearch/view/6577026/

Suppose you are selling your house and youwant to know what a good market price would be. One way to do this is to first collect information on recent houses sold and

make a model of housingprices.Thefile ex1data2.txt contains a training set of housing

prices in Port-land, Oregon. The first column is the size of the house (in square feet),

Page 22: Curriculum Handbook for M.Tech Data Science Scienc… · 2. Trevor Hastie, Robert Tibshirani and Jerome Friedman, "The Elements of Statistical Learning: Data Mining, Inference, and

thesecond column is the number of bedrooms, and the third column is the priceof the

house.

8. Logistic

Regression

Objective: Build a logistic regression model to predict whether a student gets

admitted into a university.

Dataset: http://en.pudn.com/Download/item/id/2546378.html

Suppose that you are the administrator of a university department and you want to

determine each applicant’s chance of admission based on their results on two exams.

You have historical data from previous applicants that you can use as a training set for

logistic regression. For each training example, you have the applicant’s scores on two

exams and the admissions decision. Your task is to build a classification model that estimates an applicant’s probability of admission based the scores from those two

exams.

Implement the following:

1. Visualize the data.

2. Implement Sigmoid function

3. Implement the cost function and gradient for logistic regression

4. Evaluate Logistic Regression

5. Predict the results

9. Probability

Distribution

Generate and plot some data from a Poisson distribution with an arrival rate of 1.

10. Uniform

Probability

Distribution

Calculate the Area of A=(x, y) ϵ Ɍ2: 0 < x < 1; 0< x <y2} using the Monte Carlo

Integration Method.

11. Support

Vector

Machine

Objective: To model a classifier for predicting whether a patient is suffering from any

heart disease or not.

Data Set:https://archive.ics.uci.edu/ml/datasets/heart+Disease

Hint: https://dataaspirant.com/2017/01/19/support-vector-machine-classifier-

implementation-r-caret-package/

12. Bayes

Theorem

Data Set: specdata.zip

The zip file containing the data can be downloaded here: specdata.zip. The zip file

contains 332 comma-separated-value (CSV) files containing pollution monitoring data

for fine particulate matter (PM) air pollution at 332 locations in the United States. Each

file contains data from a single monitor and the ID number for each monitor is contained in the file name. For example, data for monitor 200 is contained in the file

“200.csv”. Each file contains three variables. Date: the date of the observation in (year-

month-day) format, sulfate: the level of sulfate PM in the air on that date (measured in

micrograms per cubic meter), and nitrate: the level of nitrate PM in the air on that date

(measured in micrograms per cubic meter)

1. Write a function named ‘pollutantmean’ that calculates the mean of a

pollutant (sulfate or nitrate) across a specified list of monitors. The function

‘pollutantmean’ takes three arguments: ‘directory’, ‘pollutant’, and ‘id’.

Given a vector monitor ID numbers, ‘pollutantmean’ reads that monitors’

particulate matter data from the directory specified in the ‘directory’ argument

and returns the mean of the pollutant across all of the monitors, ignoring any missing values coded as NA.

2. Write a function that reads a directory full of files and reports the number of

completely observed cases in each data file. The function should return a data

frame where the first column is the name of the file and the second column is

the number of complete cases.

3. Write a function that takes a directory of data files and a threshold for

complete cases and calculates the correlation between sulfate and nitrate for

monitor locations where the number of completely observed cases (on all

variables) is greater than the threshold. The function should return a vector of

correlations for the monitors that meet the threshold requirement. If no monitors meet the threshold requirement, then the function should return a

numeric vector of length 0.

Hint:

Page 23: Curriculum Handbook for M.Tech Data Science Scienc… · 2. Trevor Hastie, Robert Tibshirani and Jerome Friedman, "The Elements of Statistical Learning: Data Mining, Inference, and

https://rpubs.com/ahmedtadde/DS-Rprogramming1

https://github.com/mGalarnyk/datasciencecoursera/blob/master/2_R_Programming/pro

jects/project1.md

13. Mapreucde Write map and reduce methods to count the number of occurrences of each word in a

file. For the purposes of this assignment a word will be defined as any string of

alphabetic characters appearing between non-alphabetic characters. nature's is two

words. The count should be case-insensitive. If a word occurs multiple times in a line,

all should be counted. A StringTokenizer is a convenient way to parse the words from

the input line. There is documentation of StringTokenizer online, and there is an

example of its use in the reader functions.

14. Objective: Write map and reduce methods to determine the average ratings of

movies.

Data Set:

http://academictorrents.com/details/9b13183dc4d60676b773c9e2cd6de5e5542cee9

a

The input consists of a series of lines, each containing a movie number, user number,

rating, and date: 3980,294028,5,2005-11-15

map should emit movie number and list of rating, and reduce should return for each

movie number a list of average rating as Double, and number of ratings as Integer. This

data is similar to the Netflix Prize data.

Write map and reduce methods to determine the average ratings of movies. The input

consists of a series of lines, each containing a movie number, user number, rating, and

date: 3980,294028,5,2005-11-15

map should emit movie number and list of rating, and reduce should return for each

movie number a list of average rating as Double, and number of ratings as Integer. This

data is similar to the Netflix Prize data.

15. K-means

Clustering

Given the matrix X whose rows represent different data points, you are asked to

perform a k-means clustering on this dataset using the Euclidean distance as the

distance function. Here k is chosen as 3. The Euclidean distance d between a vector x

and a vector y both in Rp is defined as d = pPp i=1(xi − yi) 2. All data in X were

plotted in Figure 1. The centres of 3 clusters were initialized as µ1 = (6.2, 3.2) (red), µ2 = (6.6, 3.7) (green), µ3 = (6.5, 3.0) (blue).

1. What’s the centre of the first cluster (red) after one iteration? (Answer in the format

of [x1, x2], round your results to three decimal places, same as problems 2 and 3)

2. What’s the centre of the second cluster (green) after two iteration?

3. What’s the centre of the third cluster (blue) when the clustering converges?

4. How many iterations are required for the clusters to converge?

16. Hierarchical

Clustering

In Figure, there are two clusters A (red) and B (blue), each has four members and

plotted in Figure . The coordinates of each member are labeled in the figure. Compute

the distance between two clusters using Euclidean distance.

1. What is the distance between the two farthest members? (complete link) (round to

four decimal places here, and next 2 problems); 2. What is the distance between the two closest members? (single link)

3. What is the average distance between all pairs?

4. Among all three distances above, which one is robust to noise? Answer either

“complete”, “single”, or “average”.

Page 24: Curriculum Handbook for M.Tech Data Science Scienc… · 2. Trevor Hastie, Robert Tibshirani and Jerome Friedman, "The Elements of Statistical Learning: Data Mining, Inference, and

17. Multivariate

Analysis

Multilinear Regression :

1. Using the matrices X and Y found in FILE2 do the following:

1* Compute the vector of raw regression weights

2 Compute the standard error of the regression weights

3 Compute t-tests for each regression weight

4* Compute the vector of predicted scores

5* Compute the vector of residual scores

6* Compute the squared multiple correlation

7* Compute the F-ratio for the model 8 Compute the standard error of estimate

9 Compute the vector of standardized regression weights

2. Using the matrix X found in FILE1 do the following:

1. Compute the deviation SSCP matrix, S

2. Compute the covariance matrix, C

3. Compute the correlation matrix, R

4. Compute the determinants of S, C and R

5. Compute the eigenvalues of S, C and R

6. Compute the eigenvectors of S, C and R

Canonical Correlation Analysis

Using the matrix XY (the first 3 columns of XY are the Y variables while the last 5

columns are the X variables) found in FILE5 do the following:

1. Compute the squared canonical correlation.

2. Compute the canonical correlation.

3. Compute the eignevalues of A and B.

4. Compute the eigenvectors of A and B.

5. Compute the F statistic approximations.

6. Compute the degrees of freedom. 7. Determine which canonical dimensions are significant.

Note: Be sure to label your output and include comments.

File1:

Mata

X = (3,9,17,24\7,8,11,25\6,5,13,29\4,7,15,32\7,9,13,24\8,8,1,23)

Stata

mat X = (3,9,17,24\7,8,11,25\6,5,13,29\4,7,15,32\7,9,13,24\8,8,1,23)

SAS IML

X = { 3 9 17 24,

18.

Page 25: Curriculum Handbook for M.Tech Data Science Scienc… · 2. Trevor Hastie, Robert Tibshirani and Jerome Friedman, "The Elements of Statistical Learning: Data Mining, Inference, and

7 8 11 25,

6 5 13 29,

4 7 15 32,

7 9 13 24,

8 8 1 23};

File 2:

Mata

X = (3,9,17,24\7,8,11,25\6,5,13,29\4,7,15,32\7,9,13,24\8,8,1,23)

Stata

mat X = (3,9,17,24\7,8,11,25\6,5,13,29\4,7,15,32\7,9,13,24\8,8,1,23) SAS IML

X = {

3 9 17 24,

7 8 11 25,

6 5 13 29,

4 7 15 32,

7 9 13 24,

8 8 1 23};