Upload
others
View
2
Download
0
Embed Size (px)
Citation preview
NITTE MEENAKSHI INSTITUE OF TECHNOLOGY (A Unit of Nitte Education Trust (R), Mangalore)
An Autonomous Institution
Department of Information Science and
Engineering
Curriculum
Handbook for
M.Tech – Data
Science
SEMESTER I
Semester: I Year: 2019-2020
Department: Information Science and Engineering Course Type: Core
Course Title: Introduction to Data Management Course Code:19DS11
L-T-P:3-0-2 Credits: 04
Total Contact Hours:39 hrs Duration of SEE: 3 hrs
SEE Marks: 50 CIE Marks: 50
Pre-requisites:
Database Management Systems
Good programming skills
Course Outcomes:
Students will be able to
CO’s Course Learning Outcomes BL
CO1 Describe the need for managing/storing data and identify the value and
relative importance of data management. L2
CO2 Describe fundamentals of Data Management techniquessuitable for
Enterprise Applications. L2
CO3 Apply Data Management Solution for Internet Applications. L3
CO4 Describe various data analysis techniques in the internet Context. L2
Teaching Methodology:
Blackboard teaching and PPT
Programming Assignment
Assessment Methods
Open Book Test for 10 Marks.
Assignment evaluation for 10 Marks on basis of Rubrics
Three internals, 30 Marks each will be conducted and the Average of best of two will be taken.
Final examination, of 100 Marks will be conducted and will be evaluated for 50 Marks.
Course Outcome to Programme Outcome Mapping
PO1 PO2 PO3 PO4 PO5 PO6
CO1 1 3 1
CO2 1 2 3
CO3 2 2 1 2 2
CO4 1 2 2
19DS11 1 2 1 2 2
COURSE CONTENT
Unit – I 10 Hrs
Introduction to Data Science and Class Logistics/Overview, Statistical Inference and Exploratory Data Analysis,
Principles of Data Management, SQL for Data Science: SQL Basics, SQL Joins and aggregates, Grouping and query
evaluation, SQL Sub-queries, Key Principles of RDBMS
Unit – II 10 Hrs
Data Models, Data Warehousing, OLAP, Data Storage and Indexing , Query Optimization and Cost Estimation,
Datalog, E/R Diagrams and Constraints, Design Theory, BCNF
Unit – III 8 Hrs
Data Management Solutions for Enterprise Applications: Introduction to Transactions, Transaction
Implementations, Transaction Model, Database Concurrency Control Protocols, Transaction Failures and Recovery,
Database Recovery Protocols.
Unit – IV 12 Hrs
Parallel Databases: Introduction to NoSQL database , Apache Cassandra, MongoDB, Apache Hive
(Text Book-3- Chapter1, 2, 5))
Unit – V 12 Hrs
Data Management Solution for Internet Applications: Google's Application Stack: Chubby Lock Service, BigTable
Data Store, and Google File System; Yahoo's key-value store: PNUTS; Amazon's key-value store: Dynamo;
Text Books:
1. Database Systems: the Complete Handbook, by Hector Garcia-Molina, Jennifer Widom, and Jeffrey
Ullman. Second edition.
2. Fundamentals of database systems by Elsmasri and Navathe
3. Seven NoSQL Databases in a Week: Get up and running with the fundamentals, By Xun (Brian) Wu,
Sudarshan Kadambi, Devram Kandhare, Aaron Ploetz, Packt Publishers
Reference Books/resources:
1. Database management systems by Raghu Ramakrishnan and Johannes Gehrke.
2. Foundations of database systems by Abiteboul, Hull and Vianu 3. “Transactional Information Systems” by Gerhard WEIKUM and Gottfried VOSSEN, publisher Morgan
Kaufmann. 4. Programming Hive: Data Warehouse and Query Language for Hadoop By Edward Capriolo, Dean
Wampler, Jason Rutherglen, O’Reilly
5. https://ai.google/research/pubs/pub27897
6. Fay Chang, Jeffrey Dean, Sanjay Ghemawat, Wilson C. Hsieh, Deborah A. Wallach Mike Burrows,
Tushar Chandra, Andrew Fikes, Robert E. Gruber, Bigtable: A Distributed Storage System for Structured
Data, Google, Inc. OSDI 2006
7. Brian F. Cooper et al., “ PNUTS: Yahoo!'s hosted data serving platform”, Journal Proceedings of the
VLDB Endowment VLDB Endowment Hompage archive Volume 1 Issue 2, August 2008 Pages 1277-
1288 8. Giuseppe DeCandia et al. , “Dynamo: Amazon’s Highly Available Key-value Store”, Proceeding SOSP
'07 Proceedings of twenty-first ACM SIGOPS symposium on Operating systems principles,Pages 205-
220 Stevenson, Washington, USA — October 14 - 17, 2007
Semester: I Year: 2019-2020
Department: Information Science and Engineering Course Type: Core
Course Title: Statistics for Data Science Course Code:19DS12
L-T-P: 4-0-0 Credits: 04
Total Contact Hours:52 hrs Duration of SEE: 3 hrs
SEE Marks: 50 CIE Marks: 50
Pre-requisites:
Good understanding of engineering mathematics (especially Algebra and Arithmetic).
Inferring conclusions from two, and three dimensional graphs.
Course Outcomes:
Students will be able to:
Cos Course Outcome Description Blooms Level
CO1 Describe the basic and intermediate concepts of probability, statistics, and
distributions. L2
CO2 Describe the applications of discrete probability distributions. L2
CO3 Analyze the inference about population statistic based on the parameters of sample population.
L4
CO4 Analyze hypothesis to accept/reject alternative hypothesis based on
statistical evidence available. L4
CO5 Apply regression, ANOVA, and goodness of fit test to construct model
and infer conclusions about population/sample. L3
Teaching Methodology:
Black Board Teaching / Power Point Presentation.
Seminar
Assessment Methods:
Rubrics to evaluate Case Study (depends on the course)
Rubrics to evaluate Course Project (depends on the course)
Three internals, 30 Marks each will be conducted and the Average of best of two will be taken.
Final examination of 100 Marks will be conducted and will be evaluated for 50 Marks.
Course Outcome to Programme Outcome Mapping
PO1 PO2 PO3 PO4 PO5 PO6
CO1 1 2 2
CO2 2 2 1
CO3 2 2 3
CO4 3 3 3 2
CO5 2 3 3 2
19DS12 2 2 3 2
COURSE CONTENT
UNIT – I
Probability and statistics
10 hours
Why Study Statistics?, Modern Statistics, Statistics and Engineering, two Basic Concepts—Population and
Sample, A Case Study: Visually Inspecting Data to Improve Product Quality, Pareto Diagrams and Dot
Diagrams, Frequency Distributions, Graphs of Frequency Distributions, Stem-and-Leaf Displays, Descriptive
Measures, Quartiles and Percentiles, calculation of X bar and S, Problems with aggregating data, Sample Spaces
and Events, Counting, Probability, The Axioms of Probability, Some Elementary Theorems, Conditional
Probability, Bayes’ Theorem.
UNIT – II
Probability Distributions
10 hours
Random Variables, The Binomial Distribution, The Hypergeometric Distribution, The Mean and the Variance
of a Probability Distribution, Chebyshev’s Theorem, The Poisson Distribution and Rare Events, Poisson
Processes, The Geometric and Negative, Binomial Distribution, The Multinomial Distribution, Simulation.
UNIT – III
Probability Densities and Sampling Distributions
12 hours
Continuous Random Variables, The Normal Distribution, The Normal Approximation to the, Binomial
Distribution, Other Probability Densities, The Uniform Distribution, The Log-Normal Distribution, The
Gamma Distribution, The Beta Distribution, The Weibull Distribution, Continuous Random Variables, The
Normal Approximation to the Binomial Distribution, Other Probability Densities, The Uniform Distribution,
The Log-Normal Distribution, The Gamma Distribution, The Beta Distribution, The Weibull Distribution,
Populations and Samples,
UNIT – IV
Inferences concerning mean and variance
10 hours
Statistical Approaches to Making, Generalizations, Point Estimation, Interval Estimation, Maximum Likelihood
Estimation, Tests of Hypotheses, Null Hypotheses and Tests of Hypotheses, Hypotheses Concerning One Mean,
The Relation between Tests and Confidence Intervals, Power, Sample Size, and Operating Characteristic Curve,
The Estimation of Variances, Hypotheses Concerning One Variance, Hypotheses Concerning Two Variances.
UNIT – V
Analysis of Variance/ Regression/ Goodness-of-fit tests
10 hours
Single-Factor ANOVA, Multiple Comparisons in ANOVA, More on Single-Factor ANOVA, Introduction Two-Factor ANOVA with Kij=1, Two-Factor ANOVA with Kij>1,Three-Factor ANOVA, Introduction, The
Simple Linear Regression Model, Estimating Model Parameters, Inferences About the Slope Parameter,
Inferences Concerning and the Prediction of Future Y Values, Correlation, Introduction, Assessing Model
Adequacy, Polynomial Regression, Goodness-of-Fit Tests
Text books:
1. Miller &freund’s Probability and statistics for engineers, ninth edition, Richard a. Johnson, Pearson.
2. Devore. J.L., “Probability and Statistics for Engineering and the Sciences”, Cengage Learning, New
Delhi, 8th Edition, 2012.
Reference books:
1. Walpole. R.E., Myers. R.H., Myers. S.L. and Ye. K., “Probability and Statistics for Engineers and
Scientists”, Pearson Education, Asia, 8th Edition, 2007.
2. Ross, S.M., “Introduction to Probability and Statistics for Engineers and Scientists”, 3rd Edition,
Elsevier, 2004.
3. Spiegel. M.R., Schiller. J. and Srinivasan. R.A., “Schaum’s Outline of Theory and Problems of
Probability and Statistics”, Tata McGraw Hill Edition, 2004. 4. Griffiths, Dawn. Head first statistics. " O'Reilly Media, Inc.", 2008.
Semester: I Year: 2019-2020
Department: Information Science and Engineering Course Type: Core
Course Title: Machine Learning-I Course Code:19DS13
L-T-P:3-0-2 Credits: 04
Total Contact Hours:39 hrs Duration of SEE: 3 hrs
SEE Marks: 50 CIE Marks: 50
Prerequisite:
Linear Algebra, Probability & Statistics, Calculus, Data Mining
Any programming language C++, Python.
Course Outcomes:
Students will be able to
CO’s Course Learning Outcomes BL
CO1 Describe the basic underlying machine learning concepts. L2
CO2 Analyze a range of machine learning algorithms along with their strength &
weaknesses. L4
CO3 Apply appropriate machine learningtechniques to solve problems of
moderate complexity.
L3
CO4 Implement Ensemble methods to obtain better predictive performance than
could be obtained from any of the constituent learning algorithms alone
L3
Teaching Methodology:
Black board teaching / Power Point presentations
Executable Codes/ Live Demonstration
Programming Assignment
Assessment Methods: Online certification from NPTEL/course-era
Programming Assignment (10M), evaluated on the basis of Rubrics.
Three internals, 30Marks each will be conducted and the Average of best of two will be taken.
Final examination, of100 Marks will be conducted and will be evaluatedfor50Marks.
Course Outcome to Programme Outcome Mapping
PO1 PO2 PO3 PO4 PO5 PO6
CO1 1 1 2
CO2 1 2 1
CO3 2 1 2 3 1
CO4 1 2 2 1
19DS13 1 1 2 2 1
COURSE CONTENT
Unit – I 8 Hrs
Concept Learning: Learning problems, Designing a learning system, perspectives and issues in Machine Learning.
Concept Learning Task, Concept Learning as search, Find S, Version space and Candidate Elimination Algorithm.
(TextBook-1)
Decision Tree Learning: Introduction, Decision tree representation, Appropriate problems for Decision Tree Learning,
The Basic Decision Tree Learning Algorithm, Hypothesis Space Search in Decision Tree Learning, Inductive Bias in
Decision Tree Learning, Issues in Decision Tree Learning (TextBook-1)
Unit – II 9Hrs
Feature Engineering for Machine Learning: Machine Learning Pipeline, Binarization, Quantization/Binning, Log
Transformation, Feature Scaling/Normalization, Interaction features, and feature selection
Text Data: Flattening, Filtering and chunking: Bag-of-X: Turning Natural Text into Flat Vectors, Filtering for
cleaner features, Atoms of Meaning: From words to n-Grams to Phrases. (TextBook3)
Unit – III 10 Hrs
Categorical variables: Encoding categorical variables, dealing with large categorical variables: feature hashing, Bin
counting
Dimensionality reduction: Intuition, Derivation, PCA in Action, Whitening and ZCA, Considerations and limitations
of PCA, Use cases (TextBook3)
Unit – IV 6 Hrs
Bayesian Learning: Bayes theorem – An Example; Bayes theorem and concept learning: Brute-Force Bayes Concept
Learning, MAP Hypotheses and Consistent Learners; maximum likelihood and least-squared error hypotheses; Bayes
optimal classifier; Gibbs algorithm, naive Bayes classifier; Bayesian belief networks – Conditional Independence,
Representation, Inference, Learning Bayesian Belief Networks.
Cluster Analysis: Basic concepts and algorithms: Overview, K-Means, Agglomerative Hierarchical clustering,
DBSCAN. (TextBook2)
Unit – V 06 Hrs
Ensemble Methods: Rationale for ensemble method, methods for constructing an Ensemble classifier, Bias-Variance
decomposition, Bagging, Boosting, Random forests, Empirical comparison among Ensemble methods. (TextBook2)
Text Books:
1. Tom M. Mitchell, “Machine Learning”, McGraw-Hill Education (INDIAN EDITION), 2013.
2. Introduction to Data Mining-Pang-NingTan, Michael Steinbach,Vipin Kumar, Pearson Education, 2007.
3. Amanda Casari, Alice Zheng, “Feature Engineering for Machine Learning”, O’Reilly, 2018.
Additional Reference Book:
1. Gareth James, Daniela Witten, Trevor Hastie and Robert Tibshirani, "An Introduction to Statistical
Learning: with Applications in R", Springer, 2016.
2. Trevor Hastie, Robert Tibshirani and Jerome Friedman, "The Elements of Statistical Learning: Data
Mining, Inference, and Prediction", Springer, 2016
3. Andreas Muller, "Introduction to Machine Learning with Python: A Guide for Data
Scientists", Shroff/O'Reilly; First edition (2016) 4. Introduction to Data Mining-Pang-NingTan, Michael Steinbach,Vipin Kumar, Pearson Education, 2007.
Online Materials:
1. https://nptel.ac.in/courses/106106139/ 2. Andrew NG's online Course
Programming Assignments: (Sample)
1) Implement the CANDIDATE – ELIMINATION algorithm. Show how it is used to learn from training
examples and hypothesize new instances in Version Space.
2) Implement the FIND–S algorithm. Show how it can be used to classify new instances of target concepts.
Run the experiments to deduce instances and hypothesis consistently.
3) Implement the ID3 algorithm for learning Boolean–valued functions for classifying the training examples
by searching through the space of a Decision Tree.
4) Design and implement the Back-propagation algorithm by applying it to a learning task involving an
application like FACE RECOGNITION.
5) Design and implement Naïve Bayes Algorithm for learning and classifying TEXT DOCUMENTS.
Semester: I Year: 2019-2020
Department: Information Science and Engineering Course Type: Core
Course Title: Exploratory Data Analysis Course Code:19DS14
L-T-P:3-0-2 Credits: 04
Total Contact Hours:39 hrs Duration of SEE: 3 hrs
SEE Marks: 50 CIE Marks: 50
Pre-requisites:
Graduate Mathematics.
Basic understanding of Probability and Statistics.
Ability to comprehend and understand relational, and unstructured datasets.
Course Outcomes:
Students will be able to:
Cos Course Outcome Description BL
CO1 Describe the philosophy of exploratory data analysis, L2
CO2 Apply visualize discrete and continuous probability distributions L3
CO3 Describe visualizing, and estimating the correlation between variables. L2
CO4 Apply linear and nonlinear models visually. L3
CO5 Describe the visualization and analysis of time series and survival calculations. L2
Teaching Methodology:
Black Board Teaching
Power Point Presentation.
Seminar
Assessment Methods:
Rubrics to evaluate Seminar
Three internals, 30 Marks each will be conducted and the Average of best of two will be taken.
Final examination of 100 Marks will be conducted and will be evaluated for 50 Marks.
Course Outcome to Programme Outcome Mapping
PO1 PO2 PO3 PO4 PO5 PO6
CO1 1 1 2
CO2 2 2 3 1
CO3 1 1 2
CO4 2 2 3 1
CO5 2 2 2 1 2
19DS14 2 2 2 2 1
COURSE CONTENT
UNIT – I
Introduction to Exploratory data analysis, and distributions
8hrs
Creating a Data Frame, Getting Information About a Data Structure, adding a Column to a Data Frame, Deleting
a Column from a Data Frame, Renaming Columns in a Data Frame, Reordering Columns in a Data Frame,
Getting a Subset of a Data Frame, Changing the Order of Factor Levels, Changing the Order of Factor Levels
Based on Data Values, Changing the Names of Factor Levels, Removing Unused Levels from a Factor,
Changing the Names of Items in a Character Vector, Recoding a Categorical Variable to Another Categorical
Variable, Recoding a Continuous Variable to a Categorical Variable, Transforming Variables, Transforming
Variables by Group, Summarizing Data by Groups, Summarizing Data with Standard Errors and Confidence
Intervals, Converting Data from Wide to Long, Converting Data from Long to Wide, Converting a Time Series
Object to Times and Values. UNIT – II
Probability mass function, Cumulative distributions, and modeling distributions
8hrs
Making a Basic Histogram, Making Multiple Histograms from Grouped Data, Making a Density Curve ,Making
Multiple Density Curves from Grouped Data, Making a Frequency Polygon, Making a Basic Box Plot, Adding
Notches to a Box Plot, Adding Means to a Box Plot, Making a Violin Plot, Making a Dot Plot, Making Multiple
Dot Plots for Grouped Data, Making a Density Plot of Two-Dimensional Data UNIT – III
Miscellaneous Graphs
8hrs
Making a Correlation Matrix, Plotting a Function, Shading a Subregion Under a Function Curve, Creating a
Network Graph, Using Text Labels in a Network Graph, Creating a Heat Map, Creating a Three-Dimensional
Scatter Plot, Adding a Prediction Surface to a Three-Dimensional Plot, Saving a Three-Dimensional Plot,
Animating a Three-Dimensional Plot, Creating a Dendrogram, Creating a Vector Field, Creating a QQ Plot,
Creating a Graph of an Empirical Cumulative Distribution Function, Creating a Mosaic Plot, Creating a Pie
Chart, Creating a Map, Creating a Choropleth Map, Making a Map with a Clean Background
UNIT – IV
Relationship between variables, and estimation
8hrs
Scatter Plots, Characterizing Relationships, Correlation, Covariance, Pearson’s Correlation, Nonlinear
Relationships, Spearman’s Rank Correlation, Correlation and Causation, The Estimation Game, Guess the
Variance, Sampling Distributions, Sampling Bias, Exponential Distributions, Classical Hypothesis Testing,
Hypothesis Test, Testing a Difference in Means, Other Test Statistics, Testing a Correlation, Testing
Proportions, Chi-Squared Tests, First Babies Again, Power, Replication,
UNIT – V
Time series and survival analysis
7hrs
Survival Curves, Hazard Function, Estimating Survival Curves, Kaplan-Meier Estimation, The Marriage Curve,
Estimating the Survival Function, Confidence Intervals, Normal Distributions, Sampling Distributions,
Representing Normal Distributions, Central Limit Theorem, Testing the CLT, Applying the CLT, Correlation
Test, Chi-Squared Test
Text books:
1. Think Stats, 2nd Edition: Exploratory Data Analysis, Allen B. Downey, Year:2014, Pages:226, ISBN 13:978-
1-49190-733-7
Reference books:
1. Making sense of Data: A practical Guide to Exploratory Data Analysis and Data Mining, by Glenn J. Myatt.
2. Making Sense of Data II: A Practical Guide to Data Visualization, Advanced Data Mining Methods, and
Applications, Glenn J. Myatt, and Wayne P. Johnson. Print ISBN:9780470222805 |Online
ISBN:9780470417409 |DOI:10.1002/9780470417409.
Semester: I Year: 2019-2020
Department: Information Science and Engineering Course Type: Elective
Course Title: Advanced Algorithms and Optimization Course Code:19DSE241
L-T-P: 4-0-0 Credits: 04
Total Contact Hours:52 hrs Duration of SEE: 3 hrs
SEE Marks: 50 CIE Marks: 50
Pre-requisites:
Students should have knowledge of ‘C’ Programming.
Knowledge of data structures; discrete mathematics, probability, basics of mathematical concepts.
Students should have completed Analysis and Design of Algorithm course.
Course Outcomes:
Students will be able to
CO’s Course Learning Outcomes BL
CO1 Apply the most appropriate algorithms to solve a real world problem through data science applications.
L3
CO2 Evaluate and measure the performance of an algorithm L4
CO3 Design Algorithm for given problem to find out approximate solution. L5
CO4 Describe optimization techniques using algorithms and perform feasibility
study for solving an optimization problem. L2
CO5 Apply optimization techniques for the given problems L3
Teaching Methodology:
Blackboard teaching and PPT
Assignment
Assessment Methods
Open Book Test for 10 Marks.
Assignment evaluation for 10 Marks.
Three internals, 30 Marks each will be conducted and the Average of best of two will be taken.
Final examination, of 100 Marks will be conducted and will be evaluated for 50 Marks.
Course Outcome to Programme Outcome Mapping
PO1 PO2 PO3 PO4 PO5 PO6
CO1 2 2 3 2 1
CO2 2 2 2
CO3 3 2 3 1
CO4 2 2 1
CO5 3 1 2 3 1 1
19DSE241 2 1 2 3 1 1
COURSE CONTENT
Unit – I 10 Hrs
Basics of Algorithm Analysis; Probabilistic Analysis & Randomized Algorithm: The hiring problem, Indicator
Random Variables, Randomized Algorithms
Dynamic Programming: Principles of Dynamic programming, Segmented Least Squares, Sequence Alignment in Linear Space.
Unit – II 12 Hrs
Network Flow: Maximum Flow Networks, Pre-flow push Maximum Flow Algorithm,
Graph Algorithms:Basics- Searching and Traversing, Ideas Behind Map Searches: A* Algorithm,
Spectral Algorithms:The Best Fit Space, Mixture Model, Streaming algorithms for computing statistics on the
data: Models and Basic techniques, Hash Functions, Counting Distinct Elements, Frequency Estimation, Other
Streaming problems
Unit – III 10 Hrs
NP-and computational Intractability - Polynomial Time Reduction, The satisfiability problem, Polynomial Time
Verification, NP-Completeness & reducibility, NP-Complete Problems
Approximation Algorithms:Greedy Algorithms and bound on optimum, Center Selection problem, The Pricing
Method, Maximization Via the Pricing Method, Linear programming & Rounding
Unit – IV 12 Hrs
Optimization Methods:
Need for unconstrained methods in solving constrained problems. Necessary conditions of unconstrained optimization,
Structure of methods, quadratic models. Methods of line search, Armijo-Goldstein and Wolfe conditions for partial line
search. Global convergence theorem, Steepest descent method. Quasi-Newton methods: DFP, BFGS, Broyden family.
Unit – V 8 Hrs
Conjugate-direction Methods: Fletcher-Reeves, Polak-Ribierre. Derivative-free methods: finite differencing.
Restricted step methods. Methods for sums of squares and nonlinear equations. Linear and Quadratic Programming.
Duality in optimization.
Optimization algorithms for parameter tuning or design projects:Genetic algorithms, quantum-inspired
evolutionary algorithms, simulated annealing, particle-swarm optimization, Ant Colony Optimization
Text Books:
1. John Kleinberg, Eva Trados, “Algorithm Design”, Pearson Addison Wesley
2. CormenT.H.,LeisersonC.E.,RivestR.L.,SteinC.,IntroductiontoAlgorithms,3rdedition,PHI2010,ISBN:9780262033848
3. Fletcher R., Practical Methods of Optimization, John Wiley, 2000.
Reference Material
1. Spectral Algorithms, by Ravindran Kannan, Santosh vempala, 2009,
https://www.cc.gatech.edu/~vempala/spectralbook.pdf
2. Streaming Algorithms, Great Ideas in Theoretical Computer Science, Saarland University, Summer
2014
3. S. Muthukrishnan. Data Streams: Algorithms and Applications. Foundations and Trends in Theoretical
Computer Science, 1(2), 2005.
4. http://theory.stanford.edu/~amitp/GameProgramming/AStarComparison.html
Semester: I Year: 2019-2020
Department: Information Science and Engineering Course Type: Elective
Course Title: Time Series Analysis and Forecasting Course Code:19DS152
L-T-P: 4-0-0 Credits: 04
Total Contact Hours:52 hrs Duration of SEE: 3 hrs
SEE Marks: 50 CIE Marks: 50
Pre-requisites:
Probability and Statistics for data Science.
Good programming skills
Course Outcomes:
Students will be able to
CO’s Course Learning Outcomes BL
CO1 Describe the fundamental advantage and necessity of forecasting in various
situations. L2
CO2 Identify how to choose an appropriate forecasting method in a particular
environment. L2
CO3 Apply various forecasting methods, which include obtaining the relevant data and carrying out the necessary computation using suitable statistical
software.
L3
CO4 Improve forecast with better statistical models based on statistical analysis L4
Teaching Methodology:
Blackboard teaching and PPT
Programming Assignment
Assessment Methods
Open Book Test for 10 Marks.
Assignment evaluation for 10 Marks on basis of Rubrics
Three internals, 30 Marks each will be conducted and the Average of best of two will be taken.
Final examination, of 100 Marks will be conducted and will be evaluated for 50 Marks.
Course Outcome to Programme Outcome Mapping
PO1 PO2 PO3 PO4 PO5 PO6
CO1 1 2 2
CO2 2 2 3 1 1
CO3 3 1 3 3 2 2
CO4 3 3 3
19DS22 2 1 2 3 1 1
COURSE CONTENT
Unit – I 10 Hrs
An Introduction to Forecasting: Forecasting and Data. Forecasting Methods. Errors in Forecasting. Choosing a
Forecasting Technique. An Overview of Quantitative Forecasting Techniques.
REGRESSION ANALYSIS: The Simple Linear Regression Model. The Least Squares Point Estimates. Point
Estimates and Point Predictions. Model Assumptions and the Standard Error. Testing the Significance of the Slope and
y Intercept. Confidence and Prediction Intervals. Simple Coefficients of Determination and Correlation. An F Test for
the Model.
Unit – II 10Hrs
Multiple Linear Regressions: The Linear Regression Model. The Least Squares Estimates, and Point Estimation and
Prediction. The Mean Square Error and the Standard Error. Model Utility: R2, Adjusted R2, and the Overall F Test.
Model Building and Residual Analysis: Model Building and the Effects of Multicollinearity. Residual Analysis in
Simple Regression. Residual Analysis in Multiple Regression. Diagnostics for Detecting Outlying and Influential
Observations
Unit – III 12 Hrs
Time Series Regression: Modelling Trend by Using Polynomial Functions. Detecting Autocorrelation. Types of
Seasonal Variation. Modelling Seasonal Variation by Using Dummy Variables and Trigonometric Functions. Growth
Curves. Handling First-Order Autocorrelation.
Decomposition Methods: Multiplicative Decomposition. Additive Decomposition. The X-12-ARIMA Seasonal
Adjustment Method. Exercises.
Exponential Smoothing: Simple Exponential Smoothing. Tracking Signals. Holt’s Trend Corrected Exponential
Smoothing. Holt-Winters Methods. Damped Trends and Other Exponential
Unit – IV 10 Hrs
Non-seasonal Box-Jenkins Modelling and Their Tentative Identification: Stationary and Nonstationary Time
Series. The Sample Autocorrelation and Partial Autocorrelation Functions: The SAC and SPAC. An Introduction to
Non-seasonal Modelling and Forecasting. Tentative Identification of Non-seasonal Box-Jenkins Models.
Estimation, Diagnostic Checking, and Forecasting for Non-seasonal Box-Jenkins Models: Estimation. Diagnostic
Checking. Forecasting. A Case Study. Box-Jenkins Implementation of Exponential Smoothing.
Unit – V 10 Hrs
Box-Jenkins Seasonal Modelling: Transforming a Seasonal Time Series into a Stationary Time Series. Examples of
Seasonal Modelling and Forecasting. Box-Jenkins Error Term Models in Time Series Regression.
Advanced Box-Jenkins Modelling: The General Seasonal Model and Guidelines for Tentative Identification.
Intervention Models. A Procedure for Building a Transfer Function Model
Causality in time series: Granger causality. Hypothesis testing on rational expectations. Hypothesis testing on market
efficiency.
Text Books:
1. Bruce L. Bowerman, Richard O'Connell, Anne Koehler, “Forecasting, Time Series, and Regression,
4th Edition”, Cengage Unlimited Publishers
2. Enders W. Applied Econometric Time Series. John Wiley & Sons, Inc., 1995
Additional Reference Material
1. Mills, T.C. The Econometric Modelling of Financial Time Series. Cambridge University Press, 1999
2. Andrew C. Harvey. Time Series Models. Harvester wheatsheaf, 1993
3. P. J. Brockwell, R. A. Davis, Introduction to Time Series and Forecasting. Springer, 1996
4. Cryer, Jonathan D.; Chan, Kung-sik, “Time series analysis : with applications in R”, ed.: New York:
Springer, cop. 2008
Semester: I Year: 2019-2020
Department: Information Science and Engineering Course Type: Elective
Course Title: Computer Vision Course Code:19DS153
L-T-P: 4-0-0 Credits: 04
Total Contact Hours:52 hrs Duration of SEE: 3 hrs
SEE Marks: 50 CIE Marks: 50
Pre-requisites:
Basic knowledge of Data Mining
Programming knowledge in object oriented methodology
Course Outcomes:
Students will be able to:
Cos Course Learning Outcomes BL
CO1 Identify image processing techniques to solve real world applications L2
CO2 Apply deep learning methods on images to solve high complexity problems L3
CO3 Develop a technique for image feature extraction L3
CO4 Design techniques for image analysis and classification L3
Teaching Methodology:
Black Board Teaching / Power Point Presentation
Programming Assignment
Assessment Methods:
Three internals, 30 Marks each will be conducted and the Average of best of two will be taken.
Rubrics for Programming Assignment for 20 marks.
Final examination of 100 Marks will be conducted and will be evaluated for 50 Marks.
Course Outcome to Programme Outcome Mapping:
COs PO1 PO2 PO3 PO4 PO5 PO6
CO1 2 2 1 1
CO2 2 3 3 1
CO3 3 3 2
CO4 2 2 3 3
19DS251 2 2 3 2 1
COURSE CONTENT
UNIT – I 10hrs
Introduction: Why is computer vision difficult?, Image presentation and analysis tasks. The image, its representations and
properties- a few concepts, image digitization, digital image properties, color images, cameras, image, its mathematical
and physical background- linear integral transforms, image as stochastic processes, image formation physics.
UNIT – II 10 hrs
Data structures for image analysis-levels of image data representation, traditional image data structures, and hierarchical
data structures. Image –preprocessing- pixel brightness transformations, geometric transformations, local preprocessing, Image restoration.
UNIT – III 10 hrs
Segmentation- thresholding, edge based segmentation, region based segmentation, matching, evaluation issues in
segmentation. Image data compression- image data properties, discrete image transforms in image data compression,
predictive compression methods, vector quantization, hierarchical and progressive compression methods, comparison of
compression methods.
UNIT – IV 11 hrs
Shape representation and description- region identification,, contour based shape representation and description-chain
codes, simple geometric border representation, region based representation and description- simple scalar region
descriptors, moments.
UNIT – V 11 hrs
Recognition: knowledge representation statistical pattern recognition- classification principles, classifier settings,
classifier learning. Support vector machines, cluster analysis. Neural nets- feed forward networks, unsupervised learning,
hopefield neural networks.
Text books:
1. Digital image processing and computer vision by Milan Sonka
Reference book:
1. Digital image processing and analysis by Chanda and Dutta Majumder
2. Digital image processing by Gonzalez and woods.
Semester: I Year: 2019-2020
Department: Information Science and Engineering Course Type: Core
Course Title: Data Engineering Lab Course Code:19DSL16
L-T-P:0-0-4 Credits: 04
Total Contact Hours:26hrs Duration of SEE: 3 hrs
SEE Marks: 50 CIE Marks: 50
Pre-requisites:
Basic Python Programming,
Machine learning
Fundamentals of Probability and Statistics
Course Outcomes:
Students will be able to
CO’s Course Learning Outcomes BL
CO1 Describe the commands and set up the programming environment of Python, R and MapReduce
L2
CO2 Apply machine learning concepts to analyze real world problems using Data
Analysis. L3
CO3 Apply probability and statistical techniques to solve problems of moderate
complexity. L3
CO4 Analyze large data sets to derive interesting inferences. L4
Teaching Methodology:
Blackboard teaching and PPT
Executables
Programming Assignment
Assessment Methods
Program Evaluation on the basis of Rubrics.
Two internals, 20 Marks each will be conducted and the Average of best of two will be taken.
Final examination, of 50 Marks will be conducted and will be evaluated for 50 Marks.
Course Outcome to Programme Outcome Mapping
PO1 PO2 PO3 PO4 PO5 PO6
CO1 2 2 1 1 2
CO2 3 2 2 3 1 2
CO3 3 2 2 3 1 2
CO4 3 1 3 2 3 2
19DSL16 3 2 2 3 1 2
COURSE CONTENT
Progr
am
No.
Domain Assignment
1 Basic Python The number of birds banded at a series of sampling sites has been counted by your
field crew and entered into the following list. The first item in each sublist is an
alphanumeric code for the site and the second value is the number of birds banded. Cut
and paste the list into your assignment and then answer the following questions by
printing them to the screen.
data = [['A1', 28], ['A2', 32], ['A3', 1], ['A4', 0],
['A5', 10], ['A6', 22], ['A7', 30], ['A8', 19],
['B1', 145], ['B2', 27], ['B3', 36], ['B4', 25],
['B5', 9], ['B6', 38], ['B7', 21], ['B8', 12],
['C1', 122], ['C2', 87], ['C3', 36], ['C4', 3],
['D1', 0], ['D2', 5], ['D3', 55], ['D4', 62],
['D5', 98], ['D6', 32]]
1. How many sites are there?
2. How many birds were counted at the 7th site?
3. How many birds were counted at the last site?
4. What is the total number of birds counted across all sites?
5. What is the average number of birds seen on a site?
6. What is the total number of birds counted on sites with codes beginning with
C?
2. Basic Python Dr. Granger is interested in studying the relationship between the length of house-
elves’ ears and aspects of their DNA. She has obtained DNA samples and ear
measurements from a small group of house-elves to conduct a preliminary analysis.
You are supposed to conduct the analysis for her. She has placed the file on the web for
you to download.
Write a Python script that:
1. Imports the data into a data structure of your choice
2. Loops over the rows in the dataset 3. For each row in the dataset checks to see if the ear length is large (>10 cm) or
small (<=10 cm) and determines the GC-content of the DNA sequence (i.e.,
the percentage of bases that are either G or C)
4. Stores this information in a table where the first column has the ID for the
individual, the second column contains the string ‘large’ or the string ‘small’
depending on the size of the individuals ears, and the third column contains
the GC content of the DNA sequence.
5. Prints the average GC-content for both large-eared elves and small-eared
elves to the screen.
6. Exports the table of individual level GC values to a CSV (comma delimited
text) file titled grangers_analysis.csv.
3. Basic Exploratory
Data Analysis
Measurements of electric power consumption in one household with a one-minute sampling rate over a period of almost 4 years. Different electrical quantities and some
sub-metering values are available.
Dataset: Individual household electric power consumption Data SetElectric power
consumption
https://github.com/mGalarnyk/datasciencecoursera/blob/master/4_Exploratory_Data_A
nalysis/project1/README.md
Perform the following:
1. Load the data
2. Subset the data from the dates 2007-02-01 and 2007-02-02.
3. Create a histogram 4. Create a Time series
5. Create a plot for sub metering
6. Create multiple plot
4. Exploratory
Data Analysis
Using R
The data for this assignment are available from the course web site as a single zip file:
Data Set
The zip file contains two files:
$PM{2.5}$ Emissions Data (summarySCC_PM25.rds): This file contains a data frame
with all of the PM2.5 emissions data for 1999, 2002, 2005, and 2008. For each year,
the table contains number of tons of $PM{2.5}$ emitted from a specific type of source
for the entire year. Here are the first few rows.
## fips SCC Pollutant Emissions type year
## 4 09001 10100401 PM25-PRI 15.714 POINT 1999
## 8 09001 10100404 PM25-PRI 234.178 POINT 1999
## 12 09001 10100501 PM25-PRI 0.128 POINT 1999
## 16 09001 10200401 PM25-PRI 2.036 POINT 1999
## 20 09001 10200504 PM25-PRI 0.388 POINT 1999
## 24 09001 10200602 PM25-PRI 1.490 POINT 1999
fips: A five-digit number (represented as a string) indicating the U.S. county
SCC: The name of the source as indicated by a digit string (see source code
classification table)
Pollutant: A string indicating the pollutant Emissions: Amount of PM2.5 emitted, in tons
type: The type of source (point, non-point, on-road, or non-road)
year: The year of emissions recorded
Source Classification Code Table (Source_Classification_Code.rds): This table
provides a mapping from the SCC digit strings into the Emissions table to the actual
name of the $PM{2.5}$ source. The sources are categorized in a few different ways
from more general to more specific and you may choose to explore whatever
categories you think are most useful. For example, source 10100101 is known as Ext
Comb /Electric Gen /Anthracite Coal /Pulverized Coal.
You can read each of the two files using the readRDS() function in R. For example,
reading in each file can be done with the following code:
You must address the following questions and tasks in your exploratory analysis. For
each question/task you will need to make a single plot. Unless specified, you can use
any plotting system in R to make your plot.
1. Have total emissions from $PM_{2.5}$ decreased in the United States from
1999 to 2008? Using the base plotting system, make a plot showing the total
$PM_{2.5}$ emission from all sources for each of the years 1999, 2002,
2005, and 2008.
2. Have total emissions from $PM_{2.5}$ decreased in the Baltimore City, Maryland (fips == "24510") from 1999 to 2008? Use the base plotting system
to make a plot answering this question.
3. Of the four types of sources indicated by the type (point, nonpoint, onroad,
nonroad) variable, which of these four sources have seen decreases in
emissions from 1999, 2008 for Baltimore City? Which have seen increases in
emissions from 1999, 2008? Use the ggplot2 plotting system to make a plot
answer this question.
4. Across the United States, how have emissions from coal combustion-related
sources changed from 1999, 2008?
5. How have emissions from motor vehicle sources changed from 1999, 2008 in
Baltimore City? 6. Compare emissions from motor vehicle sources in Baltimore City with
emissions from motor vehicle sources in Los Angeles County, California
(fips == "06037"). Which city has seen greater changes over time in motor
vehicle emissions?
Making and Submitting Plots
For each plot you should:
1. Construct the plot and save it to a PNG file.
2. Create a separate R code file (plot1.R, plot2.R, etc.) that constructs the
corresponding plot, i.e. code in plot1.R constructs the plot1.png plot. Your
code file should include code for reading the data so that the plot can be fully
reproduced. You should also include the code that creates the PNG file. Only
include the code for a single plot (i.e. plot1.R should only include code for
producing plot1.png)
3. Upload the PNG file on the Assignment submission page
4. Copy and paste the R code from the corresponding R file into the text box at
the appropriate point in the peer assessment.
Hint-
https://github.com/mGalarnyk/datasciencecoursera/blob/master/4_Exploratory_Data_A
nalysis/project2/project2.md
5. 5
. Decision
Tree
Binary Decision Trees: One very interesting application area of machine learning is in
making medical diagnoses.
Objective: To train and test a binary decision tree to detect breast cancer using
real world data using Python /R. Predict whether the cancer is benign or
malignant.
DataSet:
https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+%28Diagnostic%29
The Dataset We will use the Wisconsin Diagnostic Breast Cancer (WDBC) dataset1 .
The dataset consists of 569 samples of biopsied tissue. The tissue for each sample is
imaged and 10 characteristics of the nuclei of cells present in each image are
characterized. These characteristics are 1. Radius 2. Texture 3. Perimeter 4. Area 5.
Smoothness 6. Compactness 7. Concavity 8. Number of concave portions of contour
Each of the 569 samples used in the dataset consists of a feature vector of length 30.
The first 10 entries in this feature vector are the mean of the characteristics listed above
for each image. The second 10 are the standard deviation and last 10 are the largest
value of each of these characteristics present in each image. Each sample is also
associated with a label. A label of value 1 indicates the sample was for malignant (cancerous) tissue. A label of value 0 indicates the sample was for benign tissue. This
dataset has already been broken up into training, validation and test sets for you and is
available in the compressed archive for this problem on the class website. The names
of the files are “trainX.csv”, “trainY.csv”, “validationX.csv”, “validationY.csv”,
“testX.csv” and “testY.csv.” The file names ending in “X.csv” contain feature vectors
and those ending in “Y.csv” contain labels. Each file is in comma separated value
format where each row represents a sample.
6. Linear
Regression
with One
variable
Objective: Implement linear regression with one variable to predict profits for a
food truck.
Data Set: https://searchcode.com/codesearch/view/5404318/#
Suppose you are the CEO of arestaurant franchise and are considering different cities for opening a newoutlet. The chain already has trucks in various cities and you have
data forprofitss and populations from the cities.You would like to use this data to help
you select which city to expandtonext.The file ex1data1.txt contains the dataset for our
linear regression problem. The first column is the population of a city and the second
column isthe profitt of a food truck in that city. A negative value for profitt indicates
aloss.
7. linear
regression
with multiple
variables
Objective: Implement linear regression with multiple variables to predict the
prices of houses.
Data Set: https://searchcode.com/codesearch/view/6577026/
Suppose you are selling your house and youwant to know what a good market price would be. One way to do this is to first collect information on recent houses sold and
make a model of housingprices.Thefile ex1data2.txt contains a training set of housing
prices in Port-land, Oregon. The first column is the size of the house (in square feet),
thesecond column is the number of bedrooms, and the third column is the priceof the
house.
8. Logistic
Regression
Objective: Build a logistic regression model to predict whether a student gets
admitted into a university.
Dataset: http://en.pudn.com/Download/item/id/2546378.html
Suppose that you are the administrator of a university department and you want to
determine each applicant’s chance of admission based on their results on two exams.
You have historical data from previous applicants that you can use as a training set for
logistic regression. For each training example, you have the applicant’s scores on two
exams and the admissions decision. Your task is to build a classification model that estimates an applicant’s probability of admission based the scores from those two
exams.
Implement the following:
1. Visualize the data.
2. Implement Sigmoid function
3. Implement the cost function and gradient for logistic regression
4. Evaluate Logistic Regression
5. Predict the results
9. Probability
Distribution
Generate and plot some data from a Poisson distribution with an arrival rate of 1.
10. Uniform
Probability
Distribution
Calculate the Area of A=(x, y) ϵ Ɍ2: 0 < x < 1; 0< x <y2} using the Monte Carlo
Integration Method.
11. Support
Vector
Machine
Objective: To model a classifier for predicting whether a patient is suffering from any
heart disease or not.
Data Set:https://archive.ics.uci.edu/ml/datasets/heart+Disease
Hint: https://dataaspirant.com/2017/01/19/support-vector-machine-classifier-
implementation-r-caret-package/
12. Bayes
Theorem
Data Set: specdata.zip
The zip file containing the data can be downloaded here: specdata.zip. The zip file
contains 332 comma-separated-value (CSV) files containing pollution monitoring data
for fine particulate matter (PM) air pollution at 332 locations in the United States. Each
file contains data from a single monitor and the ID number for each monitor is contained in the file name. For example, data for monitor 200 is contained in the file
“200.csv”. Each file contains three variables. Date: the date of the observation in (year-
month-day) format, sulfate: the level of sulfate PM in the air on that date (measured in
micrograms per cubic meter), and nitrate: the level of nitrate PM in the air on that date
(measured in micrograms per cubic meter)
1. Write a function named ‘pollutantmean’ that calculates the mean of a
pollutant (sulfate or nitrate) across a specified list of monitors. The function
‘pollutantmean’ takes three arguments: ‘directory’, ‘pollutant’, and ‘id’.
Given a vector monitor ID numbers, ‘pollutantmean’ reads that monitors’
particulate matter data from the directory specified in the ‘directory’ argument
and returns the mean of the pollutant across all of the monitors, ignoring any missing values coded as NA.
2. Write a function that reads a directory full of files and reports the number of
completely observed cases in each data file. The function should return a data
frame where the first column is the name of the file and the second column is
the number of complete cases.
3. Write a function that takes a directory of data files and a threshold for
complete cases and calculates the correlation between sulfate and nitrate for
monitor locations where the number of completely observed cases (on all
variables) is greater than the threshold. The function should return a vector of
correlations for the monitors that meet the threshold requirement. If no monitors meet the threshold requirement, then the function should return a
numeric vector of length 0.
Hint:
https://rpubs.com/ahmedtadde/DS-Rprogramming1
https://github.com/mGalarnyk/datasciencecoursera/blob/master/2_R_Programming/pro
jects/project1.md
13. Mapreucde Write map and reduce methods to count the number of occurrences of each word in a
file. For the purposes of this assignment a word will be defined as any string of
alphabetic characters appearing between non-alphabetic characters. nature's is two
words. The count should be case-insensitive. If a word occurs multiple times in a line,
all should be counted. A StringTokenizer is a convenient way to parse the words from
the input line. There is documentation of StringTokenizer online, and there is an
example of its use in the reader functions.
14. Objective: Write map and reduce methods to determine the average ratings of
movies.
Data Set:
http://academictorrents.com/details/9b13183dc4d60676b773c9e2cd6de5e5542cee9
a
The input consists of a series of lines, each containing a movie number, user number,
rating, and date: 3980,294028,5,2005-11-15
map should emit movie number and list of rating, and reduce should return for each
movie number a list of average rating as Double, and number of ratings as Integer. This
data is similar to the Netflix Prize data.
Write map and reduce methods to determine the average ratings of movies. The input
consists of a series of lines, each containing a movie number, user number, rating, and
date: 3980,294028,5,2005-11-15
map should emit movie number and list of rating, and reduce should return for each
movie number a list of average rating as Double, and number of ratings as Integer. This
data is similar to the Netflix Prize data.
15. K-means
Clustering
Given the matrix X whose rows represent different data points, you are asked to
perform a k-means clustering on this dataset using the Euclidean distance as the
distance function. Here k is chosen as 3. The Euclidean distance d between a vector x
and a vector y both in Rp is defined as d = pPp i=1(xi − yi) 2. All data in X were
plotted in Figure 1. The centres of 3 clusters were initialized as µ1 = (6.2, 3.2) (red), µ2 = (6.6, 3.7) (green), µ3 = (6.5, 3.0) (blue).
1. What’s the centre of the first cluster (red) after one iteration? (Answer in the format
of [x1, x2], round your results to three decimal places, same as problems 2 and 3)
2. What’s the centre of the second cluster (green) after two iteration?
3. What’s the centre of the third cluster (blue) when the clustering converges?
4. How many iterations are required for the clusters to converge?
16. Hierarchical
Clustering
In Figure, there are two clusters A (red) and B (blue), each has four members and
plotted in Figure . The coordinates of each member are labeled in the figure. Compute
the distance between two clusters using Euclidean distance.
1. What is the distance between the two farthest members? (complete link) (round to
four decimal places here, and next 2 problems); 2. What is the distance between the two closest members? (single link)
3. What is the average distance between all pairs?
4. Among all three distances above, which one is robust to noise? Answer either
“complete”, “single”, or “average”.
17. Multivariate
Analysis
Multilinear Regression :
1. Using the matrices X and Y found in FILE2 do the following:
1* Compute the vector of raw regression weights
2 Compute the standard error of the regression weights
3 Compute t-tests for each regression weight
4* Compute the vector of predicted scores
5* Compute the vector of residual scores
6* Compute the squared multiple correlation
7* Compute the F-ratio for the model 8 Compute the standard error of estimate
9 Compute the vector of standardized regression weights
2. Using the matrix X found in FILE1 do the following:
1. Compute the deviation SSCP matrix, S
2. Compute the covariance matrix, C
3. Compute the correlation matrix, R
4. Compute the determinants of S, C and R
5. Compute the eigenvalues of S, C and R
6. Compute the eigenvectors of S, C and R
Canonical Correlation Analysis
Using the matrix XY (the first 3 columns of XY are the Y variables while the last 5
columns are the X variables) found in FILE5 do the following:
1. Compute the squared canonical correlation.
2. Compute the canonical correlation.
3. Compute the eignevalues of A and B.
4. Compute the eigenvectors of A and B.
5. Compute the F statistic approximations.
6. Compute the degrees of freedom. 7. Determine which canonical dimensions are significant.
Note: Be sure to label your output and include comments.
File1:
Mata
X = (3,9,17,24\7,8,11,25\6,5,13,29\4,7,15,32\7,9,13,24\8,8,1,23)
Stata
mat X = (3,9,17,24\7,8,11,25\6,5,13,29\4,7,15,32\7,9,13,24\8,8,1,23)
SAS IML
X = { 3 9 17 24,
18.
7 8 11 25,
6 5 13 29,
4 7 15 32,
7 9 13 24,
8 8 1 23};
File 2:
Mata
X = (3,9,17,24\7,8,11,25\6,5,13,29\4,7,15,32\7,9,13,24\8,8,1,23)
Stata
mat X = (3,9,17,24\7,8,11,25\6,5,13,29\4,7,15,32\7,9,13,24\8,8,1,23) SAS IML
X = {
3 9 17 24,
7 8 11 25,
6 5 13 29,
4 7 15 32,
7 9 13 24,
8 8 1 23};