Upload
others
View
0
Download
0
Embed Size (px)
Citation preview
CS 626 Large Scale Data Science
Jun ZhangDepartment of Computer Science
University of KentuckyBased on materials prepared by Dr. Licong Cui
Lecture 1 – Introduction
1
Outline
Course Logistics
Student Introduction
Introduction to Big Data
2
Course Logistics
• Class hours: TR 12:30 pm - 1:45 pm• Class location: F. Paul Anderson Tower Room 255• Office hours: MW: 9:00am – 10:00am• Course documents:
http://www.cs.uky.edu/~jzhang/CS626/cs626.htmlo Syllabuso Files
- Slides- Homework and Project Assignments
3
Course Description
• Data => Actionable information• Big Data Techniques– Hadoop/MapReduce– HBase– Hive– Pig– Spark
• Real-world data science problems
4
Prerequisites and Expected Background
• Algorithm design and analysis• Database systems (e.g. MySQL)• Programming languages– Java (preferred)– Python
• Linux basics (e.g., ssh, scp)• Your own computer requirements:– 64-bit OS– 10+ GB RAM
5
Alternative Hardware Systems
• Use CS Department’s OpenStack cluster• Contact Mr. Jarad Downing [email protected] for
obtaining an account and knowing the requirements
• The Cloudera system has been installed on OpenStack
• More information about OpenStack is at:https://www.cs.uky.edu/docs/users/openstack.html
6
What Do You Need for the OpenStack Cluster?
• You need to connect to the UK campus via VPN, see:https://www.cs.uky.edu/docs/users/vpn.html
• You need to install nomachine, it can be foundhere: https://www.nomachine.com
• You need to use your UK ID address and the credentials (cloudera/cloudera) to connect.
7
Textbook (Optional)
• Hadoop: The Definitive Guide: Storage and Analysis at Internet Scale (4th Edition)
• Author: Tom White • ISBN-13: 978-1491901632 • ISBN-10: 1491901632
8
Grading Criteria
• Homework/Programming assignments (40%)• Paper presentation (20%)• Project (30%) – Project team: each team consists of up to 3 members– Clear statement of contribution for each team member– Deliverables: mid-project report (5%), live demos (5%),
and final project report (20%)• Attendance and participation (10%)– Attendance: 5%– Participation: 5% (participating discussions in class)
9
Grading Scale
85 – 100% = A75 – 84% = B60 – 74% = C< 60% = E
10
Course Policies
• Academic Integrity– Independently complete
homework/programming assignments.– Proper acknowledgement is required if you
borrow idea or content from other sources.• Submission Policy– See each assignment for deadlines.– Late submission will not be accepted.
11
Course Policies
• Attendance Policy– In order to meet federal regulations, the
instructor will monitor student participation in this class through attendance or assignments. Students whose attendance or participation cannot be determined one time during the first three weeks of the semester may be dropped from the course.
12
Course Policies
• Attendance Policy– University policy: students are expected to
withdraw from the class if more than 20% of the classes scheduled for the semester are missed (excused or unexcused)
• Excused Absences– http://www.uky.edu/Ombud/
13
Student Introduction
14
Introduction to Big Data
Why Big Data?o What launches Big Data era?
o What makes Big Data valuable?
Characteristics of Big Data
15
What launches Big Data era?
Retail2 billion products sold in 2014
Social media 204 million emails/min
1.8 million likes, 200,000 photos/min
278,000 tweets/min
40,000 queries/sec, 3.5 billion/day
HealthcareA Samaritan Medical Center Watertown NY: 120 TB as of 2013
16
What Makes Big Data Valuable?
Big Data Better Models
Higher Precision
17
Example: Recommendation Engines
18
Example: Using Big Data to Help Patients
Big Data for precision medicineo Personalized healthcare
o Predict/Prevent disease
Data sourceso Genome
o Sensors
o Electronic Health Record (EHR)
o People19
Genome Data
200 GB/genome
20
Sensor Data
21
Electronic Health Record (EHR)
22
People-generated Data- Fitness Device Data
2-5 GB/day
23
How Big Data Can Help?
Integration
Genome Data
Sensor DataElectronic
Health Records
People-generated
Data
24
How Big Data Can Help?
Integration Personalization Precision
25
Basic principles for big data integration
• Create a common understanding of data definition
• Develop a set of data services to qualify the data and make it consistent and ultimate trustworthy
• Set up a streamlined way to integrate your big data sources and system of record
26
Characteristics of Big Data – 6V’s
• Veracity• Valence
Volume Variety Velocity
Value
27
Volume of big data
• The amount of data• Facebook has 250 billion images, and 2.5
trillion posts (2016)• The amount of data is ever increasing• How to store the data• How to process the data
28
Variety of big data
• Ever increasing different forms of data• Photographs, sensor data, tweets,
encrypted packages• Traditional data tables • E-mail messages, with attachments• Photos, videos and audio recordings
29
Velocity of big data
• The speed at which big data is created, stored, and/or analyzed.
• Facebook users upload 900 million photos every day
• Packet analysis for cybersercurity• Search engine query• Internet of Things
30
Veracity of big data
• Quality and trustfulness of data• Accuracy, preciseness, reliability• Any bias, noises, and abnormality in
data?• Falsification?• No good data, no good results
31
Valence of big data
• Connectedness of big data in the form of graphs
• Data bond with each other• Forming connection between disparate
data• Positive valence and negative valence
32
Value of big data
• The ability to convert big data information into a monetary reward
• The final goal of big data• Data mining?• Decision and results
33