Upload
others
View
8
Download
0
Embed Size (px)
Citation preview
Data and Data Types
BİL477-2021-2022-FALL- INTRODUCTION TO DATA MINING DR. GÖKHAN MEMIŞ
GENEL- PUBLIC
What is Data
BİL477-2021-2022-FALL- INTRODUCTION TO DATA MINING DR. GÖKHAN MEMIŞ
Collection of data objects and their attributes
• An attribute is a property or characteristic of an object –Examples: eye color of a person, temperature, etc. – Attribute is also known as variable, field, characteristic, dimension, or feature
• A collection of attributes describe an object – Object is also known as record, point, case, sample, entity, or instance
GENEL- PUBLIC
Data vs Information vs Knowledge
BİL477-2021-2022-FALL- INTRODUCTION TO DATA MINING DR. GÖKHAN MEMIŞ
GENEL- PUBLIC
Knowledge Discovery in Data: Process
BİL477-2021-2022-FALL- INTRODUCTION TO DATA MINING DR. GÖKHAN MEMIŞ
GENEL- PUBLIC
Knowledge Discovery in Data: Challenges
BİL477-2021-2022-FALL- INTRODUCTION TO DATA MINING DR. GÖKHAN MEMIŞ
GENEL- PUBLIC
Data Come from Everywhere
BİL477-2021-2022-FALL- INTRODUCTION TO DATA MINING DR. GÖKHAN MEMIŞ
GENEL- PUBLIC
Attribute (Feature) Values
BİL477-2021-2022-FALL- INTRODUCTION TO DATA MINING DR. GÖKHAN MEMIŞ
In the fields of machine learning and pattern recognition, a measurable attribute of an observed phenomenon is called a feature (or attribute).
Selecting clear, distinctive and independent features is a critical step for effective pattern recognition, classification and regression algorithms.
Features are usually numeric, but some pattern analysis also uses words and graphs.
GENEL- PUBLIC
Types of Attributes◦ Nominal
◦ Examples: ID numbers, eye color, zip codes
◦ Ordinal◦ Examples: rankings (e.g., taste of potato chips on a scale from 1-10),
grades, height {tall, medium, short}
◦ Interval◦ Examples: calendar dates, temperatures in Celsius or Fahrenheit.
◦ Ratio◦ Examples: temperature in Kelvin, length, counts, elapsed time (e.g.,
time to run a race)
BİL477-2021-2022-FALL- INTRODUCTION TO DATA MINING DR. GÖKHAN MEMIŞ
GENEL- PUBLIC
Discrete and Continuous Attributes
BİL477-2021-2022-FALL- INTRODUCTION TO DATA MINING DR. GÖKHAN MEMIŞ
GENEL- PUBLIC
Important Characteristics of Data◦ Dimensionality (number of attributes)
◦ High dimensional data brings a number of challenges
◦ Resolution
◦ Patterns depend on the scale
◦ Size
◦ Type of analysis may depend on size of data
BİL477-2021-2022-FALL- INTRODUCTION TO DATA MINING DR. GÖKHAN MEMIŞ
GENEL- PUBLIC
Types of Dataset◦ Record Data
◦ Transactional Data◦ Data Matrix◦ Document Data
◦ Temporal Data◦ Time Series Data◦ Sequence Data
◦ Spatial & Spatial-Temporal Data◦ Spatial Data◦ Spatial-Temporal Data
BİL477-2021-2022-FALL- INTRODUCTION TO DATA MINING DR. GÖKHAN MEMIŞ
◦ Graph Data
◦ Transactional Data
◦ UnStructured Data
◦ Twitter Status Message
◦ Review, news article
◦ Semi-Structured Data
◦ Paper Publications Data
◦ XML format
GENEL- PUBLIC
Record Data
BİL477-2021-2022-FALL- INTRODUCTION TO DATA MINING DR. GÖKHAN MEMIŞ
GENEL- PUBLIC
Data Matrix Example for Documents
BİL477-2021-2022-FALL- INTRODUCTION TO DATA MINING DR. GÖKHAN MEMIŞ
GENEL- PUBLIC
Data Matrix
BİL477-2021-2022-FALL- INTRODUCTION TO DATA MINING DR. GÖKHAN MEMIŞ
1.12.216.226.2512.65
1.22.715.225.2710.23
Thickness LoadDistanceProjection
of y load
Projection
of x Load
1.12.216.226.2512.65
1.22.715.225.2710.23
Thickness LoadDistanceProjection
of y load
Projection
of x Load
GENEL- PUBLIC
Temporal Data – Sequence Data
BİL477-2021-2022-FALL- INTRODUCTION TO DATA MINING DR. GÖKHAN MEMIŞ
GENEL- PUBLIC
Time Series Data
BİL477-2021-2022-FALL- INTRODUCTION TO DATA MINING DR. GÖKHAN MEMIŞ
GENEL- PUBLIC
Biological Sequence Data
BİL477-2021-2022-FALL- INTRODUCTION TO DATA MINING DR. GÖKHAN MEMIŞ
GENEL- PUBLIC
Interval Data
BİL477-2021-2022-FALL- INTRODUCTION TO DATA MINING DR. GÖKHAN MEMIŞ
GENEL- PUBLIC
Spatial & Spatial-Temporal Data
BİL477-2021-2022-FALL- INTRODUCTION TO DATA MINING DR. GÖKHAN MEMIŞ
GENEL- PUBLIC
Spatial & Spatial-Temporal Data
BİL477-2021-2022-FALL- INTRODUCTION TO DATA MINING DR. GÖKHAN MEMIŞ
GENEL- PUBLIC
Spatial & Spatial-Temporal Data
BİL477-2021-2022-FALL- INTRODUCTION TO DATA MINING DR. GÖKHAN MEMIŞ
GENEL- PUBLIC
Graph Data
BİL477-2021-2022-FALL- INTRODUCTION TO DATA MINING DR. GÖKHAN MEMIŞ
5
2
1
2
5
GENEL- PUBLIC
Structured, Semi-structured, Unstructured Data
BİL477-2021-2022-FALL- INTRODUCTION TO DATA MINING DR. GÖKHAN MEMIŞ
GENEL- PUBLIC
Structured, Semi-structured, Unstructured Data
BİL477-2021-2022-FALL- INTRODUCTION TO DATA MINING DR. GÖKHAN MEMIŞ
GENEL- PUBLIC
Can data help us solve specific problems?
BİL477-2021-2022-FALL- INTRODUCTION TO DATA MINING DR. GÖKHAN MEMIŞ
GENEL- PUBLIC
How should these pictures be placed into 3 groups?
BİL477-2021-2022-FALL- INTRODUCTION TO DATA MINING DR. GÖKHAN MEMIŞ
GENEL- PUBLIC
How many groups should there be?
BİL477-2021-2022-FALL- INTRODUCTION TO DATA MINING DR. GÖKHAN MEMIŞ
GENEL- PUBLIC
Which genes are associated with a disease?
BİL477-2021-2022-FALL- INTRODUCTION TO DATA MINING DR. GÖKHAN MEMIŞ
GENEL- PUBLIC
Where are the faces in this picture?
BİL477-2021-2022-FALL- INTRODUCTION TO DATA MINING DR. GÖKHAN MEMIŞ
GENEL- PUBLIC
Is it likely that this stock was traded based on illegal insider information?
BİL477-2021-2022-FALL- INTRODUCTION TO DATA MINING DR. GÖKHAN MEMIŞ
GENEL- PUBLIC
Data QualityWhat kinds of data quality problems?
How can we detect problems with the data?
What can we do about these problems?
Examples of data quality problems: ◦ Noise and outliers
◦ Wrong data
◦ Fake data
◦ Missing values
◦ Duplicate data
BİL477-2021-2022-FALL- INTRODUCTION TO DATA MINING DR. GÖKHAN MEMIŞ
GENEL- PUBLIC
Noise
BİL477-2021-2022-FALL- INTRODUCTION TO DATA MINING DR. GÖKHAN MEMIŞ
GENEL- PUBLIC
Outliers
BİL477-2021-2022-FALL- INTRODUCTION TO DATA MINING DR. GÖKHAN MEMIŞ
Outliers are data objects with characteristics that are considerably different than most of the other data objects in the data set
GENEL- PUBLIC
How to find Outliers?
BİL477-2021-2022-FALL- INTRODUCTION TO DATA MINING DR. GÖKHAN MEMIŞ
GENEL- PUBLIC
Missing ValuesReasons for missing values
◦ Information is not collected (e.g., people decline to give their age and weight)
◦ Attributes may not be applicable to all cases (e.g., annual income is not applicable to children)
Handling missing values◦ Eliminate data objects or variables
◦ Estimate missing values◦ Example: time series of temperature
◦ Example: census results
◦ Ignore the missing value during analysis
BİL477-2021-2022-FALL- INTRODUCTION TO DATA MINING DR. GÖKHAN MEMIŞ
GENEL- PUBLIC
Duplicate DataData set may include data objects that are duplicates, or almost duplicates of one another
◦ Major issue when merging data from heterogeneous sources
Examples:◦ Same person with multiple email addresses
Data cleaning◦ Process of dealing with duplicate data issues
When should duplicate data not be removed?
BİL477-2021-2022-FALL- INTRODUCTION TO DATA MINING DR. GÖKHAN MEMIŞ
GENEL- PUBLIC
Distance Functions
BİL477-2021-2022-FALL- INTRODUCTION TO DATA MINING DR. GÖKHAN MEMIŞ
As the size of the data increases, the Manhattan Distance is preferred to the Euclidean distance metric.
The Minkowski metric is preferred if more detailed distance of the data is required.
GENEL- PUBLIC
Euclidean Distance
BİL477-2021-2022-FALL- INTRODUCTION TO DATA MINING DR. GÖKHAN MEMIŞ
0
1
2
3
0 1 2 3 4 5 6
p1
p2
p3 p4
GENEL- PUBLIC
Minkowski Distance
BİL477-2021-2022-FALL- INTRODUCTION TO DATA MINING DR. GÖKHAN MEMIŞ
Consider two points in a 7 dimensional space:
P1: (10, 2, 4, -1, 0, 9, 1)
P2: (14, 7, 11, 5, 2, 2, 18)
If we set p = 4 for this sample calculation, we find the following:
distance_p4 = (-4)^4 + (-5)^4 + (-7)^4 + (-6)^4 + (-2)^4 + (7)^4 + (-17)^4
distance_p4 = 4^4 + 5^4 + 7^4 + 6^4 + 2^4 + 7^4 + 17^4distance_p4 = 256 + 625 + 2401 + 1296 + 16 + 2401 + 83521distance_p4 = 90516minkowski_distance = distance_p4 ^ 0.25minkowski_distance = 90516 ^ 0.25minkowski_distance = 17.3452
GENEL- PUBLIC
Manhattan DistanceWhen p = 1, Minkowshi distance is same as Manhattan distance.
BİL477-2021-2022-FALL- INTRODUCTION TO DATA MINING DR. GÖKHAN MEMIŞ
GENEL- PUBLICBİL477-2021-2022-FALL- INTRODUCTION TO DATA MINING DR. GÖKHAN MEMIŞ
L1 p1 p2 p3 p4
p1 0 4 4 6
p2 4 0 2 4
p3 4 2 0 2
p4 6 4 2 0
L2 p1 p2 p3 p4
p1 0 2.828 3.162 5.099
p2 2.828 0 1.414 3.162
p3 3.162 1.414 0 2
p4 5.099 3.162 2 0
L p1 p2 p3 p4
p1 0 2 3 5
p2 2 0 1 3
p3 3 1 0 2
p4 5 3 2 0
point x y
p1 0 2
p2 2 0
p3 3 1
p4 5 1
GENEL- PUBLIC
Cosine SimilarityIf A and B are two document (text) vectors
Example:
d1 = 3 2 0 5 0 0 0 2 0 0
d2 = 1 0 0 0 0 0 0 1 0 2
BİL477-2021-2022-FALL- INTRODUCTION TO DATA MINING DR. GÖKHAN MEMIŞ
GENEL- PUBLIC
Correlation measures the linear relationship between objects
BİL477-2021-2022-FALL- INTRODUCTION TO DATA MINING DR. GÖKHAN MEMIŞ
GENEL- PUBLIC
Corelation Ranges
BİL477-2021-2022-FALL- INTRODUCTION TO DATA MINING DR. GÖKHAN MEMIŞ
GENEL- PUBLIC
Corelation Calculationx = (-3, -2, -1, 0, 1, 2, 3)
y = (9, 4, 1, 0, 1, 4, 9)
yi = xi2
BİL477-2021-2022-FALL- INTRODUCTION TO DATA MINING DR. GÖKHAN MEMIŞ
mean(x) = 0, mean(y) = 4
std(x) = 2.16, std(y) = 3.74