Upload
janis-preston
View
215
Download
1
Embed Size (px)
Citation preview
LECTURE 03:
DATA COLLECTION AND MODELS
February 4, 2015
COMP 150-04
Topics in Visual Analytics
Note: slide deck adapted from R. Chang, Fall 2010
Announcements
• Course location has moved:
Halligan 102
• Assignment 1 posted on course website
• If you haven’t yet installed RStudio:http://www.rstudio.com/products/rstudio/download/
• To download the materials for today’s demo:http://www.cs.tufts.edu/comp/150VAN/demos/Stats-with-R.Rmd
Outline
• Reminder: post on “Illuminating the Path”• Recap: Keim’s VA Model• Data Foundations
- Basic Data Types- Dimensionality
• Metadata: “data about data”• Structure vs. Value
- Value- Derived Value- Derived Structure- Structure
Reminder: thoughts on “Illuminating the Path”
• What did you think?• Who is the intended audience?
(…is it us?)• Do the goals make sense to you?• Is anything missing?• From what you see in the world,
how far along this agenda have we come since 2006?
Recap: Keim’s Visual Analytics Model
input
Pre-process
interactions
interactions
Image source: Keim, Daniel, et al. Visual analytics: Definition, process, and challenges. Springer Berlin Heidelberg, 2008.
Data typesDimensionality
Metadata
Structure vs. Value
Statistical Models in R
Data: a definition• A typical dataset in visualization consists of n records:
(r1, r2, r3, … , rn)
• Each record ri consists of (m >=1) observations or variables:
(v1, v2, v3, … , vm)
• A variable may be either independent or dependent:- An independent variable (iv) is not controlled or affected by another
variable (e.g., time in a time-series dataset)- A dependent variable (dv) is affected by a variation in one or more
associated independent variables (e.g., temperature in a region)
• Formal definition:- ri = (iv1, iv2, iv3, … , ivmi
, dv1, dv2, dv3, … , dvmd)
- where m = mi + md
Basic Data Types
• Nominal• Ordinal• Scale / Quantitative
• Ratio• Interval
An unordered set of
non-numeric values
Examples:• Categorical (finite) data
- {apple, orange, pear}- {red, green, blue}
• Arbitrary (infinite) data- {“12 Main St. Boston MA”, “45
Wall St. New York NY”, …}- {“John Smith”, “Jane Doe”, …}
Basic Data Types
• Nominal• Ordinal• Scale / Quantitative
• Ratio• Interval
An ordered set
(also known as a tuple)
Examples:• Numeric: <2, 4, 6, 8>
• Binary: <0, 1>
• Non-numeric:
<G, PG, PG-13, R>
Basic Data Types
• Nominal• Ordinal• Scale / Quantitative
• Ratio• Interval
A numeric range
Ratios- Distance from “absolute zero”- Can be compared
mathematically using division- For example: height, weight
Intervals- Ordered numeric elements
that can be mathematically manipulated, but cannot be compared as ratios
- E.g.: date, current time
Basic Data Types (Formal)
• Nominal (N) {…}• Ordinal (O) <…>• Scale / Quantitative (Q) […]
• Q → O• [0, 100] → <F, D, C, B, A>
• O → N• <F, D, C, B, A> → {C, B, F, D, A}
• N → O (??)• {John, Mike, Bob} → <Bob, John, Mike>• {red, green, blue} → <blue, green, red>??
• O → Q (??)• Hashing?• Bob + John = ??
Readings in Information Visualization: Using Vision To Think. Card, Mackinglay, Schneiderman, 1999
Operations on Basic Data Types
• What are the operations that we can perform on these data types?• Nominal (N)
• = and ≠
• Ordinal (O)• >, <, ≥, ≤
• Scale / Quantitative (Q)• everything else (+, -, *, /, etc.)
• Consider a distance function
Dimensionality
• Scalar: a single value (0D array)• Vector: collection of scalars (1D array)• Matrix: a collection of vectors (2D array)• Tensor: a collection of matrices (3+D array)
Think of a cube:
Operations on Multidimensional Data
Slice• Selects a subset of the original nD cube• Result set could be of any dimensionality
Roll up (consolidate)• Creates a hierarchy based on the data• Same as clustering
Drill down• Expand a cluster
Pivot• Changes the orientation of the cube
Combine with the 4 basic SQL commands:• SELECT, UPDATE, INSERT, DELETE
Adapted from Wikipedia: OLAP Cube
Examples – Roll up and Drill down
Metadata
Defined as “data about data”
Introduced by Lisa Tweetie in CHI 1997 (“Characterizing Interactive Externalizations)
Extends the original concept by Bertin of data values and data structures.
• Values (low-level): variables relevant to a problem• Structures (high level): relations that characterize the data as a
whole (e.g. links, equations, constraints)
Metadata – 4 Relationships
1. Values → Derived Values
2. Values → Derived Structure
3. Structure → Derived Values
4. Structure → Derived Structure
Derived Values• Example: average
Derived Structure• Example: sorting a list of
variables
Values → Derived Values → Derived Structure
• Values: a (text) document corpus
• Derived values: compute the similarities between the documents
• Derived Structure: apply multi-dimensional scaling to plot the documents in a spatial view.
Values → Derived Values → Derived Structure
IN-SPIRE by PNNL
Structure → Derived Structure → Derived Values
• Structure: a tabular layout of individuals’ relationships with each other
• Derived Structure: convert the tabular structure to a graph
• Derived Values: compute centrality to identify the importance of the individual in this social network
Structure → Derived Structure → Derived Values
Image taken from: http://beth.typepad.com/beths_blog/2009/12
Questions / Comments?
Guest speaker
Maja Milosavljevic
“Statistical Analysis with R”
For next week
• Assignment 1 due before class on Monday• Wednesday:
• Several VIPs coming in to pitch datasets for final projects• Start thinking about a topic you might like to explore!• Need help? Talk to Jordan