23
LECTURE 03: DATA COLLECTION AND MODELS February 4, 2015 COMP 150-04 Topics in Visual Analytics Note: slide deck adapted from R. Chang, Fall

LECTURE 03: DATA COLLECTION AND MODELS February 4, 2015 COMP 150-04 Topics in Visual Analytics Note: slide deck adapted from R. Chang, Fall 2010

Embed Size (px)

Citation preview

Page 1: LECTURE 03: DATA COLLECTION AND MODELS February 4, 2015 COMP 150-04 Topics in Visual Analytics Note: slide deck adapted from R. Chang, Fall 2010

LECTURE 03:

DATA COLLECTION AND MODELS

February 4, 2015

COMP 150-04

Topics in Visual Analytics

Note: slide deck adapted from R. Chang, Fall 2010

Page 2: LECTURE 03: DATA COLLECTION AND MODELS February 4, 2015 COMP 150-04 Topics in Visual Analytics Note: slide deck adapted from R. Chang, Fall 2010

Announcements

• Course location has moved:

Halligan 102

• Assignment 1 posted on course website

• If you haven’t yet installed RStudio:http://www.rstudio.com/products/rstudio/download/

• To download the materials for today’s demo:http://www.cs.tufts.edu/comp/150VAN/demos/Stats-with-R.Rmd

Page 3: LECTURE 03: DATA COLLECTION AND MODELS February 4, 2015 COMP 150-04 Topics in Visual Analytics Note: slide deck adapted from R. Chang, Fall 2010

Outline

• Reminder: post on “Illuminating the Path”• Recap: Keim’s VA Model• Data Foundations

- Basic Data Types- Dimensionality

• Metadata: “data about data”• Structure vs. Value

- Value- Derived Value- Derived Structure- Structure

Page 4: LECTURE 03: DATA COLLECTION AND MODELS February 4, 2015 COMP 150-04 Topics in Visual Analytics Note: slide deck adapted from R. Chang, Fall 2010

Reminder: thoughts on “Illuminating the Path”

• What did you think?• Who is the intended audience?

(…is it us?)• Do the goals make sense to you?• Is anything missing?• From what you see in the world,

how far along this agenda have we come since 2006?

Page 5: LECTURE 03: DATA COLLECTION AND MODELS February 4, 2015 COMP 150-04 Topics in Visual Analytics Note: slide deck adapted from R. Chang, Fall 2010

Recap: Keim’s Visual Analytics Model

input

Pre-process

interactions

interactions

Image source: Keim, Daniel, et al. Visual analytics: Definition, process, and challenges. Springer Berlin Heidelberg, 2008.

Data typesDimensionality

Metadata

Structure vs. Value

Statistical Models in R

Page 6: LECTURE 03: DATA COLLECTION AND MODELS February 4, 2015 COMP 150-04 Topics in Visual Analytics Note: slide deck adapted from R. Chang, Fall 2010

Data: a definition• A typical dataset in visualization consists of n records:

(r1, r2, r3, … , rn)

• Each record ri consists of (m >=1) observations or variables:

(v1, v2, v3, … , vm)

• A variable may be either independent or dependent:- An independent variable (iv) is not controlled or affected by another

variable (e.g., time in a time-series dataset)- A dependent variable (dv) is affected by a variation in one or more

associated independent variables (e.g., temperature in a region)

• Formal definition:- ri = (iv1, iv2, iv3, … , ivmi

, dv1, dv2, dv3, … , dvmd)

- where m = mi + md

Page 7: LECTURE 03: DATA COLLECTION AND MODELS February 4, 2015 COMP 150-04 Topics in Visual Analytics Note: slide deck adapted from R. Chang, Fall 2010

Basic Data Types

• Nominal• Ordinal• Scale / Quantitative

• Ratio• Interval

An unordered set of

non-numeric values

Examples:• Categorical (finite) data

- {apple, orange, pear}- {red, green, blue}

• Arbitrary (infinite) data- {“12 Main St. Boston MA”, “45

Wall St. New York NY”, …}- {“John Smith”, “Jane Doe”, …}

Page 8: LECTURE 03: DATA COLLECTION AND MODELS February 4, 2015 COMP 150-04 Topics in Visual Analytics Note: slide deck adapted from R. Chang, Fall 2010

Basic Data Types

• Nominal• Ordinal• Scale / Quantitative

• Ratio• Interval

An ordered set

(also known as a tuple)

Examples:• Numeric: <2, 4, 6, 8>

• Binary: <0, 1>

• Non-numeric:

<G, PG, PG-13, R>

Page 9: LECTURE 03: DATA COLLECTION AND MODELS February 4, 2015 COMP 150-04 Topics in Visual Analytics Note: slide deck adapted from R. Chang, Fall 2010

Basic Data Types

• Nominal• Ordinal• Scale / Quantitative

• Ratio• Interval

A numeric range

Ratios- Distance from “absolute zero”- Can be compared

mathematically using division- For example: height, weight

Intervals- Ordered numeric elements

that can be mathematically manipulated, but cannot be compared as ratios

- E.g.: date, current time

Page 10: LECTURE 03: DATA COLLECTION AND MODELS February 4, 2015 COMP 150-04 Topics in Visual Analytics Note: slide deck adapted from R. Chang, Fall 2010

Basic Data Types (Formal)

• Nominal (N) {…}• Ordinal (O) <…>• Scale / Quantitative (Q) […]

• Q → O• [0, 100] → <F, D, C, B, A>

• O → N• <F, D, C, B, A> → {C, B, F, D, A}

• N → O (??)• {John, Mike, Bob} → <Bob, John, Mike>• {red, green, blue} → <blue, green, red>??

• O → Q (??)• Hashing?• Bob + John = ??

Readings in Information Visualization: Using Vision To Think. Card, Mackinglay, Schneiderman, 1999

Page 11: LECTURE 03: DATA COLLECTION AND MODELS February 4, 2015 COMP 150-04 Topics in Visual Analytics Note: slide deck adapted from R. Chang, Fall 2010

Operations on Basic Data Types

• What are the operations that we can perform on these data types?• Nominal (N)

• = and ≠

• Ordinal (O)• >, <, ≥, ≤

• Scale / Quantitative (Q)• everything else (+, -, *, /, etc.)

• Consider a distance function

Page 12: LECTURE 03: DATA COLLECTION AND MODELS February 4, 2015 COMP 150-04 Topics in Visual Analytics Note: slide deck adapted from R. Chang, Fall 2010

Dimensionality

• Scalar: a single value (0D array)• Vector: collection of scalars (1D array)• Matrix: a collection of vectors (2D array)• Tensor: a collection of matrices (3+D array)

Think of a cube:

Page 13: LECTURE 03: DATA COLLECTION AND MODELS February 4, 2015 COMP 150-04 Topics in Visual Analytics Note: slide deck adapted from R. Chang, Fall 2010

Operations on Multidimensional Data

Slice• Selects a subset of the original nD cube• Result set could be of any dimensionality

Roll up (consolidate)• Creates a hierarchy based on the data• Same as clustering

Drill down• Expand a cluster

Pivot• Changes the orientation of the cube

Combine with the 4 basic SQL commands:• SELECT, UPDATE, INSERT, DELETE

Adapted from Wikipedia: OLAP Cube

Page 14: LECTURE 03: DATA COLLECTION AND MODELS February 4, 2015 COMP 150-04 Topics in Visual Analytics Note: slide deck adapted from R. Chang, Fall 2010

Examples – Roll up and Drill down

Page 15: LECTURE 03: DATA COLLECTION AND MODELS February 4, 2015 COMP 150-04 Topics in Visual Analytics Note: slide deck adapted from R. Chang, Fall 2010

Metadata

Defined as “data about data”

Introduced by Lisa Tweetie in CHI 1997 (“Characterizing Interactive Externalizations)

Extends the original concept by Bertin of data values and data structures.

• Values (low-level): variables relevant to a problem• Structures (high level): relations that characterize the data as a

whole (e.g. links, equations, constraints)

Page 16: LECTURE 03: DATA COLLECTION AND MODELS February 4, 2015 COMP 150-04 Topics in Visual Analytics Note: slide deck adapted from R. Chang, Fall 2010

Metadata – 4 Relationships

1. Values → Derived Values

2. Values → Derived Structure

3. Structure → Derived Values

4. Structure → Derived Structure

Derived Values• Example: average

Derived Structure• Example: sorting a list of

variables

Page 17: LECTURE 03: DATA COLLECTION AND MODELS February 4, 2015 COMP 150-04 Topics in Visual Analytics Note: slide deck adapted from R. Chang, Fall 2010

Values → Derived Values → Derived Structure

• Values: a (text) document corpus

• Derived values: compute the similarities between the documents

• Derived Structure: apply multi-dimensional scaling to plot the documents in a spatial view.

Page 18: LECTURE 03: DATA COLLECTION AND MODELS February 4, 2015 COMP 150-04 Topics in Visual Analytics Note: slide deck adapted from R. Chang, Fall 2010

Values → Derived Values → Derived Structure

IN-SPIRE by PNNL

Page 19: LECTURE 03: DATA COLLECTION AND MODELS February 4, 2015 COMP 150-04 Topics in Visual Analytics Note: slide deck adapted from R. Chang, Fall 2010

Structure → Derived Structure → Derived Values

• Structure: a tabular layout of individuals’ relationships with each other

• Derived Structure: convert the tabular structure to a graph

• Derived Values: compute centrality to identify the importance of the individual in this social network

Page 20: LECTURE 03: DATA COLLECTION AND MODELS February 4, 2015 COMP 150-04 Topics in Visual Analytics Note: slide deck adapted from R. Chang, Fall 2010

Structure → Derived Structure → Derived Values

Image taken from: http://beth.typepad.com/beths_blog/2009/12

Page 21: LECTURE 03: DATA COLLECTION AND MODELS February 4, 2015 COMP 150-04 Topics in Visual Analytics Note: slide deck adapted from R. Chang, Fall 2010

Questions / Comments?

Page 22: LECTURE 03: DATA COLLECTION AND MODELS February 4, 2015 COMP 150-04 Topics in Visual Analytics Note: slide deck adapted from R. Chang, Fall 2010

Guest speaker

Maja Milosavljevic

“Statistical Analysis with R”

Page 23: LECTURE 03: DATA COLLECTION AND MODELS February 4, 2015 COMP 150-04 Topics in Visual Analytics Note: slide deck adapted from R. Chang, Fall 2010

For next week

• Assignment 1 due before class on Monday• Wednesday:

• Several VIPs coming in to pitch datasets for final projects• Start thinking about a topic you might like to explore!• Need help? Talk to Jordan