Prof. Jason Hong, Carnegie Mellon University Rapid End-User Programming and Visualization for the Web IDA Session 5 2007 CS Study Panel 24 April 2008

Prof. Jason Hong, Carnegie Mellon University

Rapid End-User Programming and Visualization for the Web

IDA Session 52007 CS Study Panel

24 April 2008

Research Areas• End-User Programming

• Extracting and visualizing data from web• Usable Privacy and Security

• Anti-phishing (training, detection)• Managing privacy and security policies

• Mobile Computing• Location-based services• Context-aware computing

Jason HongAssistant ProfessorHuman-Computer Interaction InstituteCarnegie Mellon University

PhD: University of California, Berkeley

Potential Military Applications• Tools for rapidly integrating data and

web services• Better visualizations of large data sets• Effective training for security• Automated algorithms for detecting

phishing scams• Better interfaces for managing security

Principal Investigator

Contact Information

School of Computer ScienceCarnegie Mellon University2504D Newell-Simon Hall 5000 Forbes Ave

Tel: (412) 268 1251 Fax: (412) 268 1266E-mail: [email protected]: http://www.cs.cmu.edu/~jasonh

Principal Investigator

30000 Foot View

• High-level problems observed: – Stovepipes - Data and services spread over multiple systems

– Agility - Integration takes months or years

– Overload - Too much information to easily process

• Goal: Make it easy for people to visualize and process data gathered from variety of sources– Information extraction + visualization + machine learning

– No PhD required

• Analogies: – Spreadsheets

– Visual Basic

http://msdn.microsoft.com/library/en-us/dnvb600/html/vb6tovbdotnetfig2a.gif

Mashups as Key Focus Area

• More specifically, provide an end-user programming tool that makes it easy to create mashups– Mashups are applications that combine content and

services from multiple web sites

– Ex. Craigslist.com + GoogleMaps = Housingmaps.com

Other Example Mashups

• Other example mashups– Ex. MySpace child predators

– Ex. Locations of friends on MySpace or Facebook

• Common themes– Aggregating multiple sources (web pages, databases, etc)

– Handling multiple data formats (not designed to be shared)

– Processing the data (filtering, summarizing, etc)

– Supporting multiple forms of output (graphs, maps, lists)

Creating Mashups is Difficult

• Requires lots of skill to create a mashup– Ex. Housingmaps creator has PhD in computer science

– Ex. MySpace predator list took months of custom coding

• Requires programming expertise in many areas– Web crawling

– Text parsing and pattern matching

– Web services (WSDL and REST)

– Databases

– HTML

• Can we accelerate this process to a matter of days or hours for non-experts?

End-User Programming

• Haggis, an end-user programming tool1. Rapidly extract and combine data from multiple sources

2. Quickly create high-quality interfaces and visualizations

3. Use programming-by-example techniques to specify what is normal and what is anomalous

1. Extract data from multiple sources

• Improved wizards for extracting data from web pages– Can specify example of desired links, system generalizes

• Improved wizards for extracting data from web pages– Can specify example of desired links, system generalizes

– Better support for other patterns on web• Tables, street addresses, etc

• Support for real-time data– Weather, traffic, stocks, any web page periodically updated

– Sensor Andrew, sensor network being deployed at CMU• Electrical usage, water usage, etc

1. Extract data from multiple sources

2. Interfaces and Visualizations

• Wizards for supporting common UI patterns– Table views, maps, graph views, alerts, etc

• Programming-by-example techniques


• Output as a web page or desktop widget– Yahoo Widgets, Google Desktop, Windows Sidebar


• Output as a web page or desktop widget– Yahoo Widgets, Google Desktop, Windows Sidebar

3. Normal versus Anomalous

• Problem: Too much data, gets dropped on floor• Solution: “Teach” the system what patterns to look for

– Analyst-in-the-loop: infoviz + machine learning

– Long-term goal

• Example:– eBay “penny sellers”, could create custom software, but slow

– Analyst uses visualization to find some examples of penny sellers and gives hints to system as to why

– System finds more suspects, analyst gives relevance feedback

– As new data streams in, system can flag suspects

• Can help address high turnover rate at intelligence agencies, loss of organizational memory

Current Progress

• First round of interviews completed– Sensor Andrew team (Civil and Electrical Engineers)

– Mashup Camp

– Programmers around CMU

• Initial prototype of “plumbing” in progress– An Integrated Development Environment (IDE) for

programmers, to facilitate extraction and visualization of data

– Low-level support for extracting data from tables, basic visualizations, etc

– Higher-level tools later to be built on top

• First round of user tests planned for August

Past Work with Marmite

•Wizard for extracting data from arbitrary web pages

•Combine operators together in a dataflow (Unix)

•View the data in multiple ways (table, map)

http://kettle.ubiq.cs.cmu.edu/projects/marmite/

How Marmite Works

•Wizard for getting data from web pages

•Combine operators together in a dataflow (Unix)

•View the data in multiple ways (table, map)

How Marmite Works

•Operators let you knowwhat operations can be done

•Input, processing, output

How Marmite Works

•Operators are chained together in a dataflow (Unix)

How Marmite Works

•Current data is shown

How Marmite Works

•And multiple views too

How Marmite Works

•A wizard UI for helping people get the data they want

Some High-Level Design Issues

• Centralized model– Clean data model: well-managed, well-formatted,

common representations, well-known databases, etc

• Decentralized model– “Anarchic”, multiple data formats in multiple places

– Hard to get lots of people to agree on data format and representation

– More likely scenario (look at how databases are used today)

– Haggis is being designed for this model, assuming that a person may have to clean up the data and resolve formats

Other High-Level Design Issues

• Discovery– What data sources are available?

– May need some kind of centralized store that describes these (sort of like DNS for Internet)

• Security– Access control, who can access what data sources?

– This is a general problem with sensor data

• Privacy– What kinds of queries / apps should people be able to do?

– Unclear how to restrict those in practice

Documents

Prof. Jason Hong, Carnegie Mellon University Rapid End-User Programming and Visualization for the Web IDA Session 5 2007 CS Study Panel 24 April 2008