53
SEASR Overview Loretta Auvil National Center for Supercomputing Applications University of Illinois at Urbana-Champaign [email protected]

SEASR Overview Loretta Auvil National Center for Supercomputing Applications University of Illinois at Urbana-Champaign [email protected]

Embed Size (px)

Citation preview

Page 1: SEASR Overview Loretta Auvil National Center for Supercomputing Applications University of Illinois at Urbana-Champaign lauvil@illinois.edu

SEASR Overview

Loretta Auvil

National Center for Supercomputing Applications

University of Illinois at Urbana-Champaign

[email protected]

Page 2: SEASR Overview Loretta Auvil National Center for Supercomputing Applications University of Illinois at Urbana-Champaign lauvil@illinois.edu

Overview of CourseMonday Tuesday Wednesday Thursday Friday

MORNING

SEASR Overview•Overview of Course •SEASR Overview and Motivation •Example SEASR Analytics and Applications •SEASR Architecture •Introduction of Meandre •SEASR Community Hub

Meandre Workbench•Overview of Workbench•Overview of Repositories•Designing and Constructing Flows

SEASR Analytics for Zotero•Demonstrations of SEASR Analytic•Use of SEASR services with Zotero and VUE

Mashups and Dashboards•SEASR as a service•Other services, like Tapor•Text Application: JSTOR•Text Application: WEME

Future•Audio Analytics: NEMA •SEASR Central •Future Meandre Features •Future Meandre Workbench Features •Attendee Plan Presentations•Course Wrap-up

AFTERNOON

Text Analytics•Overview of Text Analytics

• Dunning Loglikelihood Comparison

• Text Clustering• Frequent Patterns

Analysis• Entity Extraction

•Text Application: MONK Workbench

More Text Analytics•Emotion Tracking•Concept Tracking•Entity Extraction

Creating Zotero Flows•Configuration Mechanism•Specific Web Service Components•Zotero-enabled Flows•VUE-enabled Flows

Installation, Tools and Deployment of Flows•Installation•Community Collab Tools •Architecture Details •Overview of Development Tools•Overview of ZigZag•Parallelization•Example ZigZag flows with Zotero, VUE and Fedora

Page 3: SEASR Overview Loretta Auvil National Center for Supercomputing Applications University of Illinois at Urbana-Champaign lauvil@illinois.edu

Outline

• SEASR Overview and Motivation

• Example SEASR Analytics and Applications

• Attendee Plan

• SEASR Architecture

• Introduction of Meandre

• SEASR Community Hub

• Hands-On

Page 4: SEASR Overview Loretta Auvil National Center for Supercomputing Applications University of Illinois at Urbana-Champaign lauvil@illinois.edu

SEASR Overview

Page 5: SEASR Overview Loretta Auvil National Center for Supercomputing Applications University of Illinois at Urbana-Champaign lauvil@illinois.edu

SEASR

• This project will focus on developing, integrating, deploying, and sustaining a set of reusable and expandable software components and a supporting framework, SEASR that will benefit a broad set of data mining applications for scholars in humanities

• The key goals established for this effort are a set of software centric directives:

– Support the development of a state-of-the-art software environment for unstructured data management and analysis of digital libraries, repositories and archives, as well as educational platforms that are expected to contribute to many of the humanities breakthroughs of the 21st century.

– Support the continued development, expansion, and maintenance of end-to-end software system – user interfaces, workflow engines, data management, analysis and visualization tools, collaborative tools, and other software integrated into a complete environment SEASR – to bring the full power of data analytics to the scholars. 

– Support education and training for use of this software environment for analysis through workshops to promote its usage among scholars

Page 6: SEASR Overview Loretta Auvil National Center for Supercomputing Applications University of Illinois at Urbana-Champaign lauvil@illinois.edu

The SEASR Picture

Page 7: SEASR Overview Loretta Auvil National Center for Supercomputing Applications University of Illinois at Urbana-Champaign lauvil@illinois.edu

Workshop Objective

The objective of the workshop is:

• To explain and demonstrate the utility of SEASR for digital humanities, and to bring you to a point where you could deploy, contribute and utilize the SEASR environment.

SEASR + TOOLS + EXEMPLARS + HANDS ON

Page 8: SEASR Overview Loretta Auvil National Center for Supercomputing Applications University of Illinois at Urbana-Champaign lauvil@illinois.edu

Workshop Goals

The goals of the workshop are:

• LEARN: Provide a detailed understanding of the SEASR framework

• LEARN: Provide a foundation and examples for participant teams to use SEASR in a study or inquiry

• ADOPT: Share participant generated research plans to utilize SEASR

• INSTALL: Provide detailed instructions on how to install, build components, integrate existing applications, and maintain the SEASR environment

• SUPPORT: Develop plans for resolution of issues raised by the user community in utilization of SEASR

• SUSTAIN: Develop a plan for community driven future development and dissemination of SEASR

Page 9: SEASR Overview Loretta Auvil National Center for Supercomputing Applications University of Illinois at Urbana-Champaign lauvil@illinois.edu

SEASR @ Work – Tag Cloud

• Count tokens• Filter options

supported• Stem words

Page 10: SEASR Overview Loretta Auvil National Center for Supercomputing Applications University of Illinois at Urbana-Champaign lauvil@illinois.edu

SEASR @ Work – Ngram Tag Cloud

• Count multiple words

• Filter options• Stem

Page 11: SEASR Overview Loretta Auvil National Center for Supercomputing Applications University of Illinois at Urbana-Champaign lauvil@illinois.edu

SEASR @ Work – Dunning Loglikelihood• Feature comparison

of tokens

• Specify an analysis document/collection

• Specify a reference document/collection

• Perform statistics comparison using Dunning Loglikelihood

Example showing over-representedAnalysis Set: The Project Gutenberg EBook of A Tale of Two Cities, by Charles DickensReference Set: The Project Gutenberg EBook of Great Expectations, by Charles Dickens

Page 12: SEASR Overview Loretta Auvil National Center for Supercomputing Applications University of Illinois at Urbana-Champaign lauvil@illinois.edu

SEASR @ Work – HITS Summarizer

Page 13: SEASR Overview Loretta Auvil National Center for Supercomputing Applications University of Illinois at Urbana-Champaign lauvil@illinois.edu

SEASR @ Work – Entity Mash-up

• Entity Extraction with OpenNLP or Stanford NER

• Locations viewed on Google Map

• Dates viewed on Simile Timeline

Page 14: SEASR Overview Loretta Auvil National Center for Supercomputing Applications University of Illinois at Urbana-Champaign lauvil@illinois.edu

SEASR @ Work – Entities To Network• Identify entities• Define relationships between entities

within same sentence

Page 15: SEASR Overview Loretta Auvil National Center for Supercomputing Applications University of Illinois at Urbana-Champaign lauvil@illinois.edu

SEASR @ Work – Text Clustering• Clustering of Text by token counts

• Filtering options for stop words, Part of Speech

• Dendogram Visualization

Page 16: SEASR Overview Loretta Auvil National Center for Supercomputing Applications University of Illinois at Urbana-Champaign lauvil@illinois.edu

• NEMA: Executes a SEASR flow for each run– Loads audio data

– Extracts features for every 10 sec moving window of audio

– Loads and applies the models

– Sends results back to the WebUI

• NESTER: Annotation of Audio via Spectral Analysis

SEASR @ Work – Audio Analysis

Page 17: SEASR Overview Loretta Auvil National Center for Supercomputing Applications University of Illinois at Urbana-Champaign lauvil@illinois.edu

SEASR @ Work – MONK

Executes flows for each analysis requested

– Predictive modeling using Naïve Bayes

– Predictive modeling using Support Vector Machines (SVM)

– Feature comparisons

Page 18: SEASR Overview Loretta Auvil National Center for Supercomputing Applications University of Illinois at Urbana-Champaign lauvil@illinois.edu

SEASR @ Work – DISCUS• On-demand usage

of analytics while surfing

– While navigating request analytics to be performed on page

– Text extraction and cleaning

• Summarization and key work extraction

– List the important terms on the page being analyzed

– Provide relevant short summaries

• Visual maps– Provide a visual

representation of the key concepts

– Show the graph of relations between concepts

Page 19: SEASR Overview Loretta Auvil National Center for Supercomputing Applications University of Illinois at Urbana-Champaign lauvil@illinois.edu

SEASR @ Work – Emotion Tracking

Goal is to have this type of Visualization to track emotions across a text document (Leveraging flare.prefuse.org)

Page 20: SEASR Overview Loretta Auvil National Center for Supercomputing Applications University of Illinois at Urbana-Champaign lauvil@illinois.edu

SEASR @ Work – Zotero

• Plugin to Firefox • Zotero manages the

collection• Launch SEASR

Analytics – Citation Analysis uses the

JUNG network importance algorithms to rank the authors in the citation network that is exported as RDF data from Zotero to SEASR

– Zotero Export to Fedora through SEASR

– Saves results from SEASR Analytics to a Collection

• Launch MONK Processing– MONK DB Ingestion Workfl

ow

Page 21: SEASR Overview Loretta Auvil National Center for Supercomputing Applications University of Illinois at Urbana-Champaign lauvil@illinois.edu

Repository Search & Browse

Web Service

Interactive Web

Application

Zotero Upload to Repository

SEASR @ Work – Fedora

Page 22: SEASR Overview Loretta Auvil National Center for Supercomputing Applications University of Illinois at Urbana-Champaign lauvil@illinois.edu

Attendee Project Plan

• Explore tool usage during learning exercises

• Participate in discussion

• Design a project plan to use SEASR this week for some analysis

• Modify and develop the project plan over the week

• Present and discuss project plan and results on Friday

Page 23: SEASR Overview Loretta Auvil National Center for Supercomputing Applications University of Illinois at Urbana-Champaign lauvil@illinois.edu

Attendee Project Plan (2)

• Study/Project Title

• Team Members and their Affiliation

• Procedural Outline of Study/Project– Research Question/Purpose of Study

– Data Sources

– Analysis Tools

• Activity Timeline or Milestones

• Report or Project Outcome(s)

• Ideas on what your team needs from SEASR staff to help you achieve your goal.

Page 24: SEASR Overview Loretta Auvil National Center for Supercomputing Applications University of Illinois at Urbana-Champaign lauvil@illinois.edu

SEASR Architecture

Page 25: SEASR Overview Loretta Auvil National Center for Supercomputing Applications University of Illinois at Urbana-Champaign lauvil@illinois.edu

SEASR Architecture

Components

Virtualization Infrastructure

Meandre Infrastructure

Visualization

Component Repository Component Discovery

Meandre Data-Intensive Flows

Apps ServicesPlugins Web Apps

AnalyticsData

Dev

elop

er T

ools

RepositoriesData

AnalysisComponents

Flows

User Interfaces

Cloud Computing

Visualizations

Meandre Workbench

Page 26: SEASR Overview Loretta Auvil National Center for Supercomputing Applications University of Illinois at Urbana-Champaign lauvil@illinois.edu

Data Driven Models

Page 27: SEASR Overview Loretta Auvil National Center for Supercomputing Applications University of Illinois at Urbana-Champaign lauvil@illinois.edu

SEASR: Reach + Relevance + Reuse + Repeatability

SEASR emphasizes flexibility, scalability, modularity, provides community hub and access to heterogeneous data and computational systems– Semantic driven environment for SOA interoperability– Encourages sharing and participation for building

communities– Modular construction allows flows to be modified and

configured to encourage reusability within and across domains

– Enables a mashup and integration of tools– Data-intensive flows can be executed on a simple

desktop or a large cluster(s) without modification– Computation can be created for distributed execution on

servers where the content lives– User accessibility to control trust and compliance with

required copyright license of content– Relies on standardized Resource Description Framework

(RDF) to define components and flow

Page 28: SEASR Overview Loretta Auvil National Center for Supercomputing Applications University of Illinois at Urbana-Champaign lauvil@illinois.edu

Enables Humanist

To ask key questions:– What recurrent patterns would be of interest to

literary scholars

– Which patterns are characteristic of the English language and which are characteristic of a particular author, work, topic, or time?

– Patterns based on words can be extracted from literary bodies; however, can patterns be extracted based on grammar or plot constructs?

– When are correlated patterns meaningful? Can they be organized based on such criteria?

– How can an author’s intentionality be assessed given an extracted pattern?

Page 29: SEASR Overview Loretta Auvil National Center for Supercomputing Applications University of Illinois at Urbana-Champaign lauvil@illinois.edu

SEASR Enables Scholarly Research

Discovery

– What hypothesis or rules can be generated by the “features” of the corpus?

– What “features” or language of the corpus best describes the corpus?

– What are the “similarities” between elements, documents, or corpuses to each other.

Page 30: SEASR Overview Loretta Auvil National Center for Supercomputing Applications University of Illinois at Urbana-Champaign lauvil@illinois.edu

Meandre: Infrastructure

SEASR/Meandre Infrastructure:

– Dataflow execution paradigm

– Semantic-web driven

– Web Oriented

– Supports publishing services

– Modular components

– Encapsulation and execution mechanism

– Promotes reuse, sharing, and collaboration

Page 31: SEASR Overview Loretta Auvil National Center for Supercomputing Applications University of Illinois at Urbana-Champaign lauvil@illinois.edu

Meandre: Semantic Web Concepts• Relies on the usage of the resource description framework (RDF) which uses simple notation to express graph relations written usually as XML to provide a set of conventions and common means to exchange information

• Provides a common framework to share and reuse data across application, enterprise, and community boundaries

• Focuses on common formats for integration and combination of data drawn from diverse sources

• Pays special attention to the language used for recording how the data relates to real world objects

• Allows navigation to sets of data resources that are semantically connected.

Page 32: SEASR Overview Loretta Auvil National Center for Supercomputing Applications University of Illinois at Urbana-Champaign lauvil@illinois.edu

Meandre: Metadata Ontologies

• Meandre's metadata relies on three ontologies:

– The RDF ontology serves as a base for defining Meandre descriptors

– The Dublin Core Elements ontology provides basic publishing and descriptive capabilities in the description of Meandre descriptors

– The Meandre ontology describes a set of relationships that model valid components, as understood by the Meandre execution engine architecture

Page 33: SEASR Overview Loretta Auvil National Center for Supercomputing Applications University of Illinois at Urbana-Champaign lauvil@illinois.edu

@prefix meandre: <http://www.meandre.org/ontology/> .@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .@prefix dc: <http://purl.org/dc/elements/1.1/> .@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .@prefix : <#> .

<http://dita.ncsa.uiuc.edu/meandre/e2k/components/limited-iterations> meandre:name "Limited iterations"^^xsd:string ; rdf:type meandre:executable_component ; dc:creator "Xavier Llora"^^xsd:string ; dc:date "2007-11-17T00:32:35"^^xsd:date ; dc:description "Allows only a limited number of iterations"^^xsd:string ; dc:format "java/class"^^xsd:string ; dc:rights "University of Illinois/NCSA Open Source License"^^xsd:string ; meandre:execution_context

<http://norma.ncsa.uiuc.edu/public-dav/Meandre/demos/E2K/V1/resources/colt.jar> , <http://norma.ncsa.uiuc.edu/public-dav/Meandre/demos/E2K/V1/resources/gacore.jar> ,

<http://dita.ncsa.uiuc.edu/meandre/e2k/components/limited-iterations/implementation/> ,

<http://norma.ncsa.uiuc.edu/public-dav/Meandre/demos/E2K/V1/resources/gacore-meandre.jar> ,

<http://norma.ncsa.uiuc.edu/public-dav/Meandre/demos/E2K/V1/resources/formj2.0.jar> ;...

ExistingStandards

Meandre: Components in RDF

Page 34: SEASR Overview Loretta Auvil National Center for Supercomputing Applications University of Illinois at Urbana-Champaign lauvil@illinois.edu

Meandre: Components Types

• Components are the basic building block of any computational task.

• There are two kinds of Meandre components: – Executable components

• Perform computational tasks that require no human interactions during runtime

• Processes are initialized during flow startup and are fired when in accordance to the policies defined for it.

– Control components

• Used to pause dataflow during user interaction cycles

• WebUI may be a HTML Form, Applet, or Other user interface

Page 35: SEASR Overview Loretta Auvil National Center for Supercomputing Applications University of Illinois at Urbana-Champaign lauvil@illinois.edu

Meandre: Dataflow Example

• Dataflow Addition Example

– Logical Operation ‘+’

– Requires two inputs

– Produces one output

• When two inputs are available

– Logical operation can be preformed

– Sum is output

• When output is produced

– Reset internal values

– Wait for two new input values to become available

Value1

Value2

Sum

Logical Operation

OutputInputs

Page 36: SEASR Overview Loretta Auvil National Center for Supercomputing Applications University of Illinois at Urbana-Champaign lauvil@illinois.edu

Meandre: Create, Publish, & Share• “Components” and “Flows” have RDF descriptors

– Easily shared, fosters sharing, & reuse

– Allow machines to read and interpret

– Independent of the implementations

– Combine different implementation & platforms

– Components: Java, Python, Lisp, Web Services

– Execution: On a Laptop or a High Performance Cluster

• A “Location” is RDF descriptor of one to many components, one to many flows, and their implementations

Page 37: SEASR Overview Loretta Auvil National Center for Supercomputing Applications University of Illinois at Urbana-Champaign lauvil@illinois.edu

Meandre: Repository & Locations• Each location represents a set components/flows

• Users can

– Combine different locations together

– Create components

– Assemble flows

– Share components and flows

• Repositories Help

– Administrate complex environments

– Organize components and flows

Page 38: SEASR Overview Loretta Auvil National Center for Supercomputing Applications University of Illinois at Urbana-Champaign lauvil@illinois.edu

Meandre: Metadata Properties

• Components and Flows share properties such as component name, creator, creation date, description, tags, and rights.

• Components specific metadata to describe the components' behavior, it’s location, type of implementation, firing policy, runnable, format, resource location, and execution context

• Flow specific metadata describes the directed graph of components, components instances, connectors, connector instance data port source, connector, instance data port target, connector instance source, connector instance target, instance name

Page 39: SEASR Overview Loretta Auvil National Center for Supercomputing Applications University of Illinois at Urbana-Champaign lauvil@illinois.edu

Meandre: Programming Paradigm• The programming paradigm creates complex tasks by linking together a bunch of specialized components. Meandre's publishing mechanism allows components developed by third parties to be assembled in a new flow.

• There are two ways to develop flows :

– Meandre’s Workbench visual programming tool

– Meandre’s ZigZag scripting language

Page 40: SEASR Overview Loretta Auvil National Center for Supercomputing Applications University of Illinois at Urbana-Champaign lauvil@illinois.edu

Locations

Components

Flows

Meandre: Workbench Existing Flow

• Web-based UI

• Components and flows are retrieved from server

• Additional locations of components and flows can be added to server

• Create flow using a graphical drag and drop interface

• Change property values

• Execute the flow

Page 41: SEASR Overview Loretta Auvil National Center for Supercomputing Applications University of Illinois at Urbana-Champaign lauvil@illinois.edu

Meandre: ZigZag Script Language• ZigZag is a simple language for describing data-intensive flows

– Modeled on Python for simplicity.

– ZigZag is declarative language for expressing the directed graphs that describe flows.

• Command-line tools allow ZigZag files to compile and execute.

– A compiler is provided to transform a ZigZag program (.zz) into Meandre archive unit (.mau).

– Mau(s) can then be executed by a Meandre engine.

Page 42: SEASR Overview Loretta Auvil National Center for Supercomputing Applications University of Illinois at Urbana-Champaign lauvil@illinois.edu

Community Hub

Page 43: SEASR Overview Loretta Auvil National Center for Supercomputing Applications University of Illinois at Urbana-Champaign lauvil@illinois.edu

Community Hub

• Explore existing flows to find others of interest

– Keyword Cloud

– Connections

• Find related flows

• Execute flow

• Comments

Page 44: SEASR Overview Loretta Auvil National Center for Supercomputing Applications University of Illinois at Urbana-Champaign lauvil@illinois.edu

Community Hub: Keyword Cloud Design

Page 45: SEASR Overview Loretta Auvil National Center for Supercomputing Applications University of Illinois at Urbana-Champaign lauvil@illinois.edu

Keyword Cloud Implementation

Keyword Cloud functionality is currently implemented as a wordpress plugin

Page 46: SEASR Overview Loretta Auvil National Center for Supercomputing Applications University of Illinois at Urbana-Champaign lauvil@illinois.edu

Detail View of Application

Detail View with Related Flows

Page 47: SEASR Overview Loretta Auvil National Center for Supercomputing Applications University of Illinois at Urbana-Champaign lauvil@illinois.edu

Community Hub: Connections Design

Page 48: SEASR Overview Loretta Auvil National Center for Supercomputing Applications University of Illinois at Urbana-Champaign lauvil@illinois.edu

Demonstration

• Community Hub – Keyword Cloud Functionality

• Tag Cloud Viewer

• Ngram Tag Cloud Viewer

• HITS Summarizer

• Date Entity to Simile Timeline

• Location Entity to Google Map

• Google Search to Tag Cloud Viewer

• Entity to Protovis Network Graph

• Readability

• NEMA's Son of Blinkie

Page 49: SEASR Overview Loretta Auvil National Center for Supercomputing Applications University of Illinois at Urbana-Champaign lauvil@illinois.edu

Learning Exercises: Community Hub1. Explore Community Hub's Keyword Cloud

FunctionalityA. Open browser and go to http://seasr.org B. Click on "View Projects” C. Click on "Keyword Cloud” D. Click on "visualization" to see all the existing

applications that have a tag of "visualization" E. Click on "cluster" to see all the existing

applications that have a tag of "visualization" and "cluster”

F. Click on the delete button to remove "cluster" from the selection

G. Click on the "Tag Cloud Viewer" for more detail information about this application

Page 50: SEASR Overview Loretta Auvil National Center for Supercomputing Applications University of Illinois at Urbana-Champaign lauvil@illinois.edu

Learning Exercises: Tag Cloud Viewer2. Perform analysis using "Tag Cloud Viewer" on a

hard coded web page A. Open browser and go to

http://seasr.org/documentation/example-flows/tag-cloud-viewer/

B. Click on the "Execute" button to launch the creation of a tag cloud view for "Emma" by Jane Austen retrieved from Project Gutenberg

3. Perform analysis using Tag Cloud Viewer" on a webpage of your choice A. Open browser and go to

http://seasr.org/documentation/example-flows/tag-cloud-viewer/

B. Find a web url that you are interested in analyzing C. Click on the "Custom Execute" button to launch the

application where you can copy and paste a web url that you are interested in analyzing

Page 51: SEASR Overview Loretta Auvil National Center for Supercomputing Applications University of Illinois at Urbana-Champaign lauvil@illinois.edu

Learning Exercises: Google Search4. Perform analysis using "Google Search to

Tag Cloud Viewer" on a topic of your choice

A. Use Community Hub to open the "Google Search to Tag Cloud Viewer" page or open browser and go to http://seasr.org/documentation/example-flows/google-search-to-tag-cloud-viewer/

B. Click on the "Custom Execute" button to launch the application where you can type your Google query for analysis

Page 52: SEASR Overview Loretta Auvil National Center for Supercomputing Applications University of Illinois at Urbana-Champaign lauvil@illinois.edu

Attendee Project Plan

• Study/Project Title

• Team Members and their Affiliation

• Procedural Outline of Study/Project– Research Question/Purpose of Study

– Data Sources

– Analysis Tools

• Activity Timeline or Milestones

• Report or Project Outcome(s)

• Ideas on what your team needs from SEASR staff to help you achieve your goal.

Identify Research Question

Page 53: SEASR Overview Loretta Auvil National Center for Supercomputing Applications University of Illinois at Urbana-Champaign lauvil@illinois.edu

Discussion Questions

• What are data repositories that you utilize in your scholarly research?

• What tools or applications are being utilized against these repositories?