55
May 31st, 2013 First SICSA MMI Information Retrieval Workshop Looking beyond plain text for document representation in the enterprise Arjen P. de Vries [email protected] Centrum Wiskunde & Informatica Delft University of Technology Spinque B.V.

Looking beyond plain text for document representation in the enterprise

Embed Size (px)

DESCRIPTION

In many real life scenarios, searching for information is not the user's end goal. In this presentation I look into the specific example of corporate strategy and business development in a university setting. In today's academic institutions, strategic questions are those that relate to dependency on funding instruments, the public private partnerships that exist (and those that should be extended!), and the match between topic areas addressed by the research staff and those claimed important by policy makers. The professional search tasks encountered to answer questions in this domain are usually addressed by business intelligence (BI) tools, and not by search engines. However, professionals are known to be busy people inspired by their own research interests, and not particularly fond of keeping the customer relationship management (CRM) or knowledge management systems up to date for the organisation's strategic interest. This then results in incomplete and inaccurate data. Instead of requiring research staff (or their administrative support) to provide this management information, I will illustrate by example how the desired information usually exists already in the documents inherent to the academic work process. Information retrieval could thus play an important role in the computer systems that support the business analytics involved, and could significantly improve the coverage of entities of interest - i.e., to reduce the effort involved in achieving good recall in business analytics. The ranking functionality over the enterprise's (textual) content should however not be an isolated component. Our example setting integrates the information derived from research proposals, research publications and the financial systems, providing an excellent motivation for a more unified approach to structured and unstructured data.

Citation preview

Page 1: Looking beyond plain text for document representation in the enterprise

May 31st, 2013 First SICSA MMI Information Retrieval Workshop

Looking beyond plain text for document representation in

the enterprise

Arjen P. de [email protected]

Centrum Wiskunde & InformaticaDelft University of Technology

Spinque B.V.

Page 2: Looking beyond plain text for document representation in the enterprise

Outline

Motivation Mixed structured and unstructured

sources Search by strategy Equip Open ends

Page 3: Looking beyond plain text for document representation in the enterprise

Enterprise Information Needs

Hang Li et al. A new approach to intranet search based on information extraction. CIKM’05

Page 4: Looking beyond plain text for document representation in the enterprise

Strategic and business development needs

What funding schemes are the primary source of income? E.g., can we move to Europe when Dutch funding

dries up?

Who has active relations with partner X? “Valorisation”; new national funding requirements

What industry sectors do we depend upon? E.g., how many projects in smart cities? Green

energy? Cloud computing? Etc.

How are strategic decisions implemented? E.g., has objective “move from Telecom toward ICT”

been achieved, and how does it develop over time?

Page 5: Looking beyond plain text for document representation in the enterprise

A week in the life

Page 6: Looking beyond plain text for document representation in the enterprise

Date: Wed, 15 May 2013 15:14:49 +0200From: Theme Coordinator “INFORMATION”

To: Group Leaders Information ThemeSubject: List of company relations for internal CWI distribution

Dear Information Theme Group Leaders, The theme coordinators have been asked whether they: "een lijstje kan maken met de bedrijfscontacten en daarbij aan te geven van welke aard de contacten zijn".

Could you send me the names of Dutch companies you are currently working with or have worked with in the recent past by the end of Friday 17th May.

The Theme Coordinator

Page 7: Looking beyond plain text for document representation in the enterprise

Date: Fri, 24 May 2013 11:33:04 +0200 From: Theme Coordinator Life Sciences

To: Group Leaders Life Sciences TeamSubject: Life Sciences: contacts with NL companies?

Dear all,

The CWI themes are currently collecting all contacts we have with Dutch industry and companies (but also hospitals and TNO etc.) in order to get an overview. I am doing this for the theme "Life Sciences". Can you please send me a list of your contacts with short description?

Life Sciences Theme Coordinator

Page 8: Looking beyond plain text for document representation in the enterprise

From: Project Leader Project X Date: Sun, 26 May 2013 17:34:15 +0200

To: Project X Subject: [Project X: 33] @WP-leiders X-BeenThere: Project X @ Y.org

Beste WP-leiders,

Ik kreeg van Het Programma Management het volgende verzoek: > Mag ik je vragen me een lijstje te sturen van welk EU onderzoek en welk internationaal onderzoek er loopt bij de partners gerelateerd aan Project X (internationale inbedding).

Dit is mijn meest urgente punt. Kunnen jullie zsm aan mij sturen een lijstje met de volgende punten: - lijst van lopende EU projecten waarbij mensen uit jouw WP betrokken zijn; geef aub aan wi de partners zijn, financieringsbron, of het een STREP (of NoE of ...) is, en of jouw WP een participant of coordinator levert; - lijst van aangevraagde EU projecten, met zelfde extra's - lijst van eventuele andere internationale samenwerkingen die niet door een formeel project zijn afgedekt

Stuur me de lijstjes aub zsm maar niet later dan dinsdag 18u. Bedankt voor jullie hulp. De Projectleider

Page 9: Looking beyond plain text for document representation in the enterprise

Surely, academia is not like…

Page 10: Looking beyond plain text for document representation in the enterprise

The High Cost of Not Finding Info

If you employ 1000 knowledge workers: 50% of content unindexed $2.5

million/year

6.25% of effort is spent reproducing information that already exists $5 million/year

Knowledge workers spend 15-25% of their time on non-productive information-related activities

Feldman and Sherman.IDC Technical Report #29127, 2003

Butler Group Report: Enterprise Search and Retrieval. Oct-2006“many organisations are frittering away up to 10% of their staff costs on wasted effort because employees simply can’t findthe right information to do their jobs.”

Page 11: Looking beyond plain text for document representation in the enterprise

So… “the real world”

“Real” companies (as opposed to academic institutions) attempt to address these information needs a priori, by setting up a Customer Relationship Management system (CRM)

Shan L. Pan and Jae-Nam Lee, "Using e-CRM for a unified view of the customer", Communications of the ACM 46(4) (2003): 95-99

Page 12: Looking beyond plain text for document representation in the enterprise
Page 13: Looking beyond plain text for document representation in the enterprise

However…

So-called “Professionals” are well known to focus on their own expertise

They do not have (or take) the time to maintain adequate descriptions of their network, skills, projects etc. – neither for most other types of “management overhead”

Page 14: Looking beyond plain text for document representation in the enterprise

We only need to organize ourselves!!

Page 15: Looking beyond plain text for document representation in the enterprise

Funding Proposals

Proposals submitted (are supposed to) pass by the faculty’s (TUD) “contract managers” or the institute’s (CWI) “project bureau” E.g., checks for liability, IPR and valid budget Proposal and (partial) metadata are added to

a content management system (CMS) The CMS used at my faculty at TUD is DECOS; a

few other faculties plan to use Microsoft Sharepoint; CWI deploys BSCW

Page 16: Looking beyond plain text for document representation in the enterprise
Page 17: Looking beyond plain text for document representation in the enterprise

Step 1

Index all the proposals submitted with your favourite IR system

Page 18: Looking beyond plain text for document representation in the enterprise

Incompleteness

The DECOS metadata entered is usually incomplete from the start For many projects for example, only the coordinator is

entered as partner

Also, a proposal’s metadata does not reflect subsequent change; e.g., as in PuppyIR: People hired after funding secured Partner change when key person moved job Teams evolved Priorities shifted New tasks introduced and tasks (re-)assigned …

Page 19: Looking beyond plain text for document representation in the enterprise

Incompleteness

In general: A project’s proposal or even the contract

seldomly represents the project’s exact future

Page 20: Looking beyond plain text for document representation in the enterprise

Inaccuracy

Key information necessary for strategy & business development scenarios missing

Adding those is error-prone Infer domain (big data, green energy, cloud

computing, …) from keywords or content Extract names automatically Copy amounts manually; inconsistencies in

tables in proposal text are not uncommon

Page 21: Looking beyond plain text for document representation in the enterprise

Incomplete & inaccurate Data

Ambiguity When describing domain, e.g., cloud

computing vs. clouds in environmental models

Names of people and companies involved Typos & OCR mistakes Entity resolution

Amounts of funding per partner, own contribution Funding request may not equal funding

received

Page 22: Looking beyond plain text for document representation in the enterprise

The real world to rescue (1)

Not much work gets done without payments…

Page 23: Looking beyond plain text for document representation in the enterprise

ERP

All large organisations deploy Enterprise Resource Planning (ERP) systems Typical modules include accounting, human

resources, manufacturing, and logistics ERP integrates the modules, data

storing/retrieving processes, and management and analysis functionalities

Baan, Oracle, PeopleSoft, SAP, …

Page 24: Looking beyond plain text for document representation in the enterprise

More complete and more accurate data from ERP

Financial details of each project as executed Project leader People who are reimbursed from the project Exact duration of project activities ...

Page 25: Looking beyond plain text for document representation in the enterprise

Step 2

Index all the ERP data with your favourite IR system

Link the ERP project identifiers to the CMS proposal identifiers Surprisingly, an n:m relationship…

DB +

Page 26: Looking beyond plain text for document representation in the enterprise

The real world to rescue (2)

Page 27: Looking beyond plain text for document representation in the enterprise

Institutional Repository

Publication metadata helps validate existing (and may even extend) the management info required: Authors Author affiliations Projects and funding schemes (from

acknowledgements)?

Again incomplete data though… Especially my faculty notoriously bad at

maintaining their part of the institutional repository

Page 28: Looking beyond plain text for document representation in the enterprise

Step 3

Crawl the Institutional Repository using the Open Archives Initiative (OAI) harvesting protocol

Index all the publications data with your favourite DB + IR system

Relate projects to publications by author name, similar title, etc.

Page 29: Looking beyond plain text for document representation in the enterprise

Result: Unified Access

Proposals from an XML dump of the CMS

Actual project administration from CSVs extracted from ERP

Publications crawled using OAI, from the IRP

Page 30: Looking beyond plain text for document representation in the enterprise

Schema

Page 31: Looking beyond plain text for document representation in the enterprise

Heterogeneous content!

BAAN-project (ERP) Decos-project (CMS) Decos-document (CMS attachments) Publication (Institutional Repository) Publication-document (Institutional Repository PDFs) Person (adress lists, ERP + CMS mentions) Company (CMS + ERP + document mentions) Subsidy (CMS) Department (address lists, CMS) Web addresses (extracted from documents) Topic (assigned to publications) Research programme (dependent on funding scheme)

Page 32: Looking beyond plain text for document representation in the enterprise

Schema V2

Page 33: Looking beyond plain text for document representation in the enterprise

How to search that graph???!

Rank (un-/semi-)structured data to deal with incompleteness & inaccuracies

Structured data representation for attributes including project revenu, people’s names, starting dates, etc.

Use cases varying from “expert search” to “data cleaning” and “visual analytics”

Page 34: Looking beyond plain text for document representation in the enterprise

Search by Strategy

First, visually construct search strategies by connecting “building blocks”

Page 35: Looking beyond plain text for document representation in the enterprise

Search by Strategy

First, visually construct search strategies by connecting “building blocks”

Next, generate the search engine specified by that search strategy

Page 36: Looking beyond plain text for document representation in the enterprise

Strategies: DB+IR query plans

DatabaseSpinque: RDBMS (MonetDB)

BB1(in1,in2,in3, u1,u2)

in1 in2 in3

out

BB2(in1)

in1

out

• Data flowSpinque: strategy

• Query: strategy made operationalSpinque: PRA

CREATE VIEW a AS SELECT ..

CREATE VIEW b AS SELECT ..

CREATE VIEW c AS SELECT ..

Strategy

Relational DB

Page 37: Looking beyond plain text for document representation in the enterprise

Probabilistic Relational AlgebraStrategy

Relational DB

• SQLexplicit probabilities

CREATE VIEW x AS SELECT a1, a3, 1-prod(1-prob) AS prob FROM yGROUP BY a1, a3;

• PRA: probabilistic relational algebra (Fuhr and Roelleke, TOIS 2001)

x = Project DISTINCT [$1,$3](y);

Page 38: Looking beyond plain text for document representation in the enterprise

Rank by Text

Page 39: Looking beyond plain text for document representation in the enterprise

Expert Finding

Page 40: Looking beyond plain text for document representation in the enterprise
Page 41: Looking beyond plain text for document representation in the enterprise

Search User Interface

Page 42: Looking beyond plain text for document representation in the enterprise

Search results

Page 43: Looking beyond plain text for document representation in the enterprise

Result List Interactions

Zoom in on item using “+”: Open item in left pane Shows results of item as query, using a

result-type specific search strategy Goal to provide contextually most related nodes

from underlying graph

Marking any item red/yellow/green for later usage

Page 44: Looking beyond plain text for document representation in the enterprise
Page 45: Looking beyond plain text for document representation in the enterprise
Page 46: Looking beyond plain text for document representation in the enterprise

Browse by facet

Page 47: Looking beyond plain text for document representation in the enterprise
Page 48: Looking beyond plain text for document representation in the enterprise

Strategic and business development needs

What are our industry relations? Who of these partners collaborate with

more than one group? What funding schemes support these

collaborations?

Page 49: Looking beyond plain text for document representation in the enterprise

Note: relations between partners and departments, edge strength represents revenue

Page 50: Looking beyond plain text for document representation in the enterprise

Note: relations between partners and departments, edge strength represents revenue

Page 51: Looking beyond plain text for document representation in the enterprise

Multi party relationsGrouping of external relations

ForeignUniv.

NL Univ.

Fundingagency

Public NL

Publicforeign

Privatesector

Multi party relationsGrouping of external relations

ForeignUniv.

NL Univ.

Fundingagency

Public NL

Publicforeign

Privatesector

Note: External relations with at least two departments; node size w.r.t. number of relations

Page 52: Looking beyond plain text for document representation in the enterprise

Initial Findings

The integrated search helps improve recall, reducing the effort involved and leading to higher quality analyses

Many things that could be done even more automatically (albeit not perfectly) seem less important than expected We use very simple rules to extract URIs and

companies; no information extraction yet Information professional will always look into

results in detail

Page 53: Looking beyond plain text for document representation in the enterprise

Open issues

Integrate visualization Idea: select result list and facet

Too many facets Idea: group facets

Result explanations Idea: describe path through graph

Entity support ++

Page 54: Looking beyond plain text for document representation in the enterprise

Open issues

What strategy is good? Why? Idea: test using past usage data

What are the right user roles? Who should do the searches? Who should write strategies?

~ who writes the SQL queries in traditional DB?

Human in the loop for retrieval, but not yet for indexing…

Page 55: Looking beyond plain text for document representation in the enterprise

Questions?