Data and text mining: the search for unknown knowns Geoffrey Bilder UKSG, 2007 gbilder@crossref.org

Preview:

Citation preview

Data and text mining: the search for unknown knownsGeoffrey BilderUKSG, 2007gbilder@crossref.org

"Reports that say that something hasn't happened are always interesting to me, because as we know, there are known knowns; there are things we know we know. We also know there are known unknowns; that is to say we know there are some things we do not know. But there are also unknown unknowns -- the ones we don't know we don't know."

The Mining Metaphor

Gold Mining

Diamond Mining

Data Mining

Data Mining- What it isn’t

≠ Information Retrieval

≠ Information Extraction

≠ Information Analysis

+ +

InformationRetrieval

InformationExtraction

InformationAnalysis

Data Mining new, previously unknown information

And so what is text data mining?

Text Mining

+ +

InformationRetrieval

InformationExtraction

InformationAnalysis

Crucial question for publishers is: “If ‘hiding’ information in unstructured text is a problem- then shouldn’t we be exploring new ways to

“publish”?

So how did we get here?

• The word tobacco originates from the Taino indians.

• There is no I in the word Team.

• The book captured the zeitgeist of the time.

• I am sure that I turned the gas off.

The book captured the <foreign_phrase lang="DE">zeitgeist</foreign_phrase> of the time.

I am <emphasis>sure</emphasis> that I turned the gas off.

Semantic Web “Light”

But we can do more...

The web as a database

Title Author ISBN-13 Publisher

LabyrinthsJorge Luis

Borges978-

0811200127New

Directions

Hopscotch Julio Cortazar978-

0394752846Pantheon

The AlephJorge Luis

Borges978-

0140286809Penguin

... ... ... ...

The Relational Model

Title Author ISBN-13 Publisher

LabyrinthsJorge Luis

Borges978-

0811200127New

Directions

Hopscotch Julio Cortazar978-

0394752846Pantheon

The AlephJorge Luis

Borges978-

0140286809Penguin

... ... ... ...

Rows represent things

Title Author ISBN-13 Publisher

LabyrinthsJorge Luis

Borges978-

0811200127New

Directions

Hopscotch Julio Cortazar978-

0394752846Pantheon

The AlephJorge Luis

Borges978-

0140286809Penguin

... ... ... ...

Columns are properties

Title Author ISBN-13 Publisher

LabyrinthsJorge Luis

Borges978-0811200127 New Directions

Hopscotch Julio Cortazar 978-0394752846 Pantheon

The AlephJorge Luis

Borges978-0140286809 Penguin

... ... ... ...

The book has an author “Jorge Luis Borges”

The thing’s property

Subject Predicate Object

The book has an author “Jorge Luis Borges”

Subject Predicate Object

URI URI

http://www.amazon.com/isbn/978-0140286809has an author

http://www.wikipedia.com/borges

RDF: Resource Description Framework

Journal A Journal B

Wiki

Blog

Personal Website

OPAC

Journal A Journal B

Wiki

Blog

Personal Website

OPAC

PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>PREFIX foaf: <http://xmlns.com/foaf/0.1/>SELECT DISTINCT ?nameWHERE { ?x rdf:type foaf:Person . ?x foaf:name ?name}ORDER BY ?name

SPARQL

http://api.ingentaconnect.com/content/cabi/nrr/latest?format=rss

RSS 1.0

FRBR

Creative CommonsFOAF

Geo

SKOS

The Early Modern Internet

Data Mining =

With the goal of discovering new, previously unknown information

Information retrieval +Information extraction +Information analysis...

Data Mining =

Text Data Mining =

With the goal of discovering new, previously unknown information

Complex data extraction layer +data mining

Information retrieval +Information extraction +Information analysis...

Why do we publish text?

Thank Yougbilder@crossref.org