23
Jakob Voß Revealing digital documents Concealed structures in data http://arxiv.org/abs/1105.5832 http://aboutdata.org International Conference on Theory and Practice in Digital Libraries (TPDL) Doctoral Consortium, Berlin 2011-09-25

Revealing digital documents - concealed structures in data

  • Upload
    jakob-

  • View
    2.071

  • Download
    3

Embed Size (px)

DESCRIPTION

Presented September 25th 2011 at the Doctoral Consortium of Conference on Theory and Practice in Digital Libraries (TPDL), Berlin

Citation preview

Page 1: Revealing digital documents - concealed structures in data

Jakob Voß

Revealing digital documentsConcealed structures in data

http://arxiv.org/abs/1105.5832

http://aboutdata.org

International Conference on Theoryand Practice in Digital Libraries (TPDL)Doctoral Consortium, Berlin 2011-09-25

Page 2: Revealing digital documents - concealed structures in data

Jakob Voß: Revealing digital documents. Concealed structures in data. TPDL 2011, Sep. 25th http://aboutdata.org

how are (digital) documentsstructured and described?

question

Page 3: Revealing digital documents - concealed structures in data

Jakob Voß: Revealing digital documents. Concealed structures in data. TPDL 2011, Sep. 25th http://aboutdata.org

what is a document?

“[...] any physical or symbolic sign, preservedor recorded, intended to represent, to

reconstruct, or to demonstrate a physical or conceptual phenomenon” – Suzanne Briet

“[...] consists of anything that someone wishesto store. A document is something designated

by a person to be a document [...]“ – Ted Nelson

Page 4: Revealing digital documents - concealed structures in data

Jakob Voß: Revealing digital documents. Concealed structures in data. TPDL 2011, Sep. 25th http://aboutdata.org

scope

digital documentssomehow recorded (stable),

eventually as sequence of bits

Page 5: Revealing digital documents - concealed structures in data

AACR2, AAF, AAT, ADL, AES Core Audio, AES Process History, AGLS, Allegro, ASCII, ASN.1, Atom, BIBO, BibTeX, BISAC, BPEL, BPMN, BSON, CanCore, CCO, CDR, CDWA, CDWA Lite, CIDOC/CRM, CQL, CSDGM, CSV, DACS,

Data Committee Content Standard, DC, DCAM, DDC, DDI, DDL, DFDL, DIF, DIG35, DjVU, DOM, DTD, Dublin Core, DwC, EAC, EAC-CPF, EAD, ebXML,

ECN, Ediakt, EDIFAKT, eduPerson, EML, ERM, Etch, EXIF, Federal Geographic, FOAF, FRAD, FRBR, FRSAD, FRSAR, GEM, GILS, GKD, GML,

Hessian, HTML, HTTP, ID3, IDL, IEEE/LOM, indecs, inetOrgPerson, INI, IPTC, IRI, ISAAR(CPF), ISAD(G), ISBD, ISBN, ISO 19115, ISO 19119, JSON, KML,

LCC, LCSH, LDAP, Linked Data, LMER, MAB2, MADS, MARC, MARC21, MARC Relator Codes, MARCXML, MathML, MEI, MESH, METS, METS Rights,

MFC, MGraph, MIX, MO, MODS, MOTS, MPEG-21 , MPEG-7, MSchema, MuseumDat, MusicXML, MXF, NewsML, NFC, NFD, NFKC, NFKD, NIAM, OAI,

OAI-ORE, OAI-PMH, OAIS, ODRL, ONIX, Ontology for Media, OODBMS, OpenDocument, OpenSearch, OpenURL, ORM, OWL, PB Core, PDF, PI,

Pica+, Pica3, PND, PREMIS, PRISM, Proto, QDC, RAD, RAK, RDA, RDBMS, RDF, RDFS, RDF/XML, Relax NG, RELAX NG, Resource, RIS, RSS, RSWK,

Schematron, SCORM, SDXF, Seel, S-EXP, SGML, SIOC, SKOS, SMIL, SPECTRUM, SQL, SRU/SRW, SWAP, SWB, TEI, TEX, TextMD, TGM I, TGM II, TGN, Thrift, Topic Maps, UCS, ULAN, UML, unAPI, UNIMARC, URI, UTF8, vCard, Vorbis Comment, VRA, VSO Data Model, XDR, XMetaDiss, XML, XML Schema, XMP, XOBIS, XPath, XPDL, XQuery, XrML, XSLT, YAML, ZZStruct

there is not onesingle document format

Page 6: Revealing digital documents - concealed structures in data

Jakob Voß: Revealing digital documents. Concealed structures in data. TPDL 2011, Sep. 25th http://aboutdata.org

thesis

but there are common patterns on all levels of description,

independent fromparticular technologies

Page 7: Revealing digital documents - concealed structures in data

Jakob Voß: Revealing digital documents. Concealed structures in data. TPDL 2011, Sep. 25th http://aboutdata.org

examples of particular technologies

XML● Unicode● XML Infoset● XML Schema● Xpath

relational databases● Relational Model● SQL● Entity-Relationship-

Diagrams

families of related standards

Page 8: Revealing digital documents - concealed structures in data

Jakob Voß: Revealing digital documents. Concealed structures in data. TPDL 2011, Sep. 25th http://aboutdata.org

not statisticalthis would limit my research to

one level and technology of description

method

Page 9: Revealing digital documents - concealed structures in data

Jakob Voß: Revealing digital documents. Concealed structures in data. TPDL 2011, Sep. 25th http://aboutdata.org

phenomenologicaldata description in all of its forms as it appears in our experience

method

Page 10: Revealing digital documents - concealed structures in data

Jakob Voß: Revealing digital documents. Concealed structures in data. TPDL 2011, Sep. 25th http://aboutdata.org

phenomenological method

* Image CC-BY Pierre-Alain Gouanvic

data description analyzedas phenomena:

1. critical intuiting(experience)

2. analyzing structures,free of known categories

3. describing the essence

Hegel

Merleau-Ponty*

Husserl

Page 11: Revealing digital documents - concealed structures in data

Jakob Voß: Revealing digital documents. Concealed structures in data. TPDL 2011, Sep. 25th http://aboutdata.org

results

1) Categorization of data structuring methods

2) Collection of data structuring paradigms

3) Pattern language of data patterns

Page 12: Revealing digital documents - concealed structures in data

Jakob Voß: Revealing digital documents. Concealed structures in data. TPDL 2011, Sep. 25th http://aboutdata.org

result 1: categorization of methods

● encodings express data(UTF-8 Unicode, IEEE floating point, Base64…)

● file and database systems store data● identifiers and query languages refer to data● data structuring and markup languages

structure data● schema languages constrain and validate data● conceptual models describe data

¡Concrete methods appear as combinations of categories!

Page 13: Revealing digital documents - concealed structures in data

Jakob Voß: Revealing digital documents. Concealed structures in data. TPDL 2011, Sep. 25th http://aboutdata.org

result 2: paradigms

● Document- or Object-oriented approach● Document-oriented (e.g. ordered tree with

tagged character strings: XML, Relax NG…) ⇒ descriptive data description

● Object-oriented (objects with properties and defined value spaces: XML Schema, UML…)

⇒ prescriptive data description

Page 14: Revealing digital documents - concealed structures in data

Jakob Voß: Revealing digital documents. Concealed structures in data. TPDL 2011, Sep. 25th http://aboutdata.org

result 2: paradigms

● Entities and connections

Birth

Jakob 1979

Jakob 1979

born

1979Jakob

Page 15: Revealing digital documents - concealed structures in data

Jakob Voß: Revealing digital documents. Concealed structures in data. TPDL 2011, Sep. 25th http://aboutdata.org

result 2: paradigms

● Layers of abstraction● Standards and rules● Collections and types● Granularity

Page 16: Revealing digital documents - concealed structures in data

Jakob Voß: Revealing digital documents. Concealed structures in data. TPDL 2011, Sep. 25th http://aboutdata.org

result 3: patterns

● patterns as systematic tool for describing good design practice, introduced by Christopher Alexander:

“Each pattern describes a problem which occurs over and over again in our environment, and then describes the

core of the solution to that problem […]”● Adopted as design patterns in software engineering● Collected in a pattern language with meaningful

connections between patterns (network of patterns).

Page 17: Revealing digital documents - concealed structures in data

Jakob Voß: Revealing digital documents. Concealed structures in data. TPDL 2011, Sep. 25th http://aboutdata.org

result 3: patterns

sequence

collection

known sizeseparator

position ordered set array

Page 18: Revealing digital documents - concealed structures in data

Jakob Voß: Revealing digital documents. Concealed structures in data. TPDL 2011, Sep. 25th http://aboutdata.org

applications

● data archeology● In 200 years someone finds snapshots and

archives of Wikipedia in different forms(SQL, XML, Wikitext, DBPedia, HTML…)

● What are significant parts?How relate parts to each other?

Page 19: Revealing digital documents - concealed structures in data

Jakob Voß: Revealing digital documents. Concealed structures in data. TPDL 2011, Sep. 25th http://aboutdata.org

Page 20: Revealing digital documents - concealed structures in data

Jakob Voß: Revealing digital documents. Concealed structures in data. TPDL 2011, Sep. 25th http://aboutdata.org

… another document

to give a simple example…

Page 21: Revealing digital documents - concealed structures in data

Jakob Voß: Revealing digital documents. Concealed structures in data. TPDL 2011, Sep. 25th http://aboutdata.org

… another document

sequence with delimiter

Page 22: Revealing digital documents - concealed structures in data

Jakob Voß: Revealing digital documents. Concealed structures in data. TPDL 2011, Sep. 25th http://aboutdata.org

… another document

grouping of sequences with delimiter

sequence with delimiter

Page 23: Revealing digital documents - concealed structures in data

Jakob Voß: Revealing digital documents. Concealed structures in data. TPDL 2011, Sep. 25th http://aboutdata.org

… another document

A A A ET T R N SPTD

encoding (morse code)

sequence with delimiter

grouping of sequences with delimiter