Upload
jakob-
View
2.071
Download
3
Embed Size (px)
DESCRIPTION
Presented September 25th 2011 at the Doctoral Consortium of Conference on Theory and Practice in Digital Libraries (TPDL), Berlin
Citation preview
Jakob Voß
Revealing digital documentsConcealed structures in data
http://arxiv.org/abs/1105.5832
http://aboutdata.org
International Conference on Theoryand Practice in Digital Libraries (TPDL)Doctoral Consortium, Berlin 2011-09-25
Jakob Voß: Revealing digital documents. Concealed structures in data. TPDL 2011, Sep. 25th http://aboutdata.org
how are (digital) documentsstructured and described?
question
Jakob Voß: Revealing digital documents. Concealed structures in data. TPDL 2011, Sep. 25th http://aboutdata.org
what is a document?
“[...] any physical or symbolic sign, preservedor recorded, intended to represent, to
reconstruct, or to demonstrate a physical or conceptual phenomenon” – Suzanne Briet
“[...] consists of anything that someone wishesto store. A document is something designated
by a person to be a document [...]“ – Ted Nelson
Jakob Voß: Revealing digital documents. Concealed structures in data. TPDL 2011, Sep. 25th http://aboutdata.org
scope
digital documentssomehow recorded (stable),
eventually as sequence of bits
AACR2, AAF, AAT, ADL, AES Core Audio, AES Process History, AGLS, Allegro, ASCII, ASN.1, Atom, BIBO, BibTeX, BISAC, BPEL, BPMN, BSON, CanCore, CCO, CDR, CDWA, CDWA Lite, CIDOC/CRM, CQL, CSDGM, CSV, DACS,
Data Committee Content Standard, DC, DCAM, DDC, DDI, DDL, DFDL, DIF, DIG35, DjVU, DOM, DTD, Dublin Core, DwC, EAC, EAC-CPF, EAD, ebXML,
ECN, Ediakt, EDIFAKT, eduPerson, EML, ERM, Etch, EXIF, Federal Geographic, FOAF, FRAD, FRBR, FRSAD, FRSAR, GEM, GILS, GKD, GML,
Hessian, HTML, HTTP, ID3, IDL, IEEE/LOM, indecs, inetOrgPerson, INI, IPTC, IRI, ISAAR(CPF), ISAD(G), ISBD, ISBN, ISO 19115, ISO 19119, JSON, KML,
LCC, LCSH, LDAP, Linked Data, LMER, MAB2, MADS, MARC, MARC21, MARC Relator Codes, MARCXML, MathML, MEI, MESH, METS, METS Rights,
MFC, MGraph, MIX, MO, MODS, MOTS, MPEG-21 , MPEG-7, MSchema, MuseumDat, MusicXML, MXF, NewsML, NFC, NFD, NFKC, NFKD, NIAM, OAI,
OAI-ORE, OAI-PMH, OAIS, ODRL, ONIX, Ontology for Media, OODBMS, OpenDocument, OpenSearch, OpenURL, ORM, OWL, PB Core, PDF, PI,
Pica+, Pica3, PND, PREMIS, PRISM, Proto, QDC, RAD, RAK, RDA, RDBMS, RDF, RDFS, RDF/XML, Relax NG, RELAX NG, Resource, RIS, RSS, RSWK,
Schematron, SCORM, SDXF, Seel, S-EXP, SGML, SIOC, SKOS, SMIL, SPECTRUM, SQL, SRU/SRW, SWAP, SWB, TEI, TEX, TextMD, TGM I, TGM II, TGN, Thrift, Topic Maps, UCS, ULAN, UML, unAPI, UNIMARC, URI, UTF8, vCard, Vorbis Comment, VRA, VSO Data Model, XDR, XMetaDiss, XML, XML Schema, XMP, XOBIS, XPath, XPDL, XQuery, XrML, XSLT, YAML, ZZStruct
there is not onesingle document format
Jakob Voß: Revealing digital documents. Concealed structures in data. TPDL 2011, Sep. 25th http://aboutdata.org
thesis
but there are common patterns on all levels of description,
independent fromparticular technologies
Jakob Voß: Revealing digital documents. Concealed structures in data. TPDL 2011, Sep. 25th http://aboutdata.org
examples of particular technologies
XML● Unicode● XML Infoset● XML Schema● Xpath
relational databases● Relational Model● SQL● Entity-Relationship-
Diagrams
families of related standards
Jakob Voß: Revealing digital documents. Concealed structures in data. TPDL 2011, Sep. 25th http://aboutdata.org
not statisticalthis would limit my research to
one level and technology of description
method
Jakob Voß: Revealing digital documents. Concealed structures in data. TPDL 2011, Sep. 25th http://aboutdata.org
phenomenologicaldata description in all of its forms as it appears in our experience
method
Jakob Voß: Revealing digital documents. Concealed structures in data. TPDL 2011, Sep. 25th http://aboutdata.org
phenomenological method
* Image CC-BY Pierre-Alain Gouanvic
data description analyzedas phenomena:
1. critical intuiting(experience)
2. analyzing structures,free of known categories
3. describing the essence
Hegel
Merleau-Ponty*
Husserl
Jakob Voß: Revealing digital documents. Concealed structures in data. TPDL 2011, Sep. 25th http://aboutdata.org
results
1) Categorization of data structuring methods
2) Collection of data structuring paradigms
3) Pattern language of data patterns
Jakob Voß: Revealing digital documents. Concealed structures in data. TPDL 2011, Sep. 25th http://aboutdata.org
result 1: categorization of methods
● encodings express data(UTF-8 Unicode, IEEE floating point, Base64…)
● file and database systems store data● identifiers and query languages refer to data● data structuring and markup languages
structure data● schema languages constrain and validate data● conceptual models describe data
¡Concrete methods appear as combinations of categories!
Jakob Voß: Revealing digital documents. Concealed structures in data. TPDL 2011, Sep. 25th http://aboutdata.org
result 2: paradigms
● Document- or Object-oriented approach● Document-oriented (e.g. ordered tree with
tagged character strings: XML, Relax NG…) ⇒ descriptive data description
● Object-oriented (objects with properties and defined value spaces: XML Schema, UML…)
⇒ prescriptive data description
Jakob Voß: Revealing digital documents. Concealed structures in data. TPDL 2011, Sep. 25th http://aboutdata.org
result 2: paradigms
● Entities and connections
Birth
Jakob 1979
Jakob 1979
born
1979Jakob
Jakob Voß: Revealing digital documents. Concealed structures in data. TPDL 2011, Sep. 25th http://aboutdata.org
result 2: paradigms
● Layers of abstraction● Standards and rules● Collections and types● Granularity
Jakob Voß: Revealing digital documents. Concealed structures in data. TPDL 2011, Sep. 25th http://aboutdata.org
result 3: patterns
● patterns as systematic tool for describing good design practice, introduced by Christopher Alexander:
“Each pattern describes a problem which occurs over and over again in our environment, and then describes the
core of the solution to that problem […]”● Adopted as design patterns in software engineering● Collected in a pattern language with meaningful
connections between patterns (network of patterns).
Jakob Voß: Revealing digital documents. Concealed structures in data. TPDL 2011, Sep. 25th http://aboutdata.org
result 3: patterns
sequence
collection
known sizeseparator
position ordered set array
Jakob Voß: Revealing digital documents. Concealed structures in data. TPDL 2011, Sep. 25th http://aboutdata.org
applications
● data archeology● In 200 years someone finds snapshots and
archives of Wikipedia in different forms(SQL, XML, Wikitext, DBPedia, HTML…)
● What are significant parts?How relate parts to each other?
Jakob Voß: Revealing digital documents. Concealed structures in data. TPDL 2011, Sep. 25th http://aboutdata.org
Jakob Voß: Revealing digital documents. Concealed structures in data. TPDL 2011, Sep. 25th http://aboutdata.org
… another document
to give a simple example…
Jakob Voß: Revealing digital documents. Concealed structures in data. TPDL 2011, Sep. 25th http://aboutdata.org
… another document
sequence with delimiter
Jakob Voß: Revealing digital documents. Concealed structures in data. TPDL 2011, Sep. 25th http://aboutdata.org
… another document
grouping of sequences with delimiter
sequence with delimiter
Jakob Voß: Revealing digital documents. Concealed structures in data. TPDL 2011, Sep. 25th http://aboutdata.org
… another document
A A A ET T R N SPTD
encoding (morse code)
sequence with delimiter
grouping of sequences with delimiter