28
XML damage control Silvan Jegen [email protected] 24. September 2016 Silvan Jegen () XML damage control 24. September 2016 1 / 22

XML damage control - Amazon S3 · PDF fileBenchmark 3 Conclusion Silvan Jegen () ... XML vs. JSON XML ... XML damage control 24. September 2016 18 / 22. yxml yxml

  • Upload
    trantu

  • View
    224

  • Download
    1

Embed Size (px)

Citation preview

Page 1: XML damage control - Amazon S3 · PDF fileBenchmark 3 Conclusion Silvan Jegen () ... XML vs. JSON XML ... XML damage control 24. September 2016 18 / 22. yxml yxml

XML damage control

Silvan Jegen

[email protected]

24. September 2016

Silvan Jegen () XML damage control 24. September 2016 1 / 22

Page 2: XML damage control - Amazon S3 · PDF fileBenchmark 3 Conclusion Silvan Jegen () ... XML vs. JSON XML ... XML damage control 24. September 2016 18 / 22. yxml yxml

Table of Contents

1 XMLXML in theoryXML in practice

2 Dealing with XMLProgramming interfacesBenchmark

3 Conclusion

Silvan Jegen () XML damage control 24. September 2016 2 / 22

Page 3: XML damage control - Amazon S3 · PDF fileBenchmark 3 Conclusion Silvan Jegen () ... XML vs. JSON XML ... XML damage control 24. September 2016 18 / 22. yxml yxml

XML in theory

Extensible Markup Language (XML)

XML aspects

• Well-formedness

• Validation

• Namespaces

• Entities

Silvan Jegen () XML damage control 24. September 2016 3 / 22

Page 4: XML damage control - Amazon S3 · PDF fileBenchmark 3 Conclusion Silvan Jegen () ... XML vs. JSON XML ... XML damage control 24. September 2016 18 / 22. yxml yxml

XML in theory

Extensible Markup Language (XML)

XML aspects

• Well-formedness

• Validation

• Namespaces

• Entities

Silvan Jegen () XML damage control 24. September 2016 3 / 22

Page 5: XML damage control - Amazon S3 · PDF fileBenchmark 3 Conclusion Silvan Jegen () ... XML vs. JSON XML ... XML damage control 24. September 2016 18 / 22. yxml yxml

XML in theory

Related specifications

• XSLT

• XPath

• XQuery

• XML Encryption

• ...

Silvan Jegen () XML damage control 24. September 2016 4 / 22

Page 6: XML damage control - Amazon S3 · PDF fileBenchmark 3 Conclusion Silvan Jegen () ... XML vs. JSON XML ... XML damage control 24. September 2016 18 / 22. yxml yxml

XML-based formats

Variants• RDF XML

• XMPP

• EPUB

• XHTML

• ...

• 200+ more

Silvan Jegen () XML damage control 24. September 2016 5 / 22

Page 7: XML damage control - Amazon S3 · PDF fileBenchmark 3 Conclusion Silvan Jegen () ... XML vs. JSON XML ... XML damage control 24. September 2016 18 / 22. yxml yxml

XML in practice

XML in practice

Silvan Jegen () XML damage control 24. September 2016 6 / 22

Page 8: XML damage control - Amazon S3 · PDF fileBenchmark 3 Conclusion Silvan Jegen () ... XML vs. JSON XML ... XML damage control 24. September 2016 18 / 22. yxml yxml

XML in practice

Enterprise usage

• SOAP

• Configuration

• Data storage/exchange

• Java ecosystem...

Silvan Jegen () XML damage control 24. September 2016 7 / 22

Page 9: XML damage control - Amazon S3 · PDF fileBenchmark 3 Conclusion Silvan Jegen () ... XML vs. JSON XML ... XML damage control 24. September 2016 18 / 22. yxml yxml

Markup

Annotate parts of text with additional information

Text Markup

<document>Some text <tag>some other text that

should be tagged</tag> even <tag2>more</tag2>

text...</document>

Silvan Jegen () XML damage control 24. September 2016 8 / 22

Page 10: XML damage control - Amazon S3 · PDF fileBenchmark 3 Conclusion Silvan Jegen () ... XML vs. JSON XML ... XML damage control 24. September 2016 18 / 22. yxml yxml

Markup

Annotate parts of text with additional information

Text Markup

<document>Some text <tag>some other text that

should be tagged</tag> even <tag2>more</tag2>

text...</document>

Silvan Jegen () XML damage control 24. September 2016 8 / 22

Page 11: XML damage control - Amazon S3 · PDF fileBenchmark 3 Conclusion Silvan Jegen () ... XML vs. JSON XML ... XML damage control 24. September 2016 18 / 22. yxml yxml

XML vs. JSON

XML<document>Some text <tag>some other text that

should be tagged</tag> even <tag2>more</tag2>

text...</document>

JSON

["Some text ", {"t": "tag", "s": "some other text

that should be tagged"}, "even", {"t": "tag2",

"s": "more"}, "text..."]

Silvan Jegen () XML damage control 24. September 2016 9 / 22

Page 12: XML damage control - Amazon S3 · PDF fileBenchmark 3 Conclusion Silvan Jegen () ... XML vs. JSON XML ... XML damage control 24. September 2016 18 / 22. yxml yxml

XML vs. JSON

XML<document>Some text <tag>some other text that

should be tagged</tag> even <tag2>more</tag2>

text...</document>

JSON

["Some text ", {"t": "tag", "s": "some other text

that should be tagged"}, "even", {"t": "tag2",

"s": "more"}, "text..."]

Silvan Jegen () XML damage control 24. September 2016 9 / 22

Page 13: XML damage control - Amazon S3 · PDF fileBenchmark 3 Conclusion Silvan Jegen () ... XML vs. JSON XML ... XML damage control 24. September 2016 18 / 22. yxml yxml

Dealing with XML

Dealing with XML

Silvan Jegen () XML damage control 24. September 2016 10 / 22

Page 14: XML damage control - Amazon S3 · PDF fileBenchmark 3 Conclusion Silvan Jegen () ... XML vs. JSON XML ... XML damage control 24. September 2016 18 / 22. yxml yxml

Programming interfaces

• Stream-oriented (SAX, Stax)

• Tree traversal (DOM)

• XML Data binding

• Transformation languages (XSLT, XQuery)

• Other?

Silvan Jegen () XML damage control 24. September 2016 11 / 22

Page 15: XML damage control - Amazon S3 · PDF fileBenchmark 3 Conclusion Silvan Jegen () ... XML vs. JSON XML ... XML damage control 24. September 2016 18 / 22. yxml yxml

Programming interfaces

• Stream-oriented (SAX, Stax)

• Tree traversal (DOM)

• XML Data binding

• Transformation languages (XSLT, XQuery)

• Other?

Silvan Jegen () XML damage control 24. September 2016 11 / 22

Page 16: XML damage control - Amazon S3 · PDF fileBenchmark 3 Conclusion Silvan Jegen () ... XML vs. JSON XML ... XML damage control 24. September 2016 18 / 22. yxml yxml

Benchmark

• ezxml

• Golang encoding/xml

• mxml (’Mini-XML’, not ’Minimal XML’)

• Python 2 ElementTree

• sxmlc

• yxml

All code available at:git://git.sillymon.ch/slcon3.git

Silvan Jegen () XML damage control 24. September 2016 12 / 22

Page 17: XML damage control - Amazon S3 · PDF fileBenchmark 3 Conclusion Silvan Jegen () ... XML vs. JSON XML ... XML damage control 24. September 2016 18 / 22. yxml yxml

Benchmark setup

Linux machine, i7 CPU, 8GB RAM

• 627MB of XML in 10’000 files from PubMed Central

• Printing the article titles

• 20 runs (after cache warming)

• Single-threaded (except for Go)

Silvan Jegen () XML damage control 24. September 2016 13 / 22

Page 18: XML damage control - Amazon S3 · PDF fileBenchmark 3 Conclusion Silvan Jegen () ... XML vs. JSON XML ... XML damage control 24. September 2016 18 / 22. yxml yxml

ezxml

ezxml• Program: 21 lines

• Library: 623 lines

• URL: http://ezxml.sourceforge.net/

• Type: DOM (Level 3; XPath)

ezxml_t title = ezxml_get(ezdoc, "front", 0,\

"article-meta", 0, "title-group", 0,\

"article-title", -1);

Silvan Jegen () XML damage control 24. September 2016 14 / 22

Page 19: XML damage control - Amazon S3 · PDF fileBenchmark 3 Conclusion Silvan Jegen () ... XML vs. JSON XML ... XML damage control 24. September 2016 18 / 22. yxml yxml

Go encoding/xml

Go encoding/xml

• Program: 30 lines

• Library: 6235 lines (stdlib)

• URL: https://golang.org/pkg/encoding/xml/

• Type: XML Data binding

type article struct {

Title string ‘xml:"front>article-meta>title-group>\

article-title"‘

}

Silvan Jegen () XML damage control 24. September 2016 15 / 22

Page 20: XML damage control - Amazon S3 · PDF fileBenchmark 3 Conclusion Silvan Jegen () ... XML vs. JSON XML ... XML damage control 24. September 2016 18 / 22. yxml yxml

mxml

mxml• Program: 38 lines

• Library: 9633 lines

• URL: http://www.minixml.org/

• Type: DOM (Level 3; Xpath)?

node = mxmlFindElement(root, root, "title-group",

NULL, NULL, MXML_DESCEND);

Silvan Jegen () XML damage control 24. September 2016 16 / 22

Page 21: XML damage control - Amazon S3 · PDF fileBenchmark 3 Conclusion Silvan Jegen () ... XML vs. JSON XML ... XML damage control 24. September 2016 18 / 22. yxml yxml

Python 2 ElementTree

Python 2 ElementTree

• Program: 12 lines

• Library: 1107 lines (stdlib)

• URL:https://docs.python.org/2/library/xml.etree.element-tree.html

• Type: DOM (Level 3; Xpath)?

tg = r.findall("./front/article-meta/title-group")

at = tg[0].find("article-title")

Silvan Jegen () XML damage control 24. September 2016 17 / 22

Page 22: XML damage control - Amazon S3 · PDF fileBenchmark 3 Conclusion Silvan Jegen () ... XML vs. JSON XML ... XML damage control 24. September 2016 18 / 22. yxml yxml

sxmlc

sxmlc• Program: 59 lines

• Library: 2690 lines

• URL: http://sxmlc.sourceforge.net/

• Type: DOM

const char *path[] = {"front", "article-meta",\

"title-group", "article-title", NULL};

for (int i = 0; path[i]; i++) {

next = find_child_node(next, path[i]);

if (!next) {

fprintf(stderr, "Could not find ’%s’

tag.\n", path[i]);

return;

}

}Silvan Jegen () XML damage control 24. September 2016 18 / 22

Page 23: XML damage control - Amazon S3 · PDF fileBenchmark 3 Conclusion Silvan Jegen () ... XML vs. JSON XML ... XML damage control 24. September 2016 18 / 22. yxml yxml

yxml

yxml

• Program: 103 lines

• Library: 1039 lines

• URL: https://dev.yorhel.nl/yxml

• Type: Stream-oriented

case YXML_ELEMSTART:

if (!strcmp(state->elem, "title-group")) {

intitlegroup = 1;

} else if (!strcmp(state->elem, "article-title")

&& intitlegroup) {

printf("%s: ", state->elem);

inarticletitle = 1;

}

break;Silvan Jegen () XML damage control 24. September 2016 19 / 22

Page 24: XML damage control - Amazon S3 · PDF fileBenchmark 3 Conclusion Silvan Jegen () ... XML vs. JSON XML ... XML damage control 24. September 2016 18 / 22. yxml yxml

Time & Size

mean σ min max size

Python ElementTree 155.5 8.869 145.7 172.3 N/AGo encoding/xml 48.34 4.982 35.33 52.92 2Mmxml 23.96 1.841 22.32 27.64 N/Asxmlc 15.51 0.259 15.12 16.01 41Kezxml 6.460 0.058 6.366 6.592 31Kyxml 4.123 0.220 3.885 4.520 18K

Silvan Jegen () XML damage control 24. September 2016 20 / 22

Page 25: XML damage control - Amazon S3 · PDF fileBenchmark 3 Conclusion Silvan Jegen () ... XML vs. JSON XML ... XML damage control 24. September 2016 18 / 22. yxml yxml

Conclusion

• Complex specifications, (comparatively) hard to parseand verbose

• Use the ezxml or yxml libraries

• Ok:

<document>Some text <tag>some other text that

should be tagged</tag> even <tag2>more</tag2>

text...</document>

• No:

<thing><key>Key</key><value>Val</value></thing>

Silvan Jegen () XML damage control 24. September 2016 21 / 22

Page 26: XML damage control - Amazon S3 · PDF fileBenchmark 3 Conclusion Silvan Jegen () ... XML vs. JSON XML ... XML damage control 24. September 2016 18 / 22. yxml yxml

Conclusion

• Complex specifications, (comparatively) hard to parseand verbose

• Use the ezxml or yxml libraries

• Ok:

<document>Some text <tag>some other text that

should be tagged</tag> even <tag2>more</tag2>

text...</document>

• No:

<thing><key>Key</key><value>Val</value></thing>

Silvan Jegen () XML damage control 24. September 2016 21 / 22

Page 27: XML damage control - Amazon S3 · PDF fileBenchmark 3 Conclusion Silvan Jegen () ... XML vs. JSON XML ... XML damage control 24. September 2016 18 / 22. yxml yxml

Conclusion

• Complex specifications, (comparatively) hard to parseand verbose

• Use the ezxml or yxml libraries

• Ok:

<document>Some text <tag>some other text that

should be tagged</tag> even <tag2>more</tag2>

text...</document>

• No:

<thing><key>Key</key><value>Val</value></thing>

Silvan Jegen () XML damage control 24. September 2016 21 / 22

Page 28: XML damage control - Amazon S3 · PDF fileBenchmark 3 Conclusion Silvan Jegen () ... XML vs. JSON XML ... XML damage control 24. September 2016 18 / 22. yxml yxml

Thanks for your attention

Questions?

Silvan Jegen () XML damage control 24. September 2016 22 / 22