47
Processing XML A rewriting system approach Alberto Simões [email protected] Portuguese Perl Workshop – 2010 Alberto Simões Processing XML: a rewriting system approach

Processing XML: a rewriting system approach

Embed Size (px)

DESCRIPTION

Yet another method to parse XML: rewrite it!

Citation preview

Page 1: Processing XML: a rewriting system approach

Processing XMLA rewriting system approach

Alberto Simões

[email protected]

Portuguese Perl Workshop – 2010

Alberto Simões Processing XML: a rewriting system approach

Page 2: Processing XML: a rewriting system approach

Motivation and Goals

XML is usually generated from structured information:databases, spreadsheets, forms, etc.

but it can be generated from unstructured(or poorly-structured data):

textual documents, domain specific languages;

Question arises:How to produce XML documents from textual documents?

write a parser (natural language, domain specific, etc);

produce XML by rewriting the textual document!

Alberto Simões Processing XML: a rewriting system approach

Page 3: Processing XML: a rewriting system approach

Motivation and Goals

XML is usually generated from structured information:databases, spreadsheets, forms, etc.

but it can be generated from unstructured(or poorly-structured data):

textual documents, domain specific languages;

Question arises:How to produce XML documents from textual documents?

write a parser (natural language, domain specific, etc);

produce XML by rewriting the textual document!

Alberto Simões Processing XML: a rewriting system approach

Page 4: Processing XML: a rewriting system approach

Motivation and Goals

XML is usually generated from structured information:databases, spreadsheets, forms, etc.

but it can be generated from unstructured(or poorly-structured data):

textual documents, domain specific languages;

Question arises:How to produce XML documents from textual documents?

write a parser (natural language, domain specific, etc);

produce XML by rewriting the textual document!

Alberto Simões Processing XML: a rewriting system approach

Page 5: Processing XML: a rewriting system approach

Motivation and Goals

XML is usually generated from structured information:databases, spreadsheets, forms, etc.

but it can be generated from unstructured(or poorly-structured data):

textual documents, domain specific languages;

Question arises:How to produce XML documents from textual documents?

write a parser (natural language, domain specific, etc);

produce XML by rewriting the textual document!

Alberto Simões Processing XML: a rewriting system approach

Page 6: Processing XML: a rewriting system approach

Motivation and Goals

XML is usually generated from structured information:databases, spreadsheets, forms, etc.

but it can be generated from unstructured(or poorly-structured data):

textual documents, domain specific languages;

Question arises:How to produce XML documents from textual documents?

write a parser (natural language, domain specific, etc);

produce XML by rewriting the textual document!

Alberto Simões Processing XML: a rewriting system approach

Page 7: Processing XML: a rewriting system approach

Hows does textual rewriting works?

write rewriting rules:

rule ∼= pattern × restriction × action

pattern a regular (or irregular) expression that shouldbe textually matched;

restriction conditional code that checks whether the ruleshould be applied;

action a piece of code (or simply a string) thatproduces text that should replace theoriginally matched text;

Alberto Simões Processing XML: a rewriting system approach

Page 8: Processing XML: a rewriting system approach

Hows does textual rewriting works?

write rewriting rules:

rule ∼= pattern × restriction × action

pattern a regular (or irregular) expression that shouldbe textually matched;

restriction conditional code that checks whether the ruleshould be applied;

action a piece of code (or simply a string) thatproduces text that should replace theoriginally matched text;

Alberto Simões Processing XML: a rewriting system approach

Page 9: Processing XML: a rewriting system approach

Hows does textual rewriting works?

write rewriting rules:

rule ∼= pattern × restriction × action

pattern a regular (or irregular) expression that shouldbe textually matched;

restriction conditional code that checks whether the ruleshould be applied;

action a piece of code (or simply a string) thatproduces text that should replace theoriginally matched text;

Alberto Simões Processing XML: a rewriting system approach

Page 10: Processing XML: a rewriting system approach

Hows does textual rewriting works?

write rewriting rules:

rule ∼= pattern × restriction × action

pattern a regular (or irregular) expression that shouldbe textually matched;

restriction conditional code that checks whether the ruleshould be applied;

action a piece of code (or simply a string) thatproduces text that should replace theoriginally matched text;

Alberto Simões Processing XML: a rewriting system approach

Page 11: Processing XML: a rewriting system approach

Are there text rewriting tools?

For this work we used Text::RewriteRules:

written in Perl:Perl regular expression engine power;Reflexive language (code can be generated on the fly);

supports different rewriting approaches:Fixed-point rewriting approach;Sliding-cursor rewriting approach;Lexical analyzer approach;

home-developed;

Alberto Simões Processing XML: a rewriting system approach

Page 12: Processing XML: a rewriting system approach

Are there text rewriting tools?

For this work we used Text::RewriteRules:

written in Perl:Perl regular expression engine power;Reflexive language (code can be generated on the fly);

supports different rewriting approaches:Fixed-point rewriting approach;Sliding-cursor rewriting approach;Lexical analyzer approach;

home-developed;

Alberto Simões Processing XML: a rewriting system approach

Page 13: Processing XML: a rewriting system approach

Are there text rewriting tools?

For this work we used Text::RewriteRules:

written in Perl:Perl regular expression engine power;Reflexive language (code can be generated on the fly);

supports different rewriting approaches:Fixed-point rewriting approach;Sliding-cursor rewriting approach;Lexical analyzer approach;

home-developed;

Alberto Simões Processing XML: a rewriting system approach

Page 14: Processing XML: a rewriting system approach

Are there text rewriting tools?

For this work we used Text::RewriteRules:

written in Perl:Perl regular expression engine power;Reflexive language (code can be generated on the fly);

supports different rewriting approaches:Fixed-point rewriting approach;Sliding-cursor rewriting approach;Lexical analyzer approach;

home-developed;

Alberto Simões Processing XML: a rewriting system approach

Page 15: Processing XML: a rewriting system approach

Fixed-point rewriting approach

Algorithmeasy to understand;a sequence of rules that are applied by order;first rule is applied, and following rules are only applied ifthere is no previous rule that can be applied;it might happen that a rule changes the document in a waythat a previous rule will be applied again;the process ends when there are no rules that can beapplied (or if a specific rule forces the system to end);

Code example: anonymization of emailsRULES anonymize\w+(\.\w+)*@\w+\.\w+(\.\w+)*==>[[hidden email]]ENDRULES

Alberto Simões Processing XML: a rewriting system approach

Page 16: Processing XML: a rewriting system approach

Fixed-point rewriting approach

Algorithmeasy to understand;a sequence of rules that are applied by order;first rule is applied, and following rules are only applied ifthere is no previous rule that can be applied;it might happen that a rule changes the document in a waythat a previous rule will be applied again;the process ends when there are no rules that can beapplied (or if a specific rule forces the system to end);

Code example: anonymization of emailsRULES anonymize\w+(\.\w+)*@\w+\.\w+(\.\w+)*==>[[hidden email]]ENDRULES

Alberto Simões Processing XML: a rewriting system approach

Page 17: Processing XML: a rewriting system approach

Sliding-cursor rewriting approach

Algorithmthe cursor is placed in the beginning of the string;patterns are matched if they occur right after the cursor;if a rule is applied, the cursor is placed after that region;if no rule matches, the cursor moves ahead one character;process ends when cursor reaches the end of the string;it will never rewrite text that was already rewritten.

Code example: brute force translationRULES/m translate(\w+)=e=> $translation{$1} !! exists($translation{$1})ENDRULES

Example_ latest trainúltimo _ trainúltimo combóio _

Alberto Simões Processing XML: a rewriting system approach

Page 18: Processing XML: a rewriting system approach

Sliding-cursor rewriting approach

Algorithmthe cursor is placed in the beginning of the string;patterns are matched if they occur right after the cursor;if a rule is applied, the cursor is placed after that region;if no rule matches, the cursor moves ahead one character;process ends when cursor reaches the end of the string;it will never rewrite text that was already rewritten.

Code example: brute force translationRULES/m translate(\w+)=e=> $translation{$1} !! exists($translation{$1})ENDRULES

Example_ latest trainúltimo _ trainúltimo combóio _

Alberto Simões Processing XML: a rewriting system approach

Page 19: Processing XML: a rewriting system approach

Sliding-cursor rewriting approach

Algorithmthe cursor is placed in the beginning of the string;patterns are matched if they occur right after the cursor;if a rule is applied, the cursor is placed after that region;if no rule matches, the cursor moves ahead one character;process ends when cursor reaches the end of the string;it will never rewrite text that was already rewritten.

Code example: brute force translationRULES/m translate(\w+)=e=> $translation{$1} !! exists($translation{$1})ENDRULES

Example_ latest trainúltimo _ trainúltimo combóio _

Alberto Simões Processing XML: a rewriting system approach

Page 20: Processing XML: a rewriting system approach

Valid Rewriting Rules

Different approaches have different possible rules. . .

but the most relevant rules are:==> simple pattern substitution: left hand side includes

a Perl regular expression and right hand sideincludes the string that will replace the match;

=e=> similar to the previous one, but right hand sideincludes Perl code to be evaluated. The result willbe used to replace the match;

=begin=> without a left hand side, the right hand side code isexecuted before the rewrite starts;

=end=> without a right hand side, when the left hand sidepattern matches quits the rewrite system;

they can include a restriction block (!!) at the right of the action.

Alberto Simões Processing XML: a rewriting system approach

Page 21: Processing XML: a rewriting system approach

Valid Rewriting Rules

Different approaches have different possible rules. . .

but the most relevant rules are:==> simple pattern substitution: left hand side includes

a Perl regular expression and right hand sideincludes the string that will replace the match;

=e=> similar to the previous one, but right hand sideincludes Perl code to be evaluated. The result willbe used to replace the match;

=begin=> without a left hand side, the right hand side code isexecuted before the rewrite starts;

=end=> without a right hand side, when the left hand sidepattern matches quits the rewrite system;

they can include a restriction block (!!) at the right of the action.

Alberto Simões Processing XML: a rewriting system approach

Page 22: Processing XML: a rewriting system approach

Valid Rewriting Rules

Different approaches have different possible rules. . .

but the most relevant rules are:==> simple pattern substitution: left hand side includes

a Perl regular expression and right hand sideincludes the string that will replace the match;

=e=> similar to the previous one, but right hand sideincludes Perl code to be evaluated. The result willbe used to replace the match;

=begin=> without a left hand side, the right hand side code isexecuted before the rewrite starts;

=end=> without a right hand side, when the left hand sidepattern matches quits the rewrite system;

they can include a restriction block (!!) at the right of the action.

Alberto Simões Processing XML: a rewriting system approach

Page 23: Processing XML: a rewriting system approach

Valid Rewriting Rules

Different approaches have different possible rules. . .

but the most relevant rules are:==> simple pattern substitution: left hand side includes

a Perl regular expression and right hand sideincludes the string that will replace the match;

=e=> similar to the previous one, but right hand sideincludes Perl code to be evaluated. The result willbe used to replace the match;

=begin=> without a left hand side, the right hand side code isexecuted before the rewrite starts;

=end=> without a right hand side, when the left hand sidepattern matches quits the rewrite system;

they can include a restriction block (!!) at the right of the action.

Alberto Simões Processing XML: a rewriting system approach

Page 24: Processing XML: a rewriting system approach

Valid Rewriting Rules

Different approaches have different possible rules. . .

but the most relevant rules are:==> simple pattern substitution: left hand side includes

a Perl regular expression and right hand sideincludes the string that will replace the match;

=e=> similar to the previous one, but right hand sideincludes Perl code to be evaluated. The result willbe used to replace the match;

=begin=> without a left hand side, the right hand side code isexecuted before the rewrite starts;

=end=> without a right hand side, when the left hand sidepattern matches quits the rewrite system;

they can include a restriction block (!!) at the right of the action.

Alberto Simões Processing XML: a rewriting system approach

Page 25: Processing XML: a rewriting system approach

Valid Rewriting Rules

Different approaches have different possible rules. . .

but the most relevant rules are:==> simple pattern substitution: left hand side includes

a Perl regular expression and right hand sideincludes the string that will replace the match;

=e=> similar to the previous one, but right hand sideincludes Perl code to be evaluated. The result willbe used to replace the match;

=begin=> without a left hand side, the right hand side code isexecuted before the rewrite starts;

=end=> without a right hand side, when the left hand sidepattern matches quits the rewrite system;

they can include a restriction block (!!) at the right of the action.

Alberto Simões Processing XML: a rewriting system approach

Page 26: Processing XML: a rewriting system approach

Rewriting Text into XML

How to produce XML from weak-structured data?write a parser;or rewrite the data step-by-step into XML!

Two case studies:Rewriting a dictionary in textual format into TEI;Rewriting a XML DSL authoring tool into XML;

Alberto Simões Processing XML: a rewriting system approach

Page 27: Processing XML: a rewriting system approach

Rewriting Text into XML

How to produce XML from weak-structured data?write a parser;or rewrite the data step-by-step into XML!

Two case studies:Rewriting a dictionary in textual format into TEI;Rewriting a XML DSL authoring tool into XML;

Alberto Simões Processing XML: a rewriting system approach

Page 28: Processing XML: a rewriting system approach

Rewriting Text into TEI

Rewrite this. . .*Cachimbo*,_m._Apparelho de fumador, composto d..Peça de ferro, em que entra o es..Buraco, em que se encaixa a vela..* _Bras. de Pernambuco._Bebida, preparada com aguardente..* _Pl. Gír._Pés.(Do químb. _quixima_)

. . . into this!<entry id="cachimbo"><form><orth>Cachimbo</orth></form><sense><gramGrp>m.</gramGrp><def>Apparelho de fumador, composto d..Peça de ferro, em que entra o es..Buraco, em que se encaixa a vela..</def></sense><sense ast="1"><usg type="geo">Bras. de Pernamb..<def>Bebida, preparada com aguardente..</def></sense><sense ast="1"><gramGrp>Pl.</gra..<usg type="style">Gír.</usg><def>Pés.</def></sense><etym ori="químb">(Do químb. _qu..</entry>Alberto Simões Processing XML: a rewriting system approach

Page 29: Processing XML: a rewriting system approach

Rewriting Text into TEI

This rewrite was all based on:a few tables (grammatical and usage strings);

entries genres: Gír Fam Pop Des Fig Vulg Ant Chul Euphentries domains: Agr Anat Anthrop Apicult Arith Artilh Archit

rewrite the few mark-up into better XML structure;((\* )?_([^_]|_[^_]{1,5}_)+_( *)?)\n=e=>$a=$1;end_def.end_sense.start_sense.gramGrp($a)."\n".start_def

rewrite the new XML structure to detect and annotate amore complex structure;

<gramGrp>([^<]*)\s*\*\s*([^<]*)</gramGrp>=e=>$a="$1 $2"; "ast=\"1\"".gramGrp($a)

detect and correct wrong XML elements.</form></sense>==></form>

</form></def>\n</sense>==></form>

Alberto Simões Processing XML: a rewriting system approach

Page 30: Processing XML: a rewriting system approach

Rewriting Text into TEI

This rewrite was all based on:a few tables (grammatical and usage strings);

entries genres: Gír Fam Pop Des Fig Vulg Ant Chul Euphentries domains: Agr Anat Anthrop Apicult Arith Artilh Archit

rewrite the few mark-up into better XML structure;((\* )?_([^_]|_[^_]{1,5}_)+_( *)?)\n=e=>$a=$1;end_def.end_sense.start_sense.gramGrp($a)."\n".start_def

rewrite the new XML structure to detect and annotate amore complex structure;

<gramGrp>([^<]*)\s*\*\s*([^<]*)</gramGrp>=e=>$a="$1 $2"; "ast=\"1\"".gramGrp($a)

detect and correct wrong XML elements.</form></sense>==></form>

</form></def>\n</sense>==></form>

Alberto Simões Processing XML: a rewriting system approach

Page 31: Processing XML: a rewriting system approach

Rewriting Text into TEI

This rewrite was all based on:a few tables (grammatical and usage strings);

entries genres: Gír Fam Pop Des Fig Vulg Ant Chul Euphentries domains: Agr Anat Anthrop Apicult Arith Artilh Archit

rewrite the few mark-up into better XML structure;((\* )?_([^_]|_[^_]{1,5}_)+_( *)?)\n=e=>$a=$1;end_def.end_sense.start_sense.gramGrp($a)."\n".start_def

rewrite the new XML structure to detect and annotate amore complex structure;

<gramGrp>([^<]*)\s*\*\s*([^<]*)</gramGrp>=e=>$a="$1 $2"; "ast=\"1\"".gramGrp($a)

detect and correct wrong XML elements.</form></sense>==></form>

</form></def>\n</sense>==></form>

Alberto Simões Processing XML: a rewriting system approach

Page 32: Processing XML: a rewriting system approach

Rewriting Text into TEI

This rewrite was all based on:a few tables (grammatical and usage strings);

entries genres: Gír Fam Pop Des Fig Vulg Ant Chul Euphentries domains: Agr Anat Anthrop Apicult Arith Artilh Archit

rewrite the few mark-up into better XML structure;((\* )?_([^_]|_[^_]{1,5}_)+_( *)?)\n=e=>$a=$1;end_def.end_sense.start_sense.gramGrp($a)."\n".start_def

rewrite the new XML structure to detect and annotate amore complex structure;

<gramGrp>([^<]*)\s*\*\s*([^<]*)</gramGrp>=e=>$a="$1 $2"; "ast=\"1\"".gramGrp($a)

detect and correct wrong XML elements.</form></sense>==></form>

</form></def>\n</sense>==></form>

Alberto Simões Processing XML: a rewriting system approach

Page 33: Processing XML: a rewriting system approach

Rewriting Text into TEI

Case study conclusions:flexible tool;

works on big files:Text file is 13 MB;Output XML is 30 MB;Process takes about nine minutes!

we event rewrote XML into XML.

Hey!! XML is text!!How can we rewrite it!?

Alberto Simões Processing XML: a rewriting system approach

Page 34: Processing XML: a rewriting system approach

Rewriting Text into TEI

Case study conclusions:flexible tool;

works on big files:Text file is 13 MB;Output XML is 30 MB;Process takes about nine minutes!

we event rewrote XML into XML.

Hey!! XML is text!!How can we rewrite it!?

Alberto Simões Processing XML: a rewriting system approach

Page 35: Processing XML: a rewriting system approach

Rewriting XML

different from the usual DOM or SAX oriented approaches;

looks to XML as text, non structured data;

rewrite can be done:as any other text write system;taking advantage of irregular expressions.

Irregular expressions? Are you kidding?

Alberto Simões Processing XML: a rewriting system approach

Page 36: Processing XML: a rewriting system approach

Rewriting XML

different from the usual DOM or SAX oriented approaches;

looks to XML as text, non structured data;

rewrite can be done:as any other text write system;taking advantage of irregular expressions.

Irregular expressions? Are you kidding?

Alberto Simões Processing XML: a rewriting system approach

Page 37: Processing XML: a rewriting system approach

Rewriting XML

different from the usual DOM or SAX oriented approaches;

looks to XML as text, non structured data;

rewrite can be done:as any other text write system;taking advantage of irregular expressions.

Irregular expressions? Are you kidding?

Alberto Simões Processing XML: a rewriting system approach

Page 38: Processing XML: a rewriting system approach

Rewriting XML

different from the usual DOM or SAX oriented approaches;

looks to XML as text, non structured data;

rewrite can be done:as any other text write system;taking advantage of irregular expressions.

Irregular expressions? Are you kidding?

Alberto Simões Processing XML: a rewriting system approach

Page 39: Processing XML: a rewriting system approach

Not so regular expressions

Perl has a powerful regular expression engine:regular expressions can define capture zones:small pieces of the match that can be used later;

regular expressions can define look-ahead or look-behind:check the context of the matching zone;

since Perl 5.10, regular expressions can be recursive:regular expression that depends on themself.

my $parens = qr/(\((?:[^()]++|(?-1))*+\))/;

For XML, we defined two classes:[[:XML:]] matches any well formed XML fragment;[[:XML(tag):]] matches a XML fragment with a specific

root element;

Alberto Simões Processing XML: a rewriting system approach

Page 40: Processing XML: a rewriting system approach

Not so regular expressions

Perl has a powerful regular expression engine:regular expressions can define capture zones:small pieces of the match that can be used later;

regular expressions can define look-ahead or look-behind:check the context of the matching zone;

since Perl 5.10, regular expressions can be recursive:regular expression that depends on themself.

my $parens = qr/(\((?:[^()]++|(?-1))*+\))/;

For XML, we defined two classes:[[:XML:]] matches any well formed XML fragment;[[:XML(tag):]] matches a XML fragment with a specific

root element;

Alberto Simões Processing XML: a rewriting system approach

Page 41: Processing XML: a rewriting system approach

Not so regular expressions

Perl has a powerful regular expression engine:regular expressions can define capture zones:small pieces of the match that can be used later;

regular expressions can define look-ahead or look-behind:check the context of the matching zone;

since Perl 5.10, regular expressions can be recursive:regular expression that depends on themself.

my $parens = qr/(\((?:[^()]++|(?-1))*+\))/;

For XML, we defined two classes:[[:XML:]] matches any well formed XML fragment;[[:XML(tag):]] matches a XML fragment with a specific

root element;

Alberto Simões Processing XML: a rewriting system approach

Page 42: Processing XML: a rewriting system approach

Not so regular expressions

Perl has a powerful regular expression engine:regular expressions can define capture zones:small pieces of the match that can be used later;

regular expressions can define look-ahead or look-behind:check the context of the matching zone;

since Perl 5.10, regular expressions can be recursive:regular expression that depends on themself.

my $parens = qr/(\((?:[^()]++|(?-1))*+\))/;

For XML, we defined two classes:[[:XML:]] matches any well formed XML fragment;[[:XML(tag):]] matches a XML fragment with a specific

root element;

Alberto Simões Processing XML: a rewriting system approach

Page 43: Processing XML: a rewriting system approach

Not so regular expressions

Perl has a powerful regular expression engine:regular expressions can define capture zones:small pieces of the match that can be used later;

regular expressions can define look-ahead or look-behind:check the context of the matching zone;

since Perl 5.10, regular expressions can be recursive:regular expression that depends on themself.

my $parens = qr/(\((?:[^()]++|(?-1))*+\))/;

For XML, we defined two classes:[[:XML:]] matches any well formed XML fragment;[[:XML(tag):]] matches a XML fragment with a specific

root element;

Alberto Simões Processing XML: a rewriting system approach

Page 44: Processing XML: a rewriting system approach

Not so regular expressions

Perl has a powerful regular expression engine:regular expressions can define capture zones:small pieces of the match that can be used later;

regular expressions can define look-ahead or look-behind:check the context of the matching zone;

since Perl 5.10, regular expressions can be recursive:regular expression that depends on themself.

my $parens = qr/(\((?:[^()]++|(?-1))*+\))/;

For XML, we defined two classes:[[:XML:]] matches any well formed XML fragment;[[:XML(tag):]] matches a XML fragment with a specific

root element;

Alberto Simões Processing XML: a rewriting system approach

Page 45: Processing XML: a rewriting system approach

Rewriting XML

As a simple example, we can remove duplicate translation unitsin a translation memory file:

Code exampleRULES/m duplicates([[:XML(tu):]])==>!!duplicate($1)ENDRULES

sub duplicate {my $tu = shift;my $tumd5 = md5(dtstring($tu,

-default => sub{$c}));return 1 if exists $visited{$tumd5};$visited{$tumd5}++return 0;

}

Alberto Simões Processing XML: a rewriting system approach

Page 46: Processing XML: a rewriting system approach

Conclusions

The rewriting approach is:flexible;powerful;easy to learn;grows quickly;big systems can be difficult to maintain;

The Perl regular engine:makes it easy to match anything;almost supports full grammars;makes it possible to define block structures;

So, it can be applied to XML easily!

Alberto Simões Processing XML: a rewriting system approach

Page 47: Processing XML: a rewriting system approach

Thank you

Thank You!

Alberto Simõ[email protected]

Alberto Simões Processing XML: a rewriting system approach