Upload
andysh
View
3.749
Download
6
Embed Size (px)
DESCRIPTION
Citation preview
How to make Google Books
at home
in Perl
at home
not to beat Google
What do we have in the internet today?
Find a word
Find the words
Show the page
Show the pageand highlight the words
... Russia ... ece yapc/coe ...
... Russia ... ece yapc/coe ...
Pushkin?
... Russia ... ece yapc/coe ...
YAPC?
... Russia ... ece yapc/coe ...
XIX?
... Russia ... ece yapc/coe ...
WTF?
ece yapc/coe
ece yapc/coeвсе царское
все царскоеece yapc/coe
Amazon
Guess the next screen
Text archive
Berlin
How to make it
PDF(Black box)
PDF WEB
Sample PDFuse.perl.org/~andy.sh/journal
Work with PDF?
Work with PDF?No
SVG
SVGScalable vector graphics
SVGScalable vector graphics
http://www.w3.org/Graphics/SVG/
SVG is XML
SVG is XMLXML::LibXML
SVG is XMLXML::LibXMLXPath
SVG is XMLXML::LibXMLXPath
XSLT
SVG
PDFhttp://www.pdftron.com/pdf2svg/
$ ./pdf2svg book.pdf book.svg
Structure
Geometry
<g></g>
<g> <g> </g></g>
<g> <g> </g> <g> </g></g>
<g> <g> <text> </text> </g> <g> </g></g>
<g> <g> <text> </text> <text> </text> </g> <g> </g></g>
<g> <g> <text> <tspan> </tspan> </text> <text> </text> </g> <g> </g></g>
<g>
<text>
<text transform=...>
<text transform= "matrix( 1 0 0 ‐1 10 584 )">
Page
Pageg
Pageg
text
Pageg
text + transform
<tspan>
Pageg
text + transform
tspan
my $transform = $node‐>findvalue('@transform'); if ($transform =~ /matrix/) { my ($sx, $sy, $tx, $ty) = $transform =~ /matrix\((‐?\d+(?:\.\d+)?) ‐?\d(?:\.\d+)?+ ‐?\d(?:\.\d+)?+ (‐?\d+(?:\.\d+)?) (‐?\d+(?:\.\d+)?) (‐?\d+(?:\.\d+)?)\)/; print "($sx, $sy, $tx, $ty)"; $pos{x} = $sx * $tx; $pos{x} += $pos{pagew} if $sx < 0; $pos{y} = $sy * $ty; $pos{y} += $pos{pageh} if $sy < 0; print " [$pos{x}, $pos{y}]"; }
<tspan
x="0,16.875,26.258,34.695,4
0.314,44.533,49.224,55.789,
60.008,64.699" y="‐0"
class="ps00 ps23">What is
it</tspan>
<tspan
x="0,16.875,26.258,34.695,4
0.314,44.533,49.224,55.789,
60.008,64.699" y="‐0"
class="ps00 ps23">What is
it</tspan>
<tspan
x="0,16.875,26.258,34.695,4
0.314,44.533,49.224,55.789,
60.008,64.699" y="‐0"
class="ps00 ps23">What is
it</tspan>
<tspan
x="0,16.875,26.258,34.695,4
0.314,44.533,49.224,55.789,
60.008,64.699" y="‐0"
class="ps00 ps23">What is
it</tspan>
<tspan
x="0,16.875,26.258,34.695,4
0.314,44.533,49.224,55.789,
60.008,64.699" y="‐0"
class="ps00 ps23">What is
it</tspan>
<tspan
x="0,16.875,26.258,34.695,4
0.314,44.533,49.224,55.789,
60.008,64.699" y="‐0"
class="ps00 ps23">What is
it</tspan>
<tspan
x="0,16.875,26.258,34.695,4
0.314,44.533,49.224,55.789,
60.008,64.699" y="‐0"
class="ps00 ps23">What is
it</tspan>
YAPC
<tspan>YAPC</tspan>
Y APC
Y APC
<tspan>Y</tspan>
<tspan>APC</tspan>
Dictionary
mysql> select * from base where base like 'seek';
+‐‐‐‐‐‐‐‐+‐‐‐‐‐‐+‐‐‐‐‐‐‐+‐‐‐‐‐‐‐‐‐+
| id | base | rules | grammar |
+‐‐‐‐‐‐‐‐+‐‐‐‐‐‐+‐‐‐‐‐‐‐+‐‐‐‐‐‐‐‐‐+
| 189785 | seek | GRSZ | |
+‐‐‐‐‐‐‐‐+‐‐‐‐‐‐+‐‐‐‐‐‐‐+‐‐‐‐‐‐‐‐‐+
mysql> select * from word where ref = 189785;
+‐‐‐‐‐‐‐‐+‐‐‐‐‐‐‐‐‐+
| ref | word |
+‐‐‐‐‐‐‐‐+‐‐‐‐‐‐‐‐‐+
| 189785 | seek |
| 189785 | seeking |
| 189785 | seeker |
| 189785 | seeks |
| 189785 | seekers |
+‐‐‐‐‐‐‐‐+‐‐‐‐‐‐‐‐‐+
YAPC attendee seeks where to drink after the evening talk.
YAPC attendee seeks where to drink after the evening talk.
Morphology
YAPC attendee seeks where to drink after the evening talk.
Stop words
YAPC attendee seeks where to drink after the evening talk.
yapc attendee seeks where to drink after the evening talk.