Parli-N-GramsGiuseppe Sollazzo
@puntofisso
Accountability Hack 2014
Parli-N-Grams
A search and analysis tool for Hansard
The best search lets you discover things while you look for them
An N-Gram is a sequence of N words
N-Grams?
An N-Gram is a sequence of N words● 1-gram: fox
N-Grams?
An N-Gram is a sequence of N words● 1-gram: fox● 2-gram: brown fox
N-Grams?
An N-Gram is a sequence of N words● 1-gram: fox● 2-gram: brown fox● 3-gram: quick brown fox
N-Grams?
An N-Gram is a sequence of N words● 1-gram: fox● 2-gram: brown fox● 3-gram: quick brown fox● 4-gram: the quick brown fox
N-Grams?
An N-Gram is a sequence of N words● 1-gram: fox● 2-gram: brown fox● 3-gram: quick brown fox● 4-gram: the quick brown fox● ...
N-Grams?
Tech Stack
Harvesting/parsing: PHPFront-End: JQuery, JavaScriptUI: Bootswatch, Bootstrap
Next time, PLAN!
Next time, PLAN!
Harvesting 6.4GB is slow
Next time, PLAN!
Harvesting 6.4GB is slowParsing 6.4GB is slower
Next time, PLAN!
Harvesting 6.4GB is slowParsing 6.4GB is slower● Especially in PHP
Next time, PLAN!
Harvesting 6.4GB is slowParsing 6.4GB is slower● Especially in PHPRunning grep because you’ve forgotten to extract data beforehand is slow AND stupid
Next time, PLAN!
Most data is availableExtraction is still running for 1-grams...
Next time, PLAN!
sed s/=\'\'/=\'\\\\\'/g $filename | sed s/\'\'\ /\\\\''\'\'\ /g | sed "s/$/;/g" | sed "s/\([a-z]\)'\(s\)/\1\\\'\2/g" | sed "s/\([A-Z]\)'\(s\)/\1\\\'\2/g" | sed "s/\([a-z]\)'\(l\)/\1\\\'\2/g" | sed "s/\([a-z]\)'\(r\)/\1\\\'\2/g" | sed "s/\(n\)'\(t\)/\1\\\'\2/g" | sed "s/\(o\)'\(c\)/\1\\\'\2/g" | sed "s/\(e\)'\(v\)/\1\\\'\2/g" | sed "s/\(I\)'\(v\)/\1\\\'\2/g" | sed "s/\(u\)'\(v\)/\1\\\'\2/g" | sed "s/\([a-z]\)'\([A-Z]\)/\1\\\'\2/g" | sed "s/\(O\)'\([a-z]\)/\1\\\'\2/g" | sed "s/\(O\)'\([A-Z]\)/\1\\\'\2/g" | sed "s/\(I\)'\(m\)/\1\\\'\2/g" | sed "s/\([A-Z]\)'\(l\)/\1\\\'\2/g" | sed "s/\([a-z]\)'\([a-z]\)/\1\\\'\2/g" | sed "s/\([a-z]\)\'-\([a-z]\)/\1\\\'-\2/g" | sed "s/\([A-Z]\)'\([A-Z]\)/\1\\\'\2/g" | sed "s/\([A-Z]\)'\([a-z]\)/\1\\\'\2/g" | sed "s/'\([a-z]\)'\([a-z]\)/\\'\1\\\'\2/g" | sed "s/-'n\\\'/-\\\'n\\\'/g" | sed "s/-'\([a-z]\)/-\\\'\1/g" | sed "s/-o'-/-o\\\'-/g" | sed "s/ght'-le/ght\\\'-le/g" | sed "s/cats'-meat/cats\\\'-meat/g" | sed "s/n'-roll/n\\\'-roll/g" | sed "s/sou'-w/sou\\\'-w/g" | sed "s/gleaf'-for/gleaf\\\'-for/g"
Available on
http://github.com/puntofisso/AccHack14http://parli-n-grams.puntofisso.net
Thank you!Parli-N-Gram
Giuseppe Sollazzo@puntofisso
Accountability Hack 2014