23
PIPING HOT: Little Bins in big workflows Alex Garnett Digital Preservation & Data Curation SFU Library

PIPING HOT: Little Bins in big workflows Alex Garnett Digital Preservation & Data Curation SFU Library

Embed Size (px)

Citation preview

PIPING HOT:Little Bins in

big workflows

Alex GarnettDigital Preservation & Data

CurationSFU Library

Thesis: I am a terrible programmer

Thesis: I am a terrible programmer

• 20% of you are thinking “no kidding!”

• The other 80% of you are thinking “uh huh. Stupid false-modest shmuck.”

Thesis: I am a terrible programmer

• 20% of you are thinking “no kidding!”

• The other 80% of you are thinking “uh huh. Stupid false-modest shmuck.”

• Who needs impostor syndrome when you have a bash shell?

• For the record, this is the payoff from all those colonoscopy jokes. Yep.

But how does it apply to libraries?

[If MJ Suhonos is here this year, this is his cue to groan

audibly]

LIBRARY PROBLEM #1: PDFA

• ProQuest wants PDFA submissions from now on

• “now on” apparently = the past five years’ backlog

• We have to convert five years of theses!

• This is now also being used at the UofA.

LIBRARY PROBLEM #2: ARCHIVES PROBLEM:

LIBRARY HARDERSTARRING BRUCE

WILLIS

CRAP, I USED UP THE WHOLE SLIDE ON THE

TITLE

• Archives needed a GUI tool to be able to create restrictive FTP accounts for donors.

LIBRARY PROBLEM #3:PDF REDACTION (IT’S LIKE THE FIRST ONE

BECAUSE NO ONE LIKED THE SEQUEL,

DOES ANYONE WANT TO WATCH TEMPLE OF

DOOM LATER, OH HELL I’VE DONE IT AGAIN)

• We learned we had some poorly redacted PDFs

• Blackout meant to obscure text; still selectable

• Solution:– Detect offending pages with

ghostscript…• (this is the hard part; dumping PDF guts is

appalling)

• … and then:– Snip offending pages with pdftk– Convert them to images with imagemagick– OCR back into PDF (minus obscured text)

with tesseract and fix up the dimensions with gs again

– Paste back in with pdftk.– 5 lines, all free tools! Documentation &

piping.

Takeaway

• If you find yourself doing a very bad job of learning PHP and feeling like you have something to prove: it doesn’t have to be this way

Takeaway

• If you find yourself doing a very bad job of learning PHP and feeling like you have something to prove: it doesn’t have to be this way

• There is a huge amount of useful space you can occupy as a barely-programmer if you’re comfortable using a terminal for problem solving (less so on Windows). StackOverflow and Google are your friend.

Takeaway

• Open-source command line tools are really good these days! They are powerful, they are straightforward, and they are often cutting edge.

• There is a huge amount of useful space you can occupy as a barely-programmer if you’re comfortable using a terminal for problem solving (less so on Windows). StackOverflow and Google are your friend.

Surprise: Everybody gets a free colonoscopy after all!

• Thanks! [email protected] ; @axfelix