Apache Tika API Usage Examples

Embed Size (px)

Citation preview

  • 8/18/2019 Apache Tika API Usage Examples

    1/6

    4/19/2016 Apache Tika Tika API Usage Examples

    The pacheSoftware Foundation

    pache Tika PI Usage ExamplesThis page provides a number of examples on ho w to use the variousTika APIs . All of the examples shown are also available in the TikaExample module in SVN .

    • Apache Tika APT Usage Exampleso Parsing

    Parsing using the Tika FacadeParsing using the Auto -Detect Parser

    ° Picking different output formatsParsing to Plain TextParsing to XHTMLFetching just certain bits of the XHTML

    o Custom Content Handlers

    Extract PhoneNumbers from Content into theMetadata

    Streaming the plain text in chunks

    ° TranslationTranslation using the Microsoft Translation APT

    ° Language Identification

    Parsing

    Tika provides a number of different ways to parse a file. Theseprovide different levels of control , flexibility , and complexity .

    Parsing using the Tika Facade

    The Tika facade provides a number of very quickand easy ways tohave your content parsed by Tika, and return the resulting plain text

    public Str ing p a r s e To S t r i n g E x a m p l e t h rowsI O E x c e p t i o n j SAXExcep t ion , Ti k a E x c e p t i o n {

    Tika t ika n e w Ti k a ;t ry InputStream s t r eam

    P a r s i n g E x a m p l e . c l a s s . getResourceAsStream t e s t . doc

    pache Tika

    IntroductionDownloadContributeMailing ListsTika Wiki

    Issue Tracker

    Documentation

    • Apache Tika 1.12Getting StartedSupported

    FormatsParser APIParser 5 min QuickStart Guide

    Content andLanguageDetection

    Configuring TikaUsage ExamplesAPIDocumentation

    REST APIDocumentation Miredot

    o Apache Tika 1.11o Apache Tika 1.10o Apache Tika 1.9

    Th e pacheSoftwareFoundation

    AboutLicenseSecurity

    SponsorshipThanks

    Search with pache Solr

    | provi [ Search

    ooks about Tika

    https ://tika apache org /1 8 /exam pies htm I 1/6

  • 8/18/2019 Apache Tika API Usage Examples

    2/6

    4 /19/2016 Apache Tika Tika API Usage Examples

    {return tika parseToString ( stream )

    }

    }

    Parsing using the uto Detect Parser

    For more control, you can call the Tika Parsers directly. Most likely ,you ll want to sta rt outusing the Auto-Detect Parser whichautomatically figures ou t what kind of content yo u have, then callsthe appropriate parserfo r you.

    public String parseExample ( ) throws IOException SAXException TikaException {

    AutoDetectParser parser = new AutoDetectParserQ ;BodyContentHandler handler = new

    BodyContentHandlerQ ;Metadata metadata = new Metadata ( ) ;try ( InputStream stream =

    ParsingExample class getResourceAsStream ( t es t do c ) ){

    parser parse ( stream , handler , metadata ) ;return handler toStringQ

    }

    }

    Picking different output formats

    With Tika, you can get the textual content of your files returned in anumber of different formats . These can be plain text , html, xhtml,xhtml of one part of the file etc . This is controlled based on theContentHandler you supply to the Parser.

    Parsing to Plain Text

    By using the BodyContentHandler you can request that Tika returnonly the content of the document s body as a plain-text string .

    public String parseToPlainTextQ throws IOException SAXException TikaException {

    BodyContentHandler handler = new BodyContentHandlerQ ;

    AutoDetectParser parser = newAutoDetectParser ( ) ;Metadata metadata = new Metadata ( ) ;try ( InputStream stream =

    ContentHandlerExample class getResourceAsStream ( t es t doc ) ){parser parse ( stream , handler , metadata ) ;return handler toStringQ ;

    }

    https://tika apache org /1 8 /exam pies htm I 2/6

  • 8/18/2019 Apache Tika API Usage Examples

    3/6

    4 /19/2016 Apache Tika Tika API Usage Examples

    }

    Parsing to XHTML

    By using the ToXMLContentHandler you can ge t the XHTMLcontent of the whole document as a string.

    public String parseToHTML ( ) throws IOException SAXException TikaException {ContentHandler handler = new ToXMLContentHandler ( ) ;

    AutoDetectParser parser = newAutoDetectParser ( ) ;Metadata metadata = new Metadata ( ) ;try ( InputStream stream =

    ContentHandlerExample . class . getResourceAsStream ( t es t doc ) ){

    parser parse ( stream handler metadata ) ;return handler toStringQ ;

    }}

    If you just want the body of the xhtml document , without the header,you can chain together a BodvContentHandlerand aToXMLContentHandler as shown :

    public String parseBodyToHTML( ) throws IOException ,SAXException

    ,TikaException {

    ContentHandler handler = new BodyContentHandler (new ToXMLContentHandler ( ) )

    AutoDetectParser parser = newAutoDetectParserQ ;Metadata metadata = new Metadata ( ) ;try ( InputStream stream =

    ContentHandlerExample . class . getResourceAsStream ( t es t doc ) ){

    parser parse ( stream handler , metadata ) ;return handler toString ( ) ;

    }

    Fetching just certain bits of the XHTML

    It possible to execute XPath queries on the parse results , to fetch onlycertain bits of the XHTML .

    public String parseOnePartToHTML ( ) throws IOException SAXExceptionTikaException {/ / Only get things under html body div (class = header )XPathParser xhtmlParser = newXPathParser ( xhtml ,

    XHTMLContentHandler XHTML) ;https ://tika apache org /1 8 /exam pies htm I 3/6

  • 8/18/2019 Apache Tika API Usage Examples

    4/6

    4 /19/2016 Apache Tika Tika API Usage Examples

    Matcher divContentMatcher =xhtmlParser parse ( / xhtml : html / xhtml : body / xhtml : div / desCendant : : no

    ContentHandler handler = new MatchingContentHandler (new ToXMLContentHandler ( ) divContentMatcher ) ;

    e ( ) M )

    AutoDetectParser parser = new AutoDetectParserQ ;Metadata metadata = new Metadata ( ) ;

    try ( InputStream stream =ContentHandlerExample class getResourceAsStream ( t es t 2 doc ) ) {

    parser parse ( stream , handler , metadata ) return handler toStringQ ;

    }

    }

    Custom Content Handlers

    The textual output of parsing a file with Tika is returned via the SAXContentHandler you pass to the parse method . It is possible tocustomise your parsing by supplying yourown ContentHandlerwhich does special things .

    Extract Phone Numbers from Content into the Metadata

    By using the PhoneExtractingContentHandler you can have anyphone numbers found in the textual content of the document extractedand placed into the Metadata object fo r you.

    Streaming the plain text in chunks

    Sometimes , you want to chunk the resulting text up , perhaps to outputas you go minimising memory use , perhaps to output to HDFS files ,or any other reason With a small custom content handler you can dothat.

    public List String parseToPlainTextChunks ( ) throwsIOException , SAXException , TikaException {

    f i na l List String chunks = new ArrayListoQ ;chunks add ( ) ContentHandlerDecorator handler = new

    ContentHandlerDecorator ( ) { Overridepublic void characters ( char [ ] ch in t s t a r t in t

    length ) {String lastChunk = chunks ge t ( chunks . size ( ) 1 ) ;String thisStr = new String ( ch , s t a r t , length ) ;

    if ( lastChunk length ( ) + length MAXIMUM_TEXT_CHUNK_SIZE ) {

    chunks . add ( thisStr ) ;

    https ://tika apache org /1 8 /exam pies htm I 4/6

  • 8/18/2019 Apache Tika API Usage Examples

    5/6

    4 /19/2016 Apache Tika Tika API Usage Examples

    } e l se {

    c h u n k s . s e t ( c h u n k s s i z e ( ) lastChunk +t h i s S t r )

    }}

    };

    A u t o D e t e c t P a r s e r parser = new AutoDetectParserQ ;M e ta da ta m e ta da ta = new M e t a d a t a ( ) ;t r y ( InputSt ream stream =

    C o n t e n t H a n d l e r E x a m p l e class g e t R e s o u r c e A s S t r e a m ( t e s t 2 doc ) ){

    parser p a r s e ( stream , handler , m e t a d a t a ) ;r e t u r n c h u n k s

    }

    }

    Translation

    Tika provides a pluggable Translation system , which allow yo u tosend the results of parsing off to an external system or program tohave the text translated into another language .

    Translation using the Microsoft Translation PI

    In order to use the Microsoft Translation API , you need to sign up for

    a Microsoft account , get an API key , then pass the key to Tika beforetranslating.

    publ ic S t r i n g microsoftTranslateToFrench ( S t r i n gt e x t ) {

    Microsof tTrans la to r t r a n s l a t o r = newMicrosof tTrans la to r ( ) ;

    / / Change the i d and s e c r e t S e eh t t p : / / msdn m i c r o s o f t . com / e n -u s / l i b r a r y / h h4 5 4 9 5 0 a s p x

    t r a n s l a t o r s e t I d ( d u m m y i d ) ;t r a n s l a t o r se tSec re t d u m m y s e c r e t ) ;t r y {

    r e t u r n t r a n s l a t o r t r a n s l a t e ( t e x t , f r ) } catch ( E x c e p t i o n e ) {

    r e t u r n E r r o r whi le t r a n s l a t i n g ;}

    }

    Language Identification

    Tika provides support for identifying the language of text, through theLanguageldentifier class .

    https ://tika apache org /1 8 /exam pies htm I 5/6

  • 8/18/2019 Apache Tika API Usage Examples

    6/6

    4/19 /2016 Apache Tika Tika API Usage Examples

    public Str ing identifyLanguage ( S t r ing t e x t ) {Languageldentifier i d e n t i f i e r = new

    Languageldentifier ( t e x t ) ;r e t u r n i d e n t i f i e r . getLanguage ( )

    }

    Copyright © 2016 The Apache Software Foundation. Site powered by Apache Maven. Search powered by Lucid Imagination and Sematext.Apache Tika, Tika, Apache, the Apache feather logo, and the Apache Tika project logo are trademarks of The Apache Software

    Foundation.

    https://tika apache org /1 8 /exam pies htm I 6/6