Transcript
Page 1: Mime Magic With Apache Tika

MIME Magic withApache Tika

Jukka Zitting

Tika committer and mentor

Page 2: Mime Magic With Apache Tika

Agenda

The Problem

The Solution

The Project

The Client

Page 3: Mime Magic With Apache Tika

The Problem

PDFBoxApache POI

Apache XercesICU4J

NekoHTMLetc.

Lucene index

Page 4: Mime Magic With Apache Tika

It's even worse!

Licensing/PatentsDependencies

Metadata extractionStructured content

Encryption/CompressionPackage formats

Streaming/Performance

Processing ofdigital media

?

?

?

???

??

Page 5: Mime Magic With Apache Tika

Agenda

The Problem

The Solution

The Project

The Client

Page 6: Mime Magic With Apache Tika

The Solution: Technical

• Generic API for extracting metadata and structured text content from a document– Input: byte stream + optional metadata– Output: XHTML SAX events + metadata

• Automatic content type detection– Magic bytes– File name patterns

Page 7: Mime Magic With Apache Tika

The Solution: Legal / Social

• Apache License– (L)GPL projects can implement the Tika

API

• Pooling of efforts– Active development and maintenance– Already beyond the functionality of most

custom solutions

Page 8: Mime Magic With Apache Tika

Agenda

The Problem

The Solution

The Project

The Client

Page 9: Mime Magic With Apache Tika

Project Status

• Incubating since March 2007

• Sponsoring PMC: Apache Lucene

• First release (0.1-incubating) in December 2007

• Interaction with PDFBox, POI, etc.

• Currently in early adopter phase

Page 10: Mime Magic With Apache Tika

Current Features

• 73 registered media types– 167 glob patterns– 26 magic header patterns

• 7 built-in parser classes– 51 supported media types– MS Office, OpenOffice, HTML, PDF,

XML, RTF, plain text

Page 11: Mime Magic With Apache Tika

Project Statistics

Page 12: Mime Magic With Apache Tika

Agenda

The Problem

The Solution

The Project

The Client

Page 13: Mime Magic With Apache Tika

Tika Parser APIpackage org.apache.tika.parser;

public interface Parser {// Parses document content and metadatavoid parse( InputStream stream, ContentHandler handler, Metadata metadata) throws IOException, SAXException, TikaException;// Parses document metadata, @since Tika 0.2void parse( InputStream stream, Metadata metadata) throws IOException, TikaException;

}

Page 14: Mime Magic With Apache Tika

Example: Text extractionpublic static void main(String[] args)

throws Exception {

InputStream stream = System.in;

ContentHandler handler =

new WriteOutContentHandler(System.out);

Metadata metadata = new Metadata();

new AutoDetectParser().parse(

stream, handler, metadata);

}

Page 15: Mime Magic With Apache Tika

Demo: Tika GUI

Page 16: Mime Magic With Apache Tika

Agenda

The Problem

The Solution

The Project

The ClientThank You!


Recommended