39
Metadata Extraction and Content Transformations 1 Nick Burch Software Engineer, Alfresco twitter: @gagravarr

Metadata Extraction and Content Transformation

Embed Size (px)

DESCRIPTION

In this session, we will look first at the rich metadata that documents in your repository have, how to control the mapping of this on to your content model, and some of the interesting things this can deliver. We'll then move on to the content transformation and rendition services, and see how you can easily and powerfully generate a wide range of media from the content you already have.

Citation preview

Page 1: Metadata Extraction and Content Transformation

1

Metadata Extraction and Content TransformationsNick BurchSoftware Engineer, Alfresco

twitter: @gagravarr

Page 2: Metadata Extraction and Content Transformation

2

Introduction – 3 Content Related Services

Covering

• Uses• Interfaces• Calling the Services• Java & JavaScript APIs• Demos• Extensions• Apache Tika

• Metadata Extractor

• Content Transformer

• Renditions

Page 3: Metadata Extraction and Content Transformation

3

The Metadata Extractor Service

What, How, Why?

• For a given piece of content, returns the Metadata held within that• Document Metadata is converted into the content model• Typically used with uploaded binary files• Upload a PDF, extract out the Title and Description, save these as the properties on the Alfresco Node• Powered internally by a number of different extractors• Service picks the appropriate extractor for you• Since Alfresco 3.4, makes heavy use of Apache Tika

Page 4: Metadata Extraction and Content Transformation

4

The Content Transformation Service

What, How, Why?

• Transforms content from one format to another• Driven by mime types, source and destination• Used to generate plain text versions for indexing• Used to generate SWF versions for preview• Used to generate PDF versions for web download • Powered by a large number of different transformers• Transformers can be linked together, eg .doc -> .pdf via Open Office, then .pdf -> .swf via pdf2swf• Since Alfresco 3.4, makes heavy use of Apache Tika

Page 5: Metadata Extraction and Content Transformation

5

The Rendition Service

What, How, Why?

• Can turn content from one kind to another• Or can just alter some content as-is• Used to manipulate images, eg crop and resize• Used to generate HTML .docx previews in Web Quick Start• Often uses the Content Transformation Service• Replaced the Thumbnail Service• Renditions are actions

Page 6: Metadata Extraction and Content Transformation

6

Apache Tika

Apache Tika – http://tika.apache.org/

• Apache Project which started in 2006• Grew out of the Lucene community, now widely used• Provides detection of files – eg this binary blob is really a word file• Plain text, HTML and XHTML versions of a wide range of different file formats• Consistent Metadata from different files• Tika hides the complexity of the different formats, and presents a simple, powerful API• Easy to use and extend

Page 7: Metadata Extraction and Content Transformation

7

Metadata Extractor Service

Page 8: Metadata Extraction and Content Transformation

8

Alfresco 3.3 - Supported Formats

File Formats supported out of the box

• PDF• Word, PowerPoint, Excel• HTML• Open Document Formats (OpenOffice)• RFC822 Email• Outlook .msg Email

Page 9: Metadata Extraction and Content Transformation

9

Alfresco 3.4 - Supported Formats – Page 1

File Formats supported out of the box, Page 1

• Audio – WAV, RIFF, MIDI• DWG (CAD)• Epub• RSS and ATOM Feeds• True Type Fonts• HTML• Images – JPEG, GIF, PNG, TIFF, Bitmap (including EXIF where found)• iWork (Keynote, Pages etc)• RFC822 mbox Mail

Page 10: Metadata Extraction and Content Transformation

10

Alfresco 3.4 - Supported Formats – Page 2

File Formats supported out of the box, Page 2

• Microsoft Outlook .msg Email• Microsoft Office (Binary) – Word, PowerPoint, Excel, Visio, Publisher, Works• Microsoft Office (OOXML) – Word, PowerPoint, Excel• MP3 (id3 v1 and v2)• CDF (Scientific Data)• Open Document Format (Open Office)• Old-style Open Office (.sxw etc)• PDF

Page 11: Metadata Extraction and Content Transformation

11

Alfresco 3.4 - Supported Formats – Page 3

File Formats supported out of the box, Page 3

• Zip and Tar archives• RDF• Plain Text• FLV Video• XML• Java class files

And I probably forgot one...!

Page 12: Metadata Extraction and Content Transformation

12

Calling Apache Tika

• // Get a content detector, and an auto-selecting Parser• TikaConfig config = TikaConfig.getDefaultConfig();• ContainerAwareDetector detector = new ContainerAwareDetector(• config.getMimeRepository()• );• Parser parser = new AutoDetectParser(detector);

• // We’ll only want the plain text contents• ContentHandler handler = new BodyContentHandler();

• // Tell the parser what we have• Metadata metadata = new Metadata(); • metadata.set(Metadata.RESOURCE_NAME_KEY, filename);

• // Have it processed• parser.parse(input, handler, metadata, new ParseContext());

Page 13: Metadata Extraction and Content Transformation

13

Metadata Extractor – Java Use

• MetadataExtractorRegistry registry = (MetadataExtractorRegistry)context.getBean(“metadataExtracterRegistry”);

• MetadataExtracter extractor = registry.getExtracter(“application/vnd.ms-excel”);

• Map<QName, Serializable> properties = new HashMap<QName, Serializable>();

• ContentReader reader = contentService.getReader(nodeRef, ContentModel.PROP_CONTENT);

• extractor.extract(reader, properties);• System.err.println(properties);

Page 14: Metadata Extraction and Content Transformation

14

Metadata Extractor – JavaScript Use

JavaScript

var action = actions.create("extract-metadata");

action.execute(document);

• Full access is not directly available

• You can’t get at the raw properties

• You can, however, trigger extraction and saving to the node easily

• Available via an action

Page 15: Metadata Extraction and Content Transformation

15

Metadata Extractor – Geo Content Model

• <aspect name="cm:geographic">• <title>Geographic</title>• <properties>• <property name="cm:latitude">• <title>Latitude</title>• <type>d:double</type>• </property>• <property name="cm:longitude">• <title>Longitude</title>• <type>d:double</type>• </property>• </properties>• </aspect>

Page 16: Metadata Extraction and Content Transformation

16

Metadata Extractor – Geo Mapping

• # Namespaces• namespace.prefix.cm=http://www.alfresco.org/model/content/1.0

• # Geo Mappings• geo\:lat=cm:latitude• geo\:long=cm:longitude

• # Normal Mappings• author=cm:author• title=cm:title• description=cm:description• created=cm:created

Page 17: Metadata Extraction and Content Transformation

17

Demo:Geo Tagged Image in Share

Page 18: Metadata Extraction and Content Transformation

18

Content Transformation Service

Page 19: Metadata Extraction and Content Transformation

19

Supported Transformations

Transformations Supported in Alfresco v3.4

• Plain Text, HTML & XHTML for all Apache Tika supported text and document formats (around 30 file formats)• PDF to Image• PDF to SWF (for preview)• Office File Formats to PDF (via Open Office, using JODConverter in Enterprise)• Plain Text and XML to PDF• Zip listing to Text• Image to other Images (via ImageMagick)

Page 20: Metadata Extraction and Content Transformation

20

Content Transformer and Tika

Handlers

ContentHandler handler = new BodyContentHandler();

String text = handler.toString();

SAXTransformerFactory factory = SAXTransformerFactory.newInstance();

TransformerHandler handler = factory.newTransformerHandler();

handler.getTransformer().setOutputProperty(OutputKeys.METHOD, "xml");

handler.getTransformer().setOutputProperty(OutputKeys.INDENT, "yes");

StringWriter sw = new StringWriter();

handler.setResult(new StreamResult(sw));

String text = sw.toString();

• Tika generates HTML-like SAX events as it parses

• Uses Java SAX API• Events can be captured or

transformed• Body Content Handler

used for plain text• HTML and XHTML

available• Can customise with your

own handler, with XSLT or with E4X from JavaScript

Page 21: Metadata Extraction and Content Transformation

21

Content Transformer – Java Use

• ContentTransformerRegistry registry = (ContentTransformerRegistry)context.getBean(“contentTransformerRegistry”);

• ContentTransformer transformer = registry.getTransformer(“application/vnd.ms-excel”,”text/csv”, new TransformationOptions());

• ContentReader reader = contentService.getReader(sourceNodeRef, ContentModel.PROP_CONTENT);

• ContentWriter writer = contentService.getReader(destNodeRef, ContentModel.PROP_CONTENT);

• transformer.transform(reader, writer);

Page 22: Metadata Extraction and Content Transformation

22

Content Transformer – JavaScript Use

JavaScript

var action = actions.create("transform");

// Transform into the same folder

action.parameters["destination-folder"] = document.parent;

action.parameters["assoc-type"] = "{http://www.alfresco.org/model/content/1.0}contains";

action.parameters["assoc-name"] = document.name + "transformed";

action.parameters["mime-type"] = "text/html";

// Execute

action.execute(document);

• Full access is not directly available

• You can’t control which property is transformed, it’s always Content

• You can control where the transformed version goes

• Triggering the transformation is easier than in Java

• Available via an action

Page 23: Metadata Extraction and Content Transformation

23

Custom Tika Parsers - Interface

Interface

public interface Parser {Set<MediaType> getSupportedTypes(ParseContext context);

void parse(InputStream stream, ContentHandler handler,Metadata metadata, ParseContext context)throws IOException, SAXException, TikaException;}

• The Tika Parser interface is quite simple

• Need to provide a list of supported mime types, so that auto-detection can work

• Accept an input stream, populate the Metadata object, and fire SAX events to the supplied handler

• That’s it!

Page 24: Metadata Extraction and Content Transformation

24

Custom Tika Parser – Hello World Parser

public class HelloWorldParser implements Parser { public Set<MediaType> getSupportedTypes(ParseContext context) { Set<MediaType> types = new HashSet<MediaType>(); types.add(MediaType.parse("hello/world")); return types; }

public void parse(InputStream stream, ContentHandler handler, Metadata metadata, ParseContext context) throws SAXException { XHTMLContentHandler xhtml = new XHTMLContentHandler(handler, metadata); xhtml.startDocument(); xhtml.startElement("h1"); xhtml.characters("Hello, World!"); xhtml.endElement("h1"); xhtml.endDocument();

metadata.set("hello","world"); metadata.set("title","Hello World!"); }}

Page 25: Metadata Extraction and Content Transformation

25

Custom Command Line Transformer <bean id="transformer.worker.helloWorldCMD"

class="org.alfresco.repo.content.transform.RuntimeExecutableContentTransformerWorker"> <property name="mimetypeService“><ref bean="mimetypeService"/></property> <property name="transformCommand"> <bean class="org.alfresco.util.exec.RuntimeExec"> <property name="commandsAndArguments“><map> <entry key=".*“><list> <value>/bin/bash</value> <value>-c</value> <value>/bin/echo 'Hello World - ${source}' &gt; ${target}</value> </list></entry> </map></property> <property name="errorCodes“><value>1,127</value></property> </bean> </property <property name="explicitTransformations"> <list><bean class="org.alfresco.repo.content.transform.ExplictTransformationDetails"> <property name="sourceMimetype“><value>text/plain</value></property> <property name="targetMimetype“><value>hello/world</value></property> </bean></list> </property> </bean>

<bean id="transformer.helloWorldCMD" class="org.alfresco.repo.content.transform.ProxyContentTransformer" parent="baseContentTransformer">

<property name="worker"><ref bean="transformer.worker.helloWorldCMD"/></property> </bean>

Page 26: Metadata Extraction and Content Transformation

26

Custom Transformer – Demo

JS Code

var action = actions.create("transform");action.parameters["destination-folder"] = document.parent;action.parameters["assoc-type"] = "{http://www.alfresco.org/model/content/1.0}contains";action.parameters["assoc-name"] = document.name + "HW";

if(document.mimetype == "hello/world") { action.parameters["mime-type"] = "text/plain";} else { action.parameters["mime-type"] = "hello/world";}

action.execute(document);

• Use our Command Line transformer to generate a “hello/world” version

• Use our Tika transfomer to turn this back into plain text

• Uses the JavaScript API to access the content transformation service

Page 27: Metadata Extraction and Content Transformation

27

Demo 2:Excel to Plain Text, CSV and HTML

Page 28: Metadata Extraction and Content Transformation

28

Rendition Service

Page 29: Metadata Extraction and Content Transformation

29

Standard Rendition Engines

Renditions Supported in Alfresco v3.4

• reformat – access to the Content Transformation Service• image – crop, resize, etc• freemarker – runs a Freemarker Template against the content of the node• html – turns .docx files into clean HTML + images• xslt – runs a XSLT Transformation against the content of the node, XML content nodes only!• composite – execute several renditions in a series, eg reformat followed by image crop

Page 30: Metadata Extraction and Content Transformation

30

Persisted vs Transient Definitions

For your more complicated renditions

• To run a rendition, first create a rendition definition for a given rendering engine• Then set all the parameters against it• Finally execute it against a node

• For very complicated / common renditions, you can save the definition to the data dictionary• It can then be retrieved and run• Rendition Service provides support to create, load, save and execute definitions

Page 31: Metadata Extraction and Content Transformation

31

Rendition Service – Calling From Java

Load, Edit, Save, Run

•// Retrieve the existing Rendition Definition•QName renditionName = QName.createQName( NamespaceService.CONTENT_MODEL_1_0_URI, "myRendDefn");•RenditionDefinition renditionDef = loadRenditionDefinition(renditionName);

•// Make some changes.•renditionDef.setParameterValue(AbstractRenderingEngine.PARAM_MIME_TYPE, MimetypeMap.MIMETYPE_PDF);•renditionDef.setParameterValue(RenditionService.PARAM_ORPHAN_EXISTING_RENDITION, true);

•// Persist the changes.•renditionService.saveRenditionDefinition(renditionDef);

•// Run the Rendition•ChildAssociationRef assoc = renditionService.render(sourceNode, renditionDef);

Page 32: Metadata Extraction and Content Transformation

32

Rendition Service – Calling From JavaScript

Create, Run, List

•var renditionDef = renditionService.createRenditionDefinition("cm:cropResize", "imageRenderingEngine");•renditionDef.parameters["destination-path-template"] = "/Company Home/Cropped Images/${name}.jpg";•renditionDef.parameters["isAbsolute"] = true;•renditionDef.parameters["xsize"] = 50;•renditionDef.parameters["ysize"] = 50;

•renditionService.render(nodeRef, renditionDef);

•var renditions = renditionService.getRenditions(nodeRef);

Page 33: Metadata Extraction and Content Transformation

33

Rendition Service – More Calling Options

Actions, Rules, CMIS

• Renditions are Actions, but normally hidden ones• They won’t show up in Share when defining Rules, or in Explorer for running a Custom Action

• Solution – create a JS Script, or some custom Java• Use this from your Rule / to run as an Action

• No dedicated REST API, but Renditions are available through CMIS• More details available in the CMIS talks!

Page 34: Metadata Extraction and Content Transformation

34

Custom Rendition Engines

When a composite just isn’t enough

• Rendition Engines are a special kind of Action Executor• This delivers lots of flexibility, and means anyone who can write Custom Actions already knows enough to write Custom Rendition Engines!• org.alfresco.repo.rendition.executer.AbstractRenderingEngine provides a helpful superclass

• To learn more about Custom Actions and Custom Action Executors, see Neil McErlean’s talk

Page 35: Metadata Extraction and Content Transformation

35

Demo 1:Crop and Resize an Image

(Using Share Rules)

Page 36: Metadata Extraction and Content Transformation

36

Demo 2:Video Rendition

Page 37: Metadata Extraction and Content Transformation

37

Demo 3:Word .docx -> HTML & Images

(Using Web Quick Start)

Page 38: Metadata Extraction and Content Transformation

38

Any Questions?

Page 39: Metadata Extraction and Content Transformation

39

Learn Morewiki.alfresco.comforums.alfresco.comblogs.alfresco.com/wp/nickb/twitter: @AlfrescoECM @Gagravarr