40
Metadata Extractors, Content Transformers & Renditions Neil Mc Erlean

Metadata Extractors, Content Transformers & Renditions

  • Upload
    yannis

  • View
    47

  • Download
    1

Embed Size (px)

DESCRIPTION

Metadata Extractors, Content Transformers & Renditions. Neil Mc Erlean. Who am I?. Lead Engineer in the Services Team 4 years at Alfresco (since 3.2) Previously worked on Hybrid Sync Alfresco in the Cloud Various services/components Transformers & Extractors REST APIs - PowerPoint PPT Presentation

Citation preview

Page 1: Metadata Extractors, Content Transformers & Renditions

Metadata Extractors, Content Transformers & Renditions

Neil Mc Erlean

Page 2: Metadata Extractors, Content Transformers & Renditions

Who am I?

Lead Engineer in the Services Team

4 years at Alfresco (since 3.2)

Previously worked on•Hybrid Sync•Alfresco in the Cloud•Various services/components

•Transformers & Extractors•REST APIs•Actions & Behaviours and more…

Ex-astrophysicist (of which more later)

Page 3: Metadata Extractors, Content Transformers & Renditions

Talk content

What data is in your content?

How does Alfresco get at it?

What does Alfresco do with it?

How can you use these features?

Introductory material•no prior knowledge assumed

Page 4: Metadata Extractors, Content Transformers & Renditions

Talk content - Breaking it down

Your content & its metadata

Alternative renditions of your content

Overviews of the 3 services

Java Foundation APIs. JavaScript.

Configuring & extending Alfresco.

All code samples available as runnable tests - download from the website.

Page 5: Metadata Extractors, Content Transformers & Renditions

#1 Metadata Extraction

Page 6: Metadata Extractors, Content Transformers & Renditions

#2 Content Transformation

Alfresco uses them to produce

•images (thumbnails)•plain text (indexing)•inter-Office transforms

Also generally useful

Page 7: Metadata Extractors, Content Transformers & Renditions

#3 Rendition Service

• Very similar to transformations

• More general service

• More than just content to content

Page 8: Metadata Extractors, Content Transformers & Renditions

How do these components work?

Mostly by leveraging existing OSS Java libs•Notably Apache Tika

Some external OS processes too•OpenOffice.org (OOo), LibreOffice•ImageMagick•pdf2swf (swftools)

Some bespoke impls e.g. zip - txt

‘embedded’ thumbnails/previews iWorks, Office

Page 9: Metadata Extractors, Content Transformers & Renditions

General Considerations

CPU, memory

In process vs. out of process vs. Remote CPU

Selection of ‘best’ extractor/transformer

Stay for Andy Hunt’s talk for Support’s troubleshooting tips

Page 10: Metadata Extractors, Content Transformers & Renditions

Metadata Extraction

Page 11: Metadata Extractors, Content Transformers & Renditions

#1 Metadata Extraction

• Triggered on content creation or update.• or on demand

• ‘Best’ available extractor obtained from MetadataExtracterRegistry.

• This Extractor pulls out the metadata.• Format depends on the extractor lib/impl.• key/value pairs

• These data are mapped onto the Alfresco content model• configurable mapping.

<ExtractorClass>.properties

Page 12: Metadata Extractors, Content Transformers & Renditions

Metadata extraction - JavaMetadataExtracterRegistry registry = appContext.getBean("metadataExtracterRegistry”,

MetadataExtracterRegistry.class);

ContentReader reader =

contentService.getReader(nodeRef,

ContentModel.PROP_CONTENT);

MetadataExtracter extractor = registry.getExtracter(reader.getMimetype());

Map<QName, Serializable> props =

new HashMap<QName, Serializable>();

extractor.extract(reader,

OverwritePolicy.EAGER, props);

Page 13: Metadata Extractors, Content Transformers & Renditions

Overwrite Policy – when re-extracting

• EAGER• extracted value is not null

• PRUDENT• db property doesn’t exist or is null or “” (+

above)• CAUTIOUS

• existing property == undefined

Page 14: Metadata Extractors, Content Transformers & Renditions

<ExtractorClass>.properties mappingnamespace.prefix.cm=http://www.alfresco.org/model/content/1.0

author=cm:author

title=cm:title

#Note need to escape ‘:’ in key name

geo\:lat=cm:latitude

geo\:long=cm:longitude

Page 15: Metadata Extractors, Content Transformers & Renditions

Mapping properties

• Can map extracted key-value onto multiple content properties

• Can ignore extracted key-values i.e. not map.

Page 16: Metadata Extractors, Content Transformers & Renditions

Metadata extraction - JavaScript

var action = actions.create('extract-metadata'); action.execute(nodeRef);

Page 17: Metadata Extractors, Content Transformers & Renditions

Ways to customise & extend

• Customisation of existing extractors• Define new mappings – to an existing or a

new content model.• Adding new extractors

• Identify 3rd party lib that can read the binary file

• Or write your own code to do this• Extend

AbstractMappingMetadataExtracter• Or write a Tika plugin• Define metadata mappings

• org.alfresco.repo.content.metadata

Page 18: Metadata Extractors, Content Transformers & Renditions

Recap

• Metadata extraction harvests ‘hidden’ data and maps it into Alfresco content model.

• Support for many MIME types

• Metadata insertion coming• it’s on HEAD but currently disabled• also maps metadata tags to cm:taggable

• “Best” extractor selection covered below

Page 19: Metadata Extractors, Content Transformers & Renditions

Content Transformers

Page 20: Metadata Extractors, Content Transformers & Renditions

Out of the box transformers• text, html, xml• Microsoft Office (doc & docx formats)• OpenDocument Format• iWorks (Keynote, Pages, Numbers)• Images• Shockwave Flash (SWF)• RFC822 email, Outlook .msg email• Adobe PDF, Illustrator, PSD• Electronic publication (epub)• Rich Text (RTF)• MP3• Archives (ZIP, tar)• Many more

Page 21: Metadata Extractors, Content Transformers & Renditions

Available transformers

• No ‘graph’ of transform paths/mime types

• Spring beans extend “baseContentTransformer”

• They implement isTransformable(from, to)

• They can be• simple (A to B)• ‘complex’ (A to C, via B)• failover (A to B, A to B…)• overlapping (multiple beans for same

path)• dynamically un/available (e.g. OOo)

Page 22: Metadata Extractors, Content Transformers & Renditions

/api/service/mimetypes webscript

http://localhost:8080/alfresco/service/mimetypes

•MIME types

•Metadata Extractors

•Content Transformers

•As services come and go (OOo), entries may disappear

Page 23: Metadata Extractors, Content Transformers & Renditions

/api/service/mimetypes webscriptapplication/vnd.openxmlformats-officedocument.presentationml.presentation - pptx

Extractors: org.alfresco.repo.content.metadata.PoiMetadataExtracter

Transformable To:

application/pdf = Using a Direct Open Office Connection

application/vnd.ms-powerpoint = Using a Direct Open Office Connection

application/vnd.oasis.opendocument.presentation = Using a Direct Open Office Connection

application/x-shockwave-flash = Complex via: application/pdf

image/jpeg = Complex via: application/pdf

image/png = Complex via: application/pdf

text/html = org.alfresco.repo.content.transform.TikaAutoContentTransformer

text/plain = org.alfresco.repo.content.transform.TikaAutoContentTransformer

text/xml = org.alfresco.repo.content.transform.TikaAutoContentTransformer

Transformable From: application/vnd.ms-powerpoint = Using a Direct Open Office Connection

application/vnd.oasis.opendocument.presentation = Using a Direct Open Office Connection

Page 24: Metadata Extractors, Content Transformers & Renditions

“Best” transformer selection

• Alfresco prefers• available transformers (obviously)• ‘explicit’ transformers• previously fast transformers*

• Alfresco doesn’t understand the output quality• pass/fail• fast/slow

* past performance is not a guide to future performance.

Page 25: Metadata Extractors, Content Transformers & Renditions

Content Transformation - JavaContentTransformerRegistry registry =

appContext.getBean("contentTransformerRegistry”);

ContentReader reader = contentService.getReader

(nodeRef, ContentModel.PROP_CONTENT);

ContentWriter writer = contentService.getWriter

(targetNode, ContentModel.PROP_CONTENT, true);

writer.setEncoding("UTF-8”);

writer.setMimetype(MimetypeMap.MIMETYPE_TEXT_PLAIN);

// Now have a reader & writer ready to go

Page 26: Metadata Extractors, Content Transformers & Renditions

Content Transformation – Java ctd.ContentTransformer transformer =

registry.getTransformer

(MimetypeMap.MIMETYPE_ZIP,

reader.getSize(),

MimetypeMap.MIMETYPE_TEXT_PLAIN, null);

transformer.transform(reader, writer);

Page 27: Metadata Extractors, Content Transformers & Renditions

Content Transformation - JavaScript

var action = actions.create('transform');

action.parameters["destination-folder"] = node.parent;

action.parameters["assoc-type"] =

"{http://www.alfresco.org/model/content/1.0}contains";

action.parameters["assoc-name"] =

node.name + "transformed";

action.parameters["mime-type"] = "text/plain";

action.execute(testNode);

Page 28: Metadata Extractors, Content Transformers & Renditions

Config: Transformer Filtering/Debugging

• org.alfresco.service.cmr.repository.

TransformationOptionLimits

• timeouts, size limits, page limits• content.transformer.OpenOffice.

mimeTypeLimits.txt.pdf.maxSourceSizeKBytes=5120

• org.alfresco.repo.content.TransformerDebug

• contextual logging

Page 29: Metadata Extractors, Content Transformers & Renditions

Extending

• Follow the Alfresco patterns• org.alfresco.repo.content.transform

• Remember the chains

• Remember the subsystems• ImageMagick• OpenOffice

• Remember the Enterprise variants• JodConverter

Page 30: Metadata Extractors, Content Transformers & Renditions

Recap

• Many transformations & paths possible• No graph

• Can be expensive in CPU/memory

• Transformation to text = free indexing

• No link between source & transformed content• Thumbnails are children of their source

nodes• Bespoke behaviours ensure thumbnails are

updated

Page 31: Metadata Extractors, Content Transformers & Renditions

Renditions

Page 32: Metadata Extractors, Content Transformers & Renditions

Renditions

• A more general feature than transformers

• Although with a strong overlap• Thumbnails are renditions• Previews are renditions

• Not all renditions are thumbnails/previews

Page 33: Metadata Extractors, Content Transformers & Renditions

Renditions

• Flexible location

• Always associated to their source node.• Child nodes of their source node.• Child nodes of another folder node.

• Updated when their source updates.

• Can be disabled with marker aspect• rn:preventRenditions• See ‘preventRenditions’ spring bean to

register other ‘unrenditionable’ content classes

• Can reflect the content and/or metadata of their source node.

Page 34: Metadata Extractors, Content Transformers & Renditions

Standard rendition engines

• reformat redirects to vanilla transforms

• image image manipulation parameters

• freemarker run some FTL against source content

• xslt run XSLT on (XML) source node

• composite rendition series [reformat, crop]

Page 35: Metadata Extractors, Content Transformers & Renditions

Persistence of Rendition Definitions

1. Create Rendition Definition

2. Set parameter values on it

3. Execute it against a source node

• Definitions can be persisted

• Useful for complex or commonly used• RenditionService.save(), .load()

• Saved into Alfresco’s Data Dictionary

Page 36: Metadata Extractors, Content Transformers & Renditions

Renditions - JavaNodeRef jpgNodeRef; QName renditionName = QName.createQName(NamespaceService.CONTENT_MODEL_1_0_URI, "myRendDefn");

RenditionDefinition renditionDef = renditionService.createRenditionDefinition (renditionName, "imageRenderingEngine");

renditionDef.setParameterValue( ImageRenderingEngine.PARAM_RESIZE_WIDTH, 128);renditionDef.setParameterValue( ImageRenderingEngine.PARAM_RESIZE_HEIGHT, 512); renditionDef.setParameterValue( ImageRenderingEngine.PARAM_MAINTAIN_ASPECT_RATIO, false); ChildAssociationRef chAssRef = renditionService.render(jpgNodeRef, renditionDef);

Page 37: Metadata Extractors, Content Transformers & Renditions

Renditions - JavaScriptvar renditionDef = renditionService

.createRenditionDefinition("cm:cropResize”,

"imageRenderingEngine");

renditionDef.parameters["destination-path-template”]

= "/Company Home/Cropped Images/${name}.jpg";

renditionDef.parameters["isAbsolute"] = true;

renditionDef.parameters["xSize"] = 50;

renditionDef.parameters["ySize"] = 50;

renditionService.render(testNode, renditionDef);

var renditions = renditionService.getRenditions(testNode);

Page 38: Metadata Extractors, Content Transformers & Renditions

Recap

• Renditions == Transformations++

• More complex, more powerful

Page 39: Metadata Extractors, Content Transformers & Renditions
Page 40: Metadata Extractors, Content Transformers & Renditions

End