Upload
alfresco-software
View
8.866
Download
2
Embed Size (px)
DESCRIPTION
In this session, we will look first at the rich metadata that documents in your repository have, how to control the mapping of this on to your content model, and some of the interesting things this can deliver. We'll then move on to the content transformation and rendition services, and see how you can easily and powerfully generate a wide range of media from the content you already have.
Citation preview
1
Metadata Extraction and Content TransformationsNick BurchSoftware Engineer, Alfresco
twitter: @gagravarr
2
Introduction – 3 Content Related Services
Covering
• Uses• Interfaces• Calling the Services• Java & JavaScript APIs• Demos• Extensions• Apache Tika
• Metadata Extractor
• Content Transformer
• Renditions
3
The Metadata Extractor Service
What, How, Why?
• For a given piece of content, returns the Metadata held within that• Document Metadata is converted into the content model• Typically used with uploaded binary files• Upload a PDF, extract out the Title and Description, save these as the properties on the Alfresco Node• Powered internally by a number of different extractors• Service picks the appropriate extractor for you• Since Alfresco 3.4, makes heavy use of Apache Tika
4
The Content Transformation Service
What, How, Why?
• Transforms content from one format to another• Driven by mime types, source and destination• Used to generate plain text versions for indexing• Used to generate SWF versions for preview• Used to generate PDF versions for web download • Powered by a large number of different transformers• Transformers can be linked together, eg .doc -> .pdf via Open Office, then .pdf -> .swf via pdf2swf• Since Alfresco 3.4, makes heavy use of Apache Tika
5
The Rendition Service
What, How, Why?
• Can turn content from one kind to another• Or can just alter some content as-is• Used to manipulate images, eg crop and resize• Used to generate HTML .docx previews in Web Quick Start• Often uses the Content Transformation Service• Replaced the Thumbnail Service• Renditions are actions
6
Apache Tika
Apache Tika – http://tika.apache.org/
• Apache Project which started in 2006• Grew out of the Lucene community, now widely used• Provides detection of files – eg this binary blob is really a word file• Plain text, HTML and XHTML versions of a wide range of different file formats• Consistent Metadata from different files• Tika hides the complexity of the different formats, and presents a simple, powerful API• Easy to use and extend
7
Metadata Extractor Service
8
Alfresco 3.3 - Supported Formats
File Formats supported out of the box
• PDF• Word, PowerPoint, Excel• HTML• Open Document Formats (OpenOffice)• RFC822 Email• Outlook .msg Email
9
Alfresco 3.4 - Supported Formats – Page 1
File Formats supported out of the box, Page 1
• Audio – WAV, RIFF, MIDI• DWG (CAD)• Epub• RSS and ATOM Feeds• True Type Fonts• HTML• Images – JPEG, GIF, PNG, TIFF, Bitmap (including EXIF where found)• iWork (Keynote, Pages etc)• RFC822 mbox Mail
10
Alfresco 3.4 - Supported Formats – Page 2
File Formats supported out of the box, Page 2
• Microsoft Outlook .msg Email• Microsoft Office (Binary) – Word, PowerPoint, Excel, Visio, Publisher, Works• Microsoft Office (OOXML) – Word, PowerPoint, Excel• MP3 (id3 v1 and v2)• CDF (Scientific Data)• Open Document Format (Open Office)• Old-style Open Office (.sxw etc)• PDF
11
Alfresco 3.4 - Supported Formats – Page 3
File Formats supported out of the box, Page 3
• Zip and Tar archives• RDF• Plain Text• FLV Video• XML• Java class files
And I probably forgot one...!
12
Calling Apache Tika
• // Get a content detector, and an auto-selecting Parser• TikaConfig config = TikaConfig.getDefaultConfig();• ContainerAwareDetector detector = new ContainerAwareDetector(• config.getMimeRepository()• );• Parser parser = new AutoDetectParser(detector);
• // We’ll only want the plain text contents• ContentHandler handler = new BodyContentHandler();
• // Tell the parser what we have• Metadata metadata = new Metadata(); • metadata.set(Metadata.RESOURCE_NAME_KEY, filename);
• // Have it processed• parser.parse(input, handler, metadata, new ParseContext());
13
Metadata Extractor – Java Use
• MetadataExtractorRegistry registry = (MetadataExtractorRegistry)context.getBean(“metadataExtracterRegistry”);
• MetadataExtracter extractor = registry.getExtracter(“application/vnd.ms-excel”);
• Map<QName, Serializable> properties = new HashMap<QName, Serializable>();
• ContentReader reader = contentService.getReader(nodeRef, ContentModel.PROP_CONTENT);
• extractor.extract(reader, properties);• System.err.println(properties);
14
Metadata Extractor – JavaScript Use
JavaScript
var action = actions.create("extract-metadata");
action.execute(document);
• Full access is not directly available
• You can’t get at the raw properties
• You can, however, trigger extraction and saving to the node easily
• Available via an action
15
Metadata Extractor – Geo Content Model
• <aspect name="cm:geographic">• <title>Geographic</title>• <properties>• <property name="cm:latitude">• <title>Latitude</title>• <type>d:double</type>• </property>• <property name="cm:longitude">• <title>Longitude</title>• <type>d:double</type>• </property>• </properties>• </aspect>
16
Metadata Extractor – Geo Mapping
• # Namespaces• namespace.prefix.cm=http://www.alfresco.org/model/content/1.0
• # Geo Mappings• geo\:lat=cm:latitude• geo\:long=cm:longitude
• # Normal Mappings• author=cm:author• title=cm:title• description=cm:description• created=cm:created
17
Demo:Geo Tagged Image in Share
18
Content Transformation Service
19
Supported Transformations
Transformations Supported in Alfresco v3.4
• Plain Text, HTML & XHTML for all Apache Tika supported text and document formats (around 30 file formats)• PDF to Image• PDF to SWF (for preview)• Office File Formats to PDF (via Open Office, using JODConverter in Enterprise)• Plain Text and XML to PDF• Zip listing to Text• Image to other Images (via ImageMagick)
20
Content Transformer and Tika
Handlers
ContentHandler handler = new BodyContentHandler();
String text = handler.toString();
SAXTransformerFactory factory = SAXTransformerFactory.newInstance();
TransformerHandler handler = factory.newTransformerHandler();
handler.getTransformer().setOutputProperty(OutputKeys.METHOD, "xml");
handler.getTransformer().setOutputProperty(OutputKeys.INDENT, "yes");
StringWriter sw = new StringWriter();
handler.setResult(new StreamResult(sw));
String text = sw.toString();
• Tika generates HTML-like SAX events as it parses
• Uses Java SAX API• Events can be captured or
transformed• Body Content Handler
used for plain text• HTML and XHTML
available• Can customise with your
own handler, with XSLT or with E4X from JavaScript
21
Content Transformer – Java Use
• ContentTransformerRegistry registry = (ContentTransformerRegistry)context.getBean(“contentTransformerRegistry”);
• ContentTransformer transformer = registry.getTransformer(“application/vnd.ms-excel”,”text/csv”, new TransformationOptions());
• ContentReader reader = contentService.getReader(sourceNodeRef, ContentModel.PROP_CONTENT);
• ContentWriter writer = contentService.getReader(destNodeRef, ContentModel.PROP_CONTENT);
• transformer.transform(reader, writer);
22
Content Transformer – JavaScript Use
JavaScript
var action = actions.create("transform");
// Transform into the same folder
action.parameters["destination-folder"] = document.parent;
action.parameters["assoc-type"] = "{http://www.alfresco.org/model/content/1.0}contains";
action.parameters["assoc-name"] = document.name + "transformed";
action.parameters["mime-type"] = "text/html";
// Execute
action.execute(document);
• Full access is not directly available
• You can’t control which property is transformed, it’s always Content
• You can control where the transformed version goes
• Triggering the transformation is easier than in Java
• Available via an action
23
Custom Tika Parsers - Interface
Interface
public interface Parser {Set<MediaType> getSupportedTypes(ParseContext context);
void parse(InputStream stream, ContentHandler handler,Metadata metadata, ParseContext context)throws IOException, SAXException, TikaException;}
• The Tika Parser interface is quite simple
• Need to provide a list of supported mime types, so that auto-detection can work
• Accept an input stream, populate the Metadata object, and fire SAX events to the supplied handler
• That’s it!
24
Custom Tika Parser – Hello World Parser
public class HelloWorldParser implements Parser { public Set<MediaType> getSupportedTypes(ParseContext context) { Set<MediaType> types = new HashSet<MediaType>(); types.add(MediaType.parse("hello/world")); return types; }
public void parse(InputStream stream, ContentHandler handler, Metadata metadata, ParseContext context) throws SAXException { XHTMLContentHandler xhtml = new XHTMLContentHandler(handler, metadata); xhtml.startDocument(); xhtml.startElement("h1"); xhtml.characters("Hello, World!"); xhtml.endElement("h1"); xhtml.endDocument();
metadata.set("hello","world"); metadata.set("title","Hello World!"); }}
25
Custom Command Line Transformer <bean id="transformer.worker.helloWorldCMD"
class="org.alfresco.repo.content.transform.RuntimeExecutableContentTransformerWorker"> <property name="mimetypeService“><ref bean="mimetypeService"/></property> <property name="transformCommand"> <bean class="org.alfresco.util.exec.RuntimeExec"> <property name="commandsAndArguments“><map> <entry key=".*“><list> <value>/bin/bash</value> <value>-c</value> <value>/bin/echo 'Hello World - ${source}' > ${target}</value> </list></entry> </map></property> <property name="errorCodes“><value>1,127</value></property> </bean> </property <property name="explicitTransformations"> <list><bean class="org.alfresco.repo.content.transform.ExplictTransformationDetails"> <property name="sourceMimetype“><value>text/plain</value></property> <property name="targetMimetype“><value>hello/world</value></property> </bean></list> </property> </bean>
<bean id="transformer.helloWorldCMD" class="org.alfresco.repo.content.transform.ProxyContentTransformer" parent="baseContentTransformer">
<property name="worker"><ref bean="transformer.worker.helloWorldCMD"/></property> </bean>
26
Custom Transformer – Demo
JS Code
var action = actions.create("transform");action.parameters["destination-folder"] = document.parent;action.parameters["assoc-type"] = "{http://www.alfresco.org/model/content/1.0}contains";action.parameters["assoc-name"] = document.name + "HW";
if(document.mimetype == "hello/world") { action.parameters["mime-type"] = "text/plain";} else { action.parameters["mime-type"] = "hello/world";}
action.execute(document);
• Use our Command Line transformer to generate a “hello/world” version
• Use our Tika transfomer to turn this back into plain text
• Uses the JavaScript API to access the content transformation service
27
Demo 2:Excel to Plain Text, CSV and HTML
28
Rendition Service
29
Standard Rendition Engines
Renditions Supported in Alfresco v3.4
• reformat – access to the Content Transformation Service• image – crop, resize, etc• freemarker – runs a Freemarker Template against the content of the node• html – turns .docx files into clean HTML + images• xslt – runs a XSLT Transformation against the content of the node, XML content nodes only!• composite – execute several renditions in a series, eg reformat followed by image crop
30
Persisted vs Transient Definitions
For your more complicated renditions
• To run a rendition, first create a rendition definition for a given rendering engine• Then set all the parameters against it• Finally execute it against a node
• For very complicated / common renditions, you can save the definition to the data dictionary• It can then be retrieved and run• Rendition Service provides support to create, load, save and execute definitions
31
Rendition Service – Calling From Java
Load, Edit, Save, Run
•// Retrieve the existing Rendition Definition•QName renditionName = QName.createQName( NamespaceService.CONTENT_MODEL_1_0_URI, "myRendDefn");•RenditionDefinition renditionDef = loadRenditionDefinition(renditionName);
•// Make some changes.•renditionDef.setParameterValue(AbstractRenderingEngine.PARAM_MIME_TYPE, MimetypeMap.MIMETYPE_PDF);•renditionDef.setParameterValue(RenditionService.PARAM_ORPHAN_EXISTING_RENDITION, true);
•// Persist the changes.•renditionService.saveRenditionDefinition(renditionDef);
•// Run the Rendition•ChildAssociationRef assoc = renditionService.render(sourceNode, renditionDef);
32
Rendition Service – Calling From JavaScript
Create, Run, List
•var renditionDef = renditionService.createRenditionDefinition("cm:cropResize", "imageRenderingEngine");•renditionDef.parameters["destination-path-template"] = "/Company Home/Cropped Images/${name}.jpg";•renditionDef.parameters["isAbsolute"] = true;•renditionDef.parameters["xsize"] = 50;•renditionDef.parameters["ysize"] = 50;
•renditionService.render(nodeRef, renditionDef);
•var renditions = renditionService.getRenditions(nodeRef);
33
Rendition Service – More Calling Options
Actions, Rules, CMIS
• Renditions are Actions, but normally hidden ones• They won’t show up in Share when defining Rules, or in Explorer for running a Custom Action
• Solution – create a JS Script, or some custom Java• Use this from your Rule / to run as an Action
• No dedicated REST API, but Renditions are available through CMIS• More details available in the CMIS talks!
34
Custom Rendition Engines
When a composite just isn’t enough
• Rendition Engines are a special kind of Action Executor• This delivers lots of flexibility, and means anyone who can write Custom Actions already knows enough to write Custom Rendition Engines!• org.alfresco.repo.rendition.executer.AbstractRenderingEngine provides a helpful superclass
• To learn more about Custom Actions and Custom Action Executors, see Neil McErlean’s talk
35
Demo 1:Crop and Resize an Image
(Using Share Rules)
36
Demo 2:Video Rendition
37
Demo 3:Word .docx -> HTML & Images
(Using Web Quick Start)
38
Any Questions?
39
Learn Morewiki.alfresco.comforums.alfresco.comblogs.alfresco.com/wp/nickb/twitter: @AlfrescoECM @Gagravarr