TYPO3 is an Open Source Content Management System that is very popular in Europe, especially in the German market, and gaining traction in the U.S., too. TYPO3 is a good example of how to integrate Solr with a CMS. The challenges we faced are typical of any CMS integration. We came up with solutions and ideas to these challenges and our hope is that they might be of help for other CMS integrations as well. That includes content indexing, file indexing, keeping track of content changes, handling multi-language sites, search and facetting, access restrictions, result presentation, and how to keep all these things flexible and re-usable for many different sites. For all these things we used a couple additional Apache projects and we would like to show how we use them and how we contributed back to them while building our Solr integration.
Text of Apache Solr CMS Integration @ Lucene/Solr Revolution San Diego 2013
2. we build smart.ID INFIELD DESIGNMAY.01.2013LUCENE/SOLR REVOLUTIONTYPO3 CMS and Solr. How we did it.APACHE SOLR CMS INTEGRATION 3. ABOUT IDWhat we do and who we do it for Strategy Planning Design UX Development & Integration 4. WHO IS THIS GUY? Committer TYPO3 CMS Committer and PMC member Apache Tika Release Manager TYPO3 CMS 4.2 New San Franciscan Snowboarding, mountain biking Software Engineer, Architect at Ineld Design- Caution -TYPO3-Evangelist 5. TYPO3 CMS 6. TYPO3 CMS Free and Open Source Enterprise CMS Estimated 500,000+ installations worldwide Over 6,000+ public extensions 6,000,000+ downloads Content Management Framework Multi-Site, Multi-Language, Versioning, Workows, ... Stable, Secure, Scaleable 7. TYPO3 COMMUNITY Community driven development Conferences in North America, Europe, Asia Barcamps, Developer Days, Snowboard Tour 4 times Google Summer of Code participant Backed by TYPO3 Association Several other projects under the TYPO3 brand 8. SOLR & CMSINTEGRATION 9. Integration Challenges & SolutionsPAGE RENDERING Different template engines (too) exible page rendering engine Identify relevant content on websites Exclude navigation and common page elements Content generated by plugins 10. Integration Challenges & SolutionsINDEX QUEUE Index Queue to track and index content Record Monitor to update Index Queue Crawl pages, index unstructured content marked relevant Exclude pages with plugin-generated content Index structured plugin data directly from DB 11. Integration Challenges & SolutionsACCESS RIGHTS Intranet, Extranet, ... Not everybody may see everything Flexible user groups and permissions Permissions extended to sub-pages 12. Integration Challenges & SolutionsSOLR ACCESS FILTER PLUGIN Custom Solr access lter plugin Query Parser and Filter User group IDs stored in documents Current users groups submitted with query Plugin matches document groups with users groups 13. Integration Challenges & SolutionsFILE INDEXING Finding le links in page content Core le links vs. plugin le links Track les for indexing Reading le content Separate tools for different le formats 14. Integration Challenges & SolutionsFILE INDEXING File Detectors & File Index Queue File system abstraction layer Apache Tika Knows 1,200+ le formats, reads about half of them Content & meta data extraction Language detection 15. Integration Challenges & SolutionsTHE REST PHP people vs. Java technology Talking to Solr Learning from mistakes 16. Integration Challenges & SolutionsTHE REST Fully automated bash install script SolrPhpClient Separate your languages 17. EXT:solr - Apache Solr for TYPO3FEATURES Facetted Search File Indexing Multi-Language & Multi-Site Support Did you mean, More Like This Search Word Highlighting Auto Complete Access Rights Support Many More ... 18. we build smart.ID INFIELD DESIGNQUESTIONS? 19. ID INFIELD DESIGNwe build smart.THANKS. 20. ID INFIELD DESIGNwe build smart.T3CON North AmericaSan Francisco, May 30-3120% off regular ticket price, use:LUCENETYPO3INFIELD DESIGN is hiring! 21. CONFERENCE PARTYThe Tipsy Crow: 770 5th AveStarts after Stump The ChumpYour conference badge getsyou in the doorTOMORROWBreakfast starts at 7:30Keynotes start at 8:30CONTACT@firstname.lastname@example.org, email@example.com