Upload
keagan
View
133
Download
1
Tags:
Embed Size (px)
DESCRIPTION
ProjectWise 101 – Chapter 9 Document Indexing. Gary Cochrane – Technical Director Geospatial Sales – North America. Introduction. ProjectWise Document Indexing Really means three things Full Text Indexing, in support of full text searching Thumbnail Extraction Document Property Extraction - PowerPoint PPT Presentation
Citation preview
ProjectWise 101 – Chapter 9Document IndexingGary Cochrane – Technical DirectorGeospatial Sales – North America
Introduction
• ProjectWise Document Indexing– Really means three things
• Full Text Indexing, in support of full text searching• Thumbnail Extraction• Document Property Extraction
– We won’t cover this one in PW101– See Bentley Institute PW Admin course guide for this
Full Text Indexing
• We did not write the engine for this– But elected to use the one Microsoft provides
• Included with every copy of Windows– That engine is called the MS Indexing Service
• And it was installed in the VM as an optional Windows component– Microsoft indexes the following file formats
• MSWord, Excel, PPT, HTML, XML, TXT
Pre-installed in VM
Windows Server 2003 with SP2
Microsoft Indexing Service
Microsoft Message Queuing Service
Microsoft .NET Framework 2.0
MicroStation V8i-SS1
ProjectWise Orchestration Framework
ProjectWise Integration Server
Supported Database Engine
Extending the MS Index Service
• Microsoft provides an SDK for third parties to extend the Indexing service– So the Indexing service will know how to “filter” files
from that vendor• For instance, Adobe provides an “iFilter” that teaches the MS
Index Service how to extract text from a PDF file• The Adobe PDF iFilter is installed with Acrobat Reader V9x
Indexing Overview
• Within PW, Indexing consists of:– Scheduling
• A process that wakes up, checks for new, (or modified files), adds them to the Copy-out queue, and goes back to sleep
– Copy-out• Copy the file from the Storage Area, to the machine running the
Indexing Service. Then add file to the extraction queue.• Remember, files may be stored on multiple servers• Also, in large installations, a machine may be dedicated to
indexing
Indexing Overview – Part II
• Overview – continued– Extraction
• This process gets the text from the file and adds it to the MS Index catalog. Then adds the file to the Update queue
– Update• This process sets the flag on the file (in the PW database) that
says it is “done”• New files are added with the flag set to “undone”• Check-out/in causes the flag to be set to “undone”
A note on “done”
• Done does not necessarily mean it was successful– It means the file has been processed
• In other words, what happens if an unknown file (Ex: an Autocad file) is sent to the Indexing Service?
– The file is attempted…• And the indexing service says, “I don’t know how to extract text
from this file”– There would be no point in trying the file again
• So it is marked as “done”, even when unsuccessful
MicroStation and AutoCAD
• ProjectWise provides a mechanism to index the text from these file types– Instead of writing an iFilter, Bentley elected to:
• Copy-out the file• Run MicroStation in the background, extract all the text, and write
it to an XML file• Send the XML file to the Indexing Engine
– Since MicroStation can parse DWG as well…• Then this method saved us from having to write two iFilters
Summary
• So within ProjectWise, we index:– Word, PPT, Excel, XML, HTML, TXT– Adobe PDF– DGN, & DWG
• More good news– iFilters can be found for many file formats
• Some free, and some for purchase
PW Orchestration Framework
• Remember when we installed this?– PWOF is responsible for managing batch processes for
ProjectWise• This includes all those processes discussed on the previous slides
– For Full Text Indexing, that means• Scheduler process, Copy-out process, Extraction process, Updater
process, and the MicroStation instance running in the background
Lab 1a
• PW Orchestration Framework– Start the Windows Task Manager
• Hint: Right-click on empty part of Taskbar– Examine memory usage
• On the Performance tab– Switch to Processes tab
• Sort by Mem Usage column (descending)• Look for ustation.exe• Look for DmsAfpEngine(s)
– Lots of memory consumed here…
Lab 1b
• Now open Services dialog– Remember “gears” icon on Quick-Launch
• Locate PW Orchestration Framework service– Select the PW OF service, and choose> Stop
• Watch memory usage in Task Manager– For remainder of exercise, we need PWOF running
• So start it back up now• Note PWOF is configured for automatic startup
– It will run each time machine is booted– Close Services and Task Manager
Lab 2a
• Open PW Administrator– Log in as> adminpw– Drill down to:
• Document Processors> Full Text Indexing– Right-click, choose> Properties
Lab 2b - Full Text Indexing
adminpwadminpw
Turn on
Set to 60
Accept defaut, unless Indexing is to be run on another machine
Lab 2c - Full Text Indexing
Set to 2
Enable all times in the schedule
Lab 2d
• Switch to File Type Associations tab– Press> Add
• In the Extension field, enter> DWG• In the bottom field, enter> DGN
– So that DWG files are processed as if they were DGN– Press> OK
Lab 2e
Lab 2f
• Still on the File Type Associations tab– Again, press> Add
• In the Extension field, enter> itiff• In the bottom, enable> Do not process these documents
– You can’t extract text from a raster so this prevents wasted file transfers
– Press> OK• Press OK again
– To close the Full Text Indexing Properties
Lab 2g
• Open Task Manager again– Switch to Performance tab
• Within 2 minutes, you should see heavy CPU usage• Memory usage will also go up
– Up to 60 documents will be indexed in the first pass• If there are more than 60 documents to be done, then they will be
queued in the next pass– 2 minutes from now
Analysis
• All documents will eventually be processed– When done, the index will be ready for fast full text
searches• Once the indexer has caught up, future load will be lighter due to
only processing incremental documents
Lab 3a
• When done, close Task Manager, open PW Explorer– Log in as user1
• From the main tool box, select> Find Documents– Binocular icon
• Change to Full Text tab– Enter Look For> detail
• Press OK to start search– Then Close the Search dialog
• Your results should include: DGN’s, DWG’s, and PDF’s
Lab 3b
• Browse to:– User1/Document Indexing/MS-SHT
• These files were not successful because they have an unknown extension
• But they were attempted, and flagged as done
• Return to PW Administrator– Select datasource name (pwdemo)
• Right-click, choose> Properties• Change to Statistics tab• Choose Refresh• Review Full Text Statistics
– Close dialog
Lab 3c
• While still in PW Administrator– Open Full Text Indexing Properties again
• Switch to the File Type Associations tab– Press Add
• In the Extension field, enter> SHT• In the bottom Extension field, enter> DGN
– So that SHT files will be processed as if they were DGN files• Press OK to complete the Extension mapping
– Press OK again to close the Properties dialog
Lab 3d
• Once new file type has been added…– Now a small problem
• These files were flagged as done, and the Indexer won’t try them again unless they are checked out/in
• And even that won’t work unless you actually makes changes…• PW compares files to version on server, and doesn’t transfer back
if there are no changes
Lab 3e
• Rather than check them all out, and back in– From PW Administrator
• Right-click Full Text Indexing– Choose>
• Mark folder Documents for Reprocessing– Browse “…” to
• USer1/Document Indexing/MS-SHT– Press OK
• Press OK again
Analysis
• Within 2 minutes, these documents will be re-processed– If you run the search again (in a few minutes), you
should also get SHT files in your results– Re-visit Datasource statistics to see if it Full Text
categories have changed
Summary
• Once the index is created,– You can stop the PW Orchestration Framework service
• It is used to create the index, but not to search the index– This will save memory, and CPU cycles
• So in a demo, your machine will run faster• BUT, new, (or modified) files will not be re-indexed
– Up until now, the PWOF was not being used at all• Full Text Indexing is the first time we’ve needed PWOF, even
though it has been running since installation
PW Thumbnails
• PW Thumbnails is not “indexing” in the proper sense, but it is similar in nature to Full Text– PW Thumbnails extracts a thumbnail from the
document, and stores a copy in the PW database• This allows one to browse PW Explorer, and see thumbnails in the
Preview Pane– Not all file types support thumbnails
• Among those that do, some don’t do it per the industry standard
Thumbnails – Part II
• Important to remember– ProjectWise does not create thumbnails
• It only extracts what might be in the file– A good test is to check to see if Windows Explorer
displays a thumbnail for the file• If it does, then PW should as well
Lab 4a
• Open Windows Explorer– Browse to:
• C:\PW-101 Class Files\Document Indexing\MS-V8– Change to Thumbnail display
• MicroStation V8 files have thumbnails
Lab 4b
• Browse through remaining Document Indexing folders– Note which include thumbnails– Additional notes
• PDF files take a long time because you are really looking at a small view of the whole file, not a thumbnail
• AutoCAD doesn’t adhere to the Industry standard– These files only display correctly because MicroStation is
installed, and is responsible for displaying a thumbnail– Autodesk may have fixed this in later versions?
Lab 5a
• Open PW Administrator– Log in as> adminpw– Drill down to:
• Document Processors> Thumbnail Extraction– Right-click, choose> Properties
• Similar to Full Text Indexing– But actually less involved
Lab 5b
Turn on
adminpwadminpw
Set to 60
Lab 5c
Enable all times in the schedule
Set to 2
Lab 5d
• No changed required on the File Type Associations tab– Press OK to complete the configuration and close the
dialog• Within a few minutes, thumbnails should show up in the preview
pane
Analysis
• Thumbnails are extracted and stored in the PW database– Because document storage may not be local
• Thus “touching” the document to see thumbnail in real-time is not practical
– Thumbnail notes• Requires less processing than full text
– MicroStation not running in this process– Requires PWOF to extract, but not to display
Review
• Topics covered in this Chapter– Full text Indexing – Configuration– Full Text Searches– ProjectWise Orchestration Framework– Thumbnail Extraction– Microsoft Indexing Service
• And iFilters to extend default supported file types• (I have a free Visio, and MSG iFilter from Microsoft)