38
ProjectWise 101 – Chapter 9 Document Indexing Gary Cochrane – Technical Director Geospatial Sales – North America

ProjectWise 101 – Chapter 9 Document Indexing

  • Upload
    keagan

  • View
    133

  • Download
    1

Embed Size (px)

DESCRIPTION

ProjectWise 101 – Chapter 9 Document Indexing. Gary Cochrane – Technical Director Geospatial Sales – North America. Introduction. ProjectWise Document Indexing Really means three things Full Text Indexing, in support of full text searching Thumbnail Extraction Document Property Extraction - PowerPoint PPT Presentation

Citation preview

Page 1: ProjectWise 101 – Chapter 9 Document Indexing

ProjectWise 101 – Chapter 9Document IndexingGary Cochrane – Technical DirectorGeospatial Sales – North America

Page 2: ProjectWise 101 – Chapter 9 Document Indexing

Introduction

• ProjectWise Document Indexing– Really means three things

• Full Text Indexing, in support of full text searching• Thumbnail Extraction• Document Property Extraction

– We won’t cover this one in PW101– See Bentley Institute PW Admin course guide for this

Page 3: ProjectWise 101 – Chapter 9 Document Indexing

Full Text Indexing

• We did not write the engine for this– But elected to use the one Microsoft provides

• Included with every copy of Windows– That engine is called the MS Indexing Service

• And it was installed in the VM as an optional Windows component– Microsoft indexes the following file formats

• MSWord, Excel, PPT, HTML, XML, TXT

Page 4: ProjectWise 101 – Chapter 9 Document Indexing

Pre-installed in VM

Windows Server 2003 with SP2

Microsoft Indexing Service

Microsoft Message Queuing Service

Microsoft .NET Framework 2.0

MicroStation V8i-SS1

ProjectWise Orchestration Framework

ProjectWise Integration Server

Supported Database Engine

Page 5: ProjectWise 101 – Chapter 9 Document Indexing

Extending the MS Index Service

• Microsoft provides an SDK for third parties to extend the Indexing service– So the Indexing service will know how to “filter” files

from that vendor• For instance, Adobe provides an “iFilter” that teaches the MS

Index Service how to extract text from a PDF file• The Adobe PDF iFilter is installed with Acrobat Reader V9x

Page 6: ProjectWise 101 – Chapter 9 Document Indexing

Indexing Overview

• Within PW, Indexing consists of:– Scheduling

• A process that wakes up, checks for new, (or modified files), adds them to the Copy-out queue, and goes back to sleep

– Copy-out• Copy the file from the Storage Area, to the machine running the

Indexing Service. Then add file to the extraction queue.• Remember, files may be stored on multiple servers• Also, in large installations, a machine may be dedicated to

indexing

Page 7: ProjectWise 101 – Chapter 9 Document Indexing

Indexing Overview – Part II

• Overview – continued– Extraction

• This process gets the text from the file and adds it to the MS Index catalog. Then adds the file to the Update queue

– Update• This process sets the flag on the file (in the PW database) that

says it is “done”• New files are added with the flag set to “undone”• Check-out/in causes the flag to be set to “undone”

Page 8: ProjectWise 101 – Chapter 9 Document Indexing

A note on “done”

• Done does not necessarily mean it was successful– It means the file has been processed

• In other words, what happens if an unknown file (Ex: an Autocad file) is sent to the Indexing Service?

– The file is attempted…• And the indexing service says, “I don’t know how to extract text

from this file”– There would be no point in trying the file again

• So it is marked as “done”, even when unsuccessful

Page 9: ProjectWise 101 – Chapter 9 Document Indexing

MicroStation and AutoCAD

• ProjectWise provides a mechanism to index the text from these file types– Instead of writing an iFilter, Bentley elected to:

• Copy-out the file• Run MicroStation in the background, extract all the text, and write

it to an XML file• Send the XML file to the Indexing Engine

– Since MicroStation can parse DWG as well…• Then this method saved us from having to write two iFilters

Page 10: ProjectWise 101 – Chapter 9 Document Indexing

Summary

• So within ProjectWise, we index:– Word, PPT, Excel, XML, HTML, TXT– Adobe PDF– DGN, & DWG

• More good news– iFilters can be found for many file formats

• Some free, and some for purchase

Page 11: ProjectWise 101 – Chapter 9 Document Indexing

PW Orchestration Framework

• Remember when we installed this?– PWOF is responsible for managing batch processes for

ProjectWise• This includes all those processes discussed on the previous slides

– For Full Text Indexing, that means• Scheduler process, Copy-out process, Extraction process, Updater

process, and the MicroStation instance running in the background

Page 12: ProjectWise 101 – Chapter 9 Document Indexing

Lab 1a

• PW Orchestration Framework– Start the Windows Task Manager

• Hint: Right-click on empty part of Taskbar– Examine memory usage

• On the Performance tab– Switch to Processes tab

• Sort by Mem Usage column (descending)• Look for ustation.exe• Look for DmsAfpEngine(s)

– Lots of memory consumed here…

Page 13: ProjectWise 101 – Chapter 9 Document Indexing

Lab 1b

• Now open Services dialog– Remember “gears” icon on Quick-Launch

• Locate PW Orchestration Framework service– Select the PW OF service, and choose> Stop

• Watch memory usage in Task Manager– For remainder of exercise, we need PWOF running

• So start it back up now• Note PWOF is configured for automatic startup

– It will run each time machine is booted– Close Services and Task Manager

Page 14: ProjectWise 101 – Chapter 9 Document Indexing

Lab 2a

• Open PW Administrator– Log in as> adminpw– Drill down to:

• Document Processors> Full Text Indexing– Right-click, choose> Properties

Page 15: ProjectWise 101 – Chapter 9 Document Indexing

Lab 2b - Full Text Indexing

adminpwadminpw

Turn on

Set to 60

Accept defaut, unless Indexing is to be run on another machine

Page 16: ProjectWise 101 – Chapter 9 Document Indexing

Lab 2c - Full Text Indexing

Set to 2

Enable all times in the schedule

Page 17: ProjectWise 101 – Chapter 9 Document Indexing

Lab 2d

• Switch to File Type Associations tab– Press> Add

• In the Extension field, enter> DWG• In the bottom field, enter> DGN

– So that DWG files are processed as if they were DGN– Press> OK

Page 18: ProjectWise 101 – Chapter 9 Document Indexing

Lab 2e

Page 19: ProjectWise 101 – Chapter 9 Document Indexing

Lab 2f

• Still on the File Type Associations tab– Again, press> Add

• In the Extension field, enter> itiff• In the bottom, enable> Do not process these documents

– You can’t extract text from a raster so this prevents wasted file transfers

– Press> OK• Press OK again

– To close the Full Text Indexing Properties

Page 20: ProjectWise 101 – Chapter 9 Document Indexing

Lab 2g

• Open Task Manager again– Switch to Performance tab

• Within 2 minutes, you should see heavy CPU usage• Memory usage will also go up

– Up to 60 documents will be indexed in the first pass• If there are more than 60 documents to be done, then they will be

queued in the next pass– 2 minutes from now

Page 21: ProjectWise 101 – Chapter 9 Document Indexing

Analysis

• All documents will eventually be processed– When done, the index will be ready for fast full text

searches• Once the indexer has caught up, future load will be lighter due to

only processing incremental documents

Page 22: ProjectWise 101 – Chapter 9 Document Indexing

Lab 3a

• When done, close Task Manager, open PW Explorer– Log in as user1

• From the main tool box, select> Find Documents– Binocular icon

• Change to Full Text tab– Enter Look For> detail

• Press OK to start search– Then Close the Search dialog

• Your results should include: DGN’s, DWG’s, and PDF’s

Page 23: ProjectWise 101 – Chapter 9 Document Indexing

Lab 3b

• Browse to:– User1/Document Indexing/MS-SHT

• These files were not successful because they have an unknown extension

• But they were attempted, and flagged as done

• Return to PW Administrator– Select datasource name (pwdemo)

• Right-click, choose> Properties• Change to Statistics tab• Choose Refresh• Review Full Text Statistics

– Close dialog

Page 24: ProjectWise 101 – Chapter 9 Document Indexing

Lab 3c

• While still in PW Administrator– Open Full Text Indexing Properties again

• Switch to the File Type Associations tab– Press Add

• In the Extension field, enter> SHT• In the bottom Extension field, enter> DGN

– So that SHT files will be processed as if they were DGN files• Press OK to complete the Extension mapping

– Press OK again to close the Properties dialog

Page 25: ProjectWise 101 – Chapter 9 Document Indexing

Lab 3d

• Once new file type has been added…– Now a small problem

• These files were flagged as done, and the Indexer won’t try them again unless they are checked out/in

• And even that won’t work unless you actually makes changes…• PW compares files to version on server, and doesn’t transfer back

if there are no changes

Page 26: ProjectWise 101 – Chapter 9 Document Indexing

Lab 3e

• Rather than check them all out, and back in– From PW Administrator

• Right-click Full Text Indexing– Choose>

• Mark folder Documents for Reprocessing– Browse “…” to

• USer1/Document Indexing/MS-SHT– Press OK

• Press OK again

Page 27: ProjectWise 101 – Chapter 9 Document Indexing

Analysis

• Within 2 minutes, these documents will be re-processed– If you run the search again (in a few minutes), you

should also get SHT files in your results– Re-visit Datasource statistics to see if it Full Text

categories have changed

Page 28: ProjectWise 101 – Chapter 9 Document Indexing

Summary

• Once the index is created,– You can stop the PW Orchestration Framework service

• It is used to create the index, but not to search the index– This will save memory, and CPU cycles

• So in a demo, your machine will run faster• BUT, new, (or modified) files will not be re-indexed

– Up until now, the PWOF was not being used at all• Full Text Indexing is the first time we’ve needed PWOF, even

though it has been running since installation

Page 29: ProjectWise 101 – Chapter 9 Document Indexing

PW Thumbnails

• PW Thumbnails is not “indexing” in the proper sense, but it is similar in nature to Full Text– PW Thumbnails extracts a thumbnail from the

document, and stores a copy in the PW database• This allows one to browse PW Explorer, and see thumbnails in the

Preview Pane– Not all file types support thumbnails

• Among those that do, some don’t do it per the industry standard

Page 30: ProjectWise 101 – Chapter 9 Document Indexing

Thumbnails – Part II

• Important to remember– ProjectWise does not create thumbnails

• It only extracts what might be in the file– A good test is to check to see if Windows Explorer

displays a thumbnail for the file• If it does, then PW should as well

Page 31: ProjectWise 101 – Chapter 9 Document Indexing

Lab 4a

• Open Windows Explorer– Browse to:

• C:\PW-101 Class Files\Document Indexing\MS-V8– Change to Thumbnail display

• MicroStation V8 files have thumbnails

Page 32: ProjectWise 101 – Chapter 9 Document Indexing

Lab 4b

• Browse through remaining Document Indexing folders– Note which include thumbnails– Additional notes

• PDF files take a long time because you are really looking at a small view of the whole file, not a thumbnail

• AutoCAD doesn’t adhere to the Industry standard– These files only display correctly because MicroStation is

installed, and is responsible for displaying a thumbnail– Autodesk may have fixed this in later versions?

Page 33: ProjectWise 101 – Chapter 9 Document Indexing

Lab 5a

• Open PW Administrator– Log in as> adminpw– Drill down to:

• Document Processors> Thumbnail Extraction– Right-click, choose> Properties

• Similar to Full Text Indexing– But actually less involved

Page 34: ProjectWise 101 – Chapter 9 Document Indexing

Lab 5b

Turn on

adminpwadminpw

Set to 60

Page 35: ProjectWise 101 – Chapter 9 Document Indexing

Lab 5c

Enable all times in the schedule

Set to 2

Page 36: ProjectWise 101 – Chapter 9 Document Indexing

Lab 5d

• No changed required on the File Type Associations tab– Press OK to complete the configuration and close the

dialog• Within a few minutes, thumbnails should show up in the preview

pane

Page 37: ProjectWise 101 – Chapter 9 Document Indexing

Analysis

• Thumbnails are extracted and stored in the PW database– Because document storage may not be local

• Thus “touching” the document to see thumbnail in real-time is not practical

– Thumbnail notes• Requires less processing than full text

– MicroStation not running in this process– Requires PWOF to extract, but not to display

Page 38: ProjectWise 101 – Chapter 9 Document Indexing

Review

• Topics covered in this Chapter– Full text Indexing – Configuration– Full Text Searches– ProjectWise Orchestration Framework– Thumbnail Extraction– Microsoft Indexing Service

• And iFilters to extend default supported file types• (I have a free Visio, and MSG iFilter from Microsoft)