Upload
juan-ckarl
View
213
Download
0
Embed Size (px)
Citation preview
8/12/2019 AutoWeb_Guide_2.0a
1/45
A Guide to AutoWeb
Release 2.0
Memex Technology Limited2 Redwood Court
Peel ParkEast Kilbride G74 5PF
Scotland UKTel: +44 (0) 1355 233 804Fax: +44 (0) 1355 239 676
Web: http://www.memex.com
8/12/2019 AutoWeb_Guide_2.0a
2/45
Copyright 2007 Memex Technology Limited. All rights reserved.
This manual and the software described herein are the copyright of Memex Technology Limited and may not becopied or disclosed to a third party without the prior written permission of Memex. Whilst all possible care is taken in
the preparation of this manual, Memex assumes no responsibility or liability for any errors or inaccuracies that mayappear in this document. Memex reserves the right to make changes without notice both to this manual and to thesoftware and hardware it describes.
The software described in this document is furnished under licence and may only be used in accordance with theterms of such licence.
The people, places, organisations, telephone numbers, vehicle identification numbers and other details referred to inthe sample record data in this publication are entirely fictitious. These details have been created for demonstrationpurposes only and do not refer to any actual organisation, telephone number, vehicle, etc., or to any actual person,living or dead.
The text of this document may include references to previous releases of the product for example, in screenshotsand procedural examples. Regardless of any versions that may be mentioned, this manual describes the current
functionality provided by the release of the software identified on the title page.
Trademarks
Memex, Textract and Total Content Access are registered trademarks of Memex Technology Limited. Microsoft,PowerPoint and Windows are registered trademarks of Microsoft Corporation. Other product, brand and companynames mentioned herein are trademarks or registered trademarks of their respective owners and should be treatedas such.
2.0a-5-IJ-AC-20070912-1.6
8/12/2019 AutoWeb_Guide_2.0a
3/45
Contents
Scope............................................................................................................5
Related documents...............................................................................................5
Product names......................................................................................................5
Introduction.................................................................................................7AutoWeb toolbar...................................................................................................................... 7AutoWeb server ....................................................................................................................... 7
Chapter 1 Installing the AutoWeb server....................................................8
Server components...............................................................................................8
Installation prerequisites ...................................................................................10SFU requirements .................................................................................................................. 10
Installing the server components ......................................................................11Installing using the auto-installer ............................................................................................ 11Using the auto-installer on Windows ....................................................................................... 11
Using the auto-installer on Solaris or Linux .............................................................................. 12Installing using the tar file ...................................................................................................... 12
Creating extra databases........................................................................................................ 15
Setting up the AutoWeb configuration file.........................................................15The default spider.cfg file ....................................................................................................... 17HTTrack options and robots.txt............................................................................................... 17
Upgrading to AutoWeb 2.0 ................................................................................. 18Unpack the installation package.............................................................................................. 18Updating the configuration database....................................................................................... 18Run the upgrade scripts ......................................................................................................... 19
Chapter 2 Installing the AutoWeb client ...................................................20
Installing the toolbar..........................................................................................20Configuring the toolbar........................................................................................................... 20Configuring the toolbar from the Windows registry................................................................... 21How the toolbar works ........................................................................................................... 22
Memex Analyst forms .........................................................................................23
Installation tasks................................................................................................24Memex Intelligence Engine..................................................................................................... 24Memex Patriarch.................................................................................................................... 24
AutoWeb databases for Memex Patriarch................................................................................. 25
Configuration tasks ............................................................................................27Modifying the spider.cfg file.................................................................................................... 27
Linking to the WebConfig database ......................................................................................... 27
8/12/2019 AutoWeb_Guide_2.0a
4/45
Memex Technology Ltd A Guide to AutoWeb
Linking to the WebArchive database........................................................................................ 28
Setting up picklists .............................................................................................29
Adding additional web archives .........................................................................29
Chapter 4 Using AutoWeb..........................................................................31
Selecting a Memex database..............................................................................31
Specifying keywords...........................................................................................31
Indexing Web page text .....................................................................................31
Indexing a Web page..........................................................................................31
Viewing indexed pages.......................................................................................32
Monitoring Web sites..........................................................................................33
Specifying the sites you want to monitor........................................................... 33Specifying sites - Memex Patriarch .......................................................................................... 33Specifying sites Memex Analyst............................................................................................ 34Fields on the configuration form.............................................................................................. 35
How Web site monitoring works ........................................................................37
Stopping getsite.pl .............................................................................................37Extracting the Web page text.................................................................................................. 38
Appendix A Known limitations...................................................................39
Appendix B Troubleshooting......................................................................40
Appendix C HTTrack options......................................................................41
Appendix D Upgrading to AutoWeb 1.3.....................................................43
Backing up your previous AutoWeb setup .........................................................43
Installing AutoWeb 1.3.......................................................................................43
Converting your AutoWeb data ..........................................................................44Setting up the conversion script .............................................................................................. 44Running the conversion script................................................................................................. 44
4
8/12/2019 AutoWeb_Guide_2.0a
5/45
Memex Technology Ltd A Guide to AutoWeb
Scope
Thisguideprovidesdetailedinstallationanduserinstructionsforrelease2.0ofAutoWeb.
Thedocumentcontains:
AnoverviewoftheAutoWebapplication
Installationandconfigurationinstructionsfortheclientandservercomponents
Detaileduserinstructions
Informationonknownlimitations
Instructionsonhowtoupgradefromapreviousrelease
Ifyouhaveanycommentsaboutthisguide,pleasecontactMemexCustomerSupport:
Related documentsForfurtherinformationaboutthisreleaseofAutoWeb,pleasereadtheAutoWebReleaseNotes.
Product namesThismanualcontainsreferencestootherMemexproducts.Thenamesofsomeofthese
productswerechangedrecentlyfornewreleasesofthesoftware.Thenamechangesare
showninthefollowingtable.
Current name Previous name Notes
MemexPatriarch IntelligenceManager MemexPatriarchisadesktopclient
application,whereas
Intelligence
Manager
comprisesadesktopapplicationplus
variousservercomponents.
MemexAnalyst IntelligenceAnalyst
MemexSeriesVI TheIntelligence
Managerbundle
MemexSeriesVIandtheIntelligence
Managerbundlearesetsofcompatible
products.
MemexSeriesVI
Server
TheIntelligence
Managerserver
componentsplusthe
MemexIntelligence
Engine
TheMemexSeriesVIServercomprisesthe
MemexIntelligenceEngineplusvarious
servercomponentsthatsupporttheclient
applications.
5
8/12/2019 AutoWeb_Guide_2.0a
6/45
Memex Technology Ltd A Guide to AutoWeb
Thismanualusesthenameofthecurrentreleaseofthesoftwareunlessspecificallyreferring
toanolderrelease.Unlessstatedotherwise,detailsreferringtoaproductbyitscurrentname
alsoapplytoreleasesoftheproductsthatusedthepreviousname.
6
8/12/2019 AutoWeb_Guide_2.0a
7/45
Introduction
AutoWebprovidesaneasywaytoextracttextfromaWebsiteandtransferittoaMemex
database.
AutoWebhastwomaincomponents:
AtoolbarthatintegratesintoInternetExplorerandallowsyoutoindexindividualpages
directlyfromthebrowser.
Aserversideprocessthatyoucaneitherrunmanuallyoraspartofacronjob.
AutoWeb toolbar
WhenyouusetheAutoWebtoolbar,youcanchoosetoindexallthetextfromaWebpageor
justindexselectedtext.ThetoolbaralsoallowsyoutospecifytheMemexdatabasewhereyou
wanttoindextheWebpage,andtoenterkeywordsassociatedwiththepage.
AutoWeb server
Theserversideprocessreadsthecontentsofaconfigurationdatabasecontaininginformation
onwhich
pages
should
be
indexed.
The
process
then
mirrors
(that
is,
stores
alocal
copy
of)
eachWebpageandcreatesarecordinaMemexdatabase.Themirroredfilesareusedfor
displayingtheWebpageinabrowser.ThedatabaseisusedforretrievingaWebpagebased
onasearchqueryenteredinMemexPatriarchorMemexAnalyst.
Wheneverapageisindexed,eitherfromthetoolbarorfromtheserverprocess,AutoWeb
makesacopyofthepage.Thisenablesyoutoaccesshistoricalcopiesofthepagesyouhave
indexed.
Note AutoWeb is designed to be integrated with Memex Patriarch and Memex Analyst orIntelligence Manager and Intelligence Analyst if you are using older versions of these
applications. You can use either application to view the configuration and indexrecords and access the indexed Web pages.
7
8/12/2019 AutoWeb_Guide_2.0a
8/45
Chapter 1Installing the AutoWeb server
Server componentsThistableliststhecomponentsthattheAutoWebserverinstallationprocessinstalls.
Name Details
bin/HTTrack HTTrackisautilitythatisusedtomirrorWeb
pages.
bin/libhttrack.so.1 SharedlibraryforHTTrack(forSolaris)
bin/lynx Lynxisatextbasedbrowserutilitythatisusedto
extractthetextfromWebpages.
bin/lynx.cfg ConfigurationfilefortheLynxutility.
bin/getsite.pl Thisperlscriptisrunasacronjob.Itlooksatthe
contentsoftheconf i g. dbdatabaseandindexes
anysitesthathavebeensetup.
bin/addtomemex.pl ThisperlscriptiscalledbyanyHTTrackprocess
thatislaunchedfromgetsite.pl.HTTrackcallsthis
scripteverytimeitdownloadsafile.Thescriptthen
decideswhattodowiththefileandaddsarecordto
adatabaseifnecessary.
bin/addpagefile.pl ThisperlscriptiscalledbyanyHTTrackprocess
thatislaunchedfromthefileI ndexPage. pl.
HTTrackcallsthisscripteverytimeitdownloadsa
file.The
script
then
decides
what
to
do
with
the
file.
cgibin/Bar.pl Thisisacgiscriptforbackwardscompatibilitywith
theoriginalMemextoolbar(Version1.0a).This
controlswhatappearsonthatversionofthetoolbar
andtheactionsthatthetoolbarbuttonsperform.
cgibin/Databases.pl ThisisacgiscriptthatisusedbythenewMemex
toolbar(Version1.0b)todeterminethelistof
databases.
cgibin/IndexPage.pl Thisisacgiscriptthatiscalledwheneverauser
selectsIndex
Selected
Text
or
Index
Page.
8
8/12/2019 AutoWeb_Guide_2.0a
9/45
Memex Technology Ltd A Guide to AutoWeb
Name Details
config.db Thedatabasethatcontainsinformationonwhatsites
getsite.plshouldindex.
databases Thisdirectorycontainsallthedatabaseswherethe
indexedpagesarestored.
dbconfigs Thisdirectorycontainsthedatabaseconfigs.
images/memexbar.bmp Thisbitmapisanimagelistforthetoolbar.
install Theinstallscriptfortheserverinstallation.
mirror ThisdirectorycontainsthemirroredWebpages.
spider.cfg ThisistheconfigfileforAutoWeb.
locales/EN.loc
Englishlocale
file.
perlmodules/Config/General.pm Requiredperlmodule.
perlmodules/Config/General/
Extended.pm
Requiredperlmodule.
perlmodules/Config/General/
Interpolated.pm
Requiredperlmodule.
perlmodules/File/Basename.pm Requiredperlmodule.
perlmodules/File/CheckTree.pm Requiredperlmodule.
perlmodules/File/Compare.pm Requiredperlmodule.
perlmodules/File/Copy.pm Requiredperlmodule.
perlmodules/File/DosGlob.pm Requiredperlmodule.
perlmodules/File/Find.pm Requiredperlmodule.
perlmodules/File/Path.pm Requiredperlmodule.
perlmodules/File/Spec.pm Requiredperlmodule.
perlmodules/File/stat.pm Requiredperlmodule.
perlmodules/File/Spec/Functions.pm Requiredperlmodule.
perlmodules/File/Spec/Mac.pm Requiredperlmodule.
perlmodules/File/Spec/OS2.pm Requiredperlmodule.
perlmodules/File/Spec/Unix.pm Requiredperlmodule.
perlmodules/File/Spec/VMS.pm Requiredperlmodule.
perlmodules/File/Spec/Win32.pm Requiredperlmodule.
9
8/12/2019 AutoWeb_Guide_2.0a
10/45
Memex Technology Ltd A Guide to AutoWeb
Installation prerequisitesBeforeyoucaninstalltheAutoWebserver,yoursystemmustcontain:
Oneofthefollowingoperatingsystems:
SunSolaris10
RedHatEnterpriseLinux4
MicrosoftWindowsServicesforUNIX3.5
Perl5.0orgreater
MemexIntelligenceEngine(MIE)6.0
Apache2HTTPserver.
ApachebeconfiguredtorunastheMemexadministratoruser.
ToconfigureApache2torunastheMemexadministratoruser:
ChangetothedirectorywhereApacheshttpd.conffileislocated.Forexample:
cd /usr/local/apache2/conf
Editthehttpd.conffilewithaplaintexteditor,suchasvi.
Locatethesectionoftheconfigurationfilethatspecifiestheuseraswhomthehttpd
servicewillrun.Forexample,toforceApache2torunastheusermxadmininthe
groupmxadmins,addorchangetheUserandGrouplines:
User mxadmin
Group mxadmins
ApacheslogfilesmustbewritablebytheMemexadministratoruser(typicallymxadmin
ormxroot).
TodothisonSolarisorLinux:
suasroot
ChangetheownershipofthedirectorywhereApacheslogfilesreside.Thelocation
ofthelogfilesisspecifiedinApacheshttpd.conffile.Thedirectoryanditscontents
shouldbeownedbytheMemexadministratoruser.Forexample:
chown -R mxadmin:mxadmins /var/apache2/logs
TodothisonWindowsSFU:
FromanSFUcommandconsole,suasAdministrator.
ChangetheownershipofthedirectorywhereApacheslogfilesreside.Thelocation
ofthelogfilesisspecifiedinApacheshttpd.conffile.Thedirectoryanditscontents
shouldbeownedbytheMemexadministratoruser.Forexample:
chown -R SERVERNAME+mxadmin:SERVERNAME+mxadmins
/usr/local/apache2/logs Torunthegetsite.plscriptasacronjob(seeMonitoringWebsitesonpage33),theMemex
administratoraccount(usuallymxadminormxroot)musthaveahomedirectory.
SFU requirements
IfyouareinstallingonSFU,youmustfirstinstallthefollowingsoftwarepackages:
Package name Description
httpd Apache2HTTPServer
lynx LynxWebbrowserforterminals
10
8/12/2019 AutoWeb_Guide_2.0a
11/45
Memex Technology Ltd A Guide to AutoWeb
zlib Zlibdatacompressionlibrary
ThesepackagesareavailablefromtheSFUToolsWarehouseWebsite:
http://www.interopsystems.com/tools/warehouse.htm
Toinstallthesepackages,firstdownloadandinstallthepackageinstallerthatisavailableasa
shellscriptfromthesameWebsite.Youcanthenissuesimplecommandsfromashell
consolewindowthatusethepackageinstallertodownloadandinstallthesoftwarepackages
andalltheirdependencies.Forexample,toinstallApache2,runthecommand:
pkg_update L ht t pd
Formoreinformation,seetheSFUToolsWarehouseWebsite.
Installing the server componentsThemethodinstallingtheAutoWebservercomponentsvariesdependingonwhetheryour
MIEwasinstalledaspartofaMemexSeriesVIServerinstallation.Ifyouareadding
AutoWebtoaMemexSeriesVIsystem,usetheautoinstallermethoddescribedhere.
Otherwiseusethetarfilemethodonpage12.
Installing using the auto-installer
TheautoinstallerisavailableforWindows,LinuxandSolaris.YoumusthaveaMemex
SeriesVIServersetuptobeabletousetheAutoWebautoinstaller.
Using the auto-installer on Windows
1. Locatetheautoweb_windows.exefileinWindowsExplorer.
2. RightclickthisfileandchooseRunAs.
3. SelectThefollowinguserandenter\Administrator .
4. EnterthepasswordforAdministratorandclickOK.
5. Followthesetupinstructionsonscreen:
MemexrecommendsleavingthedestinationdirectoryasC:\SFU\opt\memex
Inmostcasesyoucanleavethehostnameandportsettingsattheirdefaultvalues:
Hostname:l ocal host
Port:9001
EnterthenameandpasswordofanMIEsuperuser.TocheckthenamesofcurrentMIE
superusers,lookatthevaluesofthesuperuserselementinthememexsvr.xmlfile
(usuallylocatedin/opt/memex/etc).
6. Asinstructedattheendoftheautoinstallationprocess,addanIncludestatementto
Apacheshttpd.conffile.
Forexample,fromanSFUshell,runthecommand:
11
8/12/2019 AutoWeb_Guide_2.0a
12/45
Memex Technology Ltd A Guide to AutoWeb
echo " I ncl ude / opt / memex/ aut oweb/ conf i g/ apache2. conf " >>/ usr / l ocal / apache2/ conf / ht t pd. conf
7. Start,orrestart,ApacheWebserver:
/ usr / l ocal / apache2/ bi n/ apachectl r est ar t
Using the auto-installer on Solaris or Linux
1. Logontotheserverasthelocalrootuser.
2. Locatetheautoweb_linux.shinstallscriptandrunitbytypingthecommand:
sh autoweb_linux.sh
3. Followthroughthesetupinstructionsonscreen.(Thedefaultvaluesareusuallycorrect
foreach):
Memexrecommendsleavingthedestinationdirectoryas/opt/memex
Enterthe
host
name
and
port
number
of
your
Memex
Series
VI
Server.
The
default
valuesoflocalhostand9001areusuallycorrect,butyoucanmodifythem.Ifyou
areinstallingAutoWebonaserverotherthantheonethathostsyourMemexSeries
VIsetup,youmustalsoprovidetheportnumberforthatserversMIE.Otherwise,
enterthesamevalueasyouenteredforthepreviousportnumber.
EnterthenameandpasswordofanMIEsuperuser.Tocheckthenamesofcurrent
MIEsuperusers,lookatthevaluesofthesuperuserselementinthememexsvr.xml
file(usuallylocatedin/opt/memex/etc).
Note If any of the values you enter for the previous two steps are incorrect, the installerwill display an error and prompt you to re-enter the correct values.
4. Asinstructedattheendoftheautoinstallationprocess,addanIncludestatementto
Apacheshttpd.conffile.Forexample,runthecommand:
echo "Include /opt/memex/autoweb/config/apache2.conf" >>
/httpd.conf
Where is a path such as /etc/apache2.
5. Start,orrestart,theApacheWebserver:
/bin/apachectl restart
Where is a path such as /usr/apache2.
Installing using the tar file
ThismethodofinstallationshouldonlybeusedifyourMemexserverwassetupmanually
andnotwiththeMemexSeriesVIautoinstaller.IfyouareunsurewhichtypeofMemexset
upyouhave,[email protected].
Note You must install the AutoWeb server components as the Memex administratoraccount. For example, mxadminor mxroot.
12
mailto:[email protected]:[email protected]8/12/2019 AutoWeb_Guide_2.0a
13/45
8/12/2019 AutoWeb_Guide_2.0a
14/45
Memex Technology Ltd A Guide to AutoWeb
11. ConfigureyourwebserversothattheimagessubdirectoryisvisibleasaWeb
subdirectory.
Todothis,addalinetoApacheshttpd.conffile,suchas:
Al i as / aut oweb- i mages/ / opt / memex/ aut oweb/ i mages/
Note:
Thenamethatyougivetothisaliaswillhaveanimpactonthei mgl st entrywithin
thespider.cfgfile.
12. AddacgibindirectorytoyourWebservercalled/ aut oweb- bi n/.Thisdirectory
mustbealiasedtothecgibinsubdirectorywithintheautowebdirectory.
Todothis,addalinetoApacheshttpd.conffile,suchas:
Scr i pt Al i as / aut oweb- bi n/ / opt / memex/ aut oweb/ cgi - bi n/
13. MakeanoteofthefullURLlocationofthisScriptAlias.
YouenterthisURLwhenconfiguringtheAutoWebclienttoolbar.
14.
Addadirectory
to
your
Web
server
that
points
to
the
cgi
binsubdirectory
within
the
autowebdirectory.
Todothis,addthefollowinglinestoApacheshttpd.conffile:
Al l owOver r i de NoneOpt i ons NoneOr der al l ow, denyAl l ow f rom al l
Where isthelocationofyourAutoWebinstallation,
typically/opt/memex/autoweb.
YoumustalsoaddanotherdirectorytoyourWebserverforeachofthemirrorand
imagesdirectoriessimilartotheoneshownaboveforthecgibindirectory.For
example:
Al l owOver r i de NoneOpt i ons NoneOr der al l ow, denyAl l ow f rom al l
and
Al l owOver r i de NoneOpt i ons NoneOr der al l ow, denyAl l ow f rom al l
14
8/12/2019 AutoWeb_Guide_2.0a
15/45
Memex Technology Ltd A Guide to AutoWeb
Creating extra databases
Onesampledatabaseiscreatedaspartoftheinstallationprocess.Thesampledatabaseis
calledwebarchive.ThedirectoryforAutoWebdatabasesis:/opt/memex/autoweb/databases.
Youcancreateextradatabasesbyusingthens_createcommandfollowedbythemkphonetic
command.Forexample:
ns_create -c /opt/memex/autoweb/dbconfigs/config.archive
-n 8192 /opt/memex/autoweb/databases/mynewdb
mkphonetic /opt/memex/autoweb/databases/mynewdb
SeetheMemexIntelligenceEngineAdministrator sGuideformoreinformationonthens_create
andmkphoneticutilities.
Setting up the AutoWeb configuration filespider.cfgistheconfigurationfileforAutoWeb.Thistableliststheentriesthatthe
configurationfilemustcontain.Thedefaultspider.cfgfileisshownonpage17.
Name Details
installpathTheinstallationdirectoryoftheAutoWebserver.Thisissetautomatically
bythe
install
script.
Forexample:/ opt / memex/ autoweb
localeThelanguagelocaletousefortheserverresponsestotheMemextoolbar.
Thismustbesettomatchoneofthefilesinthelocalesdirectoryinthe
installationpath.
Forexample:EN
mirrorurlTheURLforthemirrordirectory.Thismustcontainthefulldomainname
andthealiasthatyougaveforthemirrordirectory.
Forexample: ht t p: / / ser ver . domai n. com/ aut oweb- mi r r or
httracklibThepathtothelibfileforHTTrack.Forexample:/ opt / memex/ aut oweb/ bi n
httrackThepathtotheHTTrackexecutable.
Forexample:/ opt / memex/ aut oweb/ bi n/ ht t r ack
optsTheoptionsthatgetsite.plusestocallHTTrack(seeHTTrackoptionsand
robots.txtonpage17).
Forexample:- n - %e0
stdoptsMoreoptionsthatgetsite.plusestocallHTTrack.
Forexample: - I 0 - Qq - - assume cf m=t ext / ht ml , php=t ext / html
- X0 - %F ""
15
8/12/2019 AutoWeb_Guide_2.0a
16/45
Memex Technology Ltd A Guide to AutoWeb
Name Details
appendThepathtothens_appendutility.
Forexample:/ opt / memex/ mi e/ bi n/ ns_append
decode
The
path
to
thedecode
utility,
Forexample:/ opt / memex/ mi e/ bi n/ decode
configdbThepathtotheconfigdatabaseforgetsite.pl.
Forexample:/ opt / memex/ aut oweb/ conf i g. db
lynxThepathtothelynxutilityandtheparametersthatmustbepassed.
Forexample:/ opt / memex/ aut oweb/ bi n/ l ynx cf g=" / opt / memex/ aut oweb/ bi n/ l ynx. cf g"
domainThewebserverdomain.
Forexample:server . domai n. com
imglst Thepaththatwillbeaddedtothedomaintoretrievetheimagelistforthetoolbar.Thefirstpartofthismustbethenamethatyougavetothealias
forthe/imagesdirectory.
Forexample:/ aut oweb- i mages/ memexbar . bmp
cgi-binThepaththatwillbeaddedtothedomaintoaccessthecgibinfor
AutoWeb.Thismustbethenameofthealiasthatyougaveforthecgibin
directory.
Forexample:/ aut oweb- bi n/
pageoptsTheoptionsusedinthecallfromindexpage.pltoHTTrack.
Forexample: - %P0 C0 - I 0 - %Q - n - Qq - d - - assume
cf m=t ext / ht ml , php=t ext / html - X0 - %F ""
logfileThelocationofthelogfileforAutoWeb.Ifthisentrydoesnotexist,nolog
fileiscreated.
Forexample:/ opt / memex/ l ogs/ cr awl er l og. t xt
filtertypesAlistofthefiletypesthatAutoWebwillnotwritearecordfor.
Forexample:r a| r am| j pg| gi f | pbm| mov| avi | wmv| css| pdf | ps| j s| xml | r df
lockfileThelockfilethatisusedtopreventget si t e. pl fromrunningmorethan
once.
Forexample:/ t mp/ aut owebl ock
notrenamedAlistofthefiletypesthatHTTrackdoesnotrenameashtml.
Forexample:ht ml | ht m| t xt
imbaseTheinstallationdirectoryoftheMemexPatriarchsoftwareontheserver.
ThisentryisoptionalandisonlynecessaryifyouwanttouseAutoWeb
fromwithinMemexPatriarch.
Thisparametershouldusuallybesetto:/ opt / memex/ i m
16
8/12/2019 AutoWeb_Guide_2.0a
17/45
Memex Technology Ltd A Guide to AutoWeb
Name Details
rolloverThenumberofdaysbeforethemirrordirectoryisrolledover.
Rollingoverthemirrordirectoryinvolvescreatinganewsubdirectoryin
thelocationspecifiedbythemirrorurlsetting.Ifyouleavethisatthe
defaultof7,anewmirrorsubdirectoryiscreatedevery7daysforstoringWebpagesin(2007001,2007002andsoon).
Toturnoffthisprocess,setthevalueto0,althoughthisisnot
recommended.Thedefaultandrecommendedvalueintheprovidedfile
is7.
Note You use different configuration file variables to specify the HTTrack options,depending on how you are running AutoWeb:
If you are running the AutoWeb toolbar, use the pageopts variable to specify theHTTrack options.
If you running AutoWeb as a cronjob via getsite.pl use the StdOpts variable
to specify the HTTrack options.
The default spider.cfg file
#Conf i g f i l e f or I nt el l i gence Mi r r ori nst al l path / opt / memex/ aut owebmi rr orurl ht t p: / /localhost/ aut oweb- mi r r orht t r ackl i b / opt / memex/ aut oweb/ bi nht t r ack / opt / memex/ aut oweb/ bi n/ ht t r ackopts - n - %e0 - A32000st dopt s - I 0 - Qq - - assume cf m=t ext / ht ml , php=t ext / ht ml - X0 - %F ""append / opt / memex/ mi e/ bi n/ ns_append
decode / opt / memex/ mi e/ bi n/ decodeconf i gdb / opt / memex/ aut oweb/ conf i g. dbl ynx / opt / memex/ aut oweb/ bi n/ l ynx -cf g="/ opt / memex/ aut oweb/ bi n/ l ynx. cf g"domai n localhosti mgl st / autoweb- i mages/ memexbar. bmpcgi - bi n / aut oweb- bi n/pageopt s - %P0 - C0 - I 0 - %Q - n - Qq - d - - assumecf m=t ext / ht ml , php=t ext / ht ml - X0 - %F " "l ogf i l e / opt / memex/ aut oweb/ crawl er l og. t xtf i l t er t ypes r a| r am| j pg| gi f | pbm| mov| avi | wmv| css| pdf | ps| j s| xml | r dfl ockf i l e / t mp/ spi der l ocknot r enamed html | htm| t xtl ocal e EN
i mbase / opt / memex/ i mr ol l over 7
HTTrack options and robots.txt
Arobots.txtfileisstoredintherootofmostWebservers.Thisfilealertscrawlersandweb
spiders,suchasAutoWeb,astowhichpagestheyshouldignorewhenretrievingpagesfrom
theremoteWebserver.
TheoriginalspecificationofthisstandardandtheIETFdraftareavailablefromthefollowing
sites:
http://www.robotstxt.org/wc/norobots.html
17
8/12/2019 AutoWeb_Guide_2.0a
18/45
Memex Technology Ltd A Guide to AutoWeb
http://www.robotstxt.org/wc/norobotsrfc.html
Becauserobots.txtrestrictsthefilesthatcanbedownloadedbywebspiders,ithasanimpact
ontheAutoWebserversoftwareanditsabilitytotrackandstoreWebpages.
AutoWebuses
HTTrack
software
to
retrieve
remote
Web
pages.
If
required,
you
can
configureHTTracktoeitherfolloworignorethedirectivesintherobots.txtfile.Youdothis
bychangingtheopt ssettinginthespider.cfgfile.Formoreinformation,seeAppendixC
HTTrackoptionsonpage41.
Upgrading to AutoWeb 2.0ThefollowingseriesofinstructionsmustbeperformedtoupgradeanAutoWeb1.3
installationtoAutoWeb2.0.IfanupgradeisbeingperformedfromAutoWeb1.0or1.1the
configurationmustbeupgradedtoAutoWeb1.3beforethefollowingstepscanbeapplied.Instructionsforupgradingtoversion1.3aregivenintheappendixonpage43.
Unpack the installation package
UnpacktheAutoWeb2.0installationpackageinatemporarylocation.Forexample:
tar -xvf mxwasvr--.tar
Updating the configuration database
MemexAnalyst
config.db
database
IfyouareusingMemexAnalystforadding/editingconfigurationrecordsforAutoWeb,two
newfieldsmustbeaddedtotheconfigfilefortheconfig.dbdatabase.Thepathtothisfileis
typically/opt/memex/autoweb/config.db/config.Useaplaintexteditor,suchasvi,toedit
thisfile,addingthefollowingtwolinestotheendofthefile:
f i el d: 6 i ndex xxi ndex ""f i el d: 7 pr i or i t y xxpr i or i t y " "
Note If the field numbers 6 and 7 are currently used by other fields, use the next availablehighest numbers that are not currently in use.
MemexPatriarchWebConfigDatabase
IfyouuseMemexPatriarchforadding/editingconfigurationrecordsforAutoWeb(thatis,if
config.dbisasymboliclinktotheMemexPatriarchWebConfigdatabase),youmustadd
indexandpriorityfieldstotheWebConfigdatabasedefinition.DothiswithinMemex
Patriarch,usingEntityManager.SeetheMemexPatriarchonlinehelpfordetailsofhowto
addnewfields.
TheMemexPatriarchformforWebConfigrecords(and,optionally,theformfor
WebArchiverecords)shouldbereplacedbytheformssuppliedintheim13autoweb/forms
directoryofthedistribution.Forexample:
cp i m13aut oweb/ f orms/ WebConf i g. f orm / opt / memex/ i m/ CS/ f i l es/ f orms
18
8/12/2019 AutoWeb_Guide_2.0a
19/45
Memex Technology Ltd A Guide to AutoWeb
cp i m13aut oweb/ f orms/ WebAr chi ve. f orm / opt / memex/ i m/ CS/ f i l es/ f orms
TwonewpicklistsshouldbeaddedwithinListManagementtotheWebConfigdatabase
definitionfortheindexandpriorityfields.IndexshouldhavethevaluesYESandNO.Priority
shouldhavethevaluesHIGH,MEDIUMandLOW.
SeetheMemexPatriarchonlinehelpfordetailsoncreatingpicklists.
Note These picklist files are supplied with the AutoWeb distribution inim13autoweb/picklists .
Run the upgrade scripts
WithinthedirectorythattheAutoWeb2.0installationpackagewasunpacked,enterthe
followingcommand:
sh upgr ade- scr i pt s
Where istheinstallationdirectoryoftheexisting
AutoWeb1.3software.Thisisnormally/opt/memex/autoweb.
19
8/12/2019 AutoWeb_Guide_2.0a
20/45
Chapter 2Installing the AutoWeb client
Installing the toolbar
ToinstalltheAutoWebtoolbar:
1. InWindowsExplorer,browsetothelocationofthesuppliedAutoWeb.exefileforthe
clientapplication.
2. DoubleclickAutoWeb.exe.
ThislaunchestheAutoWebInstallShieldprogram.
3. ClickYestoacceptthelicenseagreement.
ThisdisplaystheChooseDestinationLocationpage.
4. Browsetothelocationwhereyouwanttoinstallthefiles,andclickNext.
TheInstallShieldprograminstallstheAutoWebfilesanddisplaysaconfirmation
messagewhentheinstallationiscomplete.
5. ClickFinishtoacknowledgethemessage.
Configuring the toolbar
AfterinstallingtheAutoWebtoolbar,youneedtoopenInternetExplorerandmakesurethat
thetoolbarisnowavailable.
Ifthetoolbarisnotvisible,chooseView>Toolbars>AutoWeb.ThisaddstheAutoWeb
toolbartoInternetExplorer.
Thetoolbarshouldlooklikethis:
ToconfiguretheAutoWebtoolbar:
1.
Clickthe
arrow
beside
the
AutoWeb
button
and
choose
Configuration
from
the
drop
downlist.
20
8/12/2019 AutoWeb_Guide_2.0a
21/45
Memex Technology Ltd A Guide to AutoWeb
ThisdisplaystheConfigurationdialogbox.
2. EntertheURLofthecgibindirectoryonthewebserverwheretheAutoWebserver
softwareisinstalled.Typically,thisis:http://server.domain/autowebbin/
Forexample:http://achilles.memex.com/autowebbin/
YoucancheckthisvaluebylookingfortherelevantScriptAliasentryinApaches
httpd.conffile(orinthe/opt/memex/autoweb/config/apache2.conffileforan
installationwithMemexSeriesVIServer).
3. ClickOK.
ThisenablestheAutoWebtoolbar.Allthetoolbaroptionswillnowbeavailable.
Configuring the toolbar from the Windows registry
IfyouareinstallingtheAutoWebtoolbaronasignificantnumberofmachines,orifyouwant
torestrictuseraccesstotheConfigurationoption,youcanconfigurethetoolbarviaaspecific
registryfileautoweb.reg.ThisfileissuppliedbyMemexalongsidetheclientinstallation
file.
Youspecifythefollowingsettingsintheautoweb.regfile:
URL
ThefullURLofthecgibindirectoryonthewebserverwheretheAutoWebserver
softwareisinstalled.
Conf i gDi sabl ed
ADWORDvalueintheregistry.Setthisto1(oranynonzerovalue)todisablethe
AutoWebtoolbarsConfigurationmenuoption.
Forexample,atypicalautoweb.regfilelookslikethis:
REGEDIT4
[HKEY_LOCAL_MACHINE\SOFTWARE\Memex Technology Ltd\AutoWeb]
"URL"="http://server.domain/autoweb-bin/"
"ConfigDisabled"=dword:00000000
DoubleclickthisfiletoapplythechangestotheWindowsregistryofthelocalcomputer.
Note These settings apply to all user accounts on the computer. The changes are appliedto Internet Explorer the next time it is started.
21
http://server.domain/autoweb-bin/http://server.domain/autoweb-bin/http://server.domain/autoweb-bin/http://server.domain/autoweb-bin/http://server.domain/autoweb-bin/http://achilles.memex.com/autoweb-bin/http://achilles.memex.com/autoweb-bin/http://achilles.memex.com/autoweb-bin/http://achilles.memex.com/autoweb-bin/http://server.domain/autoweb-bin/8/12/2019 AutoWeb_Guide_2.0a
22/45
Memex Technology Ltd A Guide to AutoWeb
Toaddafurtherlevelofsecurity,youcanplacesecuritypermissionsontheseregistrykeysto
preventthembeingchanged.Thisstopsusersfromreconfiguringthetoolbarthemselves.For
moreinformationonsettingpermissionsforregistrykeys,seeyourMicrosoftWindows
documentation.
How the toolbar worksImplementation
TheAutoWebtoolbarisimplementedasanativeDeskBandcomponentforInternetExplorer
usingVisualC++.ThisrequirestheMXAutoWeb.dllfiletoberegisteredoneachclient
machine.Afterthelibraryisregistered,userscandisplaythetoolbarbyaccessingInternet
ExplorerandselectingView>Toolbars>MemexAutoWebToolbar.
Configuration
Thetoolbarconfigurationiscontrolledbythefollowingregistrykey:
HKEY_LOCAL_MACHINE/Software/Memex Technology Ltd/AutoWeb
ThiskeyisheldunderthestringvalueURL,whichcontainsthebaseURLtothecgibin
directoryonthewebservercontainingtheCGIscripts.
Processing index requests
WhenauserclicksIndexPageorIndexSelectedTextonthetoolbar,AutoWebsendsan
HTTPrequesttotheIndexPage.plPerlCGIscript,locatedwithinthecgibindirectoryonthe
server.
Thisrequestcontainsthefollowingparameters:
TheMemexdatabasewheretheindexedtextwillbestored
The
keywords
to
add
to
the
database
record
The(selected)textfromthepage
Anindicationastowhethertheuserisindexingtheentirepageorjustselectedtext
TheWebpagesURL
IndexPage.plthencallsHTTrackfortheURL(thiscallisruninthebackground).HTTrack
attemptstocreateamirrorofthatpage.
ThiscalltoHTTrackcontainsaparameterspecifyingwhethereachindexedfilewillcontaina
timestampinthefilename.HTTrackinturncallsaddpagefile.pl,whichcomparesthenew
indexedfilewiththemostrecentversiononthelocalserver.
Ifthefilesarethesame,thenewversionisdeletedandreplacedwithasymboliclinkto
themostrecentfile.
Ifthefilesaredifferent,thenewfilebecomesthemostrecentversionandisusedfor
anysubsequentcomparisons.
AftercompletingthecalltoHTTrack,IndexPage.plwritesarecordintothespecified
databasecontaining:
TheoriginalURL
TheURLofthemirror
Thekeywords
The(selected)textfromthepage
22
8/12/2019 AutoWeb_Guide_2.0a
23/45
8/12/2019 AutoWeb_Guide_2.0a
24/45
Chapter 3Using AutoWeb with Memex Patriarch
Note This chapter contains information on configuring AutoWeb to be used with MemexPatriarch on a Memex system that was manually installed. If your system is aMemex Series VI Server that was installed using the provided auto-installer (i.e. youuse Memex Patriarch to administer your system), you can skip this chapter andcontinue readingChapter 4Using AutoWebon page 31.
AutoWebisdesignedtointegratewithMemexPatriarchandMemexAnalyst.However,you
mustperformsomeextrainstallationandsetuptaskstouseAutoWebwithinMemex
Patriarch.
Important You can use eitherMemex Patriarch orMemex Analyst for choosing the Websites you want AutoWeb to monitor. However, you cannot configure AutoWebfrom bothapplications. The steps described in this section enable configurationfrom within Memex Patriarch. This will disable configuration from within MemexAnalyst. You will still be able to viewthe configuration records in MemexAnalyst, but you will only be able to add or edit configuration records fromMemex Patriarch.
Installation tasks
Memex Intelligence Engine
ForMemexPatriarchandAutoWebtoworktogether,MIE6.0mustbeinstalledonallthe
serversthatwillbeusedtohostbothMemexPatriarchandAutoWeb.
Notes You do not need to place Memex Patriarch and AutoWeb on completelyseparate physical machines. A single MIE instance can host both the MemexPatriarchand AutoWeb databases.
If your system uses multiple physical servers, all the physical machines mustshare the same secret file to allow for certificate authentication.
FordetailsonhowtosetuptheMIEonyourservers,readtheMIE6.0InstallationGuide.
Memex Patriarch
TheMemexPatriarchserversidecomponentscanbeinstalledintwoways:
1. UsingtheMemexSeriesVIServerautoinstaller
2. UsingthePerlbasedinstaller
24
8/12/2019 AutoWeb_Guide_2.0a
25/45
Memex Technology Ltd A Guide to AutoWeb
ThePerlbasedinstallerprovidesawaytospecifymanyoftheconfigurationoptionsduring
theinstallationprocess,whereastheautoinstallerprovidesaquickwaytoinstallaprebuilt
installation.
ThissectionrelatestoMemexserverinstallationsdoneusingthePerlbasedinstaller.This
installerwill
also
be
used
to
install
the
two
AutoWeb
databases
for
Memex
Patriarch.
FormoreinformationonthePerlbasedinstallerseetheMemexSeriesVIServerInstallation
Guide:PartIIPatriarchComponents.
AutoWeb databases for Memex Patriarch
AutoWebcontainstwoMemexdatabasedefinitionsthatyoucanusetoinstallAutoWeb
databasesforMemexPatriarch.ThesedatabasesallowyoutosearchandcontrolAutoWeb
frominsideMemexPatriarch
Toenablethesedatabasedefinitions,copytheim13autowebdirectoryintotheiminstall
directory(which
was
created
when
the
Perl
based
installer
was
used
to
install
the
Memex
Patriarchservercomponents).Forexample,
cp - R / opt / memex/ aut oweb/ i m13aut oweb / opt / memex/ i m/ i m- 2. 0a- 105- vani l l a-i nter i x/ i m- i ns tal l
Important If you deleted the im-installdirectory after installing the Memex Series VIServer, you will no longer have the Perl-based installer. You need this toproceed with this installation procedure. Contact Memex Customer Servicesand request a copy of the tar file containing the Perl-based installer for theMemex Patriarch server components.
The installer for the Memex Patriarch server components mustbe run on the
physical machine that hosts the Memex configuration server. If AutoWeb isinstalled on a machine that is notthe configuration server, you must copythe AutoWeb database definitions to the configuration server, by transferringthe im13autowebdirectory across the network to the physical machine thatis hosting the configuration server.
BeforeyoucaninstalltheAutoWebdatabasedefinitions,youneedthefollowinginformation
aboutyourMemexSeriesVIServersetup:
ThehostnameandportnumberfortheMemexIntelligenceEnginethatyouwilluseto
accesstheAutoWebdatabases
TheprefixandnameofthelogicalserverthatwillhosttheAutoWebdatabases
Youwilladdthisinformationtotheinstallerssetup.xmlfiletospecifywheretheAutoWeb
databaseswillbecreated.
Editing the setup.xml file
Whenyouhavecopiedtheim13autowebdirectorytotheiminstalldirectory,youmust
modifythesetup.xmlfilewithintheiminstall/im13autowebdirectory.Thisfilecontainsthe
databasedefinitionsforthetwonewAutoWebdatabases:WebConfigandWebArchive.It
alsodefinesanewlogicalservernamedAutoWeb(prefixAW).
25
8/12/2019 AutoWeb_Guide_2.0a
26/45
Memex Technology Ltd A Guide to AutoWeb
IfyouwanttocreatetheAutoWebdatabasesonaremoteserver,youmustedittheattributes
forthehost element,specifyingtheserverwherethenewAutoWebdatabaseswillbe
created.Todothis,changetheattributesto:host name="hostname"por t ="number" .
Forexample:
Alternatively,tocreatetheAutoWebdatabasesonthesamephysicalmachineastheMemex
Patriarchconfigurationserver,leavethehost attributeas:
Installing the AutoWeb databases
Aftereditingtheset up. xml file,youmustrunthePerlbasedinstallerforMemexPatriarch,
toinstallthenewAutoWebdatabasesandlogicalserver.
ToinstalltheAutoWebdatabasesandserver:
1. Changetotheiminstalldirectoryontheconfigurationserver.Forexample:
cd /opt/memex/im/im-2.0a-105-vanilla-interix/im-install
2. Runthefollowingcommand:
perl install.pl c -i -m -x -p -fautoweb/im13autoweb
Where:
istheprefixofthelogicalserverusedastheconfigurationserver(usually
CS).
isthedirectorywheretheMemexPatriarchserverside
componentsareinstalled(usually/opt/memex/im).
isthedirectorywheretheMIEisinstalled(usually/opt/memex/mie).
isthepathtotheMIEconfigurationfile (usually
/opt/memex/etc/memexsvr.xml).
istheTCPportonwhichthelocalMIElistensforconnections.
Forexample:
perl install.pl -c CS -i /opt/memex/im -m /opt/memex/mie -x/opt/memex/etc/memexsvr.xml -p 9001 -f autoweb/im13autoweb
3. Whenthedetailsoftheinstallationaredisplayed,enterytoconfirmthatyouwantto
continuewiththeinstallation.
4. EntertheusernameandpasswordoftheMemexPatriarchsuperuser.
Thescriptcompletestheinstallation.
26
8/12/2019 AutoWeb_Guide_2.0a
27/45
Memex Technology Ltd A Guide to AutoWeb
Configuration tasksToconfigureAutoWebtoworkwithMemexPatriarch,youmustupdateAutoWebtousethe
newentitiesthathavebeencreated.
Modifying the spider.cfg file
ThistaskismandatoryifyouwanttouseAutoWebwithMemexPatriarch.
Thespider.cfgfileislocatedintheaut owebdirectory.Thefilecontainsthesettingi mbase.
YoumusteditthissettingtopointtothedirectorywhereMemexPatriarchisinstalled.
Forexample,ifMemexPatriarchisinstalledin/opt/memex/im,youwouldchangethe
spi der . cfgfilesettingto:
imbase /opt/memex/im
Important You must modify spider.cfgbefore you make any of the other changesdescribed in this section. If you do not make this change, AutoWeb will not beable to detect that it is inserting data into an Memex Patriarch database, andthe resulting records will be inaccessible from the client software.
Linking to the WebConfig database
Thegetsite.plscript,whichisusedtoindexWebpagesautomatically,isconfiguredusing
recordsinalegacyformatdatabasecalledconfig.db.Youcanaddandeditrecordsinthis
databaseusingMemexAnalyst.However,toaddoreditconfigurationrecordsfromwithinMemexPatriarchyoumustusethenewMemexPatriarchWebConfigdatabasethatwas
createdwhenyouinstalledtheim13autowebsetup.
Theconfig.dbdatabaseisstoredintheAutoWebinstallationdirectory.Forexample,if
AutoWebhasbeeninstalledin/opt/memex/autoweb,thepathtothisdatabaseis
/opt/memex/autoweb/config.db.
Important You can only configure AutoWeb from eitherMemex Patriarch or MemexAnalyst. You cannot configure AutoWeb from both applications.
Creating the symbolic l nki
Creatingasymboliclinkfromtheconfig.dbdatabasetothenewWebConfigdatabase,forces
AutoWebtouseMemexPatriarchsWebConfigdatabase.
Tocreatethesymboliclink:
1. AstheMemexadministrativeuser,movetotheAutoWebinstallationdirectory.For
example:
cd / opt / memex/ autoweb
27
8/12/2019 AutoWeb_Guide_2.0a
28/45
Memex Technology Ltd A Guide to AutoWeb
2. Movetheconfig.dbasidebyenteringthefollowingcommand:
mv config.db config.db.old
3.Create
alink
to
the
WebConfig
database
by
entering
with
the
following
command:
ln s //databases/WebConfig config.db
Forexample:
ln s /opt/memex/im/AW/databases/WebConfig config.db
AutoWebwillnowusetheWebConfigdatabaseratherthantheconfig.dbdatabase.
Note If you are upgrading your AutoWeb setup from a previous version, you must makesure that a uniq_idfile is stored in the WebConfigdatabases directory. You can dothis manually, or by adding a record to the database in Memex Patriarch.
For more information, consult the MIE Administrators Guide.
Reverting to the legacy database
If,atalaterdate,youdecidethatyouwouldprefertouseMemexAnalystforconfiguring
Websitemonitoring,youcanreversetheaboveprocess,deletingthesymboliclinkand
renamingthe
config.db.old
file
as
config.db.
However,
after
doing
this,
the
configuration
databasewillbeemptyandyouwillbeleftwithaWebConfigdatabaseinMemexPatriarch
thatisnolongerconnectedtoAutoWeb.
Linking to the WebArchive database
Bydefault,AutoWebusesalegacyformatdatabasecalledwebarchiveforindexingpages.
Thisdatabaseislocatedinthethe/opt/memex/autoweb/databasesdirectory.Toconfigure
AutoWebfromMemexPatriarchyoumustusetheMemexPatriarchWebArchivedatabase
thatwascreatedwhenyouinstalledtheim13autowebsetup.
TousetheWebArchivedatabase,youmustcreateasymboliclinktoforceAutoWebtouse
thisdatabaseratherthanthe/opt/memex/autoweb/databases/webarchivedatabase.
Creating the symbolic l nki
YoucreatethesymboliclinktotheWebArchivedatabaseinthesamewayasyoucreatedthe
symboliclinktotheWebConfigdatabase.
Tocreatethesymboliclink:
28
8/12/2019 AutoWeb_Guide_2.0a
29/45
Memex Technology Ltd A Guide to AutoWeb
1. Movetothedatabasessubdirectoryoftheautowebinstallationdirectory.For
example:
cd / opt / memex/ aut oweb/ dat abases
2. CreatealinktotheWebArchivedatabasebyenteringwiththefollowingcommand:
ln s //databases/WebArchive
Forexample:
ln s /opt/memex/im/AW/databases/WebArchive webarchive
Notes The AutoWeb toolbar will list the WebArchivedatabase by the name of the
symbolic link usually webarchive. If you are upgrading your AutoWeb setup from a previous version, you must
make sure that a uniq_idfile is stored in the WebArchivedatabases directory.You can do this manually, or by adding a record to the database in MemexPatriarch.
For more information on the uniq_idfile, see the Memex Intelligence EngineAdministrators Guide.
YouwillbeabletouseMemexPatriarchtoviewWebpagesindexedfromtheAutoWeb
toolbarbysearchingtheWebArchivedatabaseontheAutoWeblogicalserverwithinMemex
Patriarch.
Setting up picklistsThisisanoptionaltask.
InMemexPatriarch,theWebConfigentitycontainsasinglepicklistfielddatabasewhich
holdsalistofallthedatabasesinAutoWeb.Thislistisnotautomaticallypopulated.You
shouldupdatethislistwheneveryouaddadatabasetoAutoWeb.
ForinformationonmodifyingpicklistsinMemexPatriarch,refertotheMemexPatriarchOnlineHelp.
Adding additional web archivesTheinstructionsinthischapterdescribehowtocreateasingleAutoWebarchivedatabase
thatisaccessiblefromMemexPatriarch.However,youcanusemultipledatabasestostore
WebpagesindexedbyAutoWeb.
29
8/12/2019 AutoWeb_Guide_2.0a
30/45
Memex Technology Ltd A Guide to AutoWeb
ForeachnewdatabaseyouwanttousefromAutoWeb,youmustcreateanewlogicalserver
inMemexPatriarch,oruseanexistinglogicalserverthatdoesnotcontainanAutoWeb
database.
Createthenewdatabase(andlogicalserver,ifrequired)byusingthePerlbasedinstallerfor
MemexPatriarch
server
components.
This
example
shows
how
to
create
anew
WebArchive
databaseonanewlogicalservercalledAutoWeb2,withtheserverprefixZW:
1. AstheMemexadministrativeuser,copythesuppliedim13autowebdirectory:
cd /opt/memex/autowebcp -R im13autoweb im13autoweb2
2. Editthesetup.xmlfilewithinthenewim13autoweb2directory,removingthetwo
includestatementsandchangingthenameandprefixattributesfortheserverelement,
ensuringyouuseaprefixthatisnotalreadyusedbyanexistinglogicalserver.
Note For more information on server prefixes see the topic Use the installer to add alogical server in the Memex Patriarch online help.
Forexample:
3. Changetotheiminstalldirectoryandruntheinstallerwiththeim13autoweb2setup.
Forexample:
cd /opt/memex/im/im-2.0a-105-vanilla-interix/im-install
perl install.pl -c CS -i /opt/memex/im -m /opt/memex/mie-x /opt/memex/etc/memexsvr.xml -p 9001-f /opt/memex/autoweb/im13autoweb2
4. Changetothedatabasessubdirectoryoftheautowebinstallationdirectory.
5. CreateasymboliclinktothenewWebArchivedatabasebyenteringwiththefollowing
command:
ln s /opt/memex/im/ZW/databases/WebArchive webarchive2
Note Each archive database (or symbolic link) in the databasesdirectory must have aunique name. For example: webarchive1, webarchive2, and so on.
30
8/12/2019 AutoWeb_Guide_2.0a
31/45
Chapter 4
Using AutoWeb
AutoWebisautilitythatallowsyoueasilytoaddthetextofaWebpagetoaMemex
database.Inadditiontothis,whenyouextracttextfromaWebpage,AutoWebcreatesa
mirroroftheWebpageonalocalserver.YoucanthenuseMemexAnalysttoviewthe
recordscreatedfromtheWebpagetextandtoviewthemirroredcopyoftheWebpage.
Selecting a Memex databaseTospecifywheretheWebpagetextwillbestored,chooseadatabasefromtheSelect
Databasedropdownlist.
Specifying keywords
ToassociatekeywordswithanindexedWebpage,typethekeywordsintotheEnter
Keywordstextbox.
Indexing Web page textToextractspecifictextfromaWebpage,highlightthetextandthenclicktheIndexSelected
Textbutton.
Whenyouclickthisbutton,AutoWebalsomirrorstheentireWebpagetothelocalserver.
Indexing a Web pageToextractthetextofanentireWebpage,clicktheIndexPagebutton.
31
8/12/2019 AutoWeb_Guide_2.0a
32/45
Memex Technology Ltd A Guide to AutoWeb
Whenyouclickthisbutton,AutoWebalsomirrorstheentireWebpagetothelocalserver.
Viewing indexed pagesYoucanuseMemexPatriarchorMemexAnalysttoretrievetheindexedrecords.
TheindexedrecordforeachWebpagecontains:
TheURLoftheoriginalpage
TheURLofthemirroredcopyofthepage
Thedateandtimethatthepagewasindexed
Thetext(ortheselectedtext)fromthepage
Thekeywordsthatareassociatedwiththepage
IfMemexAnalysthasbeensetuptousetheformsdistributedwiththeAutoWebtoolbar,the
resultformdisplaysthemirroredcopyofthepagewhenyouviewoneoftherecords.The
screenshotbelowshowsanexampleofthis.
32
8/12/2019 AutoWeb_Guide_2.0a
33/45
8/12/2019 AutoWeb_Guide_2.0a
34/45
Memex Technology Ltd A Guide to AutoWeb
ThefollowingscreenshotshowsanexampleofcreatingaconfigurationrecordfortheMemex
WebsiteusingMemexPatriarch.
EntervaluesfortheName,URLandDatabasefieldstospecifywhatyouwanttoindexand
whereyouwanttostoretheindexedWebpagedata.
Entervaluesfortheotherfields,asrequired.Thesefieldsaredescribedinthetableonpage35.
ClickAppendtosavethenewrecord.
Specifying sites Memex Analyst
ThefollowingscreenshotshowsanexampleofcreatinganindexrecordfortheMemexWeb
siteinMemexAnalyst.
34
8/12/2019 AutoWeb_Guide_2.0a
35/45
Memex Technology Ltd A Guide to AutoWeb
EntervaluesfortheKeywords,SiteToIndexandDatabasefieldstospecifywhatyouwant
toindexandwhereyouwanttostoretheindexedWebpagedata.
Entervaluesfortheotherfields,asrequired.Thesefieldsaredescribedinthetablebelow.
Savethe
new
record.
Fields on the configuration form
ThefollowingtableexplainsthefieldsontheconfigurationformsusedwithinMemex
PatriarchandMemexAnalyst.
Note The default configuration forms have the heading Index Request. This is part of theform design and can be changed, if required. The labelling of fields on the forms canalso be changed as part of the form design. The first two columns in the followingtable show the labels as they appear in the default forms supplied for Memex
Patriarch (Field MP) and Memex Analyst (Field MA).
Field MP Field MA Details
URN ThisfieldispopulatedbyMemexPatriarchwhenyousavetherecord.
ThefieldisnotincludedonthedefaultformforMemexAnalyst.
Name Keywords EnterthenameoftheWebsiteyouwanttoindex.Thenameshouldbe
relevanttothesiteyouwanttoindexastextenteredherecanbeusedas
keywordswhensearchingforitlater.
URL SiteToIndex EnterthefullURLoftheWebsiteyouwanttoindex.Ifyouenterthe
URLofaWebsitewithoutspecifyingaparticularWebpage(for
example,http://www.yourcompany.com),AutoWebusesthehome
pageofthesiteasthestartpagefromwhichtoindex.Youcanindexan
areawithinaWebsitebyspecifyingaparticularpageonasite(for
example,http://www.youcompany.com/personnel/vacancies.html).
Indexed Index Thisfieldallowsindexingtobetemporarilyturnedoffbysettingthe
fieldvaluetoNO.ToresumeindexingsetthevaluetoYES.Thedefault
valueisYES,soWebsitesforrecordswithnovalueinthisfield(suchas
recordsfromupgradedversionsofAutoWeb)areindexed.
Database Database ThisisthenameofthedatabasetowhichindexrecordsfortheWebsite
aresaved.Thevalueisthenameofthedatabaseasitappearsonthefilesystem,withintheautoweb/databasesdirectory.Thewebarchive
databaseisthedefaultavailabledatabasecreatedforsavingnewindex
recordsto.
Priority Priority Thevalueinthisfieldallowsindexingtobeperformedatdifferent
frequencies.Thisisachievedbyrunningthegetsite.plscriptagainsta
subsetofrecords,basedonthevalueofthisfield(asshowninthecron
tablistingonpage33).TheAutoWebautoinstallercreatesthree
Priorityoptionstochoosefrom.Chooseyourprioritydependingon
howoftenyouwantthesitetobeindexedandupdated.
35
8/12/2019 AutoWeb_Guide_2.0a
36/45
Memex Technology Ltd A Guide to AutoWeb
Field MP Field MA Details
Thefrequencyofupdatesisdefinedasfollows:
HIGHprioritysitesareindexedeveryhour
MEDIUMprioritysitesareindexedeveryday
LOWprioritysitesareindexedeveryweek
Note:ThesefrequenciesaredefinedintheMemexadministratorusers
crontab.SeeMonitoringWebsitesonpage33formoredetails.
Options Crawler
Options
UsethisfieldtopassspecificoptionstotheHTTrackWebsiteCopier
software.HTTrackisathirdpartytoolusedbyAutoWebtocopyWeb
pages.Byspecifyingoptionsyoucanoverrulemanyaspectsof
AutoWebsdefaultbehaviour.
ForfulldetailsofthemanyoptionsforHTTrackseetheonlineUsers
Guideat:
http://www.httrack.com/html/fcguide.html
Theoptionthatyouaremostlikelytowanttospecifyisthelinkdepth.
AutoWebsdefaultlinkdepthis2.Thismeansthatyouwillindexall
thepagesthatarelinkedtofromthespecifiedstartpage(e.g.thehome
pageofaWebsite)plusallthepagesthatarelinkedtofromthose,
primarylink,pages.OnalargeWebsite,withpagesthateachcontain
manylinks,alinkdepthof2couldresultinhundredsofpagesbeing
indexed,andyoumay,therefore,wanttoreducethelinkdepth.Ona
smallWebsite,however,youmightwanttoincreasethelinkdepthto3
or4.
Theoptionforsettinglinkdepthis:
-%eN
WhereNisanintegertypicallybetween0and4.
Notes:
Youmustbeextremelycarefulwhenspecifyingoptions.Ifyouenter
invalidoptions,orthewrongoptionforthebehaviouryou
intended,itcanresultinnothingbeingindexed,unexpected
indexingresults,oreverythingontheentiredomainbeingindexed.
Ifyoudonotsetavaluehere,thelinkdepthdefaultsto2.
SettingahighlinkdepthvalueforalargeWebsitecanquickly
resultinyouusingupagreatdealofavailablediskspace.
Bydefault,AutoWebdoesnotindexpagesthatarelocatedoutside
thedomainonwhichthestartpageislocated.Thishelpstorestrict
indexingtoasingleWebsite.Youcanbypassthisrestrictionby
usingthe-eoption.However,youshouldusethisoptionwith
extremecautionasitcaneasilyresultinyouindexingavastnumber
ofpagesfromtheinternetatlarge.
36
http://www.httrack.com/html/fcguide.htmlhttp://www.httrack.com/html/fcguide.html8/12/2019 AutoWeb_Guide_2.0a
37/45
Memex Technology Ltd A Guide to AutoWeb
Field MP Field MA Details
Linkdepth,bydefault,onlyextendstopagesonorbelowthe
currentdirectorylevel.Forexample,ifyouindex
http://www.memex.co.uk/AboutMemex/index.phpwithalink
depthof
2,
AutoWeb
will
index
pages
such
as
http://www.memex.co.uk/AboutMemex/Awards/index.php,asthis
pageislocatedinadirectorybelowthestartpage,butitwillnot
indexhttp://www.memex.co.uk/index.php,whichisinadirectory
abovethestartpage.Youcanusethe-BoptiontoallowAutoWeb
toindexupthedirectorystructureaswellasdownit.
HTTrackWebsiteCopierisopensource,thirdpartysoftware.
Memexisnotresponsibleforanyofthecontentonthe
www.httrack.comWebsite.
Notes Notes Youcanenteranytextaboutthesiteorthisparticularrecordherefor
your
own
reference.
How Web site monitoring works
WebsitemonitoringisaccomplishedbyrunningaPerlscriptcalledgetsite.platregular
intervals.Thisscriptperformsthefollowingactions:
1. Decodestheconfigurationdatabase.
2. ParsestheoutputtodeterminewhichWebsitestomirrorandindex.
3. CallsHTTrackforeachsitethatshouldbeindexed.
Note:If getsite.plwasrunwithaspecificprioritysetting(e.g.HIGH),onlyasubsetof
theconfigurationrecordsmayproducecallstoHTTrack.
TheHTTrackWebsiteCopierprogramthencreatesamirrorofthesiteinthemirror
directoryoftheserverinstallation.
Stopping getsite.plIfyouhavestartedgetsite.plandwanttostopit,youmustmanuallydosobykillingits
processandanyhttrackprocesses.
Tokillanygetsite.plandhttrackprocesses:
1. AsrootortheMemexadministratoruser,openashellconsole.
2. Typethefollowingcommand:
ps -eo pid,args|grep autoweb
Thisliststhecurrentlyrunningprocesseswhosedetailsmentionautoweb.
37
http://www.memex.co.uk/AboutMemex/index.phphttp://www.memex.co.uk/AboutMemex/Awards/index.phphttp://www.memex.co.uk/index.phphttp://www.httrack.com/http://www.httrack.com/http://www.memex.co.uk/index.phphttp://www.memex.co.uk/AboutMemex/Awards/index.phphttp://www.memex.co.uk/AboutMemex/index.php8/12/2019 AutoWeb_Guide_2.0a
38/45
Memex Technology Ltd A Guide to AutoWeb
Forexample:
1545 grep autoweb3197 /opt/memex/autoweb/bin/httrack -V /opt/memex/autoweb/bin/addtomemex5371 /usr/contrib/perl -I/opt/memex/autoweb/perlmodules /opt/memex/autow5513 sh -c /opt/memex/autoweb/bin/httrack -V '/opt/memex/autoweb/bin/add
3.
Usethe
kill
command
with
the
relevant
process
ID
number
to
stop
each
of
the
listed
processes,apartfromtheonementioninggrep,whichsimplyreportsthesearchyou
ran.
Forexample:
kill 3197
kill 5317
kill 5513
Extracting the Web page text
ForeachpagethatHTTrackdownloads,itcallstheaddtomemex.plscript.
addtomemex.plcheckswhattypeoffilehasbeendownloadedandwhetherthetextcanbe
extractedfromthefile.ItthenusestheLynxtextbasedWebpagebrowsertooutputatext
onlyversionofthepage,fromwhichitextractsthetext.
Whenthetexthasbeenextractedsuccessfully,addtomemex.plwritesarecordtothe
specifiedWebarchivedatabasecontainingthefollowinginformation:
Thekeywordsfromtheconfigrecord
TheoriginalURLofthefile
ThemirroredURLofthefile
Thetextfromthepage
Thedateandtimethepagewasmirrored
38
8/12/2019 AutoWeb_Guide_2.0a
39/45
Appendix AKnown limitations
AutoWebcontainsthefollowinglimitations:
Ifapagecontainsanycrossdomainframes,theindexselectionandindexpagebuttons
willnotwork.Formoreinformation,seetheMicrosoftwebsite:ht t p: / / msdn. mi crosof t . com/ l i br ar y/ def aul t . asp?ur l =/ wor kshop/ aut hor / om/ xf r ame_scr i pt i ng_secur i t y. asp
AutoWeb
will
not
index
URLs
that
are
redirected.
For
example,
if
you
are
in
the
UK
andyoubrowsetowww. memex. comyouareredirectedtowww. memex. co. uk.Asa
resultyoucannotuseAutoWebtoindexht t p: / / www. memex. com.Theworkaround
istoindexaspecificpagebelowtheredirecteddomainforexample,ht t p: / / www. memex. com/ About Memex/
Ifauserattemptstoindexapagethathascrossdomainframes,thefollowingerror
messageisdisplayed:Br owser secur i t y r est r i ct i ons pr event you f r om i ndexi ng t hi spage
WhenAutoWebmirrorsaWebpageitdoesnotautomaticallymirrordocumentslinked
tofromthatpage.Thedepthofmirroringdependsontheoptionsspecifiedinthe
configuration
record.
As
a
consequence,
style
sheets
used
by
the
page,
or
images
that
appearonthepage,maynotbemirrored.
MemexstronglyadvisesthatyouchangetheInternetsecurityzoneofthemirrorto
disablescripting.AsfilescopiedtothelocalmirrorareonyourlocalIntranet,theymay
havemoresecurityrightsthanishealthy.SelectTools>InternetOptions>Security>
RestrictedSites,clickSites,andaddyourmirrordomaintothelist.
WhenindexingWebsitesusingget si t e. pl ,imagesarenottimestamped.This
meansthatifaWebpagecontainsanimagethatchanges(butkeepsthesamename),
theoldcopyoftheimagewillbeoverwritten.Asaresult,theearlierversionofthepage
willreferencethenewerversionoftheimage.
39
8/12/2019 AutoWeb_Guide_2.0a
40/45
Appendix BTroubleshooting
IfaWebsiteisnotindexedorisnotindexedinthewayyouexpected:
ChecktheknownlimitationslistedinAppendixA.
MakesureyouareawareofthedefaultindexingbehaviourofAutoWebandthe
variousHTTrackoptions.
Seepage
41
for
alist
of
the
default
options
and
the
online
User
Guide
for
HTTrack
WebsiteCopierathttp://www.httrack.com/html/fcguide.htmlforacompletelistof
availableoptions.
Checkthemessagesinthelogfile.Thepathandnameofthisfilearegivenasthevalue
ofthel ogf i l eparameterinthespider.cfgconfigurationfile
(/opt/memex/autoweb/spider.cfg).
Forexample:/opt/memex/logs/crawlerlog.txt
IfyougetthemessageAlreadyRunningResourcetemporarilyunavailablewhenyou
runthegetsite.plscript,itindicatesthatthescripthasnotfinishedindexingpages.This
maybebecausetheconfigurationrecordsarecausingittoindexmorepagesthanyou
hadexpected,orrequire.Ifthishappensyoushouldeitherwaitforthescriptto
complete,orkilltheprocess(asdescribedonpage37),andthenchecktheconfiguration
recordsbeforerunninggetsite.plagain.
Ifthegetsite.plscriptrunsmorefrequentlythanexpected,checktheentriesinthecron
tabfortheMemexadministrator.TheautoinstallerforAutoWebaddscronjobsfor
getsite.pltothecrontaboftheMemexadministratoruser.Iftheautoinstallerwasrun
morethanonce,thecrontabwillcontainduplicatecronjobs,whichmustberemoved
byeditingthecrontab.
40
http://www.httrack.com/html/fcguide.htmlhttp://www.httrack.com/html/fcguide.html8/12/2019 AutoWeb_Guide_2.0a
41/45
Appendix CHTTrack options
HTTrackWebsiteCopierisopensourcesoftwarethatisusedtomirrorWebpages.Memex
hasalteredthesoftwareslightlyforusewithAutoWeb.
Note For more information about HTTrack, visit: http://www.httrack.com/andhttp://www.httrack.com/html/fcguide.html .
ThistableliststheoptionsthatAutoWebusesbydefault.
Option Description
-n GetnonHTMLfiles near anHTMLfile
-%e2 Setstheexternallinkdepthto2
-A32000 Setsthemaximumtransferrateinbytes/seconds
-I0 Dontmakeanindexpage
-Qq Nologandnoquestions
--assumecfm=text/html,php=text/html
Assumethatatype(cfm,php)isalwayslinkedwitha
mimetype
-X0 Donotpurgeoldfilesafterupdate
-%F "" DonotputafooterintotheHTMLpages
-%P0 Donotdoextendedparsing
-C0 Donotuseacache
-%Q Donotfollowanyhyperlinksfromthepage
ThisoptionhasbeenaddedtoHTTrackbyMemex
-d Stayonthesameprincipaldomain
Thistablelistsotheroptionsthatyoucanuse,ifnecessary.Touseeitheroption,addittothe
optsparameterinthespider.cfgfile.Ifnooptionisset,thedefaultbehaviourisfollowthe
rulesinrobots.txt.SeeSettinguptheAutoWebconfigurationfileonpage15.
41
http://www.httrack.com/http://www.httrack.com/html/fcguide.htmlhttp://www.httrack.com/html/fcguide.htmlhttp://www.httrack.com/8/12/2019 AutoWeb_Guide_2.0a
42/45
Memex Technology Ltd A Guide to AutoWeb
Option Description
-s0 WhenretrievingWebpages,donotfollowtherulesspecifiedinrobots.txt
ontheremotewebserver.
-s2
Follow
all
of
the
robots.txt
rules
with
the
exception
of
Disallow:
/
as
this
willpreventthesoftwarefromretrievinganypagesfromaWebsite.
42
8/12/2019 AutoWeb_Guide_2.0a
43/45
Appendix D
Upgrading to AutoWeb 1.3
IfyouarecurrentlyusingAutoWeb1.0or1.1youmustupgradetoversion1.3beforeyoucan
upgradetoversion2.0.Onceyouhavea1.3systemyoucanupgradeto2.0byfollowingthe
instructionsonpage18.
UpgradingfromAutoWeb1.0or1.1toAutoWeb1.3isatwostageprocess.First,youmust
backupyourpreviousAutoWebsetup;thenyouneedtoinstallAutoWeb1.3.
Important You will need the installation package for version 1.3 of AutoWeb to completethis procedure.
Backing up your previous AutoWeb setupBeforebeginningtheupgrade,youshouldbackupyourexistingAutoWebconfigurationand
databases.Iftheupgradeprocessencountersanyproblems,youcanthenreverttoyour
known,validsetup.
Afterthebackupiscomplete,shutdowntheexistingMIEandmovetheAutoWebdirectories
aside.Forexample,ifyouinstalledyourpreviousversionofAutoWebin
/opt/memex/autowebyoushouldmovethiswholedirectoryto/opt/memex/autowebold.
Installing AutoWeb 1.3AfterbackingupyourexistingAutoWebsetup,youmustperformanew,cleaninstallationof
AutoWeb1.3.
Note You must install AutoWeb into the same directory as your previous version. Forexample:/opt/memex/autoweb.
IfyouareusingthisproductwithMemexPatriarch,itisessentialthatyoureadChapter3
UsingAutoWebwithMemexPatriarchonpage24.Youmustperformallthestepsdetailedthere
beforeyouproceedwiththeconversion.
43
8/12/2019 AutoWeb_Guide_2.0a
44/45
Memex Technology Ltd A Guide to AutoWeb
Converting your AutoWeb dataAfterinstallingAutoWeb1.3,youmustrunaconversionscripttoconvertthedatafromyour
previoussetupandcreateanynewdatabasesthatmayberequired.
Setting up the conversion script
Theconversionscriptsreadsaconfigurationfileconvert.confwhichisstoredinthebin
directoryofthenewAutoWebinstallation.ThisfilespecifiesthedetailsoftheAutoWeb
databasesthatwillbeconverted.
Beforerunningtheconversionscript,youmustsetthefollowingoptionstoreflectyour
AutoWebsetup:
Option Details
MIEDecodeDir ThepathtotheMIEinstallationusedbythepreviousversionof
AutoWeb
MIEDir ThepathtothenewMIEinstallation
MIEPort ThenetworkportthatthenewMIEislisteningon
OldAutoWeb ThepathtothepreviousAutoWebsetup
NewAutoWeb ThepathtothenewAutoWebinstallation
IMBase TheinstallationdirectoryforMemexPatriarch(ifinstalled)
TempDir Adirectorytouseforstoringtemporaryfiles
Verbosity HowdetailedtheAutoWeboutputwillbe:
0basicoutput
1tracksprocesseddatabases
2detailedoutput
Running the conversion script
Afterspecifyingtheconversionoptions,youcanruntheconversionscript.
Toruntheconversionscript:
1. MovetothebindirectoryofyournewAutoWebinstallation.
2. Runthefollowingcommand:
perl aw-convert.pl
ThescriptconvertsallthedatafromyourpreviousAutoWebsetupandcreatesanynew
databasesthatareneededtomatchyourprevioussetup.
44
8/12/2019 AutoWeb_Guide_2.0a
45/45
Memex Technology Ltd A Guide to AutoWeb
Note After running the conversion script you still need to open the spider.cfgfile in thenew AutoWeb installation directory and make sure the options are configuredcorrectly.