Fast EspGuide

FAST Enterprise Search Platformversion:5.3.SP1

BrowserEngine

Document Number: ESP1046, Document Revision: A, February 04, 2009

Copyright

Copyright © 1997-2009 by Fast Search & Transfer ASA (“FAST”). Some portions may be copyrightedby FAST’s licensors. All rights reserved.The documentation is protected by the copyright laws of Norway,the United States, and other countries and international treaties. No copyright notices may be removedfrom the documentation. No part of this document may be reproduced, modified, copied, stored in aretrieval system, or transmitted in any form or any means, electronic or mechanical, includingphotocopying and recording, for any purpose other than the purchaser’s use, without the writtenpermission of FAST. Information in this documentation is subject to change without notice.The softwaredescribed in this document is furnished under a license agreement and may be used only in accordancewith the terms of the agreement.

TrademarksFAST ESP, the FAST logos, FAST Personal Search, FAST mSearch, FAST InStream, FAST AdVisor,FAST Marketrac, FAST ProPublish, FAST Sentimeter, FAST Scope Search, FAST Live Analytics, FASTContextual Insight, FAST Dynamic Merchandising, FAST SDA, FAST MetaWeb, FAST InPerspective,GetSmart, NXT, LivePublish, Folio, FAST Unity, FAST Radar, RetrievalWare, AdMomentum, and allother FAST product names contained herein are either registered trademarks or trademarks of FastSearch & Transfer ASA in Norway, the United States and/or other countries. All rights reserved. Thisdocumentation is published in the United States and/or other countries.

Sun, Sun Microsystems, the Sun Logo, all SPARC trademarks, Java, and Solaris are trademarks orregistered trademarks of Sun Microsystems, Inc. in the United States and other countries.

Netscape is a registered trademark of Netscape Communications Corporation in the United States andother countries.

Microsoft, Windows, Visual Basic, and Internet Explorer are either registered trademarks or trademarksof Microsoft Corporation in the United States and/or other countries.

Red Hat is a registered trademark of Red Hat, Inc.

UNIX is a registered trademark of The Open Group in the United States and other countries.

Linux is the registered trademark of Linus Torvalds in the U.S. and other countries.

AIX and IBM Classes for Unicode are registered trademarks or trademarks of International BusinessMachines Corporation in the United States, other countries, or both.

HP and the names of HP products referenced herein are either registered trademarks or service marks,or trademarks or service marks, of Hewlett-Packard Company in the United States and/or other countries.

Remedy is a registered trademark, and Magic is a trademark, of BMC Software, Inc. in the United Statesand/or other countries.

XML Parser is a trademark of The Apache Software Foundation.

All other company, product, and service names are the property of their respective holders and may beregistered trademarks or trademarks in the United States and/or other countries.

Restricted Rights LegendThe documentation and accompanying software are provided to the U.S. government in a transactionsubject to the Federal Acquisition Regulations with Restricted Rights. Use, duplication, or disclosure ofthe documentation and software by the government is subject to restrictions as set forth in FAR 52.227-19Commercial Computer Software-Restricted Rights (June 1987).

Contact Us

Web SitePlease visit us at: http://www.fastsearch.com/

Contacting FAST

FASTCutler Lake Corporate Center117 Kendrick Street, Suite 100Needham, MA 02492 USATel: +1 (781) 304-2400 (8:30am - 5:30pm EST)Fax: +1 (781) 304-2410

Technical Support and Licensing ProceduresTechnical support for customers with active FAST Maintenance and Support agreements, e-mail:[email protected]

For obtaining FAST licenses or software, contact your FAST Account Manager or e-mail:[email protected]

For evaluations, contact your FAST Sales Representative or FAST Sales Engineer.

Product TrainingE-mail: [email protected]

To access the FAST University Learning Portal, go to: http://www.fastuniversity.com/

SalesE-mail: [email protected]

http://www.fastsearch.com/

mailto:[email protected]



http://www.fastuniversity.com/


Contents

Preface..................................................................................................iiCopyright..................................................................................................................................ii

Contact Us...............................................................................................................................iii

Chapter 1: About BrowserEngine.......................................................7About the BrowserEngine.........................................................................................................8

Architecture..............................................................................................................................8

Chapter 2: Configuring the BrowserEngine....................................11Enterprise Crawler considerations.........................................................................................12

Configuration via XML File.....................................................................................................12

Modifying BrowserEngine server settings...................................................................12

Setting browser attributes............................................................................................13

Configuring the extractor pipeline................................................................................15

Flash settings..............................................................................................................19

Example.......................................................................................................................19

Chapter 3: Operating the BrowserEngine........................................21Starting and Stopping.............................................................................................................22

Starting from the administrator interface.....................................................................22

Stopping from the administrator interface....................................................................22

Starting from the command line..................................................................................22

Stopping from the command line.................................................................................22

Logging...................................................................................................................................22

Change the BrowserEngine logging............................................................................22

Monitoring...............................................................................................................................23

Tuning.....................................................................................................................................23

Restrictions.............................................................................................................................24

Chapter 4: BrowserEngine reference information..........................25BrowserEngine binary............................................................................................................26

XML-RPC Browser Interface..................................................................................................26

XML-RPC Status Interface.....................................................................................................27

Extractor processing examples..............................................................................................28

5

6

FAST Enterprise Search Platform

Chapter

1About BrowserEngine

The BrowserEngine is a highly scalable and configurable component that extractslinks and text from JavaScript and Adobe Flash files.The BrowserEngine is used

Topics:

by the FAST Enterprise Crawler and may be called from the Document Processingpipeline.

• About the BrowserEngine• Architecture

About the BrowserEngineThe BrowserEngine is a highly scalable and configurable component that extracts links and text from JavaScriptand Adobe Flash files. The BrowserEngine is used by the FAST Enterprise Crawler (EC) and can also beused from the Document Processing pipeline.

The BrowserEngine is a new component that replaces functionality previously available only to the EnterpriseCrawler. It is intended to provide superior web page content, through the following new features:

• Improved Document Object Model (DOM) coverage• Cookie extraction• Frame support• Extensibility and customization• Scalable architecture• Link and metadata extraction from Flash

The new BrowserEngine will enable more links to be extracted, improving the scope of a crawl, as well asimproved document content, enhancing the index quality. In addition, customers and partners can modify thebehavior of the engine according to individual needs. Because more thorough emulation of a browserenvironment requires additional system resources, the design allows the crawler to take advantage of multipleBrowserEngines (on one or more hosts) in order to distribute the load and scale the number of pagesprocessed.

ArchitectureThe BrowserEngine is a stand-alone ESP component, capable of processing HTML documents containingjavascripts and Flash files. It accomplishes this by emulating a browser's internal environment, without theneed for a display.

The BrowserEngine is implemented in Java and runs as a separate process. This provides isolation fromother components (in particular, from the Enterprise Crawler), in the case of a fatal error. This design alsoallows a component to use multiple BrowserEngines, or multiple components can use the same BrowserEngine.

The following diagram illustrates the major functional modules within the BrowserEngine, and shows thedatapaths that will be referenced in the following discussion.

Figure 1: BrowserEngine Architecture

To give an overview of how the BrowserEngine works, consider the flow of an HTML page through the internalprocessing. When the BrowserEngine receives a processing request, it assigns the task to a thread from itspool of idle threads. If the file is a Flash binary content file, it is simply parsed for text and links and the result

8


returned. Otherwise, it is delivered to the JavaScript Handler. The first step is to run a user-definable pagepreprocessor to initialize the DOM tree, before any processing of the page contents takes place. This allowsthe BrowserEngine to simulate support for browser plug-ins such as Adobe Reader, Apple QuickTime orWindows Media Player, and also permits initialization of settings such as User-Agent, or the screen size.Thepage preprocessor is written in JavaScript, in order to provide quick and easy customization.

After the page preprocessor has initialized the DOM tree, the BrowserEngine parses the HTML document,fetches external dependencies and populates the DOM tree with HTML elements. External dependencies,such as scripts and frames, will be looked up in a local dependency cache, or fetched indirectly via theEnterprise Crawler, which acts as a cacheing proxy. It is also capable of fetching resources directly from thenetwork, if used by components other than the crawler.The document is loaded just as a real browser would,by executing scripts and onLoad handlers.

In addition to the page preprocessor, there is an optional script preprocessor that can modify the source codeof every snippet of JavaScript code before it is executed.

After the document is loaded, the constructed DOM tree is passed to a configurable pipeline of extractors.The pipeline stages create a text representation of the HTML document, extract cookies, generate a documentchecksum, simulate user interactions and extract links. This data and metadata is returned to the callingcomponent.

9

About BrowserEngine

Chapter

2Configuring the BrowserEngine

The BrowserEngine can run out of the box with Fast ESP. However, you maywant change the preprocessors and/or the pipeline to fit your needs.

Topics:

• Enterprise Crawler considerations• Configuration via XML File

Enterprise Crawler considerationsThe BrowserEngine does work on behalf of, and in conjunction with, the Enterprise Crawler, and that componentmust be configured properly to make use of the BrowserEngine.

There are two requirements in configuring the Enterprise Crawler to make use of the BrowserEngine. Thefirst is that one of the following attributes must be enabled, by setting it to the value Yes:

• JavaScript support• Macromedia Flash support

The Enterprise Crawler also needs to be configured with the location of all available BrowserEngines in theFAST ESP installation. Normally this setup is done by the FAST ESP installation itself, as each BrowserEngineis enabled. For information about the details, please see the section CrawlerGlobalDefaults.xml options inthe FAST Enterprise Crawler Guide.

Configuration via XML FileThe BrowserEngine is configured with default settings that are appropriate for most Fast ESP installations.You can change the configuration, including the preprocessors and the pipeline, to fit the needs of yourinstallation.

The BrowserEngine is configured through an XML file, located on the Config Server node at:$FASTSEARCH/etc/config_data/BrowserEngine/BrowserConfig.xml

Changes made to this file, or any other files used by the BrowserEngine configuration, will not take effectuntil the BrowserEngine is restarted.

Modifying BrowserEngine server settingsThe BrowserEngine XML file includes a server tag that defines the port number range, and other attributesused to tune the performance.

DescriptionParameter

Base port number, which is used to listen for requests from the Enterprise Crawler.port

Note: The BrowserEngine also uses port number "port+1". Both ports must be free.

The number of BrowserEngine threads created to process documents. This attribute limits thenumber of documents which can be processed concurrently. Note that setting this value too

maxThreads

high can result in wasted CPU utilization due to scheduling, resulting in lower documentthroughput. Also, it can cause the BrowserEngine to run out of Java heap space. Thus, a bettersolution is to start multiple instances of the BrowserEngine.

The limit on requests that may be accepted and queued, waiting for an available processingthread. If the queue becomes full, the BrowserEngine will deny further requests from theEnterprise Crawler until processing threads become available.

maxQueueSize

Example:<server maxThreads="100" maxQueueSize=”100” port="50000"/>

12


Setting browser attributesThe browser tag in the BrowserEngine XML file includes general browser attributes, and cache, blacklist, andjavascript sub-tags with corresponding attributes.

Browser Tag


Specifies the browser type to emulate. Legal values are:type

• Mozilla

• InternetExplorer

Allow pop-ups in BrowserEngine or not.allowPopups

Specifies if the BrowserEngine should use SSL when requesting external dependencies fromthe Enterprise Crawler.The attribute should be set to false when used in a FAST ESP installationwith the crawler.

useSSL

Note: This setting only affects the BrowserEnginer interactions with the Enterprise Crawler,which may still use SSL to retrieve the dependency.

The total maximum time (in seconds) that a document can use on processing. This includestime used on waiting for external dependencies. Documents which uses a longer time than this

evaluationTimeout

specified value is aborted by the BrowserEngine. In this case the Enterprise Crawler will storethe original document and follow the links it finds.

The terminateTimeout option sets the maximum time (in seconds) a thread can run before theBrowserEngine is shutdown. This prevents potential endless spinning threads, not properlytimed out by the evolutionTimeout mechanism, of hogging all system recourses.

terminateTimeout

Example:<browser type="mozilla" allowPopups="false" useSSL="true" evaluationTimeout="3600">

Browser sub-tagsWithin the browser tag, there are four configurable tags:

• cache• blacklist• flash• javascript

Cache


Specifies the cache size in megabytes (MB). The cache improves the performance by reducingthe traffic between the BrowserEngine and the Enterprise Crawler whenever there are externaldependencies.

size

The maximum time (in milliseconds) that a cache entry may exist in the cache. If the cachebecomes full, cache entries are removed in a Least Recently Used order.

ttl

Example:<cache size="25" ttl="3600000"/>

13

Configuring the BrowserEngine

Blacklist


The blacklist tag contains a list of regular expressions used to exclude requests for externaldependencies. Before the BrowserEngine requests an external dependency, it checks if the

reqexp value

URI matches a regular expression. If there is a match, the request is not submitted, and theBrowserEngine will continue to process the document without downloading the dependency. Acommon usage is to block advertisements.

Example:

<blacklist> <regexp value="as-us\.falkag\.net"/> <regexp value="doubleclick\.net"/></blacklist>

JavaScript


Specifies the maximum time (in milliseconds) that the JavaScript engine is allowed to executea snippet of JavaScript code. If the timeout limit is reached the execution of the JavaScript codewill be aborted. This prevents the BrowserEngine from becoming stuck in endless loops.

timeout

Specifies the URL or java resource path to the script preprocessor JavaScript code.scriptPreprocessor

Specifies the URL or java resource path to the pre preprocessor JavaScript code.pagePreProcessor

Example:

<javascript timeout="5000"> <pagePreProcessor src="/pagePreProcessor.js"/> <scriptPreProcessor src="/scriptPreProcessor.js"/></javascript>

Specifying a customized page preprocessorThe page preprocessor is regular text file containing JavaScript code. The purpose of the page preprocessoris to initialize the DOM tree before document processing begins. This allows the BrowserEngine to simulatesupport for browser plug-ins, such as Adobe Reader, and allows browser settings such as screen size to beset.

1. Create or modify the page preprocessor file according to your needs, and save it to the directory containingthe BrowserEngine configuration file.

2. Edit the BrowserEngine configuration file to specify this page preprocessor.

3. Restart the BrowserEngine.

Example: A page preprocessor which emulates support for the Adobe Reader.

navigator.plugins = new Array();navigator.plugins[0].name = “Adobe Reader 7.0”navigator.plugins[0].description = "The Adobe Reader plug-in is used to enable viewing of PDF and FDF files from within the browser."

14


Specifying a customized script preprocessorThe purpose of the script preprocessor is to modify JavaScript code before processing begins.

The script preprocessor is a text file containing JavaScript code to be executed before the BrowserEngineexecutes the current document's JavaScript code. This allows the BrowserEngine to modify the source codebefore it is executed. A script preprocessor file must define a function that accepts four parameters:

• page• sourceCode• sourceName• htmlElement

The last line of the script must return the output of that function. See the example below.

1. Create or modify the script preprocessor file according to your needs and save it to the directory containingthe BrowserEngine configuration file.

2. Edit the BrowserEngine configuration file to specify this script preprocessor.

3. Restart the BrowserEngine.

Script PreProcessor example: Returns the source code unmodified.

function scriptPreProcessor(page, sourceCode, sourceName, htmlElement) { return sourceCode; }scriptPreProcessor;

The four parameters to the script preprocessor are:


The HTML source pagepage

The snippet of JavaScript code to be executedsourceCode

The script namesourceName

The "this" object in a JavaScript contexthtmlElement

Configuring the extractor pipelineAfter document processing is completed the page is sent through the extractor pipeline, which can becustomized to fit specific needs.

Pipeline tagThe extractor pipeline has four primary responsibilities:

• create the processed HTML document• retrieve cookies• create a checksum• extract links

Additional functionality can also be included in the pipeline.

The configuration of the pipeline consists of parameters to control overall processing, and the list of extractorsto be run for each page.

15


DescriptionAttribute

Sets a limit on the number of times the pipelinemaxIterations

Specifies whether the extractors should obey the HTML noindex meta tag or not (boolean).obeyNoIndex

Specify if the pipeline should abort if an extractor in the pipeline fails, or if the BrowserEngineshould return the partial processed document. If set to "true" and a document fails, the document

abortOnFailure

will not be stored by the Enterprise Crawler and none of the links will be followed. If set to "false"the document will be stored, and the extracted links may be followed (depending on the crawlcollection configuration.

Example:<pipeline maxIterations="1" obeyNoIndex="false" abortOnFailure="false">

Pipeline sub-tagsWithin the pipeline tag, many extractors may be defined. The BrowserEngine will execute the extractors inthe specified order. Each extractor tag has two attributes; name and class. In addition there may be multipleparams tags.

DescriptionAttribute

The extractor identificationname

The extractor class pathclass

An optional list of parameters. A params tag has three attributes; name, value and data type.params

HTMLOutputThe extractor generates a HTML document from the DOM tree.

Note: This extractor must always be first in the pipeline!

Example:

<extractor name="HtmlOutput" class="com.fastsearch.jscriptserver.extractors.HtmlOutput"></extractor>

CookiesThe extractor extracts any cookies which have been created or modified by the executed JavaScript code.

Example:

<extractor name="Cookies" class="com.fastsearch.jscriptserver.extractors.Cookies"></extractor>

ChecksumThis extractor generates an MD5 checksum of the document. The checksum is based on the result ofHTMLOutput, with the HTML tags removed. This is the same algorithm used by default in the EnterpriseCrawler.

16


Example:

<extractor name="Checksum" class="com.fastsearch.jscriptserver.extractors.Checksum"></extractor>

AttributeValueExtractorThis extractor retrieves links from HTML attributes. The AttributeValueExtractor takes a series of stringparameters.The "name" parameter is the name of the HTML tag, and "value" is the attribute within this HTMLtag to extract links from.

Example:

<extractor name="AttributeValueExtractor" class="com.fastsearch.jscriptserver.extractors.AttributeValueExtractor"> <param name="body" value="background" type="str"/> <param name="embed" value="src" type="str"/></extractor>

ClickerThe extractor attempts to simulate user input by “clicking” on elements. This extractor takes one stringparameter, "click". The parameter contains a semicolon separated list of elements to click on.

Example:

<extractor name="Clicker" class="com.fastsearch.jscriptserver.extractors.Clicker"> <param name="click" value="a; area" type="str"/></extractor>

EventHandlerRunnerThis extractor gets links by triggering JavaScript events. The event handler runner class has one stringparameter, the "events" parameter.The value of this parameter is a semicolon separated list of events, whichthe extractor will execute to retrieve new links.

Example:

<extractor name="EventHandlerRunner" class="com.fastsearch.jscriptserver.extractors.EventHandlerRunner"> <param name="events" value="onFocus; onBlur; onClick; onMouseDown;" type="str"/></extractor>

ScriptExtractorThe script extractor uses regular expressions to extract links from JavaScript tags.

Example:

<extractor name="ScriptExtractor" class="com.fastsearch.jscriptserver.extractors.ScriptExtractor"></extractor>

17


FormExtractorThis extractor tries to extract links from forms by "triggering" submit button of forms.

Example:

<extractor name="FormExtractor" class="com.fastsearch.jscriptserver.extractors.FormExtractor"> </extractor>

CSSExtractorThe extractor retrieves links from cascading style sheets definitions.

Example:

<extractor name="CSSExtractor" class="com.fastsearch.jscriptserver.extractors.CSSExtractor">

</extractor>

MetaURLFinderThe MetaURLFiner extractor extracts links from within HTML meta tags.

Example:

<extractor name="MetaURLFinder" class="com.fastsearch.jscriptserver.extractors.MetaURLFinder"> </extractor>

UserScriptThe UserScript extractor makes it possible to create extractors using JavaScript. Thus, if none of the otherextractors are able to retrieve the links you can write your own extractor. The extractor has one parameter,"src". The parameter specifies the location to your JavaScript file. It can be a URL or a java resource path.

Example:

<extractor name="JavaScriptExtractor" class="com.fastsearch.jscriptserver.extractors.UserScript"> <param name="src" value="/JavaScriptExtractor.js" type="str"/></extractor>

Note that this script will be executed like any other script within a page. Please be cautious when namingvariables and functions. The last line in the script must be an object containing the extracted links. The objectmust have named properties with their corresponding values being arrays of strings. The name of a propertyis the link type, and the array is the list of URIs found for that particular link type.

Example: A user script which extracts image links from a page.

var links = new Object();links['images'] = new Array();for (var i = 0; i < document.images.length; i++) { var image = document.images[i]; links['images'].push(image.src);}

18


links;

Flash settings

DescriptionSetting

Specifies the URI to the flash configuration file, which is used to configure flash extraction.config

Maximum time (in milliseconds) that the BrowserEngine will use to process a flash file beforethe processing is aborted.

timeout

Example:<flash config="file:///home/user/FlashConfig.xml" timeout="5000"/>

If a Flash configuration file is not specified in the BrowserEngine configuration, the BrowserEngine will useits default configuration for Flash processing.

Configuration fileThe Flash configuration file includes an ExtractLinksFromText tag. This tag has an attribute enable whichcan be set to true or false. Setting this attribute to true allows the BrowserEngine to identify links from theextracted text from the Flash file. Enabling this option will increase the processing time of Flash files.

Note: Most of the links in a Flash file are not contained within the text itself, thus this is just an extraoption to find additional links.

DescriptionSetting

Specifies a prefix. Tokens starting with this value will be identified as links.prefix

Specifies a suffix. Tokens ending with this value will be identified as links.suffix

Below is an example of a Flash configuration file:

<FlashConfig> <ExtractLinksFromText enabled="false"> <prefix> http </prefix> <suffix> txt </suffix> <prefix> ftp </prefix> <suffix> js </suffix> <suffix> html </suffix> </ExtractLinksFromText></FlashConfig>

ExampleBelow is an example file.

<config>

<server maxThreads="50" maxQueueSize="20" port="50000"/>

<browser type="Mozilla" allowPopups="false" useSSL="true"> <cache size="25" ttl="3600000"/>

<blacklist> <regexp value="http://ads\."/>

19


<regexp value="doubleclick\.net"/> </blacklist>

<javascript timeout="5000"> <scriptPreProcessor src="/scriptPreProcessor.js"/> <pagePreProcessor src="/pagePreProcessor.js"/> </javascript>

</browser>

<pipeline maxIterations="1" obeyNoIndex="false" abortOnFailure="true">

<extractor name="HtmlOutput" class="com.fastsearch.jscriptserver.extractors.HtmlOutput"> </extractor>

<extractor name="Cookies" class="com.fastsearch.jscriptserver.extractors.Cookies">

</extractor>

<extractor name="Checksum" class="com.fastsearch.jscriptserver.extractors.Checksum">

</extractor>

<extractor name="MetaURLFinder" class="com.fastsearch.jscriptserver.extractors.MetaURLFinder"> </extractor> </pipeline></config>

20


Chapter

3Operating the BrowserEngine

This chapter describes how to perform tasks such as starting/stopping, monitoringand logging of the BrowserEngine.

Topics:

• Starting and Stopping• Logging• Monitoring• Tuning• Restrictions

Starting and StoppingStarting and stopping of the BrowserEngine can be done from the administrator interface or from the commandline.

Starting from the administrator interface

To start the BrowserEngine from the administrator interface:

1. Select System Management on the navigation bar.

2. Locate the Browser Engine on the Installed module list - Module name. Select the Start symbol.

Stopping from the administrator interface

To stop the BrowserEngine from the administrator interface:

1. Select System Management on the navigation bar.

2. Locate the Browser Engine on the Installed module list - Module name. Select the Stop symbol.

Starting from the command lineUse the nctrl tool to start the BrowserEngine from the command line.

Refer to the nctrl Tool appendix in the FAST ESP Operations Guide for nctrl usage information.

Run the following command to start the BrowserEngine:

1. $FASTSEARCH/bin/nctrl start browserengine

Stopping from the command lineUse the nctrl tool to stop the BrowserEngine from the command line.

Refer to the nctrl Tool appendix in the FAST ESP Operations Guide for nctrl usage information.

Run the following command to stop the BrowserEngine:

1. $FASTSEARCH/bin/nctrl stop browserengine

LoggingThe BrowserEngine produces logs which can help determining the state of a URI or the state of the wholesystem. By default, it logs to the $FASTSEARCH/var/log/browserengine directory.

Startup, shutdown, and status messages are the only type of messages sent to the Log Server in order toreduce network traffic. Messages on a document-level are therefore only logged to the node it's running on.

If you are using the Enterprise Crawler with the BrowserEngine, it also produces log messages that can bevaluable in tracking down what is happening to a specific URI. Refer to the FAST Enterprise Crawler Guidefor more information.

Change the BrowserEngine loggingIt is generally not recommended to change the log level. However, one sometimes needs to change the loglevel to reveal why a certain page failed to be processed.

22


Knowledge about log4j is required. General log4j information is available at http://logging.apache.org/log4j/docs/

Note: If multiple BrowserEngines run on the same machine, they will all log to the same file. To log todifferent files, the log4j configuration has to be different for each engine.

1. Open $FASTSEARCH/components/browserengine/WEB-INF/classes/log4j.xml

2. Change the configuration to your needs and save the file.

3. Using the Node Controller, restart the BrowserEngine.

MonitoringThe BrowserEngine can currently be monitored by reading the log files and by using a set of methods exposedthrough XML-RPC.

If you are using the BrowserEngine in combination with the crawler, the crawleradmin tool has an option thatdisplays statistics for a particular Master (Crawler) node:$FASTSEARCH/bin/crawleradmin --browserengine

When run on an UberMaster, the output is a list of all the BrowserEngines that are used by the Master nodes.

TuningThe BrowserEngine may easily get overloaded or run out of Java heap space due to the fact that processingan HTML document like a browser and executing JavaScripts is a heavy task. This section explains how tomodify configuration settings in order to balance the workload.

ServerPerformance may be improved by changing the maxThreads setting, to increase or decrease the thread poolsize. If the BrowserEngine uses too many threads, valuable CPU cycles will be wasted on thread scheduling,thus lowering throughput. Also, configuring the BrowserEnigne with too many threads increases the probabilityof running out of Java heap space. Thus, a better solution may be to run multiple BrowserEngine instances.While configuring an engine with too few threads may also result in low throughput, as many of the threadsmay be blocked waiting for external dependencies. The optimal number of threads is dependent on theoperating system, hardware and the content that is crawled.To tune it, you need to closely monitor the systembefore and after the thread pool size has been modified, and measure the affect of each change onperformance.

BrowserIncrease the Cache section size parameter, or the TTL setting.This should increase the cache hit ratio, whichmeans that the number of requests for external scripts and frames is decreased. As a result, fewer threadsin the BrowserEngine will be blocked.

PipelineConfigure the pipeline to use the minimal set of extractors you need. For instance, if you are only interestedin extracting image links, the default pipeline configuration would involve too much unneeded processing.

23

Operating the BrowserEngine

http://logging.apache.org/log4j/docs/

Node deploymentMove the BrowserEngine to a faster (or less heavily utilized) server, or run multiple BrowserEngine instanceson several nodes. Note that the Enterprise Crawler must be reconfigured if the BrowserEngine deploymentis changed.

Enterprise Crawler tuningThe Enterprise Crawler will generate a heavy load for the BrowserEngine at the first crawl cycle, as alldocuments are new and needs to be processed. On the subsequent crawl cycles a great portion of thesedocuments are not modified, thus the load on the BrowserEngine will be significantly reduced. Hence, it isrecommended to setup several BrowserEngine nodes for the firs crawl cycle. After this cycle the number ofnodes can be reduced. Try to limit the number of documents that will be processed by the BrowserEngine.This can be achieved in the crawler by creating subdomains with JavaScript enabled. Furthermore, decreasingthe javascript_delay attribute in the crawler will help the throughput in the BrowserEngine, as less timewill be used to wait for external dependencies.

RestrictionsIn this section two common limitations of the BrowserEngine are discussed.

AJAXThe BrowserEngine does not fully support AJAX (Asynchronous JavaScript and XML). It will extract all linksfound in XMLHttpRequest calls, thus if permitted the crawler will follow these links. However, note that it willnot try to download and execute the code

The HTTP POST methodThe BrowserEngine and crawler do no support the POST method of HTTP. The POST method is quitecommonly used to update frames and/or iframes. A potential workaround for this issue is to create a customizedstage in the BrowserEngine's pipeline, which extract the links and return them to the crawler as GET operations.Hence, the required content can be obtained.

24


Chapter

4BrowserEngine reference information

This chapter contains various reference information about the BrowserEnginesuch as command line parameters and the XML-RPC interface.

Topics:

• BrowserEngine binary• XML-RPC Browser Interface• XML-RPC Status Interface• Extractor processing examples

BrowserEngine binary

The BrowserEngine is invoked by a shell script located at:

UNIX: $FASTSEARCH/components/browserengine/bin/browserengine.sh

Windows: %FASTSEARCH%\components\browserengine\bin\browserengine.cmd

Syntax: browserengine.(sh|cmd) [options] configfile

DescriptionOption

Displays the option list.-h

Shows version information.-v

Sets the listening port number.-p

Note: The option overrides a value set in the BrowserEngine configuration file.

Sets the log directory path.-l

The configuration file as a URL or java resource path.You can specify a configuration file fromthe configserver by using the following url syntax:

configserver://<ModuleName>/<FilePath>

configfile

For instance:

configserver://BrowserEngine/BrowserConfig.xml

Note: If you want to specify a configuration file on the file system, the URL looks like this:

file:///<FilePath>/<FileName>

XML-RPC Browser InterfaceThe BrowserEngine exposes methods for processing HTML and Flash through XML-RPC on its baseport.

HTML processingMap Browser.process(String url, byte[] content, List headers, String proxyHost, int proxyPort, List extraHeaders)

where

DescriptionOption

The URL of the page.url

The content of the page.content

A list of HTTP headers where each entry in the list is a list of length two containing the nameand value of a header. As a minimum, a content-type header with text/html must be supplied.

headers

By adding Set-Cookie headers, you can define which cookies that should be available forJavaScripts on the page.

The hostname or IP address to a HTTP proxy (if any).proxyHost

The port number to a HTTP proxy (if any).proxyPort

26


DescriptionOption

Headers that will be sent with external dependency requests.extraHeaders

Returns a map containing the result (links, cookies, HTML and so on).

Flash processingbyte[] Flash.process(String url, byte[] content)

where

DescriptionOption

The URL or some other identifier for the Flash content.url

The content of the Flash file.content

Returns an XML representation of the Flash file.

XML-RPC Status InterfaceThe BrowserEngine exposes methods that returns various status information through XML-RPC on baseport+ 1. The required configserver module methods (ping, ReRegister, ConfigurationChanged and so on) arealso implemented on this port, but is not described any further in this document.

Map statistics()

Returns a map containing various statistical information about the server since it started. Example output (inthe form of a python dictionary):

...'Total Requests': 2,'Failed Requests': 0, 'Percentage Statistics': { 'CacheHit': 50.0 },'Pipeline Performance (ms)': { 'AttributeValueExtractor': {'avg': 39, 'count': 2, 'max': 54, 'min': 24, 'tot': 78}, 'CSSExtractor': {'avg': 3, 'count': 2, 'max': 4, 'min': 3, 'tot': 7}, ... },'Time Statistics (ms)': { 'ExternalResource': {'avg': 532, 'count': 1, 'max': 532, 'min': 532, 'tot': 532}, 'PageLoading': {'avg': 1193, 'count': 2, 'max': 1990, 'min': 397, 'tot': 2387}, ... }...

Map threads()

Returns a map where the keys are thread-ids and the values are maps describing the work status of thecorresponding thread. Example output (in the form of a python dictionary):

...'pool-2-thread-43': {'started': 1180012778, 'status': 'loading_page', 'url': 'http://somewhere.com/somepage1.html'},'pool-2-thread-44': {'status': 'idle/dead'},'pool-2-thread-45': {'started': 1180013128, 'status': 'processing_page',

27

BrowserEngine reference information

'url': 'http://somewhere.com/somepage2.html'},...

Map getQueueStatus()

Returns a map containing two values,QueueSize and MaxQueueSize.This can be useful to determine whetheror not the BrowserEngine is overloaded.void quit()

Terminates the server.

Extractor processing examplesBelow are examples demonstrating how the different extractors work, and how they extract URIs.

HTMLOutputInput to extractor:

<html> <head> <script language="javascript"> document.writeln('standalone<br>'); function test(arg) { document.writeln('## function test run from: '+arg+'<br>'); }

test('HEADER'); </script> </head>

<body> <script language="javascript">test('BODY');</script> </body></html>

Output from extractor:

<html> <head> <meta http-equiv="Content-Type" content="text/html; charset=utf-8"/> </head> <body> standalone<br/> ## function test run from: HEADER<br/> ## function test run from: BODY<br/> </body></html>

Cookies extractorInput to extractor:

<html> <head> <script language="javascript"> function test() { var param = "cookie_name_"; for (i=1; i<10; i++) { createCookie(param+i, "val"+i, i); }

28


}

function createCookie(name, value, days) { var date = new Date(); date.setTime(date.getTime()+(days*24*60*60*1000)); var expires = "; expires="+date.toGMTString(); document.cookie = name+"="+value+expires+"; path=/"; } </script>

</head> <body> <script language="javascript"> test() </script> </body></html>

Cookies extracted from page:

{'domain': 'www.example.com', 'name': 'cookie_name_1', 'value': 'val1', 'max-age': 86399, 'path': '/', 'spec': 'rfc2109'}, {'domain': 'www.example.com', 'name': 'cookie_name_2', 'value': 'val2', 'max-age': 172799, 'path': '/', 'spec': 'rfc2109'}, {'domain': 'www.example.com', 'name': 'cookie_name_3', 'value': 'val3', 'max-age': 259199, 'path': '/', 'spec': 'rfc2109'}, {'domain': 'www.example.com', 'name': 'cookie_name_4', 'value': 'val4', 'max-age': 345599, 'path': '/', 'spec': 'rfc2109'}, {'domain': 'www.example.com', 'name': 'cookie_name_5', 'value': 'val5', 'max-age': 431999, 'path': '/', 'spec': 'rfc2109'}, {'domain': 'www.example.com', 'name': 'cookie_name_6', 'value': 'val6', 'max-age': 518399, 'path': '/', 'spec': 'rfc2109'}, {'domain': 'www.example.com', 'name': 'cookie_name_7', 'value': 'val7', 'max-age': 604799, 'path': '/', 'spec': 'rfc2109'}, {'domain': 'www.example.com', 'name': 'cookie_name_8', 'value': 'val8', 'max-age': 691199, 'path': '/', 'spec': 'rfc2109'}, {'domain': 'www.example.com', 'name': 'cookie_name_9', 'value': 'val9', 'max-age': 777599, 'path': '/', 'spec': 'rfc2109'}

Checksum generatorInput to extractor:

<html> <head> <script language="javascript"> function test() { document.writeln("<a href=\"test.html\">test.html </a>"); } </script> </head> <body> <script language="javascript"> test() </script> </body></html>

HTML used for checksum generation in BrowserEngine:

<html> <head> <meta http-equiv="Content-Type" content="text/html; charset=utf-8"/> </head>

<body> <a href="test.html">test.html </a>

29


</body></html>

Checksum generated by BrowserEngine: eac0a7ec83537763d3ba7671828d0989

If the BrowserEngine is not configured, and the Enterprise Crawler generate the checksum, it can result in adifferent checksum.The JavaScript code of the document is not processed, so there might be different contentin the document.

Checksum generated by Enterprise Crawler: 1ed6cfe48b7a613ef93848c98aa1f88b

If the Enterprise Crawler were to process an HTML document that is identical to the JavaScript processeddocument, it would generate the same checksum as the BrowserEngine.

Example: HTML used in Enterprise Crawler for checksum generation

<html> <head> <meta http-equiv="Content-Type" content="text/html; charset=utf-8"/> </head> <body> <a href="test.html">test.html </a> </body> </html>

Checksum generated by Enterprise Crawler: eac0a7ec83537763d3ba7671828d0989

AttributeValue extractorInput to extractor:

<img src="img_src_dyn.gif">

Links reported to the Enterprise Crawler:

img_src_dyn.gif

Clicker extractorInput to extractor:

<html><head> <title> JavaScript testing... </title>

<script language="javascript"> function createLink() { var protocol = "http"; var sitename = "www.example.com"; var doc = "/cl.html";

document.getElementById("click").innerHTML = "<a href=\"deadlink.html\">Dead link</a><br><br>"; document.getElementById("click").innerHTML += "<a href=\"" + protocol + "://" + sitename + "/" + doc + "\">New link</a>"; } </script>

30


</head>

<body> <center> <div id="click"> <img src="image.jpg" onclick="createLink();">

</div> </center></body></html>


deadlink.htmlhttp;//www.example.com/cl.html

EventHandlerRunner extractorInput to extractor:

<html><head> <script language="javascript"> function createLink() { var protocol = "http"; var sitename = "www.example.com"; var doc = "event.html";

document.getElementById("click").innerHTML = "<a href=\"deadlink.html\">Dead link</a><br><br>"; document.getElementById("click").innerHTML += "<a href=\"" + protocol + "://" + sitename + "/" + doc + "\">New link</a>"; } </script></head>

<body> <center> <div id="click"> <img src="picture.jpg" onMouseOut="createLink();"> </div> </center></body></html>


deadlink.htmlhttp://www.example.com/event.html

Script extractorInput to extractor:

// document.location = 'http://www.example.com/docLoc.html';// window.open("http://www.example.com/someOpen4.html", "window name");


http://www.example.com/docLoc.htmlhttp://www.example.com/someOpen4.html

31


Form extractorInput to extractor:

document.writeln("<form action=\"action_dyn.html\" method=\"post\"><input type=\"submit\" value=\"Send\"> <input type=\"reset\"></form>"); <form action="action_static.html" type="submit"></form>


action_dyn.html action_static.html

CSS extractorInput to extractor:

<style type="text/css"> @import "1.css"; @import url('2.css'); body{background-image: url('3.jpg')}</style>


1.css2.css3.jpg

MetaURLFinder extractorInput to extractor:

<meta name="description" content="'http://fast.no/link.html'"><meta name="description" content="http://noquoutesused.no/wont_find.html">


http://fast.no/link.html

UserScript extractorJavaScript defined as userscript:

var test = new Object();test['test'] = new Array();test['test'].push(testvar);test;

Input to userscript:

<html><body> <script language="javascript"> var testvar = 'test.html'; vartest = 'MAGIC_'+testvar; </script></body></html>


MAGIC_test.html.

32