27
XML Transformations and Content-based Crawling Zachary G. Ives University of Pennsylvania CIS 455 / 555 – Internet and Web Systems June 18, 2022

XML Transformations and Content-based Crawling Zachary G. Ives University of Pennsylvania CIS 455 / 555 – Internet and Web Systems August 7, 2015

Embed Size (px)

Citation preview

Page 1: XML Transformations and Content-based Crawling Zachary G. Ives University of Pennsylvania CIS 455 / 555 – Internet and Web Systems August 7, 2015

XML Transformations andContent-based Crawling

Zachary G. IvesUniversity of Pennsylvania

CIS 455 / 555 – Internet and Web Systems

April 19, 2023

Page 2: XML Transformations and Content-based Crawling Zachary G. Ives University of Pennsylvania CIS 455 / 555 – Internet and Web Systems August 7, 2015

Reminders

Homework 2 “release” version is now on the Web site Simple web crawling XPath XSLT Storage (Berkeley DB)

Milestone 1 due March 1 Milestone 2 due March 8

2

Page 3: XML Transformations and Content-based Crawling Zachary G. Ives University of Pennsylvania CIS 455 / 555 – Internet and Web Systems August 7, 2015

More than XPath

XPath identifies or extracts subtrees from an XML document

… But there are lots of cases where we want to convert from XML XML, or something else XML text (document extraction) XML HTML XML SVG etc.

Here we need something more – often XSLT

3

Page 4: XML Transformations and Content-based Crawling Zachary G. Ives University of Pennsylvania CIS 455 / 555 – Internet and Web Systems August 7, 2015

4

A Functional Language for XML

XSLT is based on a series of templates that match different parts of an XML document There’s a policy for what rule or template is

applied if more than one matches (it’s not what you’d think!)

XSLT templates can invoke other templates XSLT templates can be nonterminating (beware!)

XSLT templates are based on XPath “match”es, and we can also apply other templates (potentially to “select”ed XPaths) Within each template, directly describe what

should be output

Page 5: XML Transformations and Content-based Crawling Zachary G. Ives University of Pennsylvania CIS 455 / 555 – Internet and Web Systems August 7, 2015

5

An XSLT Template

An XML document itself XML tags create output OR are XSL operations

All XSL tags are prefixed with “xsl” namespace All non-XSL tags are part of the XML output

Common XSL operations: template with a match XPath Recursive call to apply-templates, which may also select

where it should be applied

Attach to XML document with a processing-instruction:

<?xml version = “1.0” ?><?xml-stylesheet type=“text/xsl” href=“http://www.com/my.xsl” ?>

Page 6: XML Transformations and Content-based Crawling Zachary G. Ives University of Pennsylvania CIS 455 / 555 – Internet and Web Systems August 7, 2015

6

An Example XSLT Stylesheet

<xsl:stylesheet version=“1.1”> <xsl:template match=“/dblp”> <html><head>This is DBLP</head> <body> <xsl:apply-templates /> </body> </html> </xsl:template> <xsl:template match=“article”>

<h2><xsl:apply-templates select=“title” /></h2> <p><xsl:apply-templates select=“author”/></p> </xsl:template> …</xsl:stylesheet>

Page 7: XML Transformations and Content-based Crawling Zachary G. Ives University of Pennsylvania CIS 455 / 555 – Internet and Web Systems August 7, 2015

7

XML DataRoot

?xml dblp

mastersthesis article

mdate key

author title year school editor title yearjournal volume eeee

mdatekey

2002…

ms/Brown92

Kurt P….

PRPL…

1992

Univ….

2002…

tr/dec/…

Paul R.

The…

Digital…

SRC…

1997

db/labs/dec

http://www.

attributeroot

p-i element

text

Page 8: XML Transformations and Content-based Crawling Zachary G. Ives University of Pennsylvania CIS 455 / 555 – Internet and Web Systems August 7, 2015

8

XSLT Processing Model

List of source nodes result tree fragment(s) Start with root

Find all template rules with matching patterns from root Find “best” match according to some heuristics Set the current node list to be the set of things it maches

Iterate over each node in the current node list Apply the operations of the template “Append” the results of the matching template rule to the

result tree structure Repeat recursively if specified to by apply-templates

Page 9: XML Transformations and Content-based Crawling Zachary G. Ives University of Pennsylvania CIS 455 / 555 – Internet and Web Systems August 7, 2015

9

What If There’s More than One Match?

Eliminate rules of lower precedence due to importing

Break a rule into any | branches and consider separately

Choose rule with highest computed or specified priority

Simple rules for computing priority based on “precision”: QName preceded by XPath child/axis specifier: priority 0 NCName preceded by child/axis specifier: priority -0.25 NodeTest preceded by child/axis specifier: pririty -0.5 else priority 0.5

Page 10: XML Transformations and Content-based Crawling Zachary G. Ives University of Pennsylvania CIS 455 / 555 – Internet and Web Systems August 7, 2015

10

Other Common Operations

Iteration:<xsl:for-each select=“path”></xsl:for-each>

Conditionals:<xsl:if test=“./text() &lt; ‘abc’”></xsl:if>

Copying current node and children to the result set:

<xsl:copy><xsl:apply-templates />

</xsl:copy>

Page 11: XML Transformations and Content-based Crawling Zachary G. Ives University of Pennsylvania CIS 455 / 555 – Internet and Web Systems August 7, 2015

11

Creating Output Nodes

Return text/attribute data (this is a default rule):<xsl:template match=“text()|@*”>

<xsl:value-of select=“.”/></xsl:template>

Create an element from text (attribute is similar):

<xsl:element name=“text()”><xsl:apply-templates/>

</xsl:element>

Copy nodes matching a path<xsl:copy-of select=“*”/>

Page 12: XML Transformations and Content-based Crawling Zachary G. Ives University of Pennsylvania CIS 455 / 555 – Internet and Web Systems August 7, 2015

12

Embedding Stylesheets

You can “import” or “include” one stylesheet from another:<xsl:import href=“http://www.com/my.xsl/”><xsl:include href=“http://www.com/my.xsl/”>

“Include”: the rules get same precedence as in including template

“Import”: the rules are given lower precedence

Page 13: XML Transformations and Content-based Crawling Zachary G. Ives University of Pennsylvania CIS 455 / 555 – Internet and Web Systems August 7, 2015

13

XSLT Summary

A very powerful, template-based transformation language for XML document other structured document Commonly used to convert XML PDF, SVG, GraphViz DOT

format, HTML, WML, …

Primarily useful for presentation of XML or for very simple conversions

What if we want to: Manage and combine collections of XML documents? Make Web service requests for XML? “Glue together” different Web service requests? Query for keywords within documents, with ranked answers This is where XQuery plays a role – see CIS 330 / 550 for details

Page 14: XML Transformations and Content-based Crawling Zachary G. Ives University of Pennsylvania CIS 455 / 555 – Internet and Web Systems August 7, 2015

Now… How Do We Crawl the Web and Get Data?

A few remarks on basic crawlers…

… Then an XML-specific crawler

14

Page 15: XML Transformations and Content-based Crawling Zachary G. Ives University of Pennsylvania CIS 455 / 555 – Internet and Web Systems August 7, 2015

15

Crawling the Web: The Basic Process

Start with some initial page P0

Collect all URLs from P0 and add to the crawler queue Consider <base href> tag, anchor links, optionally

image links, CSS, DTDs, scripts

Considerations: What order to traverse (polite to do BFS – why?) How deep to traverse What to ignore (coverage) How to escape “spider traps” and avoid cycles How often to crawl

Page 16: XML Transformations and Content-based Crawling Zachary G. Ives University of Pennsylvania CIS 455 / 555 – Internet and Web Systems August 7, 2015

16

Essential Crawler Etiquette

Robot exclusion protocols First, ignore pages with:

<META NAME="ROBOTS” CONTENT="NOINDEX">

Second, look for robots.txt at root of web server See http://www.robotstxt.org/wc/robots.html

To exclude all robots from a server:User-agent: *Disallow: /

To exclude one robot from two directories:User-agent: BobsCrawlerDisallow: /news/Disallow: /tmp/

Page 17: XML Transformations and Content-based Crawling Zachary G. Ives University of Pennsylvania CIS 455 / 555 – Internet and Web Systems August 7, 2015

Suppose We Want to Crawl XML Documents Based on User Interests

We need several parts: A list of “interests” – expressed in an

executable form, perhaps XPath queries A crawler – goes out and fetches XML content A filter / routing engine – matches XML content

against users’ interests, sends them the content if it matches

17

Page 18: XML Transformations and Content-based Crawling Zachary G. Ives University of Pennsylvania CIS 455 / 555 – Internet and Web Systems August 7, 2015

18

XML-Based Information Dissemination

Basic model (XFilter, YFilter, Xyleme): Users are interested in data relating to a particular topic,

and know the schema/politics/usa//body

A crawler-aggregator reads XML files from the web (or gets them from data sources) and feeds them to interested parties

Page 19: XML Transformations and Content-based Crawling Zachary G. Ives University of Pennsylvania CIS 455 / 555 – Internet and Web Systems August 7, 2015

19

Engine for XFilter [Altinel & Franklin 00]

Page 20: XML Transformations and Content-based Crawling Zachary G. Ives University of Pennsylvania CIS 455 / 555 – Internet and Web Systems August 7, 2015

20

How Does It Work?

Each XPath segment is basically a subset of regular expressions over element tags Convert into finite state automata

Parse data as it comes in – use SAX API Match against finite state machines

Most of these systems use modified FSMs because they want to match many patterns at the same time

Page 21: XML Transformations and Content-based Crawling Zachary G. Ives University of Pennsylvania CIS 455 / 555 – Internet and Web Systems August 7, 2015

21

Path Nodes and FSMs

XPath parser decomposes XPath expressions into a set of path nodes

These nodes act as the states of corresponding FSM A node in the Candidate List denotes the current state The rest of the states are in corresponding Wait Lists

Simple FSM for /politics[@topic=“president”]/usa//body:

politics usa body

Q1_1 Q1_2 Q1_3

Page 22: XML Transformations and Content-based Crawling Zachary G. Ives University of Pennsylvania CIS 455 / 555 – Internet and Web Systems August 7, 2015

22

Decomposing Into Path Nodes

Query IDPosition in state machineRelative Position (RP) in tree:

0 for root node if it’s not preceded by “//”

-1 for any node preceded by “//”

Else =1+ (no of “*” nodes from predecessor node)

Level:If current node has fixed

distance from root, then 1+ distance

Else if RP = –1, then –1, else 0Finaly, NextPathNodeSet points to

next node

Q1=/politics[@topic=“president”]/usa//body

Q1 Q1 Q1

1 2 3

0 1 -1

1 2 -1Q1-1 Q1-2 Q1-3

Q2 Q2 Q2

1 2 3

-1 2 1-1 0 0

Q2-1 Q2-2 Q2-3

Q2=//usa/*/body/p

Page 23: XML Transformations and Content-based Crawling Zachary G. Ives University of Pennsylvania CIS 455 / 555 – Internet and Web Systems August 7, 2015

23

Query Index Query index entry

for each XML tag Two lists:

Candidate List (CL) and Wait List (WL) divided across the nodes

“Live” queries’ states are in CL; “pending” queries + states are in WL

Events that cause state transition are generated by the XML parser

politics

usa

body

p

Q1-1

Q2-1

Q1-3 Q2-2

Q2-3

X

X

X

X

X

X

X

X CLWL

Q1-2

Page 24: XML Transformations and Content-based Crawling Zachary G. Ives University of Pennsylvania CIS 455 / 555 – Internet and Web Systems August 7, 2015

24

Encountering an Element

Look up the element name in the Query Index and all nodes in the associated CL

Validate that we actually have a match

Q1

1

0

1Q1-1politics

Q1-1X

X

WL

startElement: politics

CL

Query IDPositionRel.

PositionLevelEntry in Query Index:

NextPathNodeSet

Page 25: XML Transformations and Content-based Crawling Zachary G. Ives University of Pennsylvania CIS 455 / 555 – Internet and Web Systems August 7, 2015

25

Validating a Match

We first check that the current XML depth matches the level in the user query: If level in CL node is less than 1, then ignore

height else level in CL node must = height

This ensures we’re matching at the right point in the tree!

Finally, we validate any predicates against attributes (e.g., [@topic=“president”])

Page 26: XML Transformations and Content-based Crawling Zachary G. Ives University of Pennsylvania CIS 455 / 555 – Internet and Web Systems August 7, 2015

26

Processing Further Elements

Queries that don’t meet validation are removed from the Candidate Lists

For other queries, we advance to the next state We copy the next node of the query from the

WL to the CL, and update the RP and level When we reach a final state (e.g., Q1-3), we

can output the document to the subscriber

When we encounter an end element, we must remove that element from the CL

Page 27: XML Transformations and Content-based Crawling Zachary G. Ives University of Pennsylvania CIS 455 / 555 – Internet and Web Systems August 7, 2015

27

Publish-Subscribe Model Summarized

Well-suited to an XML format called RSS (Rich Site Summary or Really Simple Syndication)

Many news sites, web logs, mailing lists, etc. use RSS to publish daily articles

Seems like a perfect fit for publish-subscribe models!