Upload
dorthy-charles
View
216
Download
0
Tags:
Embed Size (px)
Citation preview
XML Transformations andContent-based Crawling
Zachary G. IvesUniversity of Pennsylvania
CIS 455 / 555 – Internet and Web Systems
April 19, 2023
Reminders
Homework 2 “release” version is now on the Web site Simple web crawling XPath XSLT Storage (Berkeley DB)
Milestone 1 due March 1 Milestone 2 due March 8
2
More than XPath
XPath identifies or extracts subtrees from an XML document
… But there are lots of cases where we want to convert from XML XML, or something else XML text (document extraction) XML HTML XML SVG etc.
Here we need something more – often XSLT
3
4
A Functional Language for XML
XSLT is based on a series of templates that match different parts of an XML document There’s a policy for what rule or template is
applied if more than one matches (it’s not what you’d think!)
XSLT templates can invoke other templates XSLT templates can be nonterminating (beware!)
XSLT templates are based on XPath “match”es, and we can also apply other templates (potentially to “select”ed XPaths) Within each template, directly describe what
should be output
5
An XSLT Template
An XML document itself XML tags create output OR are XSL operations
All XSL tags are prefixed with “xsl” namespace All non-XSL tags are part of the XML output
Common XSL operations: template with a match XPath Recursive call to apply-templates, which may also select
where it should be applied
Attach to XML document with a processing-instruction:
<?xml version = “1.0” ?><?xml-stylesheet type=“text/xsl” href=“http://www.com/my.xsl” ?>
6
An Example XSLT Stylesheet
<xsl:stylesheet version=“1.1”> <xsl:template match=“/dblp”> <html><head>This is DBLP</head> <body> <xsl:apply-templates /> </body> </html> </xsl:template> <xsl:template match=“article”>
<h2><xsl:apply-templates select=“title” /></h2> <p><xsl:apply-templates select=“author”/></p> </xsl:template> …</xsl:stylesheet>
7
XML DataRoot
?xml dblp
mastersthesis article
mdate key
author title year school editor title yearjournal volume eeee
mdatekey
2002…
ms/Brown92
Kurt P….
PRPL…
1992
Univ….
2002…
tr/dec/…
Paul R.
The…
Digital…
SRC…
1997
db/labs/dec
http://www.
attributeroot
p-i element
text
8
XSLT Processing Model
List of source nodes result tree fragment(s) Start with root
Find all template rules with matching patterns from root Find “best” match according to some heuristics Set the current node list to be the set of things it maches
Iterate over each node in the current node list Apply the operations of the template “Append” the results of the matching template rule to the
result tree structure Repeat recursively if specified to by apply-templates
9
What If There’s More than One Match?
Eliminate rules of lower precedence due to importing
Break a rule into any | branches and consider separately
Choose rule with highest computed or specified priority
Simple rules for computing priority based on “precision”: QName preceded by XPath child/axis specifier: priority 0 NCName preceded by child/axis specifier: priority -0.25 NodeTest preceded by child/axis specifier: pririty -0.5 else priority 0.5
10
Other Common Operations
Iteration:<xsl:for-each select=“path”></xsl:for-each>
Conditionals:<xsl:if test=“./text() < ‘abc’”></xsl:if>
Copying current node and children to the result set:
<xsl:copy><xsl:apply-templates />
</xsl:copy>
11
Creating Output Nodes
Return text/attribute data (this is a default rule):<xsl:template match=“text()|@*”>
<xsl:value-of select=“.”/></xsl:template>
Create an element from text (attribute is similar):
<xsl:element name=“text()”><xsl:apply-templates/>
</xsl:element>
Copy nodes matching a path<xsl:copy-of select=“*”/>
12
Embedding Stylesheets
You can “import” or “include” one stylesheet from another:<xsl:import href=“http://www.com/my.xsl/”><xsl:include href=“http://www.com/my.xsl/”>
“Include”: the rules get same precedence as in including template
“Import”: the rules are given lower precedence
13
XSLT Summary
A very powerful, template-based transformation language for XML document other structured document Commonly used to convert XML PDF, SVG, GraphViz DOT
format, HTML, WML, …
Primarily useful for presentation of XML or for very simple conversions
What if we want to: Manage and combine collections of XML documents? Make Web service requests for XML? “Glue together” different Web service requests? Query for keywords within documents, with ranked answers This is where XQuery plays a role – see CIS 330 / 550 for details
Now… How Do We Crawl the Web and Get Data?
A few remarks on basic crawlers…
… Then an XML-specific crawler
14
15
Crawling the Web: The Basic Process
Start with some initial page P0
Collect all URLs from P0 and add to the crawler queue Consider <base href> tag, anchor links, optionally
image links, CSS, DTDs, scripts
Considerations: What order to traverse (polite to do BFS – why?) How deep to traverse What to ignore (coverage) How to escape “spider traps” and avoid cycles How often to crawl
16
Essential Crawler Etiquette
Robot exclusion protocols First, ignore pages with:
<META NAME="ROBOTS” CONTENT="NOINDEX">
Second, look for robots.txt at root of web server See http://www.robotstxt.org/wc/robots.html
To exclude all robots from a server:User-agent: *Disallow: /
To exclude one robot from two directories:User-agent: BobsCrawlerDisallow: /news/Disallow: /tmp/
Suppose We Want to Crawl XML Documents Based on User Interests
We need several parts: A list of “interests” – expressed in an
executable form, perhaps XPath queries A crawler – goes out and fetches XML content A filter / routing engine – matches XML content
against users’ interests, sends them the content if it matches
17
18
XML-Based Information Dissemination
Basic model (XFilter, YFilter, Xyleme): Users are interested in data relating to a particular topic,
and know the schema/politics/usa//body
A crawler-aggregator reads XML files from the web (or gets them from data sources) and feeds them to interested parties
19
Engine for XFilter [Altinel & Franklin 00]
20
How Does It Work?
Each XPath segment is basically a subset of regular expressions over element tags Convert into finite state automata
Parse data as it comes in – use SAX API Match against finite state machines
Most of these systems use modified FSMs because they want to match many patterns at the same time
21
Path Nodes and FSMs
XPath parser decomposes XPath expressions into a set of path nodes
These nodes act as the states of corresponding FSM A node in the Candidate List denotes the current state The rest of the states are in corresponding Wait Lists
Simple FSM for /politics[@topic=“president”]/usa//body:
politics usa body
Q1_1 Q1_2 Q1_3
22
Decomposing Into Path Nodes
Query IDPosition in state machineRelative Position (RP) in tree:
0 for root node if it’s not preceded by “//”
-1 for any node preceded by “//”
Else =1+ (no of “*” nodes from predecessor node)
Level:If current node has fixed
distance from root, then 1+ distance
Else if RP = –1, then –1, else 0Finaly, NextPathNodeSet points to
next node
Q1=/politics[@topic=“president”]/usa//body
Q1 Q1 Q1
1 2 3
0 1 -1
1 2 -1Q1-1 Q1-2 Q1-3
Q2 Q2 Q2
1 2 3
-1 2 1-1 0 0
Q2-1 Q2-2 Q2-3
Q2=//usa/*/body/p
23
Query Index Query index entry
for each XML tag Two lists:
Candidate List (CL) and Wait List (WL) divided across the nodes
“Live” queries’ states are in CL; “pending” queries + states are in WL
Events that cause state transition are generated by the XML parser
politics
usa
body
p
Q1-1
Q2-1
Q1-3 Q2-2
Q2-3
X
X
X
X
X
X
X
X CLWL
Q1-2
24
Encountering an Element
Look up the element name in the Query Index and all nodes in the associated CL
Validate that we actually have a match
Q1
1
0
1Q1-1politics
Q1-1X
X
WL
startElement: politics
CL
Query IDPositionRel.
PositionLevelEntry in Query Index:
NextPathNodeSet
25
Validating a Match
We first check that the current XML depth matches the level in the user query: If level in CL node is less than 1, then ignore
height else level in CL node must = height
This ensures we’re matching at the right point in the tree!
Finally, we validate any predicates against attributes (e.g., [@topic=“president”])
26
Processing Further Elements
Queries that don’t meet validation are removed from the Candidate Lists
For other queries, we advance to the next state We copy the next node of the query from the
WL to the CL, and update the RP and level When we reach a final state (e.g., Q1-3), we
can output the document to the subscriber
When we encounter an end element, we must remove that element from the CL
27
Publish-Subscribe Model Summarized
Well-suited to an XML format called RSS (Rich Site Summary or Really Simple Syndication)
Many news sites, web logs, mailing lists, etc. use RSS to publish daily articles
Seems like a perfect fit for publish-subscribe models!