Upload
diana-goodman
View
218
Download
4
Embed Size (px)
Citation preview
© 2008 AT&T Intellectual Property. All rights reserved.
XTreeNet: A Framework for Flexible Large Scale Information Dissemination & Retrieval
TaeWon Cho, Divesh Srivastava, K. K. Ramakrishnan, Yin Zhang and many others
AT&T Labs Research, NJ USAAugust 2011
© 2008 AT&T Intellectual Property. All rights reserved. Page 2
Network as the Vehicle for Information Dissemination
• The ‘network’ will (has) become increasingly Information-centric– Information of all types becoming electronic and network accessible– Access of information based on content of interest, instead of location
• Information Overload - Scale: Producers and Consumers face challenges– Large number of producers (publishers; data sources)– Even larger number of consumers (subscribers, users querying/looking
for content)o Tremendous number of information producers makes it difficult for a
consumer to know where to find relevant information
– Significant challenge: “whom and what to ask” & “whom and what to tell”
• XTreeNet looks at the various problems related to a network-based Information Dissemination and Retrieval environment– Obtain “information” of interest by asking the network to find it– Tell the network to deliver “information” of interest– Ask the network as to what “information” I should be interested in
© 2008 AT&T Intellectual Property. All rights reserved. Page 3
Role of the Network in Information Dissemination• Success of information aggregators (search engines etc.) unquestionable
– Information aggregators do play a key role• Limitation:
– Dis-intermediates producers: constrains business model of producers• Timeliness and Coverage are also key criteria for information
dissemination– Timeliness: Need information (including real-time) to be available right away
o E.g., for a consumer to access real-time media contento Ability for the content to be withdrawn is also desirable
– Coverage: Availability of information depends on set of information that is made available to the consumer by intermediaries, like an aggregator
o Information providers can be “dynamic”/ transient. Complete coverage by an aggregator may be difficult
o Desirable to enable information producers themselves to make it available on an as-needed basis
• Publish-subscribe based access has become somewhat popular– (E.g., news groups, RSS feeds)
• Information dissemination and Query-Response for Information Retrieval in a scalable manner is essential
– Inherently N-to-N communication– We seek to exploit XML-tagging of information
© 2008 AT&T Intellectual Property. All rights reserved. Page 4
XML Routing: Overlay Services based on XML
• An XML Network: overlay network of XML switches/routers • XTreeNet project: investigate the design for a large-scale
integrated publish/subscribe + query/response application • how can we partition functions between the overlay and underlay?
IP NetworkInfrastructure
Database
XML OverlayNetworkXML
router
Publisher
Subscriberfor alerts
Subscriber forinformation
Data querygeneration
© 2008 AT&T Intellectual Property. All rights reserved. Page 5
XTreeNet Overview
• Publishers and Subscribers submit Content Descriptors (CD’s) to the network
• As soon as CD (from producer or consumer) hits network, map into single hash-id at first overlay router
– Subsequent routers forward based on hash-id downstream
much more efficient than matching against aggregated query filters
• XTreeNet builds a common Core-based tree(CBT) on a per-”CD” basis; integrate both producers and consumers of information
– Dynamically create CBT on first arrival of CD from producer
• Groups (overlay multicast) formed on an as-needed basis for each CD
– Very fine grained distribution tree connecting producers & consumers
– Branches to subscribers for disseminating published content & branches to publishers for forwarding queries
– Different cores for different CDs – reduce likelihood of traffic concentration
© 2008 AT&T Intellectual Property. All rights reserved. Page 6
Content Descriptors
• CD can be an element of a topic hierarchy; multiple hierarchies may be supported (e.g., topics, geographic location)– An XML schema path (root-to-leaf path) may also be used as basis of
hierarchically structured domain for constructing CDso Disambiguate between multiple XML documents using string values at
leaves
<rss> <channel> <editor> Jupiter </editor> <item> <title> ReutersNews </title> <link> reuters.com </link> </item> <description> abc </description></channel> </rss>
rss
channel
editor item description
title link
Jupiter
ReutersNewsreuters.com
abc
• Content Descriptors (CDs) act like “indexes” in a distributed data base environment– Each data item generated by a producer and each consumer query filter
are independently mapped to a set of CDs
– A data item matches a query when respective sets of CDs have at least one CD in common
• CDs decouple producers from the consumers– Can support heterogeneous producer schemas
© 2008 AT&T Intellectual Property. All rights reserved. Page 7
Scalability of CDs • Publisher guidance
o Information publisher provides guidance on what XML tags of potential interest
• Strategieso Fullpath: /rss/channel/item/title/ReutersNews
o Last Tag: /title/ReutersNews
o Keyword: ReutersNews
• Estimated by extracting CDs from XML version of WikipediaUnique CDs genereated by Wikipedia articles
0
1000000
2000000
3000000
4000000
5000000
6000000
7000000
8000000
# of Wikipedia articles
# o
f uniq
ue C
ds
Fullpath
Last Tag
Keyword
Last Tag + Keyword
• ~ 5M CDs for about 1M articles and grows slowly – duplication of CDs in documents
© 2008 AT&T Intellectual Property. All rights reserved. Page 8
Scalable Multicast: Multicast Architecture with Adaptive Dual-state
• Multicast is key to efficient information dissemination• Requirements for Information-centric Multicast:
– Scalability in group membershipo Fine granularity of access support for large number of groups
– Persistent access to groupo Network should be responsible for maintaining group membership unless
users explicitly un-subscribe from group– Minimize loss of information– Keep control traffic scalable
• Limitations of existing IP / Overlay Multicasto Forwarding state grows linearly with number of groups
– State overhead (at multiple routers)o Soft-state needs to be refreshed
– Control overheado Hence, limits scalability and has inadequate persistence
• How to achieve scalable and persistent multicast?• MAD seeks to solve issues of scale and persistence with
multicast
Group Memberships Lifetime & Activity Level
• Group activity can vary widely– Analyzed publishing activity of RSS feeds
o Only 5% RSS feeds publish more than 100 updates/month
o Median rate is 10 updates/month– 10% most active feeds contribute 75% updates
• IP multicast: Inactive groups usually treated the same as an active group
o But can’t afford loss of information
© 2008 AT&T Intellectual Property. All rights reserved. Page 9
RSS: Publishing rate (# updates/month)
Subscription count to YouTube channels •Membership (e.g., in a pub-sub environment) likely to be long-lived
•Users subscribe, and remain interested in receiving info’ even when publishers distribute infrequently
•Only 2.3% groups see reduction
•Long-lived membership results in•Network state grows for group; increased group size
© 2008 AT&T Intellectual Property. All rights reserved. Page 10
Using an IP-Multicast Style Approach
00
15
01 03
1205 06
07
11
09
10
08
04
1314
02
00
15
01 03
1205 06
07
11
09
10
08
04
1314
02
First-hop router (FH)
Forwarder
Router not participating
User
First-hop router (FH)
Forwarder
Router not participating
User
• A lot of routers maintain forwarding state:
• 6 intermediate routers keep state that has to be constantly refreshed
•4 first hop routers also keep state
• Every intermediate router has to maintain state o Forwarding state grows linearly with number of groups
– State overhead (at multiple routers)
o Soft-state needs to be refreshed– Control overhead
The MAD environment• MAD multicast service overlay consists of a set of logical
overlay routers
• Each logical router serves as a single aggregated local subscriber for all users attached to it
• Subscription manager responsible for all the users’ subscription management – maintains subscriptions for users connected to site
© 2008 AT&T Intellectual Property. All rights reserved. Page 11
© 2008 AT&T Intellectual Property. All rights reserved. Page 12
Differentiate the Roles of Multicast State
• Membership State vs. Forwarding State• Group membership can be separated from
forwarding state– Group membership must be stored scalably and
persistentlyo Especially for groups that have low frequency of information
flow
– Forwarding state: efficient forwarding of active groupso Can be re-generated when a group becomes active
• Active and inactive groups can be treated differently– Small percent of (active) groups generate data at a high
rate: forward efficiently– Large percent of (inactive) groups generate low traffic
volume
© 2008 AT&T Intellectual Property. All rights reserved. Page 13
The MAD Solution• Group membership is separated from forwarding state:
Multicast with Adaptive Dual State• Use Membership Tree (MT) for scalable state maintenance– Store group membership information in MT
o Minimize number of intermediate routers keeping group state– Impose static virtual hierarchy => no control overhead
o But, static hierarchy may not result in optimal delivery path• Use Dissemination Tree (DT) for forwarding efficiency– Use DT for active groups
o Can use any “state-of-art” multicast protocol
• MAD may begin as an overlay multicast service– Use IP multicast to improve forwarding efficiency for DT– MT may also eventually evolve to being supported by the underlay
• MAD achieves best of both worlds - scalability and forwarding efficiency
MAD Membership Tree protocol overview
• Goal of Membership Tree: reduce # routers keeping multicast group state
• MT selects the core (root) based on hash of group ID– Define a single base tree at this root (static)
– All groups selecting this root use the base tree to construct MT
• Subscriber join is forwarded up on the base tree until it reaches first on-tree node for this group’s MT– When a subtree rooted at an en-route router has more than a
min. # of first-hop routers with attached subscribers, the parent node on the MT requires that the en-route router join the MT
• MAD protocol provides for seamless transition to switch from DT to MT as level of group activity changes (reduces) over time
© 2008 AT&T Intellectual Property. All rights reserved. Page 14
© 2008 AT&T Intellectual Property. All rights reserved. Page 15
Routers Maintaining State in MAD
• Fewer routers maintain state:
– 2 intermediate routers and 4 FH routers
• Forwarding by multicast/unicast – not necessarily efficient
• MT reduces number of routers keeping Multicast State by aggregating subscriber state in a virtual sub-tree
00
15
01 03
1205 06
07
11
09
10
08
04
1314
02
00
15
01 03
1205 06
07
11
09
10
08
04
1314
02
Membership Tree
(4 First-hops, 5 users)
00
1109 151210 14
070302
13
0804 0601 05
00
1109 151210 14
070302
13
0804 0601 05
Virtual membership tree
(fan-out 8, aggregation threshold 2)
Base Tree
© 2008 AT&T Intellectual Property. All rights reserved. Page 16
Scalability of Multicast with MAD
Number of First-Hop Routers in a Group
Num
ber
of
Gro
ups
(Tri
llions)
Number of First-Hop Routers in a Group
Tota
l D
ela
y
(mse
c)
• State efficiency with MAD is significantly better than IP multicast-like approaches (DT)
• Forwarding efficiency with MAD is as good as IP multicast (DT)
• Evaluation using simulation and measurements with implementation– Implementation measured on Emulab with about 100 routers
– Simulation with 16,000 routers; Power-law topology
• MAD achieves both efficient state maintenance and efficient forwarding
© 2008 AT&T Intellectual Property. All rights reserved. Page 17
Summary
• XTreeNet: project we have been working on – primarily focused on the meta-data plane– XTreeNet Architecture – complex processing at the edges;
efficient forwarding in the core
– MAD: Scalable Multicast – Large # groups; Large # subscribers
– QDTs: Query Distribution Trees for Distribution of Complex Queries – Load Balancing, Privacy preservation, Censorship Resistant
– Recommendation Systems: Scalable, Privacy Preserving
• More recent work: “COPSS: An Efficient Content-Oriented Publish/Subscribe System” in collaboration with folks from University of Goettingen, Germany