Mining and supporting community structures in sensor network
research Alberto Pepe (University of California at Los Angeles)
Marko A. Rodriguez (Los Alamos National Laboratory) CENS Friday
Seminar | May 2, 2008
Outline.
Studying Collaboration at CENS
Introduction to Data Practices
Detection of Structural Communities
Data Set and Methods
Results
Supporting Collaboration at CENS
Introduction to the Semantic Web
Semantic Networks and Graph Databases
Analyzing Semantic Networks
Demo
Alberto Marko
Data practices group.
Background research questions:
What are CENS data?
What context data is necessary to support interpretation during
re-use?
How can we automate the capture of context data?
How can we link scholarly and scientific data into meaningful
aggregations/chains?
What are the social and academic settings that yield the
production of scientific and engineering data/knowledge?
Current study.
Question: how do collaboration communities differ from
socioacademic communities?
Method : comparative analysis of coauthorship network community
structure and selected socioacademic community structures (e.g.
academic department, affiliation, country of origin, academic
position)
Rodriguez, M.A., Pepe, A., On the relationship between the
structural and socioacademic communities of a coauthorship network,
Journal of Informetrics, in press, 2008.
Steps of the study.
Gather bibliographic and socioacademic data.
Generate coauthorship network.
Determine structural communities in the coauthorship
network.
Test for statistical independence between the structural and
socioacademic communities.
Steps of the study.
Gather bibliographic and socioacademic data.
Generate coauthorship network.
Determine structural communities in the coauthorship
network.
Test for statistical independence between the structural and
socioacademic communities.
Gathered academic department, academic affiliation, country of
origin, and academic position
Steps of the study.
Gather bibliographic and socioacademic data.
Generate coauthorship network.
Determine structural communities in the coauthorship
network.
Test for statistical independence between the structural and
socioacademic communities.
Generate coauthorship network.
@article{
author={Marko A. Rodriguez and Alberto Pepe },
title={On the relationship },
journal={Journal of Informetrics },
year=2008,
editor={Leo Egghe },
}
Alberto Marko coauthor
CENS population statistics. Socioacademic communities
Study model. Alberto Marko coauthor Affiliation: UCLA
Department: IS Origin: Italy Position: PhD Student Affiliation:
LANL Department: CS Origin: USA Position: PostDoc
Steps of the study.
Gather bibliographic and socioacademic data.
Generate coauthorship network.
Determine structural communities in the coauthorship
network.
Test for statistical independence between the structural and
socioacademic communities.
Structural communities.
Structural communities are c liquish subgraphs composed by
groups of vertices that are highly connected between them, but
poorly connected to other vertices.
Girvan, M., & Newman, M. E. J., Community structure in social
and biological networks. Proceedings of the National Academy of
Sciences, 99, 7821, 2002.
Community detection methods.
edge betweenness [1]
walktrap (random walks) [2]
spinglass [3]
leading eigenvector [4]
[1] Girvan, M., & Newman, M. E. J. Community structure in
social and biological networks, Proceedings of the National Academy
of Sciences, 99:7821, 2002. [2] Pons, P., & Latapy, M.,
Computing communities in large networks using random walks, Journal
of Graph Algorithms and Applications, 10:2, 2006. [3] Reichardt,
J., & Bornholdt, S, Statistical mechanics of community
detection, Physical Review E, 74 (016110), 2006. [4] Newman, M. E.
J., Finding community structure in networks using the eigenvectors
of matrices. Physical Review E, 74, 2006.
Coauthorship network map. 27 structural detected CENS
communities (LEV).
Coauthorship network statistics.
Typical clustering coefficients:
mathematics: 0.34
physics: 0.56
biology: 0.60
less-cliquish, sparse collaboration patterns
CENS community fragmented in research agenda
Newman, M. E. J.,The structure and function of complex
networks, SIAM Review, 45, 167, 2003.
Steps of the study.
Gather bibliographic and socioacademic data.
Generate coauthorship network.
Determine structural communities in the coauthorship
network.
Test for statistical independence between the structural and
socioacademic communities.
Chi square test.
Chi square test determines whether two nominal/categorical
properties are statistically independent.
Alberto Marko coauthor Community: A Affiliation: UCLA Department:
IS Origin: Italy Position: PhD Student Community: B Affiliation:
LANL Department: CS Origin: USA Position: PostDoc
Chi square analysis. N.B. p-value greater than 0.05 is
considered statistically independent leading eigenvector (LEV),
walktrap (WT), edge betweenness (EB), spinglass (SG).
Anecdotal example.
Anecdotal example.
Remarks.
Findings :
Community structure is representative of department and
affiliation
Academic position and country of origin are independent of the
structural community of the scholar.
Generalization :
Policy recommendations to increase interdisciplinarity
Extension to other coauthorship network and other socioacademic
(demographic) variables
Useful to predict or infer topological/socioacademic
configuration when data is scarce
Metadata reuse.
Metadata can be used to support scholarly collaboration.
Everything is metadata. Borgman Article2 JCDL Pepe Italy UCLA
CENS writtenBy writtenBy member country attended hasLab Article1
Sensor Networks cites topic researches contains member member
Introduction to the Semantic Web.
The World Wide Web is used to link documents, where documents
are given universal identifiers/locators called URIs (e.g.
URL).
The structure is machine processable, but the
documents/elements are primarily human processable.
The Semantic Web is used to link data, where data is given
universal identifiers/locators called URIs (e.g. URL).
The structure and the data are both human and machine
processable.
T. Berners-Lee, J. Hendler. Publishing on the Semantic Web. Nature,
410(6832):10231024, April 2001.
The Uniform Resource Identifier.
Resource = Anything.
Anything that can be identified.
Some discrete entity.
The Uniform Resource Identifier (URI):
: [ ? ] [ # ]
http://www.lanl.gov
urn:uuid:550e8400-e29b-41d4-a716-446655440000
urn:issn:0892-3310
http://www.lanl.gov#MarkoRodriguez
prefix it to make it easier on the eyes --
lanl:MarkoRodriguez
The Semantic Web
first identify it, then relate it!
W3C/IETF. URIs, URLs, and URNs: Clarifications and recommendations
1.0, September 2001.
The undirected network.
There is the undirected network of common knowledge.
Sometimes called an undirected single-relational network.
e.g. vertex i and vertex j are related.
The semantic of the edge denotes the network type.
e.g. friendship network, collaboration network, etc.
i j
Example undirected network. Herbert Marko Aric Ed Zhiwu Alberto
Jen Johan Luda Stephan Whenzong
The directed network.
Then there is the directed network of common knowledge.
Sometimes called a directed single-relational network.
For example, vertex i is related to vertex j , but j is not
related to i .
i j
Example directed network. Muskrat Bear Fish Fox Meerkat Lion
Human Wolf Deer Beetle Hyena
The semantic network.
Finally, there is the semantic network
Sometimes called a directed multi-relational network.
For example, vertex i is related to vertex j by the semantic s
, but j is not related to i by the semantic s .
i j s
Example semantic network. SantaFe Marko NewMexico Ryan
California UnitedStates LANL livesIn worksWith cityOf
originallyFrom stateOf stateOf locatedIn hasLab Cells Atoms madeOf
madeOf researches Oregon southOf hasResident Arnold governerOf
northOf
The technologies of the Semantic Web.
Resource Description Framework (RDF): The foundation technology
of the Semantic Web. RDF is a distributed, semantic network data
model. In RDF, URIs and literals (e.g. ints, doubles, strings) are
related to one another in triples.
RDF Schema (RDFS) and the Web Ontology Language (OWL): The
ontology is to the Semantic Web as the schema is to the relational
database.
Anything of rdf:type lanl:Human can lanl:drive anything of
rdf:type lanl:Car .
Triple-Store : The triple-store is to semantic networks what
the relational database is to the data table.
RDF and RDFS. lanl:marko lanl:cookie lanl:Human lanl:Food
lanl:isEating rdf:type rdf:type lanl:isEating rdfs:domain
rdfs:range ontology instance RDF is not a syntax. Its a data model.
Various syntaxes exist to encode RDF including RDF/XML, N-TRIPLE,
TRiX, N3, etc.
General-purpose modeling. next next next item item item item
key value key value entry entry el el el el el el List Map Set
General-purpose computing. next value test PC item heap el
Program Virtual Machine false true next next stack el next item
next el Rodriguez, M.A., General-Purpose Computing on a Semantic
Network Substrate, in review, Journal of Web Semantics,
LA-UR-07-2885, April 2007.
A web of data and process. 127.0.0.1 127.0.0.0 127.0.0.2
127.0.0.3
The triple-store. SELECT ?a ?c WHERE { ?a type human ?a wrote
?b ?b type article ?c wrote ?b ?c type human ?a != ?c }
There are two primary ways to distribute information on the
Semantic Web.
1.) publish a serialized RDF document on a web server.
2.) expose a public interface to an RDF triple-store.
The triple store is to semantic networks what the relational
database is to data tables.
Storing and querying triples in a triple store.
SPARQLUpdate query language.
like SQL, but for triple-stores.
INSERT ?a coauthor ?c WHERE { ?a type human ?a wrote ?b ?b type
article ?c wrote ?b ?c type human ?a != ?c } DELETE ?s ?p ?o WHERE
{ ?s ?p ?o }
Triple-store vs. relational database. Triple-store Relational
Database SQL Interface SPARQL Interface SELECT ?x1 ?x2 WHERE { ?x1
lanl:hasFriend ?x2 . ?x2 lanl:worksFor ?x3 . ?x3
lanl:collaboratesWith ?x4 . ?x4 lanl:hasEmployee ?x1 . } SELECT
friendTable.personId1, friendTable.personId2 FROM personTable,
authorTable, articleTable, friendTable, hasEmployeeTable,
organizationTable, worksForTable, collaboratesWithTable WHERE
personTable.id = authorTable.personId AND personTable.id =
friendTable.personId1 AND friendTable.personId2 =
worksForTable.personId AND worksForTable.orgId =
collaboratesWithTable.orgId2 AND collaboratesWithTable.ordId2 =
personTable.id Give me all pairs of people that are friends, but
whom work for collaborating companies. Now!
Triple-store and graph-analysis.
Nearly all network analysis algorithms can be decomposed into a
graph traversal problem.
Spreading activation and the energy diffusion.
PageRank and the random walker.
Geodesics and the breadth-depth search.
Relational database is not optimized for graph traversal.
Indexes are not appropriate for graph traversal.
Every traversal is a table join.
Triple-store is more optimized for graph analysis.
While the triple-store is optimized for graph pattern matching,
it is more optimal for graph traversal than the relational
database.
Hybrid statement/linked-list databases are good at both pattern
matching and traversal.
Graph analysis can be used for ranking and recommendation.
Rodriguez, M.A., "A Multi-Relational Network to Support the
Scholarly Communication Process", International Journal of Public
Information Systems, volume 2007, issue 1, pages 13-29, ISSN:
1653-4360, LA-UR-06-2416, March 2007.
Rodriguez, M.A., Bollen, J., Van de Sompel, H., A Practical
Ontology for the Large-Scale Modeling of Scholarly Artifacts and
their Usage, 2007 ACM/IEEE Joint Conference on Digital Libraries,
pages 278-287, Vancouver, Canada, ACM/IEEE Computing,
doi:10.1145/1255175.1255229, LA-UR-07-0665, June 2007.