CS698F
M. Atre
Introduction
CourseStructure
AcademicIntegrity
RelationalData
QueryOptimization
Challenges
Graph Data
Challenges -Graphs
Solutions
Course Flow
Advanced Data Management
Medha Atre
Office: [email protected]
July 28, 2016
CS698F
M. Atre
Introduction
CourseStructure
AcademicIntegrity
RelationalData
QueryOptimization
Challenges
Graph Data
Challenges -Graphs
Solutions
Course Flow
Goals
Learn tools and techniques of the “big data” management.
Big data encompasses relational data, graph data, textdata, and any other data (media etc).
“Graph shaped” data has become predominant as itcaptures the ideology of internet and social networks – anetwork of information, persons, objects, and things ingeneral.
E.g., DBPedia1 is a “graphical” representation of theWikipedia network. UniProt2 is a graphical representationproteins and genes network, MusicBrainz3 is a graphicalrepresentation of music database.
1http://wiki.dbpedia.org/
2http://www.uniprot.org/
3https://musicbrainz.org/
CS698F
M. Atre
Introduction
CourseStructure
AcademicIntegrity
RelationalData
QueryOptimization
Challenges
Graph Data
Challenges -Graphs
Solutions
Course Flow
Goals
Is this graph data of any significance?
DBPedia (Wikipedia network in RDF format, pivotal in thesuccess of IBM Watson supercomputer in the Jeopardychallenge, along with the Yago RDF network)4
Adaption of RDFa by Google, and RDF by Facebook.
Understand the challenges of management of this datafrom the “scalability” point of view – how would youmanage such a huge data? Would you use
one commodity computer, ora cluster of commodity computers, ora supercomputer?
4http://en.wikipedia.org/wiki/Watson (computer)#Data
CS698F
M. Atre
Introduction
CourseStructure
AcademicIntegrity
RelationalData
QueryOptimization
Challenges
Graph Data
Challenges -Graphs
Solutions
Course Flow
A glimpse of the Linked Data network
Linked Datasets as of August 2014
Uniprot
AlexandriaDigital Library
Gazetteer
lobidOrganizations
chem2bio2rdf
MultimediaLab University
Ghent
Open DataEcuador
GeoEcuador
Serendipity
UTPLLOD
GovAgriBusDenmark
DBpedialive
URIBurner
Identifiers
EionetRDF
lobidResources
WiktionaryDBpedia
Viaf
Umthes
RKBExplorer
Courseware
Opencyc
Olia
Gem.Thesaurus
AudiovisueleArchieven
DiseasomeFU-Berlin
Eurovocin
SKOS
DNBGND
Cornetto
Bio2RDFPubmed
Bio2RDFNDC
Bio2RDFMesh
IDS
OntosNewsPortal
AEMET
ineverycrea
LinkedUser
Feedback
MuseosEspaniaGNOSS
Europeana
NomenclatorAsturias
Red UnoInternacional
GNOSS
GeoWordnet
Bio2RDFHGNC
CticPublic
Dataset
Bio2RDFHomologene
Bio2RDFAffymetrix
MuninnWorld War I
CKAN
GovernmentWeb Integration
forLinkedData
Universidadde CuencaLinkeddata
Freebase
Linklion
Ariadne
OrganicEdunet
GeneExpressionAtlas RDF
ChemblRDF
BiosamplesRDF
IdentifiersOrg
BiomodelsRDF
ReactomeRDF
Disgenet
SemanticQuran
IATI asLinked Data
DutchShips and
Sailors
Verrijktkoninkrijk
IServe
Arago-dbpedia
LinkedTCGA
ABS270a.info
RDFLicense
EnvironmentalApplications
ReferenceThesaurus
Thist
JudaicaLink
BPR
OCD
ShoahVictimsNames
Reload
Data forTourists in
Castilla y Leon
2001SpanishCensusto RDF
RKBExplorer
Webscience
RKBExplorerEprintsHarvest
NVS
EU AgenciesBodies
EPO
LinkedNUTS
RKBExplorer
Epsrc
OpenMobile
Network
RKBExplorerLisbon
RKBExplorer
Italy
CE4R
EnvironmentAgency
Bathing WaterQuality
RKBExplorerKaunas
OpenData
Thesaurus
RKBExplorerWordnet
RKBExplorer
ECS
AustrianSki
Racers
Social-semweb
Thesaurus
DataOpenAc Uk
RKBExplorer
IEEE
RKBExplorer
LAAS
RKBExplorer
Wiki
RKBExplorer
JISC
RKBExplorerEprints
RKBExplorer
Pisa
RKBExplorer
Darmstadt
RKBExplorerunlocode
RKBExplorer
Newcastle
RKBExplorer
OS
RKBExplorer
Curriculum
RKBExplorer
Resex
RKBExplorer
Roma
RKBExplorerEurecom
RKBExplorer
IBM
RKBExplorer
NSF
RKBExplorer
kisti
RKBExplorer
DBLP
RKBExplorer
ACM
RKBExplorerCiteseer
RKBExplorer
Southampton
RKBExplorerDeepblue
RKBExplorerDeploy
RKBExplorer
Risks
RKBExplorer
ERA
RKBExplorer
OAI
RKBExplorer
FT
RKBExplorer
Ulm
RKBExplorer
Irit
RKBExplorerRAE2001
RKBExplorer
Dotac
RKBExplorerBudapest
SwedishOpen Cultural
Heritage
Radatana
CourtsThesaurus
GermanLabor LawThesaurus
GovUKTransport
Data
GovUKEducation
Data
EnaktingMortality
EnaktingEnergy
EnaktingCrime
EnaktingPopulation
EnaktingCO2Emission
EnaktingNHS
RKBExplorer
Crime
RKBExplorercordis
Govtrack
GeologicalSurvey of
AustriaThesaurus
GeoLinkedData
GesisThesoz
Bio2RDFPharmgkb
Bio2RDFSabiorkBio2RDF
Ncbigene
Bio2RDFIrefindex
Bio2RDFIproclass
Bio2RDFGOA
Bio2RDFDrugbank
Bio2RDFCTD
Bio2RDFBiomodels
Bio2RDFDBSNP
Bio2RDFClinicaltrials
Bio2RDFLSR
Bio2RDFOrphanet
Bio2RDFWormbase
BIS270a.info
DM2E
DBpediaPT
DBpediaES
DBpediaCS
DBnary
AlpinoRDF
YAGO
PdevLemon
Lemonuby
Isocat
Ietflang
Core
KUPKB
GettyAAT
SemanticWeb
Journal
OpenlinkSWDataspaces
MyOpenlinkDataspaces
Jugem
Typepad
AspireHarperAdams
NBNResolving
Worldcat
Bio2RDF
Bio2RDFECO
Taxon-conceptAssets
Indymedia
GovUKSocietal
WellbeingDeprivation imd
EmploymentRank La 2010
GNULicenses
GreekWordnet
DBpedia
CIPFA
Yso.fiAllars
Glottolog
StatusNetBonifaz
StatusNetshnoulle
Revyu
StatusNetKathryl
ChargingStations
AspireUCL
Tekord
Didactalia
ArtenueVosmedios
GNOSS
LinkedCrunchbase
ESDStandards
VIVOUniversityof Florida
Bio2RDFSGD
Resources
ProductOntology
DatosBne.es
StatusNetMrblog
Bio2RDFDataset
EUNIS
GovUKHousingMarket
LCSH
GovUKTransparencyImpact ind.Households
In temp.Accom.
UniprotKB
StatusNetTimttmy
SemanticWeb
Grundlagen
GovUKInput ind.
Local AuthorityFunding FromGovernment
Grant
StatusNetFcestrada
JITA
StatusNetSomsants
StatusNetIlikefreedom
DrugbankFU-Berlin
Semanlink
StatusNetDtdns
StatusNetStatus.net
DCSSheffield
AtheliaRFID
StatusNetTekk
ListaEncabezaMientosMateria
StatusNetFragdev
Morelab
DBTuneJohn PeelSessions
RDFizelast.fm
OpenData
Euskadi
GovUKTransparency
Input ind.Local auth.Funding f.
Gvmnt. Grant
MSC
Lexinfo
StatusNetEquestriarp
Asn.us
GovUKSocietal
WellbeingDeprivation ImdHealth Rank la
2010
StatusNetMacno
OceandrillingBorehole
AspireQmul
GovUKImpact
IndicatorsPlanning
ApplicationsGranted
Loius
Datahub.io
StatusNetMaymay
Prospectsand
TrendsGNOSS
GovUKTransparency
Impact IndicatorsEnergy Efficiency
new Builds
DBpediaEU
Bio2RDFTaxon
StatusNetTschlotfeldt
JamendoDBTune
AspireNTU
GovUKSocietal
WellbeingDeprivation Imd
Health Score2010
LoticoGNOSS
UniprotMetadata
LinkedEurostat
AspireSussex
Lexvo
LinkedGeoData
StatusNetSpip
SORS
GovUKHomeless-
nessAccept. per
1000
TWCIEEEvis
AspireBrunel
PlanetDataProject
Wiki
StatusNetFreelish
Statisticsdata.gov.uk
StatusNetMulestable
Enipedia
UKLegislation
API
LinkedMDB
StatusNetQth
SiderFU-Berlin
DBpediaDE
GovUKHouseholds
Social lettingsGeneral Needs
Lettings PrpNumber
Bedrooms
AgrovocSkos
MyExperiment
ProyectoApadrina
GovUKImd CrimeRank 2010
SISVU
GovUKSocietal
WellbeingDeprivation ImdHousing Rank la
2010
StatusNetUni
Siegen
OpendataScotland Simd
EducationRank
StatusNetKaimi
GovUKHouseholds
Accommodatedper 1000
StatusNetPlanetlibre
DBpediaEL
SztakiLOD
DBpediaLite
DrugInteractionKnowledge
BaseStatusNet
Qdnx
AmsterdamMuseum
AS EDN LOD
RDFOhloh
DBTuneartistslast.fm
AspireUclan
HellenicFire Brigade
Bibsonomy
NottinghamTrent
ResourceLists
OpendataScotland SimdIncome Rank
RandomnessGuide
London
OpendataScotland
Simd HealthRank
SouthamptonECS Eprints
FRB270a.info
StatusNetSebseb01
StatusNetBka
ESDToolkit
HellenicPolice
StatusNetCed117
OpenEnergy
Info Wiki
StatusNetLydiastench
OpenDataRISP
Taxon-concept
Occurences
Bio2RDFSGD
UIS270a.info
NYTimesLinked Open
Data
AspireKeele
GovUKHouseholdsProjectionsPopulation
W3C
OpendataScotland
Simd HousingRank
ZDB
StatusNet1w6
StatusNetAlexandre
Franke
DeweyDecimal
Classification
StatusNetStatus
StatusNetdoomicile
CurrencyDesignators
StatusNetHiico
LinkedEdgar
GovUKHouseholds
2008
DOI
StatusNetPandaid
BrazilianPoliticians
NHSJargon
Theses.fr
LinkedLifeData
Semantic WebDogFood
UMBEL
OpenlyLocal
StatusNetSsweeny
LinkedFood
InteractiveMaps
GNOSS
OECD270a.info
Sudoc.fr
GreenCompetitive-
nessGNOSS
StatusNetIntegralblue
WOLD
LinkedStockIndex
Apache
KDATA
LinkedOpenPiracy
GovUKSocietal
WellbeingDeprv. ImdEmpl. Rank
La 2010
BBCMusic
StatusNetQuitter
StatusNetScoffoni
OpenElection
DataProject
Referencedata.gov.uk
StatusNetJonkman
ProjectGutenbergFU-BerlinDBTropes
StatusNetSpraci
Libris
ECB270a.info
StatusNetThelovebug
Icane
GreekAdministrative
Geography
Bio2RDFOMIM
StatusNetOrangeseeds
NationalDiet Library
WEB NDLAuthorities
UniprotTaxonomy
DBpediaNL
L3SDBLP
FAOGeopolitical
Ontology
GovUKImpact
IndicatorsHousing Starts
DeutscheBiographie
StatusNetldnfai
StatusNetKeuser
StatusNetRusswurm
GovUK SocietalWellbeing
Deprivation ImdCrime Rank 2010
GovUKImd Income
Rank La2010
StatusNetDatenfahrt
StatusNetImirhil
Southamptonac.uk
LOD2Project
Wiki
DBpediaKO
DailymedFU-Berlin
WALS
DBpediaIT
StatusNetRecit
Livejournal
StatusNetExdc
Elviajero
Aves3D
OpenCalais
ZaragozaTurruta
AspireManchester
Wordnet(VU)
GovUKTransparency
Impact IndicatorsNeighbourhood
Plans
StatusNetDavid
Haberthuer
B3Kat
PubBielefeld
Prefix.cc
NALT
Vulnera-pedia
GovUKImpact
IndicatorsAffordable
Housing Starts
GovUKWellbeing lsoa
HappyYesterday
Mean
FlickrWrappr
Yso.fiYSA
OpenLibrary
AspirePlymouth
StatusNetJohndrink
Water
StatusNetGomertronic
Tags2conDelicious
StatusNettl1n
StatusNetProgval
Testee
WorldFactbookFU-Berlin
DBpediaJA
StatusNetCooleysekula
ProductDB
IMF270a.info
StatusNetPostblue
StatusNetSkilledtests
NextwebGNOSS
EurostatFU-Berlin
GovUKHouseholds
Social LettingsGeneral Needs
Lettings PrpHousehold
Composition
StatusNetFcac
DWSGroup
OpendataScotland
GraphSimd Rank
DNB
CleanEnergyData
Reegle
OpendataScotland SimdEmployment
Rank
ChroniclingAmerica
GovUKSocietal
WellbeingDeprivation
Imd Rank 2010
StatusNetBelfalas
AspireMMU
StatusNetLegadolibre
BlukBNB
StatusNetLebsanft
GADMGeovocab
GovUKImd Score
2010
SemanticXBRL
UKPostcodes
GeoNames
EEARodAspire
Roehampton
BFS270a.info
CameraDeputatiLinkedData
Bio2RDFGeneID
GovUKTransparency
Impact IndicatorsPlanning
ApplicationsGranted
StatusNetSweetie
Belle
O'Reilly
GNI
CityLichfield
GovUKImd
Rank 2010
BibleOntology
Idref.fr
StatusNetAtari
Frosch
Dev8d
NobelPrizes
StatusNetSoucy
ArchiveshubLinkedData
LinkedRailway
DataProject
FAO270a.info
GovUKWellbeing
WorthwhileMean
Bibbase
Semantic-web.org
BritishMuseum
Collection
GovUKDev LocalAuthorityServices
CodeHaus
Lingvoj
OrdnanceSurveyLinkedData
Wordpress
EurostatRDF
StatusNetKenzoid
GEMET
GovUKSocietal
WellbeingDeprv. imdScore '10
MisMuseosGNOSS
GovUKHouseholdsProjections
totalHouseolds
StatusNet20100
EEA
CiardRing
OpendataScotland Graph
EducationPupils by
School andDatazone
VIVOIndiana
University
Pokepedia
Transparency270a.info
StatusNetGlou
GovUKHomelessness
HouseholdsAccommodated
TemporaryHousing Types
STWThesaurus
forEconomics
DebianPackageTrackingSystem
DBTuneMagnatune
NUTSGeo-vocab
GovUKSocietal
WellbeingDeprivation ImdIncome Rank La
2010
BBCWildlifeFinder
StatusNetMystatus
MiguiadEviajesGNOSS
AcornSat
DataBnf.fr
GovUKimd env.
rank 2010
StatusNetOpensimchat
OpenFoodFacts
GovUKSocietal
WellbeingDeprivation Imd
Education Rank La2010
LODACBDLS
FOAF-Profiles
StatusNetSamnoble
GovUKTransparency
Impact IndicatorsAffordable
Housing Starts
StatusNetCoreyavisEnel
Shops
DBpediaFR
StatusNetRainbowdash
StatusNetMamalibre
PrincetonLibrary
Findingaids
WWWFoundation
Bio2RDFOMIM
Resources
OpendataScotland Simd
GeographicAccess Rank
Gutenberg
StatusNetOtbm
ODCLSOA
StatusNetOurcoffs
Colinda
WebNmasunoTraveler
StatusNetHackerposse
LOV
GarnicaPlywood
GovUKwellb. happy
yesterdaystd. dev.
StatusNetLudost
BBCProgram-
mes
GovUKSocietal
WellbeingDeprivation Imd
EnvironmentRank 2010
Bio2RDFTaxonomy
Worldbank270a.info
OSM
DBTuneMusic-brainz
LinkedMarkMail
StatusNetDeuxpi
GovUKTransparency
ImpactIndicators
Housing Starts
BizkaiSense
GovUKimpact
indicators energyefficiency new
builds
StatusNetMorphtown
GovUKTransparency
Input indicatorsLocal authorities
Working w. tr.Families
ISO 639Oasis
AspirePortsmouth
ZaragozaDatos
AbiertosOpendataScotland
SimdCrime Rank
Berlios
StatusNetpiana
GovUKNet Add.Dwellings
Bootsnall
StatusNetchromic
Geospecies
linkedct
Wordnet(W3C)
StatusNetthornton2
StatusNetmkuttner
StatusNetlinuxwrangling
EurostatLinkedData
GovUKsocietal
wellbeingdeprv. imd
rank '07
GovUKsocietal
wellbeingdeprv. imdrank la '10
LinkedOpen Data
ofEcology
StatusNetchickenkiller
StatusNetgegeweb
DeustoTech
StatusNetschiessle
GovUKtransparency
impactindicatorstr. families
Taxonconcept
GovUKservice
expenditure
GovUKsocietal
wellbeingdeprivation imd
employmentscore 2010
Linking Open Data cloud diagram 2014, by M. Schmachtenberg, C. Bizer, A. Jentzsch, R. Cyganiak http://lod-cloud.net/
CS698F
M. Atre
Introduction
CourseStructure
AcademicIntegrity
RelationalData
QueryOptimization
Challenges
Graph Data
Challenges -Graphs
Solutions
Course Flow
Course Structure
50% – Course project, supposed to be innovative, chosenfrom the cutting-edge research topic. A list of researchpapers of interesting topics will be referred to by theinstructor. Students may choose an outside topic of theirinterest as long as it is relevant to the broader theme ofthe course. Breakup of the grades:
5% – A written proposal with a literature review.30% – Implementation and execution on the Linuxenvironment.15% – Demonstrable evaluation results.
All the course projects need to be well defined and needinstructor’s approval by the middle of the semester.
CS698F
M. Atre
Introduction
CourseStructure
AcademicIntegrity
RelationalData
QueryOptimization
Challenges
Graph Data
Challenges -Graphs
Solutions
Course Flow
Course Structure
Course projects can be done in the groups of 1, 2, or 3students together.
Groups of size more than 3 need special approval. Such agroup needs to demonstrate a significant component of theproject by the middle of the semester to get the approval.
Recommended size of a group is 2 or 3.
15% – Bi-weekly (once in 2 weeks) assignments.
15% – Mid-semester State-of-the-art paper presentation.
20% – Final detailed course project report (expected to bewritten as professionally as possible like a research paper).
Office hours: Mon-Thur 12-1pm (else by prior appointment).Course webpage: www.cse.iitk.ac.in/users/atrem/courses/cs698f2016fall/
CS698F
M. Atre
Introduction
CourseStructure
AcademicIntegrity
RelationalData
QueryOptimization
Challenges
Graph Data
Challenges -Graphs
Solutions
Course Flow
Academic Integrity
We value academic integrity and honesty above good grades.
Every student is required to go through the CSE departmentacademic integrity (anti-cheating) policy given onhttp://www.cse.iitk.ac.in/pages/AntiCheatingPolicy.html
After the add/drop deadline, each registered student will berequired to sign an acknowledgement of having read this policy.
There will be 0 tolerance shown towards plagiarism, copying5,and any other dishonest means.
5Using open source codes or components of them with properacknowledgement is allowed.
CS698F
M. Atre
Introduction
CourseStructure
AcademicIntegrity
RelationalData
QueryOptimization
Challenges
Graph Data
Challenges -Graphs
Solutions
Course Flow
Relational Data
Figure taken from https://docs.oracle.com/.
CS698F
M. Atre
Introduction
CourseStructure
AcademicIntegrity
RelationalData
QueryOptimization
Challenges
Graph Data
Challenges -Graphs
Solutions
Course Flow
SQL Queries
Find description and brand of all the products in all the storesin the New York city.
SELECT PRODUCT.Description, PRODUCT.BrandWHERE STORE.City=“New York” ANDSTORE.Store key=SALES FACT.Store key ANDSALES FACT.Product key=PRODUCT.Product key
CS698F
M. Atre
Introduction
CourseStructure
AcademicIntegrity
RelationalData
QueryOptimization
Challenges
Graph Data
Challenges -Graphs
Solutions
Course Flow
Query Planning and Optimization
Cost based estimation for selecting the best query plan.
Costs are in terms of number of rows (disk blocks) youhave to access to perform a join operation, whether thetable has any indexes on it etc.
In a nested loop join, suppose STORE has S pages and tStuples per page, and SALES FACT has F pages and tF tuplesper page, then with STORE as the outer relation andSALES FACT as the inner relation, the cost of the joinbetween the two will be: S + tS ∗ S ∗ F .
In a block nested loop join with B buffer pages, the cost of thisjoin will be: S + F ∗ d S
B−2e where 2 buffer pages are reservedfor the inner relation SALES FACT.
CS698F
M. Atre
Introduction
CourseStructure
AcademicIntegrity
RelationalData
QueryOptimization
Challenges
Graph Data
Challenges -Graphs
Solutions
Course Flow
Query Optimization – well known techniques
Create indexes on frequently joined columns.
Hash-joins.
Pipelining of data through the operators to avoid creationof temporary intermediate relation – popular method inthe contemporary database systems.
Creation of join-indices for frequently used joins.
Pre-join pruning of data based on the schema of the dataor other techniques such as semi-joins.
Column-wise storage instead of traditional row-wisestorage.
Data compression – column compression, compressedbit-vectors, delta encoding techniques.
CS698F
M. Atre
Introduction
CourseStructure
AcademicIntegrity
RelationalData
QueryOptimization
Challenges
Graph Data
Challenges -Graphs
Solutions
Course Flow
Challenges
When the difference between the size of the disk residentdata and available main memory becomes larger.
When the data cannot be stored on the single disk on asingle computer.
When cost-estimation methods of query planning arerendered ineffective due to heavy skew in the data –common in the “graph shaped data”.
CS698F
M. Atre
Introduction
CourseStructure
AcademicIntegrity
RelationalData
QueryOptimization
Challenges
Graph Data
Challenges -Graphs
Solutions
Course Flow
Challenges
E.g., A graph with 1 billion edges (109) cannot be fullyloaded in memory even on a machine with 32GB mainmemory. Its disk resident size in the uncompressed form isabout 400GB, e.g., UniProt network.
Cost of a PC with 8GB RAM, 1TB disk Rs. 30,000.
Cost of a PC with 32GB RAM 1TB disk is Rs. 1,75,280.
If your data processing requires cumulative X amount ofmemory and Y amount of disk space:
(cost(PC (X ,Y ))× 1)� (cost(PC (X
n,Y
n))× n)
Where n is the number of commodity compute nodes. That iswhat Google is doing through their server farms!
CS698F
M. Atre
Introduction
CourseStructure
AcademicIntegrity
RelationalData
QueryOptimization
Challenges
Graph Data
Challenges -Graphs
Solutions
Course Flow
Challenges
When the data gets distributed on a farm of servers,instead of a single server, the cost computation changes.
Now network communication cost has to be included whileplanning a query.
Let us go back to our example of the join between STORE andSALES FACT tables. Assuming none of the join data isavailable locally, and STORAGE table has to be communicatedacross the entire cluster of 5 compute nodes, an approximatecost computation of the join will be:
5 ∗ (S
5+ 4 ∗ 2 ∗ Comm(
S
5) + tS ∗ S ∗
F
5)
We can do better than this... and in this course we will seehow!
CS698F
M. Atre
Introduction
CourseStructure
AcademicIntegrity
RelationalData
QueryOptimization
Challenges
Graph Data
Challenges -Graphs
Solutions
Course Flow
Graph data of TV sitcoms
:Jerry
:Larry :Julia
:Veep:CurbYourEnthu :Seinfeld :NewAdvOldChristine
:LosAngeles :D.C. :Jersey:NewYorkCity
:hasFriend :hasFriend
:actedIn
:actedIn:actedIn
:actedIn
:location :location :location :location
CS698F
M. Atre
Introduction
CourseStructure
AcademicIntegrity
RelationalData
QueryOptimization
Challenges
Graph Data
Challenges -Graphs
Solutions
Course Flow
Graph data and queries
Data
:Jerry :hasFriend :Larry
:Jerry :hasFriend :Julia
:Larry :actedIn :CurbYourEnthu
:Julia :actedIn :Seinfeld
:Julia :actedIn :Veep
:Julia :actedIn :CurbYourEnthu
:Julia :actedIn :NewAdvOldChristine
:Seinfeld :location :NewYorkCity
:Veep :location :D.C.
:CurbYourEnthu :location :LosAngeles
:NewAdvOldChristine :location :Jersey
Graphical Representation
:Jerry
:Larry :Julia
:Veep:CurbYourEnthu :Seinfeld :NewAdvOldChristine
:LosAngeles :D.C. :Jersey:NewYorkCity
:hasFriend :hasFriend
:actedIn
:actedIn:actedIn
:actedIn
:location :location :location :location
SPARQLSELECT ?friend ?sitcom WHERE {
:Jerry :hasFriend ?friend .
?friend :actedIn ?sitcom .
?sitcom :location :NewYorkCity .
}
Eqv. SQL query
SELECT t1.o, t2.o from rdf as t1, rdf as
t2, rdf as t3 WHERE t1.s=“:Jerry” and
t1.p=“:hasFriend” and t2.p=“:actedIn”
and t3.p=“:location” and
t3.o=“:NewYorkCity” and t1.o=t2.s and
t2.o=t3.s
:Jerry
?sitcom
:NewYorkCity
:hasFriend
:actedIn
:location
?friend
CS698F
M. Atre
Introduction
CourseStructure
AcademicIntegrity
RelationalData
QueryOptimization
Challenges
Graph Data
Challenges -Graphs
Solutions
Course Flow
Query Evaluation Results
Data
:Jerry :hasFriend :Larry
:Jerry :hasFriend :Julia
:Larry :actedIn :CurbYourEnthu
:Julia :actedIn :Seinfeld
:Julia :actedIn :Veep
:Julia :actedIn :CurbYourEnthu
:Julia :actedIn :NewAdvOldChristine
:Seinfeld :location :NewYorkCity
:Veep :location :D.C.
:CurbYourEnthu :location :LosAngeles
:NewAdvOldChristine :location :Jersey
:Jerry
:Larry :Julia
:Veep:CurbYourEnthu :Seinfeld :NewAdvOldChristine
:LosAngeles :D.C. :Jersey:NewYorkCity
:hasFriend :hasFriend
:actedIn
:actedIn:actedIn
:actedIn
:location :location :location :location
Query
SELECT ?friend ?sitcom WHERE {:Jerry :hasFriend ?friend .
?friend :actedIn ?sitcom .
?sitcom :location :NewYorkCity .
}
:Jerry
?sitcom
:NewYorkCity
:hasFriend
:actedIn
:location
?friend
CS698F
M. Atre
Introduction
CourseStructure
AcademicIntegrity
RelationalData
QueryOptimization
Challenges
Graph Data
Challenges -Graphs
Solutions
Course Flow
Graphs – Types of problems
Pattern matching, a.k.a., joins.
Optional pattern matching, where part of the pattern ismatch is optional, a.k.a., left-outer-joins.
Path pattern queries, hard problems –
Given a regular expression, find all pairs of nodes that haveat least one path matching that expression – recursive joinqueries.Given a regular expression, and a pair of nodes, find allpaths that satisfy that expression.
Reachability queries – static and dynamic graphs.
Keyword search over graphs.
CS698F
M. Atre
Introduction
CourseStructure
AcademicIntegrity
RelationalData
QueryOptimization
Challenges
Graph Data
Challenges -Graphs
Solutions
Course Flow
Graphs – Challenges
Lacks schema, hence standard techniques of normalizationas in relational DBs not applicable.
Has a heavy skew in the data – a few nodes with very highdegree, and a large number of nodes with very less degree.
Lots of strongly connected components and hence cycles.
Distribution strategies need to take care of graphstructure, to avoid large cuts – methods like METIS6 helpin efficient graph partitioning.
6http://glaros.dtc.umn.edu/gkhome/metis/metis/overview
CS698F
M. Atre
Introduction
CourseStructure
AcademicIntegrity
RelationalData
QueryOptimization
Challenges
Graph Data
Challenges -Graphs
Solutions
Course Flow
Solution space – general purpose
Apache Hadoop – Map Reduce framework.
Apache SPARK – in-memory Hadoop like framework.
H20 – Big data analysis.
Apache Mahout – for substrate independent big scalemachine learning tasks, typically run on Hadoop, SPARK,H20, Flink.
CS698F
M. Atre
Introduction
CourseStructure
AcademicIntegrity
RelationalData
QueryOptimization
Challenges
Graph Data
Challenges -Graphs
Solutions
Course Flow
Solution space – relational data
Single compute node:
Row-wise stores – MySQL, PostgreSQL, Many othercommercial stores like IBM DB2.Column-wise stores – MonetDB, Virtuoso.
Cluster computing:
Cloudera, Cassandra, BigTable, HBase, HyperTable(inspired by BigTable).
CS698F
M. Atre
Introduction
CourseStructure
AcademicIntegrity
RelationalData
QueryOptimization
Challenges
Graph Data
Challenges -Graphs
Solutions
Course Flow
Solution space – graph data
Single compute node:
BitMat7 [Atre2008, Atre2010, Atre2015, Atre2016] – forpattern matching, a.k.a. joins.RDF-3X [Neumann2010]Neo4j, HypergraphDB.
Cluster computing:
Pregel by Google (Apache Giraph as opensource version)Trinity by Microsoft, GraphX library (in Apache SPARK)H-RDF-3X [Huang2011], SHARD [Rohloff2011], TriAD[Gurajada2014].
7http://bitmatrdf.sourceforge.net
CS698F
M. Atre
Introduction
CourseStructure
AcademicIntegrity
RelationalData
QueryOptimization
Challenges
Graph Data
Challenges -Graphs
Solutions
Course Flow
Flow of the course
Go over graph data processing systems on a single nodecompute environment, to understand the problems andsolutions in detail.
Through that understand the scalability problems.
Gradually move towards distributed systems, andunderstand the solutions for scalability.
Through the above exercises, identify problems in thespace of both – scalability of the existing solutions andproblems that lack good solutions even on the single nodecompute environment.
Introduction to open and/or advanced problems overgraph and relational data processing for further research.