Upload
vuthien
View
220
Download
5
Embed Size (px)
Citation preview
Building a FacetedBrowser in CouchDB
Using Views onViews and ErlangMetaprogramming
Claus Zinn
Overview
ResearchInfrastructure
Faceted Search
ImplementationCouchDB
Map-Reduce
Processing Stages
Views
Views on views
Evaluation
Future Work
Related Work andConclusion
.1
Building a Faceted Browser inCouchDB Using Views on Viewsand Erlang MetaprogrammingWFLP-2011Odense, July 19 2011
Claus Zinn
The NaLiDa ProjectNachhaltigkeit Linguistischer Daten
http://www.sfs.uni-tuebingen.de/nalida/
Building a FacetedBrowser in CouchDB
Using Views onViews and ErlangMetaprogramming
Claus Zinn
Overview
ResearchInfrastructure
Faceted Search
ImplementationCouchDB
Map-Reduce
Processing Stages
Views
Views on views
Evaluation
Future Work
Related Work andConclusion
.2
Overview
• Research infrastructure (in Linguistics)
• Faceted Search
• Implementation
• CouchDB• Map-Reduce• Processing Stages• Views• Views on Views
• Evaluation
• Future Work
• Related Work and Conclusion
Building a FacetedBrowser in CouchDB
Using Views onViews and ErlangMetaprogramming
Claus Zinn
Overview
ResearchInfrastructure
Faceted Search
ImplementationCouchDB
Map-Reduce
Processing Stages
Views
Views on views
Evaluation
Future Work
Related Work andConclusion
.3
Research Infrastructure
State of affairs in the Humanities (and elsewhere)
• no systematic management of the underlying research data
• increasing pressure from funding agencies to document andmake public research data
⇒ eScience infrastructure needed to
⇒ support reproduction of results over identical data sets⇒ increase scientific quality and fights fraud in science⇒ help avoiding unmeant duplication of research work
NaLiDa Project
• contributes to infrastructure for languages resources (corpora,lexica, ...) and software tools (part-of-speech taggers, parsers, ...)
• supports scientific community with infrastructure building,metadata management and storage
• assists institutions to systematically describe and expose theirresearch with metadata terms of XML-based documents
• increase access to and visibility of resources
Building a FacetedBrowser in CouchDB
Using Views onViews and ErlangMetaprogramming
Claus Zinn
Overview
ResearchInfrastructure
Faceted Search
ImplementationCouchDB
Map-Reduce
Processing Stages
Views
Views on views
Evaluation
Future Work
Related Work andConclusion
.4
Data Aggregation and Exposure
XML
A
XML
B XML
C
Faceted Search
OAI-PMHHarvesting
DocumentStorage
At regular intervalsnew providers may join
Building a FacetedBrowser in CouchDB
Using Views onViews and ErlangMetaprogramming
Claus Zinn
Overview
ResearchInfrastructure
Faceted Search
ImplementationCouchDB
Map-Reduce
Processing Stages
Views
Views on views
Evaluation
Future Work
Related Work andConclusion
.5
Faceted Search
Metadata Descriptions in Linguistics
• can be very detailed with large variety in the usage of metadatafield descriptors and their structural organisation
• most of the information is of little use for most users
• some information pieces matter for most users
Increasing Popularity of Faceted Browsing
• well-suited for naive users to explore large data sets with smallbut informative set of facets
• customers can identify “products” along many dimensions
• facets & their value range & number of corresponding itemsshows structure and content of the search space
• many users learn the main criteria for navigation
Building a FacetedBrowser in CouchDB
Using Views onViews and ErlangMetaprogramming
Claus Zinn
Overview
ResearchInfrastructure
Faceted Search
ImplementationCouchDB
Map-Reduce
Processing Stages
Views
Views on views
Evaluation
Future Work
Related Work andConclusion
.6
Faceted Search
Facet Selection
• governed by search for common denominator across collections
• will yield rather small set of (semantically similar) metadata fields
• main facets: organisation, language, resource type, modality
• conditional facets such as lifecyle status, tool type if ressourcetype is tool
Facetification
• Facets: F1, . . . , Fn
• with values ranges {f11, . . . f1n} . . . {fn1, . . . fnm}• document must be indexed by at least one facet-value pair
• document can be described by more than one value fij for Fi
• metadata for multimodal corpus with Fi = “modality” and fij“gesture”, “sign language” and “spoken language”
Building a FacetedBrowser in CouchDB
Using Views onViews and ErlangMetaprogramming
Claus Zinn
Overview
ResearchInfrastructure
Faceted Search
ImplementationCouchDB
Map-Reduce
Processing Stages
Views
Views on views
Evaluation
Future Work
Related Work andConclusion
.7
Faceted Search Computations
Languages
German
English
French
Dutch SignLanguageBritish SignLanguageSwedish SignLanguageGerman SignLanguageGeorgian
Hungarian
Dutch
Italian
Latin
Russian
Tibeto-BurmanischCroatian
Konkani
Prinmi
Serbian
Teribe
Bosnian
ShamskatLadakhiTurkish
Catalan
Ewe
Finnish
Hausa
Norwegian
Romanian
Spanish
Albanian
Alttibetisch
Amerindian
BahasaIndonesiaBrazilianPortugueseBulgarian
Bulgariian
Chinese
dk
Early ModernHigh GermanEstonian
EuropeanSpanishGalician
Greek
Greenlandic
Guarani
Guruntum
Hindi
Hopi
Japanese
jp
Kanuri
Kenhat Ladakhi
Kenuzi-Dongola
Lithuanian
MandarinChineseMaung
MedievalSpanishMoore
Motu
Nahuatl
Navajo
Nepali
North Saami
Old HighGermanOld Portuguese
Old Spanish
Orokolo
Portuguese
Portugueze
Provenzal
Quechua
Samoan
Scottish-Gaelic
South AmericanSpanishSurselvan
Swahili
Swedish
Tamil
Tangale
Thai
Tibetan
Tibetisch
Tzeltal
Warao
Warlpiri
Yir Yoront
Yoruba
Yukatekisch
Zulu
• Once facet-value pair fik is selected, corresponding document setfik must be intersected with each of the other subsets of Fj with1 < j < n, j 6= i :
• document set of ring segment fik must be intersected withdocument sets of all segments of all rings other than Fi
• When users select facet Fi with value fik and facet Fj with fjl• first build intersection between the two corresponding
document collections• then, intersect (non-empty) result with all ring segments of
all rings other than Fi and Fj
Building a FacetedBrowser in CouchDB
Using Views onViews and ErlangMetaprogramming
Claus Zinn
Overview
ResearchInfrastructure
Faceted Search
ImplementationCouchDB
Map-Reduce
Processing Stages
Views
Views on views
Evaluation
Future Work
Related Work andConclusion
.8
Implementation
Requirements
• cope with metadata heterogeneity, given that documents willadhere to different schemas each defining its own structured setof descriptors and values
• preserve the original format of all metadata descriptions, andconsider storing primary data in addition to the metadatadescribing it
• handle regular additions to document storage with onlyincremental update for document access
• provide effective and user-friendly access to all documents
• use a REST-based approach to make data storage read & writeweb-accessible
Building a FacetedBrowser in CouchDB
Using Views onViews and ErlangMetaprogramming
Claus Zinn
Overview
ResearchInfrastructure
Faceted Search
ImplementationCouchDB
Map-Reduce
Processing Stages
Views
Views on views
Evaluation
Future Work
Related Work andConclusion
.9
CouchDB
• schema-less database design permits the inclusion of arbitrarilystructured documents into the database
• original metadata format can be preserved, and primary data canalso be associated with the metadata describing it
• map-reduce framework promises incrementality and scalability
• features a REST-based interface for document uploading,downloading and querying
• also “hosts” GUI, and provides Lucene port
CouchDB Views
• correspond to hardwired DB queries; also stored in CouchDB
• once a query is executed, its result is also stored
• defined in terms of map & reduce
• written in Erlang, Javascript, and other languages
Building a FacetedBrowser in CouchDB
Using Views onViews and ErlangMetaprogramming
Claus Zinn
Overview
ResearchInfrastructure
Faceted Search
ImplementationCouchDB
Map-Reduce
Processing Stages
Views
Views on views
Evaluation
Future Work
Related Work andConclusion
.10
Map-Reduce Motivation
• process lots of data to produce other data
• using many CPUs
• supporting automatic parallelization & distribution,fault-tolerance, I/O scheduling, status and monitoring
Programming Model: Map
• processes input documents (key-value pairs)
• produces set/table of intermediate pairsmap(in_key, in_value) → list(out_key, intermed_value)
• must be referentially transparent
• given a document, the function will always emit the samekey-value pairs
• document indexing process is incremental, can run in parallel
• can be written in Javascript and Erlang (& other ports)
Building a FacetedBrowser in CouchDB
Using Views onViews and ErlangMetaprogramming
Claus Zinn
Overview
ResearchInfrastructure
Faceted Search
ImplementationCouchDB
Map-Reduce
Processing Stages
Views
Views on views
Evaluation
Future Work
Related Work andConclusion
.11
Programming Model: Reduce
• combines all values for a particular key
• produces a set of merged output values (usually just one)
reduce(out_key, list(intermed_value)) → list(out_value)
• map function can be complemented by a reduce function
• takes as input the table of emitted values with identical keys asgenerated by the map function, and aggregates them, e.g.,
• summing up the values associated with the same key:
function(key, values) {return sum(values);
}
• must be referentially transparent, commutative and associative
• must be call-able with output of map process, but also withintermediate values computed by prior reduce (rereduce).
Building a FacetedBrowser in CouchDB
Using Views onViews and ErlangMetaprogramming
Claus Zinn
Overview
ResearchInfrastructure
Faceted Search
ImplementationCouchDB
Map-Reduce
Processing Stages
Views
Views on views
Evaluation
Future Work
Related Work andConclusion
.12
Map-Reduce Framework
documents documentsmap
reduce reduce reduce final key-1
valuesfinal key-2
valuesfinal key-3
values
key-1values
intermediatevalues
key-2values
key-3values
map
key-1values
key-2values
key-3values
key 1 key 2 key 3
aggregation
values values values
Building a FacetedBrowser in CouchDB
Using Views onViews and ErlangMetaprogramming
Claus Zinn
Overview
ResearchInfrastructure
Faceted Search
ImplementationCouchDB
Map-Reduce
Processing Stages
Views
Views on views
Evaluation
Future Work
Related Work andConclusion
.13
Implementation
Stages
1 ingestion: OAI-PMH-harvested documents validated against theirschema, which are then
• converted from XML to JSON• supplied with unique id, timestamp, source, and schema
information, and• added to DB with original XML as attachment
2 indexing: to attack data heterogeneity at schema level
3 curation: to address variability in facet values
4 faceted search indexing: to precompute all possible queries
5 presentation: to give users navigation access to datasets
Building a FacetedBrowser in CouchDB
Using Views onViews and ErlangMetaprogramming
Claus Zinn
Overview
ResearchInfrastructure
Faceted Search
ImplementationCouchDB
Map-Reduce
Processing Stages
Views
Views on views
Evaluation
Future Work
Related Work andConclusion
.14
Document Indexing with map-reduce
• document indexing tackles data heterogeneity given thatdocuments may adhere to different schemas
Map Example (template)
function(doc) {switch( doc.schema ){case "<reference_to_schema_a>":
if ( <tree_has_node>) {
emit(<path_to_node_val>, 1);break;
}
case "<reference_to_schema_b>":[...]
[...]}
}
Building a FacetedBrowser in CouchDB
Using Views onViews and ErlangMetaprogramming
Claus Zinn
Overview
ResearchInfrastructure
Faceted Search
ImplementationCouchDB
Map-Reduce
Processing Stages
Views
Views on views
Evaluation
Future Work
Related Work andConclusion
.15
Map to index organisations (fragment)
function(doc) {switch( doc.schema ){
case "http://catalog.clarin.eu/ds/ComponentRegistry/rest/registry/[...]1694580/xsd":if ( doc.CMD
&& doc.CMD.Components&& doc.CMD.Components.TextCorpusProfile&& doc.CMD.Components.TextCorpusProfile.GeneralInfo&& doc.CMD.Components.TextCorpusProfile.GeneralInfo.LegalOwner&& doc.CMD.Components.TextCorpusProfile.GeneralInfo.LegalOwner.$t
) {emit( doc.CMD.Components.TextCorpusProfile.GeneralInfo.LegalOwner.$t, 1);break;
}
case "http://theharvestingday.eu/schemas/clarin_bamdes-1.1.xsd":if ( doc.LexicalResource
&& doc.LexicalResource.organization&& doc.LexicalResource.organization.$t
) {emit( doc.LexicalResource.organization.$t, 1);break;
}...
}}
Building a FacetedBrowser in CouchDB
Using Views onViews and ErlangMetaprogramming
Claus Zinn
Overview
ResearchInfrastructure
Faceted Search
ImplementationCouchDB
Map-Reduce
Processing Stages
Views
Views on views
Evaluation
Future Work
Related Work andConclusion
.16
Map Result (organisations)
Building a FacetedBrowser in CouchDB
Using Views onViews and ErlangMetaprogramming
Claus Zinn
Overview
ResearchInfrastructure
Faceted Search
ImplementationCouchDB
Map-Reduce
Processing Stages
Views
Views on views
Evaluation
Future Work
Related Work andConclusion
.17
Reduce Result (organisations)
Building a FacetedBrowser in CouchDB
Using Views onViews and ErlangMetaprogramming
Claus Zinn
Overview
ResearchInfrastructure
Faceted Search
ImplementationCouchDB
Map-Reduce
Processing Stages
Views
Views on views
Evaluation
Future Work
Related Work andConclusion
.17
Reduce Result (organisations)
Note:
• need for data curation
Building a FacetedBrowser in CouchDB
Using Views onViews and ErlangMetaprogramming
Claus Zinn
Overview
ResearchInfrastructure
Faceted Search
ImplementationCouchDB
Map-Reduce
Processing Stages
Views
Views on views
Evaluation
Future Work
Related Work andConclusion
.18
Document Indexing with map-reduce
• initially, manually coded, and adapted after schema change
• but this is tedious and prone to error
⇒ now automatic generation of views from declarative facetspecification using JavaScript (string concatenation)
Facet specification
{ "facet" : "modality","pathInfos" : [{ "schema": "http://catalog.clarin.eu/...:cr1:p_129094580/...",
"path" : "doc.CMD.Components.TextCorpusProfile...",},{ "schema": "http://catalog.clarin.eu/...:cr1:p_129094579/...","path" : "doc.CMD.Components.LexicalResourceProfile..."
},...]
}{ "facet" : "language","pathInfos" : [ ... ]
}[...]
Building a FacetedBrowser in CouchDB
Using Views onViews and ErlangMetaprogramming
Claus Zinn
Overview
ResearchInfrastructure
Faceted Search
ImplementationCouchDB
Map-Reduce
Processing Stages
Views
Views on views
Evaluation
Future Work
Related Work andConclusion
.19
Data Curation
• each map function gives a view of the document space in termsof the facet it represents
• analysis shows large variability for many facet values, e.g.,organisations with different names
• devised curation tables that map given names to preferred names
• data curation performed on the indices (for faceted search) ratherthan the original documents
Conversion of Views to Documents
• faceted search to be defined in terms of document indexingestablished in first map-reduce cycle
• but CouchDB’s map-reduce framework is defined in terms ofdocuments
• thus, not possible to define views on views, at least not directly
Building a FacetedBrowser in CouchDB
Using Views onViews and ErlangMetaprogramming
Claus Zinn
Overview
ResearchInfrastructure
Faceted Search
ImplementationCouchDB
Map-Reduce
Processing Stages
Views
Views on views
Evaluation
Future Work
Related Work andConclusion
.20
Views on Views
• re-using the result of document indexing by converting resultingviews into documents
• conversion takes care of data curation
• conversion written in JavaScript implementing hash table of hashtables• outer hash table gives access to the facets
• “language”• inner hash table to all the values a chosen hash can take
• associating key “German” with all documents with this pieceof information
• new index (of type “docIndex”) is stored into extra CouchDB DB
• also holds all views to implement faceted search
• one index file for each document collection
Building a FacetedBrowser in CouchDB
Using Views onViews and ErlangMetaprogramming
Claus Zinn
Overview
ResearchInfrastructure
Faceted Search
ImplementationCouchDB
Map-Reduce
Processing Stages
Views
Views on views
Evaluation
Future Work
Related Work andConclusion
.21
document index for one collection
Building a FacetedBrowser in CouchDB
Using Views onViews and ErlangMetaprogramming
Claus Zinn
Overview
ResearchInfrastructure
Faceted Search
ImplementationCouchDB
Map-Reduce
Processing Stages
Views
Views on views
Evaluation
Future Work
Related Work andConclusion
.22
Map View for Country
fun ({Doc}) ->case proplists:get_value(<<"docType">>, Doc) of <<"docIndex">> ->
{CountryHash} = proplists:get_value(<<"country">>, Doc, {[]}),{LanguageHash} = proplists:get_value(<<"language">>, Doc, {[]}),<other hashes>
lists:foreach(fun (CountryItem) ->DocSet = proplists:get_value(CountryItem, CountryHash),DocSetSize = ordsets:size(DocSet),if DocSetSize > 0 ->
Emit(CountryItem,{[{<<"facet">>, <<"_total_">>},{<<"value">>, <<"_total_">>},{<<"docs">>, DocSet}]}),
lists:foreach(fun (LanguageItem) ->Intersection = ordsets:intersection(proplists:get_value(LanguageItem,
LanguageHash),proplists:get_value(CountryItem,
CountryHash)),case Intersection == [] of false ->
Emit(CountryItem,{[{<<"facet">>, <<"language">>},{<<"value">>, LanguageItem},{<<"docs">>, ordsets:size(Intersection)}]});
_ -> okend
end,proplists:get_keys(LanguageHash)),
<other intersections for other facets[...]>true -> ok
endend,proplists:get_keys(CountryHash));
_ -> okend
end.
Building a FacetedBrowser in CouchDB
Using Views onViews and ErlangMetaprogramming
Claus Zinn
Overview
ResearchInfrastructure
Faceted Search
ImplementationCouchDB
Map-Reduce
Processing Stages
Views
Views on views
Evaluation
Future Work
Related Work andConclusion
.23
Result for Country View (fragment)
Building a FacetedBrowser in CouchDB
Using Views onViews and ErlangMetaprogramming
Claus Zinn
Overview
ResearchInfrastructure
Faceted Search
ImplementationCouchDB
Map-Reduce
Processing Stages
Views
Views on views
Evaluation
Future Work
Related Work andConclusion
.24
Reduce Function (common to FB views
fun (Key, Values) ->AddToDict = fun (CurrentEntry, Dict) ->
{[{<<"facet">>, Facet}, {<<"value">>, Value},{<<"docs">>, Documents}]} =CurrentEntry,
DictKey = {Facet, Value},case Facet of<<"_total_">> ->
dict:append_list(DictKey, Documents, Dict);_ ->
dict:update(DictKey,fun (Old) -> Old + Documents end,Documents, Dict)
endend,
DictToList = fun (Dict) ->lists:map(fun (Entry) ->
{{Facet, Value}, Docs} = Entry,{struct,[{<<"facet">>, Facet},{<<"value">>, Value},{<<"docs">>, Docs}]}
end,dict:to_list(Dict))
end,
DictToList(lists:foldl(fun (Value, Dict) ->AddToDict(Value, Dict)
end,dict:new(), Values))
end.
Building a FacetedBrowser in CouchDB
Using Views onViews and ErlangMetaprogramming
Claus Zinn
Overview
ResearchInfrastructure
Faceted Search
ImplementationCouchDB
Map-Reduce
Processing Stages
Views
Views on views
Evaluation
Future Work
Related Work andConclusion
.25
Coding of Views
• initially, views were coded manually in JavaScript
• but poor performance in view computation on large index files
• lead to the usage of Erlang instead, which resulted into asignificant performance boost
• writing views by hand is tedious and prone to error
• have written Erlang code that generates the code definitions forErlang views automatically
• Erlang meta-code based on the concatenation of Erlang codestrings
facet specification
-define( FACETS, ["country","language","modality","organisation", "resourceclass"] ).
-define( COND_FACETS, [{ "resourceclass", "corpus", ["genre"] },{ "resourceclass", "Tool", ["tooltype", "applicationtype"
"inputtype", "outputtype","lifecyclestatus" ]}]).
Building a FacetedBrowser in CouchDB
Using Views onViews and ErlangMetaprogramming
Claus Zinn
Overview
ResearchInfrastructure
Faceted Search
ImplementationCouchDB
Map-Reduce
Processing Stages
Views
Views on views
Evaluation
Future Work
Related Work andConclusion
.26
Coding of Views (cont’d)
• specification leads to the generation of 121 views, with each viewhaving between 5000 and 12000 bytes of Erlang code
• not all possible combinations of set intersections are necessary
• document sets resulting from first selecting facet F1 and thenselecting facet F2 are identical to those when F2 is selected firstand then F1
• realized computation of all necessary intersections using Erlangcombinators
Use of Erlang Combinators
comb_4(L) ->case length(L) < 4 of true -> "supply lists with length >= 4" ;
_ -> [ {A,B,C,D,Z} || A <- L,B <- L--[A],A < B,C <- L--[A,B],B < C,D <- L--[A,B,C],C < D,
Z <- [L--[A,B,C,D]] ]end.
Building a FacetedBrowser in CouchDB
Using Views onViews and ErlangMetaprogramming
Claus Zinn
Overview
ResearchInfrastructure
Faceted Search
ImplementationCouchDB
Map-Reduce
Processing Stages
Views
Views on views
Evaluation
Future Work
Related Work andConclusion
.27
Faceted Search GUI
Building a FacetedBrowser in CouchDB
Using Views onViews and ErlangMetaprogramming
Claus Zinn
Overview
ResearchInfrastructure
Faceted Search
ImplementationCouchDB
Map-Reduce
Processing Stages
Views
Views on views
Evaluation
Future Work
Related Work andConclusion
.28
Faceted Search Queries = Map-Reduce
View request
/mpi_mgt/_design/country/_view/country?key=’’Germany’’&reduce=true
View result
{"rows":[{"key":"Germany","value":[{"facet":"modality","value":"Unspecified","docs":140},{"facet":"modality","value":"Speech/gestures","docs":230},{"facet":"language","value":"German Sign Language","docs":433},{"facet":"genre","value":"Secondary document","docs":3},{"facet":"genre","value":"Movie","docs":458},{"facet":"_total_","value":"_total_",
"docs":["oai:www.mpi.nl:MPI100","oai:www.mpi.nl:MPI1002978"...]}
[...]]}]}
Building a FacetedBrowser in CouchDB
Using Views onViews and ErlangMetaprogramming
Claus Zinn
Overview
ResearchInfrastructure
Faceted Search
ImplementationCouchDB
Map-Reduce
Processing Stages
Views
Views on views
Evaluation
Future Work
Related Work andConclusion
.29
Evaluation
Views for Document Indexing
• views for document indexing are automatically generated fromfacet specification using JavaScript
• resulting map and reduce functions are in JavaScript too,CouchDB’s default view language
• computation of the view “organisation” takes approximately 25minutes on 86k documents
• one-time payoff
• no effort has been made yet to increase the speed of viewcomputation
• small changes in document database will have only small impacton view recomputation at the document indexing level
Building a FacetedBrowser in CouchDB
Using Views onViews and ErlangMetaprogramming
Claus Zinn
Overview
ResearchInfrastructure
Faceted Search
ImplementationCouchDB
Map-Reduce
Processing Stages
Views
Views on views
Evaluation
Future Work
Related Work andConclusion
.30
Views for Faceted Search
• computation of faceted search views computationally expensive
• JavaScript too slow
• Erlang much faster (better in memory and processor usage)
Evaluation setting
• each Erlang view stored in separate CouchDB design document
• executed map-reduce computation to 24-core 96GB machine
• harvested and ingested approximately 86.000 metadatadocuments on language resources
• five unconditional facets “language” (371), “country” (67),“organisation” (39), “modality” (32), and “genre” (50)
• many different facet values: “modality” = “speech” (59463);“language” = “Dutch” (18345); “country” = “Germany” (16178);“organisation” = “Max Planck Institute for Psycholinguistics”(16568), and “genre”= “Discourse” (33676)
• 31 different map-reduce pairs
Building a FacetedBrowser in CouchDB
Using Views onViews and ErlangMetaprogramming
Claus Zinn
Overview
ResearchInfrastructure
Faceted Search
ImplementationCouchDB
Map-Reduce
Processing Stages
Views
Views on views
Evaluation
Future Work
Related Work andConclusion
.31
Evaluation
Views Computation for Faceted Search
• generation of the views “language”, “country”, “organisation”,“modality”, and “genre” takes altogether less than one minute(using 5 cpus)
• generation of the ten 2-level views (users selected two facets,e.g., “country”:”genre”, “country”:”language”...) was computed inless than 1 minute (using 10 cpus).
• computation of the ten 3-level views where users selected threefacets: < 7.5 minutes
• computation of the 5 4-level views: more than 2 hours to compute
Building a FacetedBrowser in CouchDB
Using Views onViews and ErlangMetaprogramming
Claus Zinn
Overview
ResearchInfrastructure
Faceted Search
ImplementationCouchDB
Map-Reduce
Processing Stages
Views
Views on views
Evaluation
Future Work
Related Work andConclusion
.32
Future work for optimisation
• currently, one indexing document for each of the metadataproviders
⇒ update from one data provider only requires a limited viewrecomputation
• but some data providers provide 10.000s of documents
⇒ optimise index documents for faceted search• reflect additions by new index document, so that incremental
updates are indeed limited to document additions• modifications and deletions by introducing MODIFY and
DELETE lists that a revised map-reduce combination wouldneed to consider
Building a FacetedBrowser in CouchDB
Using Views onViews and ErlangMetaprogramming
Claus Zinn
Overview
ResearchInfrastructure
Faceted Search
ImplementationCouchDB
Map-Reduce
Processing Stages
Views
Views on views
Evaluation
Future Work
Related Work andConclusion
.33
Related Work: Flamenco
• toolkit with web-based interface to give faceted access to largedata collections given import format:
• the file facets.tsv listing all facets• the file attrs.tsv listing all attributes of a given item• the file items.tsv listing each collection item (following
definition in attrs.tsv) with unique id• for each entry facet in facets.tsv
• file facet_term: lists all terms for given facet with uniquefacet term ids
• facet_map associates unique facet term id with item ids
• data files ingested into Flamenco relational database (MySQL)
• Flamenco generates faceted browser’s default/customizable GUI
• user’s selection of facet terms translated into correspondingMySQL queries to compute all necessary set interactions
• results of executing MySQL queries are cached to avoidre-computation
Building a FacetedBrowser in CouchDB
Using Views onViews and ErlangMetaprogramming
Claus Zinn
Overview
ResearchInfrastructure
Faceted Search
ImplementationCouchDB
Map-Reduce
Processing Stages
Views
Views on views
Evaluation
Future Work
Related Work andConclusion
.34
Flamenco used in VLO
• faceted search access to language resources using Flamencowith same dataset
• See http://www.clarin.eu/vlo
• used Perl to translate 80.000+ XML-based metadata files intoFlamenco’s indexing data format (incl. curation)
• ingested data into the Flamenco database and adapted GUI
• script to generate all queries to warm-up the cache
Comparison
• data preparation required for Flamenco roughly corresponds toour CouchDB-based document indexing phase (simple views)
• data curation only happens when the views of the indexing phaseare converted into the indexing documents
• MySQL queries fired by Flamenco correspond to the viewscomputed in terms of the indexing documents
Building a FacetedBrowser in CouchDB
Using Views onViews and ErlangMetaprogramming
Claus Zinn
Overview
ResearchInfrastructure
Faceted Search
ImplementationCouchDB
Map-Reduce
Processing Stages
Views
Views on views
Evaluation
Future Work
Related Work andConclusion
.35
Advantages of CouchDB
• CouchDB also stores original metadata documents (with varyingschemata), thus also serves as permanent storage
• conditional facets contribute to usability guiding users’ navigation
• need only be computed in subsets whose documents areindexed against terms the conditional facet depends on
• index generation accommodates for incremental updates on themetadata sets, supporting regular harvesting withoutrecomputing all indices/views
• In Flamenco, any change in data set requires overwriting ofall contexts/caches
• facet specification offers more declarative view
• index generation taken to higher level;• easy to experiment with different facet configurations• but, once facet specification is changed, index generation
starts from scratch
Building a FacetedBrowser in CouchDB
Using Views onViews and ErlangMetaprogramming
Claus Zinn
Overview
ResearchInfrastructure
Faceted Search
ImplementationCouchDB
Map-Reduce
Processing Stages
Views
Views on views
Evaluation
Future Work
Related Work andConclusion
.36
Conclusion
• CouchDB with its native language Erlang is well suited for thedevelopment of industrial-strength applications
• CouchDB’s REST-based interface offers lean alternative toestablished software (Java-based Apache Tomcat webserver)
• Erlang’s main limitations is lack of full macro package allowingusers to write programs to write other programs
• Common-Lisp like defmacro would have made life easier
• currently, no strong support for Lisp (or Haskell) port to index andquery documents in CouchDB
• CouchDB’s main limitation – when used with Erlang – being thelack of documentation and example code available
Building a FacetedBrowser in CouchDB
Using Views onViews and ErlangMetaprogramming
Claus Zinn
Overview
ResearchInfrastructure
Faceted Search
ImplementationCouchDB
Map-Reduce
Processing Stages
Views
Views on views
Evaluation
Future Work
Related Work andConclusion
.37
Conclusion
• general approach to aggregate heterogeneously structureddocuments and to make them accessible via faceted (andfull-text) search
• works as long as documents’ relevant content can be given inJSON (CouchDB’s native format)
• for given context, facet specification was straightforward
• desirable to detect good facet candidates automatically
• Castanet algorithm• requires definition of target terms to best reflect the topics
present in given collection• combines target terms with hypernymy (IS-A) information of
WordNet to both• build facet hierarchies and• to assign documents to the facets
Building a FacetedBrowser in CouchDB
Using Views onViews and ErlangMetaprogramming
Claus Zinn
Overview
ResearchInfrastructure
Faceted Search
ImplementationCouchDB
Map-Reduce
Processing Stages
Views
Views on views
Evaluation
Future Work
Related Work andConclusion
.38
Questions