9
Assignment 2: Full text search with Lucene Mathias Mosolf, Alexander Frenzel

Assignment 2: Full text search with Lucene Mathias Mosolf, Alexander Frenzel

Embed Size (px)

Citation preview

Page 1: Assignment 2: Full text search with Lucene Mathias Mosolf, Alexander Frenzel

Assignment 2: Full text searchwith Lucene

Mathias Mosolf,Alexander Frenzel

Page 2: Assignment 2: Full text search with Lucene Mathias Mosolf, Alexander Frenzel

Allgemein

• Probleme?

• Welche Dokumentation?o Getting Startedo vorwiegend Beispiele o Javadoc

Page 3: Assignment 2: Full text search with Lucene Mathias Mosolf, Alexander Frenzel

Indexer

• XML-Parsing mit Digester o http://commons.apache.org/digester/

•  Luceneo org.apache.lucene.index.IndexWriter;o org.apache.lucene.store.Directory;o org.apache.lucene.analysis.Analyzer;o org.apache.lucene.analysis.WhitespaceAnalyzer;o org.apache.lucene.document.Document;o org.apache.lucene.document.Field;

Page 4: Assignment 2: Full text search with Lucene Mathias Mosolf, Alexander Frenzel

Analyzer analyzer = new WhitespaceAnalyzer(); boolean createFlag = true;writer = new IndexWriter(indexDir, analyzer, createFlag, IndexWriter.MaxFieldLength.UNLIMITED);

[..]

digester.parse(xml);

[..]

public void addMedlineDocument(MedlineDocument doc) throws IOException { this.counter++;

String title = doc.getTitle().replaceAll("\\<.*?\\>", " ").toLowerCase(); String text = ((doc.getAbstract() != null)?doc.getAbstract():"").replaceAll("\\<.*?\\>", "

").toLowerCase(); Document medlineDocument = new Document(); medlineDocument.add(new Field("pmid", doc.getPmid(), Field.Store.YES, Field.Index.NOT_ANALYZED)); medlineDocument.add(new Field("title", title, Field.Store.YES, Field.Index.ANALYZED)); medlineDocument.add(new Field("abstract", text, Field.Store.YES, Field.Index.ANALYZED)); medlineDocument.add(new Field("combined", title+text, Field.Store.YES, Field.Index.ANALYZED));

writer.addDocument(medlineDocument);}

[..]

// optimize and close the indexwriter.optimize();writer.close();

Page 5: Assignment 2: Full text search with Lucene Mathias Mosolf, Alexander Frenzel

BoolSearch

• Sammeln von ID mittel Set<String>

• Luceneo org.apache.lucene.search.IndexSearcher;o org.apache.lucene.index.Term;o org.apache.lucene.search.BooleanClause;o org.apache.lucene.search.BooleanQuery;o org.apache.lucene.search.ScoreDoc;o org.apache.lucene.search.TermQuery;o org.apache.lucene.search.TopDocs;

Page 6: Assignment 2: Full text search with Lucene Mathias Mosolf, Alexander Frenzel

IndexSearcher indexSearcher = new IndexSearcher(indexDir);pmids = new HashSet<String>();

public void search(String field, String[] keywords) throws IOException { BooleanQuery query = new BooleanQuery();

// create BOOL Query for (String word: keywords) { TermQuery tq = new TermQuery(new Term(field, word.toLowerCase())); query.add(tq, BooleanClause.Occur.MUST); }

// extract PMIDs TopDocs docs = this.searcher.search(query, searcher.maxDoc()); for (ScoreDoc scoreDoc : docs.scoreDocs) { Document doc = searcher.doc(scoreDoc.doc); pmids.add(doc.get("pmid")); // add to set }}

Page 7: Assignment 2: Full text search with Lucene Mathias Mosolf, Alexander Frenzel

PhraseSearch

• PhraseQuery vs SpanQuery

• Luceneo org.apache.lucene.search.IndexSearcher;o org.apache.lucene.index.Term;o org.apache.lucene.search.spans.SpanNearQuery;o org.apache.lucene.search.spans.SpanQuery;o org.apache.lucene.search.spans.SpanTermQuery;o org.apache.lucene.search.spans.Spans;

Page 8: Assignment 2: Full text search with Lucene Mathias Mosolf, Alexander Frenzel

IndexSearcher indexSearcher = new IndexSearcher(indexDir);pmids = new HashSet<String>();

public void search(String field, String[] phrase) throws IOException { // generate query int l=phrase.length; SpanQuery[] sq = new SpanQuery[l];

for(int i = 0; i < l; i++) { sq[i] = new SpanTermQuery(new Term(field, phrase[i])); } SpanNearQuery query = new SpanNearQuery(sq, 0, true);

// search Spans sp = query.getSpans(this.searcher.getIndexReader());

int id=-1; Document doc; // runs trough all Occurrences of the phrase while (sp.next() == true) { this.occ++; //number of occurrences if (id != sp.doc()) { // next doc id = sp.doc(); // save current id

// add pmid doc = searcher.doc(id); this.pmids.add(doc.get("pmid")); }}

Page 9: Assignment 2: Full text search with Lucene Mathias Mosolf, Alexander Frenzel

Tests/Beispiele

• Beispieleo "reduce the appeal of"  => ~0.399so "duchenne's muscular" => ~0.398so dicyclocarbodimide => ~0.399so experiment => ~0.458so protein complex => ~0.446so duchenne's disease => ~0.447s

• "Komplex"o and => ~1.446so and the => ~1.298so and to the => ~1.212so and to the you => ~0.449so "operation, the patient presents no signs of" => ~0.400s