51
Bleve Text Indexing for Go 1 February 2015 Marty Schoch

Bleve - GitHub

  • Upload
    others

  • View
    15

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Bleve - GitHub

BleveText Indexing for Go1 February 2015

Marty Schoch

Page 2: Bleve - GitHub

Say What?

blev-ee bih-leev

1

Page 3: Bleve - GitHub

Marty Schoch

NoSQL Document Database

Official Go SDK

Projects Using Go

N1QL Query Language

Secondary Indexing

Cross Data-Center Replication

2

Page 4: Bleve - GitHub

Why?

Lucene/Solr/Elasticsearch are awesome

Could we build 50% of Lucene's text analysis, combine it with off-the-shelf KV storesand get something interesting?

3

Page 5: Bleve - GitHub

Bleve Core Ideas

Text Analysis Pipeline

We only have to build common core

Users customize for domain/language through interfaces

Pluggable KV storage

No custom file format

Plug-in Bolt, LevelDB, ForestDB, etc

Search

Make term search work

Almost everything else built on top of that...

4

Page 6: Bleve - GitHub

What is Search?

Page 7: Bleve - GitHub

Simple Search

6

Page 8: Bleve - GitHub

Advanced Search

7

Page 9: Bleve - GitHub

Search Results

8

Page 10: Bleve - GitHub

Faceted Search

9

Page 11: Bleve - GitHub

Getting Started

Page 12: Bleve - GitHub

Install bleve

go get github.com/blevesearch/bleve/...

11

Page 13: Bleve - GitHub

1011121314151617181920212223242526272829

Import

import "github.com/blevesearch/bleve"

type Person struct { Name string}

func main() { mapping := bleve.NewIndexMapping() index, err := bleve.New("people.bleve", mapping) if err != nil { log.Fatal(err) }

person := Person{"Marty Schoch"} err = index.Index("m1", person) if err != nil { log.Fatal(err) } fmt.Println("Indexed Document")}

12

Page 14: Bleve - GitHub

10111213141516

17181920212223242526272829

Data Model

import "github.com/blevesearch/bleve"

type Person struct { Name string}

func main() {

mapping := bleve.NewIndexMapping() index, err := bleve.New("people.bleve", mapping) if err != nil { log.Fatal(err) }

person := Person{"Marty Schoch"} err = index.Index("m1", person) if err != nil { log.Fatal(err) } fmt.Println("Indexed Document")}

13

Page 15: Bleve - GitHub

101112131415

1617181920212223242526272829

Index Mapping

import "github.com/blevesearch/bleve"

type Person struct { Name string}

func main() { mapping := bleve.NewIndexMapping() index, err := bleve.New("people.bleve", mapping) if err != nil { log.Fatal(err) }

person := Person{"Marty Schoch"} err = index.Index("m1", person) if err != nil { log.Fatal(err) } fmt.Println("Indexed Document")}

14

Page 16: Bleve - GitHub

1011121314

151617181920212223242526272829

Create a New Index

import "github.com/blevesearch/bleve"

type Person struct { Name string}

func main() { mapping := bleve.NewIndexMapping() index, err := bleve.New("people.bleve", mapping) if err != nil { log.Fatal(err) }

person := Person{"Marty Schoch"} err = index.Index("m1", person) if err != nil { log.Fatal(err) } fmt.Println("Indexed Document")}

15

Page 17: Bleve - GitHub

Index Data

10111213

14151617181920212223242526272829

import "github.com/blevesearch/bleve"

type Person struct { Name string

}

func main() { mapping := bleve.NewIndexMapping() index, err := bleve.New("people.bleve", mapping) if err != nil { log.Fatal(err) }

person := Person{"Marty Schoch"} err = index.Index("m1", person) if err != nil { log.Fatal(err) } fmt.Println("Indexed Document")} Run

16

Page 18: Bleve - GitHub

151617

1819202122232425262728

Open Index

func main() { index, err := bleve.Open("people.bleve") if err != nil {

log.Fatal(err) }

query := bleve.NewTermQuery("marty") request := bleve.NewSearchRequest(query) result, err := index.Search(request) if err != nil { log.Fatal(err) } fmt.Println(result)}

17

Page 19: Bleve - GitHub

1516

171819202122232425262728

Build Query

func main() { index, err := bleve.Open("people.bleve")

if err != nil { log.Fatal(err) }

query := bleve.NewTermQuery("marty") request := bleve.NewSearchRequest(query) result, err := index.Search(request) if err != nil { log.Fatal(err) } fmt.Println(result)}

18

Page 20: Bleve - GitHub

1516171819202122232425262728

Build Request

func main() { index, err := bleve.Open("people.bleve") if err != nil { log.Fatal(err) }

query := bleve.NewTermQuery("marty") request := bleve.NewSearchRequest(query) result, err := index.Search(request) if err != nil { log.Fatal(err) } fmt.Println(result)}

19

Page 21: Bleve - GitHub

Search

1516171819202122232425262728

func main() { index, err := bleve.Open("people.bleve") if err != nil { log.Fatal(err) }

query := bleve.NewTermQuery("marty") request := bleve.NewSearchRequest(query) result, err := index.Search(request) if err != nil { log.Fatal(err) } fmt.Println(result)} Run

20

Page 22: Bleve - GitHub

More Realistic Examples

Page 23: Bleve - GitHub

FOSDEM Schedule of Events (iCal)

BEGIN:VEVENTMETHOD:PUBLISHUID:2839@[email protected]:Europe-BrusselsDTSTART:20150201T140000DTEND:20150201T144500SUMMARY:bleve - text indexing for GoDESCRIPTION: Nearly every application today has a search component. But delivering high quality search results requires a long list of text analysis and indexing techniques. With the bleve library, we bring advanced text indexing and search to your Go applications. In this talk we'll examine how the bleve library brings powerful text indexing and search capabilities to Go applications.CLASS:PUBLICSTATUS:CONFIRMEDCATEGORIES:GoURL:https:/fosdem.org/2015/schedule/event/bleve/LOCATION:K.3.401ATTENDEE;ROLE=REQ-PARTICIPANT;CUTYPE=INDIVIDUAL;CN="Marty Schoch":invalid:nomailEND:VEVENT

22

Page 24: Bleve - GitHub

FOSDEM Event Data Structure

type Event struct { UID string ̀json:"uid"̀ Summary string ̀json:"summary"̀ Description string ̀json:"description"̀ Speaker string ̀json:"speaker"̀ Location string ̀json:"location"̀ Category string ̀json:"category"̀ URL string ̀json:"url"̀ Start time.Time ̀json:"start"̀ Duration float64 ̀json:"duration"̀}

23

Page 25: Bleve - GitHub

Index FOSDEM Events

363738394041424344454647484950515253545556

count := 0 batch := bleve.NewBatch() for event := range parseEvents() { batch.Index(event.UID, event) if batch.Size() > 100 { err := index.Batch(batch) if err != nil { log.Fatal(err) } count += batch.Size() batch = bleve.NewBatch() } } if batch.Size() > 0 { index.Batch(batch) if err != nil { log.Fatal(err) } count += batch.Size() } fmt.Printf("Indexed %d Events\n", count) Run

24

Page 26: Bleve - GitHub

Search FOSDEM Events

1112131415161718192021222324252627

func main() {

index, err := bleve.Open("fosdem.bleve") if err != nil { log.Fatal(err) }

q := bleve.NewTermQuery("bleve") req := bleve.NewSearchRequest(q) req.Highlight = bleve.NewHighlightWithStyle("html") req.Fields = []string{"summary", "speaker"} res, err := index.Search(req) if err != nil { log.Fatal(err) } fmt.Println(res)} Run

25

Page 27: Bleve - GitHub

Phrase Search

111213141516171819202122232425262728

func main() {

index, err := bleve.Open("fosdem.bleve") if err != nil { log.Fatal(err) }

phrase := []string{"advanced", "text", "indexing"} q := bleve.NewPhraseQuery(phrase, "description") req := bleve.NewSearchRequest(q) req.Highlight = bleve.NewHighlightWithStyle("html") req.Fields = []string{"summary", "speaker"} res, err := index.Search(req) if err != nil { log.Fatal(err) } fmt.Println(res)} Run

26

Page 28: Bleve - GitHub

Combining Queries

11121314151617181920212223242526272829

func main() {

index, err := bleve.Open("fosdem.bleve") if err != nil { log.Fatal(err) }

tq1 := bleve.NewTermQuery("text") tq2 := bleve.NewTermQuery("search") q := bleve.NewConjunctionQuery([]bleve.Query{tq1, tq2}) req := bleve.NewSearchRequest(q) req.Highlight = bleve.NewHighlightWithStyle("html") req.Fields = []string{"summary", "speaker"} res, err := index.Search(req) if err != nil { log.Fatal(err) } fmt.Println(res)} Run

27

Page 29: Bleve - GitHub

Combining More Queries

11121314151617181920212223242526

2728293031

func main() {

index, err := bleve.Open("fosdem.bleve") if err != nil { log.Fatal(err) }

tq1 := bleve.NewTermQuery("text") tq2 := bleve.NewTermQuery("search") tq3 := bleve.NewTermQuery("believe") q := bleve.NewConjunctionQuery( []bleve.Query{tq1, tq2, tq3}) req := bleve.NewSearchRequest(q) req.Highlight = bleve.NewHighlightWithStyle("html") req.Fields = []string{"summary", "speaker"} res, err := index.Search(req)

if err != nil { log.Fatal(err) } fmt.Println(res)} Run

28

Page 30: Bleve - GitHub

Fuzzy Query

111213141516171819202122232425

262728293031

func main() {

index, err := bleve.Open("fosdem.bleve") if err != nil { log.Fatal(err) }

tq1 := bleve.NewTermQuery("text") tq2 := bleve.NewTermQuery("search") tq3 := bleve.NewFuzzyQuery("believe") q := bleve.NewConjunctionQuery( []bleve.Query{tq1, tq2, tq3}) req := bleve.NewSearchRequest(q) req.Highlight = bleve.NewHighlightWithStyle("html") req.Fields = []string{"summary", "speaker"}

res, err := index.Search(req) if err != nil { log.Fatal(err) } fmt.Println(res)} Run

29

Page 31: Bleve - GitHub

Numeric Range Query

1112131415161718192021222324

25262728

func main() {

index, err := bleve.Open("fosdem.bleve") if err != nil { log.Fatal(err) }

longTalk := 110.0 q := bleve.NewNumericRangeQuery(&longTalk, nil) req := bleve.NewSearchRequest(q) req.Highlight = bleve.NewHighlightWithStyle("html") req.Fields = []string{"summary", "speaker", "duration"} res, err := index.Search(req) if err != nil {

log.Fatal(err) } fmt.Println(res)} Run

30

Page 32: Bleve - GitHub

Date Range Query

11121314151617181920212223

242526272829

func main() {

index, err := bleve.Open("fosdem.bleve") if err != nil { log.Fatal(err) }

lateSunday := "2015-02-01T17:30:00Z" q := bleve.NewDateRangeQuery(&lateSunday, nil) q.SetField("start") req := bleve.NewSearchRequest(q) req.Highlight = bleve.NewHighlightWithStyle("html") req.Fields = []string{"summary", "speaker", "start"}

res, err := index.Search(req) if err != nil { log.Fatal(err) } fmt.Println(res)} Run

31

Page 33: Bleve - GitHub

Query Strings

111213141516171819202122

23242526272829303132

func main() {

index, err := bleve.Open("fosdem.bleve") if err != nil { log.Fatal(err) }

qString := ̀+description:text ̀ qString += ̀summary:"text indexing" ̀ qString += ̀summary:believe~2 ̀ qString += ̀-description:lucene ̀ qString += ̀duration:>30̀

q := bleve.NewQueryStringQuery(qString) req := bleve.NewSearchRequest(q) req.Highlight = bleve.NewHighlightWithStyle("html") req.Fields = []string{"summary", "speaker", "description", "duration"} res, err := index.Search(req) if err != nil { log.Fatal(err) } fmt.Println(res)} Run

32

Page 34: Bleve - GitHub

Default Mapping vs Custom Mapping

The default mapping has worked really well, but...

Earlier today we heard talk named "Finding Bad Needles in Worldwide Haystacks".

Will we find it if we search for "haystack"?

1819202122232425

26

q := bleve.NewTermQuery("haystack") req := bleve.NewSearchRequest(q) req.Highlight = bleve.NewHighlightWithStyle("html") req.Fields = []string{"summary", "speaker"} res, err := index.Search(req) if err != nil { log.Fatal(err) }

fmt.Println(res) Run

33

Page 35: Bleve - GitHub

Custom Mapping

2728293031323334353637383940414243444546

enFieldMapping := bleve.NewTextFieldMapping() enFieldMapping.Analyzer = "en"

eventMapping := bleve.NewDocumentMapping() eventMapping.AddFieldMappingsAt("summary", enFieldMapping) eventMapping.AddFieldMappingsAt("description", enFieldMapping)

kwFieldMapping := bleve.NewTextFieldMapping() kwFieldMapping.Analyzer = "keyword"

eventMapping.AddFieldMappingsAt("url", kwFieldMapping) eventMapping.AddFieldMappingsAt("category", kwFieldMapping)

mapping := bleve.NewIndexMapping() mapping.DefaultMapping = eventMapping

index, err := bleve.New("custom.bleve", mapping) if err != nil { log.Fatal(err) } Run

34

Page 36: Bleve - GitHub

Search Custom Mapping

181920212223242526

q := bleve.NewTermQuery("haystack") req := bleve.NewSearchRequest(q) req.Highlight = bleve.NewHighlightWithStyle("html") req.Fields = []string{"summary", "speaker"} res, err := index.Search(req) if err != nil { log.Fatal(err) } fmt.Println(res) Run

35

Page 37: Bleve - GitHub

Analysis Wizard

http://analysis.blevesearch.com

36

Page 38: Bleve - GitHub

Precision vs Recall

Precision - are the returned results relevant?

Recall - are the relevant results returned?37

Page 39: Bleve - GitHub

Faceted Search

1112131415

16171819202122232425262728

func main() {

index, err := bleve.Open("custom.bleve") if err != nil { log.Fatal(err)

}

q := bleve.NewMatchAllQuery() req := bleve.NewSearchRequest(q) req.Size = 0 req.AddFacet("categories", bleve.NewFacetRequest("category", 50)) res, err := index.Search(req) if err != nil { log.Fatal(err) } fmt.Println(res)} Run

38

Page 40: Bleve - GitHub

Optional HTTP Handlers

import "github.com/blevesearch/bleve/http"

All major bleve operations mapped

Assume JSON document bodies

See bleve-explorer sample app

https://github.com/blevesearch/bleve-explorer

39

Page 41: Bleve - GitHub

Putting it All Together

Page 42: Bleve - GitHub

FOSDEM Schedule Search

http://fosdem.blevesearch.com

41

Page 43: Bleve - GitHub

Performance

Page 44: Bleve - GitHub

Micro Benchmarks

Use Go benchmarks to test/compare small units of functionality in isolation.

$ go test -bench=. -cpu=1,2,4PASSBenchmarkBoltDBIndexing1Workers 1000 3075988 ns/opBenchmarkBoltDBIndexing1Workers-2 1000 4004125 ns/opBenchmarkBoltDBIndexing1Workers-4 500 4470435 ns/opBenchmarkBoltDBIndexing2Workers 500 3148049 ns/opBenchmarkBoltDBIndexing2Workers-2 1000 3336268 ns/opBenchmarkBoltDBIndexing2Workers-4 1000 3461157 ns/opBenchmarkBoltDBIndexing4Workers 1000 3642691 ns/opBenchmarkBoltDBIndexing4Workers-2 500 3130814 ns/opBenchmarkBoltDBIndexing4Workers-4 1000 3312662 ns/opBenchmarkBoltDBIndexing1Workers10Batch 1 1350916284 ns/opBenchmarkBoltDBIndexing1Workers10Batch-2 1 1493538328 ns/opBenchmarkBoltDBIndexing1Workers10Batch-4 1 1256294099 ns/opBenchmarkBoltDBIndexing2Workers10Batch 1 1393491792 ns/opBenchmarkBoltDBIndexing2Workers10Batch-2 1 1271605176 ns/opBenchmarkBoltDBIndexing2Workers10Batch-4 1 1343410709 ns/opBenchmarkBoltDBIndexing4Workers10Batch 1 1393552247 ns/opBenchmarkBoltDBIndexing4Workers10Batch-2 1 1144501920 ns/opBenchmarkBoltDBIndexing4Workers10Batch-4 1 1311805564 ns/opBenchmarkBoltDBIndexing1Workers100Batch 3 425731147 ns/opBenchmarkBoltDBIndexing1Workers100Batch-2 3 439312970 ns/op

43

Page 45: Bleve - GitHub

Bleve Bench

Long(er) running test, index real text from Wikipedia. Measure stats periodicaly,compare across time.

Does indexing performance degrade over time?

How does search performance relate to number of matching documents?

44

Page 46: Bleve - GitHub

Join the Community

Page 47: Bleve - GitHub

Community

#bleve is small/quiet room, talk to us real time

Discuss your use-case

Plan a feature implementation

Apache License v2.0, Report Issues, Submit Pull Requests46

Page 48: Bleve - GitHub

Contributors

47

Page 49: Bleve - GitHub

Roadmap

Result Sorting (other than score)

Better Spell Suggest/Fuzzy Search

Performance

Prepare for 1.0 Release

48

Page 50: Bleve - GitHub

Speaking

GopherCon India February 2015 (Speaking)

GopherCon July (Attending/Proposal to be Submitted)

Your Conference/Meetup Here!

49

Page 51: Bleve - GitHub

Thank you

Marty [email protected] (mailto:[email protected])

http://github.com/blevesearch/bleve (http://github.com/blevesearch/bleve)

@mschoch (http://twitter.com/mschoch)

@blevesearch (http://twitter.com/blevesearch)