45
Searching Chinese Patents: Challenges and Solutions When Building an Innovative Discovery Interface ERIC PUGH | [email protected] | @dep4b

Searching Chinese Patents Presentation at Enterprise Data World

Embed Size (px)

DESCRIPTION

War stories from building the Global Patent Search Network, and why Data folks need to think more about UX and Discovery, and UX folks need to think more about Data.

Citation preview

Page 1: Searching Chinese Patents Presentation at Enterprise Data World

Searching Chinese Patents: Challenges and Solutions When Building

an Innovative Discovery InterfaceERIC PUGH | [email protected] | @dep4b

Page 2: Searching Chinese Patents Presentation at Enterprise Data World

Who am I?

• Principal at OpenSource Connections - Solr/Lucene Search Consultancy http://bit.ly/OSCCommercialSummary

• Member of Apache Software Foundation

• SOLR-284 UpdateRichDocuments (July 07)

Page 3: Searching Chinese Patents Presentation at Enterprise Data World

Co-AuthorN

ext Edition May!

Page 4: Searching Chinese Patents Presentation at Enterprise Data World

Agilista

Page 5: Searching Chinese Patents Presentation at Enterprise Data World

Selected Customers

Page 6: Searching Chinese Patents Presentation at Enterprise Data World

Telling some storieswar ^

Page 7: Searching Chinese Patents Presentation at Enterprise Data World
Page 8: Searching Chinese Patents Presentation at Enterprise Data World

Risks

• Cloud new at USPTO

• Discovery is tenuous concept

• Conflicting User Goals

• Fixed Budget: trade scope for budget/quality

Page 9: Searching Chinese Patents Presentation at Enterprise Data World
Page 10: Searching Chinese Patents Presentation at Enterprise Data World
Page 11: Searching Chinese Patents Presentation at Enterprise Data World
Page 12: Searching Chinese Patents Presentation at Enterprise Data World
Page 13: Searching Chinese Patents Presentation at Enterprise Data World
Page 14: Searching Chinese Patents Presentation at Enterprise Data World

Telling some stories

➡How to inject “Discovery” into your app

• The Cloud to the Rescue (sorta!)

• Parsers and Parsers and Parsers

• Don’t be Afraid to Share!

Page 15: Searching Chinese Patents Presentation at Enterprise Data World

Flow of understanding

Data UnderstandingInformation

Page 16: Searching Chinese Patents Presentation at Enterprise Data World

Building “Discovery”

Engine

UX DataTension

Page 17: Searching Chinese Patents Presentation at Enterprise Data World

Grok data at gut level

Look for outliers

!

!

User Interviews

Surveys

Card Sorting

Scenarios/Personas

!

UX

Data

brainstormMockups

Proof of concept

!

!

Page 18: Searching Chinese Patents Presentation at Enterprise Data World

Where to spend time?

UX

Engine

Data

40%

!

20%

!

40%

!

40%

!

40%

!

20%

We spent

!

!

Page 19: Searching Chinese Patents Presentation at Enterprise Data World

Telling some stories

• How to inject “Discovery” into your app

➡The Cloud to the Rescue (sorta!)

• Parsers and Parsers and Parsers

• Don’t be Afraid to Share!

Page 20: Searching Chinese Patents Presentation at Enterprise Data World

Boy meets Girl Story

Page 21: Searching Chinese Patents Presentation at Enterprise Data World

Boy meets Girl Story

Metadata

Ingest Pipeline

Discovery UX

Content Files

Page 22: Searching Chinese Patents Presentation at Enterprise Data World

How we built it

EmberJS Single Page Search App

HTML

XML

JSON

Server Dashboard

GPSN UI (Bootsrap CSS)

BrowsersMobile/

Tablet

Third Party Application

Servers

S3 BucketSolr

Page 23: Searching Chinese Patents Presentation at Enterprise Data World

Solr as a NoSQL Datastore

• Used “atomic updates” to merge three source datasets into single final dataset.

• All text displayed in application stored in Solr.

• Dynamic schema supports many languages, en, cn right now.

Page 24: Searching Chinese Patents Presentation at Enterprise Data World

Lessons Learned

Page 25: Searching Chinese Patents Presentation at Enterprise Data World

Don’t Move Files

• Copying 5 TB data up to S3 was very painful.

• We used S3Funnel which is “rsync like”

• We bought more network bandwidth for our office

Page 26: Searching Chinese Patents Presentation at Enterprise Data World

Never underestimate

the bandwidth of a station wagon

full of tapes hurtling down the highway.

–Andrew Tanenbaum, 1981

Page 27: Searching Chinese Patents Presentation at Enterprise Data World

Data Size

0

250000

500000

750000

1000000

1985 1987 1989 1991 1993 1995 1997 1999 2001 2003 2005 2007 2009 2011

Patent Count

277871

Page 28: Searching Chinese Patents Presentation at Enterprise Data World

Think about Data Volume• Started with older dataset, and tasks like TIFF -> PNG

conversion became progressively harder. Map/Reduce nice, need more visibility into progress..

• Should have sharded our Search Index from the beginning just to make indexing faster and cheaper process (500 gb index!)

• 8 shards dropped time from 12 hours to 2 hours. Merging took 5!

• We had too many steps in our pipeline

Page 29: Searching Chinese Patents Presentation at Enterprise Data World

Building  a  Patents  IndexM

achi

ne C

ount

0

75

150

225

300

5 days 3 days 30 Minutes

1 5

300

Page 30: Searching Chinese Patents Presentation at Enterprise Data World

Key scaling concept behind GPSN:

!

Cloud meets Ocean

Page 31: Searching Chinese Patents Presentation at Enterprise Data World

More prosaically…

Database

Server

Server

Server

Client

Client

Client

$

$

$

$

Page 32: Searching Chinese Patents Presentation at Enterprise Data World

Telling some stories

• How to inject “Discovery” into your app

• The Cloud to the Rescue (sorta!)

➡Parsers and Parsers and Parsers

• Don’t be Afraid to Share!

Page 33: Searching Chinese Patents Presentation at Enterprise Data World

Why so many pipelines?Morphlines

Page 34: Searching Chinese Patents Presentation at Enterprise Data World

Tika as a pipeline?

Page 35: Searching Chinese Patents Presentation at Enterprise Data World

Lot’s of File Types

• Sometimes in ZIP archives, sometimes not!

• multiple XML formats as well as CSV and EDI

• Purplebook, Yellowbook, Redbook,Greenbook, Questel, SIPO…

Page 36: Searching Chinese Patents Presentation at Enterprise Data World

Tika as a pipeline!

• Auto detects content type

• Metadata structure has all the key/value needed for Solr

• Allows us to scale up with Behemoth project (and others!).

Page 37: Searching Chinese Patents Presentation at Enterprise Data World

Lots of files!HHHHHT APS1 ISSUE - 760106!PATN!WKU 039302717!SRC 5!APN 5328756!APT 1!ART 353!APD 19741216!TTL Golf glove!ISD 19760106!NCL 4!ECL 1

<PatentGrant>! <BibliographicData>! <GrantIdentification>! <DocumentKindCode>B1</DocumentKindCode>! <GrantNumber>06644224</GrantNumber>! <CountryCode>US</CountryCode>! <IssueDateText>2003-11-11</IssueDateText>

Page 38: Searching Chinese Patents Presentation at Enterprise Data World

Detector to pick Filepublic  class  GreenbookDetector  implements  Detector  {  !        private  static  Pattern  pattern  =  Pattern.compile("PATN");                    @Override          public  MediaType  detect(InputStream  stream,  Metadata  metadata)  throws  IOException  {  !                MediaType  type  =  MediaType.OCTET_STREAM;                  InputStream  lookahead  =  new  LookaheadInputStream(stream,  1024);                  String  extract  =  org.apache.commons.io.IOUtils.toString(lookahead,  "UTF-­‐8");  !                Matcher  matcher  =  pattern.matcher(extract);  !                if  (matcher.find())  {                          type  =  GreenbookParser.MEDIA_TYPE;                  }  !                lookahead.close();                                    return  type;          }        }

Page 39: Searching Chinese Patents Presentation at Enterprise Data World

Telling some stories

• How to inject “Discovery” into your app

• The Cloud to the Rescue (sorta!)

• Parsers and Parsers and Parsers

➡Don’t be Afraid to Share!

Page 40: Searching Chinese Patents Presentation at Enterprise Data World

Your BigData solution isn’t perfect

• Allow users to export data

• Most business users want to work in Excel. Accept it!

• Allow other applications to build on top of of your application.

Page 41: Searching Chinese Patents Presentation at Enterprise Data World

GPSN has• Lots of easy “Print to

PDF” options.

• Data stored in S3 as:

• individual patent files

• chunky downloads.

• Filtering to expand or select specific data sets.

• Permalinks: simple, very sharable URLs.

• Underlying Solr service is exposed to public via proxy. You can query Solr yourself.

• Need advance querying? Use Lucene syntax in search bar.

Page 42: Searching Chinese Patents Presentation at Enterprise Data World

One more thought...

Page 43: Searching Chinese Patents Presentation at Enterprise Data World

Measuring the impact of our algorithms

changes is just getting harder with Big Data.

Page 44: Searching Chinese Patents Presentation at Enterprise Data World

www.quepid.com

Quepid: Give your Queries some Love

Project SolrPanl

We need beta users!

Page 45: Searching Chinese Patents Presentation at Enterprise Data World

Thank you! !

Questions?

[email protected]

• @dep4b

• www.opensourceconnections.com

• slideshare.com/o19s

Nervous about speaking up? Ask

me later!