Workshop 1 - Stellenbosch Universityconferences.sun.ac.za/public/conferences/25/slides/IT-Section-2015... · future libraries planning Customer satisfaction improvements with better

1/90

2/90

Workshop 1

Big Data for Libraries

Kia Siang Hock [email protected]

3/90

The Workshop Programme

14:00 Welcome

14:10 About National Library Board, Singapore

14:20 What is Big Data?

14:40 Big Data in Libraries

15:15 Break

15:45 Examples of Big Data Implementations: Recommendations, Text Analytics, Ngram Viewer, Named Entity Extraction, Image Matching

16:45 More Q&A

17:00 End of workshop

4/90

About the National Library Board, Singapore

5/90

Libraries & Archives

1 National Library 1 National Archives 26 Public Libraries

About National Library Board, Singapore

6/90

Vision

Readers for Life, Learning Communities, Knowledgeable Nation

Mission

We make knowledge come alive, spark imagination and create possibilities.


7/90


The Public Library seeks to be a social learning space that nurtures active readers and knowledge seekers, through the provision of relevant, timely and engaging library services and reading programmes, using physical and digital means.

Public Library Services

8/90


Only library in Singapore that collects comprehensively published and distributed Content in the country for preservation and long term access

Enable easy access to country’s Shared Memory to build rootedness and national identity

Forge International Collaborations and advise on library development

National Library

9/90


The National Archives of Singapore (NAS) is the official custodian of Singapore’s collective memory. Ranging from government files, private memoirs, historical maps and photographs to oral history interviews and audio-visual materials, the NAS is responsible for the collection, preservation and management of Singapore's public and private archival records.

The Asian Film Archive is founded to preserve the rich film heritage of Singapore and Asian Cinema, to encourage scholarly research on film, and to promote a wider critical appreciation of this art form.

http://www.asianfilmarchive.org/

10/90

A typical day in Singapore libraries…

79,000 people visit libraries

300 new members join

the library

100,000 loans are made

27,000 people attend library programs and

exhibitions

11/90


Libraries & Archives

1 National Library 1 National Archives 26 Public Libraries

Membership More than 2m members

Visits More than 27m visits

Collection More than 1m titles More than 8.5m items

Loans More than 35m loans

Online Usage Digital User Visits: > 11m e-Retrievals: > 70m

FY2013 figures

12/90

What is Big Data?

13/90

What is Big Data?

“data of a very large size, typically to the extent that its manipulation and management present significant logistical challenges.”

Oxford English Dictionary

“an all-encompassing term for any collection of data sets so large and complex that it becomes difficult to process using on-hand data management tools or traditional data processing applications.”

Wikipedia

“datasets whose size is beyond the ability of typical database software tools to capture, store, manage, and analyze.”

McKinsey

“The ability of society to harness information in novel ways to produce useful insights or goods and services of significant values” and “… things one can do at a large scale that cannot be done at a smaller one, to extract new insights or create new forms of value.”

Viktor Mayer-Schonberger & Kenneth Cukier

“The broad range of new and massive data types that have appeared over the last decade or so.”

Tom Davenport

Source: http://www.forbes.com/sites/gilpress/2014/09/03/12-big-data-definitions-whats-yours/

14/90

What is Big Data?

Source: http://datascience.berkeley.edu/what-is-big-data/

Top recurrent themes in the definitions of Big Data by 40 thought leaders

15/90

The Four V’s of Big Data

Source: http://www.ibmbigdatahub.com/infographic/four-vs-big-data

16/90

The Fifth V: Values

Big is relative.

Five broad ways in which using Big Data can create value

Source: Big data: The next frontier for innovation, competition, and productivity (McKinsey) http://www.mckinsey.com/insights/business_technology/big_data_the_next_frontier_for_innovation

❶ Unlock significant value by making information transparent and usable at much higher frequency

❷ Collects more accurate and detailed performance information

❸ Allows ever-narrower segmentation of customers

❹ Sophisticated analytics can substantially improve decision-making

❺ Improves the development of the next generation of products and services

17/90

IDA Infocomm Technology Roadmap O

pp

ort

un

itie

s > Analysis of unstructured data such as images and audio on top of text data to unearth insights from a bigger data pool

> Insights from the data analytics outcomes to augment decision making processes

> Analytics (retrospective to predictive) to proactively identify opportunities or tackle problems

Ch

alle

nge

s > Understand and framing Big Data problems

> Maturity in some of the underlying analytics algorithms

> Shortage of data analytics talent

Source: IDA’s Public Consultation on Infocomm Technology Roadmap 2012, 17 Aug 2012 http://www.ida.gov.sg/Technology/20060417212727.aspx

‘Big Data’ is a key technology theme that will shape the ICT landscape

18/90

Technology Stack Radar < Hadoop MapReduce & distributed file system < NoSQL DBMS < Text Analytics < Visualisation-based discovery < In-memory analytics < Audio analytics < Predictive analytics < Master data management < SaaS-based business analytics

- Complex event processing - Data-federation/visualisation - Video analytics - Mobile business analytics - Non-volatile memory

<03 Years

03-05 Years

Source: IDA’s Public Consultation on Infocomm Technology Roadmap 2012, 17 Aug 2012 http://www.ida.gov.sg/Technology/20060417212727.aspx

IDA Infocomm Technology Roadmap

‘Big Data’ is a key technology theme that will shape the ICT landscape

19/90


20/90

Disclaimers

Not a comprehensive study of the use of big data in libraries.

A practitioner's high level overview of use of big data in libraries.

Do not cover big data issues including data management, privacy and ownership.

21/90

Big Data Goals

Leverages NLB’s unique data assets

Actionable Insights

Better foresights for future libraries planning

Customer satisfaction improvements with

better service offerings

Better usage of NLB services and resources

Unearthing the hidden treasures

Patrons B

oo

ks

Loan

s

Visits

Newspapers

DV

Ds

VC

Ds

E-databases

E-Books

Digitised newspapers

Demographics Locations

Digitised books

Facebook pages

Events

Bro

wse

C

ou

nt

Structured & Unstructured Data

Blo

gs

Tweets

Productivity gain with better decisions

22/90


Library Planning

Patron Profiling

Collection Optimisation

Business Operations

Digital Library

Service Delivery

23/90


Library Planning using Geospatial Analytics

Where are our users? What do they read? Are our libraries serving the residents in the vicinity? Where shall we target our outreach campaign? What is the impact on the usage of existing libraries when a new

library opens? Can our libraries cope with the population growth?

24/90


Patron Profiling & Footfall Analysis to Optimise Use of Library Space

Crowd Density Audience Profiling Human Traffic Flow

Source: Video Analytics as a Service http://vaaas.kaisquare.com/

25/90


Measuring & Analysing Energy Consumption using Smart Meters

26/90


Collection Optimisation – Collection Planning

Source: http://www.ifla.org/files/assets/hq/news/documents/nlb-collection-management-e-newsletter-april-2013.pdf

27/90


Collection Optimisation – Collection Planning

Collection Planning Model

Forecast of usage

Cost of books

Shelf space

Initial collection

Available budget

Min/max collection size

Planned Budget

Planned Acquisition

Planned Weeding

Planned Space

Projected Loans

Planned Final Collection


28/90


Collection Optimisation – Demand Forecast


29/90


Business Operations (Corporate KPIs, Finance, HR)


30/90


Library Analytics Toolkit

Source: https://osc.hul.harvard.edu/liblab/projects/library-analytics-toolkit

The Library Analytics Toolkit is a dashboard that pulls library data together in a way that allows both librarians and library users to identify and respond to trends and changes in collections, usage, and other data

31/90


Integrated & Operational Analytics

~20% of items are borrowed within 3 days of

their return

Auto-sorter

Just Return Bin

Patrons can easily access to popular items

Libraries can reduce resources for shelving

32/90


Digital Library - Curation

• For staff to generate pathfinders • An easier publishing tool for staff curation of

content • Crowdsensing of user interests interfaces

with NLB content and pushes recommended content back to patron.

Find Curate Publish

Analyse search keywords at NLB

websites

User Interests

Analyse search results & click-throughs

Collection Gaps

Building relationships between entities

Relationships

Analyse social media & relevant websites

Trending topics

33/90


Digital Library – Contextual Discovery

The Cenotaph, located at Esplanade Park along Connaught Drive, is a war memorial which commemorates the sacrifice of the men who perished during World War I and World War II. It was unveiled on 31 March 1922 by the Prince of Wales. On 28 December 2010, it was gazetted as a national monument together with two other structures in Esplanade Park, the Lim Bo Seng Memorial and the Tan Kim Seng Fountain…

Gwee Peng Kwee (Oral History)

Dalhousie Obelisk (Article) Dalhousie Obelisk, landmark, located at Empress Place in the Central Region. The tall 'needle-like' monument...

Lim Bo Seng (Article) Major-General Lim Bo Seng (b. 27 April 1909, Nan Ann, Fujian, China - d. 29 June 1944, Perak, Malaya) was a prominent ...

Master Plan for Singapore - Central Area (1958)

Singapore’s War Memorial to the Glorious Dead (11 Nov 1920)

Lest we forget (8 Nov 1953)

Singapore students learn to care about history (13 Jul 1997)

Arrival of the Prince (31 Mar 1922)

Singapore’s War Memorial (21 Sep 1921)

Newspaper articles

His daily routine school… Laying of foundation stone and unveiling of Cenotaph…

34/90


Digital Library – Content Analytics to search the ‘Un-searchable’

Image/video search Voice-to-text

Named entity recognition

Buildings

People

Streets

Dates

Organisations

welcome 欢迎

Selamat datang

நல்வரவு

Cross-language discovery

35/90

Examples of Big Data Implementations

36/90

Recommending Good Reads to Patrons

Source: http://www.amazon.com/

37/90

Patron also borrowed these titles…

> 33m loans a year > 2m patrons Recommendations

tailored to NLB patrons

Flag your wings by P. D. Eastman

1,070 patrons borrowed this title:

M00014123D M00025872A M00032776C M00032897A M00039928K M00040123B M00042334H M00045167I M00051921E M00056997H . . .

Recommendations

402 patrons

289 patrons

260 patrons

Other titles the 1,070 patrons borrowed:

9342951 12547108 12910631 13085283 . . .

8734188 10247657 13046840 13085283 . . .

Collaborative filtering

38/90

A Simple Implementation

select book_id, count(*) from loans where patron_id in (select patron_id from loans where book_id = 5127546) group by book_id order by 2 desc limit 20;

book_id patron_id loans table:

Patrons who borrowed title ‘5127546’ also borrowed these other titles:

+----------+----------+

| book_id | count(*) |

+----------+----------+

| 5127546 | 115 |

| 3652671 | 23 |

| 9136504 | 21 |

| 3857787 | 20 |

| 6132951 | 19 |

| 4235852 | 19 |

| 3049673 | 18 |

| 12863855 | 18 |

| 4624247 | 18 |

| 4643539 | 18 |

| 3718516 | 18 |

| 5018345 | 18 |

| 2908246 | 17 |

| 4235878 | 17 |

| 2085361 | 17 |

| 3718517 | 17 |

| 3260602 | 16 |

| 4317577 | 16 |

| 9043935 | 16 |

| 6373666 | 16 |

+----------+----------+

39/90

A Simple Implementation

Title level recommendations Patron level recommendations

NLB Mobile app

40/90

Contextual discovery via text analytics

41/90

The ability to mine unstructured data is key to an organisation’s competitive advantage

7,910 EB

1,227 EB

130 EB

1 EB (Exabyte) = 1,000,000 TB

2005 2010 2015

$20

$10

$0.50

90% - unstructured data

68% of all unstructured data in 2015 will be created by consumers

All digital data

Unstructured digital data

Sto

rage

co

st p

er

GB

(U

S$)

Source: IDC’s Digital Universe Study, sponsored by EMC, Jun 2011

IDC - Singapore National Library Transforms Structured and Unstructured Data Into Insights on Cloud, 2014

“For companies who are looking to derive insights from nontransactional data sources with Big Data analytics and cloud technologies, NLB's analytics journey shines the light on approaches and lessons learned in deriving value from both transactional and nontransactional data sources on cloud infrastructure.”

42/90

The Growing Digital Collection

Digitised books

Historic newspapers

Images

Oral history recordings

Audio-visual recordings

Other collections

Infopedia articles

Web Archives

Singapore Memories

Music

Posters

Building Plans

Govt Records

Private Records Maps

43/90

NLB users retrieved tens of millions of e-content every year

It would be really nice if we could convert every single e-retrieval

instance into an enriching discovery experience for every single user

every time…

44/90

Contextual Discovery

NLB users collectively contribute to tens of millions of e-retrievals every year

The Cenotaph, located at Esplanade Park along Connaught Drive, is a war memorial which commemorates the sacrifice of the men who perished during World War I and World War II. It was unveiled on 31 March 1922 by the Prince of Wales. On 28 December 2010, it was gazetted as a national monument together with two other structures in…

Gwee Peng Kwee His daily routine school… Laying of foundation stone and unveiling of Cenotaph…

Dalhousie Obelisk (Article) Dalhousie Obelisk, landmark, located at Empress Place in the Central Region. The tall…

Lim Bo Seng (Article) Major-General Lim Bo Seng (b. 27 April 1909, Nan Ann, Fujian, China - d. 29 June…

Master Plan for Singapore - Central Area (1958) Singapore’s

War Memorial to the Glorious Dead (11 Nov 1920)

Lest we forget (8 Nov 1953)

Singapore students learn to care about history (13 Jul 1997)

Arrival of the Prince (31 Mar 1922)

Singapore’s War Memorial (21 Sep 1921)

Newspaper articles

http://eresources.nlb.gov.sg/infopedia/OpenNLBCMSthumbnail.aspx?id=b0ae5759-c9c4-4630-ab12-00c24ff67850

45/90

Using text analytics to automatically identify related content

Text tokenised; tokens parsed and weighted (TF/IDF)

Text tokenised; tokens parsed and weighted (TF/IDF)

Weighted tokens similarity

computed

Similarity = 0.295

http://www.google.com.sg/url?sa=i&rct=j&q=&esrc=s&frm=1&source=images&cd=&cad=rja&uact=8&docid=3IQURA74VpaRrM&tbnid=aubfJN_kFZJ-tM:&ved=0CAUQjRw&url=http://www.youthforia.org.uk/guides/&ei=oXnUU7nwMo66uASo94DQCw&bvm=bv.71778758,d.c2E&psig=AFQjCNEm5yd3507AgRydgBLUmGnkJIcZuA&ust=1406520077238408















46/90

Using Mahout to identify related content

Scalable, commercial-friendly, machine learning for building intelligent applications

Use cases: • Recommendation

• User Info + Community Info

• Classification • Places new items into categories

• Clustering • Group documents based on the notion of similarity

• Frequent Itemset Mining • Analyze items in a group and then identifies which item typically

appear together

What is Apache Mahout?

http://mahout.apache.org/

47/90


The steps

•Obtain the text of the content to be analysed, one file per item. Put them in the “datafiles” folder.

Obtain the text files

•mahout seqdirectory -c UTF-8 -i datafiles -o seqfiles

Create the sequence files

•mahout seq2sparse -i seqfiles -o vectors -ow -chunk 100 -x 90 -

seq -ml 50 -n 2 -s 5 -md 5 -ng 3 -nv

Create TF/IDF weighted vectors

•mahout rowid -i vectors/tfidf-vectors/part-r-00000 -o matrix

•mahout rowsimilarity -i matrix/matrix -o similarity -

similarityClassName SIMILARITY_COSINE -m -ess

Get the similarity results

48/90


The steps





•mahout seq2sparse -i seqfiles -o vectors -ow -chunk 100 -x 90 -

seq -ml 50 -n 2 -s 5 -md 5 -ng 3 -nv






Sequence file is a binary key-value file format used extensively in Mahout and Hadoop.

49/90


The steps





•mahout seq2sparse -i seqfiles -o vectors -ow -chunk 100 -x 90

-seq -ml 50 -n 2 -s 5 -md 5 -ng 3 –wt tfidf –nv





Get the similarity results -s Min support of the term in the entire collection

-md Min document frequency

-x Max document frequency percentage

-ng Maximum size of the n-grams

Key parameters:

50/90


The steps

TF-IDF (Term Frequency-Inverse Document Frequency) is a numerical statistic that is intended to reflect how important a word is to a document in a collection or corpus.

Sources: https://en.wikipedia.org/wiki/Tf%E2%80%93idf http://filotechnologia.blogspot.sg/2014/01/a-simple-java-class-for-tfidf-scoring.html http://criminalintent.org/2011/01/rapid-prototyping-with-mathematica/

51/90


The steps





•mahout seq2sparse -i seqfiles -o vectors -ow -chunk 100 -x 90

-seq -ml 50 -n 2 -s 5 -md 5 -ng 3 –wt tfidf -nv



•mahout rowsimilarity -i matrix/matrix -o similarity

-similarityClassName SIMILARITY_COSINE -m 10


52/90


The steps

Cosine similarity is a measure of similarity between two vectors of an inner product space that measures the cosine of the angle between them. The cosine of 0° is 1, and it is less than 1 for any other angle.

① Julie loves me more than Linda loves me

② Jane likes me more than Julie loves me

① Julie (1) loves (2) me (2) more (1) than (1) Linda (1) likes (0) Jane (0)

② Julie (1) loves (1) me (2) more (1) than (1) Linda (0) likes (1) Jane (1)

term frequency

sim (①, ②) = 1x1 + 2x1 + 2x2 + 1x1 + 1x1 + 1x0 + 0x1 + 0x1

sqrt(12+22+22+12+12+12+02+02) x sqrt(12+12+22+12+12+02+12+12) =0.822

53/90


The results

• mahout seqdumper -i similarity > similarity.txt

Key: 0: Value:

{14458:0.2966480826934176,

11399:0.30290014772966095,

12793:0.22009858979452146,

3275:0.1871791030103281,

14613:0.3534278632679437,

4411:0.2516380602790199,

17520:0.3139731583634198,

13611:0.18968888212315968,

14354:0.17673965754661425,

0:1.0000000000000004}

Key: 1: Value:

...

Article ID

Similarity score between article 0 with article 14458 is 0.297

Similarity scores are between 0 and 1

54/90

An event unfolds…

55/90

Handling large data sets

Online resource of current and historic Singapore and Malaya newspapers − include The Straits Times,

The Business Times, 星洲日报, 南洋商报, 联合早报, Berita Harian, TODAY

Over 20,000,000 articles published, and growing

NewspaperSG

56/90

Using clustering to handle large datasets

Clustering is the task of grouping a set of objects in such a way that objects in the same group (called a cluster) are more similar (in some sense or another) to each other than to those in other groups (clusters)

Mahout K-Means Clustering with Cosine Distance



































57/90

Using Mahout to cluster related content

58/90


mahout kmeans -i vectors/tfidf-vectors/ -c initial-clusters

-o kmeans-clusters

-dm org.apache.mahout.common.distance.CosineDistanceMeasure

-cd 0.1 -k 20 -x 20 -cl

-k Number of clusters

-x Maximum number of iterations

-cd Threshold of convergence (default: 0.5)

-dm Distance measurement (default: SquaredEuclidean)

Source: https://mahout.apache.org/users/clustering/k-means-commandline.html

59/90

Handling large data sets

Size of cluster

Top 40 stemmed terms

52,678 exhibit, art, artist, paint, museum, singapor, work, displai, galleri, open, mr, year, organis, centr, chines, cultur, held, on, nation, world, pictur, photograph, collect, hall, colour, includ, first, time, featur, sculptur, design, dai, peopl, piec, two, societi, part, visitor, fair, intern

86,881 olymp, athlet, game, sport, medal, event, team, gold, championship, record, world, singapor, metr, swim, won, year, champion, win, nation, women, time, coach, asian, meet, competit, train, swimmer, two, second, race, compet, first, amateur, bronz, intern, associ, best, finish, yesterdai, silver

142,289 school, student, educ, teacher, univers, secondari, singapor, children, year, primari, studi, pupil, parent, mr, teach, colleg, cours, ministri, english, languag, chines, on, institut, examin, time, learn, princip, train, programm, work, graduat, help, nation, class, two, govern, girl, scienc, boi, first

125,629 polic, arrest, offic, suspect, two, men, yesterdai, man, investig, report, found, mr, gang, road, raid, on, station, crime, detain, year, night, arm, robberi, car, believ, peopl, forc, charg, singapor, robber, hous, todai, told, stolen, seiz, spokesman, old, held, four, escap

67 clusters averaging 93,000 articles each Worked well for Chinese and Malay articles too Close to 1 billion associations identified

Using automatic clustering technique

60/90

Implementations

Implemented in Jul 2013

Implemented in Sep 2013

Implemented in Nov 2013

Implemented in Jul 2014

Infopedia

PictureSG

NewspaperSG

61/90

NLB’s Hadoop cluster for text analytics

o

VM Host 1

o Server 5

TaskTracker

Server 6

TaskTracker

Server 7

DataNode

Server 2

JobTracker

Server 8

TaskTracker

Server 9

TaskTracker

Server 10

DataNode

Server 11

TaskTracker

Server 12

TaskTracker

Server 13

DataNode

Server 3

NameNode

Server 4

Checkpoint

Server 1

Cluster Mgr

VM Host 3

VM Host 2

62/90

Benefits of Contextual Discovery

Referrals from Infopedia

Pageviews per month

Pageviews per visit

0.14%

10.65% (after 6 months)

37,841

84,341

3.64

6.41

Infopedia

PictureSG

63/90

N-gram Viewer

64/90

Google Books Ngram Viewer Graph showing how phrases have occurred in

a corpus over time

https://books.google.com/ngrams

65/90

Ngram viewer using Bookworm Open source software from Culturomics

http://bookworm.culturomics.org/

66/90

Ngram viewer using Bookworm To create an Ngram Viewer for your collection

Metadata Catalog

• /metadata/jsoncatalog.txt

• The list of the metadata for each text

Field Descriptions

• /metadata/field_descriptions.json

• Describes the properties of each available metadata field

Raw Text

• /texts/raw/*.txt

• The text files in your collection (in .txt format)

67/90


Metadata Catalog



Field Descriptions



Raw Text



Key Description of Value

filename The filename of the corresponding text file (with .txt omitted and no whitespace in the name).

date The date corresponding to a text file. Dates which are not integers should be specified as a string in the format: YYYY-MM-DD.

searchstring The HTML code displayed for a text when points are clicked on in the ngram graph.

3 required fields:

{"filename": "s1541-104", "date": "1997-2-7", "searchstring": "A bill to extend, reform, and improve agricultural commodity, trade, conservation, and other programs, and for other purposes. | Read at: <a href=\"http://www.govtrack.us/congress/bills/104/s1541\" target=\"_blank\">govtrack.us</a>"}

68/90


Metadata Catalog



Field Descriptions



Raw Text



Key Description

field The name of the metadata variable.

datatype The type of the data: searchstring, time, categorical, etc.

type The format of the data: integer, decimal, character, text.

unique Whether any given text can have only one type of this field (e.g. title) or not (e.g. subject).

{"datatype": "searchstring", "field": "searchstring", "unique": true, "type": "text"}, {"datatype": "categorical", "field": "enacted", "unique": false, "type": "text"}, {"datatype": "time", "field": "date", "unique": true, "type": “character", "derived":[{"resolution":"year"}]}

69/90


Metadata Catalog



Field Descriptions



Raw Text



Example Files: /texts/raw/s1541-104.txt /texts/raw/hr2854-104.txt

70/90

Ngram viewer using Bookworm Demonstration

http://bookworm.culturomics.org/congress/

http://bookworm.culturomics.org/congress/

71/90

Named Entity Recognition

72/90

Automatic extraction of time-based and location related information

12 Aug 1956

07 Sep 1971

30 Mar 1988

26 Jul 1992

16 Aug 2002

11 Feb 2009

Users navigate through old images of Singapore

building, streets, satellite images and events via

augmented reality apps

Resources can be mapped for

contextual discovery

Resources are time-stamped for discovery

on a time-line

Time and location are two of the most fundamental ways we organise things. The automatic extraction of geo- and time-based references from the full-text can

yield more data than through manual tagging.

73/90

Natural Language Processing (NLP)

Typical Steps of NLP

Sentence detection

Tokenization Part-of-speech tagging

Named-entity detection

Hi, How are you? This is Mike. → ①Hi, How are you? ②This is Mike.

74/90



Sentence detection



This is Mike. → ①This ②is ③Mike


75/90



Sentence detection





①This (DT) ②is (VBZ) ③Mike (NNP)

76/90



Sentence detection





①This (DT) ②is (VBZ) ③Mike (NNP) Mike person was in

Singapore location on 3rd October date

This is Mike person

77/90


Named Entity Recognition using GATE/ANNIE

General Architecture for Text Engineering

Developed at the University of Sheffield in 1995

A Java suite of tools, GUI & library Provides means of analyzing text Makes computers analyze and

understand the language that humans use naturally

Plugin to support different languages

ANNIE

A Nearly-New IE system IE: Information Extraction Distributed in GATE

78/90



Dates in Infopedia articles (http://eresources.nlb.gov.sg/infopedia/)

79/90



Handling local street and building names using ‘Gazetteers’

80/90


Some other options for NLP

http://www.alchemyapi.com/

Free for up to 1,000 transactions per day

81/90


Some other options for NLP

http://new.opencalais.com/

Free for up to 5,000 submissions per day

82/90

Image matching

83/90

Visual Search & Discovery

Visual Search User uploads his old photo to search for similar images

Upload another image

128 similar images found: Visual Discovery Images without meta-data description; cannot use text mining to cluster similar images

84/90


Image Database

Query Image

Feature Detector & Descriptor Extractor

Features (Super Matrix)

Feature Detector & Descriptor Extractor

Descriptor Matcher Similar Images

Algorithms: SIFT (Scale Invariant Feature Transform) SURF (Speeded Up Robust Features)

FAST (Features from Accelerated Segment Test) BRIEF (Binary Robust Independent Elementary Features) ORB (Oriented FAST and Rotated BRIEF)

85/90


Options for image matching

Free open-source library based on BSD license

Image processing, computer vision and machine learning

Supports large number of algorithms Key Features

• Optimized for real time image processing and computer vision applications

• Interfaces to C++, C, C#, Java, Python • Run on Windows, Linux, Mac, iOS

and Android

http://opencv.org/

https://visenze.com/

Hosted on cloud Scalable image

database Search returns result

in milliseconds APIs available

86/90

Question?

Kia Siang Hock [email protected]

87/90

Backup slides

88/90


Parameters for seq2sparse command Option Flag Description Default value

Overwrite (bool) -ow If set, the output folder is overwritten. If not set, the output folder is created if the folder doesn’t exist. If the output folder does exist, the job fails and an error is thrown. Default is unset.

NA

Lucene analyzer name (String)

-a The class name of the analyzer to use. org.apache.lucene.analysis.standard.StandardAnalyzer

Chunk size (int) -chunk The chunk size in MB. For large document collections (sizes in GBs and TBs), you won’t be able to load the entire dictionary into memory during vectorization, so you can split the dictionary into chunks of the specified size and perform the vectorization in multiple stages. It’s recommended you keep this size to 80 percent of the Java heap size of the Hadoop child nodes to prevent the vectorizer from hitting the heap limit.

100

Weighting (String) -wt The weighting scheme to use: tf for termfrequency based weighting and tfidf for TFIDF based weighting.

tfidf

Minimum support (int)

-s The minimum frequency of the term in the entire collection to be considered as a part of the dictionary file. Terms with lesser frequency are ignored.

2

Minimum document frequency (int)

-md The minimum number of documents the term should occur in to be considered a part of the dictionary file. Any term with lesser frequency is ignored.

1

89/90


Option Flag Description Default value

Max document frequency percentage (int)

-x The maximum number of documents the term should occur in to be considered a part of the dictionary file. This is a mechanism to prune out high frequency terms (stop-words). Any word that occurs in more than the specified percentage of documents is ignored.

99

N-gram size (int) -ng The maximum size of n-grams to be selected from the collection of documents.

1

Minimum log-likelihood ratio (LLR) (float)

-ml This flag works only when n-gram size is greater than 1. Very significant n-grams have large scores, such as 1000; less significant ones have lower scores. Although there’s no specific method for choosing this value, the rule of thumb is that n-grams with a LLR value less than 1.0 are irrelevant.

1.0

Normalization (float)

-n The normalization value to use in the Lp space. A detailed explanation of normalization is given in section 8.4. The default scheme is to not normalize the weights.

0

Create sequential access sparse vectors (bool)

-seq If set, the output vectors are created as SequentialAccessSparseVectors. By default the dictionary vectorizer generates RandomAccessSparseVectors. The former gives higher performance on certain algorithms like k-means and SVD due to the sequential nature of vector operations. By default the flag is unset.

NA

Parameters for seq2sparse command

90/90


Option Description

--input (-i) input Path to job input directory. Must be a SequenceFile of VectorWritable

--clusters (-c) clusters The input centroids, as Vectors. Must be a SequenceFile of Writable, Cluster/Canopy. If k is also specified, then a random set of vectors will be selected and written out to this path first

--output (-o) output The directory pathname for output.

--distanceMeasure (-dm) distanceMeasure

The classname of the DistanceMeasure. Default is SquaredEuclidean

--convergenceDelta (-cd) convergenceDelta

The convergence delta value. Default is 0.5

--maxIter (-x) maxIter The maximum number of iterations.

--maxRed (-r) maxRed The number of reduce tasks. Defaults to 2

--k (-k) k The k in k-Means. If specified, then a random selection of k Vectors will be chosen as the Centroid and written to the clusters input path.

--overwrite (-ow) If present, overwrite the output directory before running job

--clustering (-cl) If present, run clustering after the iterations have taken place

Parameters for kmeans command

Documents

Workshop 1 - Stellenbosch Universityconferences.sun.ac.za/public/conferences/25/slides/IT-Section-2015... · future libraries planning Customer satisfaction improvements with better