90
1/90

Workshop 1 - Stellenbosch Universityconferences.sun.ac.za/public/conferences/25/slides/IT-Section-2015... · future libraries planning Customer satisfaction improvements with better

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Workshop 1 - Stellenbosch Universityconferences.sun.ac.za/public/conferences/25/slides/IT-Section-2015... · future libraries planning Customer satisfaction improvements with better

1/90

Page 2: Workshop 1 - Stellenbosch Universityconferences.sun.ac.za/public/conferences/25/slides/IT-Section-2015... · future libraries planning Customer satisfaction improvements with better

2/90

Workshop 1

Big Data for Libraries

Kia Siang Hock [email protected]

Page 3: Workshop 1 - Stellenbosch Universityconferences.sun.ac.za/public/conferences/25/slides/IT-Section-2015... · future libraries planning Customer satisfaction improvements with better

3/90

The Workshop Programme

14:00 Welcome

14:10 About National Library Board, Singapore

14:20 What is Big Data?

14:40 Big Data in Libraries

15:15 Break

15:45 Examples of Big Data Implementations: Recommendations, Text Analytics, Ngram Viewer, Named Entity Extraction, Image Matching

16:45 More Q&A

17:00 End of workshop

Page 4: Workshop 1 - Stellenbosch Universityconferences.sun.ac.za/public/conferences/25/slides/IT-Section-2015... · future libraries planning Customer satisfaction improvements with better

4/90

About the National Library Board, Singapore

Page 5: Workshop 1 - Stellenbosch Universityconferences.sun.ac.za/public/conferences/25/slides/IT-Section-2015... · future libraries planning Customer satisfaction improvements with better

5/90

Libraries & Archives

1 National Library 1 National Archives 26 Public Libraries

About National Library Board, Singapore

Page 6: Workshop 1 - Stellenbosch Universityconferences.sun.ac.za/public/conferences/25/slides/IT-Section-2015... · future libraries planning Customer satisfaction improvements with better

6/90

Vision

Readers for Life, Learning Communities, Knowledgeable Nation

Mission

We make knowledge come alive, spark imagination and create possibilities.

About National Library Board, Singapore

Page 7: Workshop 1 - Stellenbosch Universityconferences.sun.ac.za/public/conferences/25/slides/IT-Section-2015... · future libraries planning Customer satisfaction improvements with better

7/90

About National Library Board, Singapore

The Public Library seeks to be a social learning space that nurtures active readers and knowledge seekers, through the provision of relevant, timely and engaging library services and reading programmes, using physical and digital means.

Public Library Services

Page 8: Workshop 1 - Stellenbosch Universityconferences.sun.ac.za/public/conferences/25/slides/IT-Section-2015... · future libraries planning Customer satisfaction improvements with better

8/90

About National Library Board, Singapore

Only library in Singapore that collects comprehensively published and distributed Content in the country for preservation and long term access

Enable easy access to country’s Shared Memory to build rootedness and national identity

Forge International Collaborations and advise on library development

National Library

Page 9: Workshop 1 - Stellenbosch Universityconferences.sun.ac.za/public/conferences/25/slides/IT-Section-2015... · future libraries planning Customer satisfaction improvements with better

9/90

About National Library Board, Singapore

The National Archives of Singapore (NAS) is the official custodian of Singapore’s collective memory. Ranging from government files, private memoirs, historical maps and photographs to oral history interviews and audio-visual materials, the NAS is responsible for the collection, preservation and management of Singapore's public and private archival records.

The Asian Film Archive is founded to preserve the rich film heritage of Singapore and Asian Cinema, to encourage scholarly research on film, and to promote a wider critical appreciation of this art form.

Page 10: Workshop 1 - Stellenbosch Universityconferences.sun.ac.za/public/conferences/25/slides/IT-Section-2015... · future libraries planning Customer satisfaction improvements with better

10/90

A typical day in Singapore libraries…

79,000 people visit libraries

300 new members join

the library

100,000 loans are made

27,000 people attend library programs and

exhibitions

Page 11: Workshop 1 - Stellenbosch Universityconferences.sun.ac.za/public/conferences/25/slides/IT-Section-2015... · future libraries planning Customer satisfaction improvements with better

11/90

About National Library Board, Singapore

Libraries & Archives

1 National Library 1 National Archives 26 Public Libraries

Membership More than 2m members

Visits More than 27m visits

Collection More than 1m titles More than 8.5m items

Loans More than 35m loans

Online Usage Digital User Visits: > 11m e-Retrievals: > 70m

FY2013 figures

Page 12: Workshop 1 - Stellenbosch Universityconferences.sun.ac.za/public/conferences/25/slides/IT-Section-2015... · future libraries planning Customer satisfaction improvements with better

12/90

What is Big Data?

Page 13: Workshop 1 - Stellenbosch Universityconferences.sun.ac.za/public/conferences/25/slides/IT-Section-2015... · future libraries planning Customer satisfaction improvements with better

13/90

What is Big Data?

“data of a very large size, typically to the extent that its manipulation and management present significant logistical challenges.”

Oxford English Dictionary

“an all-encompassing term for any collection of data sets so large and complex that it becomes difficult to process using on-hand data management tools or traditional data processing applications.”

Wikipedia

“datasets whose size is beyond the ability of typical database software tools to capture, store, manage, and analyze.”

McKinsey

“The ability of society to harness information in novel ways to produce useful insights or goods and services of significant values” and “… things one can do at a large scale that cannot be done at a smaller one, to extract new insights or create new forms of value.”

Viktor Mayer-Schonberger & Kenneth Cukier

“The broad range of new and massive data types that have appeared over the last decade or so.”

Tom Davenport

Source: http://www.forbes.com/sites/gilpress/2014/09/03/12-big-data-definitions-whats-yours/

Page 14: Workshop 1 - Stellenbosch Universityconferences.sun.ac.za/public/conferences/25/slides/IT-Section-2015... · future libraries planning Customer satisfaction improvements with better

14/90

What is Big Data?

Source: http://datascience.berkeley.edu/what-is-big-data/

Top recurrent themes in the definitions of Big Data by 40 thought leaders

Page 15: Workshop 1 - Stellenbosch Universityconferences.sun.ac.za/public/conferences/25/slides/IT-Section-2015... · future libraries planning Customer satisfaction improvements with better

15/90

The Four V’s of Big Data

Source: http://www.ibmbigdatahub.com/infographic/four-vs-big-data

Page 16: Workshop 1 - Stellenbosch Universityconferences.sun.ac.za/public/conferences/25/slides/IT-Section-2015... · future libraries planning Customer satisfaction improvements with better

16/90

The Fifth V: Values

Big is relative.

Five broad ways in which using Big Data can create value

Source: Big data: The next frontier for innovation, competition, and productivity (McKinsey) http://www.mckinsey.com/insights/business_technology/big_data_the_next_frontier_for_innovation

❶ Unlock significant value by making information transparent and usable at much higher frequency

❷ Collects more accurate and detailed performance information

❸ Allows ever-narrower segmentation of customers

❹ Sophisticated analytics can substantially improve decision-making

❺ Improves the development of the next generation of products and services

Page 17: Workshop 1 - Stellenbosch Universityconferences.sun.ac.za/public/conferences/25/slides/IT-Section-2015... · future libraries planning Customer satisfaction improvements with better

17/90

IDA Infocomm Technology Roadmap O

pp

ort

un

itie

s > Analysis of unstructured data such as images and audio on top of text data to unearth insights from a bigger data pool

> Insights from the data analytics outcomes to augment decision making processes

> Analytics (retrospective to predictive) to proactively identify opportunities or tackle problems

Ch

alle

nge

s > Understand and framing Big Data problems

> Maturity in some of the underlying analytics algorithms

> Shortage of data analytics talent

Source: IDA’s Public Consultation on Infocomm Technology Roadmap 2012, 17 Aug 2012 http://www.ida.gov.sg/Technology/20060417212727.aspx

‘Big Data’ is a key technology theme that will shape the ICT landscape

Page 18: Workshop 1 - Stellenbosch Universityconferences.sun.ac.za/public/conferences/25/slides/IT-Section-2015... · future libraries planning Customer satisfaction improvements with better

18/90

Technology Stack Radar < Hadoop MapReduce & distributed file system < NoSQL DBMS < Text Analytics < Visualisation-based discovery < In-memory analytics < Audio analytics < Predictive analytics < Master data management < SaaS-based business analytics

- Complex event processing - Data-federation/visualisation - Video analytics - Mobile business analytics - Non-volatile memory

<03 Years

03-05 Years

Source: IDA’s Public Consultation on Infocomm Technology Roadmap 2012, 17 Aug 2012 http://www.ida.gov.sg/Technology/20060417212727.aspx

IDA Infocomm Technology Roadmap

‘Big Data’ is a key technology theme that will shape the ICT landscape

Page 19: Workshop 1 - Stellenbosch Universityconferences.sun.ac.za/public/conferences/25/slides/IT-Section-2015... · future libraries planning Customer satisfaction improvements with better

19/90

Big Data for Libraries

Page 20: Workshop 1 - Stellenbosch Universityconferences.sun.ac.za/public/conferences/25/slides/IT-Section-2015... · future libraries planning Customer satisfaction improvements with better

20/90

Disclaimers

Not a comprehensive study of the use of big data in libraries.

A practitioner's high level overview of use of big data in libraries.

Do not cover big data issues including data management, privacy and ownership.

Page 21: Workshop 1 - Stellenbosch Universityconferences.sun.ac.za/public/conferences/25/slides/IT-Section-2015... · future libraries planning Customer satisfaction improvements with better

21/90

Big Data Goals

Leverages NLB’s unique data assets

Actionable Insights

Better foresights for future libraries planning

Customer satisfaction improvements with

better service offerings

Better usage of NLB services and resources

Unearthing the hidden treasures

Patrons B

oo

ks

Loan

s

Visits

Newspapers

DV

Ds

VC

Ds

E-databases

E-Books

Digitised newspapers

Demographics Locations

Digitised books

Facebook pages

Events

Bro

wse

C

ou

nt

Structured & Unstructured Data

Blo

gs

Tweets

Productivity gain with better decisions

Page 22: Workshop 1 - Stellenbosch Universityconferences.sun.ac.za/public/conferences/25/slides/IT-Section-2015... · future libraries planning Customer satisfaction improvements with better

22/90

Big Data for Libraries

Library Planning

Patron Profiling

Collection Optimisation

Business Operations

Digital Library

Service Delivery

Page 23: Workshop 1 - Stellenbosch Universityconferences.sun.ac.za/public/conferences/25/slides/IT-Section-2015... · future libraries planning Customer satisfaction improvements with better

23/90

Big Data for Libraries

Library Planning using Geospatial Analytics

Where are our users? What do they read? Are our libraries serving the residents in the vicinity? Where shall we target our outreach campaign? What is the impact on the usage of existing libraries when a new

library opens? Can our libraries cope with the population growth?

Page 24: Workshop 1 - Stellenbosch Universityconferences.sun.ac.za/public/conferences/25/slides/IT-Section-2015... · future libraries planning Customer satisfaction improvements with better

24/90

Big Data for Libraries

Patron Profiling & Footfall Analysis to Optimise Use of Library Space

Crowd Density Audience Profiling Human Traffic Flow

Source: Video Analytics as a Service http://vaaas.kaisquare.com/

Page 25: Workshop 1 - Stellenbosch Universityconferences.sun.ac.za/public/conferences/25/slides/IT-Section-2015... · future libraries planning Customer satisfaction improvements with better

25/90

Big Data for Libraries

Measuring & Analysing Energy Consumption using Smart Meters

Page 26: Workshop 1 - Stellenbosch Universityconferences.sun.ac.za/public/conferences/25/slides/IT-Section-2015... · future libraries planning Customer satisfaction improvements with better

26/90

Big Data for Libraries

Collection Optimisation – Collection Planning

Source: http://www.ifla.org/files/assets/hq/news/documents/nlb-collection-management-e-newsletter-april-2013.pdf

Page 27: Workshop 1 - Stellenbosch Universityconferences.sun.ac.za/public/conferences/25/slides/IT-Section-2015... · future libraries planning Customer satisfaction improvements with better

27/90

Big Data for Libraries

Collection Optimisation – Collection Planning

Collection Planning Model

Forecast of usage

Cost of books

Shelf space

Initial collection

Available budget

Min/max collection size

Planned Budget

Planned Acquisition

Planned Weeding

Planned Space

Projected Loans

Planned Final Collection

Source: http://www.ifla.org/files/assets/hq/news/documents/nlb-collection-management-e-newsletter-april-2013.pdf

Page 28: Workshop 1 - Stellenbosch Universityconferences.sun.ac.za/public/conferences/25/slides/IT-Section-2015... · future libraries planning Customer satisfaction improvements with better

28/90

Big Data for Libraries

Collection Optimisation – Demand Forecast

Source: http://www.ifla.org/files/assets/hq/news/documents/nlb-collection-management-e-newsletter-april-2013.pdf

Page 29: Workshop 1 - Stellenbosch Universityconferences.sun.ac.za/public/conferences/25/slides/IT-Section-2015... · future libraries planning Customer satisfaction improvements with better

29/90

Big Data for Libraries

Business Operations (Corporate KPIs, Finance, HR)

Source: http://www.ifla.org/files/assets/hq/news/documents/nlb-collection-management-e-newsletter-april-2013.pdf

Page 30: Workshop 1 - Stellenbosch Universityconferences.sun.ac.za/public/conferences/25/slides/IT-Section-2015... · future libraries planning Customer satisfaction improvements with better

30/90

Big Data for Libraries

Library Analytics Toolkit

Source: https://osc.hul.harvard.edu/liblab/projects/library-analytics-toolkit

The Library Analytics Toolkit is a dashboard that pulls library data together in a way that allows both librarians and library users to identify and respond to trends and changes in collections, usage, and other data

Page 31: Workshop 1 - Stellenbosch Universityconferences.sun.ac.za/public/conferences/25/slides/IT-Section-2015... · future libraries planning Customer satisfaction improvements with better

31/90

Big Data for Libraries

Integrated & Operational Analytics

~20% of items are borrowed within 3 days of

their return

Auto-sorter

Just Return Bin

Patrons can easily access to popular items

Libraries can reduce resources for shelving

Page 32: Workshop 1 - Stellenbosch Universityconferences.sun.ac.za/public/conferences/25/slides/IT-Section-2015... · future libraries planning Customer satisfaction improvements with better

32/90

Big Data for Libraries

Digital Library - Curation

• For staff to generate pathfinders • An easier publishing tool for staff curation of

content • Crowdsensing of user interests interfaces

with NLB content and pushes recommended content back to patron.

Find Curate Publish

Analyse search keywords at NLB

websites

User Interests

Analyse search results & click-throughs

Collection Gaps

Building relationships between entities

Relationships

Analyse social media & relevant websites

Trending topics

Page 33: Workshop 1 - Stellenbosch Universityconferences.sun.ac.za/public/conferences/25/slides/IT-Section-2015... · future libraries planning Customer satisfaction improvements with better

33/90

Big Data for Libraries

Digital Library – Contextual Discovery

The Cenotaph, located at Esplanade Park along Connaught Drive, is a war memorial which commemorates the sacrifice of the men who perished during World War I and World War II. It was unveiled on 31 March 1922 by the Prince of Wales. On 28 December 2010, it was gazetted as a national monument together with two other structures in Esplanade Park, the Lim Bo Seng Memorial and the Tan Kim Seng Fountain…

Gwee Peng Kwee (Oral History)

Dalhousie Obelisk (Article) Dalhousie Obelisk, landmark, located at Empress Place in the Central Region. The tall 'needle-like' monument...

Lim Bo Seng (Article) Major-General Lim Bo Seng (b. 27 April 1909, Nan Ann, Fujian, China - d. 29 June 1944, Perak, Malaya) was a prominent ...

Master Plan for Singapore - Central Area (1958)

Singapore’s War Memorial to the Glorious Dead (11 Nov 1920)

Lest we forget (8 Nov 1953)

Singapore students learn to care about history (13 Jul 1997)

Arrival of the Prince (31 Mar 1922)

Singapore’s War Memorial (21 Sep 1921)

Newspaper articles

His daily routine school… Laying of foundation stone and unveiling of Cenotaph…

Page 34: Workshop 1 - Stellenbosch Universityconferences.sun.ac.za/public/conferences/25/slides/IT-Section-2015... · future libraries planning Customer satisfaction improvements with better

34/90

Big Data for Libraries

Digital Library – Content Analytics to search the ‘Un-searchable’

Image/video search Voice-to-text

Named entity recognition

Buildings

People

Streets

Dates

Organisations

welcome 欢迎

Selamat datang

நல்வரவு

Cross-language discovery

Page 35: Workshop 1 - Stellenbosch Universityconferences.sun.ac.za/public/conferences/25/slides/IT-Section-2015... · future libraries planning Customer satisfaction improvements with better

35/90

Examples of Big Data Implementations

Page 36: Workshop 1 - Stellenbosch Universityconferences.sun.ac.za/public/conferences/25/slides/IT-Section-2015... · future libraries planning Customer satisfaction improvements with better

36/90

Recommending Good Reads to Patrons

Source: http://www.amazon.com/

Page 37: Workshop 1 - Stellenbosch Universityconferences.sun.ac.za/public/conferences/25/slides/IT-Section-2015... · future libraries planning Customer satisfaction improvements with better

37/90

Patron also borrowed these titles…

> 33m loans a year > 2m patrons Recommendations

tailored to NLB patrons

Flag your wings by P. D. Eastman

1,070 patrons borrowed this title:

M00014123D M00025872A M00032776C M00032897A M00039928K M00040123B M00042334H M00045167I M00051921E M00056997H . . .

Recommendations

402 patrons

289 patrons

260 patrons

Other titles the 1,070 patrons borrowed:

9342951 12547108 12910631 13085283 . . .

8734188 10247657 13046840 13085283 . . .

Collaborative filtering

Page 38: Workshop 1 - Stellenbosch Universityconferences.sun.ac.za/public/conferences/25/slides/IT-Section-2015... · future libraries planning Customer satisfaction improvements with better

38/90

A Simple Implementation

select book_id, count(*) from loans where patron_id in (select patron_id from loans where book_id = 5127546) group by book_id order by 2 desc limit 20;

book_id patron_id loans table:

Patrons who borrowed title ‘5127546’ also borrowed these other titles:

+----------+----------+

| book_id | count(*) |

+----------+----------+

| 5127546 | 115 |

| 3652671 | 23 |

| 9136504 | 21 |

| 3857787 | 20 |

| 6132951 | 19 |

| 4235852 | 19 |

| 3049673 | 18 |

| 12863855 | 18 |

| 4624247 | 18 |

| 4643539 | 18 |

| 3718516 | 18 |

| 5018345 | 18 |

| 2908246 | 17 |

| 4235878 | 17 |

| 2085361 | 17 |

| 3718517 | 17 |

| 3260602 | 16 |

| 4317577 | 16 |

| 9043935 | 16 |

| 6373666 | 16 |

+----------+----------+

Page 39: Workshop 1 - Stellenbosch Universityconferences.sun.ac.za/public/conferences/25/slides/IT-Section-2015... · future libraries planning Customer satisfaction improvements with better

39/90

A Simple Implementation

Title level recommendations Patron level recommendations

NLB Mobile app

Page 40: Workshop 1 - Stellenbosch Universityconferences.sun.ac.za/public/conferences/25/slides/IT-Section-2015... · future libraries planning Customer satisfaction improvements with better

40/90

Contextual discovery via text analytics

Page 41: Workshop 1 - Stellenbosch Universityconferences.sun.ac.za/public/conferences/25/slides/IT-Section-2015... · future libraries planning Customer satisfaction improvements with better

41/90

The ability to mine unstructured data is key to an organisation’s competitive advantage

7,910 EB

1,227 EB

130 EB

1 EB (Exabyte) = 1,000,000 TB

2005 2010 2015

$20

$10

$0.50

90% - unstructured data

68% of all unstructured data in 2015 will be created by consumers

All digital data

Unstructured digital data

Sto

rage

co

st p

er

GB

(U

S$)

Source: IDC’s Digital Universe Study, sponsored by EMC, Jun 2011

IDC - Singapore National Library Transforms Structured and Unstructured Data Into Insights on Cloud, 2014

“For companies who are looking to derive insights from nontransactional data sources with Big Data analytics and cloud technologies, NLB's analytics journey shines the light on approaches and lessons learned in deriving value from both transactional and nontransactional data sources on cloud infrastructure.”

Page 42: Workshop 1 - Stellenbosch Universityconferences.sun.ac.za/public/conferences/25/slides/IT-Section-2015... · future libraries planning Customer satisfaction improvements with better

42/90

The Growing Digital Collection

Digitised books

Historic newspapers

Images

Oral history recordings

Audio-visual recordings

Other collections

Infopedia articles

Web Archives

Singapore Memories

Music

Posters

Building Plans

Govt Records

Private Records Maps

Page 43: Workshop 1 - Stellenbosch Universityconferences.sun.ac.za/public/conferences/25/slides/IT-Section-2015... · future libraries planning Customer satisfaction improvements with better

43/90

NLB users retrieved tens of millions of e-content every year

It would be really nice if we could convert every single e-retrieval

instance into an enriching discovery experience for every single user

every time…

Page 44: Workshop 1 - Stellenbosch Universityconferences.sun.ac.za/public/conferences/25/slides/IT-Section-2015... · future libraries planning Customer satisfaction improvements with better

44/90

Contextual Discovery

NLB users collectively contribute to tens of millions of e-retrievals every year

The Cenotaph, located at Esplanade Park along Connaught Drive, is a war memorial which commemorates the sacrifice of the men who perished during World War I and World War II. It was unveiled on 31 March 1922 by the Prince of Wales. On 28 December 2010, it was gazetted as a national monument together with two other structures in…

Gwee Peng Kwee His daily routine school… Laying of foundation stone and unveiling of Cenotaph…

Dalhousie Obelisk (Article) Dalhousie Obelisk, landmark, located at Empress Place in the Central Region. The tall…

Lim Bo Seng (Article) Major-General Lim Bo Seng (b. 27 April 1909, Nan Ann, Fujian, China - d. 29 June…

Master Plan for Singapore - Central Area (1958) Singapore’s

War Memorial to the Glorious Dead (11 Nov 1920)

Lest we forget (8 Nov 1953)

Singapore students learn to care about history (13 Jul 1997)

Arrival of the Prince (31 Mar 1922)

Singapore’s War Memorial (21 Sep 1921)

Newspaper articles

Page 45: Workshop 1 - Stellenbosch Universityconferences.sun.ac.za/public/conferences/25/slides/IT-Section-2015... · future libraries planning Customer satisfaction improvements with better

45/90

Using text analytics to automatically identify related content

Text tokenised; tokens parsed and weighted (TF/IDF)

Text tokenised; tokens parsed and weighted (TF/IDF)

Weighted tokens similarity

computed

Similarity = 0.295

Page 46: Workshop 1 - Stellenbosch Universityconferences.sun.ac.za/public/conferences/25/slides/IT-Section-2015... · future libraries planning Customer satisfaction improvements with better

46/90

Using Mahout to identify related content

Scalable, commercial-friendly, machine learning for building intelligent applications

Use cases: • Recommendation

• User Info + Community Info

• Classification • Places new items into categories

• Clustering • Group documents based on the notion of similarity

• Frequent Itemset Mining • Analyze items in a group and then identifies which item typically

appear together

What is Apache Mahout?

http://mahout.apache.org/

Page 47: Workshop 1 - Stellenbosch Universityconferences.sun.ac.za/public/conferences/25/slides/IT-Section-2015... · future libraries planning Customer satisfaction improvements with better

47/90

Using Mahout to identify related content

The steps

•Obtain the text of the content to be analysed, one file per item. Put them in the “datafiles” folder.

Obtain the text files

•mahout seqdirectory -c UTF-8 -i datafiles -o seqfiles

Create the sequence files

•mahout seq2sparse -i seqfiles -o vectors -ow -chunk 100 -x 90 -

seq -ml 50 -n 2 -s 5 -md 5 -ng 3 -nv

Create TF/IDF weighted vectors

•mahout rowid -i vectors/tfidf-vectors/part-r-00000 -o matrix

•mahout rowsimilarity -i matrix/matrix -o similarity -

similarityClassName SIMILARITY_COSINE -m -ess

Get the similarity results

Page 48: Workshop 1 - Stellenbosch Universityconferences.sun.ac.za/public/conferences/25/slides/IT-Section-2015... · future libraries planning Customer satisfaction improvements with better

48/90

Using Mahout to identify related content

The steps

•Obtain the text of the content to be analysed, one file per item. Put them in the “datafiles” folder.

Obtain the text files

•mahout seqdirectory -c UTF-8 -i datafiles -o seqfiles

Create the sequence files

•mahout seq2sparse -i seqfiles -o vectors -ow -chunk 100 -x 90 -

seq -ml 50 -n 2 -s 5 -md 5 -ng 3 -nv

Create TF/IDF weighted vectors

•mahout rowid -i vectors/tfidf-vectors/part-r-00000 -o matrix

•mahout rowsimilarity -i matrix/matrix -o similarity -

similarityClassName SIMILARITY_COSINE -m -ess

Get the similarity results

Sequence file is a binary key-value file format used extensively in Mahout and Hadoop.

Page 49: Workshop 1 - Stellenbosch Universityconferences.sun.ac.za/public/conferences/25/slides/IT-Section-2015... · future libraries planning Customer satisfaction improvements with better

49/90

Using Mahout to identify related content

The steps

•Obtain the text of the content to be analysed, one file per item. Put them in the “datafiles” folder.

Obtain the text files

•mahout seqdirectory -c UTF-8 -i datafiles -o seqfiles

Create the sequence files

•mahout seq2sparse -i seqfiles -o vectors -ow -chunk 100 -x 90

-seq -ml 50 -n 2 -s 5 -md 5 -ng 3 –wt tfidf –nv

Create TF/IDF weighted vectors

•mahout rowid -i vectors/tfidf-vectors/part-r-00000 -o matrix

•mahout rowsimilarity -i matrix/matrix -o similarity -

similarityClassName SIMILARITY_COSINE -m -ess

Get the similarity results -s Min support of the term in the entire collection

-md Min document frequency

-x Max document frequency percentage

-ng Maximum size of the n-grams

Key parameters:

Page 50: Workshop 1 - Stellenbosch Universityconferences.sun.ac.za/public/conferences/25/slides/IT-Section-2015... · future libraries planning Customer satisfaction improvements with better

50/90

Using Mahout to identify related content

The steps

TF-IDF (Term Frequency-Inverse Document Frequency) is a numerical statistic that is intended to reflect how important a word is to a document in a collection or corpus.

Sources: https://en.wikipedia.org/wiki/Tf%E2%80%93idf http://filotechnologia.blogspot.sg/2014/01/a-simple-java-class-for-tfidf-scoring.html http://criminalintent.org/2011/01/rapid-prototyping-with-mathematica/

Page 51: Workshop 1 - Stellenbosch Universityconferences.sun.ac.za/public/conferences/25/slides/IT-Section-2015... · future libraries planning Customer satisfaction improvements with better

51/90

Using Mahout to identify related content

The steps

•Obtain the text of the content to be analysed, one file per item. Put them in the “datafiles” folder.

Obtain the text files

•mahout seqdirectory -c UTF-8 -i datafiles -o seqfiles

Create the sequence files

•mahout seq2sparse -i seqfiles -o vectors -ow -chunk 100 -x 90

-seq -ml 50 -n 2 -s 5 -md 5 -ng 3 –wt tfidf -nv

Create TF/IDF weighted vectors

•mahout rowid -i vectors/tfidf-vectors/part-r-00000 -o matrix

•mahout rowsimilarity -i matrix/matrix -o similarity

-similarityClassName SIMILARITY_COSINE -m 10

Get the similarity results

Page 52: Workshop 1 - Stellenbosch Universityconferences.sun.ac.za/public/conferences/25/slides/IT-Section-2015... · future libraries planning Customer satisfaction improvements with better

52/90

Using Mahout to identify related content

The steps

Cosine similarity is a measure of similarity between two vectors of an inner product space that measures the cosine of the angle between them. The cosine of 0° is 1, and it is less than 1 for any other angle.

① Julie loves me more than Linda loves me

② Jane likes me more than Julie loves me

① Julie (1) loves (2) me (2) more (1) than (1) Linda (1) likes (0) Jane (0)

② Julie (1) loves (1) me (2) more (1) than (1) Linda (0) likes (1) Jane (1)

term frequency

sim (①, ②) = 1x1 + 2x1 + 2x2 + 1x1 + 1x1 + 1x0 + 0x1 + 0x1

sqrt(12+22+22+12+12+12+02+02) x sqrt(12+12+22+12+12+02+12+12) =0.822

Page 53: Workshop 1 - Stellenbosch Universityconferences.sun.ac.za/public/conferences/25/slides/IT-Section-2015... · future libraries planning Customer satisfaction improvements with better

53/90

Using Mahout to identify related content

The results

• mahout seqdumper -i similarity > similarity.txt

Key: 0: Value:

{14458:0.2966480826934176,

11399:0.30290014772966095,

12793:0.22009858979452146,

3275:0.1871791030103281,

14613:0.3534278632679437,

4411:0.2516380602790199,

17520:0.3139731583634198,

13611:0.18968888212315968,

14354:0.17673965754661425,

0:1.0000000000000004}

Key: 1: Value:

...

Article ID

Similarity score between article 0 with article 14458 is 0.297

Similarity scores are between 0 and 1

Page 54: Workshop 1 - Stellenbosch Universityconferences.sun.ac.za/public/conferences/25/slides/IT-Section-2015... · future libraries planning Customer satisfaction improvements with better

54/90

An event unfolds…

Page 55: Workshop 1 - Stellenbosch Universityconferences.sun.ac.za/public/conferences/25/slides/IT-Section-2015... · future libraries planning Customer satisfaction improvements with better

55/90

Handling large data sets

Online resource of current and historic Singapore and Malaya newspapers − include The Straits Times,

The Business Times, 星洲日报, 南洋商报, 联合早报, Berita Harian, TODAY

Over 20,000,000 articles published, and growing

NewspaperSG

Page 56: Workshop 1 - Stellenbosch Universityconferences.sun.ac.za/public/conferences/25/slides/IT-Section-2015... · future libraries planning Customer satisfaction improvements with better

56/90

Using clustering to handle large datasets

Clustering is the task of grouping a set of objects in such a way that objects in the same group (called a cluster) are more similar (in some sense or another) to each other than to those in other groups (clusters)

Mahout K-Means Clustering with Cosine Distance

Page 57: Workshop 1 - Stellenbosch Universityconferences.sun.ac.za/public/conferences/25/slides/IT-Section-2015... · future libraries planning Customer satisfaction improvements with better

57/90

Using Mahout to cluster related content

Page 58: Workshop 1 - Stellenbosch Universityconferences.sun.ac.za/public/conferences/25/slides/IT-Section-2015... · future libraries planning Customer satisfaction improvements with better

58/90

Using Mahout to cluster related content

mahout kmeans -i vectors/tfidf-vectors/ -c initial-clusters

-o kmeans-clusters

-dm org.apache.mahout.common.distance.CosineDistanceMeasure

-cd 0.1 -k 20 -x 20 -cl

-k Number of clusters

-x Maximum number of iterations

-cd Threshold of convergence (default: 0.5)

-dm Distance measurement (default: SquaredEuclidean)

Source: https://mahout.apache.org/users/clustering/k-means-commandline.html

Page 59: Workshop 1 - Stellenbosch Universityconferences.sun.ac.za/public/conferences/25/slides/IT-Section-2015... · future libraries planning Customer satisfaction improvements with better

59/90

Handling large data sets

Size of cluster

Top 40 stemmed terms

52,678 exhibit, art, artist, paint, museum, singapor, work, displai, galleri, open, mr, year, organis, centr, chines, cultur, held, on, nation, world, pictur, photograph, collect, hall, colour, includ, first, time, featur, sculptur, design, dai, peopl, piec, two, societi, part, visitor, fair, intern

86,881 olymp, athlet, game, sport, medal, event, team, gold, championship, record, world, singapor, metr, swim, won, year, champion, win, nation, women, time, coach, asian, meet, competit, train, swimmer, two, second, race, compet, first, amateur, bronz, intern, associ, best, finish, yesterdai, silver

142,289 school, student, educ, teacher, univers, secondari, singapor, children, year, primari, studi, pupil, parent, mr, teach, colleg, cours, ministri, english, languag, chines, on, institut, examin, time, learn, princip, train, programm, work, graduat, help, nation, class, two, govern, girl, scienc, boi, first

125,629 polic, arrest, offic, suspect, two, men, yesterdai, man, investig, report, found, mr, gang, road, raid, on, station, crime, detain, year, night, arm, robberi, car, believ, peopl, forc, charg, singapor, robber, hous, todai, told, stolen, seiz, spokesman, old, held, four, escap

67 clusters averaging 93,000 articles each Worked well for Chinese and Malay articles too Close to 1 billion associations identified

Using automatic clustering technique

Page 60: Workshop 1 - Stellenbosch Universityconferences.sun.ac.za/public/conferences/25/slides/IT-Section-2015... · future libraries planning Customer satisfaction improvements with better

60/90

Implementations

Implemented in Jul 2013

Implemented in Sep 2013

Implemented in Nov 2013

Implemented in Jul 2014

Infopedia

PictureSG

NewspaperSG

Page 61: Workshop 1 - Stellenbosch Universityconferences.sun.ac.za/public/conferences/25/slides/IT-Section-2015... · future libraries planning Customer satisfaction improvements with better

61/90

NLB’s Hadoop cluster for text analytics

o

VM Host 1

o Server 5

TaskTracker

Server 6

TaskTracker

Server 7

DataNode

Server 2

JobTracker

Server 8

TaskTracker

Server 9

TaskTracker

Server 10

DataNode

Server 11

TaskTracker

Server 12

TaskTracker

Server 13

DataNode

Server 3

NameNode

Server 4

Checkpoint

Server 1

Cluster Mgr

VM Host 3

VM Host 2

Page 62: Workshop 1 - Stellenbosch Universityconferences.sun.ac.za/public/conferences/25/slides/IT-Section-2015... · future libraries planning Customer satisfaction improvements with better

62/90

Benefits of Contextual Discovery

Referrals from Infopedia

Pageviews per month

Pageviews per visit

0.14%

10.65% (after 6 months)

37,841

84,341

3.64

6.41

Infopedia

PictureSG

Page 63: Workshop 1 - Stellenbosch Universityconferences.sun.ac.za/public/conferences/25/slides/IT-Section-2015... · future libraries planning Customer satisfaction improvements with better

63/90

N-gram Viewer

Page 64: Workshop 1 - Stellenbosch Universityconferences.sun.ac.za/public/conferences/25/slides/IT-Section-2015... · future libraries planning Customer satisfaction improvements with better

64/90

Google Books Ngram Viewer Graph showing how phrases have occurred in

a corpus over time

https://books.google.com/ngrams

Page 65: Workshop 1 - Stellenbosch Universityconferences.sun.ac.za/public/conferences/25/slides/IT-Section-2015... · future libraries planning Customer satisfaction improvements with better

65/90

Ngram viewer using Bookworm Open source software from Culturomics

http://bookworm.culturomics.org/

Page 66: Workshop 1 - Stellenbosch Universityconferences.sun.ac.za/public/conferences/25/slides/IT-Section-2015... · future libraries planning Customer satisfaction improvements with better

66/90

Ngram viewer using Bookworm To create an Ngram Viewer for your collection

Metadata Catalog

• /metadata/jsoncatalog.txt

• The list of the metadata for each text

Field Descriptions

• /metadata/field_descriptions.json

• Describes the properties of each available metadata field

Raw Text

• /texts/raw/*.txt

• The text files in your collection (in .txt format)

Page 67: Workshop 1 - Stellenbosch Universityconferences.sun.ac.za/public/conferences/25/slides/IT-Section-2015... · future libraries planning Customer satisfaction improvements with better

67/90

Ngram viewer using Bookworm To create an Ngram Viewer for your collection

Metadata Catalog

• /metadata/jsoncatalog.txt

• The list of the metadata for each text

Field Descriptions

• /metadata/field_descriptions.json

• Describes the properties of each available metadata field

Raw Text

• /texts/raw/*.txt

• The text files in your collection (in .txt format)

Key Description of Value

filename The filename of the corresponding text file (with .txt omitted and no whitespace in the name).

date The date corresponding to a text file. Dates which are not integers should be specified as a string in the format: YYYY-MM-DD.

searchstring The HTML code displayed for a text when points are clicked on in the ngram graph.

3 required fields:

{"filename": "s1541-104", "date": "1997-2-7", "searchstring": "A bill to extend, reform, and improve agricultural commodity, trade, conservation, and other programs, and for other purposes. | Read at: <a href=\"http://www.govtrack.us/congress/bills/104/s1541\" target=\"_blank\">govtrack.us</a>"}

Page 68: Workshop 1 - Stellenbosch Universityconferences.sun.ac.za/public/conferences/25/slides/IT-Section-2015... · future libraries planning Customer satisfaction improvements with better

68/90

Ngram viewer using Bookworm To create an Ngram Viewer for your collection

Metadata Catalog

• /metadata/jsoncatalog.txt

• The list of the metadata for each text

Field Descriptions

• /metadata/field_descriptions.json

• Describes the properties of each available metadata field

Raw Text

• /texts/raw/*.txt

• The text files in your collection (in .txt format)

Key Description

field The name of the metadata variable.

datatype The type of the data: searchstring, time, categorical, etc.

type The format of the data: integer, decimal, character, text.

unique Whether any given text can have only one type of this field (e.g. title) or not (e.g. subject).

{"datatype": "searchstring", "field": "searchstring", "unique": true, "type": "text"}, {"datatype": "categorical", "field": "enacted", "unique": false, "type": "text"}, {"datatype": "time", "field": "date", "unique": true, "type": “character", "derived":[{"resolution":"year"}]}

Page 69: Workshop 1 - Stellenbosch Universityconferences.sun.ac.za/public/conferences/25/slides/IT-Section-2015... · future libraries planning Customer satisfaction improvements with better

69/90

Ngram viewer using Bookworm To create an Ngram Viewer for your collection

Metadata Catalog

• /metadata/jsoncatalog.txt

• The list of the metadata for each text

Field Descriptions

• /metadata/field_descriptions.json

• Describes the properties of each available metadata field

Raw Text

• /texts/raw/*.txt

• The text files in your collection (in .txt format)

Example Files: /texts/raw/s1541-104.txt /texts/raw/hr2854-104.txt

Page 70: Workshop 1 - Stellenbosch Universityconferences.sun.ac.za/public/conferences/25/slides/IT-Section-2015... · future libraries planning Customer satisfaction improvements with better

70/90

Ngram viewer using Bookworm Demonstration

http://bookworm.culturomics.org/congress/

Page 71: Workshop 1 - Stellenbosch Universityconferences.sun.ac.za/public/conferences/25/slides/IT-Section-2015... · future libraries planning Customer satisfaction improvements with better

71/90

Named Entity Recognition

Page 72: Workshop 1 - Stellenbosch Universityconferences.sun.ac.za/public/conferences/25/slides/IT-Section-2015... · future libraries planning Customer satisfaction improvements with better

72/90

Automatic extraction of time-based and location related information

12 Aug 1956

07 Sep 1971

30 Mar 1988

26 Jul 1992

16 Aug 2002

11 Feb 2009

Users navigate through old images of Singapore

building, streets, satellite images and events via

augmented reality apps

Resources can be mapped for

contextual discovery

Resources are time-stamped for discovery

on a time-line

Time and location are two of the most fundamental ways we organise things. The automatic extraction of geo- and time-based references from the full-text can

yield more data than through manual tagging.

Page 73: Workshop 1 - Stellenbosch Universityconferences.sun.ac.za/public/conferences/25/slides/IT-Section-2015... · future libraries planning Customer satisfaction improvements with better

73/90

Natural Language Processing (NLP)

Typical Steps of NLP

Sentence detection

Tokenization Part-of-speech tagging

Named-entity detection

Hi, How are you? This is Mike. → ①Hi, How are you? ②This is Mike.

Page 74: Workshop 1 - Stellenbosch Universityconferences.sun.ac.za/public/conferences/25/slides/IT-Section-2015... · future libraries planning Customer satisfaction improvements with better

74/90

Natural Language Processing (NLP)

Typical Steps of NLP

Sentence detection

Tokenization Part-of-speech tagging

Named-entity detection

This is Mike. → ①This ②is ③Mike

Hi, How are you? This is Mike. → ①Hi, How are you? ②This is Mike.

Page 75: Workshop 1 - Stellenbosch Universityconferences.sun.ac.za/public/conferences/25/slides/IT-Section-2015... · future libraries planning Customer satisfaction improvements with better

75/90

Natural Language Processing (NLP)

Typical Steps of NLP

Sentence detection

Tokenization Part-of-speech tagging

Named-entity detection

This is Mike. → ①This ②is ③Mike

Hi, How are you? This is Mike. → ①Hi, How are you? ②This is Mike.

①This (DT) ②is (VBZ) ③Mike (NNP)

Page 76: Workshop 1 - Stellenbosch Universityconferences.sun.ac.za/public/conferences/25/slides/IT-Section-2015... · future libraries planning Customer satisfaction improvements with better

76/90

Natural Language Processing (NLP)

Typical Steps of NLP

Sentence detection

Tokenization Part-of-speech tagging

Named-entity detection

This is Mike. → ①This ②is ③Mike

Hi, How are you? This is Mike. → ①Hi, How are you? ②This is Mike.

①This (DT) ②is (VBZ) ③Mike (NNP) Mike person was in

Singapore location on 3rd October date

This is Mike person

Page 77: Workshop 1 - Stellenbosch Universityconferences.sun.ac.za/public/conferences/25/slides/IT-Section-2015... · future libraries planning Customer satisfaction improvements with better

77/90

Natural Language Processing (NLP)

Named Entity Recognition using GATE/ANNIE

General Architecture for Text Engineering

Developed at the University of Sheffield in 1995

A Java suite of tools, GUI & library Provides means of analyzing text Makes computers analyze and

understand the language that humans use naturally

Plugin to support different languages

ANNIE

A Nearly-New IE system IE: Information Extraction Distributed in GATE

Page 78: Workshop 1 - Stellenbosch Universityconferences.sun.ac.za/public/conferences/25/slides/IT-Section-2015... · future libraries planning Customer satisfaction improvements with better

78/90

Natural Language Processing (NLP)

Named Entity Recognition using GATE/ANNIE

Dates in Infopedia articles (http://eresources.nlb.gov.sg/infopedia/)

Page 79: Workshop 1 - Stellenbosch Universityconferences.sun.ac.za/public/conferences/25/slides/IT-Section-2015... · future libraries planning Customer satisfaction improvements with better

79/90

Natural Language Processing (NLP)

Named Entity Recognition using GATE/ANNIE

Handling local street and building names using ‘Gazetteers’

Page 80: Workshop 1 - Stellenbosch Universityconferences.sun.ac.za/public/conferences/25/slides/IT-Section-2015... · future libraries planning Customer satisfaction improvements with better

80/90

Natural Language Processing (NLP)

Some other options for NLP

http://www.alchemyapi.com/

Free for up to 1,000 transactions per day

Page 81: Workshop 1 - Stellenbosch Universityconferences.sun.ac.za/public/conferences/25/slides/IT-Section-2015... · future libraries planning Customer satisfaction improvements with better

81/90

Natural Language Processing (NLP)

Some other options for NLP

http://new.opencalais.com/

Free for up to 5,000 submissions per day

Page 82: Workshop 1 - Stellenbosch Universityconferences.sun.ac.za/public/conferences/25/slides/IT-Section-2015... · future libraries planning Customer satisfaction improvements with better

82/90

Image matching

Page 83: Workshop 1 - Stellenbosch Universityconferences.sun.ac.za/public/conferences/25/slides/IT-Section-2015... · future libraries planning Customer satisfaction improvements with better

83/90

Visual Search & Discovery

Visual Search User uploads his old photo to search for similar images

Upload another image

128 similar images found: Visual Discovery Images without meta-data description; cannot use text mining to cluster similar images

Page 84: Workshop 1 - Stellenbosch Universityconferences.sun.ac.za/public/conferences/25/slides/IT-Section-2015... · future libraries planning Customer satisfaction improvements with better

84/90

Visual Search & Discovery

Image Database

Query Image

Feature Detector & Descriptor Extractor

Features (Super Matrix)

Feature Detector & Descriptor Extractor

Descriptor Matcher Similar Images

Algorithms: SIFT (Scale Invariant Feature Transform) SURF (Speeded Up Robust Features)

FAST (Features from Accelerated Segment Test) BRIEF (Binary Robust Independent Elementary Features) ORB (Oriented FAST and Rotated BRIEF)

Page 85: Workshop 1 - Stellenbosch Universityconferences.sun.ac.za/public/conferences/25/slides/IT-Section-2015... · future libraries planning Customer satisfaction improvements with better

85/90

Visual Search & Discovery

Options for image matching

Free open-source library based on BSD license

Image processing, computer vision and machine learning

Supports large number of algorithms Key Features

• Optimized for real time image processing and computer vision applications

• Interfaces to C++, C, C#, Java, Python • Run on Windows, Linux, Mac, iOS

and Android

http://opencv.org/

https://visenze.com/

Hosted on cloud Scalable image

database Search returns result

in milliseconds APIs available

Page 86: Workshop 1 - Stellenbosch Universityconferences.sun.ac.za/public/conferences/25/slides/IT-Section-2015... · future libraries planning Customer satisfaction improvements with better

86/90

Question?

Kia Siang Hock [email protected]

Page 87: Workshop 1 - Stellenbosch Universityconferences.sun.ac.za/public/conferences/25/slides/IT-Section-2015... · future libraries planning Customer satisfaction improvements with better

87/90

Backup slides

Page 88: Workshop 1 - Stellenbosch Universityconferences.sun.ac.za/public/conferences/25/slides/IT-Section-2015... · future libraries planning Customer satisfaction improvements with better

88/90

Using Mahout to identify related content

Parameters for seq2sparse command Option Flag Description Default value

Overwrite (bool) -ow If set, the output folder is overwritten. If not set, the output folder is created if the folder doesn’t exist. If the output folder does exist, the job fails and an error is thrown. Default is unset.

NA

Lucene analyzer name (String)

-a The class name of the analyzer to use. org.apache.lucene.analysis.standard.StandardAnalyzer

Chunk size (int) -chunk The chunk size in MB. For large document collections (sizes in GBs and TBs), you won’t be able to load the entire dictionary into memory during vectorization, so you can split the dictionary into chunks of the specified size and perform the vectorization in multiple stages. It’s recommended you keep this size to 80 percent of the Java heap size of the Hadoop child nodes to prevent the vectorizer from hitting the heap limit.

100

Weighting (String) -wt The weighting scheme to use: tf for termfrequency based weighting and tfidf for TFIDF based weighting.

tfidf

Minimum support (int)

-s The minimum frequency of the term in the entire collection to be considered as a part of the dictionary file. Terms with lesser frequency are ignored.

2

Minimum document frequency (int)

-md The minimum number of documents the term should occur in to be considered a part of the dictionary file. Any term with lesser frequency is ignored.

1

Page 89: Workshop 1 - Stellenbosch Universityconferences.sun.ac.za/public/conferences/25/slides/IT-Section-2015... · future libraries planning Customer satisfaction improvements with better

89/90

Using Mahout to identify related content

Option Flag Description Default value

Max document frequency percentage (int)

-x The maximum number of documents the term should occur in to be considered a part of the dictionary file. This is a mechanism to prune out high frequency terms (stop-words). Any word that occurs in more than the specified percentage of documents is ignored.

99

N-gram size (int) -ng The maximum size of n-grams to be selected from the collection of documents.

1

Minimum log-likelihood ratio (LLR) (float)

-ml This flag works only when n-gram size is greater than 1. Very significant n-grams have large scores, such as 1000; less significant ones have lower scores. Although there’s no specific method for choosing this value, the rule of thumb is that n-grams with a LLR value less than 1.0 are irrelevant.

1.0

Normalization (float)

-n The normalization value to use in the Lp space. A detailed explanation of normalization is given in section 8.4. The default scheme is to not normalize the weights.

0

Create sequential access sparse vectors (bool)

-seq If set, the output vectors are created as SequentialAccessSparseVectors. By default the dictionary vectorizer generates RandomAccessSparseVectors. The former gives higher performance on certain algorithms like k-means and SVD due to the sequential nature of vector operations. By default the flag is unset.

NA

Parameters for seq2sparse command

Page 90: Workshop 1 - Stellenbosch Universityconferences.sun.ac.za/public/conferences/25/slides/IT-Section-2015... · future libraries planning Customer satisfaction improvements with better

90/90

Using Mahout to cluster related content

Option Description

--input (-i) input Path to job input directory. Must be a SequenceFile of VectorWritable

--clusters (-c) clusters The input centroids, as Vectors. Must be a SequenceFile of Writable, Cluster/Canopy. If k is also specified, then a random set of vectors will be selected and written out to this path first

--output (-o) output The directory pathname for output.

--distanceMeasure (-dm) distanceMeasure

The classname of the DistanceMeasure. Default is SquaredEuclidean

--convergenceDelta (-cd) convergenceDelta

The convergence delta value. Default is 0.5

--maxIter (-x) maxIter The maximum number of iterations.

--maxRed (-r) maxRed The number of reduce tasks. Defaults to 2

--k (-k) k The k in k-Means. If specified, then a random selection of k Vectors will be chosen as the Centroid and written to the clusters input path.

--overwrite (-ow) If present, overwrite the output directory before running job

--clustering (-cl) If present, run clustering after the iterations have taken place

Parameters for kmeans command