Upload
others
View
1
Download
0
Embed Size (px)
Citation preview
1/90
3/90
The Workshop Programme
14:00 Welcome
14:10 About National Library Board, Singapore
14:20 What is Big Data?
14:40 Big Data in Libraries
15:15 Break
15:45 Examples of Big Data Implementations: Recommendations, Text Analytics, Ngram Viewer, Named Entity Extraction, Image Matching
16:45 More Q&A
17:00 End of workshop
4/90
About the National Library Board, Singapore
5/90
Libraries & Archives
1 National Library 1 National Archives 26 Public Libraries
About National Library Board, Singapore
6/90
Vision
Readers for Life, Learning Communities, Knowledgeable Nation
Mission
We make knowledge come alive, spark imagination and create possibilities.
About National Library Board, Singapore
7/90
About National Library Board, Singapore
The Public Library seeks to be a social learning space that nurtures active readers and knowledge seekers, through the provision of relevant, timely and engaging library services and reading programmes, using physical and digital means.
Public Library Services
8/90
About National Library Board, Singapore
Only library in Singapore that collects comprehensively published and distributed Content in the country for preservation and long term access
Enable easy access to country’s Shared Memory to build rootedness and national identity
Forge International Collaborations and advise on library development
National Library
9/90
About National Library Board, Singapore
The National Archives of Singapore (NAS) is the official custodian of Singapore’s collective memory. Ranging from government files, private memoirs, historical maps and photographs to oral history interviews and audio-visual materials, the NAS is responsible for the collection, preservation and management of Singapore's public and private archival records.
The Asian Film Archive is founded to preserve the rich film heritage of Singapore and Asian Cinema, to encourage scholarly research on film, and to promote a wider critical appreciation of this art form.
10/90
A typical day in Singapore libraries…
79,000 people visit libraries
300 new members join
the library
100,000 loans are made
27,000 people attend library programs and
exhibitions
11/90
About National Library Board, Singapore
Libraries & Archives
1 National Library 1 National Archives 26 Public Libraries
Membership More than 2m members
Visits More than 27m visits
Collection More than 1m titles More than 8.5m items
Loans More than 35m loans
Online Usage Digital User Visits: > 11m e-Retrievals: > 70m
FY2013 figures
12/90
What is Big Data?
13/90
What is Big Data?
“data of a very large size, typically to the extent that its manipulation and management present significant logistical challenges.”
Oxford English Dictionary
“an all-encompassing term for any collection of data sets so large and complex that it becomes difficult to process using on-hand data management tools or traditional data processing applications.”
Wikipedia
“datasets whose size is beyond the ability of typical database software tools to capture, store, manage, and analyze.”
McKinsey
“The ability of society to harness information in novel ways to produce useful insights or goods and services of significant values” and “… things one can do at a large scale that cannot be done at a smaller one, to extract new insights or create new forms of value.”
Viktor Mayer-Schonberger & Kenneth Cukier
“The broad range of new and massive data types that have appeared over the last decade or so.”
Tom Davenport
Source: http://www.forbes.com/sites/gilpress/2014/09/03/12-big-data-definitions-whats-yours/
14/90
What is Big Data?
Source: http://datascience.berkeley.edu/what-is-big-data/
Top recurrent themes in the definitions of Big Data by 40 thought leaders
15/90
The Four V’s of Big Data
Source: http://www.ibmbigdatahub.com/infographic/four-vs-big-data
16/90
The Fifth V: Values
Big is relative.
Five broad ways in which using Big Data can create value
Source: Big data: The next frontier for innovation, competition, and productivity (McKinsey) http://www.mckinsey.com/insights/business_technology/big_data_the_next_frontier_for_innovation
❶ Unlock significant value by making information transparent and usable at much higher frequency
❷ Collects more accurate and detailed performance information
❸ Allows ever-narrower segmentation of customers
❹ Sophisticated analytics can substantially improve decision-making
❺ Improves the development of the next generation of products and services
17/90
IDA Infocomm Technology Roadmap O
pp
ort
un
itie
s > Analysis of unstructured data such as images and audio on top of text data to unearth insights from a bigger data pool
> Insights from the data analytics outcomes to augment decision making processes
> Analytics (retrospective to predictive) to proactively identify opportunities or tackle problems
Ch
alle
nge
s > Understand and framing Big Data problems
> Maturity in some of the underlying analytics algorithms
> Shortage of data analytics talent
Source: IDA’s Public Consultation on Infocomm Technology Roadmap 2012, 17 Aug 2012 http://www.ida.gov.sg/Technology/20060417212727.aspx
‘Big Data’ is a key technology theme that will shape the ICT landscape
18/90
Technology Stack Radar < Hadoop MapReduce & distributed file system < NoSQL DBMS < Text Analytics < Visualisation-based discovery < In-memory analytics < Audio analytics < Predictive analytics < Master data management < SaaS-based business analytics
- Complex event processing - Data-federation/visualisation - Video analytics - Mobile business analytics - Non-volatile memory
<03 Years
03-05 Years
Source: IDA’s Public Consultation on Infocomm Technology Roadmap 2012, 17 Aug 2012 http://www.ida.gov.sg/Technology/20060417212727.aspx
IDA Infocomm Technology Roadmap
‘Big Data’ is a key technology theme that will shape the ICT landscape
19/90
Big Data for Libraries
20/90
Disclaimers
Not a comprehensive study of the use of big data in libraries.
A practitioner's high level overview of use of big data in libraries.
Do not cover big data issues including data management, privacy and ownership.
21/90
Big Data Goals
Leverages NLB’s unique data assets
Actionable Insights
Better foresights for future libraries planning
Customer satisfaction improvements with
better service offerings
Better usage of NLB services and resources
Unearthing the hidden treasures
Patrons B
oo
ks
Loan
s
Visits
Newspapers
DV
Ds
VC
Ds
E-databases
E-Books
Digitised newspapers
Demographics Locations
Digitised books
Facebook pages
Events
Bro
wse
C
ou
nt
Structured & Unstructured Data
Blo
gs
Tweets
Productivity gain with better decisions
22/90
Big Data for Libraries
Library Planning
Patron Profiling
Collection Optimisation
Business Operations
Digital Library
Service Delivery
23/90
Big Data for Libraries
Library Planning using Geospatial Analytics
Where are our users? What do they read? Are our libraries serving the residents in the vicinity? Where shall we target our outreach campaign? What is the impact on the usage of existing libraries when a new
library opens? Can our libraries cope with the population growth?
24/90
Big Data for Libraries
Patron Profiling & Footfall Analysis to Optimise Use of Library Space
Crowd Density Audience Profiling Human Traffic Flow
Source: Video Analytics as a Service http://vaaas.kaisquare.com/
25/90
Big Data for Libraries
Measuring & Analysing Energy Consumption using Smart Meters
26/90
Big Data for Libraries
Collection Optimisation – Collection Planning
Source: http://www.ifla.org/files/assets/hq/news/documents/nlb-collection-management-e-newsletter-april-2013.pdf
27/90
Big Data for Libraries
Collection Optimisation – Collection Planning
Collection Planning Model
Forecast of usage
Cost of books
Shelf space
Initial collection
Available budget
Min/max collection size
Planned Budget
Planned Acquisition
Planned Weeding
Planned Space
Projected Loans
Planned Final Collection
Source: http://www.ifla.org/files/assets/hq/news/documents/nlb-collection-management-e-newsletter-april-2013.pdf
28/90
Big Data for Libraries
Collection Optimisation – Demand Forecast
Source: http://www.ifla.org/files/assets/hq/news/documents/nlb-collection-management-e-newsletter-april-2013.pdf
29/90
Big Data for Libraries
Business Operations (Corporate KPIs, Finance, HR)
Source: http://www.ifla.org/files/assets/hq/news/documents/nlb-collection-management-e-newsletter-april-2013.pdf
30/90
Big Data for Libraries
Library Analytics Toolkit
Source: https://osc.hul.harvard.edu/liblab/projects/library-analytics-toolkit
The Library Analytics Toolkit is a dashboard that pulls library data together in a way that allows both librarians and library users to identify and respond to trends and changes in collections, usage, and other data
31/90
Big Data for Libraries
Integrated & Operational Analytics
~20% of items are borrowed within 3 days of
their return
Auto-sorter
Just Return Bin
Patrons can easily access to popular items
Libraries can reduce resources for shelving
32/90
Big Data for Libraries
Digital Library - Curation
• For staff to generate pathfinders • An easier publishing tool for staff curation of
content • Crowdsensing of user interests interfaces
with NLB content and pushes recommended content back to patron.
Find Curate Publish
Analyse search keywords at NLB
websites
User Interests
Analyse search results & click-throughs
Collection Gaps
Building relationships between entities
Relationships
Analyse social media & relevant websites
Trending topics
33/90
Big Data for Libraries
Digital Library – Contextual Discovery
The Cenotaph, located at Esplanade Park along Connaught Drive, is a war memorial which commemorates the sacrifice of the men who perished during World War I and World War II. It was unveiled on 31 March 1922 by the Prince of Wales. On 28 December 2010, it was gazetted as a national monument together with two other structures in Esplanade Park, the Lim Bo Seng Memorial and the Tan Kim Seng Fountain…
Gwee Peng Kwee (Oral History)
Dalhousie Obelisk (Article) Dalhousie Obelisk, landmark, located at Empress Place in the Central Region. The tall 'needle-like' monument...
Lim Bo Seng (Article) Major-General Lim Bo Seng (b. 27 April 1909, Nan Ann, Fujian, China - d. 29 June 1944, Perak, Malaya) was a prominent ...
Master Plan for Singapore - Central Area (1958)
Singapore’s War Memorial to the Glorious Dead (11 Nov 1920)
Lest we forget (8 Nov 1953)
Singapore students learn to care about history (13 Jul 1997)
Arrival of the Prince (31 Mar 1922)
Singapore’s War Memorial (21 Sep 1921)
Newspaper articles
His daily routine school… Laying of foundation stone and unveiling of Cenotaph…
34/90
Big Data for Libraries
Digital Library – Content Analytics to search the ‘Un-searchable’
Image/video search Voice-to-text
Named entity recognition
Buildings
People
Streets
Dates
Organisations
welcome 欢迎
Selamat datang
நல்வரவு
Cross-language discovery
35/90
Examples of Big Data Implementations
36/90
Recommending Good Reads to Patrons
Source: http://www.amazon.com/
37/90
Patron also borrowed these titles…
> 33m loans a year > 2m patrons Recommendations
tailored to NLB patrons
Flag your wings by P. D. Eastman
1,070 patrons borrowed this title:
M00014123D M00025872A M00032776C M00032897A M00039928K M00040123B M00042334H M00045167I M00051921E M00056997H . . .
Recommendations
402 patrons
289 patrons
260 patrons
Other titles the 1,070 patrons borrowed:
9342951 12547108 12910631 13085283 . . .
8734188 10247657 13046840 13085283 . . .
Collaborative filtering
38/90
A Simple Implementation
select book_id, count(*) from loans where patron_id in (select patron_id from loans where book_id = 5127546) group by book_id order by 2 desc limit 20;
book_id patron_id loans table:
Patrons who borrowed title ‘5127546’ also borrowed these other titles:
+----------+----------+
| book_id | count(*) |
+----------+----------+
| 5127546 | 115 |
| 3652671 | 23 |
| 9136504 | 21 |
| 3857787 | 20 |
| 6132951 | 19 |
| 4235852 | 19 |
| 3049673 | 18 |
| 12863855 | 18 |
| 4624247 | 18 |
| 4643539 | 18 |
| 3718516 | 18 |
| 5018345 | 18 |
| 2908246 | 17 |
| 4235878 | 17 |
| 2085361 | 17 |
| 3718517 | 17 |
| 3260602 | 16 |
| 4317577 | 16 |
| 9043935 | 16 |
| 6373666 | 16 |
+----------+----------+
39/90
A Simple Implementation
Title level recommendations Patron level recommendations
NLB Mobile app
40/90
Contextual discovery via text analytics
41/90
The ability to mine unstructured data is key to an organisation’s competitive advantage
7,910 EB
1,227 EB
130 EB
1 EB (Exabyte) = 1,000,000 TB
2005 2010 2015
$20
$10
$0.50
90% - unstructured data
68% of all unstructured data in 2015 will be created by consumers
All digital data
Unstructured digital data
Sto
rage
co
st p
er
GB
(U
S$)
Source: IDC’s Digital Universe Study, sponsored by EMC, Jun 2011
IDC - Singapore National Library Transforms Structured and Unstructured Data Into Insights on Cloud, 2014
“For companies who are looking to derive insights from nontransactional data sources with Big Data analytics and cloud technologies, NLB's analytics journey shines the light on approaches and lessons learned in deriving value from both transactional and nontransactional data sources on cloud infrastructure.”
42/90
The Growing Digital Collection
Digitised books
Historic newspapers
Images
Oral history recordings
Audio-visual recordings
Other collections
Infopedia articles
Web Archives
Singapore Memories
Music
Posters
Building Plans
Govt Records
Private Records Maps
43/90
NLB users retrieved tens of millions of e-content every year
It would be really nice if we could convert every single e-retrieval
instance into an enriching discovery experience for every single user
every time…
44/90
Contextual Discovery
NLB users collectively contribute to tens of millions of e-retrievals every year
The Cenotaph, located at Esplanade Park along Connaught Drive, is a war memorial which commemorates the sacrifice of the men who perished during World War I and World War II. It was unveiled on 31 March 1922 by the Prince of Wales. On 28 December 2010, it was gazetted as a national monument together with two other structures in…
Gwee Peng Kwee His daily routine school… Laying of foundation stone and unveiling of Cenotaph…
Dalhousie Obelisk (Article) Dalhousie Obelisk, landmark, located at Empress Place in the Central Region. The tall…
Lim Bo Seng (Article) Major-General Lim Bo Seng (b. 27 April 1909, Nan Ann, Fujian, China - d. 29 June…
Master Plan for Singapore - Central Area (1958) Singapore’s
War Memorial to the Glorious Dead (11 Nov 1920)
Lest we forget (8 Nov 1953)
Singapore students learn to care about history (13 Jul 1997)
Arrival of the Prince (31 Mar 1922)
Singapore’s War Memorial (21 Sep 1921)
Newspaper articles
45/90
Using text analytics to automatically identify related content
Text tokenised; tokens parsed and weighted (TF/IDF)
Text tokenised; tokens parsed and weighted (TF/IDF)
Weighted tokens similarity
computed
Similarity = 0.295
46/90
Using Mahout to identify related content
Scalable, commercial-friendly, machine learning for building intelligent applications
Use cases: • Recommendation
• User Info + Community Info
• Classification • Places new items into categories
• Clustering • Group documents based on the notion of similarity
• Frequent Itemset Mining • Analyze items in a group and then identifies which item typically
appear together
What is Apache Mahout?
http://mahout.apache.org/
47/90
Using Mahout to identify related content
The steps
•Obtain the text of the content to be analysed, one file per item. Put them in the “datafiles” folder.
Obtain the text files
•mahout seqdirectory -c UTF-8 -i datafiles -o seqfiles
Create the sequence files
•mahout seq2sparse -i seqfiles -o vectors -ow -chunk 100 -x 90 -
seq -ml 50 -n 2 -s 5 -md 5 -ng 3 -nv
Create TF/IDF weighted vectors
•mahout rowid -i vectors/tfidf-vectors/part-r-00000 -o matrix
•mahout rowsimilarity -i matrix/matrix -o similarity -
similarityClassName SIMILARITY_COSINE -m -ess
Get the similarity results
48/90
Using Mahout to identify related content
The steps
•Obtain the text of the content to be analysed, one file per item. Put them in the “datafiles” folder.
Obtain the text files
•mahout seqdirectory -c UTF-8 -i datafiles -o seqfiles
Create the sequence files
•mahout seq2sparse -i seqfiles -o vectors -ow -chunk 100 -x 90 -
seq -ml 50 -n 2 -s 5 -md 5 -ng 3 -nv
Create TF/IDF weighted vectors
•mahout rowid -i vectors/tfidf-vectors/part-r-00000 -o matrix
•mahout rowsimilarity -i matrix/matrix -o similarity -
similarityClassName SIMILARITY_COSINE -m -ess
Get the similarity results
Sequence file is a binary key-value file format used extensively in Mahout and Hadoop.
49/90
Using Mahout to identify related content
The steps
•Obtain the text of the content to be analysed, one file per item. Put them in the “datafiles” folder.
Obtain the text files
•mahout seqdirectory -c UTF-8 -i datafiles -o seqfiles
Create the sequence files
•mahout seq2sparse -i seqfiles -o vectors -ow -chunk 100 -x 90
-seq -ml 50 -n 2 -s 5 -md 5 -ng 3 –wt tfidf –nv
Create TF/IDF weighted vectors
•mahout rowid -i vectors/tfidf-vectors/part-r-00000 -o matrix
•mahout rowsimilarity -i matrix/matrix -o similarity -
similarityClassName SIMILARITY_COSINE -m -ess
Get the similarity results -s Min support of the term in the entire collection
-md Min document frequency
-x Max document frequency percentage
-ng Maximum size of the n-grams
Key parameters:
50/90
Using Mahout to identify related content
The steps
TF-IDF (Term Frequency-Inverse Document Frequency) is a numerical statistic that is intended to reflect how important a word is to a document in a collection or corpus.
Sources: https://en.wikipedia.org/wiki/Tf%E2%80%93idf http://filotechnologia.blogspot.sg/2014/01/a-simple-java-class-for-tfidf-scoring.html http://criminalintent.org/2011/01/rapid-prototyping-with-mathematica/
51/90
Using Mahout to identify related content
The steps
•Obtain the text of the content to be analysed, one file per item. Put them in the “datafiles” folder.
Obtain the text files
•mahout seqdirectory -c UTF-8 -i datafiles -o seqfiles
Create the sequence files
•mahout seq2sparse -i seqfiles -o vectors -ow -chunk 100 -x 90
-seq -ml 50 -n 2 -s 5 -md 5 -ng 3 –wt tfidf -nv
Create TF/IDF weighted vectors
•mahout rowid -i vectors/tfidf-vectors/part-r-00000 -o matrix
•mahout rowsimilarity -i matrix/matrix -o similarity
-similarityClassName SIMILARITY_COSINE -m 10
Get the similarity results
52/90
Using Mahout to identify related content
The steps
Cosine similarity is a measure of similarity between two vectors of an inner product space that measures the cosine of the angle between them. The cosine of 0° is 1, and it is less than 1 for any other angle.
① Julie loves me more than Linda loves me
② Jane likes me more than Julie loves me
① Julie (1) loves (2) me (2) more (1) than (1) Linda (1) likes (0) Jane (0)
② Julie (1) loves (1) me (2) more (1) than (1) Linda (0) likes (1) Jane (1)
term frequency
sim (①, ②) = 1x1 + 2x1 + 2x2 + 1x1 + 1x1 + 1x0 + 0x1 + 0x1
sqrt(12+22+22+12+12+12+02+02) x sqrt(12+12+22+12+12+02+12+12) =0.822
53/90
Using Mahout to identify related content
The results
• mahout seqdumper -i similarity > similarity.txt
Key: 0: Value:
{14458:0.2966480826934176,
11399:0.30290014772966095,
12793:0.22009858979452146,
3275:0.1871791030103281,
14613:0.3534278632679437,
4411:0.2516380602790199,
17520:0.3139731583634198,
13611:0.18968888212315968,
14354:0.17673965754661425,
0:1.0000000000000004}
Key: 1: Value:
...
Article ID
Similarity score between article 0 with article 14458 is 0.297
Similarity scores are between 0 and 1
54/90
An event unfolds…
55/90
Handling large data sets
Online resource of current and historic Singapore and Malaya newspapers − include The Straits Times,
The Business Times, 星洲日报, 南洋商报, 联合早报, Berita Harian, TODAY
Over 20,000,000 articles published, and growing
NewspaperSG
56/90
Using clustering to handle large datasets
Clustering is the task of grouping a set of objects in such a way that objects in the same group (called a cluster) are more similar (in some sense or another) to each other than to those in other groups (clusters)
Mahout K-Means Clustering with Cosine Distance
57/90
Using Mahout to cluster related content
58/90
Using Mahout to cluster related content
mahout kmeans -i vectors/tfidf-vectors/ -c initial-clusters
-o kmeans-clusters
-dm org.apache.mahout.common.distance.CosineDistanceMeasure
-cd 0.1 -k 20 -x 20 -cl
-k Number of clusters
-x Maximum number of iterations
-cd Threshold of convergence (default: 0.5)
-dm Distance measurement (default: SquaredEuclidean)
Source: https://mahout.apache.org/users/clustering/k-means-commandline.html
59/90
Handling large data sets
Size of cluster
Top 40 stemmed terms
52,678 exhibit, art, artist, paint, museum, singapor, work, displai, galleri, open, mr, year, organis, centr, chines, cultur, held, on, nation, world, pictur, photograph, collect, hall, colour, includ, first, time, featur, sculptur, design, dai, peopl, piec, two, societi, part, visitor, fair, intern
86,881 olymp, athlet, game, sport, medal, event, team, gold, championship, record, world, singapor, metr, swim, won, year, champion, win, nation, women, time, coach, asian, meet, competit, train, swimmer, two, second, race, compet, first, amateur, bronz, intern, associ, best, finish, yesterdai, silver
142,289 school, student, educ, teacher, univers, secondari, singapor, children, year, primari, studi, pupil, parent, mr, teach, colleg, cours, ministri, english, languag, chines, on, institut, examin, time, learn, princip, train, programm, work, graduat, help, nation, class, two, govern, girl, scienc, boi, first
125,629 polic, arrest, offic, suspect, two, men, yesterdai, man, investig, report, found, mr, gang, road, raid, on, station, crime, detain, year, night, arm, robberi, car, believ, peopl, forc, charg, singapor, robber, hous, todai, told, stolen, seiz, spokesman, old, held, four, escap
67 clusters averaging 93,000 articles each Worked well for Chinese and Malay articles too Close to 1 billion associations identified
Using automatic clustering technique
60/90
Implementations
Implemented in Jul 2013
Implemented in Sep 2013
Implemented in Nov 2013
Implemented in Jul 2014
Infopedia
PictureSG
NewspaperSG
61/90
NLB’s Hadoop cluster for text analytics
o
VM Host 1
o Server 5
TaskTracker
Server 6
TaskTracker
Server 7
DataNode
Server 2
JobTracker
Server 8
TaskTracker
Server 9
TaskTracker
Server 10
DataNode
Server 11
TaskTracker
Server 12
TaskTracker
Server 13
DataNode
Server 3
NameNode
Server 4
Checkpoint
Server 1
Cluster Mgr
VM Host 3
VM Host 2
62/90
Benefits of Contextual Discovery
Referrals from Infopedia
Pageviews per month
Pageviews per visit
0.14%
10.65% (after 6 months)
37,841
84,341
3.64
6.41
Infopedia
PictureSG
63/90
N-gram Viewer
64/90
Google Books Ngram Viewer Graph showing how phrases have occurred in
a corpus over time
https://books.google.com/ngrams
65/90
Ngram viewer using Bookworm Open source software from Culturomics
http://bookworm.culturomics.org/
66/90
Ngram viewer using Bookworm To create an Ngram Viewer for your collection
Metadata Catalog
• /metadata/jsoncatalog.txt
• The list of the metadata for each text
Field Descriptions
• /metadata/field_descriptions.json
• Describes the properties of each available metadata field
Raw Text
• /texts/raw/*.txt
• The text files in your collection (in .txt format)
67/90
Ngram viewer using Bookworm To create an Ngram Viewer for your collection
Metadata Catalog
• /metadata/jsoncatalog.txt
• The list of the metadata for each text
Field Descriptions
• /metadata/field_descriptions.json
• Describes the properties of each available metadata field
Raw Text
• /texts/raw/*.txt
• The text files in your collection (in .txt format)
Key Description of Value
filename The filename of the corresponding text file (with .txt omitted and no whitespace in the name).
date The date corresponding to a text file. Dates which are not integers should be specified as a string in the format: YYYY-MM-DD.
searchstring The HTML code displayed for a text when points are clicked on in the ngram graph.
3 required fields:
{"filename": "s1541-104", "date": "1997-2-7", "searchstring": "A bill to extend, reform, and improve agricultural commodity, trade, conservation, and other programs, and for other purposes. | Read at: <a href=\"http://www.govtrack.us/congress/bills/104/s1541\" target=\"_blank\">govtrack.us</a>"}
68/90
Ngram viewer using Bookworm To create an Ngram Viewer for your collection
Metadata Catalog
• /metadata/jsoncatalog.txt
• The list of the metadata for each text
Field Descriptions
• /metadata/field_descriptions.json
• Describes the properties of each available metadata field
Raw Text
• /texts/raw/*.txt
• The text files in your collection (in .txt format)
Key Description
field The name of the metadata variable.
datatype The type of the data: searchstring, time, categorical, etc.
type The format of the data: integer, decimal, character, text.
unique Whether any given text can have only one type of this field (e.g. title) or not (e.g. subject).
{"datatype": "searchstring", "field": "searchstring", "unique": true, "type": "text"}, {"datatype": "categorical", "field": "enacted", "unique": false, "type": "text"}, {"datatype": "time", "field": "date", "unique": true, "type": “character", "derived":[{"resolution":"year"}]}
69/90
Ngram viewer using Bookworm To create an Ngram Viewer for your collection
Metadata Catalog
• /metadata/jsoncatalog.txt
• The list of the metadata for each text
Field Descriptions
• /metadata/field_descriptions.json
• Describes the properties of each available metadata field
Raw Text
• /texts/raw/*.txt
• The text files in your collection (in .txt format)
Example Files: /texts/raw/s1541-104.txt /texts/raw/hr2854-104.txt
70/90
Ngram viewer using Bookworm Demonstration
http://bookworm.culturomics.org/congress/
71/90
Named Entity Recognition
72/90
Automatic extraction of time-based and location related information
12 Aug 1956
07 Sep 1971
30 Mar 1988
26 Jul 1992
16 Aug 2002
11 Feb 2009
Users navigate through old images of Singapore
building, streets, satellite images and events via
augmented reality apps
Resources can be mapped for
contextual discovery
Resources are time-stamped for discovery
on a time-line
Time and location are two of the most fundamental ways we organise things. The automatic extraction of geo- and time-based references from the full-text can
yield more data than through manual tagging.
73/90
Natural Language Processing (NLP)
Typical Steps of NLP
Sentence detection
Tokenization Part-of-speech tagging
Named-entity detection
Hi, How are you? This is Mike. → ①Hi, How are you? ②This is Mike.
74/90
Natural Language Processing (NLP)
Typical Steps of NLP
Sentence detection
Tokenization Part-of-speech tagging
Named-entity detection
This is Mike. → ①This ②is ③Mike
Hi, How are you? This is Mike. → ①Hi, How are you? ②This is Mike.
75/90
Natural Language Processing (NLP)
Typical Steps of NLP
Sentence detection
Tokenization Part-of-speech tagging
Named-entity detection
This is Mike. → ①This ②is ③Mike
Hi, How are you? This is Mike. → ①Hi, How are you? ②This is Mike.
①This (DT) ②is (VBZ) ③Mike (NNP)
76/90
Natural Language Processing (NLP)
Typical Steps of NLP
Sentence detection
Tokenization Part-of-speech tagging
Named-entity detection
This is Mike. → ①This ②is ③Mike
Hi, How are you? This is Mike. → ①Hi, How are you? ②This is Mike.
①This (DT) ②is (VBZ) ③Mike (NNP) Mike person was in
Singapore location on 3rd October date
This is Mike person
77/90
Natural Language Processing (NLP)
Named Entity Recognition using GATE/ANNIE
General Architecture for Text Engineering
Developed at the University of Sheffield in 1995
A Java suite of tools, GUI & library Provides means of analyzing text Makes computers analyze and
understand the language that humans use naturally
Plugin to support different languages
ANNIE
A Nearly-New IE system IE: Information Extraction Distributed in GATE
78/90
Natural Language Processing (NLP)
Named Entity Recognition using GATE/ANNIE
Dates in Infopedia articles (http://eresources.nlb.gov.sg/infopedia/)
79/90
Natural Language Processing (NLP)
Named Entity Recognition using GATE/ANNIE
Handling local street and building names using ‘Gazetteers’
80/90
Natural Language Processing (NLP)
Some other options for NLP
http://www.alchemyapi.com/
Free for up to 1,000 transactions per day
81/90
Natural Language Processing (NLP)
Some other options for NLP
http://new.opencalais.com/
Free for up to 5,000 submissions per day
82/90
Image matching
83/90
Visual Search & Discovery
Visual Search User uploads his old photo to search for similar images
Upload another image
128 similar images found: Visual Discovery Images without meta-data description; cannot use text mining to cluster similar images
84/90
Visual Search & Discovery
Image Database
Query Image
Feature Detector & Descriptor Extractor
Features (Super Matrix)
Feature Detector & Descriptor Extractor
Descriptor Matcher Similar Images
Algorithms: SIFT (Scale Invariant Feature Transform) SURF (Speeded Up Robust Features)
FAST (Features from Accelerated Segment Test) BRIEF (Binary Robust Independent Elementary Features) ORB (Oriented FAST and Rotated BRIEF)
85/90
Visual Search & Discovery
Options for image matching
Free open-source library based on BSD license
Image processing, computer vision and machine learning
Supports large number of algorithms Key Features
• Optimized for real time image processing and computer vision applications
• Interfaces to C++, C, C#, Java, Python • Run on Windows, Linux, Mac, iOS
and Android
http://opencv.org/
https://visenze.com/
Hosted on cloud Scalable image
database Search returns result
in milliseconds APIs available
87/90
Backup slides
88/90
Using Mahout to identify related content
Parameters for seq2sparse command Option Flag Description Default value
Overwrite (bool) -ow If set, the output folder is overwritten. If not set, the output folder is created if the folder doesn’t exist. If the output folder does exist, the job fails and an error is thrown. Default is unset.
NA
Lucene analyzer name (String)
-a The class name of the analyzer to use. org.apache.lucene.analysis.standard.StandardAnalyzer
Chunk size (int) -chunk The chunk size in MB. For large document collections (sizes in GBs and TBs), you won’t be able to load the entire dictionary into memory during vectorization, so you can split the dictionary into chunks of the specified size and perform the vectorization in multiple stages. It’s recommended you keep this size to 80 percent of the Java heap size of the Hadoop child nodes to prevent the vectorizer from hitting the heap limit.
100
Weighting (String) -wt The weighting scheme to use: tf for termfrequency based weighting and tfidf for TFIDF based weighting.
tfidf
Minimum support (int)
-s The minimum frequency of the term in the entire collection to be considered as a part of the dictionary file. Terms with lesser frequency are ignored.
2
Minimum document frequency (int)
-md The minimum number of documents the term should occur in to be considered a part of the dictionary file. Any term with lesser frequency is ignored.
1
89/90
Using Mahout to identify related content
Option Flag Description Default value
Max document frequency percentage (int)
-x The maximum number of documents the term should occur in to be considered a part of the dictionary file. This is a mechanism to prune out high frequency terms (stop-words). Any word that occurs in more than the specified percentage of documents is ignored.
99
N-gram size (int) -ng The maximum size of n-grams to be selected from the collection of documents.
1
Minimum log-likelihood ratio (LLR) (float)
-ml This flag works only when n-gram size is greater than 1. Very significant n-grams have large scores, such as 1000; less significant ones have lower scores. Although there’s no specific method for choosing this value, the rule of thumb is that n-grams with a LLR value less than 1.0 are irrelevant.
1.0
Normalization (float)
-n The normalization value to use in the Lp space. A detailed explanation of normalization is given in section 8.4. The default scheme is to not normalize the weights.
0
Create sequential access sparse vectors (bool)
-seq If set, the output vectors are created as SequentialAccessSparseVectors. By default the dictionary vectorizer generates RandomAccessSparseVectors. The former gives higher performance on certain algorithms like k-means and SVD due to the sequential nature of vector operations. By default the flag is unset.
NA
Parameters for seq2sparse command
90/90
Using Mahout to cluster related content
Option Description
--input (-i) input Path to job input directory. Must be a SequenceFile of VectorWritable
--clusters (-c) clusters The input centroids, as Vectors. Must be a SequenceFile of Writable, Cluster/Canopy. If k is also specified, then a random set of vectors will be selected and written out to this path first
--output (-o) output The directory pathname for output.
--distanceMeasure (-dm) distanceMeasure
The classname of the DistanceMeasure. Default is SquaredEuclidean
--convergenceDelta (-cd) convergenceDelta
The convergence delta value. Default is 0.5
--maxIter (-x) maxIter The maximum number of iterations.
--maxRed (-r) maxRed The number of reduce tasks. Defaults to 2
--k (-k) k The k in k-Means. If specified, then a random selection of k Vectors will be chosen as the Centroid and written to the clusters input path.
--overwrite (-ow) If present, overwrite the output directory before running job
--clustering (-cl) If present, run clustering after the iterations have taken place
Parameters for kmeans command