View
761
Download
2
Embed Size (px)
DESCRIPTION
(Check out the “Notes” section for explanation about each slide.) A presentation given at the Lucene/Solr Revolution 2014 conference to discuss metadata search in television news for the NewsScape project. Please also visit http://bitly.com/lsr2014tvnews for updates. Video: Coming soon! Summary: UCLA’s NewsScape has over 200,000 hours of television news from the United States and Europe. In the last two years, the project has generated a large set of “metadata”: story segment boundaries, story types and topics, name entities, on-screen text, image labels, etc. Including them in searches opens new opportunities for research, understanding, and visualization, and helps answer questions such as “Who were interviewed on which shows about the Ukraine crisis in May 2014” and “What text or image is shown on the screen as a story is being reported”. However, metadata search poses significant challenges, because the search engine needs to consider not only the content, but also its position and time relative to other metadata instances, whether search terms are found in the same or different metadata instances, etc. This session will describe how UCLA has implemented metadata search with Lucene/Solr’s block join and custom query types, as well as the collection’s position-time data. This talk will also describe UCLA’s work on using time as the distance unit for proximity search and filtering search results by metadata boundaries as well as their metadata-aware, multi-field implementation of auto-suggest.
Citation preview
Reading Metadata
Between the Lines:
Searching for
Stories, People, Places and More
in Television News
Kai Chan
Social Sciences Computing
University of California, Los Angeles
What?
What We Do with Television News
Make Metadata Searchable
Make Metadata Searchable
captionTHESE RECALLED CARS ARE AMONG
THE MOST POPULAR FOR THE PAST 12
YEARS.
Make Metadata Searchable
caption(searchable)
THESE RECALLED CARS ARE AMONG
THE MOST POPULAR FOR THE PAST 12
YEARS.
Make Metadata Searchable
metadata
caption(searchable)
THESE RECALLED CARS ARE AMONG
THE MOST POPULAR FOR THE PAST 12
YEARS.
Make Metadata Searchable
metadata(not searchable)
caption(searchable)
THESE RECALLED CARS ARE AMONG
THE MOST POPULAR FOR THE PAST 12
YEARS.
Story Segment
Story 1 Story 2
Story Segment
Name Entity
Name: John McCainRole: US SenatorParty: Republican
Name: Greta Van SusterenRole: AnchorNetwork: Fox News Channel
Name Entity
NJ Governor: cooperation from US President “outstanding”, “deserves great credit”
Republican Democrat praise (!)
Non-Verbal Communication
Non-Verbal Communication
On-Screen Text
On-Screen Text
How?
1. Help Users Search
Define Metadata Structure
Tag Attribute Name: Value
Attribute Name: Value
Attribute Name: Value
Start Time End Time
Define Metadata Structure
SEG Type: Headline
Topic: Ebola Scare
Country: US
1:00:00 1:03:00
(story segment)
start time end time tag attributes
Search in Multiple Places
Offer Suggestions
2. Make the Search Happen
Map to Documents and Fields
SEG_Topic: Ebola Scare
NER_Name: John McCain
NER_Role: Senator
fields
SEG_Type: Headline
(program info, caption)
document
NER_State: Arizona
Map to Documents and Fields
SEG_Topic: Ebola Scare
NER_Name: John McCain
NER_Role: Senator
fields
SEG_Type: Headline
(program info, caption)
document
NER_State: Arizona
NER_Name: John Chiang
NER_Role: Controller
NER_State: California
SEG_Topic: Drought
SEG_Type: Politics
Define Semantics
+TEXT_Text:“drought”
+NER_Role:“Senator”
+NER_State:“California”
Query:
Define Semantics
Interpretation 1:
“drought”
time
start end
Role: Senator State: California
start end
Define Semantics
Interpretation 2:
“drought”
time
start end start end
“drought”
Role: Senator State: California
Define Semantics
Interpretation 3:
“drought”
time
start end
Role: Senator
State: California
Define Semantics
Interpretation 4:
“drought”
time
start end
Role: Senator
State: California
Map to Documents and Fields
SEG_Topic: Ebola Scare
NER_Name: John McCain
NER_Role: Senator
fields
SEG_Type: Headline
(program info, caption)
document
NER_State: Arizona
NER_Name: John Chiang
NER_Role: Controller
NER_State: California
SEG_Topic: Drought
SEG_Type: Politics
Map to Documents and Fields
SEG_Topic: Ebola Scare
NER_Name: John McCain
NER_Role: Senator
fields
SEG_Type: Headline
(program info, caption)
document
NER_State: Arizona
NER_Name: John Chiang
NER_Role: Controller
NER_State: California
SEG_Topic: Drought
SEG_Type: Politics
Map to Documents and Fields
SEG_Topic: Ebola Scare
NER_Name: John McCain
NER_Role: Senator
fields
SEG_Type: Headline
(program info, caption)
document
NER_State: Arizona
NER_Name: John Chiang
NER_Role: Controller
NER_State: California
SEG_Topic: Drought
SEG_Type: Politics
Map to Documents and Fields
SEG_Topic: Ebola Scare
NER_Name: John McCain
NER_Role: Senator
fields
SEG_Type: Headline
(program info, caption)
document
NER_State: Arizona
NER_Name: John Chiang
NER_Role: Controller
NER_State: California
SEG_Topic: Drought
SEG_Type: Politics
Map to Documents and Fields
SEG_Topic: Ebola Scare
NER_Name: John McCain
NER_Role: Senator
fields
SEG_Type: Headline
(program info, caption)
document
NER_State: Arizona
NER_Name: John Chiang
NER_Role: Controller
NER_State: California
SEG_Topic: Drought
SEG_Type: Politics
Map to Documents and Fields
SEG_Topic: Ebola Scare
NER_Name: John McCain
NER_Role: Senator
fields
SEG_Type: Headline
(program info, caption)
document
NER_State: Arizona
NER_Name: John Chiang
NER_Role: Controller
NER_State: California
SEG_Topic: Drought
SEG_Type: Politics
Map to Documents and Fields
SEG_Topic: Ebola Scare
NER_Name: John McCain
NER_Role: Senator
fields
SEG_Type: Headline
(program info, caption)
document
NER_State: Arizona
NER_Name: John Chiang
NER_Role: Controller
NER_State: California
SEG_Topic: Drought
SEG_Type: Politics
3. Make the Search Meaningful
Two Levels of Document
programdocument
tag document
tag document
tag document
1 document= 1 metadata instance
Two Levels of Document
programdocument
tag document
tag document
tag document
1 document= 1 news program
Two Levels of Document
programdocument
tag document
tag document
tag document
1. search metadata content
Two Levels of Document
programdocument
tag document
tag document
tag document
2. lookup program document(s)
Two Levels of Document
programdocument
tag document
tag document
tag document
3. filter by program information
Two Levels of Document
NER_Role: Senator
NER_State: Arizona
tag document
NER_Role: Senator
tag document tag document
Tag: NER
NER_State: California
Tag: NER Tag: NER
NER_Role: Controller
NER_State: California
matchNOT match NOT match
Two Levels of Document
Date
Network
Show
program document
NER_Role: Senator
tag document
Tag: NER
NER_State: California
Filter by Metadata Boundaries
“drought”
time
start end
“drought”“drought”
Role: Governor
State: California
Filter by Metadata Boundaries
...
EMERGENCY PLED TO THE STATE
OF CALIFORNIA IN MAY TO
CONSERVE WATER.
>> THIS DROUGHT IS A BIG
WAKE-UP CALL, A REMINDER.
THE COUPLE SAYS
THAT THEY NEED NO
REMINDERS.
...
36:18
36:22 36:18 – 36:22Tag: NERName: Jerry BrownRole: GovernorState: California
36:19
4. Make the Search More Powerful
Proximity Search – Word as Unit
...
>> THIS DROUGHT IS A BIG
WAKE-UP CALL, A REMINDER.
THE COUPLE SAYS
THAT THEY NEED NO
REMINDERS
THEY DO ADMIT THAT THEIR
LAWN HAS BECOME A BIT
UNSIGHTLY.
...
position 100
position 121
20 words
Proximity Search – Time as Unit
...
>> THIS DROUGHT IS A BIG
WAKE-UP CALL, A REMINDER.
THE COUPLE SAYS
THAT THEY NEED NO
REMINDERS
THEY DO ADMIT THAT THEIR
LAWN HAS BECOME A BIT
UNSIGHTLY.
...
36:19
36:25
6 s
Make Metadata Searchable
metadata(not searchable)
caption(searchable)
THESE RECALLED CARS ARE AMONG
THE MOST POPULAR FOR THE PAST 12
YEARS.
Make Metadata Searchable – Accomplished
metadata(now searchable)
caption(searchable)
THESE RECALLED CARS ARE AMONG
THE MOST POPULAR FOR THE PAST 12
YEARS.
Thank you for coming!
Questions or comments?My e-mail: [email protected]
Slides available at:http://bit.ly/lsr2014tvnews(or scan this barcode)