40
Medical Information Retrieval Challenges in a Webbed World William Hersh, M.D. Associate Professor and Chief Division of Medical Informatics and Outcomes Research Oregon Health Sciences University [email protected]

Medical Information Retrieval Challenges in a Webbed World William Hersh, M.D. Associate Professor and Chief Division of Medical Informatics and Outcomes

Embed Size (px)

Citation preview

Page 1: Medical Information Retrieval Challenges in a Webbed World William Hersh, M.D. Associate Professor and Chief Division of Medical Informatics and Outcomes

Medical Information RetrievalChallenges in a Webbed World

William Hersh, M.D.

Associate Professor and Chief

Division of Medical Informatics and Outcomes Research

Oregon Health Sciences University

[email protected]

Page 2: Medical Information Retrieval Challenges in a Webbed World William Hersh, M.D. Associate Professor and Chief Division of Medical Informatics and Outcomes

Overview

Describe current information retrieval technology

Summarize IR research activities and their results

Discuss the implications of the World Wide Web (WWW) for IR

Page 3: Medical Information Retrieval Challenges in a Webbed World William Hersh, M.D. Associate Professor and Chief Division of Medical Informatics and Outcomes

Overview of IR Process

Indexinglanguage

Indexinglanguage

QueriesQueries DocumentsDocuments

Searchengine

Searchengine

Retrieval Indexing

Page 4: Medical Information Retrieval Challenges in a Webbed World William Hersh, M.D. Associate Professor and Chief Division of Medical Informatics and Outcomes

What is the field of IR?

Concerned with creation, storage, organization, and retrieval of computer-based information

“IR” has traditionally focused on retrieval of information from heterogeneous textual databases

Recent expansion to multimedia and integration with “traditional” databases

Page 5: Medical Information Retrieval Challenges in a Webbed World William Hersh, M.D. Associate Professor and Chief Division of Medical Informatics and Outcomes

Why is IR pertinent to health care?

Growth of knowledge has long surpassed human memory capabilities

Clinicians have frequent and unmet information needs

Primary literature on a given topic can be scattered and hard to synthesize

Non-primary literature sources are often neither comprehensive nor systematic

Page 6: Medical Information Retrieval Challenges in a Webbed World William Hersh, M.D. Associate Professor and Chief Division of Medical Informatics and Outcomes

Further reading

Hersh WR, Information Retrieval: A Health Care Perspective, Springer-Verlag, 1996

Hersh WR, Hickam DH, How well do physicians use electronic information retrieval systems? A framework for investigation and systematic review, Journal of the American Medical Association, 1998, 280: 1347-1352

Page 7: Medical Information Retrieval Challenges in a Webbed World William Hersh, M.D. Associate Professor and Chief Division of Medical Informatics and Outcomes

IR state of the art

Databases Indexing Retrieval Evaluation

Page 8: Medical Information Retrieval Challenges in a Webbed World William Hersh, M.D. Associate Professor and Chief Division of Medical Informatics and Outcomes

Databases

Bibliographic– References to journal literature– Used in initial IR systems– Most famous example is MEDLINE

» Nearly 9 million references to peer-reviewed literature dating back to 1966

» Covers about 3,000 journals, mostly English-based

» About 300,000 new references added yearly

» Maintained by National Library of Medicine

Page 9: Medical Information Retrieval Challenges in a Webbed World William Hersh, M.D. Associate Professor and Chief Division of Medical Informatics and Outcomes

Databases (cont.)

Full-text– Journal literature has been available for over a

decade in text-only and at high cost– Last decade has seen increasing growth of CD-

ROM market– New “evidence-based” resources are becoming

available, e.g., Best Evidence, Cochrane Hypertext

– Information linked in non-linear fashion

Page 10: Medical Information Retrieval Challenges in a Webbed World William Hersh, M.D. Associate Professor and Chief Division of Medical Informatics and Outcomes

Indexing

Two major types:– Human indexing with controlled vocabulary

» MEDLINE uses the 18,000-term Medical Subject Headings (MeSH) vocabulary

– Computer assignment of all words in record» Often a stop word list to remove common words

(e.g., the, and, which) is used

» Some systems “stem” words to root form (eg, coughs to cough)

Page 11: Medical Information Retrieval Challenges in a Webbed World William Hersh, M.D. Associate Professor and Chief Division of Medical Informatics and Outcomes

Limitations of indexing approaches

Human indexing– Inconsistency– Inadequate indexing vocabulary

Word indexing– Synonymy - e.g., cancer and carcinoma– Polysemy - e.g., lead– Granularity - e.g., antibiotics, penicillin– Focus

Page 12: Medical Information Retrieval Challenges in a Webbed World William Hersh, M.D. Associate Professor and Chief Division of Medical Informatics and Outcomes

Retrieval

Traditional approach: indexing terms connected by AND, OR

Most bibliographic systems allow searching on both vocabulary and text words

Proximity operators require words to be within a certain range

Some systems hide Boolean operators

Page 13: Medical Information Retrieval Challenges in a Webbed World William Hersh, M.D. Associate Professor and Chief Division of Medical Informatics and Outcomes

Limitations of retrieval approaches

Novices confuse ANDs and ORs Complex user interfaces dissuade busy

users Returned documents displayed in arbitrary

or, at best, reverse chronological order

Page 14: Medical Information Retrieval Challenges in a Webbed World William Hersh, M.D. Associate Professor and Chief Division of Medical Informatics and Outcomes

An alternative approach to indexing and retrieval

Called vector-space, word-statistical, automated retrieval…

Developed by Salton in 1960’s but since works best for end-users did not achieve commercial prominence until 1990’s

Based on notion of finding similarity in words between user’s query and document

Used in Knowledge Finder (Aries) and most Web search engines

Page 15: Medical Information Retrieval Challenges in a Webbed World William Hersh, M.D. Associate Professor and Chief Division of Medical Informatics and Outcomes

Word-statistical indexing

Indexing done of all words (though nothing precludes use of MeSH or other terms)

After stop word filtering and stemming, each word in each document assigned a weight based on product of IDF*TF:– Inverse document frequency of term i

» IDFi = log(# documents/# documents with term)+1

– Term frequency of term i in document j» Tfij = log(frequency of term in document)+1

Page 16: Medical Information Retrieval Challenges in a Webbed World William Hersh, M.D. Associate Professor and Chief Division of Medical Informatics and Outcomes

Word-statistical retrieval

Queries entered in natural language, subject to same stop list and stemming

Each document gets a score based on sum of weights for each query term in the document

Results are sorted and presented to user (relevance ranking)

Page 17: Medical Information Retrieval Challenges in a Webbed World William Hersh, M.D. Associate Professor and Chief Division of Medical Informatics and Outcomes

This approach allows other features:

Relevance feedback– After user designates relevant documents,

query modified Query expansion

– Same but using top-ranked documents without user relevance designations

Page 18: Medical Information Retrieval Challenges in a Webbed World William Hersh, M.D. Associate Professor and Chief Division of Medical Informatics and Outcomes

Evaluation

What questions to ask?– Is system used?– Are users satisfied?– Do they find relevant information?– Do they complete their desired task?

Most research has focused on retrieval of relevant documents

Page 19: Medical Information Retrieval Challenges in a Webbed World William Hersh, M.D. Associate Professor and Chief Division of Medical Informatics and Outcomes

Relevance-based measures

Recall =# retrieved and relevant documents

-------------------------------------------

# relevant documents in collection Precision =

# retrieved and relevant documents

-------------------------------------------

# relevant documents in search

Page 20: Medical Information Retrieval Challenges in a Webbed World William Hersh, M.D. Associate Professor and Chief Division of Medical Informatics and Outcomes

Comments about recall and precision

There tends to be a trade-off between the two

“Relevance” can be a slippery notion It is unclear whether they correlate with a

user’s success in using an IR system The proliferation of standard test collections

leads to a great deal of research that excludes real users

Page 21: Medical Information Retrieval Challenges in a Webbed World William Hersh, M.D. Associate Professor and Chief Division of Medical Informatics and Outcomes

How well do users search?(Haynes et al., Annals of Internal Medicine, 1990)

Recall Precision

Novice 27% 38%

Experiencedclinician

48% 48%

Librarian 49% 57%

Page 22: Medical Information Retrieval Challenges in a Webbed World William Hersh, M.D. Associate Professor and Chief Division of Medical Informatics and Outcomes

More searching results(Hersh et al., Bull Med Libr Assoc, 1994)

Retrieved Recall PrecisionNovice physicians Knowledge Finder 88.8 68.2 14.7Novice physicians KF top 15 14.6 31.2 24.8Librarians Full MEDLINE 18.0 37.1 36.1Librarians Text words only 17.0 31.5 31.9Exp. Physicians Full MEDLINE 10.9 26.6 34.9Exp. Physicians Text words only 14.8 30.6 31.4

Page 23: Medical Information Retrieval Challenges in a Webbed World William Hersh, M.D. Associate Professor and Chief Division of Medical Informatics and Outcomes

Other results

Little overlap among retrieval sets– Searchers tend to find similar quantities of

disparate relevant documents Novice searchers are satisfied with results

– Adequate information or ignorant bliss?

Page 24: Medical Information Retrieval Challenges in a Webbed World William Hersh, M.D. Associate Professor and Chief Division of Medical Informatics and Outcomes

New approaches to evaluation

Changing the research questions– How well can clinical users answer questions?

– What factors are association with success?» Demographics, experience, cognitive factors, and

searching mechanics?

Ongoing study funded by NLM Challenges

– Appropriate questions, database, sample size, etc.

Page 25: Medical Information Retrieval Challenges in a Webbed World William Hersh, M.D. Associate Professor and Chief Division of Medical Informatics and Outcomes

IR research directions

Enhancing word-statistical approaches Linguistic approaches Enhancing conventional indexing and

retrieval

Page 26: Medical Information Retrieval Challenges in a Webbed World William Hersh, M.D. Associate Professor and Chief Division of Medical Informatics and Outcomes

Enhancing word-statistical approaches

Passage retrieval– Giving weight to documents that have sections

mapping closely to the query Use of phrases

– High, blood, and pressure have more meaning when occurring near each other

Page 27: Medical Information Retrieval Challenges in a Webbed World William Hersh, M.D. Associate Professor and Chief Division of Medical Informatics and Outcomes

Linguistic approaches

Syntactic approaches– Conceptual matter tends to occur in noun

phrases Semantic approaches

– Can we overcome problems of synonymy, polysemy, granularity, etc.?

Page 28: Medical Information Retrieval Challenges in a Webbed World William Hersh, M.D. Associate Professor and Chief Division of Medical Informatics and Outcomes

Identifying semantics in documents

SAPHIRE (Hersh and Hickam, 1995)– Direct mapping of text to terms in large

controlled vocabulary (UMLS Metathesaurus)– Works best when exact terms and synonyms

present, less well when terms vague or synonyms non-standard

MEDSPACE (Schatz, 1997)– Large-scale processing to uncover underlying

related terms and literatures

Page 29: Medical Information Retrieval Challenges in a Webbed World William Hersh, M.D. Associate Professor and Chief Division of Medical Informatics and Outcomes

Enhancing conventional systems

Better content– Evidence-based resources

» More informative abstracts, e.g., Best Evidence

» Systematic reviews, e.g., Cochrane Database of Systematic Reviews

» Critically-Appraised Topics (CATs)

Better indexing– NLM’s MedIndex system provides expert

assistance to indexers

Page 30: Medical Information Retrieval Challenges in a Webbed World William Hersh, M.D. Associate Professor and Chief Division of Medical Informatics and Outcomes

Enhancing conventional systems (cont.)

Better retrieval– NLM’s Internet Grateful Med looks for

common searching mistakes (eg, excessive AND’s) and informs searcher

Better vocabularies– NLM’s UMLS Project adds terminology from

other vocabularies

Page 31: Medical Information Retrieval Challenges in a Webbed World William Hersh, M.D. Associate Professor and Chief Division of Medical Informatics and Outcomes

IR and the World Wide Web

Indexing and retrieval approaches Implications for scientific publishing Implications for health care Limitations

Page 32: Medical Information Retrieval Challenges in a Webbed World William Hersh, M.D. Associate Professor and Chief Division of Medical Informatics and Outcomes

Indexing and retrieval on the Web

Web crawlers– Index everything they find– Examples: Alta Vista, InfoSeek, Lycos– Problems: non-discriminating, word only

Filtering and/or classifying– Sites filtered and/or classified based on criteria– Examples: Yahoo, CliniWeb, OMNI– Problems: maintenance, intended audience

Page 33: Medical Information Retrieval Challenges in a Webbed World William Hersh, M.D. Associate Professor and Chief Division of Medical Informatics and Outcomes

Implications for scientific publishing

Peer-review process– Imperfect but best means for controlling

quality in publications Responsibility

– Increased anonymity of Web enhances ability for misrepresentation

Liability– Who is liable for inaccurate information?

Page 34: Medical Information Retrieval Challenges in a Webbed World William Hersh, M.D. Associate Professor and Chief Division of Medical Informatics and Outcomes

Implications for health care

Informativeness vs. marketing– There is potential conflict between providing

information and self-promotion Patient empowerment

– Absolutely important but much potential for damage from misinformation

Page 35: Medical Information Retrieval Challenges in a Webbed World William Hersh, M.D. Associate Professor and Chief Division of Medical Informatics and Outcomes

Much medical informationis on the Web

“Free” information from government agencies, medical schools, and advocacy groups is easy to access and use

“Best” information from traditional medical publishers still costly and fragmented

Some well-known launching pads– Medical Matrix: www.medmatrix.org/– CliniWeb: www.ohsu.edu/cliniweb/

Page 36: Medical Information Retrieval Challenges in a Webbed World William Hersh, M.D. Associate Professor and Chief Division of Medical Informatics and Outcomes

Limitations of the Web(Hersh, ACPJC, 1996)

Difficult to find information - a diversity of different search engines, each with its own benefits and limitations

Everyone can be a publisher - Good for democratic society, less so for scientific and professional fields

Misrepresentation and fraud - Web can amplify misinformation and allow easy fraud

Page 37: Medical Information Retrieval Challenges in a Webbed World William Hersh, M.D. Associate Professor and Chief Division of Medical Informatics and Outcomes

Some have expressed concern about free information on Web

Silberg et al. (JAMA, 1997) suggested standards for health information on Web– Authorship - names, affiliations, and credentials

– Attribution - references, sources, and (where appropriate) copyright

– Disclosure - potential and real conflicts of interest

– Currency - dates content posted and updated

Page 38: Medical Information Retrieval Challenges in a Webbed World William Hersh, M.D. Associate Professor and Chief Division of Medical Informatics and Outcomes

But applicability and quality of Web content is poor

Hersh, Gorman, and Sacherek, JAMA, 1998 Searched on 50 questions generated by

clinicians Less than 10% of pages relevant, none for

half of queries Low percentage of JAMA quality indicators

Page 39: Medical Information Retrieval Challenges in a Webbed World William Hersh, M.D. Associate Professor and Chief Division of Medical Informatics and Outcomes

Final thoughts

We are on the threshold of an exciting new era in communications and information dissemination– Integrity of information and responsibility for

it must be maintained– It should augment and not substitute for human

communication

Page 40: Medical Information Retrieval Challenges in a Webbed World William Hersh, M.D. Associate Professor and Chief Division of Medical Informatics and Outcomes

References Cited: 1. Hersh, W., Information Retrieval: A Health Care Perspective. 1996, New York: Springer-Verlag. 2. Hersh, W. and D. Hickam, How well do physicians use electronic information retrieval systems? A framework

for investigation and review of the literature. Journal of the American Medical Association, 1998. 280: 1347-1352.

3. Haynes, R., et al., Online access to MEDLINE in clinical settings. Annals of Internal Medicine, 1990. 112: 78-84.

4. Hersh, W. and D. Hickam, The use of a multi-application computer workstation in a clinical setting. Bulletin of the Medical Library Association, 1994. 82: 382-389.

5. Hersh W. and D. Hickam, Information retrieval in medicine: the SAPHIRE experience. Journal of the American Society for Information Science, 1995. 46: 743-747.

6. Schatz B., Information retrieval in digital libraries: bringing search to the net. Science, 1997. 275: 327-334. 7. Hersh, W., Evidence-based medicine and the Internet. ACP Journal Club, 1996. 5(4): A12-A14. 8. Silberg, W., G. Lundberg, and R. Musacchio, Assessing, controlling, and assuring the quality of medical

information on the Internet: caveat lector et viewor - let the reader and viewer beware. Journal of the American Medical Association, 1997. 277: 1244-1245.

9. Hersh, W., P. Gorman, and L. Sacherek, Applicability and quality of information for answering clinical questions on the Web. Journal of the American Medical Association, 1998. 280: 1307-1308.

URLs:Division of Medical Informatics & Outcomes Research: www.ohsu.edu/bicc-informatics/CliniWeb: www.ohsu.edu/cliniweb/SAPHIRE International: www.ohsu.edu/cliniweb/saphint/