Alexandria ACM SC | Introduction to Natural Language Processing

Embed Size (px)

Citation preview

  • 8/22/2019 Alexandria ACM SC | Introduction to Natural Language Processing

    1/21

    Ahmad M. Bakr

    Computer and Systems Engineering

    Department

    Faculty of EngineeringAlexandria University, Egypy

    Introduction to Natural Language

    Processing

  • 8/22/2019 Alexandria ACM SC | Introduction to Natural Language Processing

    2/21

    Agenda

    Introduction.

    Basic text processing techniques.

    Information Retrieval.

    Sentiment Analysis. Named Entity Recognition.

    Question Answering.

    Relation Extraction.

  • 8/22/2019 Alexandria ACM SC | Introduction to Natural Language Processing

    3/21

    Introduction

    NLP is a branch ofartificial intelligence that dealswith analyzing, understanding and generating the

    languages that humans use naturally in order

    to interface with computers.

    Natural language processing aims to teach

    computers to understand the way humans learn

    and use language.

    http://www.webopedia.com/TERM/A/artificial_intelligence.htmlhttp://www.webopedia.com/TERM/I/interface.htmlhttp://www.webopedia.com/TERM/I/interface.htmlhttp://www.webopedia.com/TERM/A/artificial_intelligence.html
  • 8/22/2019 Alexandria ACM SC | Introduction to Natural Language Processing

    4/21

    Introduction Speech processing: get flight information or book a hotel over the

    phone.

    Information extraction: discover names of people and events theyparticipate in, from a document.

    Machine translation: translate a document from one humanlanguage into another.

    Question answering: find answers to natural language questions ina text collection or database.

    Summarization: generate a short biography of Noam Chomsky fromone or more news articles.

  • 8/22/2019 Alexandria ACM SC | Introduction to Natural Language Processing

    5/21

    Text Processing Text processing is manipulation of text, especially

    the transformation of text from one format toanother.

    Usually from plain text (set of paragraphs) to a

    form that is easy to be included in calculations. Vector Space Model (VSM) is one of the forms

    used by application to represent document as avector of its words.

    dj={W1,W2, W3 . Wn}

    Each word is assigned a weight (i.e TF-IDF) Weight = Term Frequency * 1/(Document Frequency)

    Similarity between two documents can becalculated as the similarity between the vectors of

    these documents.

  • 8/22/2019 Alexandria ACM SC | Introduction to Natural Language Processing

    6/21

    Information Retrieval

    Information retrieval is the activity of obtaininginformation resources relevant to an information

    need from a collection of information resources.

  • 8/22/2019 Alexandria ACM SC | Introduction to Natural Language Processing

    7/21

    Information Retrieval

    Usually information is indexed to speed up thequeries.

    Inverted Index is one of the primary attempts to

    index text based on its words.

  • 8/22/2019 Alexandria ACM SC | Introduction to Natural Language Processing

    8/21

    Information Retrieval

    Can we use inverted index to search forsentences A B C?

  • 8/22/2019 Alexandria ACM SC | Introduction to Natural Language Processing

    9/21

    Information Retrieval

    Document Index Graph

  • 8/22/2019 Alexandria ACM SC | Introduction to Natural Language Processing

    10/21

    Sentiment Analysis Sentiment analysis oropinion mining refers to the

    application ofnatural language

    processing, computational linguistics, and text

    analytics to identify and extract subjective information

    in source materials.

    http://en.wikipedia.org/wiki/Natural_language_processinghttp://en.wikipedia.org/wiki/Natural_language_processinghttp://en.wikipedia.org/wiki/Computational_linguisticshttp://en.wikipedia.org/wiki/Text_analyticshttp://en.wikipedia.org/wiki/Text_analyticshttp://en.wikipedia.org/wiki/Text_analyticshttp://en.wikipedia.org/wiki/Text_analyticshttp://en.wikipedia.org/wiki/Computational_linguisticshttp://en.wikipedia.org/wiki/Natural_language_processinghttp://en.wikipedia.org/wiki/Natural_language_processing
  • 8/22/2019 Alexandria ACM SC | Introduction to Natural Language Processing

    11/21

    Sentiment Analysis

    Techniques: Maintaining a list of words for each class

    Example This is a nicemovie , This is a badmovie

    Using classifiers that trained with sentences for each class

    separately

  • 8/22/2019 Alexandria ACM SC | Introduction to Natural Language Processing

    12/21

    Named Entity Recognition

    NER is a subtask ofinformation extraction thatseeks to locate and classify atomic elements in

    text into predefined categories such as the names

    of persons, organizations, locations, expressions

    of times, quantities, monetary values,percentages, etc.

    http://en.wikipedia.org/wiki/Information_extractionhttp://en.wikipedia.org/wiki/Information_extraction
  • 8/22/2019 Alexandria ACM SC | Introduction to Natural Language Processing

    13/21

    Name Entity Recognition

    Approaches: Database based recognition (word net)

    Rule based model

    Statistical models (ex. HMM and Maximum Entropy)

  • 8/22/2019 Alexandria ACM SC | Introduction to Natural Language Processing

    14/21

    Name Entity Recognition

    Wikipedia-based NER

  • 8/22/2019 Alexandria ACM SC | Introduction to Natural Language Processing

    15/21

    Name Entity Recognition

    Wikipedia-based NER Index all pages titles

    Two phase algorithm

    Given a text, search all titles. (phase one)

    Score the candidate titles (phase two)

    What factors should the scoring formula consider

    ?

  • 8/22/2019 Alexandria ACM SC | Introduction to Natural Language Processing

    16/21

    Question Answering

    What is Question Answering

    QA is a computer science discipline within the

    fields ofinformation retrieval and natural

    language processing (NLP), which is concerned

    with building systems that automatically answer

    questions posed by humans in a natural

    language.

    http://en.wikipedia.org/wiki/Information_retrievalhttp://en.wikipedia.org/wiki/Natural_language_processinghttp://en.wikipedia.org/wiki/Natural_language_processinghttp://en.wikipedia.org/wiki/Natural_languagehttp://en.wikipedia.org/wiki/Natural_languagehttp://en.wikipedia.org/wiki/Natural_languagehttp://en.wikipedia.org/wiki/Natural_languagehttp://en.wikipedia.org/wiki/Natural_language_processinghttp://en.wikipedia.org/wiki/Natural_language_processinghttp://en.wikipedia.org/wiki/Information_retrieval
  • 8/22/2019 Alexandria ACM SC | Introduction to Natural Language Processing

    17/21

    Question Answering

    A QA implementation, usually a computerprogram, may construct its answers by querying a

    structured database of knowledge or information,

    usually a knowledge base. More commonly, QA

    systems can pull answers from an unstructuredcollection of natural language documents.

    http://en.wikipedia.org/wiki/Databasehttp://en.wikipedia.org/wiki/Knowledge_basehttp://en.wikipedia.org/wiki/Knowledge_basehttp://en.wikipedia.org/wiki/Database
  • 8/22/2019 Alexandria ACM SC | Introduction to Natural Language Processing

    18/21

    Question Answering

  • 8/22/2019 Alexandria ACM SC | Introduction to Natural Language Processing

    19/21

    Question Answering Question Classification

    Question classifiermodule determines the type ofquestion and the type of answer. Examples:1) Who discovered x-rays? should be classified

    into the type of human (individual)

    Examples: 2) Where is Alexandria Located ? should beclassified into the type of place

    Rule-based approaches

    Using Classifiers to be trained with possible questiontypes

    Question is put in a form of parse tree to capture therelationship between its entities (i.e subjects, objects etc)

    The main purpose of the parse tree is to understand

    the question and the links between its entities.

  • 8/22/2019 Alexandria ACM SC | Introduction to Natural Language Processing

    20/21

    Question Answering

    Query FormulationApply text processing techniques to form a query

    from the question.

    Techniques as:

    Stemming (Swimming Swim)

    Adding synonymous (USA United States of America)

    Give weights to words of the question (nouns takes higher

    weights)

  • 8/22/2019 Alexandria ACM SC | Introduction to Natural Language Processing

    21/21

    Question Answering Search knowledge base

    The main target is to identify the paragraphs thatpossibly contain answers to the users question

    Knowledge based is usually indexed.

    Answers Extraction Parse the candidate paragraphs to extract

    sentences with possible answers

    Construct the parse tree of the matches sentences

    Parse tree gives insights about the relationshipbetween the entities of a candidate sentence

    Rank the possible answers based on theirrelevance to the question.