Tech Seminar Doc

Embed Size (px)

Citation preview

  • 8/2/2019 Tech Seminar Doc

    1/27

    Speech To Text Conversion 1

    CVSR COLLEGE OF ENGINEERING DEPARTMENT OF ECE

    CHAPTER 1

    INTRODUCTION

  • 8/2/2019 Tech Seminar Doc

    2/27

    Speech To Text Conversion 2

    CVSR COLLEGE OF ENGINEERING DEPARTMENT OF ECE

    INTRODUCTION

    Since the beginning of the 21st century, a revolutionary development began in

    technical and communication fields. These developments affected several areas such as

    medicine, industry, and even translation. There were a lot of problems and difficulties which

    faced this generation, however, People searched for new ideas that save time, effort, and

    money. When talking about translation, many has established computer programs electronic

    tools, and websites to service this area, for instance, online translation sites, translating

    programs, electronic dictionaries, corps. etc. One area of interest is Speech Recogni tion.

    Speech recognition is the process of capturing spoken words using microphone or

    telephone and converting them into a digitally stored set of words .Speech recognition

    reduces the overhead caused by alternate communication methods. Speech has not been used

    much in the field of electronics and computers due to the complexity and variety of speech

    signals and sounds. However, with modern processes, algorithms, and methods we can

    process speech signals easily and recognize the text.

  • 8/2/2019 Tech Seminar Doc

    3/27

    Speech To Text Conversion 3

    CVSR COLLEGE OF ENGINEERING DEPARTMENT OF ECE

    1.1OBJECTIVE

    In this an on-line speech-to-text engine, implemented as a system-on-aprogrammable-

    chip (SOPC) solution. The system acquires speech at run time through a microphone and

    processes the sampled speech to recognize the uttered text. Here hidden Markov model

    (HMM) is used for speech recognition, which converts the speech to text. The recognized text

    can be stored in a file on a PC that connects to an FPGA on a development board using a

    standard RS-232 serial cable. Our speech-to-text system directly acquires and converts

    speech to text. It can supplement other larger systems, giving users a different choice for data

    entry. A speech-to-text system can also improve system accessibility by providing data entry

    options for blind, deaf, or physically handicapped users.

  • 8/2/2019 Tech Seminar Doc

    4/27

    Speech To Text Conversion 4

    CVSR COLLEGE OF ENGINEERING DEPARTMENT OF ECE

    CHAPTER 2

    DESIGN IMPLEMENTATION

    DESCRIPTION PARTS

  • 8/2/2019 Tech Seminar Doc

    5/27

    Speech To Text Conversion 5

    CVSR COLLEGE OF ENGINEERING DEPARTMENT OF ECE

    DESIGN IMPLEMENTATION DESCRIPTION PARTS

    2.1AUDIO INPUT HARDWARE

    PULSE CODE MODULATION

    Pulse Code Modulation (PCM) is an extension of PAM wherein each analogue sample value is

    quantized into a discrete value for representation as a digital code word. Thus, as shown below, a

    PAM system can be converted into a PCM system by adding a suitable analogue-to-digital (A/D)

    converter at the source and a digital-to-analogue (D/A) converter at the destination.

    Fig 2.1 PCM Modulation and Demodulation

    In PCM the speech signal is converted from analogue to digital form. PCM is

    standardised for telephony by the ITU-T (International Telecommunications Union -

    Telecoms, a branch of the UN), in a series of recommendations called the G series. For

    example the ITU-T recommendations for out-of-band signal rejection in PCM voice coders

    require that 14 dB of attenuation is provided at 4 kHz. Also, the ITU-T transmission quality

    specification for telephony terminals require that the frequency response of the handset

    microphone has a sharp roll-off from 3.4 kHz.

    In quantization the levels are assigned a binary codeword. All sample values falling

    between two quantization levels are considered to be located at the centre of the quantization

    interval. In this manner the quantization process introduces a certain amount of error or

    distortion into the signal samples. This error known as quantization noise, is minimised by

    establishing a large number of small quantization intervals. Of course, as the number of

    quantization intervals increase, so must the number or bits increase to uniquely identify the

    A to D

    Converter

    Binary

    Coder

    Parallel

    to Serial

    Converter

    Digital

    Pulse

    Generator

    Serial to

    Parallel

    Converter

    D to A

    ConverterLPF

    Sampler

    Analogue

    Input

    PCM

    Output

    Modulator

    PCM

    Input

    Analogue

    Output

    Demodulator

  • 8/2/2019 Tech Seminar Doc

    6/27

    Speech To Text Conversion 6

    CVSR COLLEGE OF ENGINEERING DEPARTMENT OF ECE

    quantization intervals. For example, if an analogue voltage level is to be converted to a digital

    system with 8 discrete levels or quantization steps three bits are required. In the ITU-T

    version there are 256 quantization steps, 128 positive and 128 negative, requiring 8 bits. A

    positive level is represented by having bit 8 (MSB) at 0, and for a negative level the MSB is

    1.

    2.2 FPGA INTERFACE

    The trend in hardware design is towards implementing a complete system, intended for

    various applications, on a single chip. The advent of high-density FPGAs with high-capacity

    RAMs, and support for soft-core processors such as Alteras Nios II processor, have

    enabled designers to implement a complete system on a chip. FPGAs provide the following

    benefits

    FPGA systems are portable, cost effective, and consume very little power comparedto PCs. A complete system can be implemented easily on a single chip because

    complex integrated circuits (ICs) with millions of gates are available now.

    System on programmable chip (SOPC) Builder can trim months from a design cycleby simplifying and accelerating the design process. It integrates complex system

    components such as intellectual property (IP) blocks, memories, and interfaces to off-

    chip devices including application-specific standard products(ASSPs) and ASICs on

    Altera high-density FPGAs.

    SOPC methodology gives the designer flexibility when writing code, and supportsboth high-level languages (HLLs) and hardware description language (HDLs). The

    time-critical blocks can be implemented in an HDL while the remaining blocks are

    implemented in an HLL. It is easy to change the existing algorithm in hardware and

    software by simply modifying the code.

    FPGAs provide the best of both worlds: a microcontroller or RISC processor canefficiently perform control and decision-making operations while the FPGA can

    perform digital signal processing (DSP) operations and other computationally

    intensive tasks.

    An FPGA supports hardware/software co-design in which the time-critical blocks arewritten in HDL and implemented as hardware units, while the remaining application

    logic is written C. The challenge is to find a good tradeoff between the two. Both the

  • 8/2/2019 Tech Seminar Doc

    7/27

    Speech To Text Conversion 7

    CVSR COLLEGE OF ENGINEERING DEPARTMENT OF ECE

    processor and the custom hardware must be optimally designed such that neither is

    idle or under-utilized.

    2.3 NIOS II PROCESSOR

    The Nios II processor is a general-purpose RISC processor core with the following

    features

    Full 32-bit instruction set, data path, and address space 32 general-purpose registers Optional shadow register sets 32 interrupt sources External interrupt controller interface for more interrupt sources Single-instruction 32 32 multiply and divide producing a 32-bit result Floating-point instructions for single-precision floating-point operations Single-instruction barrel shifter Access to a variety of on-chip peripherals, and interfaces to off-chip memories and

    peripherals

    Hardware-assisted debug module enabling processor start, stop, step, and traceunder control of the Nios II software development tools

    Optional memory management unit (MMU) to support operating systems thatrequire MMUs

    Optional memory protection unit (MPU) Software development environment based on the GNU C/C++ tool chain and the

    Nios II Software Build Tools (SBT) for Eclipse

    Integration with Altera's SignalTap II Embedded Logic Analyzer, enablingreal-time analysis of instructions and data along with other signals in the FPGA

    design

    Instruction set architecture (ISA) compatible across all Nios II processor systems Performance up to 250 DMIPS

    A Nios II processor system is equivalent to a microcontroller or computer on a chipthat

    includes a processor and a combination of peripherals and memory on a single chip. A Nios

    II processor system consists of a Nios II processor core, a set of on-chip peripherals, on-chip

  • 8/2/2019 Tech Seminar Doc

    8/27

    Speech To Text Conversion 8

    CVSR COLLEGE OF ENGINEERING DEPARTMENT OF ECE

    memory, and interfaces to off-chip memory, all implemented on a single Altera device. Like

    a microcontroller family, all Nios II processor systems use a consistent instruction set and

    programming model.

    2.4 DEVELOPMENT AND EDUCATION BOARD(DE2)

    The DE2 package includes

    DE2 board. USB Cable for FPGA programming and control. CD-ROM containing the DE2 documentation and supporting materials, including the

    User Manual, the Control Panel utility, reference designs and demonstrations, device

    datasheets, tutorials, and a set of laboratory exercises.

    CD-ROMs containing Alteras Quartus II Web Edition and the Nios II EmbeddedDesign Suit Evaluation Edition software.

    Bag of six rubber (silicon) covers for the DE2 board stands. The bag also containssome extender pins, which can be used to facilitate easier probing with testing

    equipment of theboards I/O expansion headers.

    Clear plastic cover for the board. 9V DC wall-mount power supply.

    The DE2 board has many features that allow the user to implement a wide range of designed

    circuits, from simple circuits to various multimedia projects.

    The following hardware is provided on the DE2 board

    Altera Cyclone II 2C35 FPGA device

    Altera Serial Configuration device - EPCS16 USB Blaster (on board) for programming and user API control; both JTAG and

    Active Serial(AS) programming modes are supported

    512-Kbyte SRAM 8-Mbyte SDRAM 4-Mbyte Flash memory (1 Mbyte on some boards) SD Card socket 4 pushbutton switches

  • 8/2/2019 Tech Seminar Doc

    9/27

    Speech To Text Conversion 9

    CVSR COLLEGE OF ENGINEERING DEPARTMENT OF ECE

    18 toggle switches 18 red user LEDs 9 green user LEDs

    50-MHz oscillator and 27-MHz oscillator for clock sources 24-bit CD-quality audio CODEC with line-in, line-out, and microphone-in jacks VGA DAC (10-bit high-speed triple DACs) with VGA-out connector TV Decoder (NTSC/PAL) and TV-in connector 10/100 Ethernet Controller with a connector USB Host/Slave Controller with USB type A and type B connectors RS-232 transceiver and 9-pin connector PS/2 mouse/keyboard connector IrDA transceiver Two 40-pin Expansion Headers with diode protection

    In addition to these hardware features, the DE2 board has software support for standard

    I/O interfaces and a control panel facility for accessing various components. Also, software is

    provided for a number of demonstrations that illustrate the advanced capabilities of the DE2

    board.

  • 8/2/2019 Tech Seminar Doc

    10/27

    Speech To Text Conversion 10

    CVSR COLLEGE OF ENGINEERING DEPARTMENT OF ECE

    CHAPTER 3

    DESIGN IMPLEMENTATION

  • 8/2/2019 Tech Seminar Doc

    11/27

    Speech To Text Conversion 11

    CVSR COLLEGE OF ENGINEERING DEPARTMENT OF ECE

    3.1 BLOCK DIAGRAM

    Fig 3.1 Block diagram

    Through

    MicrophoneSpeech

    acquisition

    Speech

    preprocessing

    Hidden

    Marcov model

    Text storage

    External Hardware

  • 8/2/2019 Tech Seminar Doc

    12/27

    Speech To Text Conversion 12

    CVSR COLLEGE OF ENGINEERING DEPARTMENT OF ECE

    3.2 FUNCTIONAL DESCRIPTION

    Functionally, the Speech to text conversion has the following blocks

    Speech acquisition Speech preprocessing Hidden Marcov model(HMM) Text storage

    SPEECH ACQUISITIONDuring speech acquisition, speech samples are obtained from the speaker in real time

    and stored in memory for preprocessing. The microphone input port with the audio codec

    receives the signal, amplifies it, and converts it into 16-bit PCM digital samples at a sampling

    rate of 8 KHz. The system needs a parallel/serial interface to the Nios II processor and an

    application running on the processor that acquires and stores data in memory. The received

    samples are stored into memory on the Altera Development and Education (DE2)

    development board.

    The codec requires initial configuration, which is performed using custom hardware

    implemented in the Altera Cyclone II FPGA on the board. The audio codec provides a

    serial communication interface, which is connected to a UART. We used SOPC Builder to

    add the UART to the Nios II processor to enable the interface. The UART is connected to the

    processor through the Avalon bus. The C application running on the HAL transfers data

    from the UART to the SDRAM. Direct memory access (DMA) transfers data efficiently and

    quickly, and we may use it instead in future designs.

    SPEECH PREPROCESSINGThe speech signal consists of the uttered digit along with a pause period and background

    noise. Preprocessing reduces the amount of processing required in later stages. Generally,

    preprocessing involves taking the speech samples as input, blocking the samples into frames,

    and returning a unique pattern for each sample, as described in the following steps.

    1. The system must identify useful or significant samples from the speech signal. To

    accomplish this goal, the system divides the speech samples into overlapped frames.

  • 8/2/2019 Tech Seminar Doc

    13/27

    Speech To Text Conversion 13

    CVSR COLLEGE OF ENGINEERING DEPARTMENT OF ECE

    2. The system checks the frames for voice activity using endpoint detection and energy

    threshold calculations.

    3. The speech samples are passed through a pre-emphasis filter.

    4. The frames with voice activity are passed through a Hamming window.

    5. The system performs autocorrelation analysis on each frame.

    6. The system finds linear predictive coding (LPC) coefficients using the Levinson and

    Durbin algorithm.

    7. From the LPC coefficients, the system determines the cepstral coefficients and weighs

    them using a tapered window. The cepstral coefficients serve as feature vectors.

    Fig 3.2 Flow chart for speech preprocessing

    Sampled

    speech input

    Block into

    frames

    Find maximum

    energy and ZCR

    Find starting &

    ending ofsample

    Pre-emphasis

    filter

    Windowing onframe

    Auto

    correlation

    LPC/Cepstral

    analysis

    Cepstrum o/p

    for sample

  • 8/2/2019 Tech Seminar Doc

    14/27

    Speech To Text Conversion 14

    CVSR COLLEGE OF ENGINEERING DEPARTMENT OF ECE

    VOICE ACTIVITY DETECTIONThe system uses the endpoint detection algorithm to find the start and end points of

    the speech. The speech is sliced into frames that are 450 samples long. Next, the system

    finds the energy and number of zero crossings of each frame. The threshold energy and

    zero crossing value is determined based on the computed values and only frames crossing

    the threshold are considered, removing most background noise. We include a small

    number of frames beyond the starting and ending frames so that we do not miss starting

    or ending parts that do not cross the threshold but may be important for recognition.

    PRE-EMPHASISThe digitized speech signal s(n) is put through a low-order LPF to flatten the signal

    spectrally and make it less susceptible to finite precision effects later in the signal processing.

    The filter is represented by the equation

    H(z) = 1 - az-1 where a is 0.9375.

    FRAME BLOCKINGSpeech frames are formed with a duration of 56.25 ms (N = 450 sample length) and an

    overlap of 18.75 ms (M = 150 sample length) between adjacent frames. The overlapping

    ensures that the resulting LPC spectral estimates are correlated from frame to frame and are

    quite smooth.

    Xq(n) = s(Mq + n)

    with n = 0 to N - 1 and q = 0 to L - 1, where L is the number of frames.

    WINDOWINGHamming window is used to each frame to minimize signal discontinuities at the

    beginning and end of the frame according to the equation:

    xq(n) = xq(n). w(n)

    where w(n) = 0.54 = 0.46 cos(2n/N - 1).

  • 8/2/2019 Tech Seminar Doc

    15/27

    Speech To Text Conversion 15

    CVSR COLLEGE OF ENGINEERING DEPARTMENT OF ECE

    HIDDEN MARCOV MODEL

    The Hidden Markov Model(HMM) is a powerful statistical tool for modeling

    generative sequences that can be characterised by an underlying process generating an

    observable sequence. HMMs have found application in many areas interested in signalprocessing, and in particular speech processing, but have also been applied with success to

    low level Natural language processing( NLP) tasks such as part-of-speech tagging, phrase

    chunking, and extracting target information from documents. Andrei Markov gave his name

    to the mathematical theory of Markov processes in the early twentieth century, but it was

    Baum and his colleagues that developed the theory of HMMs in the 1960s[2].

    HMM TRAININGAn important part of speech-to-text conversion using pattern recognition is training.

    Training involves creating a pattern representative of the features of a class using one or more

    test patterns that correspond to speech sounds of the same class. The resulting pattern

    (generally called a reference pattern) is an example or template, derived from some type of

    averaging technique. It can also be a model that characterizes the reference pattern statistics.

    A model commonly used for speech recognition is the HMM, which is a statistical model

    used for modeling an unknown system using an observed output sequence. The system trains

    the HMM for each digit in the vocabulary using the Baum-Welch algorithm. The codebook

    index created during preprocessing is the observation vector for the HMM model.

    After preprocessing the input speech samples to extract feature vectors, the system builds the

    codebook. The codebook is the reference code space that we can use to compare input feature

    vectors. The weighted cepstrum matrices for various users and digits are compared with the

    codebook. The nearest corresponding codebook vector indices are sent to the Baum-Welch

    algorithm for training an HMM model.

    The HMM characterizes the system using three matrices

    AThe state transition probability distribution. BThe observation symbol probability distribution. nThe initial state distribution.

  • 8/2/2019 Tech Seminar Doc

    16/27

    Speech To Text Conversion 16

    CVSR COLLEGE OF ENGINEERING DEPARTMENT OF ECE

    Any digit is completely characterized by its corresponding A, B, and n matrices. The A,

    B, and n matrices are modeled using the Baum-Welch algorithm, which is an iterative

    procedure (we limit the iterations to 20). The Baum-Welch algorithm gives 3 matrices for

    each digit corresponding to the 3users with whom we created the vocabulary set. The A, B,

    and n matrices are averaged over the users to generalize them for user-independent

    recognition.

    For the design to recognize the same digit uttered by a user for which the design has not

    been trained, the zero probabilities in the B matrix are replaced with a low value so that it

    gives a non-zero value on recognition. To some extent, this arrangement overcomes the

    problem of less training data. Training is a one-time process. Due to the complexity and

    resource requirements, it is performed using standalone PC application software that we

    created by compiling our C program into an executable. For recognition, we compile the

    same C program but target it to run on the Nios II processor instead. We were able to

    accomplish this cross-compilation because of the wide support for the C language in the Nios

    II processor IDE.

    The C program running on the PC takes the digit speech samples from a MATLAB

    output file and performs preprocessing, feature vector extraction, vector quantization, Baum-

    Welch modeling, and averaging, which outputs the normalized A, B, and n matrices for each

    digit. The normalized A, B, and n matrices are then embedded in the recognition C program

    code and stored in the DE2 developmentboards SDRAM using the Nios II IDE.

    BAUM-WELCH ALGORITHM

    Baum-Welch algorithm is an instance of a general algorithm, the Expectation-

    Maximisation algorithm, which maximises the probability of observations depending on

    hidden data. The algorithm requires specifying the number of states n of the learnt model

    .The algorithm finds a local maximum in the parameter space of n-state HMMs, rather than a

    global maximum.

  • 8/2/2019 Tech Seminar Doc

    17/27

    Speech To Text Conversion 17

    CVSR COLLEGE OF ENGINEERING DEPARTMENT OF ECE

    Fig 3.3 Flow chart for Training

    Speech

    sample i/p for

    each digit

    preprocessing

    Append Weighted

    Cepstrum for Each User

    for Digit (0 - 9)

    Compress the Vector

    Space by Vector

    Quantization

    Find the Index of the

    Nearest Code Book (in a

    Euclidian Sense) Vector

    for Each Frame (Vector)

    of the Input Speech

    Train the HMM for Each

    Digit for Each User &

    Average the Parameters

    (A, B, ) over All Users

    Trained

    Models

    For Each Digit

    Code book of size 128

    Weighted cepstrum

  • 8/2/2019 Tech Seminar Doc

    18/27

    Speech To Text Conversion 18

    CVSR COLLEGE OF ENGINEERING DEPARTMENT OF ECE

    HMM BASED RECOGNITION

    Recognition or pattern classification is the process of comparing the unknown test

    pattern with each sound class reference pattern and computing a measure of similarity

    (distance) between the test pattern and each reference pattern. The digit is recognized using a

    maximum likelihood estimate, such as the Viterbi decoding algorithm, which implies that the

    digit whose model has the maximum probability is the spoken digit.

    Preprocessing, feature vector extraction, and codebook generation are same as in

    HMM training. The input speech sample is preprocessed and the feature vector is extracted.

    Then, the index of the nearest codebook vector for each frame is sent to all digit models. The

    model with the maximum probability is chosen as the recognized digit.

    After preprocessing in the Nios II processor, the required data is passed to the

    hardware for Viterbi decoding. Viterbi decoding is computationally intensive so we

    implemented it in the FPGA for better execution speed, taking advantage of

    hardware/software co-design. We wrote the Viterbi decoder in Verilog HDL and included it

    as a custom instruction in the Nios II processor. Data passes through the dataa and datab ports

    and the prefix port is used for control operations. The custom instruction copies or adds two

    floating-point numbers from dataa and datab, depending on the prefix input. The output

    (result) is sent back to the Nios II processor for further maximum likelihood estimation.

    VITERBI ALGORITHM

    The Viterbi algorithm is a dynamic programming algorithm for finding the

    most likely sequence of hidden statescalled the Viterbi paththat results in a sequence of

    observed events, especially in the context of Markov information sources, and more

    generally, hidden Markov models. The forward algorithm is a closely related algorithm for

    computing the probability of a sequence of observed events. These algorithms belong to the

    realm of probability theory.

    The algorithm makes a number of assumptions

  • 8/2/2019 Tech Seminar Doc

    19/27

    Speech To Text Conversion 19

    CVSR COLLEGE OF ENGINEERING DEPARTMENT OF ECE

    First, both the observed events and hidden events must be in a sequence. Thesequence is often temporal, i.e. in time order of occurrence.

    Second, these two sequences need to be aligned: an instance of an observed eventneeds to correspond to exactly one instance of a hidden event.

    Third, computing the most likely hidden sequence (which leads to a particularstate) up to a certain point tmust depend only on the observed event at point t, and

    the most likely sequence which leads to that state at point t 1.

    Fig 3.4 Flow chart for HMM-Recognition

    Sample

    Speech Input

    to be

    Recognized

    PreprocessingCode Book

    Input

    Find the Index of theNearest Code Book (in a

    Euclidian Sense) Vector

    for Each Frame Vector

    Find the Probability

    for the Input Being

    Digit K = 1 to M

    Find the Model with theMaximum Probability &

    the Corresponding Digit

    Recognized

    Digit

    Trained Digit

    Model

    for All Digits for All

    Users

  • 8/2/2019 Tech Seminar Doc

    20/27

    Speech To Text Conversion 20

    CVSR COLLEGE OF ENGINEERING DEPARTMENT OF ECE

    TEXT STORAGEOur speech-to-text conversion system can send the recognized digit to a PC via the

    serial, USB, or Ethernet interface for backup or archiving. For our testing, we used a serial

    cable to connect the PC and RS-232 port on the DE2 board. The Nios II processor on the

    DE2 board sends the digital speech data to a PC; a target program running on the PC receives

    the text and writes it to the disk.

    We wrote the PC program using Visual Basic 6 (VB) using a Microsoft serial port

    Control. The VB program must be run in the background for the PC to receive the data and

    write it to the hard disk. The Windows HyperTerminal software or any other RS-232 serial

    communication receiver could also be used to receive and view the data. The serial portcommunication runs at 115,200 bits per second (bps) with 8 data bits, 1 stop bit, and no

    parity. Handshaking signals are not required. The speech-to-text conversion system can also

    operate as a standalone network device using an Ethernet interface for PC communication

    and appropriate speech recognition server software designed for the Nios II processor.

  • 8/2/2019 Tech Seminar Doc

    21/27

    Speech To Text Conversion 21

    CVSR COLLEGE OF ENGINEERING DEPARTMENT OF ECE

    Fig 3.5 Flow chart for text storage

    start

    Serial

    port data

    received

    Ask

    Filename

    Open File

    Copy Serial Buffer

    to File

    serial

    port data

    over?

    Close File

    Stop

    Y

    Y

    N

    N

  • 8/2/2019 Tech Seminar Doc

    22/27

    Speech To Text Conversion 22

    CVSR COLLEGE OF ENGINEERING DEPARTMENT OF ECE

    3.3 WORKING

    The microphone input port with the audio codec receives the signal, amplifies it, and

    converts it into 16-bit PCM digital samples at a sampling rate of 8 KHz. And it transmits to

    the Nios II processor in the FPGA through the communication interface.

    Generally, a speech signal consists of noise-speech-noise. The detection of actual

    speech in the given samples is important. The speech signal is divided into frames of 450

    samples each with an overlap of 300 samples, i.e., two-thirds of a frame length. The speech is

    separated from the pauses using voice activity detection (VAD) techniques. The receivedsamples are stored into memory on the Altera Development and Education (DE2)

    development board.

    The speech signal consists of the uttered digit along with a pause period and

    background noise. Preprocessing involves taking the speech samples as input, blocking the

    samples into frames, and returning a unique pattern for each sample. The system performs

    speech analysis using the linear predictive coding (LPC) method. From the LPC coefficients

    we get the weighted cepstral coefficients and cepstral time derivatives, which form the

    feature vector for a frame. Then, the system performs vector quantization using a vector

    codebook. The resulting vectors form the observation sequence. For each word in the

    vocabulary, the system builds an Hidden Marcov Model(HMM ) and trains the model during

    the training phase. The training steps, from VAD to HMM model building, are performed

    using PC-based C programs. The resulting HMM models on to an FPGA for the recognition

    phase.

    In the recognition phase, the speech is acquired dynamically from the microphone

    through a codec and is stored in the FPGAs memory. These speech samples are

    preprocessed, and the probability of getting the observation sequence for each model is

    calculated. The uttered word is recognized based on a maximum likelihood estimation.

  • 8/2/2019 Tech Seminar Doc

    23/27

    Speech To Text Conversion 23

    CVSR COLLEGE OF ENGINEERING DEPARTMENT OF ECE

    3.4 SOFTWARE AND HARDWARE USED

    Sound recorder MATLAB version 6 CDevC++ with gcc and gdb Quartus II version 5.1 SOPC Builder version 5.1 Nios II processor Nios II IDE version 5.1 MegaCore IP library Altera Development and Education (DE2) board Microphone and headset RS-232 cable

  • 8/2/2019 Tech Seminar Doc

    24/27

    Speech To Text Conversion 24

    CVSR COLLEGE OF ENGINEERING DEPARTMENT OF ECE

    CHAPTER 4

    APPLICATIONS

  • 8/2/2019 Tech Seminar Doc

    25/27

    Speech To Text Conversion 25

    CVSR COLLEGE OF ENGINEERING DEPARTMENT OF ECE

    APPLICATIONS

    Interactive voice response system (IVRS) Voice-dialing in mobile phones and telephones Hands-free dialing in wireless bluetooth headsets PIN and numeric password entry modules Automated teller machines (ATMs)

  • 8/2/2019 Tech Seminar Doc

    26/27

    Speech To Text Conversion 26

    CVSR COLLEGE OF ENGINEERING DEPARTMENT OF ECE

    CHAPTER 5

    REFERENCES

  • 8/2/2019 Tech Seminar Doc

    27/27

    Speech To Text Conversion 27

    REFERENCES

    1. Topic taken from seminartopics.co.in/ece-seminar-topics/

    2. Garg, Mohit. Linear Prediction Algorithms. Indian Institute of Technology, Bombay, India,

    Apr 2003.

    3. Li, Gongjun and Taiyi Huang. An Improved Training Algorithm in Hmm-Based Speech

    Recognition.National Laboratory of Pattern Recognition. Chinese Academy of Sciences,

    Beijing.

    4. Altera Nios ii Document