Course Project Presentation - Mingyu · Course Project Presentation COMP4431 Artificial...

Preview:

Citation preview

My lightblueCourse Project PresentationCOMP4431 Artificial Intelligence

Department of Computing, The Hong Kong Polytechnic University

MA Mingyu Derekderek.ma@connect.polyu.hk, BSc (Hons) Computing, 14110562D

derek.maDecember 1, 2017

Preparation > Training Strategies and Observations > ReflectionsMA Mingyu Derek

Preparation

Training Strategies and Observations

Reflections

Contents

2

3

Preparation

Preparation > Training Strategies and Observations > ReflectionsMA Mingyu Derek 4

Training data lies the

solid foundation for

smallblue to grow up.

Training Data

is the Heart

Preparation > Training Strategies and Observations > ReflectionsMA Mingyu Derek

Challenges for Data

5

• What kind of data is needed?

• Where can I get the proper data?

• How to pre-process the data?

• What kind of topics should be trained?

Preparation > Training Strategies and Observations > ReflectionsMA Mingyu Derek

What kind of data is neededPre-training and literature review

6

What kind of data is needed?Where can I get the proper data?

How to pre-process the data?What kind of topics should be trained?

1. Full structure sentences with consistent grammarDecrease the complexity of training samples

2. Not too long sentencesGuess: generation model

3. Wider coverage of vocabularyHandle more topics

4. Commutative Q&A interactionsThe chatbot links sentences around

Preparation > Training Strategies and Observations > ReflectionsMA Mingyu Derek

Where can I get the proper data?Common datasets in academia

7

What kind of data is needed?Where can I get the proper data?

How to pre-process the data?What kind of topics should be trained?

Dataset

SentenceStructureandConsistentGrammar

NotTooLong

Sentences

WideCoverageofVocabulary

CommutativeQ&A

Interactions

NUSSMSCorpus No Yes Yes Yes

Cornell MovieDialogs Yes No Yes Yes

CornellCourt Dialogs Yes No Yes No

UCSB SpokenEnglish No No Yes Yes

Eslfast Yes Yes Yes Yes

Preparation > Training Strategies and Observations > ReflectionsMA Mingyu Derek

How to pre-process the dataClean and polite data make sure controlled ethics

8

What kind of data is needed?Where can I get the proper data?How to pre-process the data?

What kind of topics should be trained?

• Processes• Remove out-of-vocabulary words• Shorten sentence length• Enhance commutative elements

• Methods• Python program preliminary check• Manually double check

Preparation > Training Strategies and Observations > ReflectionsMA Mingyu Derek

What kind of topics should be trained?Topic selection by vocabulary analysis

9

What kind of data is needed?Where can I get the proper data?

How to pre-process the data?What kind of topics should be trained?

Word2vec(Mikolov et al., 2013)• Learn word vectors from context• Computationally-efficient predictive model

Semantic meaning of vocabulary and check relationships

Processes• Train relationships by most common 50000

words dataset with 50000 iterations• Test the model using our vocabulary• Plot the relationships by semantic meanings

Preparation > Training Strategies and Observations > ReflectionsMA Mingyu Derek

What kind of topics should be trained?Topic selection by vocabulary analysis

10

What kind of data is needed?Where can I get the proper data?

How to pre-process the data?What kind of topics should be trained?

No significant clusters

11 common daily topics• Each topics may have multiple dialogs• Each dialogs have three mutation versions

• Different version are slightly different in language

11 / 26

Training Strategies and Observations

Preparation > Training Strategies and Observations > ReflectionsMA Mingyu Derek 12

dialogs

47unique sentences

270+sentences

8219

Preparation > Training Strategies and Observations > ReflectionsMA Mingyu Derek

Simulated Annealing and Training Flow

13

topic 1 topic 2 topic 3 topic 4 topic 5 topic 6

RNN/LSTM Structure• Common approaches for chat bot• Still not good at memorySimulated AnnealingSo proper repeating can “cool down” the high-entropy new incoming conversations and let the chat bot settle the structure and knowledge.

Preparation > Training Strategies and Observations > ReflectionsMA Mingyu Derek

Repeating and Its Effect

14

topic 1 topic 2 topic 3 topic 4 topic 5 topic 6

Repeating Ratio: 15Sequence matters

RNN/LSTM can reflect the sequence of input<1,2,3,…,1,2,3> takes shorter rumination time than <1,2,3,…,3,2,1>

Preparation > Training Strategies and Observations > ReflectionsMA Mingyu Derek

Significant Rumination Effect and Long Rumination Time

15

After several repeating, a rumination can significantly improve the performanceLong rumination time

1h for 1000 inputsPossible explanation

the time for rumination is when data is still in high-entropy

Preparation > Training Strategies and Observations > ReflectionsMA Mingyu Derek

Training Tools

16

A automatic Chrome extensionJavaScript + Chrome• Input data• Modify response• Like modified response• Open new session for

next topic

17 / 26

Reflections

Preparation > Training Strategies and Observations > ReflectionsMA Mingyu Derek

Reflections

18

Utilize “Ruminate” operations

faster training and better results

Avoid “big data” strategy

history is crowded and hard to ruminate

Thanks!

My lightblue: Course Project Presentation

derek.ma

MA Mingyu Derek