21
Data Structure Project 2 Calculating Word Frequency in a Document

Calculating Word Frequency in a Document. 11/6( 四 ) 這個星期四小考, 5. Threaded Binary

  • View
    218

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Calculating Word Frequency in a Document.     11/6( 四 ) 這個星期四小考, 5. Threaded Binary

Data Structure Project 2

Calculating Word Frequency in a Document

Page 2: Calculating Word Frequency in a Document.     11/6( 四 ) 這個星期四小考, 5. Threaded Binary

http://mpc.cs.nctu.edu.tw/forum/ 11/6( 四 ) 這個星期四小考 , 5. Threaded

Binary Tree 不考 11/15( 六 ) 10:10~12:00 期中考!

TA’s website & remainder

Page 3: Calculating Word Frequency in a Document.     11/6( 四 ) 這個星期四小考, 5. Threaded Binary

有關多一行的問題 .. >> version

◦ ifstream input(argv[1]);◦ while (!input.eof() && input.peek() > 0) {◦ input >> buf;◦ cout << buf ;◦ input >> buf;◦ input.get(); /* 拿走 ‘ \n’ 這個 character

*/◦ cout << " " << buf << endl;◦ }

About Project One…

Page 4: Calculating Word Frequency in a Document.     11/6( 四 ) 這個星期四小考, 5. Threaded Binary

Getline version◦ ifstream input(argv[1]); ◦ while (!input.eof()) {◦ input.getline(buf, 500);◦ if (input.gcount() > 0) /* 判斷是不是有拿到東西了 */◦ cout << buf << endl;◦ }

Another one◦ ifstream input(argv[1]);◦ while (input.getline(buf, 500)) { ◦ cout << buf << endl;◦ }

About Project One…

Page 5: Calculating Word Frequency in a Document.     11/6( 四 ) 這個星期四小考, 5. Threaded Binary

有關於出現 ^@ 的問題◦ 看到 demo 時候出現 ^@ 就是你把 ‘ \0’ ( 就是 0) output 到檔案

中了 ..◦ 以後多出這種 demo 程式就不會過 , 就以錯誤計算

How to fix ?◦ 最常發生的就是沒有計算好 buffer/string 長度就 output 到檔案中 .◦ int i; FILE* fw; char *a = "123"; ◦ fw = fopen(argv[1], "w");◦ /* 這樣不會 output 出 ^@ */◦ for(i=0; i<3; i++) fprintf(fw, "%c", a[i]);◦ /* 這樣就會 output 出 ^@ */◦ for(i=0; i<4; i++) fprintf(fw, "%c", a[i]); ◦ fclose(fw);

About Project One…

1 2 3 \0

Page 6: Calculating Word Frequency in a Document.     11/6( 四 ) 這個星期四小考, 5. Threaded Binary

補 demo project 1 請先 upload code ftp://mpc.cs.nctu.edu.tw, 開一個自己學號的目錄 .

第一次 demo 成績 : http://www.cs.nctu.edu.tw/~hhyou/ds.php

About Project One

Page 7: Calculating Word Frequency in a Document.     11/6( 四 ) 這個星期四小考, 5. Threaded Binary

Input: a text file and a stop words list◦ Using argc and argv◦ ./a.out stopword textfile

Output: pairs of word and the number of their occurrence◦ To stdout (the screen)

Project Two

Page 8: Calculating Word Frequency in a Document.     11/6( 四 ) 這個星期四小考, 5. Threaded Binary

Text file (without stop word)Hello, I’m Billy, not bi|lyor 6illy or b.

Output◦ Hello,:1◦ I’m:1◦ Billy,:◦ not:1◦ bi|ly: 1◦ or: 2◦ 6illy: 1◦ b.: 1

Project Two

Page 9: Calculating Word Frequency in a Document.     11/6( 四 ) 這個星期四小考, 5. Threaded Binary

Text file (same) Stop word list

◦ and◦ not◦ or

Output◦ Hello,:1◦ I’m:1◦ Billy,:◦ bi|ly: 1◦ 6illy: 1◦ b.: 1

Project Two

Page 10: Calculating Word Frequency in a Document.     11/6( 四 ) 這個星期四小考, 5. Threaded Binary

Text file◦ a b c d e f g h i j a b c d e

Stop words list◦ a b c d

Output◦ e:2 ; f:1 ; g:1 ; h:1 ; i:1 ; j:1

Project Two

Page 11: Calculating Word Frequency in a Document.     11/6( 四 ) 這個星期四小考, 5. Threaded Binary

Input◦ Text file

Every words are spited by ‘ ‘,’\t’, or ‘\n’. Case sensitive.

Do and do are different words There’s at most 2000 chars in one line. There will be no Chinese input. Not only one line in a text file. There might be consecutive ‘\t’ or ‘ ‘ or ‘\n’. Program executive time are limited.

Project Two

Page 12: Calculating Word Frequency in a Document.     11/6( 四 ) 這個星期四小考, 5. Threaded Binary

Input◦ Stop words list

One word one line No space,’\t’ in one line No more than 2000 chars one line

Correct◦ Haha◦ Hehe◦ kerker

Incorrect◦ 囧 oo◦ A b

Project Two

Page 13: Calculating Word Frequency in a Document.     11/6( 四 ) 這個星期四小考, 5. Threaded Binary

Word occurrence◦ String+’ ‘+number+’’\n’A 3B 5

String orders won’t matter.B 5A 3

Project Two

Page 14: Calculating Word Frequency in a Document.     11/6( 四 ) 這個星期四小考, 5. Threaded Binary

You can use any data structure to store the pair (word, occurrence), such like an array. (watch out about the large case)

One array for your string, another for the occurrence

Your data structure must be fast in insertion and selection (search).

Project Two

Page 15: Calculating Word Frequency in a Document.     11/6( 四 ) 這個星期四小考, 5. Threaded Binary

We’ll use program to judge your homework◦ Please take care about the I/O format

You can not read the whole file in one time◦ You have to read at most one line in one time

We’ll release some test data. Due: 11/21 Your bonus will depend on the efficiency of

your program

Project Two

Page 16: Calculating Word Frequency in a Document.     11/6( 四 ) 這個星期四小考, 5. Threaded Binary

Large case◦ A lot of different words (more than 1000000)◦ A lot of words in a text file◦ 30%◦ One of them will be released

10% per test case We will release 2 normal test case and 1

large test case for testing.

Project Two

Page 17: Calculating Word Frequency in a Document.     11/6( 四 ) 這個星期四小考, 5. Threaded Binary

Some simple algorithm Assume STOPWORD has N word, TEXTFILE

has M word. We build SW_LIST to store stop words,

TXT_LIST to store text file words.

Project Two

Page 18: Calculating Word Frequency in a Document.     11/6( 四 ) 這個星期四小考, 5. Threaded Binary

Read in STOPWORD, store it as SW_LIST foreach ( word read from TEXTFILE ) { if ( the word is in SW_LIST ) then continue to read another word. else ( the word is not in SW_LIST ) then if ( the word is in TXT_LIST ) then add count of the word 1 else ( the word is not in TXT_LIST ) then insert word into TXT_LIST }

Project Two (Brute Force)O(N)

O(M)

O(M)

O(N)

Page 19: Calculating Word Frequency in a Document.     11/6( 四 ) 這個星期四小考, 5. Threaded Binary

這個作業寫的比較快的會有 Bonus. 到時候會把大家的程式拿到某台神秘的工作站上面

跑 , 看誰快誰慢 . 如果對於加分部份的公平性有疑問請在 11/6( 四 )

上課前提出 .

Project Two

Page 20: Calculating Word Frequency in a Document.     11/6( 四 ) 這個星期四小考, 5. Threaded Binary

先到 ftp://mpc.cs.nctu.edu.tw 建立自己學號的資料夾 .

上傳可 compile, run 的 C/C++ source code 檔案到 ftp://mpc.cs.nctu.edu.tw

Project Two – How to hand in

Page 21: Calculating Word Frequency in a Document.     11/6( 四 ) 這個星期四小考, 5. Threaded Binary

Any questions ?

Q & A