Upload
ismet
View
51
Download
1
Tags:
Embed Size (px)
DESCRIPTION
COMP323 Foundations of Chinese Computing. Course Introduction. Lecturer Qin LU [email protected] R oo m PQ814, Tel. 27667247 Teaching Assistant ( Responsible for some Labs and Project Assignments ) Chen Yirong [email protected] R oo m QT416 , Tel. 2766 7326. - PowerPoint PPT Presentation
COMP323 Foundations of Chinese Computing
COMP323 Lecture 1 2
Course Introduction Lecturer
Qin LU [email protected] Room PQ814, Tel. 27667247
Teaching Assistant (Responsible for some Labs and Project Assignments) Chen Yirong [email protected] Room QT416, Tel. 2766 7326
COMP323 Lecture 1 3
Course Introduction COMP323 Reference Books
CJKV Information Processing: Chinese, Japanese, Korean and Vietnamese Computing (PL1074.5 .L86)
An Introduction to Chinese, Japanese and Korean Computing (QA76.H7795)
計算機中文信息處理 (PL1074.5.C42) and others Tutorials and labs: PQ604A
Tuesday Group: 9:30 – 10:30 Tuesdays Thursday Group: 9:30 – 10:30 Thursdays Try to finish the labs and the online assignment/QA
during lab hours
COMP323 Lecture 1 4
Course Introduction COMP323 Website
WebCT Lecture notes available Wed. by 5pm Print as NotePage
Method of Assessment Course Work 55%
2 Programming Assignments 20% 2 online quizzes 20% 1 online homework 5% 4 online QA(labs) 8% Class attendance (punctuation) 2%
Final Examination 45%
COMP323 Lecture 1 5
Course Introduction
ChineseChinese ComputingComputing
Introduction to Chinese Computing
Computer processing of data related to Chinese, involving any human-computer interaction activity where communication is achieved using Chinese language.
About one-fifth of the people in the world speak some form of Chinese as their native language, making it the language with the most native speakers.
COMP323 Lecture 1 6
Fundamental Problems with Chinese Computing At Chinese Character Level
Large and not Closed Character Set Computer Representation, Input and Output
At Chinese Language Level Lack of Morphological Variation Lack of Grammar
Very Arbitrary and FlexibleSuperimposed Grammar
Texts are Running Together
Course Introduction
COMP323 Lecture 1 7
Course Introduction Fundamental Problems with Chinese Computing
COMP323 Lecture 1 8
Course Introduction Fundamental Problems with Chinese Language
Bi-lingual, Tri-lingual and Multi-lingual Computing Question: Is Hong Kong a multi-lingual society? How can a system be designed so that it can be
used by different languages with minimal changes?
How can a system be designed so that it can be used for multiple languages?
Distinguish Chinese and English CharactersChinese Text, English Text or Chinese Text
Mixed Together with English Text?
COMP323 Lecture 1 9
Course Introduction Fundamental Problems with Chinese Language
Bi-lingual, Tri-lingual and Multi-lingual Computing Example: Count the Number of (Chinese and/or
English) Characters or Words
Multilingual Computing
多語言文字處理技術
?
COMP323 Lecture 1 10
Characteristics of Chinese Language Reading System (Pronunciation) Writing System (Look)
Computer Representation of Chinese Characters Character Set Standards (GB, Big5 and Unicode ...) Encoding Schemes (ISO and UTF …)
Chinese Character Input Chinese Input Processing by (Pen, Image, Speech
and) Key Stroke Shape-based Keystroke Input Method Phonetic-based Keystroke Input Method
Tentative Teaching Content
COMP323 Lecture 1 11
Chinese Character Output Bitmap and Outline Font Representation Compression Scaling Problem
Software Development for Chinese Text Processing, such as Character Searching,
Editing, and Deletion … Software Localization and Internationalization
Tentative Teaching Content
COMP323 Lecture 1 12
Chinese Language Processing Word Segmentation Part-of-Speech (POS) Tagging Syntactic Analysis (Grammatical Analysis)
Chinese Information (Document) Retrieval Document Retrieval Models Language-Related Issues
Advanced Topics (possibly) Information Extraction Text Summarization
Tentative Teaching Content
Lecture 1Characteristics of Chinese
COMP323 Lecture 1 14
General Characteristics The official language in China is mandarin ( 普通話 ),
but there are many dialects in spoken form (50+). Different Pronunciation across Different Dialects Relatively Unified Writing System Dialect-specific Characters and Variant Character
Writing
Different words express the same meaning, e.g. 係 and 是 (to be)
Word order reversal, e.g. 找尋 and 尋找 (look for)
The Chinese Language
叻吓吔呃咁咗咩哂哋唔唥唧啱啲喐喥喺嗰嘅嘜嘞嘢
COMP323 Lecture 1 15
The Chinese Language
COMP323 Lecture 1 16
Characteristics of Chinese Characters Each Chinese character associates with three
features, namely its look (called graphemics), its pronunciation (called phonetics), and its meaning (called semantics).
The Chinese Language
Graphemics(The Look)
Graphemics(The Look)
Phonetics(The Sound)
Phonetics(The Sound)
Semantic(The Meaning)
Semantic(The Meaning)
COMP323 Lecture 1 17
Radicals (部首 ) Chinese characters are
composed of smaller units, called radicals.
214+ radicals are used for indexing Chinese characters.
The advantage of a radical is that one does not have to know the pronunciation of the character, but can still look up a character in a dictionary.
Chinese Writing System一丨丶丿乙亅二亠人儿入八冂冖冫几凵刀力勹匕匚匸十卜卩厂厶又口囗土士夊夊夕大女子宀寸小尢尸屮山巛工己巾乡广廴廾弋弓彐彡彳心戈戶手支攴文斗斤方无日曰月木欠止歹殳毋比毛氏气水火爪父爻爿片牙犬玄玉瓜瓦甘生用田疋疒癶白皮目矛矢石示禸禾穴立竹米糸缶网羊羽老而耒耳聿肉臣自至臼舌舛舟艮色艸虍虫血行衣襾見角言谷豆豕豸貝赤走足身車辛辰辵邑酉釆里金長門阜隶隹雨靑非面革韦韭音頁凬飛食首香馬骨高髟鬥鬯鬲鬼魚鳥鹵鹿麦麻黃黍黑黹黽鼎鼓鼠鼻齊
COMP323 Lecture 1 18
Radicals Remark: Several radicals can stand alone as single
and meaningful Chinese characters.
Chinese Writing System
Radical Standalone Examples
木 Yes 本未术札朽朴朳杀杂机朵权火 Yes 炜炬炅炖炒炝炙炘炊炆炕炉
心 Yes 伈芯志忐吣忘忍态忠念忿忽
石 Yes 岩矾矿宕砀码研砆砌砂泵砍
COMP323 Lecture 1 19
Strokes ( 筆劃 ) Radicals in turn are composed of smaller units,
called strokes. 30+ strokes are the most basic elements of a
character. 5 basic strokes are “一” (横 , a horizontal
stroke), “丨” (竖 , a vertical stroke), “丶” (点 , dot), “丿” (撇 , a stroke curved to the left) and “乙” (折 , a bend stroke).
Chinese Writing System
COMP323 Lecture 1 20
Strokes Stroke Order ( 筆順 )
The strokes for each Chinese character are to be drawn in a certain defined order.
Basic principles are: from left to right, top to bottom, outside to inside, horizontal before vertical, left slant before right slant, center before two sides, etc.
See Animations here http://www.chinawestexchange.com/Chinese/characters.htm
Chinese Writing System
COMP323 Lecture 1 21
Tree Structure of Chinese Characters
Chinese Writing System
COMP323 Lecture 1 22
Character Classifications and Formation Type 1: Pictographs (Picture Characters) ( 象形 )
They look like the things they represent, e.g.
Other examples are 日 (sun), 山 (mountain), 水 (water), 鸟 (bird), 火 (fire), 木 (tree), 車 (car, cart), and 口 (month, opening), etc.
Chinese Writing System
Does this character 月 really look like a moon to you? Centuries ago, it was written like this:
COMP323 Lecture 1 23
Chinese Writing System Evolution
of Chinese Characters
COMP323 Lecture 1 24
Character Classifications and Formation Type 2:(Simple) Ideographs ( 指事 or 表意 )
They represent abstract concepts or ideas, such as numbers and directions, e.g. 一 (one), 二 (two), 三 (three), and 中 (center, middle), 上 (above), 下 (below) etc.
Chinese Writing System
COMP323 Lecture 1 25
Character Classifications and Formation Type 3: Compound Ideographs ( 會意 )
Pictographs and ideographs can be combined to represent more complex characters, and usually reflect the combined meaning of them.
Examples: More
Interesting Animations from Internet http://www.language.berkeley.edu/fanjian/compound_ideographs.html
Chinese Writing System
sun 日 + moon 月 = bright 明person 人 + person 人 = agree/follow 从sun 日 + tree 木 = east (sun rising above
the trees in the east) 東tree 木 + tree 木 = forest 林 + one more tree 木 = full of trees 森
COMP323 Lecture 1 26
Character Classifications and Formation Type 3: Compound Ideographs
Chinese Writing System
COMP323 Lecture 1 27
Character Classifications and Formation Type 3: Compound Ideographs
Chinese Writing System
COMP323 Lecture 1 28
Character Classifications and Formation Type 4: Phonetic Ideographs ( 形聲 )
They usually have at least two component characters, one influences the sound and the other influences the meaning.
For example, They account
for more than 90% of all Chinese charactersin use today.
Chinese Writing System
For the character “跳” ( jump ), the left part “足“ means “foot”. The meanings of those characters that contain “足” are related to “foot” in a certain way. The right part “兆” indicates the sound. They share the same vowel.
COMP323 Lecture 1 29
Chinese Writing System
Thought to be the oldest types of characters, pictographs were originally pictures of things. During the past 5,000 years or so they have become simplified and stylised.
Ideographs are graphical representations of abstract ideas.
Compound pictographs and ideographs combine one or more pictographs or ideographs to form new characters. Both component parts contribute to the meaning of the compound character.
COMP323 Lecture 1 30
Chinese Writing System
Semantic-phonetic compounds represent around 90% of all existing characters and consist of two parts: a semantic component or radical which hints at the meaning of the character, and a phonetic component which gives a clue to the pronunciation of the character. Characters containing the same phonetic component may have the same sound and the same tone, the same sound but a different tone, the same initial or final sound, or a different sound and a different tone. Phonetic components are generally a more reliable indication of pronunciation than semantic components are of meaning.
COMP323 Lecture 1 31
Traditional and Simplified Characters Over time, frequently used and complex Chinese
characters tend to be simplified.
More about Pitfalls and Complexities of Chinese to Chinese Conversion http://www.cjk.org/cjk/c2c/c2cbasis.htm
Chinese Writing System
retain only one part from the traditional character
COMP323 Lecture 1 32
Chinese Language (Chinese Text) Chinese characters are subsequently combined
with other Chinese characters as words to form more complex ideas and concepts.
Question: How many Chinese characters?
Chinese Writing System
The Chinese writing system is open-ended, meaning that there is no upper limit to the number of characters. The largest Chinese dictionaries include about 56,000 characters, but most of them are archaic, obscure or rare variant forms. Knowledge of about 3,000 characters enables you to read about 99% of the characters in Chinese newspapers and magazines. To read Chinese literature, technical writings or classical Chinese, though, you need to be familiar with about 6,000 characters.
COMP323 Lecture 1 33
Pronunciation The phonetic information is not explicit.
Sometimes, you can guess the pronunciation through the component characters.
Sometimes, the pronunciation has no relation to its components at all.
It makes the learning of Chinese difficult without a phonetic transcription system.
Phonetic transcription: Dictation of pronunciations Symbols to indicate all sounds in the language -
sufficient One sound is denoted by only one symbol -
Uniqueness
Chinese Reading System
COMP323 Lecture 1 34
Pronunciation Pinyin: dictating Mandarin Chinese
Vowel ( 元音 , Initial) and Consonant ( 輔音 , final)
More about Pronunciation http://www.chinese-outpost.com/language/pronunciation/mandarin-chinese-initials-and-finals-table-1.asp
Chinese Reading System
For example, consider Beijing:bei: b is an initial, and ei is a finaljing: j is an initial, and ing is a final
In speech, Chinese words are created using just 21 beginning sounds called initials, and 37 ending sounds called finals. Initials and finals, of course, combine to create the basic sounds of Chinese.
COMP323 Lecture 1 35
Pronunciation Pinyin
Chinese Reading System
COMP323 Lecture 1 36
Pronunciation Tones of Chinese
Chinese is a tonal Language.
Mandarin has 4 (5) tones and Cantonese has 6 (9) tones, which makes it much harder to learn than Mandarin.
Chinese Reading System
COMP323 Lecture 1 37
Pronunciation
Tones differentiate meanings.
Chinese Reading System
Everyone seems to know this one: Yes, just by saying “ma” in different tones, you can ask, “Did mother scold the horse?”
妈骂马吗 ? (mā mà mă ma?)
鞏俐 (Gong Li, with third and fourth tones), is the name of the star of “Raise the Red Lantern” and other contemporary Chinese films. However, 公里 (gong li, with first and third tones, means kilometer.