Upload
sybil-hunt
View
224
Download
2
Tags:
Embed Size (px)
Citation preview
An Experience of the Language
Observatory Project
Yoshiki MikamiLeader, Language Observatory Project
Japan Science & Technology Agency
Workshop on “Recent Experiences on Measuring Languages
on the Cyberspace”UNESCO, Paris, February 22, 2007
2LANGUAGE OBSERVATORY
Outlines
1. Global Digital Divide2. Language Observatory: How It Functions?3. Major Findings
3.1 Survey Snapshots, Asia and Africa3.2 Technical aspect of the Divide3.3 Social aspect of the Divide3.4 Several non-linguistic aspects
4. Future AgendaRegarding MeasurementFrom Measurement to Empowerment
3LANGUAGE OBSERVATORY
Source: ITU Statistics
0
200
400
600
800
100< 215< 464< 1,000< 2,154< 4,642< 10,000< 21,544<
million
0
500
1000
1500
2000million
Population(right)Telephon
Mobile
Internet
0
100
200
300
400
100< 215< 464< 1,000< 2,154< 4,642< 10,000< 21,544<Per capita GDP, US$
million
0
500
1000
1500
2000million
1999
2004
1. Global Digital DivideIncome, telephony
4LANGUAGE OBSERVATORY
0%
20%
40%
60%
80%
100%
0% 20% 40% 60% 80% 100%accumulated population
accu
mul
ated
num
bers GDP
number of fixedlines
number of cellularsubscribers
number ofinternet domains
The Degree of Inequality Telephony<Income<Internet
Gini-coefficient: Telephony 0.51 < GDP 0.73 < Internet 0.91
5LANGUAGE OBSERVATORY
Recommendation concerning the Promotion and Use of Multilingualism and Universal Access to Cyberspace, October 2003
[PREAMBLE] Noting that linguistic diversity in the global information
networks and universal access to information in cyberspace are at the core of contemporary debates and can be a determining factor in the development of a knowledge-based society,
UNESCO Recommendation
6LANGUAGE OBSERVATORY
Linguistic activities moving onto the Web
<< Real world >> << Cyberspace >>
Oral/vocal Recorded
speak &listen
conversation/chattelephone
conference
----------
proceedings
chat roomemail, SMS, SNS
web forum
listen songsradio/TV
movie film
music CDDVD
audio filesweb radio/TV
[subtitled]
read-----
advertisementmagazinesnewspaper
book/textbook
web-adsonline magazine
online newse-books, etc.
write-----
letterdiary
articles
email, SMS, SNSweblogs
online journals
Web media
7LANGUAGE OBSERVATORY
pages
Crawler[ UbiCrawler ]
Analysis on Digital
LanguageDivide
Internet
LanguageResources
LanguageIdentifier [ LI ]
http://gii.nagaokaut.ac.jp/gii/papers.php
<HTML><HEAD>
<TITLE>Language Observatory</TITLE>
<META http-equiv=Content-Type content="text/html; charset=UTF-8">
</HEAD>
<BODY>
<A href = "http://www.language-observatory.org"><IMG height=137 alt="logo" src = “LO.files/logo.gif" width=155></A>
<H2>About us</H2>
<P>Astronomical observatory catches the light from stars, likewise.................
Contant nalysis
Tag Analysis
2. Language ObservatoryHow It Functions?
8LANGUAGE OBSERVATORY
Unit of Identification = LSE Language+Script+Encoding
Language Script Encoding
Dari Arabic UTF-8
Farsi Arabic UTF-8
Hindi Devanagari UTF-8
Hindi Devanagari Arjun
Hindi Devanagari Shusha
Hindi Devanagari Shivaji
Azeri Latin Latin-1
Azeri Cyrillic KOИ-RAzeri Arabic ASMO
Differnce of Encoding
Difference of Script
Difference of language
9LANGUAGE OBSERVATORY
The First Workshopon the IMLD, 2004
http://portal.unesco.org/ci/en/ev.php-URL_ID=14480&URL_DO=DO_TOPIC&URL_SECTION=201.html
UNESCO reported the launch of the project
10LANGUAGE OBSERVATORY
Milestones, 2003 to 2007
Oct. 2003 UNESCO Adopted “Cyberspace Recommendation”
Oct. 2003 Project started by the support of Japan Science and Technology Agency (JST)
Feb. 2004 The First Language Observatory Workshop
Jun. 2004 Started to collect web data by “UbiCrawler”
Aug. 2005 The First version of Language Identification Module (LIM)
Nov. 2005 WSIS Tunis meeting inspired the collaboration with ACALAN.
Feb. 2006 The first meeting of the World Network for Linguistic Diversity
Jun. 2006 Workshop at Bamako, Mali on African Survey
11LANGUAGE OBSERVATORY
Expert CollaborationCase of African Survey
June 26-28, 2006 at Bamako, Mali ACALAN
Mali
Algeria
Burkina Faso
Ethiopia
Kenya
Malawi
Nigeria
Tunisia
CNRS, France
12LANGUAGE OBSERVATORY
Researchers NetworkOver 35 countries
Experts’ contribution is essential in collection of local coding text, seed URLs, and verification of LI results
13LANGUAGE OBSERVATORY
3.1 Survey SnapshotLanguages on the net, Asia
0%
20%
40%60%
80%
100%C
ypru
s
Tur
key
Isra
elL
eban
on
Jord
anS
yria
Pal
estin
eG
CC
Iran
Afg
anis
tan
%Local
%Arabic
%Others
%Russian
%English
0%
20%
40%
60%
80%
100%
Mya
nmar
Tha
iland Lao
Cam
bodi
a
Mal
aysi
a
Indo
nesi
a
Phi
lippi
nes
Bru
nei
Vie
tnam
Sin
gapo
re
%Local
%Arabic
%Others
%Russian
%English
0%20%
40%60%
80%100%
Kaz
akhs
tan
Kyr
gyzs
tan
Uzb
ekis
tan
Tur
kmen
ista
n
Taj
ikis
tan
Aze
rbai
jan
Mon
golia
0%
20%
40%
60%
80%
100%
Pak
ista
n
Indi
a
Sri
Lan
ka
Mal
dive
s
Bhu
tan
Nep
al
Ban
glad
esh
as of June 2006
14LANGUAGE OBSERVATORY
3.1 Survey Snapshot (cont.)Languages on the net, Africa
0%
20%
40%
60%
80%
100%
All Africandomains
Common-wealth
Franco-phonie
League ofArab States
English
French
Arabic
OtherLanguages
AfricanLanguages
as of October 2006
15LANGUAGE OBSERVATORY
3.2 Technical AspectLocalization Problem
“Language Localization” has been the key obstacle to the use of new information technologies since type printing age.
16LANGUAGE OBSERVATORY
"Before I end this letter I wish to bring before Your Paternity's mind the fact that for many years I very strongly desired to see in this Province some books printed in the language and alphabet of the land, as there are in Malabar with great benefit for that Christian community. And this could not be achieved for two reasons; the first because it looked impossible to cast so many moulds amounting to six hundred, whilst as our twenty-four in Europe."
source: Priolkar, The Printing Press in India,Bombay, 1958
Doctrina Christam in Tamil, 1578
A Jesuit Friar’s letter, 1608Six hundred versus 24
17LANGUAGE OBSERVATORY
“Doctrina Christiana”, bi-lingual version, printed in Tagalog by Tagalog script / in Tagalog by Latin script / in Spanish by Latin script.
Philippines postal stamp issued in 1995
Doctrina in Tagalog, 1593The script was finally lost
18LANGUAGE OBSERVATORY
note: Local proprietary encodings are shown in this table by names of font (families). as of June 2006
Language Standard encodingand its share
Examples of other encodings found [footnote]
Turkish ISO 8859 (99.5%)
Hebrew ISO 8859 (87.7%)
Vietnamese UTF-8 (96.4%) TCVN, VIQR, VPS
Thai TIS 620 (97.3%)
Mongolian UTF-8 (95.5%) Latin-Cyrillic
Sinhala UTF-8 (44.5%) Metta, Kaputa, etc.
Telugu UTF-8 (16.6%) Shree, TLH, etc.
Tamil UTF-8 (14.9%) Amudham, Kumudam, Shree, Vikatan, etc.
Burmese UTF-8 (0.7%) WinResearcher, etc.
Encoding Chaos leads todelay of localization
19LANGUAGE OBSERVATORYAs of June 2006
ScriptRegion
Latin Cyrillic Arabic hanzi Indic Others
Europe Major European languages (17)
Russian --- --- --- Greek
AsiaAfrica
Indonesian
African language, Tagalog, etc.
---
Bulgarian, Ukraine, Belarus, Central Asian
Arabic
Farsi, UrduPashtu, etc.
中 /日 /韓
---
Indic,Thai, Lao, Khmer, Myanmar, Tibetan, etc.
Hebrew
Ethiopic, Georgia, Armenian, Divehi
Unavailability of search engines :another problem
20LANGUAGE OBSERVATORY
Technical Aspect of the Digital Language Divide
globalIT firms
less attention from IT vendors
various localization by overseas communities
Int’lstandardbodies
gov.
local IT firms
users
localmedia
encoding chaosdelay in localization
non-availability of searchengines (SEs)
differentiation strategy to
enclose customers
lack of leadership in
standardization
difficulty in access to
standardization process
lack of standard in typewriter keyboard
21LANGUAGE OBSERVATORY
Personaldomain
Publicdomain
Occupationaldomain
Educational domain
Conversation, mail, phone, blog, magazines, newspaper, novel, songs, etc.
Official documents, laws and regulations, traffic signs, contract, legal, etc.
Business letter, invoice, manual, contract, name card, packaging, etc.
Textbook, academic journal, dictionary, scientific communication, etc.
Based on EU’s “Common European Framework of Reference for Languages” (2004)
3.3 Social Aspect: languages in multilingual society
22LANGUAGE OBSERVATORY
Language plays a different role in multilingual society
ac.xxeducational
com.xxoccupational
gov.xxpublic
otherspersonal
Regional languages
Official
language (s)
Minority
language (s)
Global
languages
So
cio-eco
no
mic d
om
ains
secondarylevel domain
23LANGUAGE OBSERVATORY
others
gov
com
ac
English Others Turkish Tatar
others
gov
co
ac
English Russian Others Kazakh
others
gov
co
ac
English Arabic Others Farsi
Turkey
Kazakhstan Iran
others
gov
com
ac
English Greek Others Turkish
Cyprus
Specialization of LanguageSecondary domain analysis
24LANGUAGE OBSERVATORY
gov.
overseas community
globalIT firms
mediapress
e-business
govusers
local business
nonavailability
of SEs
localmedia
highereducation
absence of mother
language
primaryseondaryeducation
low literacy
usersusers
restricted social activities
Social Aspect of the Digital Language Divide
25LANGUAGE OBSERVATORYas of December 2005
3.4 Non-linguistic Aspects a. Network and Server
80% of servers under African domains are located outside of the country. 60% of servers in Asian domains are also “offshore”
•○rw: Rwanda•△ml: Mali•□mz: Mozambique•White: servers installed in the country•Colored: servrs installed overseas
26LANGUAGE OBSERVATORY
Complaint against accessA letter from Namibia
I am the web master of the XXXXXXX Database. We are being severely hit by your Language Observatory‘s web crawler - already 37000 page hits this month. In December 2005 you hit us 34000 times. We are on limited bandwidth, and this puts unacceptable strain on our server. I notice that you consider one HTTP request every 5 seconds 'polite' and 'modest'. This may be true in Japan, but not in Africa - our connections are very slow and very narrow. I would appreciate it if you could prevent your crawlers from visiting our URL again. In return, I will be happy to provide you directly with whatever statistics about our site you need for your research.
Sincerely we carefully control data collection speed using a
set of parameters, such as revisiting interval, depth, maximum pages per server, prohibition URL list.
27LANGUAGE OBSERVATORY
b. Domain Governance
LS
ER
BW
GM
EGMA
SDTG
TD
SLCF
LR
SZ
GA
NAMU
RECV
DJ:Djibouti
KM
GQ
SOGW
ZA:South Africa
SH:St. Helena
ST:Sao Tome &Pricipe
SC:Seychelles
AC:Ascension
IO:Indian OceanTerritory
CD
BJ
GNAO
MRCG
LY
MLNE
CM
NG
ETGHDZZM
ZW UGTZ KE
TN
RW BF
SN
BI MWMZ
MG
CI
1.E+00
1.E+01
1.E+02
1.E+03
1.E+04
1.E+05
1.E+06
1.E+07
1.E+08
1.E+00 1.E+01 1.E+02 1.E+03 1.E+04 1.E+05 1.E+06
population (1,000)
pages
as of December 2005
Management of small Islands’ domains are often re-delegated to overseas web-hosting operators, who tend to admit spam, porn, etc.
28LANGUAGE OBSERVATORY
ae
af
az
bd bhbn
bt
cy
idilin
ir
jo
kg
khkw
kzla lblkmm
mnmv
my
np
om
pg
phpk ps qasa
sgsy th
tjtm
truzvn ye0
0.5
1
1.5
2
2.5
0.00 0.10 0.20 0.30 0.40 0.50 0.60 0.70
TV receivers per population
Pre
ss L
inka
ge R
atio
Countries where only state controlled TV stations available, show higher percentage of links going to global
news sites abroad.
c. Access regulationsby the government
29LANGUAGE OBSERVATORY
4. Future Agenda
Regarding Measurement Improvement of accuracy and coverage Multi-stakeholder Collaboration Global Observatories Network
From Measurement to EmpowermentGoals/Targets/Indicators system which help and guide stakeholders in empowering languages
30LANGUAGE OBSERVATORY
World Network for Linguistic Diversity
31LANGUAGE OBSERVATORY
”Language Empowerment”Mother language for creation
language community
gov
OSSdevelopers
IT firms
users
languageportal
mediapress
motherlanguage
for creation
highereducation
local language search engines
electronic delivery of
public services
豊富な母語コンテンツ豊富な母語コンテンツ
localization of application SW
based on standard
creation of local contents
literacy
母語情報処理技術OCR, TTS, 翻訳
mother language use in higher
education
promotion of NLPOCR, TTS, MTe-dictionary, etc
32LANGUAGE OBSERVATORY
Millennium Development Goals: Structure
33LANGUAGE OBSERVATORY
Jehan Rectus Square, Parisphoto: courtesy by Wunna Ko Ko, June 2005
Thanks for your attention