33
An Experience of the Language Observatory Project Yoshiki Mikami Leader, Language Observatory Project Japan Science & Technology Agency Workshop on “Recent Experiences on Measuring Languages on the Cyberspace” UNESCO, Paris, February 22, 2007

An Experience of the Language Observatory Project Yoshiki Mikami Leader, Language Observatory Project Japan Science & Technology Agency Workshop on “Recent

Embed Size (px)

Citation preview

Page 1: An Experience of the Language Observatory Project Yoshiki Mikami Leader, Language Observatory Project Japan Science & Technology Agency Workshop on “Recent

An Experience of the Language

Observatory Project

Yoshiki MikamiLeader, Language Observatory Project

Japan Science & Technology Agency

Workshop on “Recent Experiences on Measuring Languages

on the Cyberspace”UNESCO, Paris, February 22, 2007

Page 2: An Experience of the Language Observatory Project Yoshiki Mikami Leader, Language Observatory Project Japan Science & Technology Agency Workshop on “Recent

2LANGUAGE OBSERVATORY

Outlines

1. Global Digital Divide2. Language Observatory: How It Functions?3. Major Findings

3.1 Survey Snapshots, Asia and Africa3.2 Technical aspect of the Divide3.3 Social aspect of the Divide3.4 Several non-linguistic aspects

4. Future AgendaRegarding MeasurementFrom Measurement to Empowerment

Page 3: An Experience of the Language Observatory Project Yoshiki Mikami Leader, Language Observatory Project Japan Science & Technology Agency Workshop on “Recent

3LANGUAGE OBSERVATORY

Source: ITU Statistics

0

200

400

600

800

100< 215< 464< 1,000< 2,154< 4,642< 10,000< 21,544<

million

0

500

1000

1500

2000million

Population(right)Telephon

Mobile

Internet

0

100

200

300

400

100< 215< 464< 1,000< 2,154< 4,642< 10,000< 21,544<Per capita GDP, US$

million

0

500

1000

1500

2000million

1999

2004

1. Global Digital DivideIncome, telephony

Page 4: An Experience of the Language Observatory Project Yoshiki Mikami Leader, Language Observatory Project Japan Science & Technology Agency Workshop on “Recent

4LANGUAGE OBSERVATORY

0%

20%

40%

60%

80%

100%

0% 20% 40% 60% 80% 100%accumulated population

accu

mul

ated

num

bers GDP

number of fixedlines

number of cellularsubscribers

number ofinternet domains

The Degree of Inequality Telephony<Income<Internet

Gini-coefficient: Telephony 0.51 < GDP 0.73 < Internet 0.91

Page 5: An Experience of the Language Observatory Project Yoshiki Mikami Leader, Language Observatory Project Japan Science & Technology Agency Workshop on “Recent

5LANGUAGE OBSERVATORY

Recommendation concerning the Promotion and Use of Multilingualism and Universal Access to Cyberspace, October 2003

[PREAMBLE] Noting that linguistic diversity in the global information

networks and universal access to information in cyberspace are at the core of contemporary debates and can be a determining factor in the development of a knowledge-based society,

UNESCO Recommendation

Page 6: An Experience of the Language Observatory Project Yoshiki Mikami Leader, Language Observatory Project Japan Science & Technology Agency Workshop on “Recent

6LANGUAGE OBSERVATORY

Linguistic activities moving onto the Web

<< Real world >> << Cyberspace >>

Oral/vocal Recorded

speak &listen

conversation/chattelephone

conference

----------

proceedings

chat roomemail, SMS, SNS

web forum

listen songsradio/TV

movie film

music CDDVD

audio filesweb radio/TV

[subtitled]

read-----

advertisementmagazinesnewspaper

book/textbook

web-adsonline magazine

online newse-books, etc.

write-----

letterdiary

articles

email, SMS, SNSweblogs

online journals

Web media

Page 7: An Experience of the Language Observatory Project Yoshiki Mikami Leader, Language Observatory Project Japan Science & Technology Agency Workshop on “Recent

7LANGUAGE OBSERVATORY

pages

Crawler[ UbiCrawler ]

Analysis on Digital

LanguageDivide

Internet

LanguageResources

LanguageIdentifier [ LI ]

http://gii.nagaokaut.ac.jp/gii/papers.php

<HTML><HEAD>

<TITLE>Language Observatory</TITLE>

<META http-equiv=Content-Type content="text/html; charset=UTF-8">

</HEAD>

<BODY>

<A href   =   "http://www.language-observatory.org"><IMG height=137 alt="logo" src = “LO.files/logo.gif" width=155></A>

<H2>About us</H2>

<P>Astronomical observatory catches the light from stars, likewise.................

Contant nalysis

Tag Analysis

2. Language ObservatoryHow It Functions?

Page 8: An Experience of the Language Observatory Project Yoshiki Mikami Leader, Language Observatory Project Japan Science & Technology Agency Workshop on “Recent

8LANGUAGE OBSERVATORY

Unit of Identification = LSE Language+Script+Encoding

Language Script Encoding

Dari Arabic UTF-8

Farsi Arabic UTF-8

Hindi Devanagari UTF-8

Hindi Devanagari Arjun

Hindi Devanagari Shusha

Hindi Devanagari Shivaji

Azeri Latin Latin-1

Azeri Cyrillic KOИ-RAzeri Arabic ASMO

       Differnce of Encoding

       Difference of Script

       Difference of language

Page 9: An Experience of the Language Observatory Project Yoshiki Mikami Leader, Language Observatory Project Japan Science & Technology Agency Workshop on “Recent

9LANGUAGE OBSERVATORY

The First Workshopon the IMLD, 2004

http://portal.unesco.org/ci/en/ev.php-URL_ID=14480&URL_DO=DO_TOPIC&URL_SECTION=201.html

UNESCO reported the launch of the project

Page 10: An Experience of the Language Observatory Project Yoshiki Mikami Leader, Language Observatory Project Japan Science & Technology Agency Workshop on “Recent

10LANGUAGE OBSERVATORY

Milestones, 2003 to 2007

Oct. 2003 UNESCO Adopted “Cyberspace Recommendation”

Oct. 2003 Project started by the support of Japan Science and Technology Agency (JST)

Feb. 2004 The First Language Observatory Workshop

Jun. 2004 Started to collect web data by “UbiCrawler”

Aug. 2005 The First version of Language Identification Module (LIM)

Nov. 2005 WSIS Tunis meeting inspired the collaboration with ACALAN.

Feb. 2006 The first meeting of the World Network for Linguistic Diversity

Jun. 2006 Workshop at Bamako, Mali on African Survey

Page 11: An Experience of the Language Observatory Project Yoshiki Mikami Leader, Language Observatory Project Japan Science & Technology Agency Workshop on “Recent

11LANGUAGE OBSERVATORY

Expert CollaborationCase of African Survey

June 26-28, 2006 at Bamako, Mali ACALAN

Mali

Algeria

Burkina Faso

Ethiopia

Kenya

Malawi

Nigeria

Tunisia

CNRS, France

Page 12: An Experience of the Language Observatory Project Yoshiki Mikami Leader, Language Observatory Project Japan Science & Technology Agency Workshop on “Recent

12LANGUAGE OBSERVATORY

Researchers NetworkOver 35 countries

Experts’ contribution is essential in collection of local coding text, seed URLs, and verification of LI results

Page 13: An Experience of the Language Observatory Project Yoshiki Mikami Leader, Language Observatory Project Japan Science & Technology Agency Workshop on “Recent

13LANGUAGE OBSERVATORY

3.1 Survey SnapshotLanguages on the net, Asia

0%

20%

40%60%

80%

100%C

ypru

s

Tur

key

Isra

elL

eban

on

Jord

anS

yria

Pal

estin

eG

CC

Iran

Afg

anis

tan

%Local

%Arabic

%Others

%Russian

%English

0%

20%

40%

60%

80%

100%

Mya

nmar

Tha

iland Lao

Cam

bodi

a

Mal

aysi

a

Indo

nesi

a

Phi

lippi

nes

Bru

nei

Vie

tnam

Sin

gapo

re

%Local

%Arabic

%Others

%Russian

%English

0%20%

40%60%

80%100%

Kaz

akhs

tan

Kyr

gyzs

tan

Uzb

ekis

tan

Tur

kmen

ista

n

Taj

ikis

tan

Aze

rbai

jan

Mon

golia

0%

20%

40%

60%

80%

100%

Pak

ista

n

Indi

a

Sri

Lan

ka

Mal

dive

s

Bhu

tan

Nep

al

Ban

glad

esh

as of June 2006

Page 14: An Experience of the Language Observatory Project Yoshiki Mikami Leader, Language Observatory Project Japan Science & Technology Agency Workshop on “Recent

14LANGUAGE OBSERVATORY

3.1 Survey Snapshot (cont.)Languages on the net, Africa

0%

20%

40%

60%

80%

100%

All Africandomains

Common-wealth

Franco-phonie

League ofArab States

English

French

Arabic

OtherLanguages

AfricanLanguages

as of October 2006

Page 15: An Experience of the Language Observatory Project Yoshiki Mikami Leader, Language Observatory Project Japan Science & Technology Agency Workshop on “Recent

15LANGUAGE OBSERVATORY

3.2 Technical AspectLocalization Problem

“Language Localization” has been the key obstacle to the use of new information technologies since type printing age.

Page 16: An Experience of the Language Observatory Project Yoshiki Mikami Leader, Language Observatory Project Japan Science & Technology Agency Workshop on “Recent

16LANGUAGE OBSERVATORY

"Before I end this letter I wish to bring before Your Paternity's mind the fact that for many years I very strongly desired to see in this Province some books printed in the language and alphabet of the land, as there are in Malabar with great benefit for that Christian community. And this could not be achieved for two reasons; the first because it looked impossible to cast so many moulds amounting to six hundred, whilst as our twenty-four in Europe."

source: Priolkar, The Printing Press in India,Bombay, 1958

Doctrina Christam in Tamil, 1578

A Jesuit Friar’s letter, 1608Six hundred versus 24

Page 17: An Experience of the Language Observatory Project Yoshiki Mikami Leader, Language Observatory Project Japan Science & Technology Agency Workshop on “Recent

17LANGUAGE OBSERVATORY

“Doctrina Christiana”, bi-lingual version, printed in Tagalog by Tagalog script / in Tagalog by Latin script / in Spanish by Latin script.

Philippines postal stamp issued in 1995

Doctrina in Tagalog, 1593The script was finally lost

Page 18: An Experience of the Language Observatory Project Yoshiki Mikami Leader, Language Observatory Project Japan Science & Technology Agency Workshop on “Recent

18LANGUAGE OBSERVATORY

note: Local proprietary encodings are shown in this table by names of font (families). as of June 2006

Language Standard encodingand its share

Examples of other encodings found [footnote]

Turkish ISO 8859 (99.5%)

Hebrew ISO 8859 (87.7%)

Vietnamese UTF-8 (96.4%) TCVN, VIQR, VPS

Thai TIS 620 (97.3%)

Mongolian UTF-8 (95.5%) Latin-Cyrillic

Sinhala UTF-8 (44.5%) Metta, Kaputa, etc.

Telugu UTF-8 (16.6%) Shree, TLH, etc.

Tamil UTF-8 (14.9%) Amudham, Kumudam, Shree, Vikatan, etc.

Burmese UTF-8 (0.7%) WinResearcher, etc.

Encoding Chaos leads todelay of localization

Page 19: An Experience of the Language Observatory Project Yoshiki Mikami Leader, Language Observatory Project Japan Science & Technology Agency Workshop on “Recent

19LANGUAGE OBSERVATORYAs of June 2006

ScriptRegion

Latin Cyrillic Arabic hanzi Indic Others

Europe Major European languages (17)

Russian --- --- --- Greek

AsiaAfrica

Indonesian

African language, Tagalog, etc.

---

Bulgarian, Ukraine, Belarus, Central Asian

Arabic

Farsi, UrduPashtu, etc.

中 /日 /韓

---

Indic,Thai, Lao, Khmer, Myanmar, Tibetan, etc.

Hebrew

Ethiopic, Georgia, Armenian, Divehi

Google

Unavailability of search engines :another problem

Page 20: An Experience of the Language Observatory Project Yoshiki Mikami Leader, Language Observatory Project Japan Science & Technology Agency Workshop on “Recent

20LANGUAGE OBSERVATORY

Technical Aspect of the Digital Language Divide

globalIT firms

less attention from IT vendors

various localization by overseas communities

Int’lstandardbodies

gov.

local IT firms

users

localmedia

encoding chaosdelay in localization

non-availability of searchengines (SEs)

differentiation strategy to

enclose customers

lack of leadership in

standardization

difficulty in access to

standardization process

lack of standard in typewriter keyboard

Page 21: An Experience of the Language Observatory Project Yoshiki Mikami Leader, Language Observatory Project Japan Science & Technology Agency Workshop on “Recent

21LANGUAGE OBSERVATORY

Personaldomain

Publicdomain

Occupationaldomain

Educational domain

Conversation, mail, phone, blog, magazines, newspaper, novel, songs, etc.

Official documents, laws and regulations, traffic signs, contract, legal, etc.

Business letter, invoice, manual, contract, name card, packaging, etc.

Textbook, academic journal, dictionary, scientific communication, etc.

Based on EU’s “Common European Framework of Reference for Languages” (2004)

3.3 Social Aspect: languages in multilingual society

Page 22: An Experience of the Language Observatory Project Yoshiki Mikami Leader, Language Observatory Project Japan Science & Technology Agency Workshop on “Recent

22LANGUAGE OBSERVATORY

Language plays a different role in multilingual society

ac.xxeducational

com.xxoccupational

gov.xxpublic

otherspersonal

Regional languages

Official

language (s)

Minority

language (s)

Global

languages

So

cio-eco

no

mic d

om

ains

secondarylevel domain

Page 23: An Experience of the Language Observatory Project Yoshiki Mikami Leader, Language Observatory Project Japan Science & Technology Agency Workshop on “Recent

23LANGUAGE OBSERVATORY

others

gov

com

ac

English Others Turkish Tatar

others

gov

co

ac

English Russian Others Kazakh

others

gov

co

ac

English Arabic Others Farsi

Turkey

Kazakhstan Iran

others

gov

com

ac

English Greek Others Turkish

Cyprus

Specialization of LanguageSecondary domain analysis

Page 24: An Experience of the Language Observatory Project Yoshiki Mikami Leader, Language Observatory Project Japan Science & Technology Agency Workshop on “Recent

24LANGUAGE OBSERVATORY

gov.

overseas community

globalIT firms

mediapress

e-business

govusers

local business

nonavailability

of SEs

localmedia

highereducation

absence of mother

language

primaryseondaryeducation

low literacy

usersusers

restricted social activities

Social Aspect of the Digital Language Divide

Page 25: An Experience of the Language Observatory Project Yoshiki Mikami Leader, Language Observatory Project Japan Science & Technology Agency Workshop on “Recent

25LANGUAGE OBSERVATORYas of December 2005

3.4 Non-linguistic Aspects a. Network and Server

80% of servers under African domains are located outside of the country. 60% of servers in Asian domains are also “offshore”

•○rw: Rwanda•△ml: Mali•□mz: Mozambique•White: servers installed in the country•Colored: servrs installed overseas

Page 26: An Experience of the Language Observatory Project Yoshiki Mikami Leader, Language Observatory Project Japan Science & Technology Agency Workshop on “Recent

26LANGUAGE OBSERVATORY

Complaint against accessA letter from Namibia

I am the web master of the XXXXXXX Database. We are being severely hit by your Language Observatory‘s web crawler - already 37000 page hits this month. In December 2005 you hit us 34000 times. We are on limited bandwidth, and this puts unacceptable strain on our server. I notice that you consider one HTTP request every 5 seconds 'polite' and 'modest'. This may be true in Japan, but not in Africa - our connections are very slow and very narrow. I would appreciate it if you could prevent your crawlers from visiting our URL again. In return, I will be happy to provide you directly with whatever statistics about our site you need for your research.

Sincerely we carefully control data collection speed using a

set of parameters, such as revisiting interval, depth, maximum pages per server, prohibition URL list.

Page 27: An Experience of the Language Observatory Project Yoshiki Mikami Leader, Language Observatory Project Japan Science & Technology Agency Workshop on “Recent

27LANGUAGE OBSERVATORY

b. Domain Governance

LS

ER

BW

GM

EGMA

SDTG

TD

SLCF

LR

SZ

GA

NAMU

RECV

DJ:Djibouti

KM

GQ

SOGW

ZA:South Africa

SH:St. Helena

ST:Sao Tome &Pricipe

SC:Seychelles

AC:Ascension

IO:Indian OceanTerritory

CD

BJ

GNAO

MRCG

LY

MLNE

CM

NG

ETGHDZZM

ZW UGTZ KE

TN

RW BF

SN

BI MWMZ

MG

CI

1.E+00

1.E+01

1.E+02

1.E+03

1.E+04

1.E+05

1.E+06

1.E+07

1.E+08

1.E+00 1.E+01 1.E+02 1.E+03 1.E+04 1.E+05 1.E+06

population (1,000)

pages

as of December 2005

Management of small Islands’ domains are often re-delegated to overseas web-hosting operators, who tend to admit spam, porn, etc.

Page 28: An Experience of the Language Observatory Project Yoshiki Mikami Leader, Language Observatory Project Japan Science & Technology Agency Workshop on “Recent

28LANGUAGE OBSERVATORY

ae

af

az

bd bhbn

bt

cy

idilin

ir

jo

kg

khkw

kzla lblkmm

mnmv

my

np

om

pg

phpk ps qasa

sgsy th

tjtm

truzvn ye0

0.5

1

1.5

2

2.5

0.00 0.10 0.20 0.30 0.40 0.50 0.60 0.70

TV receivers per population

Pre

ss L

inka

ge R

atio

Countries where only state controlled TV stations available, show higher percentage of links going to global

news sites abroad.

c. Access regulationsby the government

Page 29: An Experience of the Language Observatory Project Yoshiki Mikami Leader, Language Observatory Project Japan Science & Technology Agency Workshop on “Recent

29LANGUAGE OBSERVATORY

4. Future Agenda

Regarding Measurement Improvement of accuracy and coverage Multi-stakeholder Collaboration Global Observatories Network

From Measurement to EmpowermentGoals/Targets/Indicators system which help and guide stakeholders in empowering languages

Page 30: An Experience of the Language Observatory Project Yoshiki Mikami Leader, Language Observatory Project Japan Science & Technology Agency Workshop on “Recent

30LANGUAGE OBSERVATORY

World Network for Linguistic Diversity

Page 31: An Experience of the Language Observatory Project Yoshiki Mikami Leader, Language Observatory Project Japan Science & Technology Agency Workshop on “Recent

31LANGUAGE OBSERVATORY

”Language Empowerment”Mother language for creation

language community

gov

OSSdevelopers

IT firms

users

languageportal

mediapress

motherlanguage

for creation

highereducation

local language search engines

electronic delivery of

public services

豊富な母語コンテンツ豊富な母語コンテンツ

localization of application SW

based on standard

creation of local contents

literacy

母語情報処理技術OCR, TTS, 翻訳

mother language use in higher

education

promotion of NLPOCR, TTS, MTe-dictionary, etc

Page 32: An Experience of the Language Observatory Project Yoshiki Mikami Leader, Language Observatory Project Japan Science & Technology Agency Workshop on “Recent

32LANGUAGE OBSERVATORY

Millennium Development Goals: Structure

Page 33: An Experience of the Language Observatory Project Yoshiki Mikami Leader, Language Observatory Project Japan Science & Technology Agency Workshop on “Recent

33LANGUAGE OBSERVATORY

Jehan Rectus Square, Parisphoto: courtesy by Wunna Ko Ko, June 2005

Thanks for your attention