Upload
hanteng-liao
View
181
Download
1
Embed Size (px)
DESCRIPTION
Wanna better analyze the geographic and linguistic outreach/dynamics of web traffic? We propose a method of geo‐linguistic normalization to do so, with multilingual Wikipedia projects as the example.
Citation preview
Han-Teng Liao defended his PhD successfully at the Oxford Internet Institute (OII) July 2014. His research focus in is on user-generated content and data, Web analytics (webometrics), Chinese Internet Research and integrated digital research designs (both qualitative and quantitative).
Thomas Petzold is a social technology analyst, TED speaker and professor of media management at HMKW – University of Applied Sciences for Media, Communication and Management in Berlin, Germany. As a research fellow at the WZB (2011–2013), he led a project on languages and big data in social technology.[photo: David Ausserhofer]
Abstract
What is Data Normalization?
Finer normalization: geolinguistic unit
A language tag:
• Often starts with a language code followed by a country code
• e.g. “fr‐CA” = the geolinguistic unit of French used in Canada.
• has corresponding data points in the Unicode’s Common Locale Data Repository (CLDR) Project.
• e.g. “fr‐CA” = 7,605,004 [12]
Finer geolinguistic data normalization is useful …
• for finer comparison between, say, Egyptian Arabic and Saudi Arabia Arabic speakers, or that of Spanish Spanish and Mexican Spanish speakers
• for analysts or designers to better know and thus support their users by to providing appropriate interfaces and content[7]
• for better understanding of the Wikipedia traffic data
References (partial: those mentioned in this poster)[1] American Planning Association 2006. Planning and Urban Design Standards. John Wiley & Sons.
[2] Cote, P. Effective Cartography: Mapping with Quantitative Data. Harvard Graduate School of Design.
[3] Crowston, K. et al. 2013. Sustainability of Open Collaborative Communities: Analyzing Recruitment Efficiency. Technology Innovation Management Review. January: Open Source Sustainability (2013).
AcknowledgmentsWe appreciate the Wikimedia UK for the scholarship for Han‐Teng Liao to present the findings at the OpenSym 2014. We also acknowledge the open source software tools called Scrapy for making the web mining tasks easier.
Data normalization, or geographic normalization, allows data to be compared using a sensible common denominator, thereby producing measurements of intensity or density, such as population density [1, 2]
Data normalization is useful …
• in “factoring out the size” in order to facilitate comparisons across unequal areas or populations [2]
• in dividing a certain numeric attribute (e.g. GDP) by another (e.g. population), andso as to derive another numeric attribute (e.g. GDP per capita)
• in minimizing the differences caused by the size of a geographic unit
It is similar to Crowston, Julien and Ortega[3] in “factoring out the size” but different in the choice of size unit.
• Crowston et al’s work[3] have proposed a measurement to compare how efficient a language version turns potential users into actual contributors.
• They found “a strong (but not perfect) correlation” between the total number of Wikipedia contributors on one side, and the Internet population, and total tertiary‐educated population on the other.
Han‐Teng Liao ([email protected]) and Thomas Petzold ([email protected])
Towards a better understanding of the geolinguistic dynamics of knowledge
Geographic And Linguistic Normalization
OpenSym '14 , Aug 27‐29 2014,Berlin, GermanyACM 978‐1‐4503‐3016‐9/14/08.http://dx.doi.org/10.1145/2641580.2641623
We propose a method of geo‐linguistic normalization to advance the existing comparative analysis of open collaborative communities, with multilingual Wikipedia projects as the example. Such normalization requires data regarding the potential users and/or resources of a geolinguistic unit.
0%
20%
40%
60%
80%
Percent of the traffic
Year/Month
pgViews_perLang
Egypt
Saudi Arabia
Other
Algeria
0
2
4
6
8
Norm
alized
by language
population
Year/Month
pgViews_perLang
IsraelKuwaitSaudi ArabiaUAEJordanBahrainQatarEgypt
Figure 1. Viewing traffic trend lines Figure 3. Normalized viewing traffic trend lines
Comparing results: before and after data normalization
Arabic Wikipedia viewing traffic
Arabic Wikipedia editing traffic?
Please refer to the extended abstract or ask the authors for more (Figure 2 and Figure 4).
English Wikipedia editing traffic
English Wikipedia viewing traffic
Figure EN1. Viewing traffic trend lines Figure EN3. Normalized viewing traffic trend lines
0%
10%
20%
30%
40%
50%
Percent of the traffic
Year/Month
pgViews_perLang
United States
Other
UnitedKingdomCanada
0
0.5
1
1.5
2
Norm
alized
by language
population
Year/Month
pgViews_perLang
CanadaUnited KingdomNew ZealandAustraliaIrelandUnited StatesMalaysiaNetherlandsSpainItalyFranceGermany
Figure EN2. Editing traffic trend lines Figure EN4. Normalized editing traffic trend lines
0%
10%
20%
30%
40%
50%
Percent of the traffic
Year/Month
pgEdits_perLang
United States
UnitedKingdomOther
Canada
0
0.5
1
1.5
2
2.5Norm
alized
by language
population
Year/Month
pgEdits_perLangUnitedKingdomNewZealandCanada
Ireland
Australia
[7] Liao, H.-T. 2013. How does localization influence online visibility of user-generated encyclopedias? A case study on Chinese-language Search Engine Result Pages (SERPs). Proceedings of the 9th International Symposium on Open Collaboration (Hong Kong, Aug. 2013).
[12]Unicode Consortium 2014. Language-Territory Information, CLDR Version 25.