Upload
leo-victor-andrews
View
225
Download
0
Tags:
Embed Size (px)
Citation preview
ETD 2005
International Accesses International Accesses to a Digital Library of to a Digital Library of
ETDsETDs
ETD 2005
Ana PavaniAna Pavani
Departamento de Engenharia ElétricaDepartamento de Engenharia ElétricaPontifícia Universidade Católica do Rio de JaneiroPontifícia Universidade Católica do Rio de Janeiro
[email protected]://www.maxwell.lambda.ele.puc-rio.br/
Presentation outlinePresentation outline
Profile of the digital library
Generation of data
Combination and anaysis of data – interesting results
Next steps
Profile of the digital libraryProfile of the digital library
Beginning of the collection – 2nd semester of 1995
Items to start the collection – courseware (texts, exercises, technical manuals, tests, etc.)
The digital library is part of a system that:
Is a LMS (Learning Management System) Has administrative functions that allow data
exchange with the university’s administrative system
Is linked (2 directions) to CNPq’s Lattes Platform (curricula database with more than 595 K CV)
Allows the control of series collections Is multilingual and has interfaces in 3 languages
Evolution of the collection:
Administrative documents Preprints, published papers & online articles Interactive courseware ETDs (2000) Online journals (2003) Senior projects (2003) Online bulletins – distributed through mailing
lists, archived and published automatically (2004) Books (Oct. 2005)
Numbers of titles in the collection:
Courseware (many types) – 2,700+ Administrative documents – 33 Technical documents – 94 ETDs – 1873 (PUC-Rio) + 31 (UNICAP) Preprints, published papers & online articles –
280 Senior projects – 305 Online journals – 3 (+ 1 in Oct. 2005 + 1 in Dec. 2005) Online bulletins – 2 Books – 1 (to be published in Oct. 2005)
Total number of digital objects (DOs): 16,400+
Technological characteristics:
Machine – IBM RS/6000 Operating system – IBM AIX Web server – Apache DBMS – IBM DB2
Apache log contains info on accesses to ALLALL digital contents on the system, besides all transaction that users perform (clicking buttons, reading posts, reading help pages, etc.) – data on transactions with contents must be extracted from the server log to generate the numbers to be analyzed
Generation of dataGeneration of data
Data have 2 different natures: production and accesses
Production data come from functions of the system that are not related to the Apache server but only to the DB
example
(*) PUC-Rio started requiring ETDs in Aug. 2002;
(*) UNICAP does not require ETDs.
ETD time series - PUC-Rio & UNI CAP
0
200
400
600
800
1000
1200
1400
1600
1800
2000
semesters
nu
mb
ers
jun.00 dec.00 jun.01 dec.01 jun.02 dec.02 jun.03 dec.03 jun.04dec.04 jun.05 sep.05
Access data are obtained from both the Apache Server log and the DB:
Logs are mined (according to the following definitions) and the results are stored on the DB
Mined data are combined with production data (metadata) already in the database (types of contents, authors, programs, areas of knowledge, dates, countries, etc.) to yield results
Definitions for mining the logDefinitions for mining the log
When access statistics came into discussion, it was necessary to define how data should be mined from the log and how it should be combined afterwards
The definitions follow – (M) mining definitions and (C) combining definitions
(M) Visits and complete visits
An ETD can have one or many digital objects. The number of visits is the sum of all accesses to all digital objects in a given month. A complete visit is a set of visits to all digital objects from a country in a given month.
(M) Country x IP address
The decision to use the country and not the IP address to establish a visit was based on the fact that the visits to an ETD can be made at different times (and reconnecting may assign a new IP address) and from different locations (with fixed IP addresses).
(M) Counting visits from the same IP address
Visits from the same IP are counted individually due to the fact that networks with many machines can be identified by the IP address of a firewall.
(M) Counting visits to restricted digital objects
Some ETDs are totally or partially restricted – approximately 30% have some type of permanent or temporary restriction. Metadata, abstracts included, are publicly available for all of them. It was decided that attempts followed by denials of access would be counted as accesses.
!! This is informed in the help pages !! This is informed in the help pages of the system; it is suggested that of the system; it is suggested that authors should consider allowing their authors should consider allowing their contents to become public if many contents to become public if many attempts occur.attempts occur.
(C) Lines to mine
Since the interest was on access to digital objects, the decision was to get the lines with extensions .dcr, .doc, .htm, .pdf, etc. All possible extensions on the database are considered, as long as the corresponding item is cataloged on the digital library, so that an eventual static html system page is not counted.
Observations
(1)(1) Statistics were planned on a monthly basis. Statistics were planned on a monthly basis. The model treats data as sequences of points The model treats data as sequences of points with discrete-time intervals of a month. Past with discrete-time intervals of a month. Past months data are unchanged and current months data are unchanged and current month is updated according to the Update month is updated according to the Update definition. definition.
(2)(2) IPs are resolved using a plug-in called GeoIP IPs are resolved using a plug-in called GeoIP Free that is available with AWStats.Free that is available with AWStats.
(C) Information to get from a log line
The month and the year are extracted along with identification of the digital object and the country of the IP address that accessed the digital object.
(C) Update of the DB
The lines are read every hour at the full hours (00:00, 01:00, etc.); incremental lines are mined. Accesses are summed for each month-year-DO-country, so the table is not very big – in the first 6 months of 2005 the average number of lines per month was 10,000.
(C) When to start computing
The log of the Apache Server started being saved on Jun 01, 2004. So, either this date was used or a later one, for example Jan 01, 2005. The decision was to use all available monthly logs.
When the process started, some days When the process started, some days of offline processing were required. of offline processing were required. Afterwards update became automatic Afterwards update became automatic according to the Update definition.according to the Update definition.
Observations
(1)(1) Maybe these were not the best definitions – we Maybe these were not the best definitions – we are willing to discuss alternatives!!are willing to discuss alternatives!!
(2)(2) The The (original)(original) logs are stored and saved offline logs are stored and saved offline in case some change in the minig strategy is in case some change in the minig strategy is decided decided (we have not sunk the ships!!)(we have not sunk the ships!!)..
Definitions for computing statisticsDefinitions for computing statistics
By author
Visited ETDs by year, month and country
Visited ETDs by country, month and year
25 most visited ETDs (on the system = PUC-Rio + UNICAP)
20 most visited ETDs by institution
10 most visited ETDs by graduate program
Visited ETDs by institution, program, year and month
Initial ResultsInitial Results
# ETDs may/sep – 13%
# accesses may/sep – 54.6%
Access to ETDs is increasing (Sep 28, 2005)
Visits
0
200
400
600
800
1,000
1,200
1,400
1,600
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
J un 04 - Sep 05
Brazil
United States
pt speaking countries
es speaking countries
others
# ETDs may/sep – 13%
# accesses may/sep – 54.6%
Number of total visits is increasing (Sep 28, 2005)
Total Visits
0
500
1,000
1,500
2,000
2,500
3,000
3,500
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
J un 04 - Sep 05
# ETDs may/sep – 13%
# accesses may/sep – 54.6%
Accumulated average total visits is increasing (Sep 28, 2005)
Accumulated Average Total Visits
0
100
200
300
400
500
600
700
800
900
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
J un 04 - Sep 05
But… But… Brazil + pt speaking + es speaking = 75%Brazil + pt speaking + es speaking = 75%
Brazil + US + pt speaking + es speaking = 87%Brazil + US + pt speaking + es speaking = 87%
54.87
11.94
7.00
13.53
12.66
Brazil
United States
pt speaking countries
es speaking countries
others
Brazil accounts for 55% of the accesses since Jun 01, 2004 (Sep 28, 2005)
On Jun 15, 2007 the numbers of ETDs in Iberian languages on the NDLTD DB were
Brazilian ETDs were 83% of all ETDs in Iberian languages Brazilian ETDs were 83% of all ETDs in Iberian languages (total number 13,369)(total number 13,369)
Institution Country Language(s) Number
National Library Portugal Portuguese 185
IBICT (includes PUC-Rio)IBICT (includes PUC-Rio) BrazilBrazil PortuguesePortuguese 11,11811,118
UAB Spain (Catalunya) Catalan or English or Spanish 1,011
UIB Spain (Catalunya) Catalan or English or Spanish 22
UJI Spain (Catalunya) Catalan or English or Spanish 42
UOC Spain (Catalunya) Catalan 1
UPC Spain (Catalunya) Catalan or English or Spanish 415
UPF Spain (Catalunya) Catalan or English or Spanish 67
URL Spain (Catalunya) Spanish 1
URV Spain (Catalunya) Catalan or English or Spanish 106
UdG Spain (Catalunya) Catalan or English or Spanish 131
UdL Spain (Catalunya) Catalan or English or Spanish 70
UV Spain (Catalunya) Catalan or English or Spanish 200
Percentage of visits from Brazil is decreasing (Sep 28, 2005)
Percentage of Visits from Brazil
0
10
20
30
40
50
60
70
80
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
J un 04 - Sep 05
Accumulated percentage averages of visits from Brazil (Sep 28, 2005)
Accumulated Average Percentage of Visits from Brazil
50
52
54
56
58
60
62
64
66
68
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
J un 04 - Sep 05
Total accesses top 10 countries (Sep 28, 2005)
# identified countries 122
+ unindentified countries
+ satellite access host
Country Visits
Brazil 12,845
USA 2,795
Portugal 1,489
Spain 679
Peru 652
Mexico 432
Chile 364
France 245
Colombia 225
Argentina 224
Some interesting resultsSome interesting results
Some ETDs are permanent ‘best sellers’
They are on specific subjects (examples: a specific phylosopher and history of modern architecture in Brazil)
They are linked from sites on the subjects (examples: the first from the US & Brazil and the second from Germany)
They are accessed from different countries
Some topics are permanent ‘best sellers’ (example: energy)
Some ETDs are temporary ‘best sellers’ – this seems to happen when they are displayed at the ‘last published ETDs’ functions (system and graduate program)
Some graduate programs are permanent ‘best sellers’
They research topics that are very specific of the country (examples: education and history of culture)
They are indexed in other sites and/or digital libraries (examples: Universia in Spain for social sciences and humanities)
They are accessed from different countries
The 25 most visited ETDs have a large number of visits
No average is lower than 100 visits per month
Average Number of Visits of the 25 Most Visited ETDs
0
50
100
150
200
250
300
350
400
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
J an 04 - Sep 05
Next stepsNext steps
Find out how readers got to ETDs (BDTD, NDLTD, SCIRUS, etc.) – an online survey is planned
Interview faculty to check if some ETDs are recommended reading in courses
Gather more data and analyze in a ‘more scientific’ manner (must find a student!!)
Develop additional functions comparing accesses with production
Extend to other digital contents (at the moment only ETDs and online journals have access statistics)
Thank you!Thank you!
Muito obrigada!Muito obrigada!