SESSION TIMEOUT THRESHOLDS IMPACT ON QUALITY AND QUANTITY OF EXTRACTED SEQUENCE RULES

Embed Size (px)

Citation preview

  • 8/12/2019 SESSION TIMEOUT THRESHOLDS IMPACT ON QUALITY AND QUANTITY OF EXTRACTED SEQUENCE RULES

    1/18

  • 8/12/2019 SESSION TIMEOUT THRESHOLDS IMPACT ON QUALITY AND QUANTITY OF EXTRACTED SEQUENCE RULES

    2/18

    International Journal on New Computer Architectures and Their Applications (IJNCAA) 2(1): 34-51The Society of Digital Information and Wireless Communications, 2012 (ISSN: 2220-9085)

    35

    knowledge from the online courses [3].We note herein an effort of some authorsto design tools that automate typicaltasks performed in the pre-processing

    phase [4] or authors who prepare step-

    by-step tutorials [5, 6].Good quality data are a prerequisite for awell-realized data analysis. If there isjunk at the input, the same will be atthe output, regardless of the method forknowledge extraction used. This applieseven more in the area of web log mining,where the log file requires a thoroughdata preparation. As an example we can

    present the usage analysis, where we areaimed at finding out what our web

    visitors are interested in.For this purpose we can use:1. survey sampling we find out

    answers to particular items in thequestionnaire and a visitor of our siteknows that he/she is the object of oursurvey [7],

    2. web log mining we analyse the logfile of the web server, which containsinformation on accesses to the pagesof our web, and the visitor does notknow that he is the object of oursurvey.

    While in case of the survey sampling wecan provide good quality data using areliable and valid measuring procedurefor their mining, in case of the web logmining we can provide them throughgood preparation of data from the logfile.The data pre-processing itself representsoften the most time consuming phase ofthe web page analysis [8]. We realizedan experiment for purpose to find the ananswer to question to what measure it isnecessary to execute data pre-processingtasks for gaining valid data from the logfiles obtained from learning managementsystems. Specifically, we would like toassess the impact of session timeout

    threshold and path completion on thequantity and quality of extractedsequence rules that represent thelearners behavioral patterns in alearning management system [9].

    We compare six datasets of differentquality obtained from logs of thelearning management system and pre-

    processed in different ways. We usethree datasets with identified userssessions based on 15, 30 and 60 minutesession timeout threshold (STT) andthree another datasets with the samethresholds including reconstructed pathsamong course activities.The rest of the paper is structured

    subsequently. We summarize relatedwork of other authors who deal with data pre-processing issues in connection witheducational systems in the secondchapter. Especially, we pay attention toauthors who were concerned with the

    problem of finding the most suitablevalue of STT for session identification.Subsequently, we particularize researchmethodology and describe how we

    prepared log files in different manners insection 3. The section 4 gives thesummary of experiment results in detail.Finally, we discuss obtained results andgive indication of our future work insection 5.

    2 LITERATURE OVERVIEW

    The aim of the pre-processing phase is toconvert the raw data into a suitable inputfor the next stage mining algorithms [1].Before applying data mining algorithm,a number of general data pre-processingtasks can be applied. We focus only ondata cleaning, user identification, sessionidentification and path completion in this

    paper.Marquardt et al. [4] published acomprehensive paper about the

  • 8/12/2019 SESSION TIMEOUT THRESHOLDS IMPACT ON QUALITY AND QUANTITY OF EXTRACTED SEQUENCE RULES

    3/18

    International Journal on New Computer Architectures and Their Applications (IJNCAA) 2(1): 34-51The Society of Digital Information and Wireless Communications, 2012 (ISSN: 2220-9085)

    36

    application of web usage mining in thee-learning area with focus on the pre-

    processing phase. They did not deal withsession timeout threshold in detail.Romero et al. [5] paid more attention to

    data pre-processing issues in theirsurvey. They summarized specific issuesabout web data mining in learningmanagement systems and providedreferences about other relevant research

    papers. Moreover, Romero et al. dealtwith some specific features of data pre-

    processing tasks in LMS Moodle in [5,10], but they removed the problem ofuser identification and sessionidentification from their discussion.

    A user session that is closely associatedwith user identification is defined as asequence of requests made by a singleuser over a certain navigation period anda user may have a single or multiplesessions during this time period. Sessionidentification is a process of segmentingthe log data of each user into individualaccess sessions [11]. Romero et al.argued that these tasks are solved bylogging into and logging out from thesystem. We can agree with them in thecase of user identification.In the e-learning context, unlike otherweb based domains, user identification isa straightforward problem because thelearners must login using their unique ID[1]. The excellent review of useridentification was made in [3] and [12].Assuming the user is identified, the nextstep is to perform session identification,

    by dividing the click stream of each userinto sessions. We can find manyapproaches to session identification [13-17].In order to determine when a sessionends and the next one begins, the sessiontimeout threshold (STT) is often used. ASTT is a pre-defined period of inactivitythat allows web applications to

    determine when a new session occurs[18]. Each website is unique and shouldhave its own STT value. The correctsession timeout threshold is oftendiscussed by several authors. They

    experimented with a variety of differenttimeouts to find an optimal value [19-24]. However, no generalized model was

    proposed to estimate the STT used togenerate sessions [19]. Some authorsnoted that the number of identifiedsessions is directly dependent on time.Hence, it is important to select thecorrect space of time in order for thenumber of sessions to be estimatedaccurately [18].

    In this paper, we used reactive time-oriented heuristic method to define theusers sessions. From our point of viewsessions were identified as delimitedseries of clicks realized in the definedtime period. We prepared three differentfiles (A1, A2, A3) with a 15-minute STT(mentioned for example in [25]), 30-minute STT [12, 19, 26, 27] and 60-minute STT [28] to start a new sessionwith regard to the setting used inlearning management system.The analysis of the path completion ofusers activities is another problem. Thereconstruction of activities is focused onretrograde completion of records on the

    path went through by the user by meansof a back button, since the use of such

    button is not automatically recorded intolog entries web-based educationalsystem. Path completion consists ofcompleting the log with inferredaccesses. The site topology, represented

    by sitemap, is fundamental for thisinference and significantly contributes tothe quality of the resulting dataset, andthus to patterns precision and reliability[4].The sitemap can be obtained using acrawler. We used the application Web

  • 8/12/2019 SESSION TIMEOUT THRESHOLDS IMPACT ON QUALITY AND QUANTITY OF EXTRACTED SEQUENCE RULES

    4/18

    International Journal on New Computer Architectures and Their Applications (IJNCAA) 2(1): 34-51The Society of Digital Information and Wireless Communications, 2012 (ISSN: 2220-9085)

    37

    Crawling implemented in the used DataMiner for the needs of our analysis.Having ordered the records according tothe IP address we searched for somelinkages between the consecutive pages.

    We found and analyzed severalapproaches mentioned in literature [12,17].Finally, we chose the same approach asin our previous paper [9]. A sequence forthe selected IP address can look like this:ABCDX. In our example,

    based on the sitemap the algorithm canfind out if there not exists the hyperlinkfrom the page D to our page X. Thus weassume that this page was accessed by

    the user by means of using a Back buttonfrom one of the previous pages.Then, through a backward browsing wecan find out, where of the previous pagesexists a reference to page X. In oursample case, we can find out if there noexists a hyperlink to page X from pageC, if C page is entered into the sequence,i.e. the sequence will look like this:ABCDCX. Similarly, weshall find that there exists any hyperlinkfrom page B to page X and can be addedinto the sequence, i.e.ABCDCBX.Finally algorithm finds out that the pageA contains hyperlink to page X and afterthe termination of the backward pathanalysis the sequence will look like this:ABCDCBAX. Then itmeans, the user used Back button inorder to transfer from page D to C, fromC to B and from B to A [29].After the application of this method weobtained the files (B1, B2, B3) with anidentification of sessions based on userID, IP address, different timeoutthresholds and completing the paths [9].

    3 USED METHODOLOGY

    3.1 Sequence RulesSequence rules have been derived fromassociation ones, thus the differences are

    not so wide [30]. The k sequence is theone of the length k , i. e. it contains k pages. Frequented k sequence is avariation of the frequented k item set,or a combination. Usually, thefrequented single-item set is identicalwith the frequented single sequence.In the following example we shallillustrate differences between thealgorithm Apriori and AprioriAll, wherethe algorithm Apriori serves for the

    searching of association rules andAprioriAll for the searching of sequencerules: D = { S 1 = { U 1, < a, b, c >}, S 2 ={U 2, < a, c >}, S 3 = { U 1, < b, c, e >}, S 4 ={U 3, < a, c, d, c, e >}}, where D is adatabase of transactions with a timelabel, a, b, c, d, e are web sections andU 1, U 2, U 3 represent users. Eachtransaction is identified by a user. Let usassume that the minimum support is 30%.In this case, user U 1 has actually twotransactions. When searching forsequence rules we consider his sequenceto be a current connection of websections in the transactions S 1 and S 3,i.e. a sequence can consist of severaltransactions, while continuous accessesto pages are not required. Similarlysupport of the sequence is designated not

    by the percentage of transactions, but bythe percentage of users, who own thegiven sequence. Sequence is large(frequented), if it is at least situated inone sequence identified by the user andmeets the condition of minimumsupport. Set of frequented sequences,which have k items, we mark Lk . Forfinding Lk , we use set of candidates,

  • 8/12/2019 SESSION TIMEOUT THRESHOLDS IMPACT ON QUALITY AND QUANTITY OF EXTRACTED SEQUENCE RULES

    5/18

    International Journal on New Computer Architectures and Their Applications (IJNCAA) 2(1): 34-51The Society of Digital Information and Wireless Communications, 2012 (ISSN: 2220-9085)

    38

    marks Ck , which involves sequenceswith k items.The first step is ranking of transactionsas to the user with a time label of each

    page visited by him; the remaining steps

    are similar to the ones in algorithmApriori. Having ranked the transactionswe obtained current sequences identified

    by the user, which represent completereferences from a single user: D = { S 1 ={U 1, < a, b, c >}, S 3 = { U 1, < b, c, e >},S 2 = { U 2, < a, c >}, S 4 = { U 3, < a, c, d, c,e>}}. Similarly to algorithm Apriori westart with generating the set ofcandidates of length 1: C 1 = {< a>, < b>,, < d >, < e>}, from it we define the set

    L1 = {< a>, < b>, < c>, < d >, < e>} ofsingle-item sequences, where each pageis referenced at least by one user.Flowingly, we generate sets ofcandidates C 2 from L1 by means of theso-called full linking, i.e. we make

    provisions for the web user who searchesthrough the pages forwards or

    backwards. This is the reason whyalgorithm Apriori is not suitable for weblog mining, by contrast to algorithm

    AprioriAll, which respects the abovementioned fact. Out of the set ofcandidates of length 2: C 2 = {< a, b >, < a,c>, < a, d >, < a, e >, < b, a >, < b, c>, < b,d >, < b, e >, < c, a >, < c, b >, < c, d >, < c,

    e>, < d, a >, < d, b >, < d, c>, < d, e >, < e,a>, < e, b >, < e, c >, < e, d >}, we shalldefine a set L2 = {< a, b >, < a, c >, < a, d >,, < b, c >, < b, e >, < c, b >, < c, d >,, < d, c>, < d, e>} of double-itemsequences, where each sequence issituated in at least one sequenceidentified by the user. We thenanalogically proceed.

    3.2 Steps of Experiment

    We aimed at specifying the inevitablesteps that are required for gaining validdata from the log file of learningmanagement system. Specially, wefocused on the identification of sessions

    based on time of various length andreconstruction of student`s activities andinfluence of interaction of these twosteps of data preparation on derivedrules.We tried to assess the impact of this

    Figure 1. Application of data pre-processing technique on log file obtained from VLE.

  • 8/12/2019 SESSION TIMEOUT THRESHOLDS IMPACT ON QUALITY AND QUANTITY OF EXTRACTED SEQUENCE RULES

    6/18

    International Journal on New Computer Architectures and Their Applications (IJNCAA) 2(1): 34-51The Society of Digital Information and Wireless Communications, 2012 (ISSN: 2220-9085)

    39

    advanced techniques on the quantity andquality of the extracted rules. Theserules contribute to the overallrepresen tation of the students behavior

    patterns. The experiment was realized in

    several steps (Figure 1).First step was Data acquisition. Dataacquisition is defining the observedvariables into the log file from the pointof view of obtaining the necessary data(ID, IP address, URL address, date andtime of access, activity, etc.).The second step lead to the creation ofdata matrices. Data come from the logfile (information of accesses) andsitemaps (information on the course

    contents).The aim of the next step is Data preparation on various levels. Weidentified of sessions based on 15-minute STT (File A1), sessions based on30-minute STT (File A2), sessions basedon 60-minute STT (File A3).Subsequently, we identified sessions

    based on 15-minute STT and completionof the paths (File B1), sessions based on30-minute STT and completion of the

    paths (File B2) and sessions based on60-minute STT and completion of the

    paths (File B3).We can name the next step as dataanalysis. We were searching for

    behaviour patterns of students inindividual files.

    We used STATISTICA Sequence, Association and Link Analysis forsequence rules extraction. It is animplementation of algorithm using the

    powerful Apriori algorithm [31-34]

    together with a tree structured procedurethat only requires one pass through data[35].After data analysis we tried tounderstand the output data creation ofdata matrices from the outcomes of theanalysis, defining assumptions.The last step contains comparison ofresults of data analysis elaborated onvarious levels of data preparation fromthe point of view of quantity and quality

    of the found rules patterns of behaviours of students upon browsingthe course. We attended to thecomparison of the portion of the rulesfound in examined files, the comparisonof the portion of inexplicable rules inexamined files and to the comparison ofvalues of the degree of support andconfidence of the found rules inexamined files.The contemporary learning managementsystems store information about theirusers not in server log file but mainly inrelational database. We can find therehigh extensive log data of the studentsactivities. Learning management systemsusually have built-in student monitoringfeatures so they can record any studentsactivity [36].

  • 8/12/2019 SESSION TIMEOUT THRESHOLDS IMPACT ON QUALITY AND QUANTITY OF EXTRACTED SEQUENCE RULES

    7/18

    International Journal on New Computer Architectures and Their Applications (IJNCAA) 2(1): 34-51The Society of Digital Information and Wireless Communications, 2012 (ISSN: 2220-9085)

    40

    The analyzed course consisted of 12activities and 145 course pages.Students records about their activities inindividual course pages in learningmanagement system were observed inthe e-learning course in winter term2010.We used logs stored in relationaldatabase of LMS Moodle. LMS Moodlekeeps detailed logs of all activities thatstudents perform. It logs every click thatstudents make for navigation in the e-learning course [5].We used records from mdl_log andmdl_log_display tables. These records

    contained the entities from the e-learningcourse with 180 participants. In this

    phase, log file was cleaned fromirrelevant items. First of all, we removedentries of all users with the role otherthan student. After performing this task,75 530 entries were accepted to be usedin the next task.These records were pre-processed indifferent manners. In each file, variableSession identifies individual course visit.The variable Session was based onvariables User ID, IP address andtimeout threshold with selected length(15, 30 and 60-minute STT) in the case

    Table 1. Number of accesses and sequences in particular files.

    FileCount ofwebaccesses

    Count ofcostumer'ssequences

    Count of frequentedsequences

    Average size of costumer'ssequences

    A1 70553 12992 71 5

    A2 70553 12058 81 6

    A3 70553 11378 89 6

    B1 75372 12992 73 6

    B2 75372 12058 82 6

    B3 75439 11378 93 7

    Figure 2. Sequential/Stacked plot for derived rules in examined files.

  • 8/12/2019 SESSION TIMEOUT THRESHOLDS IMPACT ON QUALITY AND QUANTITY OF EXTRACTED SEQUENCE RULES

    8/18

    International Journal on New Computer Architectures and Their Applications (IJNCAA) 2(1): 34-51The Society of Digital Information and Wireless Communications, 2012 (ISSN: 2220-9085)

    41

    of files X1, X2 and X3, where X = {A,B}. The paths were completed for eachfiles BY separately, where Y = {1, 2, 3}

    based on the sitemap of the course.Compared to the file X1 with theidentification of sessions based on 15-minute STT (Table 1), the number ofvisits (costumer's sequences) decreased

    by approximately 7 % in case of theidentification of sessions based on 30-minute STT (X2) and decreased by 12.5% in case of the identification ofsessions based on 60-minute STT (X3).On the contrary, the number offrequented sequences increased by 14 %(A2) to 25 % (A3) and in the case ofcompleting the paths increased by 12 %(B2) to 27 % (B3) in examined files.Having completed the paths (Table 1)the number of records increased byalmost 7 % and the average length ofvisit/sequences increased from 5 to 6(X2) and in the case of the identificationof sessions based on 60-minute STTeven to 7 (X3).

    We articulated the followingassumptions:

    1. we expect that the identification ofsessions based on shorter STT willhave a significant impact on thequantity of extracted rules in terms ofdecreasing the portion of trivial andinexplicable rules,

    2. we expect that the identification ofsessions based on shorter STT willhave a significant impact on thequality of extracted rules in the termof their basic measures of the quality,

    3. we expect that the completion of pathswill have a significant impact on thequantity of extracted rules in terms ofincreasing the portion of useful rules,

    4. we expect that the completion of pathswill have a significant impact on thequality of extracted rules in the termof their basic measures of the quality .

    4 EXPERIMENT RESULTSWe summarize the results of theexperiment with regards to the last stepof used methodology in the next treechapters.

    Table 2. Incidence of discovered sequence rules in particular files.

    Body ==> Head A1 A2 A3 B1 B2 B3Typeof rule

    course view ==> resource - final test requirements, course view 0 1 1 0 1 1 trivial

    . . . ==> . . . . . . . . . . . . . . . . . . . . . . . .

    course view ==> view forum about ERD and relation schema 0 0 1 0 0 1 inexplicable

    . . . ==> . . . . . . . . . . . . . . . . . . . . . . . .

    course view ==> view collaborative activities 1 1 1 1 1 1 useful

    Count of derived sequence rules 63 78 89 68 81 98

    Per cent of derived sequence rules (Per cent 1's) 62.4 77.2 88.1 67.3 80.2 97.0

    Per cent 0's 37.6 22.8 11.9 32.7 19.8 3.0

    Cochran Q test Q = 93.84758, df = 5, p