9
 An improved collaborative movie recommendation system using computational intelligence $ Zan Wang a , Xue Yu b,n , Nan Feng b , Zhenhua Wang c a School of Computer Software, Tianjin University, Tianjin 300072, China b College of Management and Economics, Tianjin University, Tianjin 300072, China c  American Electric Power, Gahanna, OH 43230, United States a r t i c l e i n f o  Article history: Received 20 September 2014 Accepted 29 September 2014 Available online 14 October 2014 Keywords: Movie recommendation Collaborative filtering Sparsity data Genetic algorithms K -means a b s t r a c t Recommendation systems have become prevalent in recent years as they dealing with the information overlo ad probl em by suggestin g users the most relevan t produ cts from a massive amount of data. For media product, online collaborative movie recommendations make attempt s to assist users to access their prefer red movies by capt uring precisel y similar neighbors among users or movies from their historical common ratings. However, due to the dat a sparsely, neighbor selecting is get ting mor e dif fic ult with the fas t increasing of movies and users. In this paper, a hybrid model-based movie recommenda- tio n system whi ch uti lize s the imp rov ed  K -means clust ering coupled with genetic algorithms (GAs) to partition transformed user space is proposed. It employs principal component analysis (PCA) data reduction technique to dense the movie population space which could reduce the computation complexity in intelligent movie recom-mendation as well. The experiment results on Movielens dataset indicate that the proposed approach can provi de hig h per for man ce in terms of acc uracy , and gen era te more rel iab le and personalized movie recommendations when compared with the existing methods. & 2014 Elsevier Ltd. All rights reserved. 1. Intr oduct ion Fast development of internet technology has resulted in explosive growth of available information over the last decade. Recommendation systems (RS), as one of the most successful information filtering applications, have become an effic ient way to solv e the informat ion overloa d pro- ble m. The aim of Recommendation systems is to aut o- matically generate suggested items (movies, books, news, music, CDs, DVDs, webpages) for users according to their historical preferences and save their searching time online by extracting useful data. Movie recommendation is the most widely used appli- cati on cou ple d wit h onli ne mul timedi a plat forms whi ch aims to help customers to access preferred movies intelli- gently from a huge movie library. A lot of works have been done both in the academic and industry area in developing new movie recommendation algorit hms and extensions. The majority of existing recommendation systems is based on collaborative filtering (CF) mechanism [13]  which has been successfully developed in the past few years. It first col lec ts rat ing s of mov ies giv en by ind ivi dual s and then rec ommends pro mising movies to tar get cus tomer based on the  like-minded individuals with similar tastes and pre- ferences in the past. There have been many famous online multimedia platforms (e.g., youtube.com, Netflix.com, and douban .com) incorporat ed with CF tech nique to sugg est Contents lists available at  ScienceDirec t journal homepage:  www.elsevier.com/locate/jvlc  Journal of Visual Languages and Computing http://dx.doi.org/10.1016/j.jvlc.2014.09.011 1045-926X/ & 2014 Elsevier Ltd. All rights reserved. This paper has been recommend ed for accep tance by S.-K. Chang. n Corresponding author. Tel.:  þ 86 22 27406125; fax: þ 86 22 87401540. E-mail addresses: [email protected]  (Z. Wang), [email protected] (X. Yu),  [email protected]  (N. Feng), [email protected] (Z. Wang).  Journal of Visual Languages and Computing 25 (2014) 66 7 675

An Improved Collaborative Movie Recommendation System Using Computational Intelligence-2

  • Upload
    jalatf

  • View
    228

  • Download
    0

Embed Size (px)

DESCRIPTION

Collaborative filtering

Citation preview

  • eco

    Wa

    China

    a r t i c l e i n f o

    Received 20 September 2014Accepted 29 September 2014Available online 14 October 2014

    Keywords:Movie recommendationCollaborative filtering

    a b s t r a c t

    idely used appli-platforms whichd movies intelli-works have beenrea in developingand extensions.systems is based[13] which has

    been successfully developed in the past few years. It first

    the like-minded individuals with similar tastes and pre-

    Contents lists available at ScienceDirect

    journal homepage: www.e

    Journal of Visual Langu

    This paper has been recommended for acceptance by S.-K. Chang.n Corresponding author. Tel.: 86 22 27406125; fax: 86 22 87401540.

    Journal of Visual Languages and Computing 25 (2014) 6676751045-926X/& 2014 Elsevier Ltd. All rights reserved.ferences in the past. There have been many famous onlinemultimedia platforms (e.g., youtube.com, Netflix.com, anddouban.com) incorporated with CF technique to suggest

    http://dx.doi.org/10.1016/j.jvlc.2014.09.011

    E-mail addresses: [email protected] (Z. Wang),[email protected] (X. Yu), [email protected] (N. Feng),[email protected] (Z. Wang).collects ratings of movies given by individuals and thenrecommends promising movies to target customer based onFast development of internet technology has resultedin explosive growth of available information over the lastdecade. Recommendation systems (RS), as one of the mostsuccessful information filtering applications, have becomean efficient way to solve the information overload pro-blem. The aim of Recommendation systems is to auto-matically generate suggested items (movies, books, news,music, CDs, DVDs, webpages) for users according to their

    by extracting useful data.Movie recommendation is the most w

    cation coupled with online multimediaaims to help customers to access preferregently from a huge movie library. A lot ofdone both in the academic and industry anew movie recommendation algorithmsThe majority of existing recommendationon collaborative filtering (CF) mechanism1. Introduction historical preferences and save their searching time onlineSparsity dataGenetic algorithmsK-meansRecommendation systems have become prevalent in recent years as they dealing with theinformation overload problem by suggesting users the most relevant products from amassive amount of data. For media product, online collaborative movie recommendationsmake attempts to assist users to access their preferred movies by capturing preciselysimilar neighbors among users or movies from their historical common ratings. However,due to the data sparsely, neighbor selecting is getting more difficult with the fastincreasing of movies and users. In this paper, a hybrid model-based movie recommenda-tion system which utilizes the improved K-means clustering coupled with geneticalgorithms (GAs) to partition transformed user space is proposed. It employs principalcomponent analysis (PCA) data reduction technique to dense the movie population spacewhich could reduce the computation complexity in intelligent movie recom-mendation aswell. The experiment results on Movielens dataset indicate that the proposed approachcan provide high performance in terms of accuracy, and generate more reliable andpersonalized movie recommendations when compared with the existing methods.

    & 2014 Elsevier Ltd. All rights reserved.Article history:An improved collaborative movie rusing computational intelligence$

    Zan Wang a, Xue Yu b,n, Nan Feng b, Zhenhuaa School of Computer Software, Tianjin University, Tianjin 300072, Chinab College of Management and Economics, Tianjin University, Tianjin 300072,c American Electric Power, Gahanna, OH 43230, United Statesmmendation system

    ng c

    lsevier.com/locate/jvlc

    ages and Computing

  • Z. Wang et al. / Journal of Visual Languages and Computing 25 (2014) 667675668media products to their customers. However, traditionalrecommendation systems always suffer from some inherentlimitations: poor scalability, data sparsity and cold startproblems [3,4]. A number of works have developed model-based approaches to deal with these problems and provedthe benefits on prediction accuracy in RS [58].

    Model-based CF employs the user-item ratings to learna model which is then used to generate online prediction.Clustering and dimensionality reduction techniques areoften employed in model-based approaches to address thedata sparse problem [5,8,9]. The sparsity issues arise dueto the insufficiency of users history rating data and it ismade even more severe in terms of the dramaticallygrowth of users and items. Moreover, high-dimensionalrating data may cause it difficult to extract commoninteresting users by similarity computation, which resultsin poor recommendations. In the literature, there havebeen many model-based recommendation systems devel-oped by partitioning algorithms coupled, such as K-meansand self-organizing maps (SOM) [1518,20]. The aim ofclustering is to divide users into different groups to formlike-minded (nearest) neighbors instead of searching thewhole user space, which could dramatically improve thesystem scalability. It has been proved that clustering-basedrecommendation systems outperform the pure CF-basedones in terms of efficiency and prediction quality [7,911].In many works, the clustering methods are conducted withthe entire dimensions of data which might lead to some-what inaccuracy and consume more computation time. Ingeneral, making high quality movie recommendations isstill a challenge, and exploring an appropriate and efficiencyclustering method is a crucial problem in this situation.

    To address challenges aforementioned, a hybrid model-based movie recommendation approach is proposed toalleviate the issues of both high dimensionality and datasparsity. In this article, we develop an optimized clusteringalgorithm to partition user profiles which have beenrepresented by denser profile vectors after principal com-ponent analysis transforming. The whole system consistsof two phases, an online phase, and an offline phase. Inoffline phase, a clustering model is trained in a relativelylow dimensional space, and prepares to target active usersinto different clusters. In online phase, a TOP-N movierecommendation list is presented for an active user due topredicted ratings of movies. Furthermore, a genetic algo-rithm (GA) is employed in our new approach to improvethe performance of K-means clustering, and the proposedclustering algorithm is named as GA-KM. We furtherinvestigate the performance of the proposed approach inMovieslens dataset. In terms of accuracy and precision, theexperiment results prove that the proposed approach iscapable of providing more reliable movie recommendationscomparing with the existing cluster-based CF methods.

    The remainder of this paper is organized as follows:Section 2 gives a brief overview on collaborative recommen-dation systems and clustering-based collaborative recommen-dation. Then we discuss the development of our proposedapproach called PCA-GAKM movie recommendation systemin detail in Section 3. In Section 4, experiment results onmovielens dataset and discussion are described. Finally, wesummarize this paper and the future work is given.2. Related work

    2.1. Movie recommendation systems based on collaborativefiltering

    Recommendation systems (RS), introduced by Tapestryproject in 1992, is one of the most successful informationmanagement systems [12]. The practical recommenderapplications help users to filter mass useless informationfor dealing with the information overloading and providingpersonalized suggestions. There has been a great success ine-commerce to make the customer access the preferredproducts, and improve the business profit. In addition, toenhance the ability of personalization, recommendationsystem is also widely deployed in many multimedia web-sites for targeting media products to particular customers.Nowadays, collaborative filtering (CF) is the most effectivetechnique employed by movie recommendation systems,which is on the basis of the nearest-neighbor mechanism.It assumes that people who have similar history ratingpattern may be on the maximum likelihood that have thesame preference in the future. All like-minded users, calledneighbors, are derived from their rating database that isrecording evaluation values to movies. The prediction of amissing rating given by a target user can be inferred by theweighted similarity of his/her neighborhood.

    Ref. [6] divides CF techniques into two important classesof recommender systems: memory-based CF and model-based CF. Memory-based CF operate on the entire userspace to search nearest neighbors for an active user, andautomatically produce a list of suggested movies to recom-mend. This method suffers from the computation complex-ity and data sparsity problem. In order to addresscomputational and memory bottleneck issues, Sarwaret al. proposed an item-based CF in which the correlationsbetween items are computed to form the neighborhood fora target item [4]. Their empirical studies proved that thatitem-based approach can shorten computation time appar-ently while providing comparable prediction accuracy.

    Model-based CF, on the other hand, develops a pre-buildmodel to store rating patterns based on user-rating databasewhich can deal with the scalability and sparsity issues. Interms of recommendation quality, model-based CF applica-tions can perform as well as memory-based ones. However,model-based approaches are time-consuming in buildingand training the offline model which is hard to be updated aswell. Algorithms that often used in model-based CF applica-tions include Bayesian networks [6], clustering algorithms[911], neural networks [13], and SVD (singular valuedecomposition) [5,14]. While traditional collaborativerecommendation systems have their instinct limitations,such as computational scalability, data sparsity and cold-start, and these issues are still challenges that affect theprediction quality. Over the last decade, there have been highinterests toward RS area due to the possible improvement inperformance and problems solving capability.

    2.2. Clustering-based collaborative recommendation

    In movie recommendation, clustering is a widely usedapproach to alleviate the scalability problem and provides

  • Z. Wang et al. / Journal of Visual Languages and Computing 25 (2014) 667675 669a comparable accuracy. Many works have proved thatthe benefits of clustering-based CF frameworks [1518].The aim of clustering algorithms is to partition objectsinto clusters that minimize the distance between objectswithin a same cluster to identify similar objects. As one ofmodel-based CF methods, clustering-based CF is used toimprove k-nearest neighbor (k-NN) performance by pre-building an offline clustering model. Typically, numbers ofusers can be grouped into different clusters based on theirrating similarity to find like-minded neighbors by theclustering techniques. Then the clustering process is per-formed offline to build the model. When a target userarrived, the online module assigns a cluster with a largestsimilarity weight to him/her, and the prediction rating of aspecified item is computed based on the same clustermembers instead of searching whole user space.

    According to early studies in [3,6], CF coupled withclustering algorithms is a promising schema to provideaccuracy personal recommendations and address the largescale problems. But they also concluded that good perfor-mance of clustering-based CF depends on appropriate cluster-ing techniques and the nature of dataset as well. Li and Kimapplied fuzzy K-means clustering method to group itemswhich combined the content information for similarity mea-surement to enhance the recommendation accuracy [9]. Inthe conclusion of their work, it shows that the proposedcluster-based approach is capable of dealing with the coldstart problem. Furthermore, Wang et al. developed [19] a newapproach to cluster both the rows and columns fuzzily inorder to condense the original user rating matrix. In Kim andAhns research, a new optimal K-means clustering approachwith genetic algorithms is introduced to make online shop-ping market segmentation [10]. The proposed approach istested to exhibit better quality than other widely usedclustering methods such as pure K-means and SOM algo-rithms in the domain of market segmentation, and could be apromising tool for e-commerce recommendation systems.

    Liu and Shih proposed two hybrid methods that exploitedthe merits of the Weighted RFM-based and the preference-based CF methods to improve the quality of recommenda-tions [20]. Moreover, K-means clustering is employed togroup customers based on their enhanced profile. Theexperiments prove that the combined models perform betterthan the classical k-NN mechanism. Xue et al. proposed anovel CF framework that uses clustering technique to addressdata sparsity and scalability in common CF [7]. In their work,K-means algorithm is employed to classify users for smooth-ing the rating matrix that is to generate estimated values formissing ratings corresponding to cluster members. In latterrecommendation phrase, the clustering result is utilized toneighborhood selection for an active user. The experimentresults show that the novel approach can demonstratesignificant improvement in prediction accuracy. Georgiouand Tsapatsoulis developed a genetic algorithm based clus-tering method which allows overlapping clusters to perso-nalized recommendation, and their experiment findingsshow that the new approach outperforms K-means cluster-ing in terms of efficiency and accuracy [21].

    The above works have proved that clustering-based CFsystems show more accuracy prediction and help dealwith scalability and data sparse issues.3. PCA-GAKM based collaborative ltering framework

    In this section, we aim at developing a hybrid cluster-based model to improve movie prediction accuracy, inwhich offline and online modules are coupled to makeintelligent movie recommendations. Traditional CF searchthe whole space to locate the k-nearest neighbors for atarget user, however, considering the super high dimen-sionality of user profile vectors, it is hard to calculate asimilarity to find like-minded neighbors based on ratingswhich leads to poor recommendation because of sparse.To address such an issue, our offline clustering moduleinvolves two phases: (1) to concentrate feature informa-tion into a relatively low and dense space using PCAtechnique; (2) to develop an effective GA-KM clusteringalgorithm based on the transformed user space.

    Fig. 1 shows an overview of the new approach: offlinemodule represented by light flow arrows, is used tooptimize and train the user profiles into different clusterson the basis of history rating data; online module is real-time movie recommendation noted with dark flow arrows,to which a target users rating vector input, and come outwith a TOP-N movie recommendation list. We explain thedetails in the following.

    3.1. Pre-processing data using PCA

    In this section, we employ a linear feature extractiontechnique to transfer the original high dimensional spaceinto a relatively low dimensional space in which carriesdenser feature information. Since the high dimensionalityof user-rating matrix which is mostly empty at the begin-ning makes the similarity computation very difficult, ourapproach is started with PCA-based dimension reductionprocess. As one of the most successful feature extractiontechniques, PCA is widely used in data prefilling anddimensional reduction of collaborative filtering systems[14,22,23].

    The main idea of PCA is to convert the original data toa new coordinate space which is represented by principalcomponent of data with highest eigenvalue. The firstprincipal component vector carries the most significantinformation after ordering them by eigenvalues from highto low. In general, the components with lesser significanceare ignored to form a space with fewer dimensions thanthe original one. Suppose we have user-rating mnmatrix in which n-dimension vector represents usersprofile. It turns out the n principal components afterperforming eigenvalue decomposition, and we select theonly first d components (d5n) to keep in the new dataspace which is based on the value of accumulated propor-tion of 90% of the original one. As a result, the reducedfeature vectors from PCA are prepared to feed to GA-KMalgorithm for classification.

    3.2. An enhanced K-means clustering optimized by geneticalgorithms

    Memory-based CF systems suffer from two main com-mon flaws: cold-start and data sparse. Many researchworks have proved benefits of cluster-based CF in terms

  • Z. Wang et al. / Journal of Visual Languages and Computing 25 (2014) 667675670ofneclasamaccemph

    1.the increased quality of recommendation and robust-ss. The objective of this section is to propose an effectivessification method to ensure the users who have thee preference could fall into one cluster to generateurate like-minded neighbors. The GA-KM algorithm weployed in this work can be roughly performed in twoases:

    K-means clustering:K-means algorithm is one of the most commonly usedclustering approaches due to its simplicity, flexibilityand computation efficiency especially considering largeamounts of data. K-means iteratively computes k clus-ter centers to assign objects into the most nearestcluster based on distance measure. When center points

    Fig. 1. Overview of proposed movie recommhavto aof sinacseeinfeentproGiveanpartcally(1)(2)

    ende no more change, the clustering algorithm comesconvergence. However, K-means lacks the abilityelecting appropriate initial seed and may lead tocuracy of classification. Randomly selecting initiald could result in a local optimal solution that is quiterior to the global optimum. In other words, differ-initial seeds running on the same dataset mayduce different partition results.n a set of objects x1; x2;; xn, where each object ism-dimensional vector, K-means algorithm aims toition these objects into k groups automatically. Typi-, the procedure consists of the following steps [24,25]:choose k initial cluster centers Cj, j1,2,3 , k;each xi is assigned to its closest cluster centeraccording to the distance metric;

    ation system framework.

  • (3) compute the sum of squared distances from allmembers in one cluster:

    J k

    j 1

    iACtemp

    jjxiMjjj2 1

    where Mj denotes the mean of data points in Ctemp;(4) if there is no further change, the algorithm has

    converged and clustering task is end; otherwise,recalculate the Mj of k clusters as the new clustercenters and go to (2).

    To overcome the above limitations, we introducegenetic algorithm to merge with K-means clusteringprocess for the enhancement of classification qualityaround a specified k.

    2. Genetic algorithms:Genetic algorithms (GAs) are inspired by nature evolu-tionary theory which is known for its global adaptiveand robust search capability to capture good solutions[26]. It can solve diverse optimization problems withefficiency due to its stochastic search on large andcomplicated spaces. The whole process of GA is guidedby Darwins nature survival principle and provides amechanism to model biological evolution. A GA utilizesa population of individuals as chromosomes, repre-senting possible solutions for a given problem.Each chromosome contains number of genes which is

    used to compute fitness to determine the likelihood ofreproduction for the next generation. Typically, a chro-mosome with the fittest value will be more likely toreproduce than unfit ones. GA iteratively creates newpopulations replace of the old ones by selecting solu-tions (chromosomes) on the basis of pre-specifiedfitness function. During each successive iteration, threegenetic operators are executed to construct the newgeneration known as selection, crossover and mutation.Selection process selects a proportion of the currentpopulation to breed a new generation according totheir fitness value. Crossover operator allows swappinga portion of two parent chromosomes for each other tobe recombined into new offspring. Mutation operatorrandomly alters the value of a gene to produce off-spring. All above operators provide the means toextend the diversity of population over time and bringnew information to it. Finally, the iterations tend toterminate when the fitness threshold is met or a pre-defined number of generations is reached.

    A common drawback of K-means algorithm hasdescribed above that sensitivity selection of initial seedscould influence final output and easy to fall into localoptimum. In order to avoid the premature convergence ofK-means clustering, we considered a genetic algorithm asthe optimization tool for evolving initial seeds in the first

    ring a

    Z. Wang et al. / Journal of Visual Languages and Computing 25 (2014) 667675 671Fig. 2. GA-KM cluste lgorithm procedure.

  • Z. Wang et al. / Journal of Visual Languages and Computing 25 (2014) 667675672step of K-means process in order to identify optimalpartitions. In our study, a chromosome with k genes isdesigned for k cluster centers as x1; x2;; xk, where xi is avector with n dimensions. During the evolution process,we applied fitness function to evaluate the quality ofsolutions that is:

    f chromosome xjAX

    min 1r irk distCi; xj 2

    The fitness value is the sum of distances for all innerpoints to their cluster centers and tries to minimize thevalues which correspond to optimized partitions. In everysuccessive iteration, three genetic operators precede toconstruct new populations as offspring according to thefittest survival principles. The populations tend to con-verge to an optimum chromosome (solution) when thefitness criterion is satisfied. Once the optimal clustercenters have come out, we use them as initial seeds toperform K-means algorithm in the last step of clustering.Pseudo-code of the hybrid GA-KM approach is presentedas follows, and other configuration parameters will bepointed in Fig. 2.

    The offline model was constructed by our PCA-GAKMapproach, and once the target user is arrived, the mostinteresting movies will be calculated as the TOP-N recom-mendation list online from a cluster neighborhood insteadof searching the whole user space. The estimated ratingvalue for an un-rated movie given by Ua is predicted asfollows [27]:

    PUa;item RuyACx simUa; y Ry;iRy

    yACx jsimUa; yj 3

    where Ru is average rating score given by Ua, Cx is a set ofneighbors belonging to one common cluster with Ua, Rydenotes the average rating given by Uas neighbor y, sim(Ua,y) is a similarity function based on Pearson correlationmeasure to decide the similarity degree between two users.

    4. Experiments and results

    In this section, we describe the experimental design,and empirically investigate the proposed movie recom-mendation algorithm via PCA-GAKM technique. We eval-uate the performance of the proposed method withbenchmark clustering-based CF. Finally the results will beanalyzed and discussed. We carried out all our experi-ments on Dual Xeon 3.0 GHz, 8.0 GB RAM computer andrun Matlab R2011b to simulate the model.

    4.1. Data set and evaluation criteria

    We consider the well-known Movielens dataset toconduct the experiments, which is available online, includ-ing 100,000 ratings by 943 users on 1682 movies, andassigned to a discrete scale of 15. Each user has rated atleast 20 movies. We use to describe the sparsity level ofdataset: ml1100,000/94316820.9369. Then thedataset was randomly split into training and test datarespectively with a ratio of 80%/20%. We utilized trainingdata to build the offline model, and the remaining datawere used to make prediction. To verify the quality ofrecommendation, we employed the mean absolute error(MAE), precision, recall as evaluation measures which havebeen widely used to compare and measure the perfor-mance of recommendation systems. The MAE is a statis-tical accuracy metric which measures the average absolutedifference between the predicted ratings and actualratings on test users as shown in Eq. (4). A lower MAEvalue corresponds to more accurate predictions.

    MAEj~Pi;jri;jjM

    4

    where M is the total number of predicted movies, ~Pi;j,represents the predicted value for user i on item j, and ri;jis the true rating.

    To understand whether users are interested with therecommendation movies, we employ the precision andrecall metrics which are widely used in movie recommen-der systems to evaluate intelligence level of recommenda-tions. Precision is the ratio of interesting movies retrievedby a recommendation method to the number of recom-mendations. Recall gives the ratio of interesting moviesretrieved that is considered interesting. These two mea-sures are clearly conflict in nature because increasing thesize of recommended movies N leads to an increase inrecall but decrease the precision. The precision and recallfor Top-N (N is the number of predicted movies) recom-mendation are defined in (5) and (6), respectively.

    Precision jinteresting \ TopNjN

    5

    Recall jinteresting \ TopNjjinterestingj 6

    4.2. Experimental designs

    We try to conduct different clustering algorithmsK-means, SOM, and the proposed GA-KM on a relativelylow dimension space after PCA transformation. K-means iseasy with efficiency, but sensitive for initial cluster andoften convergence to a local optimum. SOM, as an artificialneural network, has been applied to many intelligentsystems for its good performance. In Genetic Algorithmsprocessing, we use Euclidean distance measure to decidethe similarity of n-dimensional vectors in search space.The initial population is generated with the size of 50 andnumber of generation: Tmax200. Parameter N representsthe number of movies on the recom-mendation list asdescribed in the previous section. To decide the clusternumber K suitably, we first make a robust estimation onunrated scores by performing global K-means clusteringoperations on dataset where K varies from 4 to 28. It hasbeen seen in Fig. 3 that smaller MAE values occur when Kis between 12 and 18. Thus we set K to 16 as the totalnumber of clusters to guide our numerical experiments.

    In this part, we use an all-but-10 method, in which werandomly held out ten ratings for each user in test data thatshould be predicted based on model. Prediction is calcu-lated for the hidden votes by three clustering algorithmsto compare the recommendation accuracy. In addition,

  • shows that all methods tries to reach the optimumprediction values where the neighborhood size variesfrom 15 to 20, and it becomes relatively stable around60 nearest neighbors. All clustering with PCA algo-rithms performed better accuracy than pure cluster-based CFs. We consider that PCA process could benecessary to dense the original user-rating space, andthen improve the partition results. Without the firststep of dimensional reduction, GAKM and SOM gavevery close MAE values and it seems that GAKM produceslightly better prediction than SOM. When couplingwith PCA technique, GAKM shows a distinct improve-ment on recommendation accuracy compared withSOM. Moreover, the proposed PCA-GAKM performs

    2)

    Z. Wang et al. / Journal of Visual Languages and Computing 25 (2014) 667675 673Fig. 4. Comparing accuracy with existing clustering-based CF.Fig. 3. Precision error on different cluster numbers.to(Pecluhyb

    neuseratrepGivbytartoabotra

    4.3

    finoutresysvartio

    1)examine the prediction precision, we employed UPCCarson Correlation Coefficient based CF) and up-to-datestering-based CF methods to compare with the proposedrid clustering approach.We also conduct experiments to verify quality of thew hybrid model. Here cold-start users are defined asrs who have rated less than five movies. We get fiveings visible for each test user; the rest of ratingslaced with null, and tried to prediction their values.en5 data is designed to test cold-start problem causeda new user with little history information. To normalget users, we also build Given10 to Given20 and Allbut5Allbut10 test data to generate recommendation. In eachve experiment, we repeat five times for randomlyining and test datasets, and average the results.

    . Results and discussion

    The sparse of user-item rating matrix makes it hard tod real neighbors to form the final recommendation list. Inr experiments, we compare the performances and somends of the existing baseline CF movie recommendationtems with our approach, while the neighborhood sizeies from 5 to 60 in an increment of 5. Detailed explana-n is showed as the follows from experiment results:

    Performance of PCA-GAKM CF approach:We first try to evaluate the movie recommendationquality with the traditional cluster-based CFs. Fig. 4

    3)

    TabThe

    M

    PSUKPGPneighbor size with 20. As seen from Fig. 5, the preci-sions of PCA-GAKM are improved for any value used inthe number of recommendations which indicates thatit can recommend more interesting and reliable moviesto users than other clustering-based algorithms when arelatively small number of movies on recommendationlist are considered. In addition, Fig. 6 compares the recallrates of user interesting movies, and its apparently thatPCA-GAKM still provide greater recall rates with eachvalue in the number of recommendation. The existingcluster-based CFs show lower precision and recall ratescomparing to our optimal clustering approach.Recommendation of cold-start users:We finally experiment the cold-start problem withless information users who has rated few movies in

    le 1t-test results for various cluster-based methods in terms of MAEs.

    ethod Mean Std. Dev. PCA-GAKM Sign

    t-Value

    CA-SOM 0.8018 5.900e03 9.0633 .000OM-CLUSTER 0.8078 2.589e03 16.5194 .000PCC 0.8232 1.292e03 28.9869 .000MEANS-CLUSTER 0.8175 2.244e03 23.3691 .000CA-KMEANS 0.8412 2.845e03 37.0515 .000AKM-CLUSTER 0.8040 2.786e03 13.8541 .000CA-GAKM 0.7821 4.747e03

    n Statistically significant at the 1% level.CF generate increasing MAE values which indicate thedecreasing quality for recommendation due to sensi-tiveness of the algorithm. Traditional user-based CFproduces relatively worse prediction compared withthe basic clustering-based methods.To exam the difference of predictive accuracy between ourproposed method and other comparative cluster-basedmethods, we applied t-test in the recommendation results.As shown in Table 1, the differences between MAE valuesare statistically significant at the 1% level. Therefore, wecan affirm that the proposed PCA-GAKM outperforms withrespect to the comparable cluster-based methods.Precision of PCA-GAKM CF approach:To analyze precision of recommendation, we fix theapparently high accuracy among all the algorithms,and produces the smallest MAE values continuallywhere the neighbor size varies. All K-means clustering

  • users. Based on the Movielens dataset, the experimental

    Z. Wang et al. / Journal of Visual Languages and Computing 25 (2014) 667675674Fig. 5. Precision comparison with existing cluster-based CF.history. It is understandable that searching neighbors inhigh dimensional space become difficult for cold-startusers with few ratings. Fig. 7 enables us to discover thatclustering coupled with PCA methods may produce ageneralized improvement in prediction accuracy forcold-start users. Among examined clustering methods,the proposed PCA-GAKM seems to have the bestperformance in alleviating cold-start with the satisfac-tory MAE values. With increasing number of ratingsused to make prediction, all approaches show thesimilar trend that prediction accuracy is getting higheras presented in Fig. 7.

    5. Conclusions and future work

    In this paper we develop a hybrid model-based CFapproach to generate movie recommendations whichcombines dimensional reduction technique with cluster-ing algorithm. In the sparse data environment, selectionof like-minded neighborhood on the basis of commonratings is a vital function to generate high quality movierecommendations. In our proposed approach, featureselection based on PCA was first performed on whole dataspace, and then the clusters were generated from rela-tively low dimension vector space transformed by the first

    Fig. 6. Recall comparison as the recommendation size grows.evaluation of the proposed approach proved that it iscapable of providing high prediction accuracy and morereliable movie recommendations for users preferencecomparing to the existing clustering-based CFs. As forcold-start issue, the experiment also demonstrated thatour proposed approach is capable of generating effectiveestimation of movie ratings for new users via traditionalmovie recommendation systems.

    As for future work, we will continue to improve ourapproach to deal with higher dimensionality and sparsityissues in practical environment, and will explore moreeffective data reduction algorithms to couple withclustering-based CF. Furthermore, we will study how thevariation number of clusters may influence the movierecommendation scalability and reliability. To generatehigh personalized movie recommendations, other featuresof users, such as tags, context, and web of trust should beconsidered in our future studies.

    Acknowledgements

    The research was supported by the National NaturalScience Foundation of China (nos. 71271149, 70901054and 61202030). It was also supported by the Program forstep. In this way, the original user space becomes muchdenser and reliable, and used for neighborhood selectioninstead of searching in the whole user space. In addition,to result in best neighborhood, we apply genetic algo-rithms to optimize K-means process to cluster similar

    Fig. 7. MAE comparisons in different rating reveal level.New Century Excellent Talents in University (NCET). Theauthors are very grateful to all anonymous reviewerswhose invaluable comments and suggestions substantiallyhelped improve the quality of the paper.

    References

    [1] G. Adomavicius, A. Tuzhilin, Toward the next generation of recom-mender system: a survey of the state-of-the-art and possibleextensions, IEEE Trans. Knowl. Data Eng. 17 (6) (2005) 734749.

    [2] G. Linden, B. Smith, J. York, Amazon.com recommendations: item toitem collaborative filtering, IEEE Internet Comput. 7 (1) (2003)7680.

    [3] B.M. Sarwar, G. Karypis, J. Konstan, J. Riedl, Recommender systemsfor large-scale e-commerce: scalable neighborhood formation usingclustering, in: Proceedings of International Conference on Computerand Information Technology, Dhaka, Bangladesh, 2002.

  • [4] B.M. Sarwar, G. Karypis, J. Konstan, J. Riedl, Item-based collaborativefiltering recommendation algorithm, in: Proceedings of the 10thInternational WWW Conference, Hong Kong, 2001, pp. 285295.

    [5] B.M. Sarwar, G. Karypis, J. Konstan, J. Riedl, Application of dimen-sionality reduction in recommender systema case study, in:Proceedings of ACM WebKDD Workshop, Boston, MA, 2000.

    [6] J.S. Breese, D. Heckerman, C. Kadie, Empirical analysis of predictivealgorithms for collaborative filtering, in: Proceedings of the 14thConference on Uncertainty in Artificial Intelligence, Madison,Wisconsin, USA, 1998, pp. 4352.

    [7] G. Xue, C. Lin, Q. Yang, et al., Scalable collaborative filtering usingcluster-based smoothing, in: Proceedings of the 28th InternationalConference on ACM SIGIR, Brazil: ACM Press, 2005, pp. 114121.

    [8] F. Gao, C. Xing, Y. Zhao, An Effective Algorithm for DimensionalReduction in Collaborative Filtering, in LNCS 4822, Springer, Berlin,2007, 7584.

    [9] Q. Li, B.M. Kim, Clustering approach for hybrid recommendationsystem, in: Proceedings of the International Conference on WebIntelligence, Halifax, Canada, 2003, pp. 3338.

    [10] K. Kim, H. Ahn, A recommender system using GA K-means clusteringin an online shopping market, Expert Syst. Appl. 34 (2) (2008)12001209.

    [11] A. Kohrs, B. Merialdo, Clustering for collaborative filtering applica-tions, in: Proceedings of Computational Intelligence for Modeling,Control and Automation (CIMCA), Vienna: IOS Press, 1999, pp. 199204.

    [12] D. Goldberg, D. Nichols, B.M. Oki, D. Terry, Using collaborativefiltering to weave an information tapestry, Commun. ACM 35 (12)(1992) 6170.

    [13] D. Billsus, M.J. Pazzani, Learning collaborative information filters, in:Proceedings of the 15th International Conference on MachineLearning, Madison, 1998, pp. 4653.

    [14] K. Goldberg, T. Roeder, D. Gupta, C. Perkins, Eigentaste: a constant

    [16] G. Pitsilis, X.L. Zhang, W. Wang, Clustering recommenders in colla-borative filtering using explicit trust information, in: Proceedings ofthe Fifth International Conference on Trust Management IFIPTM,Denmark, Copenhagen, 2011, pp. 8297.

    [17] M. Zhang, N. Hurley, Novel item recommendation by user profilepartitioning, in: Proceedings of the International. Conference onWeb Intelligence and Intelligent Agent Technology, Washington DC,USA, 2009, pp. 508515.

    [18] C. Huang, J. Yin, J, Effective association clusters filtering to cold-startrecommendations, in: Proceedings of the Seventh InternationalConference on Fuzzy Systems and Knowledge Discovery, Yantai,Shandong, 2010, pp. 24612464.

    [19] J. Wang, N.-Y. Zhang, J. Yin, J, Collaborative filtering recommendationbased on fuzzy clustering of user preferences, in: Proceedings of theSeventh International Conference on Fuzzy Systems and KnowledgeDiscovery, Yantai, Shandong, 2010, pp. 19461950.

    [20] D.R. Liu, Y.Y. Shih, Hybrid approaches to product recommendationbase on customer lifetime value and purchase preferences, J. Syst.Softw. 77 (22) (2005) 181191.

    [21] O. Georgiou, N. Tsapatsoulis, Improving the scalability of recom-mender systems by clustering using genetic algorithms, in: Proceed-ings of the 20th International conference on Artificial NeuralNetworks, Thessaloniki, Greece, 2010, pp. 442449.

    [22] K. Honda, N. Sugiura, H. Ichihashi, S. Araki, Collaborative filteringusing principal component analysis and fuzzy clustering, in: Pro-ceedings of the First Asia-Pacific Conference on Web Intelligence:Research and Development, Maebashi City, Japan, 2001, pp. 394402.

    [23] A. Selamat, S. Omatu, Web page feature selection and classificationusing neural networks, Inf. Sci. 158 (2004) 6988.

    [24] J. Han, M. Kamber, Data Mining: Concepts and Techniques, MorganKaufmann Publishers, San Francisco, 2001.

    [25] J.T. Tou, R.C. Gonzalez, Pattern Recognition Principle, AddisonWesley, Massachusetts, MA, 1974.

    [26] D. Goldberg, Genetic Algorithms in Search, Optimization, andMachine Learning, Addison-Wesley, New York, NY, 1989.

    [27] P. Resnick, N. Iacovou, M. Sushak, P. Bergstrom, J. Riedl, GroupLens:an open architecture for collaborative filtering of netnews, in:Proceedings of ACM 1994 Conference on Computer Supported

    Z. Wang et al. / Journal of Visual Languages and Computing 25 (2014) 667675 675(2007) 13631373. Cooperative Work, Chapel Hill, 1994, pp. 175186.time collaborative filtering algorithm, Inf. Retrieval. 4 (2) (2001)133151.

    [15] K.Q. Truong, F. Ishikawa, S. Honiden, Improving accuracy of recom-mender system by item clustering, IEICE Trans. Inf. Syst. E90-D (9)

    An improved collaborative movie recommendation system using computational intelligenceIntroductionRelated workMovie recommendation systems based on collaborative filteringClustering-based collaborative recommendation

    PCA-GAKM based collaborative filtering frameworkPre-processing data using PCAAn enhanced K-means clustering optimized by genetic algorithms

    Experiments and resultsData set and evaluation criteriaExperimental designsResults and discussion

    Conclusions and future workAcknowledgementsReferences