Document clustering using K-means under Mahout€¦ · Document clustering using K-means under Mahout 1. Edit system variables. Edit some environment variables for your system. In

Document clustering using K-means under Mahout

1. Edit system variables. Edit some environment variables for your system. In your home directory, edit the file: .bash_profile (for Mac user) or .bashrc (for linux user) and add the following lines. $vi .bash_profile export JAVA_HOME=$(/usr/libexec/java_home) export HADOOP_HOME=/Users/DoubleJ/Software/hadoop-1.0.3 (replace with your hadoop directory) export HADOOP_CONF_DIR=/Users/DoubleJ/Software/hadoop-1.0.3/conf Then save and exit the file. $source .bash_profile Download mahout and modify the mahout file under bin: Uncomment the following line and change to the corresponding path HADOOP_CONF_DIR ="/usr/local/Cellar/hadoop/1.0.3/libexec/conf" 2. Download data $curl http://kdd.ics.uci.edu/databases/reuters21578/reuters21578.tar.gz -o ~/Software/mahout-work/reuters.tar.gz 3. Uncompress data $mkdir ~/Software/mahout-work/sgm $tar xzf ~/Software/mahout-work/reuters.tar.gz -C ~/Software/mahout-work/sgm 4. Obtain the segment data $./bin/mahout org.apache.lucene.benchmark.utils.ExtractReuters ~/Software/mahout-work/sgm ~/Software/mahout-work/out

5. Generate the sequence directory, which contains all texts 5.1 Send the segment data to the Hadoop file system $cd $HADOOP_HOME $./bin/start-all.sh $./bin/hadoop fs -mkdir hadoop-mahout $./bin/hadoop fs -put ~/Software/mahout-work/out hadoop-mahout (might take a while) $./bin/hadoop fs -ls hadoop-mahout/out

5.2 Create the sequence directory $cd ~/Software/mahout-distribution-0.7 $./bin/mahout seqdirectory -i hadoop-mahout/out -o hadoop-mahout/out-seqdir -c UTF-8 -chunk 5 $./bin/hadoop fs -ls hadoop-mahout $./bin/hadoop fs -ls hadoop-mahout/out-seqdir

6. Create vector files to represent these texts $./bin/mahout seq2sparse -i hadoop-mahout/out-seqdir/ -o hadoop-mahout/vectors --maxDFPercent 85 --namedVector (takes for a while) $cd $HADOOP_HOME $./bin/hadoop fs -ls hadoop-mahout $./bin/hadoop fs -ls hadoop-mahout/vectors

In addition, when you check the completed jobs using http://localhost:50030, you will see the following screen.

7. Running K-means $./bin/mahout kmeans -i hadoop-mahout/vectors/tfidf-vectors -c hadoop-mahout/cluster-centroids -o hadoop-mahout/kmeans -dm org.apache.mahout.common.distance.CosineDistanceMeasure -x 10 -k 20 -ow --clustering –cl

$cd $HADOOP_HOME $./bin/hadoop fs -ls hadoop-mahout/ $./bin/hadoop fs -ls hadoop-mahout/cluster-centroids

In addition, when you check the completed jobs using http://localhost:50030, you will see the following screen.

8. View results in a human readable way. Dump the documents into a cluster mapping $./bin/mahout seqdumper -i hadoop-mahout/kmeans/clusteredPoints/part-m-00000 > ~/Software/mahout-work/cluster-docs.txt $ls -l ~/Software/mahout-work

Then download the view.pl and run it using the following command. $perl view.pl ~/Software/mahout-work/cluster-docs.txt ~/Software/mahout-work/cluster-results.txt

The cluster-results.txt is a text file where each line has two columns with a tab separated: clustered and its corresponding file name.

Documents

Document clustering using K-means under Mahout€¦ · Document clustering using K-means under Mahout 1. Edit system variables. Edit some environment variables for your system. In