5
Document clustering using K-means under Mahout 1. Edit system variables. Edit some environment variables for your system. In your home directory, edit the file: .bash_profile (for Mac user) or .bashrc (for linux user) and add the following lines. $vi .bash_profile export JAVA_HOME=$(/usr/libexec/java_home) export HADOOP_HOME=/Users/DoubleJ/Software/hadoop-1.0.3 (replace with your hadoop directory) export HADOOP_CONF_DIR=/Users/DoubleJ/Software/hadoop-1.0.3/conf Then save and exit the file. $source .bash_profile Download mahout and modify the mahout file under bin: Uncomment the following line and change to the corresponding path HADOOP_CONF_DIR ="/usr/local/Cellar/hadoop/1.0.3/libexec/conf" 2. Download data $curl http://kdd.ics.uci.edu/databases/reuters21578/reuters21578.tar.gz -o ~/Software/mahout-work/reuters.tar.gz 3. Uncompress data $mkdir ~/Software/mahout-work/sgm $tar xzf ~/Software/mahout-work/reuters.tar.gz -C ~/Software/mahout-work/sgm 4. Obtain the segment data $./bin/mahout org.apache.lucene.benchmark.utils.ExtractReuters ~/Software/mahout-work/sgm ~/Software/mahout-work/out

Document clustering using K-means under Mahout€¦ · Document clustering using K-means under Mahout 1. Edit system variables. Edit some environment variables for your system. In

  • Upload
    others

  • View
    8

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Document clustering using K-means under Mahout€¦ · Document clustering using K-means under Mahout 1. Edit system variables. Edit some environment variables for your system. In

Document clustering using K-means under Mahout

1. Edit system variables. Edit some environment variables for your system. In your home directory, edit the file: .bash_profile (for Mac user) or .bashrc (for linux user) and add the following lines. $vi .bash_profile export JAVA_HOME=$(/usr/libexec/java_home) export HADOOP_HOME=/Users/DoubleJ/Software/hadoop-1.0.3 (replace with your hadoop directory) export HADOOP_CONF_DIR=/Users/DoubleJ/Software/hadoop-1.0.3/conf Then save and exit the file. $source .bash_profile Download mahout and modify the mahout file under bin: Uncomment the following line and change to the corresponding path HADOOP_CONF_DIR ="/usr/local/Cellar/hadoop/1.0.3/libexec/conf" 2. Download data $curl http://kdd.ics.uci.edu/databases/reuters21578/reuters21578.tar.gz -o ~/Software/mahout-work/reuters.tar.gz 3. Uncompress data $mkdir ~/Software/mahout-work/sgm $tar xzf ~/Software/mahout-work/reuters.tar.gz -C ~/Software/mahout-work/sgm 4. Obtain the segment data $./bin/mahout org.apache.lucene.benchmark.utils.ExtractReuters ~/Software/mahout-work/sgm ~/Software/mahout-work/out

Page 2: Document clustering using K-means under Mahout€¦ · Document clustering using K-means under Mahout 1. Edit system variables. Edit some environment variables for your system. In

5. Generate the sequence directory, which contains all texts 5.1 Send the segment data to the Hadoop file system $cd $HADOOP_HOME $./bin/start-all.sh $./bin/hadoop fs -mkdir hadoop-mahout $./bin/hadoop fs -put ~/Software/mahout-work/out hadoop-mahout (might take a while) $./bin/hadoop fs -ls hadoop-mahout/out

5.2 Create the sequence directory $cd ~/Software/mahout-distribution-0.7 $./bin/mahout seqdirectory -i hadoop-mahout/out -o hadoop-mahout/out-seqdir -c UTF-8 -chunk 5 $./bin/hadoop fs -ls hadoop-mahout $./bin/hadoop fs -ls hadoop-mahout/out-seqdir

Page 3: Document clustering using K-means under Mahout€¦ · Document clustering using K-means under Mahout 1. Edit system variables. Edit some environment variables for your system. In

6. Create vector files to represent these texts $./bin/mahout seq2sparse -i hadoop-mahout/out-seqdir/ -o hadoop-mahout/vectors --maxDFPercent 85 --namedVector (takes for a while) $cd $HADOOP_HOME $./bin/hadoop fs -ls hadoop-mahout $./bin/hadoop fs -ls hadoop-mahout/vectors

In addition, when you check the completed jobs using http://localhost:50030, you will see the following screen.

7. Running K-means $./bin/mahout kmeans -i hadoop-mahout/vectors/tfidf-vectors -c hadoop-mahout/cluster-centroids -o hadoop-mahout/kmeans -dm org.apache.mahout.common.distance.CosineDistanceMeasure -x 10 -k 20 -ow --clustering –cl

Page 4: Document clustering using K-means under Mahout€¦ · Document clustering using K-means under Mahout 1. Edit system variables. Edit some environment variables for your system. In

$cd $HADOOP_HOME $./bin/hadoop fs -ls hadoop-mahout/ $./bin/hadoop fs -ls hadoop-mahout/cluster-centroids

In addition, when you check the completed jobs using http://localhost:50030, you will see the following screen.

8. View results in a human readable way. Dump the documents into a cluster mapping $./bin/mahout seqdumper -i hadoop-mahout/kmeans/clusteredPoints/part-m-00000 > ~/Software/mahout-work/cluster-docs.txt $ls -l ~/Software/mahout-work

Page 5: Document clustering using K-means under Mahout€¦ · Document clustering using K-means under Mahout 1. Edit system variables. Edit some environment variables for your system. In

Then download the view.pl and run it using the following command. $perl view.pl ~/Software/mahout-work/cluster-docs.txt ~/Software/mahout-work/cluster-results.txt

The cluster-results.txt is a text file where each line has two columns with a tab separated: clustered and its corresponding file name.