28
GENDER DETECTION IN BLOGS

Gender Detection on Blogs

Embed Size (px)

Citation preview

Page 1: Gender Detection on Blogs

GENDER DETECTION IN

BLOGS

Page 2: Gender Detection on Blogs

Presented By (Team No. 32)

Nitish Jain (201301227)Ganesh Borle (201505587)Vamshikrishna Reddy (201202177)

Mentored By

Lokesh Walase

IRE [CSE474]

Page 3: Gender Detection on Blogs

The Big Picture

Page 4: Gender Detection on Blogs

ABSTRACT

● Through the sands of time, textual content has remained a prominent feature of internet media especially BLOGS.

● Thus, author profiling and attribution becomes an important and task and we try to capture one aspect of it, i.e gender.

● internet can’t take responsibility of the all the content, it should be the author itself.

● But . . .

● lot of content brings a lot of responsibility

Page 5: Gender Detection on Blogs

Given a text blog , can we identify whether the writer is a male or a female ?

The Question

Page 6: Gender Detection on Blogs

WHO IS THE AUTHOR?

Page 7: Gender Detection on Blogs

OUR APPROACH

Page 8: Gender Detection on Blogs

THE APPROACH

● An ensemble is applied on these models and the input document is classified as written by male or female.

● We take advantage of the linguistic features of the blog and create a feature file.

● This feature file is then trained on various classifier and a model for each of the classifier is prepared.

Page 9: Gender Detection on Blogs

WORKFLOW

Page 10: Gender Detection on Blogs

● each document contains text of about ~35 blogs in XML format.

[Dataset Link : http://u.cs.biu.ac.il/~koppel/BlogCorpus.htm ]

The Dataset● Koppels blog dataset

● contains about 19 thousand document

Page 11: Gender Detection on Blogs

PARSING

● Language used : Python● Each blog is entry stored in XML format

<Blog><date>....... </date><post>

…. </post>...

<Blog>

● Each of the blog filename contains the name and Gender of the author

Page 12: Gender Detection on Blogs

The Feature Extraction

Page 13: Gender Detection on Blogs

FEATURES

For our task of Gender Identification, we take the help of the following linguistic features:● Character Based Features● Word Based Features● Syntactic Features● Structural Features● Function Words● POS Start Probability

Page 14: Gender Detection on Blogs

The

Classification

Page 15: Gender Detection on Blogs

THE CLASSIFICATION TASK

For the task of classification, we used several classifying algorithms and arrived at a model that uses ensemble of the following classification algorithms:

● Random Forest Classifier● Neural Networks Classifier● Adaboost Tree Classifier● Gradient Boosting Classifier● Bagging Classifier

Page 16: Gender Detection on Blogs

THE CLASSIFICATION TASK

For each of the classifier

● We fed it with partial features to actually see the variation of accuracies with the features.

● We applied a 10 fold validation to measure the accuracies.

For measuring the accuracy of the ensemble we took the majority class from the classified results of the classifiers.

Page 17: Gender Detection on Blogs

RANDOM FOREST CLASSIFIER

● An meta estimator that fits a number of decision tree classifiers on various sub-samples of the dataset

● By using Random Forest Classifier we were able to achieve an accuracy of 69.79%

Page 18: Gender Detection on Blogs

NEURAL NETWORKS CLASSIFIER

● Consists of multiple layers of nodes with each layer fully connected to the next layer nodes and each node is a neuron with non-linear perceptron.

● Uses a supervised learning called backpropagation for training the network.

● By using Neural Networks Classifier we were able to achieve an accuracy of 69.51%

Page 19: Gender Detection on Blogs

ADABOOST TREE CLASSIFIER

● An meta estimator that begins by fitting a classifier on the original dataset and then fits the next round classifiers on the same dataset

● By using Adaboost tree Classifier we were able to achieve an accuracy of 69.57%

Page 20: Gender Detection on Blogs

GRADIENT BOOSTING CLASSIFIER

● Builds model in a forward stage-wise fashion.

● In each of the next stages weak classifiers are introduced to compensate the shortcomings of the existing weak learners and these shortcomings are identified by the gradients.

● By using Gradient Boosting Classifier we were able to achieve an accuracy of 70.81%

Page 21: Gender Detection on Blogs

BAGGING CLASSIFIER

● A meta estimator that fits the base classifiers each on random subsets of the datasets and then aggregate their individual predictions.

● By using Gradient Boosting Classifier we were able to achieve an accuracy of 70.03%

Page 22: Gender Detection on Blogs

THE ENSEMBLE

● An Ensemble takes the output of other classifier and then applies a majority voting to the outputs of the classifier to determine the output.

● By using the Ensemble model on the above discussed classifiers we were able to achieve an accuracy of 71.10%

Page 23: Gender Detection on Blogs

FINAL RESULTS

Page 24: Gender Detection on Blogs

THE FINAL RESULTS

● By using the ensemble, we were actually able to increase our efficiency by nearly 1% in each case irrespective of the performance of the individual classifiers.

● The maximum obtainable accuracy that was shown during the experiments was 73.19% by the Ensemble model.

Page 25: Gender Detection on Blogs

73.188406 %The maximum Accuracy Achieved

Page 26: Gender Detection on Blogs

USEFUL LINKS

● Github - https://github.com/nitishjain2007/Gender_Identification

● Youtube - https://www.youtube.com/watch?v=T04BJ6cIeTs

● Slideshare - http://bit.ly/1Q8UiCe

● Website - http://nitishjain2007.github.io/Gender_Identification/

● Dropbox - http://bit.ly/1Xx0ppL

Page 28: Gender Detection on Blogs

Thanks!Any questions?