Upload
ganesh-borle
View
80
Download
0
Embed Size (px)
Citation preview
Presented By (Team No. 32)
Nitish Jain (201301227)Ganesh Borle (201505587)Vamshikrishna Reddy (201202177)
Mentored By
Lokesh Walase
IRE [CSE474]
ABSTRACT
🔸Through the sands of time, textual content has remained a prominent feature of internet media especially BLOGS.
🔸Thus, author profiling and attribution becomes an important and task and we try to capture one aspect of it, i.e gender.
● internet can’t take responsibility of the all the content, it should be the author itself.
● But . . .
● lot of content brings a lot of responsibility
THE APPROACH
🔸An ensemble is applied on these models and the input document is classified as written by male or female.
● We take advantage of the linguistic features of the blog and create a feature file.
● This feature file is then trained on various classifier and a model for each of the classifier is prepared.
🔸each document contains text of about ~35 blogs in XML format.
[Dataset Link : http://u.cs.biu.ac.il/~koppel/BlogCorpus.htm ]
The Dataset
● Koppels blog dataset
● contains about 19 thousand document
PARSING
● Language used : Python● Each blog is entry stored in XML format
<Blog><date>....... </date>
<post>….
</post>...
<Blog>
● Each of the blog filename contains the name and Gender of the author
FEATURES
For our task of Gender Identification, we take the help of the following linguistic features:🔸Character Based Features🔸Word Based Features🔸Syntactic Features🔸Structural Features🔸Function Words🔸POS Start Probability
THE CLASSIFICATION TASK
For the task of classification, we used several classifying algorithms and arrived at a model that uses ensemble of the following classification algorithms:🔸Random Forest Classifier🔸Neural Networks Classifier🔸Adaboost Tree Classifier🔸Gradient Boosting Classifier🔸Bagging Classifier
THE CLASSIFICATION TASK
For each of the classifier🔸We fed it with partial features to actually see the variation
of accuracies with the features.🔸We applied a 10 fold validation to measure the accuracies.
For measuring the accuracy of the ensemble we took the majority class from the classified results of the classifiers.
RANDOM FOREST CLASSIFIER
● An meta estimator that fits a number of decision tree classifiers on various sub-samples of the dataset
● By using Random Forest Classifier we were able to achieve an accuracy of 69.79%
NEURAL NETWORKS CLASSIFIER
● Consists of multiple layers of nodes with each layer fully connected to the next layer nodes and each node is a neuron with non-linear perceptron.
● Uses a supervised learning called backpropagation for training the network.
● By using Neural Networks Classifier we were able to achieve an accuracy of 69.51%
ADABOOST TREE CLASSIFIER
● An meta estimator that begins by fitting a classifier on the original dataset and then fits the next round classifiers on the same dataset
● By using Adaboost tree Classifier we were able to achieve an accuracy of 69.57%
GRADIENT BOOSTING CLASSIFIER
● Builds model in a forward stage-wise fashion.
● In each of the next stages weak classifiers are introduced to compensate the shortcomings of the existing weak learners and these shortcomings are identified by the gradients.
● By using Gradient Boosting Classifier we were able to achieve an accuracy of 70.81%
BAGGING CLASSIFIER
● A meta estimator that fits the base classifiers each on random subsets of the datasets and then aggregate their individual predictions.
● By using Gradient Boosting Classifier we were able to achieve an accuracy of 70.03%
THE ENSEMBLE
● An Ensemble takes the output of other classifier and then applies a majority voting to the outputs of the classifier to determine the output.
● By using the Ensemble model on the above discussed classifiers we were able to achieve an accuracy of 71.10%
THE FINAL RESULTS
● By using the ensemble, we were actually able to increase our efficiency by nearly 1% in each case irrespective of the performance of the individual classifiers.
● The maximum obtainable accuracy that was shown during the experiments was 73.19% by the Ensemble model.
USEFUL LINKS
🔸Github - https://github.com/nitishjain2007/Gender_Identification🔸Youtube - 🔸Slideshare - 🔸Website - http://nitishjain2007.github.io/Gender_Identification/ 🔸Dropbox -
REFERENCES
🔸http://u.cs.biu.ac.il/~koppel/papers/male female llc final.pdf 🔸http://www.aaai.org/ocs/index.php/ICWSM/09/paper/viewFile
/208/537
🔸http://www.cs.columbia.edu/nlp/papers/2011/acl2011age.pdf 🔸http://www.ccse.kfupm.edu.sa/~ahmadsm/coe589 121/cheng
2011 gender identification.pdf