Opinion Mining and Classification Technique to help make better choices before buying a product

1. Data Mining and Business Intelligence PGP 2012-14 Group no 1 Amit Singh Chauhan (60) Komal Billu (21)

2. consumermarket is flooded with products of the most varied sorts, each being advertised as better, cheaper, and more resistant. Isadvertisement really true?INDIAN INSTITUTE OF MANAGEMENT RAIPUR2 3. GoodSolution is to go for Word of Mouth on the web. Idealsituation is that one is able to read all the available reviews and create an opinion. Time spent in reviewing will be huge Product reviews written in different languagesINDIAN INSTITUTE OF MANAGEMENT RAIPUR3 4. How to extract the features for a given product, that could be commented upon in a customer review ????INDIAN INSTITUTE OF MANAGEMENT RAIPUR4 5. Significanceof the problem Mining the web for customer opinion on different products isboth a useful, as well as challenging task. This research will give customer a clear polarity which will bebinary in nature. Eventually it will help customer to take a firm opinion aboutthe product he goes for opinion mining. INDIAN INSTITUTE OF MANAGEMENT RAIPUR5 6. Whatare the expected results of the project?It will evolve methods to evaluate a system implementing the method presented and we show the evaluation results obtained when applying our system to a set of previously manually annotated texts containing customer reviews in English and Spanish.INDIAN INSTITUTE OF MANAGEMENT RAIPUR6 7. Theapproach to the problem has been divided into two major phases: Preprocessing Main Processing Assigning polarity to feature attribute Summarization of feature polarity Discussion and EvaluationINDIAN INSTITUTE OF MANAGEMENT RAIPUR7 8. INDIAN INSTITUTE OF MANAGEMENT RAIPUR8 9. Oncethe user enters a query about the product a series of documents are downloaded in different languages A second operation is performed to determine the category of the product After the category is determined the product specific features are extracted using the Word net and Concept net Product independent features also extracted which are applicable to all the products INDIAN INSTITUTE OF MANAGEMENT RAIPUR9 10. Oncewe are done with Word net we search the Concept net for further attributes and features. In the next step we look for undiscovered features of the product. For eg. For a camera these features would be battery life, picture resolution and auto mode. These features extracted by using bigrams which use a corpus of target words and other words used with it in the customer review INDIAN INSTITUTE OF MANAGEMENT RAIPUR10 11. EnglishSpanish INDIAN INSTITUTE OF MANAGEMENT RAIPUR11 12. Themain processing process starts with anaphora resolution in which we replace anaphoric references with their corresponding referents For eg: I bought this camera about a week ago, and so far have found it very simple to use and after anaphoric resolution it will become I bought this camera about a week ago, and so far have found very simple to use Sentence chunking done to convert the modified text to sentences and after that sentence extraction done to remove text of no importance INDIAN INSTITUTE OF MANAGEMENT RAIPUR12 13. Sentenceparsing done to obtain sentence structure and component dependencies. In the next step the features and their values i.e. attributes are extracted We also assign a modifier to each attribute feature to determine whether the attribute is positive or negative Hence triplets of the form (feature, feature attribute, valueof Modifier). INDIAN INSTITUTE OF MANAGEMENT RAIPUR13 14. ConceptNet methodology: the OUT relations PropertyOf and CapableOf relations IN relations PartOf and UsedFor relationsFeature value extraction: feature, attributeFeature, valueOfModifierAssigning polarity to feature attributes i.e. SMO(sequential minimaloptimization ) SVM(Support Vector Machine) The set of anchors contains the terms {featureName,happy, unsatisfied, nice,small, buy} 6 dimensional training vector v(j,i) = NGD(w,a), where a with j ranging from 1 to 6 are the anchors and wi, with i from 1 to 30 are the words from the positive and negative categories. ijINDIAN INSTITUTE OF MANAGEMENT RAIPURj14 15. Summarizationof feature polarity:The formulas can be summarized in: Fpos(i)= #pos_feature_attributes(i)/#feature_attributes(i) Fneg(i) =#neg_feature_attributes(i)/#feature attributes(i) The results shown are triplets of the form (feature, % Positive Opinions, % Negative Opinions) Discussionand Evaluation:Three formula for computing the system performance System Accuracy (SA) Feature Identification Precision (FIP) Feature Identification Recall (FIR) INDIAN INSTITUTE OF MANAGEMENT RAIPUR15 16. The Normalized Google Distance, is a semantic similarity measure derived from the number of hits returned by the Google search engine for a given set of keywords. Keywords with the same or similar meanings in a natural language sense tend to be "close" in units of Normalized Google Distance, while words with dissimilar meanings tend to be farther apart.NGD(x,y) = [max{logf(x), logf(y)}-log f(x,y)]/[log N min{log f(x), log f(y)] Where: N is the total number of web pages searched by Google * average number of singletonsearch terms occurring on pages f(x) and f(y) are the number of hits for search terms x and y, respectively f(x, y) is the number of web pages on which both x and y occur. INDIAN INSTITUTE OF MANAGEMENT RAIPUR16 17. Once the product category is determined, extracting the product specific features and feature attributes by using: WordNet for English EuroWordNet for Spanish Processof determining the specific product features is done by ConceptNetINDIAN INSTITUTE OF MANAGEMENT RAIPUR17 18. Specialised tool for anaphora resolution JavaRAP for English. SUPAR (Slot Unification Parser for Anaphora Resolution) forSpanish. NamedEntity Recognizer to spot names of products, brands and shops. Ling Pipe is used to split to sentence and identifying the named entities being referred.INDIAN INSTITUTE OF MANAGEMENT RAIPUR18 19. Sentenceparsing tool Minipar (English) Freeling (Spanish) Toassign polarity to each of the identified attribute of the product, following are used sequentially Sequential Minimal Optimization (SMO) Support Vector Machine(SVM) Normalized Google Distance (NGD)INDIAN INSTITUTE OF MANAGEMENT RAIPUR19 20. SVMand NGD scores use a set of anchors that must be established previously, which remains largely a subjective matter. The informal language style used by the customers while jotting their reviews, makes the identification of words and dependencies in phrases sometimes impossible.INDIAN INSTITUTE OF MANAGEMENT RAIPUR20 21. Currentlyit is possible to review consumer comments in two languages it can also be further extended to include other languages also We can also extend it to include for extracting information from images and photos posted by the other users It can also be used for suggestive selling i.e. user will provide his criteria for buying the product as well as how important each factor is to him and then our system will give suggestions accordingly INDIAN INSTITUTE OF MANAGEMENT RAIPUR21 22. A Feature Dependent Method for Opinion Mining and Classification By - Alexandra BALAHUR DLSI, Univ. Alicante Alicante, Spain Andrs MONTOYO DLSI, Univ. AlicanteAlicante, Spain http://en.wikipedia.org/wiki/Sequential_minimal_optimization http://en.wikipedia.org/wiki/Normalized_Google_distance http://research.microsoft.com/en-us/groups/nlp/ http://en.wikipedia.org/wiki/Natural_language_processing http://wordnet.princeton.edu/ http://conceptnet5.media.mit.edu/ http://web.media.mit.edu/~hugo/publications/papers/BTTJ-ConceptNet.pdf http://www.acronymfinder.com/Slot-Unification-Parser-for-Anaphora-Resolution(computer-science)-(SUPAR).html http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.21.8911&rep=rep1&ty pe=pdf INDIAN INSTITUTE OF MANAGEMENT RAIPUR22 23. INDIAN INSTITUTE OF MANAGEMENT RAIPUR23