1
Suman -Sai Introduction Suicidal tendency among the adolescent girls is one of the biggest challenges for the present day society [1]. The main goal of this project is to find the association of the various socio- emotional factors to suicidal tendencies among adolescent girls in United States. The data was collected from the National Longitudinal Study of Adolescent Health, to explore social behavior among adolescents. Models built via multiple regressions are used to find the association between depression, religious affiliation and suicidal tendency. ANOVA and Chi-Square tests are conducted to confirm the association of Religious affiliation, depression to suicidal tendency. Figure 2. Result of Text Parsing Node – Zipf Plot Association of Socio-Emotional Factors: Religious Affiliation, Depression to Suicidal tendency among Adolescent girls Suman Rachapudi and Sesha Sai Spears School of Business , Oklahoma State University, Stillwater, OK 74078 Figure 1. EM Process flow Zipf Plot: Zipf’s law assigns ranks to the words based on its frequency. It states that the frequency of the word is inversely proportional to the rank. Figure 3. Result of Text Filter Node – Terms Table Concept Links: The concept link diagram shows the association or relationship of a particular word with the other words. Figure 4. Concept Link – “Reject” The above concept link shows the most frequent words that occur with the word “Reject” and the strength of association between them, and they are: Pinholes, UT (Under Tolerance), Dents, Transverse scratches, and Gouges The conclusion is that most of the materials were getting rejected due to the above defects. Through a more in depth examination of patterns in the associations, questions can be answered, such as, “Why are Gouges occurring?” Gouge has got strong association with the word “Weld”. Weld is a defect that occurs when the material is not welded properly or is related to the issue of welding. Similarly, in depth analysis can be performed to find out the root cause of the defect. Figure 5. Concept Link – “Gouge” Figure 7. Result of Text Topic Node – Topics The words together in Topic 3 appeared in the highest number of documents, i.e., 16,093 documents, followed by words of Topic 2, which appeared in 12,132 documents. The words “cut”, “yield”, “length”, “id”, and “mark” define the theme of Topic 3. All the materials that were assigned to Topic 3 have the following defects: cut issues, desired length not met, marks on the surfaces, and inner diameter under tolerances. Similarly, the words “reject”, “gouge”, “cut”, “weld”, and “UT” define the theme of Topic 2. The materials in Topic 2 were getting rejected because of the following defects: Gouges, welding issues, and under tolerances. The conclusion is most of the comments were assigned to Topics 2, 3, and 10. Most of the materials in Plant 30 have the defects that were present in Topics 2, 3, and 10. Figure 9. Results of Text Cluster Node after removing some comments 55% of the comments were assigned to Cluster 1 & 2. The materials that fall in Cluster 1 have the following defects: bad coil, light wall, scratches, and under tolerances, whereas Cluster 2 has the following defects: Welding issues, crooked materials, dents, and gouges. The conclusion is most of the materials in Plant 30 have the above mentioned defects. The Text Cluster node discovered six major themes, and each document (comment) got assigned to one of the themes. 36% of the comments were assigned to Cluster 1, but Cluster 1 does not have any descriptive terms and has no theme. This has happened because all those comments in cluster 1 do not contain any of the words that are present in the Start list. This can be improved by removing some of the comments. Figure 8. Results of Text Cluster Node Figure 6. Concept Link – “Tensile” Tensile: This defect is usually called tensile failure. Tensile is the capability of the material to stretch. Tensile failure has got very strong association with the words “heat”, “furnace”, and “power outage”. If the temperature in the furnace is not constant, then the material will lose its tensile. When there is a power outage, the furnace will not operate at a constant temperature, thus causing reduction in the material’s tensile. Methodology Analysis SAS® Enterprise Guide 5.1 was used to analyze the data. Text Parsing The Text Parsing node parses through the entire data to identify unique terms (words) present in the text data. To speed up the run time, some of the parsed properties were set to ‘NO’, these were: Detect Different Parts of Speech, Noun Groups, and Find Entities. Based on the properties set, only certain required key words were identified. After the terms were identified, a Term-By-Document matrix was created with terms as rows and documents (comments) as columns. Usually, a very sparse matrix will be generated. To reduce the size of the term-by-document matrix a Start list was used. Text Filter The Text Filter node assigns weight to the words based on their respective frequencies. A built in algorithm was used to filter all the unnecessary words which have a lower weight. The algorithm assigns more weight to the medium and low frequency words because the documents can easily be classified based on those words. Check spelling was set to ‘Yes’. The default English dictionary was customized as needed. The cells of the term-by-document matrix contain the frequency of the term in that particular document. SAS default weighting was used to calculate frequency for terms. Text Topic The Text Topic node extracts topics. Topics are groups of terms that summarize a representation of document collection. Multi-term topics were used, and after some research, the number of multi-term topics was set to ‘10.’ The node assigns document cutoff to each document and assigns term cutoff to each topic, and, based on the threshold value, it will check whether the

Suman Rachapudi Main

Embed Size (px)

DESCRIPTION

SAS Poster

Citation preview

  • Suman -SaiIntroductionSuicidal tendency among the adolescent girls is one of the biggest challenges for the present day society [1]. The main goal of this project is to find the association of the various socio-emotional factors to suicidal tendencies among adolescent girls in United States. The data was collected from the National Longitudinal Study of Adolescent Health, to explore social behavior among adolescents. Models built via multiple regressions are used to find the association between depression, religious affiliation and suicidal tendency. ANOVA and Chi-Square tests are conducted to confirm the association of Religious affiliation, depression to suicidal tendency. Figure 2. Result of Text Parsing Node Zipf PlotAssociation of Socio-Emotional Factors: Religious Affiliation, Depression to Suicidal tendency among Adolescent girls Suman Rachapudi and Sesha Sai Spears School of Business , Oklahoma State University, Stillwater, OK 74078 Figure 1. EM Process flowZipf Plot: Zipfs law assigns ranks to the words based on its frequency. It states that the frequency of the word is inversely proportional to the rank.

    Figure 3. Result of Text Filter Node Terms TableConcept Links:The concept link diagram shows the association or relationship of a particular word with the other words. Figure 4. Concept Link RejectThe above concept link shows the most frequent words that occur with the word Reject and the strength of association between them, and they are: Pinholes, UT (Under Tolerance), Dents, Transverse scratches, and GougesThe conclusion is that most of the materials were getting rejected due to the above defects. Through a more in depth examination of patterns in the associations, questions can be answered, such as, Why are Gouges occurring?

    Gouge has got strong association with the word Weld. Weld is a defect that occurs when the material is not welded properly or is related to the issue of welding. Similarly, in depth analysis can be performed to find out the root cause of the defect. Figure 5. Concept Link Gouge Figure 7. Result of Text Topic Node TopicsThe words together in Topic 3 appeared in the highest number of documents, i.e., 16,093 documents, followed by words of Topic 2, which appeared in 12,132 documents. The words cut, yield, length, id, and mark define the theme of Topic 3. All the materials that were assigned to Topic 3 have the following defects: cut issues, desired length not met, marks on the surfaces, and inner diameter under tolerances. Similarly, the words reject, gouge, cut, weld, and UT define the theme of Topic 2. The materials in Topic 2 were getting rejected because of the following defects: Gouges, welding issues, and under tolerances. The conclusion is most of the comments were assigned to Topics 2, 3, and 10. Most of the materials in Plant 30 have the defects that were present in Topics 2, 3, and 10. Figure 9. Results of Text Cluster Node after removing some comments55% of the comments were assigned to Cluster 1 & 2. The materials that fall in Cluster 1 have the following defects: bad coil, light wall, scratches, and under tolerances, whereas Cluster 2 has the following defects: Welding issues, crooked materials, dents, and gouges. The conclusion is most of the materials in Plant 30 have the above mentioned defects. The Text Cluster node discovered six major themes, and each document (comment) got assigned to one of the themes. 36% of the comments were assigned to Cluster 1, but Cluster 1 does not have any descriptive terms and has no theme. This has happened because all those comments in cluster 1 do not contain any of the words that are present in the Start list. This can be improved by removing some of the comments. Figure 8. Results of Text Cluster Node Figure 6. Concept Link TensileTensile: This defect is usually called tensile failure. Tensile is the capability of the material to stretch. Tensile failure has got very strong association with the words heat, furnace, and power outage. If the temperature in the furnace is not constant, then the material will lose its tensile. When there is a power outage, the furnace will not operate at a constant temperature, thus causing reduction in the materials tensile. Methodology AnalysisSAS Enterprise Guide 5.1 was used to analyze the data.

    Text ParsingThe Text Parsing node parses through the entire data to identify unique terms (words) present in the text data. To speed up the run time, some of the parsed properties were set to NO, these were: Detect Different Parts of Speech, Noun Groups, and Find Entities. Based on the properties set, only certain required key words were identified. After the terms were identified, a Term-By-Document matrix was created with terms as rows and documents (comments) as columns. Usually, a very sparse matrix will be generated. To reduce the size of the term-by-document matrix a Start list was used.

    Text FilterThe Text Filter node assigns weight to the words based on their respective frequencies. A built in algorithm was used to filter all the unnecessary words which have a lower weight. The algorithm assigns more weight to the medium and low frequency words because the documents can easily be classified based on those words. Check spelling was set to Yes. The default English dictionary was customized as needed. The cells of the term-by-document matrix contain the frequency of the term in that particular document. SAS default weighting was used to calculate frequency for terms.

    Text TopicThe Text Topic node extracts topics. Topics are groups of terms that summarize a representation of document collection. Multi-term topics were used, and after some research, the number of multi-term topics was set to 10. The node assigns document cutoff to each document and assigns term cutoff to each topic, and, based on the threshold value, it will check whether the association between the terms in the topic was strong enough or not.

    Text ClusterThe Text Cluster node assigns the document to only one cluster whereas Text Topic node assigns the document to zero or more topics. Single Value Decomposition was used to overcome the curse of dimensionality. The Max SVD Dimensions property was set to 35. Descriptive terms property was set to 8. The maximum Number of clusters was set to 8. The Expectation-Maximization cluster algorithm was used for data clustering.