39
Research project MAI 2 Final presentation - Group n. 4 1 S. Deckers - J. Hermans - A. Ludermann - D. Di Mitri - J. Rutten - D. Soemers

Research project MAI2 - Final Presentation Group 4

Embed Size (px)

Citation preview

Page 1: Research project MAI2  - Final Presentation Group 4

Research project MAI 2Final presentation - Group n. 4

1

S. Deckers - J. Hermans - A. Ludermann - D. Di Mitri - J. Rutten - D. Soemers

Page 2: Research project MAI2  - Final Presentation Group 4

Research Project MAI 2 - Group n.4

● Our data

● Visualisation● Analysing keywords● Ontology● Cluster articles● Predicting citations● Analyzing raw material

● Conclusion● Improvements

Outline

2

Page 3: Research project MAI2  - Final Presentation Group 4

Research Project MAI 2 - Group n.4

Our data

3

Page 4: Research project MAI2  - Final Presentation Group 4

Research Project MAI 2 - Group n.4

Our data

4

Page 5: Research project MAI2  - Final Presentation Group 4

Research Project MAI 2 - Group n.4

Visualisations

Task 5: “Visualising the articles in a relevant context of time and geographical location in 2D or 3D”

5

Page 6: Research project MAI2  - Final Presentation Group 4

Research Project MAI 2 - Group n.4

Task 5 - Results

6

Page 7: Research project MAI2  - Final Presentation Group 4

Research Project MAI 2 - Group n.4

Analysing Keywords

Task 1: “Determining combinations of keywords that are specific for each year, country, journal, and subject category”

Task 6: “Extracting (combinations of) keywords from abstracts and titles”

7

Page 8: Research project MAI2  - Final Presentation Group 4

Research Project MAI 2 - Group n.4

Task 1

• TF-IDF as feature extraction method.• Treat objects of interest and their keywords as

documents.• Extract relevant keywords by making use of a

threshold.• Fast

Fetching model

Generic document processor

Combination model

8

Page 9: Research project MAI2  - Final Presentation Group 4

Research Project MAI 2 - Group n.4

Task 6

1. Preprocessing of abstracts (tokenization, stemming, removal of stopwords to reduce dimensionality).

2. Construct vector space and word mapping for every article abstract.

LDA (treat sentences as documents).TF-IDF seemed too naïve.

3. Apply LDA (k = 1) on vector space to fetch distribution over words.4. Use wordmapping (index -> word), to extract relevant words.

9

Page 10: Research project MAI2  - Final Presentation Group 4

Research Project MAI 2 - Group n.4

Ontology

Task 2: “Specifying an application independent ontology of publications.”

Task 7: “Defining ontology of the domain of nanotechnology which should be linked to the ontology of publications made in the first block.”

Task 8: “Automatically generating ontology for the publication data. Compare this ontology with the one you defined yourselves. Fill the ontology with data from the articles.”

10

Page 11: Research project MAI2  - Final Presentation Group 4

Research Project MAI 2 - Group n.4

Task 2: Result

11

Page 12: Research project MAI2  - Final Presentation Group 4

Research Project MAI 2 - Group n.4

Task 7: Result

12

Page 13: Research project MAI2  - Final Presentation Group 4

Research Project MAI 2 - Group n.4

• Ontology Learning• Automatic or semi-automatic creation of ontologies• Requires text (or other data)• Often requires human supervision / corrections

• Approach• Accept single words from user input• Allow choice of different senses of word• Automatically generate related words

Task 8: Approach

13

Page 14: Research project MAI2  - Final Presentation Group 4

Research Project MAI 2 - Group n.4

Cluster Articles

Task 4: “Learning article dendrograms and interpreting the dendrogram clusters”• Approach• Analysing splitting at the root

14

Page 15: Research project MAI2  - Final Presentation Group 4

Research Project MAI 2 - Group n.4

• Sample 8,000 articles from database• Top-down hierarchical clustering• K-Means on each level with K = 2• Stop splitting when cluster small enough or dense

enough• Repeat N times and compare results

Task 4: Approach

15

Page 16: Research project MAI2  - Final Presentation Group 4

Research Project MAI 2 - Group n.4 16

Task 4: Dendrogram

Page 17: Research project MAI2  - Final Presentation Group 4

Research Project MAI 2 - Group n.4

• Year Features• 1998, 1999, 2000, 2001, 2002 (all 4x)

• Country Features• USA (4x), Japan (2x), Germany (1x),

Peoples R. China (1x)• Subject Features

• Physics, Condensed Matter (4x)• Physics, Applied (3x)• Chemistry, Physical (3x)• Materials Science, Multidisciplinary (2x)

Task 4: Analysing split at root

17

Page 18: Research project MAI2  - Final Presentation Group 4

Research Project MAI 2 - Group n.4

Predicting citations

Task 3: “Learning models that predict the citations of articles.”Task 9: “Predicting the most cited authors.”

k-Nearest Neighbor classification

18

Page 19: Research project MAI2  - Final Presentation Group 4

Research Project MAI 2 - Group n.4

• k-Nearest Neighbor, with k = 1 (results are sufficient)• Considered attributes:

• Cited patents• Publication year• Countries• Subject category• Author affiliation origin

• Instance representation using a boolean array• Cosine similarity

Initial approach (1)

19

Page 20: Research project MAI2  - Final Presentation Group 4

Research Project MAI 2 - Group n.4

Classification using four classes:• 0: no citations• 1-20: low number of citations• 21-100: medium number of citations• 101 and more: high number of citations

Initial approach (2)

20

Page 21: Research project MAI2  - Final Presentation Group 4

Research Project MAI 2 - Group n.4

Problem!

21

• 189,508 data instances (valid data entries)• ~ 14,000 dimensional space• Bool eq. to byte (smallest addressable memory elem.)

~14 kB for every instance!~2.7 GB to contain complete dataset!

Page 22: Research project MAI2  - Final Presentation Group 4

Research Project MAI 2 - Group n.4

Solution

22

• Use the boolean nature of the instance representation!• Address and modify bit’s using bitmasks.

~14 kB reduced to ~1.7 kB~2.7 GB reduced to ~332 MBMemory consumption reduced by a factor of 8.

Page 23: Research project MAI2  - Final Presentation Group 4

Research Project MAI 2 - Group n.4

Additional optimizations

23

Bit representation allows us to make more efficient use of the CPU’s ALU.

Optimization of Cosine Similarity.Increase in classification performance using linear search.

Original BitSet implementation

Page 24: Research project MAI2  - Final Presentation Group 4

Research Project MAI 2 - Group n.4

• 10-fold-cross validation

• Avg. accuracy Class 0 : 0.7908• Avg. accuracy Class 1 : 0.9943• Avg. accuracy Class 2 : 0.9823• Avg. accuracy Class 3 : 0.8175

• Total avg. accuracy: 0.8963

Task 3 - Results

24

Page 25: Research project MAI2  - Final Presentation Group 4

Research Project MAI 2 - Group n.4

Represent author by his / her articles (instances) since author cannot be uniquely identified.

Task 9 - Results

25

Search for Class 3 instances.

Avg. accuracy for Class 3 classification: 0.7377

Page 26: Research project MAI2  - Final Presentation Group 4

Research Project MAI 2 - Group n.4

Analysing raw materials

Task 10: “Determining new substitutes of expensive raw materials”

26

Page 27: Research project MAI2  - Final Presentation Group 4

Research Project MAI 2 - Group n.4

Task 10: Determining new substitutes of rare raw materials● Rare earth elements

○ group of 17 chemical elements

● 1. Find relevant documents○ Abstracts that mention Rare Earth elements in some form

● 2. Analyse these documents for trends/patterns

27

Page 28: Research project MAI2  - Final Presentation Group 4

Research Project MAI 2 - Group n.4

Task 10: Finding Relevant Documents

● Regular Expressions○ Can detect different ways of writing Rare Earths

● Full names○ Yttrium / yttrium

● Chemical Formulae○ Zr-Ce / YBa2Cu3O7+Ni / YSi1.7

● Some false positives○ ZYMV-S (Zucchini Yellow Mosaic Virus)○ especially for Yttrium

28

Page 29: Research project MAI2  - Final Presentation Group 4

Research Project MAI 2 - Group n.4

Task descriptionUse TF-IDF to order the 190,692 publications according to the similarity of their abstract with the Wikipedia article “Rare earth element”

Task 10 - TF-IDF approach 1/3

29

Background knowledge on Rare earth elements

Page 30: Research project MAI2  - Final Presentation Group 4

Research Project MAI 2 - Group n.4 30

QueryDoc 0001.txt

Doc 192K.txt…

s = A x bT

Linear kernel

Text preprocessing Text preprocessing

Query vector(n. query terms)

TF-IDF index(ndocs x n.terms)

Task 10: TF-IDF approach 2/3

Page 31: Research project MAI2  - Final Presentation Group 4

Research Project MAI 2 - Group n.4 31

Task 10: TF-IDF approach 3/3

Example: first result, doc id 20350The nano-grained Ni/ZrO2 catalysts containing rare earth element oxides were prepared by oxidation-reduction pretreatment of amorphous Ni-(40-x) at% Zr-x at% rare earth element (Y, Ce and Sm; x=1 - 10) alloy precursors. The conversion of carbon dioxide on the catalysts containing 1 at% rare earth elements was almost the same as that on the rare earth element-free catalyst, but the addition of 5 at% or more rare earth elements increased remarkably the conversion at 473 K. In contrast to the formation of monoclinic and tetragonal ZrO2 during pretreatment of amorphous Ni-Zr alloys containing 1 at% rare earth elements, tetragonal ZrO2, which is generally stable only at high temperatures, was predominantly formed during the pretreatment of the catalysts containing 5 at% or more rare earth elements. The surface area of the catalysts increased with the content of rare earth element. Thus, the increase in the surface area and stabilization of tetragonal ZrO2 seem to be responsible for the improvement of catalytic activity of the Ni-Zr alloy-derived catalysts by the addition of rare earth elements.

Page 32: Research project MAI2  - Final Presentation Group 4

Research Project MAI 2 - Group n.4

Task 10: Removing False Positives

● Compute similarity to wikipedia page on Rare Earth elements○ TF-IDF vectors

● Reject documents with similarity score below threshold● Conservative threshold (0.005)

○ filters some false positives○ excludes few (if any) true positives○ manually determined

32

Page 33: Research project MAI2  - Final Presentation Group 4

Research Project MAI 2 - Group n.4

Task 10: Analysis

33

Page 34: Research project MAI2  - Final Presentation Group 4

Research Project MAI 2 - Group n.4

Task 10: Analysis

● Top 3 countries○ Saudi Arabia (15.74%), Slovenia (12.59%), Romania (9.13%)

● Top subject categories per Rare Earth element● Rare Earth element trends over the years

● See report for detailed results

34

Page 35: Research project MAI2  - Final Presentation Group 4

Research Project MAI 2 - Group n.4

Task 10: Rare Earth substitution

● Search articles that address substitution

● Lucene to search within RE abstracts (11.430)

● Search for “substitut*”, “replace*” or “alternative” (955)

● Filtered by sentences containing chemical formula (841)

● found no article that directly address substitution

● but e.g. refer to alternative methods or substitution as

chemical reaction

35

Page 36: Research project MAI2  - Final Presentation Group 4

Research Project MAI 2 - Group n.4

Wrapping up

36

• TASK9: • Represent authors by collection of their articles• k-Nearest Neighbor classification• high accuracy

• TASK10:• two approaches to find Rare Earth articles:

• similarity to wikipedia article with tf-idf• regular expressions

• substitution: search for abstracts that address RE substitution directly

Page 37: Research project MAI2  - Final Presentation Group 4

Research Project MAI 2 - Group n.4

Conclusions

● variety of techniques for more insight

● ontologies and visualization

● most popular topics for years or countries

● predicting number of citations

Assistance for decision makinge.g. in which research areas should be invested

37

Page 38: Research project MAI2  - Final Presentation Group 4

Research Project MAI 2 - Group n.4

Improvements

• Improve RE substitution results by Machine Learning

techniques

• Need annotated data

• More advanced Machine Learning techniques for ontology

learning, e.g. clustering

38

Page 39: Research project MAI2  - Final Presentation Group 4

Research Project MAI 2 - Group n.4

Thank you for your attention.

Questions?

39