SCIENCE ECHNOLOGY INNOVATION TEXTUAL DATA ...goals, the process of Term Clumping should be modified. Moreover, we generate a Term-Record-Matrix to calculate further similarities in

5th International Conference on Future-Oriented Technology Analysis (FTA) - Engage today to shape tomorrow Brussels, 27-28 November 2014

THEME 3: CUTTING EDGE FTA APPROACHES

- 1 -

SCIENCE, TECHNOLOGY & INNOVATION TEXTUAL DATA-ORIENTED TOPIC ANALYSIS AND FORECASTING:

METHODOLOGY AND A CASE STUDY

Yi Zhang1, 2, *, Guangquan Zhang2, Alan L. Porter3, Donghua Zhu1, Jie Lu2

1School of Management and Economics, Beijing Institute of Technology, Beijing, P. R. China

2Decision Systems & e-Service Intelligence research Lab, Centre for Quantum Computation & Intelligent Systems, Faculty

of Engineering and Information Technology, University of Technology Sydney, Australia

3Technology Policy and Assessment Centre, Georgia Institute of Technology, Atlanta, USA

*Corresponding Email Address: [email protected]

Abstract

Not only the external quantities, but also the potential topics of current Science, Technology & Innovation (ST&I) are changing all the time, and their induced accumulative innovation or, even, disruptive revolution, should be able to heavily influence the whole society in the near future. Addressing and predicting the changes, this paper proposes an analytic method (1) to cluster associated terms and phrases to constitute meaningful technological topics and (2) to identify changing topical emphases, the results of which we carry forward to present mechanisms to forecast prospective developments via Technology Roadmapping approaches. Furthermore, an empirical case study of Award data in the United States National Science Foundation Division of Computer and Communication Foundations is performed to demonstrate the proposed method and the resulting knowledge could hold interests for R&D management and science policy in practice.

Keywords: Text Mining; Topic Analysis; Text Clustering; Technological Forecasting; Big Data;

Introduction

The coming of Big Data Age introduces big opportunities and challenges for human beings and modern society, which makes possible to explore more potential information from massive and various kinds of data sources. Meanwhile, the dynamic development of Science, Technology & Innovation (ST&I) has been considered as one of the most important features in today’s open innovation systems. In this context, both national R&D management and industrialization start to highlight these trends for dominating global competition. However, not only the external quantities, but also the potential topics are changing all the time, and their induced accumulative innovation or, even, disruptive revolution, should be able to heavily influence the whole society in the near future.

As a valuable instrument addressing these concerns, text mining affords effective approaches to understand vast textual databases and engages with semantic tools to deal with real world problems. It is ST&I resources, evolving with academic publications, patents, ST&I program proposals, etc., that provides possibilities on describing previous scientific dynamics and efforts, discovering innovation capabilities, and forecasting probable evolution trends in the near future (Porter and Detampel 1995; Zhang et al. 2013). Currently, ST&I text analysis focuses on the emerging topics by combing both quantitative and qualitative methodologies and emphasizing the automatic knowledge-based system and bibliometric approaches.



- 2 -

On the one hand, clustering analysis is widely introduced for topic generation. As described by Jain (2009), the purpose of clustering analysis is to explore the potential grouping for a set of patterns, points or objects. Analogously, text clustering concentrates on the textual data with its statistical properties and semantic connections of phrases and terms. Normally, clustering algorithms seek to calculate the similarity between documents and to reduce the rank by grouping large number of items to meaningful small number of factors (Chen et al. 2013; Zhang et al. 2014). On the other hand, understanding on the topics usually connects with time series and forecasting requests, and is described as Emerging Trend System (ETD) problems (Kontostathis et al. 2004), evolving with (1) Technology Opportunity Analysis (Porter and Detampel 1995; Porter and Cunningham 2004) and continuous studies on automated ST&I intelligence extraction and visualisation, and future-oriented analysis (Zhu and Porter 2002; Zhang et al. 2013; Huang et al. 2014), and (2) Topic Detection and Tracking (Allan et al. 1998) and its related algorithm researches on Topic Identification (Small et al. 2014), Topic Detection (Cataldi et al. 2010; Dai et al. 2010), and Concept Drift (Lu et al. 2014).

However, previous studies lack the macro scope on connecting the algorithm study and the real-world requirements, which only concentrate the design and refinement of textual clustering and classification approaches, or only focus on the problem itself and ignore the possible quantitative efforts for the improvement. Standing on these concerns, this paper develops a data-driven, but adaptive, methodology for topic analysis and forecasting. At the first step, we introduce a K-Means-based clustering approach for semi-supervised learning on semi-labelled ST&I records, which includes several selection models between 1) Phrases and Words, 2) normal Term Frequency (TF) and Term Frequency Inverse Document Frequency (TFIDF), and 3) Feature Combinations. Furthermore, we apply Technology Roadmapping approaches for foresight studies, which combine quantitative evidence with expert knowledge, introduce the visual model to present the innovation trend in a specified period, and address concerns for forecasting discussions. Moreover, a special case study focuses on the United States (US) National Science Foundation (NSF) Awards is presented in this paper.

This paper is organized as follows: in the “Methodological Approach” section, we present the detailed research method on ST&I textual data oriented topic analysis and forecasting studies. The section “Results, Discussion and Implications” follows, taking the US NSF Awards from 2009 to 2013 in the Division of Computer and Communication Foundation as a case study. This part identifies topics by clustering approaches, illustrates the development trend visually, and engages expert knowledge for foresight understanding. Finally, we conclude our current research and put forward possible directions for future works.

Methodological Approach

The study proposes and develops a data pre-processing approach, a K-Means based clustering analysis approach, and a trend analysis approach, and use NSF Award data as a case study. As we mentioned, our methodology seeks to define a ST&I textual data-driven, but adaptive, method to topic analysis and forecasting. The general research framework is given in Fig. 1.



- 3 -

Figure 1. Research Framework for ST&I Textual Data Oriented Topic Analysis and Forecasting

Step 1 Raw Data Retrieval

Normally, ST&I textual data has common fields (e.g., Title, Abstract, etc.) and special ones (e.g., International Patent Classification in patent data, Program Element/Program Reference in NSF Awards data, etc.), and in this step, our purpose is to remove the meaningless data, and to retrieve the needed fields from the raw records.

Step 2 Feature Extraction

In our previous study, we developed a Term Clumping process for technical intelligence that aims to retrieve core terms (words and phrases) from ST&I resources by performing term cleaning, consolidation, and clustering approaches (Zhang et al. 2014). In this paper, we introduce these effective steps for feature extraction (core terms retrieval); however, the purpose of Term Clumping in our approach is to remove common terms but not to retrieve the exact core terms—the usual goal for using Term Clumping. Therefore, based on the specific domain and

Raw Data Retrieval e.g., Title, Abstract, Special Fields

Feature Extraction

Topic Analysis

Feature Selection and Weighting Model

Words or Phrases TF or TFIDF Feature Combination

K Local Optimum Model

Cluster Validation Model

Labelled

Training Set

Data-Driven K Means based Clustering Approach

Topics

Forecasting

Expert

Knowledge

Topic Understanding and Foresight Technology Roadmapping

External Factors e.g., Policy, Tech.

Term Clumping Processing

Core Terms (Words and Phrases) & Term – Record Matrix



- 4 -

goals, the process of Term Clumping should be modified. Moreover, we generate a Term-Record-Matrix to calculate further similarities in this step.

Step 3 Topic Analysis

In this step, we set up a training set of labelled data for machine learning and propose a data-driven K-Means-based clustering approach. Several aiding models are added as described below:

1) Cluster Validation Model

We compose the cluster validation model with Recall, Precision, and F Measure, referring to the common performance measures in information retrieval, which are defined as follows:

𝑅𝑒𝑐𝑎𝑙𝑙 =Number of Relevant Records Clustered to the Category

Total Number of Relevant Records of the Category

𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 =Number of Relevant Records Clustered to the Category

Total Number of Records Clustered to the Category

𝐹 𝑀𝑒𝑎𝑠𝑢𝑟𝑒 =𝑅𝑒𝑐𝑎𝑙𝑙 × 𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 × 2

𝑅𝑒𝑐𝑎𝑙𝑙 + 𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛

Generally, Recall denotes the fraction of the records that are relevant to the query that are successfully retrieved, Precision indicates the fraction of retrieved documents that are relevant to the find, and F Measure combines both as the harmonic mean. Considering the Recall value for the whole dataset is meaningless, since all records have been clustered, we only calculate the Total Precision to evaluate the total number of correctly grouped records. In addition, we also calculate the Recall, Precision, and F Measure for each Category, and use the Average F Measure as another main target value.

𝐴𝑣𝑒𝑟𝑎𝑔𝑒 𝐹 𝑀𝑒𝑎𝑠𝑢𝑟𝑒 =∑ 𝐹 𝑀𝑒𝑎𝑠𝑢𝑟𝑒 of each Category

Total Number of Categories

𝑇𝑜𝑡𝑎𝑙 𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 =Number of Records clustered to the correct Category

Total Number of Records in Training Set

Moreover, it is necessary to mention that, in the Cluster Validation Model, we label all records grouped with the Centroid via the real category of this Centroid. Therefore, the selection of Centroid is one of the key issues that will influence the cluster validation process.

2) K Local Optimum Model

The traditional K Means algorithm needs to set the K value manually, and this value affects the clustering results heavily (Jain 2010). In our approach, aiming to reduce this influence and to find the best K value in a specified interval, we situate the cluster validation model in the loop for the specified interval, and decide the best K value in the interval based on its F Measure.

The main concept of K Means is described as follows:

A. Initialization: Select the top K records with the highest Euclid Length as the Centroid of K Clusters;

Let 𝑡𝑓𝑖𝑛 as the frequency of term 𝑡𝑖 in Record 𝐷𝑛

Record − Term Vector: 𝑉(𝐷𝑛) = {𝑡𝑓1𝑛, 𝑡𝑓2𝑛, … , 𝑡𝑓(𝑖−1)𝑛, 𝑡𝑓𝑖𝑛}

Euclid Length of Record 𝐷𝑛: 𝐸𝐿𝐸𝑁(𝐷𝑛) =1

√∑ 𝑡𝑓𝑖𝑛2



- 5 -

B. Record Assignment: Classify each record to the Centroid with the highest Similarity value;

Let 𝑉(𝐷𝑛) and 𝑉(𝐷𝑚) as the Record − Term Vector of Record 𝐷𝑛 and Centroid 𝐷𝑚

Similarity Value: 𝑆(𝐷𝑛 , 𝐷𝑚) = 𝐶𝑜𝑠(𝑉(𝐷𝑖), 𝑉(𝐷𝑚))

C. Centroid Refine: Calculate the Similarity between record and its cluster, set the record with the highest Similarity Value as the Centroid of this cluster;

Let Cluster 𝐶 = {𝐷1, 𝐷2, … , 𝐷𝑙−1, 𝐷𝑙}

Similarity Value: 𝑆(𝐷𝑛 , 𝐶) =∑ S(𝐷𝑛 , 𝐷𝑘)𝑙

𝑘=1

𝑙

D. New & Old Centroid Comparison: If all new Centroids are the same as the old ones, the loop ends. Or else, return to Step B.

3) Feature Selection and Weighting Model

In this step, considering the specified fields of the NSF Award data in our case study, we will use “NSF Awards” as the sample to present our method. Title and Abstract (described as Narration in NSF Awards) are the most common fields used by text analysis, and, for NSF Award data, we also introduce the Program Element (PE) Code and Program Reference (PR) Code to our study. In NSF Awards, one record will be classified to at the most 2 PE codes and at least 1 PR code, both of which are also comprised of semantic terms. However, whereas these codes sometimes make good sense to help our approach to explore relations between similar records, sometimes they mislead the machine to other directions. For example, in the case of the PR code, one or two codes will be used to describe the case of an empirical study that should be totally different with the main concept. In this condition, we also use an automatic way, aiding with cluster validation model, to assemble the best Title term, Narration term, PE code, and PR code. Six assembled sets are compared in this model:

#1 Narration + Title Terms

#2 Narration + Title Terms + PE Code

#3 Narration + Title Terms + PE/PR Code

#4 Narration + Weighted Title Terms

#5 Narration + Weighted (Title Terms + PE Code)

#6 Narration + Weighted (Title Terms + PE/PR Code)

We treat these 4 kinds of terms separately and introduce a weighting model into #4, #5, and #6 in order to calculate similarities. Normally, in the first 3 assembled sets, we calculate the similarity for Narration Terms, Title Terms, PE Code, and PR code respectively, and use the mean as the final similarity value of the assembled set. In the last 3 assembled sets, with the help of the weighting model, the inverse ratio of the term amount is engaged. Let #4 serve as a sample and we come out with the weight terms below:

𝑉(𝐷𝑛) = 𝑉𝑁(𝐷𝑛) + 𝑉𝑇(𝐷𝑛)

𝑉𝑁(𝐷𝑛) is the Term − Record Vector with only Narration Terms, while 𝑉𝑇(𝐷𝑛) with only Title Terms

Let 𝑇𝑁 = 𝑉𝑁(𝐷𝑛) ∩ 𝑉𝑁(𝐷𝑚) and 𝑇𝑇 = 𝑉𝑇(𝐷𝑛) ∩ 𝑉𝑇(𝐷𝑚)

𝜔𝑁 =𝑇𝑇

𝑇𝑁 + 𝑇𝑇

, 𝜔𝑇 =𝑇𝑁

𝑇𝑁 + 𝑇𝑇



- 6 -

Weighted Similarity Value: 𝑆𝑤(𝐷𝑛 , 𝐷𝑚) = 𝜔𝑁 × 𝑆(𝑉𝑁(𝐷𝑛), 𝑉𝑁(𝐷𝑚) ) + 𝜔𝑇 × 𝑆(𝑉𝑇(𝐷𝑛), 𝑉𝑇(𝐷𝑚) )

At the same time, in this model, we also try to compare another two topics that always attract researcher’s interests in text analysis: the clustering accuracy of words and phrases, normal term frequency (TF), and TFIDF value.

According to common sense, phrase is more specific and would create a more accurate cluster, since it has much stronger relation with each other than what individual words have. However, phrases appear much less frequently leading to less overlap between records, etc., and thus, might be detrimental to a similarity measure. It is interesting that a Topic Models approach, which engages a hierarchical Bayesian analysis to explore latent semantic groups in a collection of documents (Blei and Lafferty 2006; Blei 2012), handles text clustering tasks with a “Word – Topic – Record” model and only pays attention to words. Yau et al. (2014) addressed Topic Models on both words and phrases derived from ST&I publications, and the comparison indicated better results on words. Moreover, Gretarsson et al. (2012) proposed a Topic Model approach “TopicNets” for visual analysis on large amounts of textual data. They selected NSF grant proposals awarded to University of California campuses as one data sample. Thus, this comparison also represents our interests explored in this paper.

We apply normal term frequency and TFIDF value with the K Means approach and compare the accuracy regarding the NSF Award data. Considering various kinds of TFIDF approaches, we only address the classical formula (Boyack et al. 2011) as described below:

TFIDF = TF × IDF =Frequency of Term 𝑡𝑖

Total Instances of Terms in Record 𝐷𝑗

× 𝑙𝑜𝑔Total Record Number in the Set

Total Number of Records with Term 𝑡𝑖

Our data-driven clustering approach is comprised of the models above, and the clusters, identified as topics, will be generated at the end of this step.

Step 4 Forecasting

In the past years, we have contributed a bit to the semi-automatic Technology Roadmapping composing model (Zhang et al. 2013), but all these efforts are based on terms. In this paper, we use topics to take the place of terms and locate them in the time series as topic trends. Also, we engage expert knowledge and understanding with external factors, e.g., policy, technique development status, etc. Especially, in this part, quantitative results are only considered as objective evidence for expert engagement, and qualitative judgments will take a more important role for the forecasting studies. General steps of this section are outlined below:

1) To sort out the generated topics by year and to remove distinct duplicate topics manually;

2) To send the topic list to domain experts and ask for assessments, 1 for “interesting topic at that time,” 0 for “not interesting at that time,” and 0.5 for “not sure”;

3) To calculate the marks of each topic and obtain the ranking list;

4) With the help of experts, to remove low-ranked items and meaningless topics, consolidate similar topics, and to classify topics into their appropriate “technology development levels”;

5) To locate topics on the visual maps and to address the understanding gained regarding relations, development trends, and foresights.

Results, Discussion and Implications

Data



- 7 -

In the book Lee Kuan Yew: The Grand Master’s Insights on China, the United States, and the World, the founding father of modern Singapore mentioned that “America’s creativity, resilience and innovative spirit will allow it to confront its core problems, overcome them, and regain competitiveness” (Allison et al. 2013). Researchers and institutions are trying to evaluate the status of global innovation competition and until now there is no clear conclusion on the results. However, undoubtedly, it is common sense that the United States (US) currently is, and still will be, the leading country of the world for a while to come due to its powerful capability to produce innovation.

As the most important government agency of the US for funding research and education in most fields of science and engineering, the US National Science Foundation accounts for about one-fourth of federal support to academic institutions for basic research. It receives approximately 40,000 proposals each year for research, education and training projects, approximately 11,000 of which are granted as awards (NSF Website, see Reference). In another word, understanding on the NSF Award data, which contains the most intelligent and innovative basic research, more advanced than other regions by several years, could be considered an express way to revealing how the innovation evolution pathways of the US work. Such a research approach gets the core of the world’s innovation and research forefront and the resulting knowledge could strongly support R&D management plans and science policy both in the US and other countries.

Actually, the NSF Award database is made available with open access, and all data can be handled on the NSF’s website (NSF Website, see Reference). Normally, these awards are classified according to specific Award Type and divisions. As this represents the most meaningful and the largest part of the data available, we concentrate on Standard Grants. Moreover, most NSF Award data is labelled by its Program Type, while a less part is unlabelled. However, the Program Type sometimes entails really extensive classification (e.g., Collaborative Research, Early Concept Grants for Exploratory Research, etc.), and sometimes it is very specific (e.g., Cyber Physical System, Information Integration and Informatics, etc.). Statistically, less than half of the NSF Award data is labelled in detail or with any kind of “usable” classification, while others have a common or meaningless labels or even no label at all. Thus, we treat the NSF Award data as semi-labelled.

As we mentioned above, the NSF funds more than 10 thousand proposals per year and the open access data online goes back to 1959. Considering our background, social networks, and, especially, the purpose of this paper, we only choose awards related to Computer Science, from which 12,915 records under the Division of Computer and Communication Foundations with an Organization Code that falls between 5010000 and 5090000 are selected. Since one of the main motivations for topic analysis is to address the innovation possibilities from NSF Award data, we remove awards granting travel support, summer school support, and other proposals asking for education support; we finally come out with 9,274 records. We then apply Term Clumping steps (Zhang et al. 2014) for core term retrieval; the process for each step is given in Table 1. However, we do not choose the clustering steps in Term Clumping steps, including Term Cluster Analysis and Combine Terms Network, which will reduce the number of similar terms and increase the difficulty of seeking similar pairs.

Table 1. Steps of Term Clumping Processing

Step # N.* Terms # T.* Terms

1 9274 Records, with 9274 Titles and 8975 Narrations - -



- 8 -

2 Natural Language Processing via VantagePoint (see Reference)

254992 17859

3 Basic Cleaning with thesaurus 214172 16208

4 Fuzzy Matching 184767 15309

5 Pruning (Remove terms appearing only in one record) 42819 2470

6 Extra Fuzzy Matching 40179 2395

7 Computer Science based Common Term Cleaning 38487 2311

*N. = Narration, *T. = Title

Before further processing, we deal with the training set first. As NSF Award data is semi-labelled data, we screen all 1,124 records in 2009, choose 10 labels associated with 588 records, and set up the training set, which includes 135 Title Terms and 2,746 Narration Terms. Also, we import 56 PE codes and 64 PR codes for these 588 records. A Term-Record Matrix is generated for text clustering calculation.

Topic Analysis

Based on the training set, in the K Local Optimum Model, considering the balance of the best number of clusters to treat at a time (fewer topics make the results easier to understand; but more topics lead to a greater degree of accuracy), we set the interval of K value as [10, 20]. We list the maximum and mean of Average F Measure and Total Precision of 6 Assembles with Word & TF, Phrase & TF, and Phrase & TFIDF in Table 2, and also compare the accuracy of the TF and TFIDF value with 6 assembled sets in Figure 2 (1 is for normal term frequency, and 2 is for TFIDF value, e.g., #1-1 means Feature Combination #1 with normal term frequency).

Table 2. Max and Avg. Value of F Measure and Total Precision of 6 Assembles with Word & TF, Phrase & TF, and Phrase & TFIDF

#1 #2 #3 #4 #5 #6

WORD

& TF

Average F Measure

Max 0.908948 0.91547 0.91547 0.905554 0.935272 0.935961

Avg. 0.888425 0.888192 0.888192 0.857989 0.896751 0.922976

Total Precision

Max 0.865417 0.851789 0.851789 0.863714 0.870528 0.870528

Avg. 0.792318 0.798204 0.798204 0.705901 0.774353 0.838625

PHRASE

& TF

Average F Measure

Max 0.960813 0.93294 0.928113 0.960813 0.971387 0.971387

Avg. 0.924594 0.902324 0.868127 0.921359 0.952589 0.939672

Total Precision

Max 0.957411 0.890971 0.890971 0.957411 0.969336 0.969336

Avg. 0.845594 0.844045 0.76661 0.828868 0.91064 0.899799

PHRASE

& TFIDF

Average F Measure

Max 0.948351 0.886208 0.909903 0.952885 0.979774 0.979774

Avg. 0.914352 0.855526 0.861222 0.914686 0.955184 0.955934

Total Precision

Max 0.943782 0.887564 0.906303 0.948893 0.977853 0.977853

Avg. 0.828713 0.840948 0.840019 0.783646 0.9221 0.931237



- 9 -

Figure 2. Average F Measure (L) and Total Precision (R) of 6 Assembles with TF and TFIDF

Generally, the results of the calculated TFIDF value are less accurate than those with TF in #1, #2, #3, and #4, but inverse results are generated in #5 and #6, where TFIDF’s highest Average F Measure and Total Precision reach the peak of the whole result set. In this context, one possible understanding is that TFIDF analysis introduces document frequency into the feature space rather than only term frequency. This would help to reduce the weighting of common terms that appear in many records and to increase the weighting of special ones. That is to say TFIDF strengthens relations between “real” similar records and increases the accuracy of the clustering analysis.

Comparing with the efficiency of the 6 assembled sets of feature combinations, as shown in Figure 2, #5 and #6 are the best assembled sets, #1 and #4 are at a passable level, while #2 and #3 obtain the worst results. We try to explore the reasons behind these differences and outline some of our deductions below:

1) PE code and PR code could be treated as the keywords of publications, which are special and meaningful, but much less terms originate from this than terms from the narration and title. Thus, there is no obvious difference between #2 and #3 and between #5 and #6;

2) Undoubtedly, the total amount of narration terms is much greater than that of the title terms, but, as we calculate the same terms between two records according to the weighting model, 𝑇𝑁 shows no significant difference from 𝑇𝑇 , which might explain why #1 and #4 have similar results.

3) We have mentioned that the PE code acts as the main keyword for one proposal while the PR Code contains much noise information, which should obfuscate the relations between proposals. For example, it is common to add one or two terms describing the empirical study, e.g., “Earthquake Engineering,” “Gene And Drug Delivery,” etc., or to use some “general” terms to emphasize the research purpose, e.g., “Science, Math, Eng & Tech Education,” “Science Of

0.75

0.8

0.85

0.9

0.95

1

10 11 12 13 14 15 16 17 18 19 20

#1-1 #1-2 #2-1 #2-2

#3-1 #3-2 #4-1 #4-2

#5-1 #5-2 #6-1 #6-2

0.6

0.65

0.7

0.75

0.8

0.85

0.9

0.95

1

10 11 12 13 14 15 16 17 18 19 20

#1-1 #1-2 #2-1 #2-2

#3-1 #3-2 #4-1 #4-2

#5-1 #5-2 #6-1 #6-2



- 10 -

Science Policy,” etc. Considering this and the reason described above in point 1), it should be able to explain the reason why #2 and #5 is ahead of #3 and #6 a bit respectively.

4) Aiming to explore the potential reason why #5 (weighted PE code) is better than #4 (without PE code), and #4 is better than #2 (non-weighted PE code), we run the “Feature Combination #5.1”, which uses “direct ratio” to weight the PE code. The result of #5.1 is that the highest F Measure is 0.91436 and the highest Total Precision is 0.873935, both of which are worse than #2. Therefore, a reasonable explanation is that, comparing with PE/PR code, Narration Terms are more negatively misleading in the clustering analysis, and Direct Ratio enlarges this negative impact while Inverse Ratio weakens it.

5) Especially, although we have mentioned that the engagement of TFIDF analysis helps to arrive at a better target value, based on the comparison of 6 assembled sets, TFIDF also seems to weaken the distinction between PE/PR code and narration terms, since “#2 and #3” and “#5 and #6” are really close to each other when the TFIDF is added into consideration. At the same time, as we run some test samples on the semi-labelled data, we also notice that the results derived from the TFIDF value will lead to a “large” cluster, which means that 90% of the records are grouped in one cluster. Our concern is that, although TFIDF introduces document frequency to balance the influence from term frequency and to highlight the speciality of terms, TFIDF also weakens the distance between special terms and common terms. Thus, records with more non-zero value in the Term-Record Matrix will have much more similarity with other ones. Thus the function will select these records as the centroids more easily, and then “large” clusters will be generated. Comparably, normal term frequency seems to avoid the “large” cluster troubles.

In Table 2, we also use normal TF for words to run the clustering approach again. Obviously, compared with results derived from phrases, results of Word & TF are not as good as those derived from Phrase with TF or TFIDF. Therefore, phrases definitely work better than words for record clustering regarding the NSF Award data.

Based on the results and analysis of above experiments, we choose “Phrases,” “normal TF,” “#5 Feature Combination of Narration Terms, and weighted Title Terms and PE/PR Code,” and “K=18” as the most suitable K Means Clustering approach for NSF proposal data.

Forecasting

We apply our method to NSF Award datasets from 2009 to 2013, the number of which is approximately 1000 each year and 4,847 in total. After the K Means clustering approach, we obtain 90 topics and retrieve 83 topics for further processing (just removing 7 duplicate topics here). As we mentioned in the Methodology section, we seek to combine quantitative and qualitative methods together for forecasting studies and treat our auto-generated results as objective evidence for decision-making. Thus, in this case, we engage experts on computer-related subjects for topic confirmation and modification. In the context, we invite 9 experts (4 researchers, e.g., Senior Lecturer, Lecturer, or Senior Researcher, who have focused on computer-related studies for more than 10 years, and 5 PhD candidates) from the School of Software, University of Technology Sydney, Australia. Referring to their research experience and deep academic understanding regarding the development of computer science, they help us to confirm if these topics, generated by Text Clustering analysis, are interesting or not, and they also help us to consolidate similar topics.

Especially, considering the scope of knowledge, we use inverse ratio to weight the 4-Researcher Group and 5-PhD-Candidate Group, and then, remove all topics below the Rank Value 0.5, as 0.5 is set as “not sure” and value below should incline to “not interesting.” Here, we obtain 54 topics in total. At the same time, referring to the hybrid composing model of Technology



- 11 -

Roadmapping (Zhang et al. 2013) and considering the special situation of computer science, 2 experts from the 5-PhD-Candidate Group also help us to classify these 54 topics into three levels of Technology Development Phases: Basic Research, Assistant Instrument, and System and Product. We listed parts of final topics (all 11 topics in 2009, and 8 topics in 2013) in Table 3.

It is necessary to mention that the assessment on the level of the technology development phase is subjective, and also, aiming to reduce this kind of bias, we set up a 0.5 amendatory value between nearby levels, e.g., 2.5 means the topic bases on “Assistant Instrument,” but is also close to “System and Product.” This kind of amendatory will be inflected as the location of topic on the Y axis. In addition, only in 2013, considering the significance of the topic “Big Data,” we consolidate 2 related topics and package them as a big “Big Data” topic.

The visual Technology Roadmapping for trend analysis is shown in Figure 3.

Table 3. Interesting Topics of 2009 and 2013 Selected by Experts

Year Topic Topic Description Rank Level

2009

Adaptive Grasping Adaptive Grasping, Automatic Speech Recognition, Empirical Mechanism Design, Hierarchical Visual Categorization

0.9111 2.0

Behavior Modeling Behavior Modeling, Checkpoint, Citizen Science, Dynamic Environments

0.8667 1.5

Online Social Networks

HCC, Large Scale, Online Social Networks, Applications, Measurement

0.8417 2.0

High Performance Computer

High Performance Computer, MRI R2, MRI R2 Consortium, Applications, Consortium

0.8167 3.0

Human Centered Computing

BPC DP, Brain Computer Interface, HCC Small, Mainstream 0.7278 2.0

Bayesian Model Computation

Bayesian Model Computation, CPS, Graphical Models, High Dimensional Data Sets, III/EAGER

0.7222 1.0

Cyber Physical System

CPS, Cyber Physical System, MRI R2, Service Attacks, Wireless Network, Large Scale Data Centers

0.7083 2.5

Medical Device Coordination

Medical Device Coordination, CPS, CSR, Data Centers, 0.6833 2.5

Hidden Web Databases

Hidden Web Databases, Sensitive Aggregates, Urgent Challenge, Suppressing

0.6389 1.0

ad Hoc Wireless Networks

ad Hoc Wireless Networks, Cooperative Beam Forming, Cross Layer Optimization, Data Centers, Mobile Devices

0.6139 2.0

Biological Networks Biological Networks, CIF, Communication Networks, CPS, Cryptography

0.5889 2.0

2013

Big Data AF, Big Data, , CIF, Mathematical Problems, Joint Source Channel Codes, Large Scale Neural Networks

1.0000 2.5

Robotic Intelligence RI, RUI, High End Computer Users, SI2 SSE, Software Needs

1.0000 3.0

Supporting Knowledge Discovery

GV, Scientific Visualization Language, Supporting Knowledge Discovery, correlated dynamics, cortical visual processing

0.9111 2.0



- 12 -

NSF Smart Health Health Influences, Learning Fine, Social Media, Twitter Health, NSF Smart Health

0.8611 3.0

Large Scale Hydrodynamic

Brownian Simulations

CDS&E, Large Scale Hydrodynamic Brownian Simulations, Parallel Structured Adaptive Mesh Refinement Calculations

0.7278 2.5

Automatic Graphical Analysis

Automatic Graphical Analysis, Intuition Wall, Matrix Free Algorithms, Program New Computer Architectures, SHF, Electrical thermal Co Design

0.6389 2.0

Virtual Organization Long Tail Sciences, Virtual Organization, Human Centered Computation, Scientific Software Development, Scientists

0.5250 2.0

Asynchronous Learning Experiences

Asynchronous Learning Experiences, EXP, RUI, Earthquake, Learning via Architectural Design

0.5000 2.5

Discussion and Implications

Considering the hottest topics at the current time, it is interesting and promising to explore and discuss more details on the “Big Data” issues in this case. On March 29, 2012, the Obama Administration announced the “Big Data Research and Development Initiative” (White House 2012, see Reference) to improve the ability to extract knowledge and insights from large and complex collections of digital data and to help accelerate the pace of discovery in science and engineering, strengthen national security, and transform teaching and learning. In this context, six federal departments and agencies of the US planned to announce more than $200 million US dollars to launch the initiative; the NSF is one of them. As shown in Figure 4, “Big Data” appears as to have been a hot topic in 2013, having evolved with various kinds of “new” techniques and concepts, but it is also easy to link these new ones to their original ideas (as marked in the solid box in Figure 3). Obviously, “Large Scale” related concepts, algorithms, and systems, have been generated since 2009, e.g., “L-Scale Data Centre (2009),” “Solving Large System and Parallel Strategy (2011),” “Large Asynchronous Multi Channel Audio Corpora (2012),” etc. In another words, “Big Data” is not a total invention, but a kind of evolution from previous techniques and a solution for real-word problems. Almost all components can be traced back. But looking forward must be more valuable for us. Concerning the general technology development pathway and the leading situation of Big Data, we try to leave several comments here for the purpose of forecasting studies.

1) Outwardly, “Trustworthy Cyber Space (2010 and 2012)” and “Privacy Preserving Architecture and Digital Privacy (2011)” seem to have no direct relations with Big Data, especially the techniques of Big Data, but, in May 2014, the White House announced another report “Big Data: Seizing Opportunities Preserving Values” (White House 2014, see Reference), which involved the relations between government, citizens, businesses, and consumers and focused on how the public and private sectors can maximize the benefits of big data while minimizing its risks. Definitely, Cyber Security is considered as such a risk in the Big Data Age. Therefore, it is reasonable to imagine that, in the near future, the “Privacy in Big Data” should be a big concern for both government and citizens, not only in the policy and legal domains, but also the privacy protecting techniques.

2) Another set of topics that attracts our eyes is “Open Data Repository Intermediary (2011),” “Supporting Knowledge Discovery (2013),” “Virtual Organization (2013),” and also those real application related topics, e.g., “Medical Device Coordination (2009),” “Cyber Infrastructure (2010),” “Wireless Camera Networks (2012),” “RFID System (2012),” etc. As the most powerful competitor of the US, China has issued “Internet of Things” as its TOP 5 Emerging Industry,



- 13 -

announced firstly in the Speech “Let Science and Technology Lead China’s Sustainable Development” by the then current Premier Wen (2009). Not uniquely, in the report of the White House 2014 we mentioned above, “Internet of Things” is highlighted as the ability of devices to communicate with each other using embedded sensors that are linked through wired and wireless networks. This is also linked with Big Data. Thus, “Internet of Things” including its related techniques and cyber security issues, must be another hot research topic in the coming decades.

3) It has been a long time since people started to imagine intelligent robots, although these topics are not new, and have appeared several times in Figure 3, e.g., “Next Generation Robotics (2011),” “Robotic Intelligence (2012),” and “Robotics Engineering (2013),” we still address positive foresight on the robotics techniques, which must be able to gain more intelligence from Big Data and to upgrade into a smarter format.

4) As part of Obama Admission’s “Big Data” program, the NSF started its “NSF Smart Health and Wellbeing” program in 2012, which “seeks improvements in safe, effective, efficient, equitable, and patient-centered health and wellness services through innovations in computer and information science and engineering” (NSF Website, see Reference). As a sign for this kind of research, “Medical Device Coordination (2009)” could be considered as the fundamental construction, and “NSF Smart Health” rose exponentially as the hot topic in 2013. With the push of the NSF program and the pull of enormous wellbeing requests in modern society, the application of computer techniques in health and wellness services must be an emerging industry for a long time yet.

Conclusion and Further Study

Currently, in the Big Data Age, it is common sense to transfer the traditional method-driven research to data-driven empirical study, and this paper could be considered as this kind of an attempt. We focus on NSF Awards, propose a clustering approach for topic retrieval, and then engage expert knowledge to identify developmental patterns. Combination of quantitative and qualitative methods provides a promising approach to forecast potential coming advances.

We anticipate further study along three directions: 1) to continue to improve our clustering algorithm by comparing with other text clustering approaches, and to make it more operable, adaptable, and effective; 2) to extend empirical study to cover multiple data sources -- e.g., publications and patents, and 3) to extend our scope to the broader innovation processes. We believe this should hold interest for government, industry, and researchers.



- 14 -

Figure 3. Technology Roadmapping for Computer Science (based on NSF Award Data)



- 15 -

References

Allan, J., Carbonell, J., Doddington, G.,Yamron, J., Yang, Y. Topic Detection and Tracking Pilot Study Final Report, in Proceedings of the DARPA Broadcast News Transcription and Understanding Workshop, 1998.

Allison, G., Blackwill, R. D., Wyne, A. Lee Kuan Yew: The Grand Master’s Insights on China, the United States, and the World. The MIT Press, USA, 2013.

Big Data is a Big Deal, Office of Science and Technology Policy, White House, March 29, 2012. http://www.whitehouse.gov/blog/2012/03/29/big-data-big-deal (accessed October 24, 2014)

Big Data: Seizing Opportunities Preserving Values, White House, May 2014. http://www.whitehouse.gov/sites/default/files/docs/big, data, privacy, report, may, 1, 2014.pdf (accessed October 24, 2014)

Blei, D. M. Probabilistic topic models. Communications of the ACM, 2012, 55(4): 77-84.

Blei, D. M., Lafferty, J. D. Dynamic topic models, Proceedings of the 23rd international conference on Machine learning. ACM, 2006: 113-120.

Boyack, K. W., Newman, D., Duhon, R. J., Klavans, R., Patek M., et al. Clustering More than Two Million Biomedical Publications: Comparing theAccuracies of Nine Text-Based Similarity Approaches. PLoS ONE 6(3): e18029. doi:10.1371/journal.pone.0018029, 2011.

Cataldi, M., Di Caro, L., Schifanella, C. Emerging topic detection on twitter based on temporal and social terms evaluation. Proceedings of the Tenth International Workshop on Multimedia Data Mining, ACM, 2010: 4.

Chen, H., Zhang, G., Lu, J. A Time-Series-Based Technology Intelligence Framework by Trend Prediction Functionality, Systems, Man, and Cybernetics (SMC), 2013 IEEE International Conference on. IEEE, 2013: 3477-3482.

Dai, X., Chen, Q., Wang, X. Xu, J. Online topic detection and tracking of financial news based on hierarchical clustering. Machine Learning and Cybernetics (ICMLC), 2010 International Conference of IEEE, 2010, 6: 3341-3346.

Gretarsson, B., O’donovan, J., Bostandjiev, S., Höllerer, T., Asuncion, A., Newman, D., Smyth, P. Topicnets: Visual analysis of large text corpora with topic modeling. ACM Transactions on Intelligent Systems and Technology (TIST), 2012, 3(2), 23.

Huang, L., Zhang, Y., Guo, Y., Zhu, D., Porter, A. L. Four dimensional Science and Technology planning: A new approach based on bibliometrics and technology roadmapping. Technological Forecasting and Social Change, 2014, 81: 39-48.

Jain, A. K. Data Clustering: 50 Years beyond K-Means. Pattern Recognition Letters, 2010, 31(8): 651-666.

Kontostathis, A., Galitsky, L. M., Pottenger, W. M.A survey of emerging trend detection in textual data mining. Survey of Text Mining. Springer New York, 2004: 185-224.

Lu, N., Zhang, G., Lu, J. Concept drift detection via competence models. Artificial Intelligence, 2014, 209: 11-28.

Porter, A. L., Detampel, M. J. Technology opportunities analysis. Technological Forecasting and Social Change, 1995, 49(3): 237-255.

Porter, A. L., Cunningham, S. W. Tech mining: exploiting new technologies for competitive advantage. John Wiley & Sons, New York, USA, 2004.

Small, H., Boyack, K. W., Klavans, R. Identifying emerging topics in science and technology.Research Policy, 2014. 43(8): 1450-1467.

United States National Science Foundation, http://www.nsf.gov/ , (accessed October 12, 2014)

VantagePoint, www.theVantagePoint.com, (accessed October 12, 2014)

Wen, J. Let Science and Technology Lead China's Sustainable Development, 2009. http://www.chinanews.com/gn/news/2009/11-23/1979809.shtml (accessed October 24, 2014)

Yau, C., Porter, A., Newman, N. Clustering scientific documents with topic modelling. Scientometrics, 2014,100:767-786.

http://www.whitehouse.gov/blog/2012/03/29/big-data-big-deal

http://www.whitehouse.gov/sites/default/files/docs/big_data_privacy_report_may_1_2014.pdf

http://www.nsf.gov/funding/aboutfunding.jsp

http://www.thevantagepoint.com/

http://www.chinanews.com/gn/news/2009/11-23/1979809.shtml



- 16 -

Zhang, Y., Guo, Y., Wang, X., Zhu, D., Porter, A L. A hybrid visualisation model for technology roadmapping: bibliometrics, qualitative methodology and empirical study. Technology Analysis & Strategic Management, 2013, 25(6): 707-724.

Zhang, Y., Porter, A. L., Hu, Z., Guo, Y. Newman, N. “Term clumping” for technical intelligence: A case study on dye-sensitized solar cells. Technological Forecasting and Social Change, 2014, 85: 26-39.

Zhu, D., Porter, A. L. Automated extraction and visualization of information for technological intelligence and forecasting. Technology Forecasting & Social Change, 2002, 69: 495–506.

Documents

SCIENCE ECHNOLOGY INNOVATION TEXTUAL DATA ...goals, the process of Term Clumping should be modified. Moreover, we generate a Term-Record-Matrix to calculate further similarities in