Discovering and Visualizing the Social Structure of …sli.ics.uci.edu/pmwiki/uploads/Grants-2008SciSIP/SCH… · Web viewTitle Discovering and Visualizing the Social Structure of

III-CTX-Small: ScholarMatch: Technology to Discover and Relate to Scholars around the Globe

1 IntroductionAn undergraduate asks: “Which scholars have the most relevant publications for my paper on nineteenth-century Japanese social history?” A graduate student asks, “Which faculty might be interested in reading my draft research on Latin American revolutions?” A community college teacher asks, “What scholars in my city might be willing to do a guest lecture on modern labor practices in the Middle East?” An early-career researcher asks, “Who is doing research on sexuality outside my field of specialization in U.S. History?” A tenure committee asks, “Who would be an appropriate evaluator of a colleague who works on pre-colonial Nahuatl-speaking cultures?”

Our project, ScholarMatch, will allow users to easily answer all of these questions. ScholarMatch will be a user-friendly portal that connects social science and humanities scholars, students, and teachers via their shared intellectual interests in historical topics. ScholarMatch puts academic databases to a new use through state-of-the-art text mining techniques. We use topic modeling to create a semantic map of the subjects covered in more than 800,000 abstracts of historical publications over the past two decades. This topic representation allows us to create webs of scholars whose works fall into similar categories. Users can view lists of scholars who focus on particular thematic areas; investigate the combination of subjects on which individual scholars work; or track how various scholars relate to one another. ScholarMatch goes beyond keyword search to organize scholars by their overall subject interests. Rather than connecting users to individual pieces of scholarship on particular issues, ScholarMatch connects people based on how closely their constellation of interests match one another.

Moreover, ScholarMatch is not just a passive information provider. ScholarMatch users will be able to upload their own abstracts, research papers, syllabi, lecture outlines -- or even just a page of ideas -- for on-the-spot topic modeling that will show how their work relates to that of other scholars. We envision that ScholarMatch will be useful to a large segment of the academic community, such as undergraduates writing papers, graduate students learning a field, researchers looking for collaborators, college teachers developing lesson plans, or institutions seeking expert reviewers.

We focus the project on the discipline of History, a field that straddles the boundaries between Humanities and the Social Sciences. It is Block’s field of expertise, and we already have access to major historical abstract databases. We match the expertise of a humanities scholar with the innovative skills of two computer scientists, both of whom already have significant experience in interdisciplinary work. These combined skills allow us to turn an enormous amount of text data -- in this case, more than 800,000 abstract entries plus thousands of CVs and homepages harvested from the web -- into a means to connect scholars around the globe.

We will produce the semantic web subject index that runs ScholarMatch with topic modeling – a relatively new text mining technique. The topic model is an unsupervised statistical language model that automatically learns a set of topics that together describe a collection in its entirety, while simultaneously organizing each document in the collection by topic. Topic modeling is an ideal way to extract and summarize structured information from unstructured data sources [9, 19-22].

We will run the topic model on more than two decades of abstracts (1985-2007) from America: History and Life and Historical Abstracts. AHL and HA include articles from several thousand journals published world-wide, thousands of book and media reviews selected from one hundred key journals, and entries for dissertations and masters’ theses. These data sets have already been collected.

1

In the first year, the ScholarMatch system will be based on writings from published scholars. Users will be able to browse the system and to have their writing temporarily topic-modeled as part of their search queries. In the second year, we will integrate an additional community and functionality: unpublished scholars and anyone not already in the system will be able to have their own writings permanently topic modeled to create a page for themselves in ScholarMatch. This second community will grow organically, driven by word-of-mouth and publicity campaigns. We expect students and teachers to be a significant contingent of this group. In the third year, the project team will conduct extensive assessments of the impact of ScholarMatch on diverse user groups and make improvements to the ScholarMatch system.

ScholarMatch benefits diverse populations of users in multiple ways. Its innovative text mining technology transforms how teachers, researchers and students can search and understand thematic fields of scholarship. By being publicly available, ScholarMatch can increase involvement of traditionally underserved communities who often lack access to digital or scholarly resources. Furthermore, ScholarMatch bridges gaps between teachers and researchers. Rather than a traditional database that revolves around published scholarship, ScholarMatch encourages teachers to upload teaching content for topic modeling. This will connect them to other teachers and researchers in their areas of interest. Likewise, students can upload their own writings and directly compare themselves to those whose books and articles they have read. Thus, ScholarMatch will promote many forms of collaboration by connecting diverse members of the academic community. Finally, ScholarMatch levels the playing field between new and established scholars by providing a resource for anyone searching for experts in a particular subfield, without having to rely on “old boy networks” or institutional affiliations.

By building a system that allows users to find and relate to scholars around the world, ScholarMatch facilitates engagement between communities of researchers, teachers, and students in new and productive ways. ScholarMatch allows anyone who can write a few paragraphs to become an interactive participant in the social web of scholarly work.

2 Intellectual MeritScholarMatch is uniquely situated to make contributions to multiple disciplines. It has significant potential benefits for humanities and social science scholars: it breaks down barriers between published scholars and other members of the academic community; it allows for improved search and discovery; it promotes collaboration; it allows one to summarize, track, and see who is working in sub-fields; and it provides increased networking opportunities to people who may lack institutional resources or access to influential leaders in their field.

ScholarMatch focuses on two areas of scientific research. The first area, computer science, relates to the text mining and modeling challenges required to make a working ScholarMatch system. We will investigate new topic models that better model the type of heterogenous corpora collected for the ScholarMatch system. The second area, informatics, relates to assessing the impact of ScholarMatch. In particular, we will assess the impact and usefulness of this advanced text mining technology on students, teachers at community colleges, early career researchers, and established researchers. This will allow us to particularly focus on the ways that ScholarMatch helps overcome divisions between research and teaching, and between new and established scholars.

2.1 Novelty of ScholarMatchScholarMatch will be novel in its wholescale application of text mining technology to fields far removed from computer science. Unlike the various academic literature digital libraries serving computer science and physics, such as CiteSeer, Rexa and arXiv (all of which are NSF-funded), humanities fields are far less likely to have such broad-based technological resources that take advantage of just-developing computer science text mining techniques.

2

ScholarMatch moves beyond keyword or fulltext search to topically categorize entire corpora. Rather than a search engine focused on finding individual pieces of scholarship, ScholarMatch provides users with overviews of thematic fields. Evaluating the total published scholarship -- and, for year-two self-uploaded users, the teaching/research interests – available in a given field allows users to identify others with in-depth knowledge of a given topical constellation. ScholarMatch’s focus on scholars over publications builds on the increasing popularity of social relationship websites like Facebook and MySpace to provide a format understood by the younger generation of scholars and students.

Most currently available digital library or abstract search systems are still traditional read-only search systems. Because these systems contain, and mostly benefit, published scholars and active researchers, they can inadvertently reinstitute academic separations and institutional hierarchies between teaching and research.

In contrast, ScholarMatch is not a static database to be queried; it is an interactive system to connect all members of academic communities. For example:

An undergraduate uploads a historic document to find scholars whose work will be useful for the research paper he is writing.

A community college teacher uploads her lecture notes to see who might best be able to answer a student’s conceptual question about her latest lecture.

An M.A. student uploads his research paper to see which Ph.D. programs have scholars that best relate to his research interests.

A conference committee uploads panel proposals to fill slots for commentators and chairs. A new Assistant Professor uploads her research summary to find likely collaborators outside of

people she knows within her regional and chronologic subfields. A Search Committee uploads their job requirements (e.g., fields of scholarship) to find up-and-

coming scholars who match their needs.

Thus, ScholarMatch users will not only be able to investigate relationships among thousands of scholars, they will be able to place themselves into that scholarly community.

2.2 Broader Impacts

ScholarMatch:

Integrates Research and Teaching: ScholarMatch uses technology to transform how both teachers and students integrate research into the educational process. Because topic modeling works well with unstructured text, community college teachers who don’t produce their own research might still be active users of ScholarMatch by uploading syllabi, lecture outlines, or even just course catalog summaries. ScholarMatch can provide direct connections between teaching topics and scholarship focused on that topical area. Undergraduate researchers and graduate students can use ScholarMatch to better understand the total scholarship in any given subfield.

Promotes Collaboration: ScholarMatch helps users find potential collaborators with similar interests, explains the connection by topic, and provides contact information for these scholars. Unlike scholars in fields that regularly publish with multiple authors, it can be less obvious to humanities/social science scholars who viable collaborators might be. Historians are increasingly emphasizing the value of collaborative research, and ScholarMatch can help scholars looking for conference panelists, invited

3

speakers, or research and writing partnerships. Teachers can also use ScholarMatch to find other educators with overlapping pedagogical interests.

Levels Academic Playing Field: ScholarMatch effectively performs a double blind classification of anyone’s work. Furthermore, the topic model topically characterizes authors according to the subject of what they write, not the amount that they write, thereby leveling the playing field between new and established scholars. Because the abstract databases we use include dissertation and M.A. theses, early-career scholars will immediately be part of this scholarly community. ScholarMatch gives all interested users equal access to an array of networking opportunities, regardless of their personal connections, institutional affiliations, or geographic location. Thus, we see ScholarMatch as an excellent tool for leveling the academic playing field.

Reaches out to Traditionally-Underserved Communities: Because ScholarMatch will be freely available, users in academic communities that have been traditionally underserved (e.g. community colleges, underfunded B.A.-only institutions) can use ScholarMatch to increase their involvement in academic research and scholarship. Members of institutions without access to (increasingly expensive) proprietary academic databases will be able to use ScholarMatch to link to scholars around the world who work on topics of interest to them. ScholarMatch’s outreach and assessment efforts will include a specific focus on members of community colleges and traditionally underrepresented racial and ethnic groups.

Transforms learning experiences: By drawing students into scholarship, ScholarMatch will promote learning and discovery, even for those who are geographically isolated or at an educational institution without experts in their fields of interest. Uploading their own writings will make students part of an academic community. ScholarMatch allows students to directly compare their own interests to those scholars whose books and articles they have read.

2.3 Qualifications of TeamThe PIs’ combined interdisciplinary expertise, proven record of successful topic modeling systems, and extensive experience with assessment well situates our team to undertake this multi-disciplinary project.

Dr. David Newman has experience in probabilistic language modeling and building software systems. He has already built prototype-versions of several components of this system. In 2006, he built the Calit2 browser (http://yarra.ics.uci.edu/calit2), which uses topic modeling to compare UC, Irvine and UC, San Diego researchers based on their publications. In 2007, Newman built the UC Irvine topic modeler (http://yarra.ics.uci.edu/sam), a demonstration topic modeling system for an outreach workshop for Cyber-Infrastructure for Humanities, Arts and Social Sciences. The UC Irvine topic modeler allowed workshop participants to upload their own writings for on-the-spot topic modeling. Thus, Newman has created prototypes of two crucial features of the ScholarMatch system. He has also already completed a trial topic modeling run of 40,000 abstracts from America History and Life, one of the two databases that will be the basis of ScholarMatch.

Dr. Sharon Block, an historian, is the domain expert on this project who will provide advice about the interpretability of learned topics and the accuracy of topic matches made by ScholarMatch. Block has already collaborated with computer scientists (including Newman) on several projects and has published on innovative technological approaches to humanities research [3, 19]. A scholar of gender, race and sexuality, Block also has a strong background in working with underrepresesented minorities and women. Finally, Block has been a consultant to digital document providers, serves on several digital humanities publisher advisory boards, and has conducted usability evaluations for the past four years. Block’s historic expertise, commitment to increasing diversity, experience with user interface evaluations, and proven track record as a collaborator on topic modeling projects will provide a useful skill set for ScholarMatch.

4

http://yarra.ics.uci.edu/sam

http://yarra.ics.uci.edu/calit2

Dr. Bonnie Nardi is an Anthropologist by training, and is currently a Professor in a School of Information and Computer Science. Having expanded her initial academic training into a professional appointment in an Informatics department, Nardi has extensive experience working across traditional disciplinary boundaries. Equally importantly, Nardi is an expert on user testing and assessment. She will conceptualize, supervise and be responsible for answering the assessment research questions in this project.

3 Constituencies of UsersMultiple constituencies of users will benefit from finding and relating to scholars around the world. In this project, we will focus on two overarching communities: published scholars and unpublished scholars. Published scholars are in the system because they appeared in one of the databases (AHL or HA) over the past two decades. Unpublished Scholars will create their own entry in the system and upload their own text content. These two communities apply to both users of the system and people in the system. By creating a system where these two groups can effectively interact with each other, we bridge the divides between researchers, teachers, and students.

ScholarMatch aims to assess the impact of this technology on various groups of users. As such, we separately break out four different constituencies of users for assessment: (1) undergraduate students; (2) community colleges teachers; (3) early career researchers (which include graduate students and scholars within five years of the Ph.D.); (4) established researchers. While we believe that there will be other users who will find the proposed technology useful, (e.g. high-school students and intellectually-curious members of the general public), we focus on these broad groups because we anticipate that they will most directly benefit from ScholarMatch. Furthermore, these groups are more easily identifiable for user testing and assessment. We envision these groups using ScholarMatch in the following ways:

Group 1: Undergraduate students (likely unpublished scholars): Undergraduate students are increasingly comfortable relying on online resources and social networking websites like Facebook and MySpace. ScholarMatch builds on both these interests by allowing students to not only search for scholars online, but to place their own writings alongside published scholars. For instance, students doing a research project with historic documents or other primary sources might upload text from those documents to see whose published scholarship might best help them analyze these sources. ScholarMatch also promotes networking for students at institutions without experts in their field(s) of interest. A ScholarMatch analysis can allow a student working in an isolated or underserved institution to reach out to published scholars working on their precise area of interest. Finally, undergraduates can quickly see an overview of major scholars in an entire field to get a sense of whose work they should be reading to understand a thematic area.

Group 2. Community college teachers (some published and some unpublished scholars): Community colleges are more likely to employ faculty whose primary focus is teaching rather than research. Because topic modeling works well with unstructured text, community college teachers who don’t produce their own research might still be active members of ScholarMatch by uploading syllabi, lecture outlines, or even just course catalog summaries. This breaks down barriers between teaching and research and provides a means for teachers to effectively connect with one another based on their teaching areas.

5

Group 3. Early career scholars, including graduate students (some published and some unpublished scholars): Because our databases include M.A. and Ph.D. theses, graduate students and other early career scholars will automatically be part of the ScholarMatch database. Early career researchers can use ScholarMatch to find readers for their article and book manuscript drafts, locate panel participants for annual meetings, or find collaborators for anthologies, special journal issues or topically-based conferences. For early career scholars who did not have well-connected mentors at their graduate institutions, this ability to place themselves appropriately within a topical field of scholarship will allow them to more effectively advance their careers. Moreover, all early career scholars can benefit from access to the wider social network of academic specialists provided by ScholarMatch.

Group 4: Established researchers (published scholars): Established researchers can also take advantage of the networking opportunities and comprehensive picture of scholarly subject areas that ScholarMatch and topic modeling provide. Interdisciplinary work is becoming increasingly important in the humanities and social sciences, and finding collaborators outside one’s subfield is challenging even for established researchers. ScholarMatch will provide an accessible means for published scholars to see how their work relates to other users in ways that they might not have realized. ScholarMatch will also be of use to departments looking for appropriate tenure evaluators, editors looking for manuscript reviewers, or search committees looking for proven candidates in a particular subfield.

4 Technical Background WorkIn year one, we will use multiple sources of data to build phase one of ScholarMatch. We already have downloaded over two decades of abstracts (1985-2007) from America: History and Life and Historical Abstracts. AHL and HA include abstracts from several thousand journals published world-wide, thousands of book and media reviews selected from one hundred key journals, and entries for dissertations and masters’ theses. Our downloads of AHL and HA, which include over 800,000 abstracts, will ensure that this project has sufficient starting content.

The second source in year one is publicly available content from the web, including scholars’ webpages and CVs harvested from those webpages. Starting with the list of over 1000 US and Canadian history department websites maintained by the American Historical Association, we will use semi-automated systems to harvest webpages and CVs that are found under department websites. Newman has already developed and deployed directed crawlers for this task.

The next step is to combine these two sources to create a single database. Citations in AHL and HA contain various pieces of information including: author(s), title, abstract, date, and journal or publication. Information such as the author’s email, affiliation and headshot image will come from extracting this information either directly from department webpages, or from CVs or webpages located under department webpages. Since our focus is to develop new ways to match users with scholars, not new information extraction systems, we will leverage other work that has centered on the time consuming tasks of information extraction, and entity resolution [24, 30]. We expect that entity-resolution, i.e. matching scholars listed in the 800,000 abstracts to scholars found from web crawls of history departments, will not be a major issue given that we will be starting from a relatively rich set of initial information (e.g., name, email, affiliation, title, list of publications, and text from abstracts). This is in contrast to the more usual and difficult case of trying to resolve between entities based on their name but not knowing their affiliation or list of published papers.

For each scholar, we will have a rich set of attributes including: name, email, affiliation, title, list of links to publications, headshot (where possible), abstracts (from AHL and HA), other text (e.g. if found using Google Scholar), and co-authors. We will use statistical topic modeling to model the collection of abstracts and additional publications, and automatically summarize the text [2, 4-6, 8, 12, 29]. The topic

6

model is an unsupervised Bayesian statistical language model that is based on the idea that each document is made up of one or more topics, where each topic is a thematic distribution over words. More specifically, the topic model is a mixture model, where each document is represented as a finite mixture of topics, i.e., each document is a multinomial distribution over topics, where each topic is a multinomial distribution over words. The model automatically learns topics by finding groups of words that tend to co-occur in documents. While not all writing has equal impact on or importance to a discipline, the topic model produces sensible topics without knowing this information because it learns from a huge amount of data.

In year two we will allow unpublished scholars and teachers not already in the system to create a page for themselves in ScholarMatch. We will then allow both published and unpublished scholars to upload writings and edit personal information. Users will access the system using email (as a unique login) and password, thereby eliminating any entity resolution issues. Content for unpublished scholars will be completely user-contributed, and as such, will grow organically. The project team may add additional textual material for individual scholars by ongoing searches of Google Scholar to locate, harvest, and add publicly available writing. Note that we are not relying on this additional text content for the success of ScholarMatch or our proposed assessment research.

To help make concrete the idea of automatically learned topics, we list below selected topics from our preliminary topic model run on a collection of a subset of 40,000 America History and Life abstracts. Each topic lists words in order of likelihood (most likely word first), with ellipsis indicating that there is no hard cutoff to the number of words in a topic. Each list clearly conveys a specific theme, which is encapsulated by the human-assigned label in square brackets at the start of each topic.

[RACE AND ETHNICITY] american identity cultural culture racial ethnic race immigrant group social community class ...

[U.S. WEST] texas land california mexican farm farmer mexico agricultural new-mexico rural arizona west ...

[METHODOLOGY] social historian historical studies theory understanding cultural scholar concept field approach context ...

[FILM AND MEDIA] film newspaper media press television coverage american popular public show theater hollywood ...

[INTERNATIONAL RELATIONS & COLD WAR] united-states american international cold-war relation china foreign-policy chinese war policy foreign government ...

[NATIONALISM:] political public american nation view rhetoric national freedom moral debate liberty independence ...

Table 1. Selected topics learned by the topic model. Bracketed phrases are the domain-expert label given to the topic, and the lists of words are produced by the topic model.

7

4.6% Social and Cultural History 3.7% International Relations & Cold War 3.2% Environmental History 3.1% Politics and Elections 3.0% Archaeology and Prehistory 2.9% Business and Economics 2.9% Immigration 2.6% Nationalism 2.6% Education 2.5% Women, Gender and Sexuality 2.4% Urban Studies 2.4% Civil War

2.4% Race and Ethnicity 2.4% Legal and Constitutional History 2.3% Twentieth-Century Military History 2.3% African American History 2.3% Science and Technology 2.2% Film and Media 2.2% Death and Violent Crime 2.1% Religious Studies 2.1% US West 2.1% Museums and Historical Societies 2.0% Colonial America 2.0% Native American History

Table 2. Sample topic labels from a preliminary Topic Model of America History and Life, 2001-2006We focus on topics because they will be the basis for characterizing publications, scholars, and uploaded writings. By aggregating topic counts we can measure the relative prevalence of topics. For example, Table 2 shows (for the sample topic model run of AHL), that 4.6% of the content (abstracts) in America History and Life are about SOCIAL AND CULTURAL HISTORY, while less than half that number are about NATIVE AMERICAN HISTORY. Alternatively, by aggregating topic counts over a single scholar, we see in Table 3, that Sharon Block’s published work is almost equally about CRIME AND VIOLENCE and WOMEN AND GENDER. Given that much of Block’s work has been on the history of rape, this suggests that the topic model is accurately representing researchers’ topical specialties. A user interested in Block’s profile could then use ScholarMatch to find other scholars (or publications) in WOMEN AND GENDER, or even find scholars who best match Block’s particular mix of topics.

15% Crime and Violence 14% Women, Gender & Sexuality 10% Teaching and Methodology

10% Colonial America 6% African American History

5% Media and Print Culture 4% Legal History

Table 3. Topical Breakdown of Scholarship by Sample Published Scholar, Sharon Block

8

2001 2002 2003 2004 2005 20061

1.5

2

2.5[Religion] church religious catholic baptist christian mission irish protestant ...

% o

f abs

tract

s

2001 2002 2003 2004 2005 20062.5

3

3.5

4

4.5[Politics and Elections] political election party presidential campaign voter state ...

% o

f abs

tract

s

Figure 1. Topic trends. From 2001 to 2006, there was a decrease in the relative amount published in the History of Religion and an increase for Political and Election History.

Once a topic model is learned, the underlying probabilistic framework allows great flexibility in the range of queries users can pose. We can answer simple queries such as: who are the most prolific researchers in a particular topic, or what publications are most relevant to a given topic. We can also answer more complex queries such as which scholars are most similar (topically) to a given scholar; and what departments have scholars in a particular field (e.g.: Latin American Studies) most similar in aggregate to those at another institution. Users will also be able to track topic trends using the topic counts, and determine, as Figure 1 shows, which subject areas are increasing or decreasing in popularity over time.

Figure 2 shows an example of the list of research topics and related faculty that were automatically inferred for a particular faculty member based on the Calit2 topic-browsing system built by Newman. Newman’s system crawled the Web pages for 460 faculty members, downloaded 12,000 publications, extracted bags of words, and ran a statistical topic model with 200 topics. The resulting topic models were manually named and a browser was built that automatically constructs ranked lists of topics by faculty members, allowing a user to navigate among faculty and topics in an intuitive manner - the system is online at http://yarra.calit2.uci.edu/calit2. This system demonstrates and proves that topic modeling is a sensible basis for finding – based on published writing by individuals – scholars who relate to one another.

home | researchers | research topics

RUIZ, VICKI LYNN

HISTORYSCHOOL OF HUMANITIESUCIemail: [email protected] URL: http://www.humanities.uci.edu/history/faculty/ruiz/ (6 papers collected)

Research topics:

9

http://www.humanities.uci.edu/history/faculty/ruiz/

http://yarra.ics.uci.edu/calit2/gettopics.php

http://yarra.ics.uci.edu/calit2/getauthors.php

http://yarra.ics.uci.edu/calit2/index.html

http://yarra.calit2.uci.edu/calit2

(45%) [ gender and race ] women black gender american white occupation job female rape housing race ...(27%) [ politics and society ] political social policy economic china law government national legal ...(12%) [ education and strategy ] student game question action player strategy experience learning team ...

Related researchers (UCSD,UCI) :

(1.0) GARCIA BEDOLLA, LISA (0.9) FRANK, ROSS (0.9) BLOCK, SHARON B. (0.9) BASOLO, MARY V. (0.9) COHEN, PHILIP (0.8) HUFFMAN, MATTHEW L. (0.8) PETERSILIA, JOAN R. (0.7) WIENER, JONATHAN M.

Figure 2: Calit2 browser example of a faculty profile automatically derived using topic models. Ruiz publishes in gender and race, politics and society, and education and strategy. A list of related researchers is also displayed. (Because this prototype focused primarily on scientists and social scientists, the research topics are broader than they will be on ScholarMatch.)

5 Managing and Sustaining the ScholarMatch WebsiteHere we describe some of the planned functionality and detail some of the management and sustainability issues for ScholarMatch over the three year grant period.

Functionality Data SourcesYear 1 Published Scholar pages: displays contact info,

headshot, publications, topics, related scholars Topic pages: displays list of scholars who publish

in this topic, display time trends for topic Users can keyword search database of 10,000+

historians (searches name, titles of publications, abstracts)

Users can upload writing for temporary topic modeling to find related topics and scholars with similar topical interests

Advanced search will allow more flexible queries (e.g. group by department or subfield, graph trends)

Historical Abstracts American History and

Life Harvested CVs and

webpages Google Scholar

Years 2 and 3

Year one functionality plus: Allows unpublished scholars to create own page

in ScholarMatch (access via login and password) Allows all scholars to permanently topic model

uploaded materials Allows all scholars to upload and add to list of

own writings Allows published scholars to correct/edit/populate

contact information, affiliation, headshot (access via login and password)

Allow users to search and browse published scholars and/or unpublished scholars

Other advanced search and browse features

User-contributed writings Harvested CVs and

webpages Google Scholar

Table 4. Rollout of data sources and functionality for ScholarMatch system over the three year period.

10

http://yarra.ics.uci.edu/calit2/getauthor.php?aid=261








http://yarra.ics.uci.edu/calit2/getonetopic.php?zid=127



ScholarMatch will not keep or display any abstract or full-text content. This will avoid any licensing, copyright or plagiarism issues. Instead ScholarMatch will maintain a list of links for publications/citations for each published and unpublished scholar, and via topic modeling, maintain a topical description of each person in the system. Uploaded writing will be digested by the system and topic modeled, but not stored or re-displayed. The ScholarMatch system will need several safeguards to avoid spamming abuses or other inappropriate use of the system. ScholarMatch will have a system administrator (Newman), who will control who is in system and be able to disable display of particular pages if necessary. Users will be able to email the sysadmin about concerns or issues. A language filter may also be used to prevent misuse of ScholarMatch. Finally, topic modeling may be used to separate relevant content from uploaded spam. By topic modeling uploaded writing, the topic model can make an assessment of whether a particular piece of writing is likely to be spam. The sysadmin can be notified of exceptions or uncertain cases. Users will always be able to email the sysadmin for help and information.

Beyond the duration of the project, we anticipate that ScholarMatch will need only minimal resources to guarantee a sustained, working system. A similar successful model to this is the UC Irvine Machine Learning Archive (www.ics.uci.edu/~mlearn), which has been a highly-trafficked working site for over a decade, and supported by minimal effort.

6 Research QuestionsThis research aims to link technological innovation in text mining and knowledge discovery with an assessment of real user use. We see significant research questions in both text mining and assessment.

In the area of text mining, research questions include: How can accurate models of a scholar’s interests be inferred from multiple heterogeneous and

unstructured text sources? How accurately can we model the topical similarity between a single user-uploaded text

document and a huge collection of text documents? How can we jointly model topics across two or more disparate text collections (edited prose from

published scholars, and uploaded text content from unpublished scholars)?

In terms of assessing impact on education, learning and teaching, research questions include: What are the specific ways in which ScholarMatch is useful across the four core constituencies of

users (undergraduates, community college teachers, early career researchers, and established researchers)?

Can the technology increase the participation of traditionally underrepresented and underserved groups?

In what ways can the technology be extended to meet the needs of the constituencies?

6.1 Text-Mining Research Questions

Advancing Browsing, Querying and Interest-Matching CapabilitiesAs discussed earlier, topic models provide a useful statistical framework for linking words across documents. Based on prior work with modeling portions of the CiteSeer digital library, we estimate that 1,000 topics will be needed to adequately represent the full breadth of historical research in our 800,000 abstracts. The statistical topic models will be stored in a SQL database, with a topic-word table for each topic, author-topic distributions for each scholar, and document-topic distributions for each abstract. The distributions can be efficiently stored as sparse sets of counts from word-topic assignments from multiple

11

Gibbs samples [25, 26]. The sparse counts can be smoothed and normalized for inference purposes at query time.

The research task is to implement this efficient, flexible probabilistic query processing into ScholarMatch, and assess the accuracy of the query results. Successful topic models will allow a user to interactively browse and discover scholars and sub-disciplines in history, and to answer questions beyond the standard functionality of current online resources or databases:

On what overall combination of topics does Martha Hodes at New York University work? What department’s faculty are the most topically similar to Yale’s faculty in modern Chinese

history? What universities have the most active researchers on post-colonial settings? What scholars’ research profiles best match that of Founding Father biographer, Drew McCoy?

Because the topic model is a probabilistic model (in effect a very large Bayesian network defined over words, documents, topics, and authors) queries such as these can be formulated as probabilistic inference, namely the calculation of conditional probabilities for events of interest given conditioning information. Thus, for example, finding universities doing research on post-colonial histories requires that we first infer an aggregate topic distribution for the set of researchers at each university. In turn, finding the topic distribution for a set of researchers amounts to calculating a mixture of topic distributions over documents authored by members of the set – this mixture distribution can be quickly computed by summing appropriate counts in the document-topic matrix. The universities can then be ranked by how much probability mass they put on topics related to the topic of interest (i.e. post-colonial histories).

From the viewpoint of the user interface the underlying probability calculations will be hidden. Real-time query-answering takes place by parsing high-level user queries in the ScholarMatch system, sending the appropriate SQL queries to the database to generate the relevant sets of counts, and then aggregating, smoothing, and ranking the results.

Interactive Real-Time Topic ModelingA central feature of ScholarMatch that doesn’t, to our knowledge, exist in any other digital library system will be the ability to instantaneously topic model writing uploaded by users. The learned topical representation of the uploaded writing will then be used to locate scholars or others in the system with similar topical interests. Developing language models to support this functionality will allow us to explore a range questions:

How can we make the topic match reliable when the document being uploaded is short or noisy (e.g. a draft paragraph about a research idea).

How can we improve inference about intended topics of uploaded text? How do we model text that contains vocabulary not yet seen by ScholarMatch?

We will investigate and develop specific models that address these questions in the context of ScholarMatch. For example, developing a topic model that incorporates background knowledge (say from an ontology, or Wikipedia) into topic extraction and inference may help with uploaded documents that contain a significant amount of new vocabulary. In this case, the background knowledge will help the topic model make sense of these unseen terms. This research of developing probabilistic models will also require significant usability research to measure improvements, as perceived and reported by real users.

Investigating New Algorithms for Multi-Source Topic Models

12

The standard topic modeling framework is intended for documents from a single corpus, e.g., a set of scientific abstracts from CiteSeer. In contrast, we will have multiple heterogeneous sets of text. Each individual might have text from publication abstracts, from web pages and CVs, and from self-uploaded writings. We will investigate new algorithms for multi-source topic modeling, namely, inferring topic models using multiple sets of text-based information.

A baseline strategy will be to first learn a topic model from published abstracts (the most reliable source) in the standard manner and "freeze" the resulting topic representation. We can then use Gibbs sampling to infer how to topically represent each scholar using both the abstracts and additional words associated with that scholar from crawled or contributed content. We will compare this with a more ambitious approach where the topic models are learned from multiple sources in a single pass, using a hierarchical graphical model (e.g., similar in spirit to that of [28]). We will explore techniques that allow the different sources to have different weights in terms of their reliability and relevance to the technical topics, including weights assigned manually based on human heuristics as well as probabilistic source weights that are inferred during the topic learning. These different approaches will be quantitatively compared using standard language modeling measures such as perplexity scores on test documents.

We will extend this multi-source model development in year two, when we open ScholarMatch to unpublished/additional scholars. After these additional scholars have uploaded a relatively large collection of text documents, we will have a substantial body of new text. This new corpus will be categorized by title, e.g. undergraduate, graduate, teacher, researcher, independent scholar, etc. To have a unifying basis for what will certainly many different levels and kinds of writing, we will develop topic models that learn a single set of topics over multiple sources (that typically have different writing styles, and perhaps levels of sophistication). The challenge is to link the topic extraction across these multiple heterogeneous corpora that differ in content and style. We would like to “borrow strength” from the collection from published scholars, without this corpus (initially) taking over all representational resources. Developing these models will allow us to explore a wide range of questions:

Does the respective size of the various corpora affect the learned topics? How can we design a model so that the more reliable content (published scholars) is not swamped

by user contributed content, which may become progressively large, noisy and unrelated? Can we develop models that accurately predict from which group (undergraduate, graduate,

teacher, established researcher) a particular document comes? This information would be extremely useful so that users can specify with which group(s) of scholars in the system they wish to be matched.

6.2 Assessment ResearchOne of the main goals of ScholarMatch is to make it a usable, user-friendly and useful system. Accordingly, we will heavily emphasize assessment research and make ongoing improvements to the user interface and search capabilities to best meet the needs of a diverse community of users. Nardi has performed qualitative and quantitative assessments of Internet technologies such as instant messaging, video, blogging, gaming, and a system she and colleagues designed, ContactMap [13-18, 31].

For each of our four constituencies (undergraduates, community college teachers, early career scholars and established researchers), we will identify users and deploy and test the technology. (Our focus will not be on usability except inasmuch as it is necessary for a testable prototype.) We will select an array of users with attention to gender, racial and geographic diversity, and specifically address questions of how ScholarMatch can increase access for diverse segments of the academic community. We will also select users from across disciplinary subfields to insure representativeness in that respect as well.

13

We will utilize qualitative and quantitative methodologies following the methodologies set forth in [1, 11, 23]. The assessment will occur in three phases and has four key questions:

Is this system better than a conventional keyword search? What value do users find in the system? How can the system be extended and improved to more effectively meet user needs? How would a system such as ScholarMatch, which lets users find and relate to scholars around

the world, change practice for undergraduates, community college teachers, and early career scholars and established researchers?

Our approach is a variant of “patchwork prototyping” [10] in which high-fidelity prototypes are tested with users for “requirements gathering which is not purely need-based, but also opportunity-and creativity- based” [7]. Given the nature of the system we are developing, iteration is like to be somewhat slower than under ideal patchwork prototyping conditions. But we subscribe to the general philosophy of involving users with functioning systems to which they can respond with reflective comments and feedback rather than simply paper prototypes or other typical participatory design methods [7]. Floyd et al. point out that when users can provide meaningful feedback in a timely manner based on real usage, the result is more likely to be useful guidance that can be vital in improving the technology for effective deployment.

Phase One of Assessment:Phase One of the assessment will comprise qualitative assessments from users. We want to know what the experience of using different technologies (e.g., Google, proprietary databases, and the ScholarMatch system), is like for connecting with scholars, what users like and dislike about each, how they would extend our technology if they could, if they see the technologies as complementary or if one is clearly better than the other, what changes to our technology they would recommend, and what actions they took as a result of using the technologies. We will not attempt to impose strict experimental conditions but simply ask users to use both systems in their own fashion. We will make observations of actual use and interview users about their experiences, asking them to conduct some inquiries with the systems.

Each user will participate in an in-depth audiotaped semi-structured interview after using the technologies for at least two weeks. This trial will enable us to understand users’ experiences with a keyword search engine and with our technology. We will provide technical support to make sure that participants can use our technology with ease.

In Phase One we will deploy the technology with local users. All of the user communities are readily available to us locally. We can work with local undergraduate students and historians at the many higher education institutions in southern California.

Phase TwoIn Phase Two we will modify the technology based on the qualitative assessment from Phase One and gather quantitative data from a larger sample. We will advertise the technology in relevant places for each of our communities to attract users, such as appropriate listservs and forums. For some constituencies we will personally contact potential users to make sure we have an adequate sample.

The survey instrument will be designed in accordance with what is learned in the qualitative assessment to reflect users’ interests, concerns, and vocabulary. The surveys will be tailored to each constituency. Analysis will focus on simple descriptive statistics. We can then present these findings to our constituencies so that we can discuss the findings. This interaction is another means of getting feedback from users and engaging them in the design process. Such users will include Phase One users and others who take our survey. In Phase two we expand our user base to geographically diverse institutions as well

14

as different categories of institutions (e.g., liberal arts colleges, community colleges, doctoral-granting universities).

Phase ThreeIn Phase Three we will identify a small group of users in each constituency who wish to continue using the technology. We will interview them on a regular basis to see how use of the system changes their practice. Such interviews will take place in various media (e.g., phone, face to face, email) to get a wide range of users (rather than just local users whom we can interview face to face). Such opportunistic methods are characteristic of ethnography, a field in which Nardi trained as an anthropologist [27]. We believe we can best get detailed understandings of users’ experiences with the proposed technology through the range of methods from in-depth qualitative interviews to large-scale surveying.

The question regarding changing practice with use of the proposed system is the most difficult to answer. Within the time frame of the proposed research we can provide preliminary answers regarding how our technology would change practice. We will pay careful attention to the actions taken as a result of the use of the technology. Phase Three will allow us to follow some users over an extended period of time to gain deeper understandings about the technology and its impact on different kinds of users.

Approval of the use of human subjects will be obtained from the University of California, Irvine Institutional Review Board before the work commences.

7 Timeline

Months Area Research0-6 Text Mining Run topic model on collection of 800,000 historical abstracts. Extract list

of published scholars. Crawl 1000 history department webpages to find faculty members webpages and CVs.

ScholarMatch Development

Design database backend and basic user interface. Develop PHP/MySQL browser with read-only (no upload) functionality.

Assessment Preliminary evaluations of ScholarMatch, starting with early career researcher and tenured history professors (Block’s colleagues).

Dissemination None.6-12 Text Mining Perform information extraction on faculty webpages and CVs to extract

name, email, affiliation, headshot, and list of publications. Perform initial join (using entity resolution) of these scholars and published scholars from 800,000 abstracts. Search and harvest from Google Scholar.


Develop upload functionality where users can upload text and perform instant topic modeling.

Assessment Preliminary evaluations of ScholarMatch, starting with graduate students (graduate students of Block’s colleagues)

Dissemination None.12-18 Text Mining Develop multi-source topic models, and models for real-time topic

modeling of uploaded documents (which may be short, noisy, or contain new vocabulary). Search and harvest from Google Scholar.


Open system to allow other/unpublished scholars to create pages for themselves in ScholarMatch. Implement authentication (login using email, and password).

Assessment Preliminary research via interviews with users in the four constituencies.Assess initial implementation with the four constituencies locally.

Dissemination Write and submit papers on core text mining and assessment research.

15

18-24 Text Mining Test predictive power and accuracy of topic models under development.ScholarMatch Development

Revise ScholarMatch functionality based on assessment findings.

Assessment Design survey instrument. Conduct online survey; identify geographically and institutionally diverse users for in-depth study.

Dissemination None.24-30 Text Mining Search and harvest from Google Scholar. Perform additional crawls to

find email, affiliation and headshots for all people in ScholarMatch.ScholarMatch Development

Continue to improve ScholarMatch.

Assessment Follow users for in-depth study; Analyze results.Dissemination Write and submit papers on new topic models developed; write and submit

papers on user response to technology.30-36 Text Mining None.


Prepare for ongoing operation of ScholarMatch.

Assessment Complete assessment research.Dissemination Write and submit papers on full system, with accompanying assessment

results.Table 5. Timeline of Text Mining, Development, Assessment and Dissemination activities.

16

Documents

Discovering and Visualizing the Social Structure of …sli.ics.uci.edu/pmwiki/uploads/Grants-2008SciSIP/SCH… · Web viewTitle Discovering and Visualizing the Social Structure of