Response modeling-iui-2013-talk

Recommending Targeted Strangers for Answering Questions on Social media

Jalal Mahmud, Michelle Zhou, Nimrod Megiddo, Jeffrey Nichols, Clemens DrewsIBM Research – Almaden

San Jose, CA

The Buzz of the Crowd Hundreds of millions of people express

themselves on social media daily– Location-based information– Status update– Sentiment about products or services

The buzz of the crowd creates a unique opportunity for building a new type of crowd-powered information collection systems.

Such systems will actively identify and engage the right people at the right time on social media to elicit desired information. E.g.

- current wait time at a restaurant- airport security wait time

- information in an emergency situation. Our initial experiment when we manually

selected strangers based on their ability to answer questions achieved 42% response rate [Nichols et el. 2012]

•400+ million tweets daily•3.2 billion Facebook likes and comments daily

Our System - qCrowdMonitors twitter stream to

identify relevant postsMonitors twitter stream to

identify relevant posts

Generates questions and sends them to each selected

person

Generates questions and sends them to each selected

person

Analyzes received responses and synthesizes the answers

together

Analyzes received responses and synthesizes the answers

together

Evaluates the authors of identified posts and

recommends a sub-set of people to engage

Evaluates the authors of identified posts and

recommends a sub-set of people to engage

How to identify strangers who are willing, able and ready to provide requested information? - Ability to provide information is domain dependent such as being at a location or

knowledge about a product/service and hence we use a set of rules to determine ability.

Key Contributions Features

- Set of features that are likely to impact one’s willingness and readiness to respond

Prediction of Response Likelihood - A statistical model to infer the contribution of each feature to one’s willingness

and readiness, which are used to predict one’s likelihood to respond. Recommendation Algorithm

- A recommendation algorithm that automatically selects a set of targeted strangers to maximize the overall response rate of an information request. Effectiveness

- Demonstrated effectiveness in real world scenarios and insights for building new class of crowd-powered intelligent information collection systems.

Outline Background - Buzz of the Crowd Our System - qCrowd Key Contributions Active Engagement and Data Collection Baselines Features Statistical Model & Recommendation Algorithm Evaluation Summary and Future Work

Active Engagement and Data Collection TSA-tracker Question Datasets:

Our first two data sets were obtained in the process of collecting location- based information—airport security check wait time via Twitter.

- @bbx If you went through security at JFK, can you reply with your wait time? Info will be used to help other travelers.Product Question Dataset: Collected by asking people on Twitter describing their product/service experience. - @johnny Trying to learn about tablets...sounds like you have Galaxy Tab 10.1. How fast is it? Domain # of

Questions# of

ResponsesResponse

Rate

TSA-tracker-1 589 245 42%

TSA-tracker-2 409 134 33%

Product 1540 474 31%

Baseline: Asking Random Strangers Sent questions to random people on Twitter.

@needy Doing a research about your local public safety. Would you be willing to answer a related question?

@john Doing a survey about your local school system. Would you be willing to answer a related question?

@dolly Collecting local weather data for a research. Would you tell us what your local weather was last week?

Domain # of Questions

# of Responses

Response Rate

Weather 187 7 3.7%

Public Safety 178 6 3.4%

Education 101 3 3.0%

It is ineffective to ask random strangers on social media without considering their willingness, ability, or readiness to answer

Baseline: Crowd As Human Operator Crowd-sourcing a human operator’s task to test a crowd’s ability to

identify the right targets. We conducted two surveys on CrowdFlower – a crowd-sourcing

platform. - Willingness Survey: Asked each participant to predict if a displayed Twitter

user would be willing to respond to a given question, assuming that the user has the ability to answer.

- Readiness Survey: Asked each participant to predict how soon the person would respond assuming that s/he is willing to respond.

The participants were also required to provide an explanation of their predictions.

We wanted to know what criteria a crowd would use to identify targeted strangers.

Willingness Survey Randomly picked 200 users from each of our datasets:

- 100 participant from CrowdFlower. - Each participant was given 2 randomly selected users for judgment.

- Participants were asked to predict if the displayed Twitter user would respond. - Compared with Twitter user’s actual response (responded/not-responded).

Correctness: - 29% correct when only tweets of a user was displayed.

- 38% correct when complete twitter profile was displayed. - The task of selecting users for question asking is also difficult for the crowd. Top Predictors:

- Past responsiveness and interaction behavior (57.6%) "The user seems extremely social, both asking questions and replying to others“. - Profile information (10.45%) "Because him being a social media guy and his tagline saying “we should hang out”. - Personality (7.4%) "I think he won’t respond. Doesn't seem to be very friendly” . - Retweeting behavior (6%) “No. Most of the tweets are retweets instead of anything personal”. - General tweeting activity (10.45%) “This user tweets a lot, seems very chatty”.

Readiness Survey Participants Judged how soon a person would respond to an information

request, assuming that the person would respond. - Used a multiple choice question with varied time windows as choices.- Randomly selected 100 people from our collected datasets. - Recruited 50 participants on CrowdFlower. - Each of them was given two randomly chosen people and their

twitter handlers. Computed prediction correctness

- Compared with ground truth. - For example, if a participant predicted that person X will respond within an hour, but the

response was not received in time, the prediction is then incorrect. Correctness:

- 58% correct in making prediction. Top Predictors:

- Activeness and Steadiness of Twitter usage: 25% .- Promptness of Response: 30% .

Key Features for Selection of StrangersResponsiveness Features

Responsiveness Features Computation

Mean Response Time Avg(T), T denote previous response times

Median Response Time Med(T), T denote previous response times

Mode Response Time Mod(T), T denote previous response times

Max Response Time Max(T), T denote previous response times

Min Response Time Min(T), T denote previous response times

Past Response Rate NR/ND, NR is the number of the user’s responses and ND is the number of direct questions the user was asked in Twitter.

Proactiveness NR/NI, NR is the number of user’s responses and NI is the number of indirect questions the user was asked in Twitter.

We hypothesize that one’s willingness to respond to questions is related to one’s past response behavior.

a

Key Features for Selection of StrangersProfile FeaturesWe use a profile-based

CountSocialWords - Count of the following phrases in description field in user profile: {“social”, “social media”, “social network”, “social networking”, “friend”, “tweet”, “twitter”, “tweeting”, “tweets”, “tell”, “telling”, “talk”, “talking”, “communication”, “communicator”}.- Adopted from LIWC “social process” category and by observing words related to modern social network activity. - The intuition is that a user who has such words in her profile will be more active and engaging than others, hence may likely to respond.

MsgCount - Number of status messages DailyMsgCount - Number of status messages per day

Retweet Features RetweetRatio – Ratio of the total number of retweets and total number of tweets DailyRetweetCount - Ratio of the total number of retweets and total number of days since the account is created.

Activity Features

Key Features for Selection of StrangersPersonality Features Personality traits such as Friendliness and Extraversion are intuitively related

with one’s willingness to respond to questions. Previous researchers have shown that word usage in one’s writings such as blogs

and essays is related with one’s personality.

Personality Features

Total Number

Examples Computation

LIWC 68 Comm@Communication[admit, advice, affair*,

apolog*, …]

Let g be a LIWC category, Ng denotes the number of occurrences of words in that

category in one’s tweets and N denotes the total number of words in his/her tweets. A

score for category g is then: Ng/N.

Big Five 5 Extraversion Using correlations with LIWC features as reported by previous researchers (e.g.,

Yarkoni et al.)

Big Five Facets

30 Friendliness, Anxiety Using correlations with LIWC features as reported by previous researchers (e.g.,

Yarkoni et al.)

Key Features for Selection of StrangersReadiness Features Even if a person is willing to respond to questions, he/she may not be ready to respond at the time of questioning. Since one’s readiness is highly context dependent (e.g. mobile device in use to

send answers is running out of battery) and often difficult to capture computationally, we use several features to approximate one’s readiness:

Readiness Features ComputationTweeting Likelihood of the Day TD/N, where TD is the number of tweets sent by the user

on day D and N is the total number of tweets.

Tweeting Likelihood of the Hour

TH/N, where TH is the number of tweets sent by the user on hour H and N is the total number of tweets.

Tweeting Steadiness 1/σ, where σ is the standard deviation of the elapsed time between consecutive tweets of users, computed from

users’ most recent K tweets (where K is set, for example, to 20).

Tweeting Inactivity TQ - TL, where TQ is the time the question was sent and TL is the time the user last tweeted.

Feature AnalysisSignificant Features

- Statistical significance test was done using Chi-square test with Bonferroni correction.- For TSA-tracker-1 dataset, we found 42 significant features (FDR was 2.8%). - For Product dataset, we found 31 features as significant (FDR 4.2%)- For TSA-tracker-2 dataset, we found 11 significant features (FDR 11.2%)

Top-4 FeaturesFeatures Feature-type

Communication personality

Past response rate responsiveness

tweeting inactivity readiness

tweeting likelihood of the day

readiness

- Top-4 features were found using extensive experiments.

Dataset Top ten statistically significant features

TSA- tracker-1

Past Response Rate, Tweet Inactivity, Negative Emotions, Cautiousness, Depression,

Excitement-Seeking, DailyMsgCount, Intellect, Communication, Immoderation

TSA- tracker-2

Prepositions, Past, Exclusion, Sensation, Past Response Rate, Space, Tweeting Steadiness,

Achievement-striving, Agreeableness, CountSocialWords

Product Mode Response Time, Tweet Inactivity, Activity Level, Depression, Present,

Cautiousness, Positive Emotion, Excitement-Seeking, DailyMsgCount, Past Response Rate

Statistical Model & Recommendation

Binary-classification Classify a person as responder or non-responder and send questions to

people who are classified as responders.

Top-K Selection Rank according to their probabilities computed by the statistical model and select top-K people to send questions.

Our Recommendation Algorithm Automatically selects a subset of people from a set of available people. Such subset selection is designed with the goal of maximizing the response rate.

Once features are computed, we trained statistical model such as Support Vector Machine and Logistic Regression to predict likelihood of response.

Statistical Model

Recommendation Algorithm Ranks people in the training set in the order of non-decreasing

probabilities. Finds the best interval that has the maximum response rate among all

interval subsets in the linear order. Computes best subinterval in the test set from the best subinterval in

training set using simple linear projection. Can apply various constraints on selecting minimum/maximum/exact

number of people. - select at-least/at-most/exactly K% people

Evaluation

Evaluating Prediction Model TSA-tracker-1 TSA-tracker-2 Product

SVM Logistic SVM Logistic SVM Logistic

Precision 0.62 0.60 0.52 0.51 0.67 0.654

Recall 0.63 0.61 0.53 0.55 0.71 0.62

F1 0.625 0.606 0.525 0.53 0.689 0.625

AUC 0.657 0.599 0.592 0.514 0.716 0.55

5-fold cross validation experimentsAUC – Area Under ROC CurveF1 = Harmonic mean of precision and recall

Our models are 60-70% correct in making a prediction.

Evaluating Recommendation Algorithm

Comparison of Average Response Rates using Different Approaches

TSA-tracker-1 TSA-tracker-2 Product

Baseline 42% 33% 31%

Binary-classification 62% 52% 67%

Top-K-Selection 61% 54% 67%

Our Algorithm 67% 56% 69%

Baseline is the response rates achieved by a human operator during data collection Used “asking at least K% of people from the original set” as a constraint to search for the interval that maximizes the response rate. Computed the response rates on with varied K (e.g., K=5%, …,90%) to find the respective optimal intervals. Computed the response rates achieved using a simple binary classification (response rate is the precision of the predictive model) and simply selecting the top- K (e.g., K=5%, …, 90%.) people by their computed probabilities.

Recall of Recommendation

Selecting K% people Response Rate Recall

25% 76% 37%

50% 68% 64%

75% 53% 82%

100% 31% 100%

Response Rate and Recall for Our Algorithm with Fixed Size (Product Data, all features, SVM Model)

Trade-off between response rate and recommendation recall, which captures the ratio of the actual responders our algorithm identifies for sending questions.

Use of Different Feature Sets

Feature Set Response Rate TSA-tracker-1 TSA-tracker-2 Product

All 0.79 0.72 0.78Significant 0.83 0.75 0.82

Top-10 Significant 0.83 0.74 0.81Top-4 features 0.82 0.73 0.83

Common Significant features 0.81 0.72 0.82

Selecting at-least 5% people, SVM Model

Comparison of Average Response Rates

Live Experiments Used Twitter’s Search API and a set of rules to find 500 users who mentioned that they were at any US airport in their tweets.

- Randomly asked 100 users for the security wait time. - Used our algorithm to identify 100 users for questioning from the remaining 400 users. - Used the SVM-based model with the identified significant features.- Waited 48 hours for the responses.

The same process was repeated for sending product questions.

Large improvement of response rate in a live setting.

Live Experiment Random Selection Our Algorithm

TSA-Tracker-1 29% 66%

Product 26% 60%

Summary & Future Work We focused on modeling users willingness and readiness to answer questions. We can predict one’s likelihood of response to questions and identified sub-set

of features that have significant prediction power. Our experiments including the live one in a real-world setting demonstrated our

approach’s effectiveness in maximizing the response rate. Future Work - Applicability

- Apply to other social media platforms - Apply to other information collection applications

- Handling Skew in the user base - Identify inactive users similar to active users in terms of personality

- Modeling the fitness of a stranger to engage- Develop model to receive high quality response

- Model dutifulness and trustworthiness of users

- Handling complex situations- Incorporate various costs/benefits which might change over time- Develop model to maximize expected net benefit

- Handling unexpected answer- Incorporate voluntary responses from people in social media- Grow potential targets

- Protecting Privacy- Tune selection algorithm to exclude people who are concerns about privacy

Questions?

Technology

Response modeling-iui-2013-talk