Upload
lethalamby
View
459
Download
2
Embed Size (px)
DESCRIPTION
Our approach to Kaggle's Expedia competition that involved position ranking for Hotel Searches. The approach taken was a pseudo classification approach whereby calculating the probability of clicking and ranking the searches accordingly
Citation preview
04/11/2023
Personalize Expedia Hotel SearchesOptimize Hotel Ranks to Maximize Purchase
Team members: Ambuj Agarwal Lanqiu Mei Yunlu Gao Yuqian Liu
04/11/2023 2Personalize Expedia Hotel Rank
Background • Expedia is the largest online travel agency
Introduction Preprocessing Models Results Improvement
04/11/2023 3Personalize Expedia Hotel Rank
Introduction Preprocessing Models Results Improvement
04/11/2023 4Personalize Expedia Hotel Rank
Introduction Preprocessing Models Results Improvement
Background • Expedia is the largest online travel agency• Accurately matching customers with hotel inventory is
important in the highly competitive market
04/11/2023 5Personalize Expedia Hotel Rank
Introduction Preprocessing Models Results Improvement
Background • Expedia is the largest online travel agency• Accurately matching customers with hotel inventory is
important in the highly competitive market
Yearly Revenue of Expedia: $ 4,800,000,000
1% increase in conversion rate: $ 48,000,000
04/11/2023 6Personalize Expedia Hotel Rank
Introduction Preprocessing Models Results Improvement
Background • Expedia is the largest online travel agency• Accurately matching customers with hotel inventory is
important in the highly competitive market
Yearly Revenue of Expedia: $ 4,800,000,000
1% increase in conversion rate: $ 48,000,000 500 more Data Scientists, better models!
04/11/2023 7Personalize Expedia Hotel Rank
Introduction Preprocessing Models Results Improvement
Data • Searching and purchase data
• Hotel characteristics, location attractiveness, competitive OTA information
User
Visitor_locationVisitor_history_rate
Hotel Search CompetitorsSrch_idSite_idDate_timePromotion_flagAdult_countChildren_countRoom_countSatureday_nightquery_affinity_scoreorig_destination_distance
Position
Prop_idProp_starratingProp_review_scoreProp_brand_boolPromotion_flag
Location_score1Location_score2
Price_usdprop_log_historical_price
Comp_rateComp_invComp_rate_%_diff(1-8)
ResultClick_boolBooking_boolGross_book_usd
04/11/2023 8Personalize Expedia Hotel Rank
Hotel Search CompetitorsSrch_idSite_idDate_timeAdult_countChildren_countRoom_countSatureday_nightquery_affinity_scoreorig_destination_distance
Position
Prop_idProp_starratingProp_review_scoreProp_brand_boolPromotion_flag
Location_score1Location_score2
Price_usdprop_log_historical_price
Comp_rateComp_invComp_rate_%_diff(1-8)
ResultClick_boolBooking_boolGross_book_usd
Introduction Preprocessing Models Results Improvement
Data
User
Visitor_locationVisitor_history_rate
• Searching and purchase data
• Hotel characteristics, location attractiveness, competitive OTA information
04/11/2023 9Personalize Expedia Hotel Rank
Introduction Preprocessing Models Results Improvement
Data
User Hotel SearchSrch_idSite_idDate_timePromotion_flagAdult_countChildren_countRoom_countSatureday_nightquery_affinity_scoreorig_destination_distance
Position
Visitor_locationVisitor_history_rate
Prop_idProp_starratingProp_review_scoreProp_brand_boolPromotion_flag
Location_score1Location_score2
Price_usdprop_log_historical_price
• Searching and purchase data
• Hotel characteristics, location attractiveness, competitive OTA information
04/11/2023 10Personalize Expedia Hotel Rank
Introduction Preprocessing Models Results Improvement
Data
User Hotel Search CompetitorsSrch_idSite_idDate_timePromotion_flagAdult_countChildren_countRoom_countSatureday_nightquery_affinity_scoreorig_destination_distance
Position
Visitor_locationVisitor_history_rate
Prop_idProp_starratingProp_review_scoreProp_brand_boolPromotion_flag
Location_score1Location_score2
Price_usdprop_log_historical_price
Comp_rateComp_invComp_rate_%_diff(1-8)
ResultClick_boolBooking_boolGross_book_usd
• Searching and purchase data
• Hotel characteristics, location attractiveness, competitive OTA information
04/11/2023 11Personalize Expedia Hotel Rank
Introduction Preprocessing Models Results Improvement
Data
User Hotel Search CompetitorsSrch_idSite_idDate_timePromotion_flagAdult_countChildren_countRoom_countSatureday_nightquery_affinity_scoreorig_destination_distance
Position
Visitor_locationVisitor_history_rate
Prop_idProp_starratingProp_review_scoreProp_brand_boolPromotion_flag
Location_score1Location_score2
Price_usdprop_log_historical_price
Comp_rateComp_invComp_rate_%_diff(1-8)
ResultClick_boolBooking_boolGross_book_usd
• Searching and purchase data
• Hotel characteristics, location attractiveness, competitive OTA information
04/11/2023 12Personalize Expedia Hotel Rank
Introduction Preprocessing Models Results Improvement
Data • 2.19G training data
• 54 variables
• 400,000 unique searches
• 10 Million Training Data Points
• 6 Million Test Data Points
• NA, and informative missing data
• Imbalanced classes problem • Click : 4.49%• Booking: 2.78%
04/11/2023 13Personalize Expedia Hotel Rank
Introduction Preprocessing Models Results Improvement
Solution • Ranking Problem
• Converted to a pseudo classification problem• Overall Ranking Problem by each search
• list-wise approach• Use click_bool as target variable instead of booking_bool
04/11/2023 14Personalize Expedia Hotel Rank
Preprocessing – Variable Not Considered
Introduction Preprocessing Models Results Improvement
Variables with missing value Id variables% Missing
comp1_rate 98%comp1_inv 98%comp1_rate_percent_diff 98%visitor_hist_starrating 95%visitor_hist_adr_usd 95%srch_query_affi nity_score 93%gross_bookings_usd 97%
ImputationFor Variables with less than 30% missing
prop_review_score: Random Imputationprop_location_score2 : Uniform Distribution Imputation
54 variables -> 19 variables
prop_id
srch_id
04/11/2023 15Personalize Expedia Hotel Rank
Preprocessing – Dummy coding
Introduction Preprocessing Models Results Improvement
Variables Dummied
19 variables -> 30 variables
Dummy Bysite_id Majorityvisitor_location_country_id Majorityprop_country_id Majority
prop_starrating By Star
srch_length_of_stay Dummy 1 and 2srch_adults_count Dummy 1 to 4srch_room_count Dummy 1 and 2
04/11/2023 16Personalize Expedia Hotel Rank
Preprocessing – Sampling
• Randomly take 10% of training dataset• Limited by computational power• Learning Curve
• 40000 unique search id• 1 million rows
• The philosophy of ensemble• Listwise ensemble method• Build different models to decrease variance without sacrifice in bias
Introduction Preprocessing Models Results Improvement
04/11/2023 17Personalize Expedia Hotel Rank
Modeling Method
Introduction Preprocessing Models Results Improvement
40,000Unique IDs
39,000 unique IDs
1000 unique IDs
1,000 1,000
1,000
…39samples
… Click probabilities
04/11/2023 18Personalize Expedia Hotel Rank
Model families
• Trees• C4.5(rpart), Bagging (C4.5), Random Forest, Gradient Boosted Trees, C5.0
• Discriminant Analysis• Linear, Quadratic, Flexible
• Logistic Regression• Logistic, Ridge, LASSO
• Support Vector Machine• Artificial Neural Network
Introduction Preprocessing Models Results Improvement
04/11/2023 19Personalize Expedia Hotel Rank
Normalized Discounted Cumulative Gain (NDCG)• Measure the performance of a recommendation system based on the
graded relevance of the recommended entities.
• NDCG can range from 0 to 1 • 1 -> ideal ranking• Current score: 0.300 (Expedia Algorithm), 0.54075 (Kaggle Winner)
• Scoring• 1 for Click• 5 for Booking
Introduction Preprocessing Models Results Improvement
04/11/2023 20Personalize Expedia Hotel Rank
NDCG Scores of Models
Introduction Preprocessing Models Results Improvement
04/11/2023 21Personalize Expedia Hotel Rank
Limitation and Future Improvement
• Other imputation methods• KNN (time consuming and memory limitation)• Impute using highest correlated variable
• Utilize all the data • Computation and Memory Constraint• Use the same trick but based on the country level
• Storing Object and Predicting (Time and Memory Constraints)
• Compute some self-defined variables• Ensemble across different family of models • Performance on larger test dataset• Models with booking_bool• Possible to make different models for non-missing rows and use
gated ensemble
Introduction Preprocessing Models Results Improvement
Questions ?