Post Booking Score (PBS) is a binary customer satisfaction metric gathered from the user after a hotel is booked on Hotwire.com as shown in figure 1.
Fig. 1 – Post booking score is the user response to the question “How do you feel about your deal?”
The user responds with a smiley face if he/she is happy with the deal purchased, or with a sad face otherwise. It is collected immediately after a user books a hotel via a hot-rate deal on Hotwire.com. The interpretation of this metric is as an immediate user reaction to the quality of the hotel deal they got and how satisfied they are with the booking process.
The goal of this analysis is to answer the following question: can Post Booking Score data be used to improve the Hotwire customer experience? This blog post will outline the process of:
- Performing exploratory analysis on this metric
- Identifying features correlated with PBS which could be used to produce a machine learning model to predict PBS
- Building a utility model to predict the probability of a positive PBS and incorporate it into Hotwire’s hotel scoring algorithm
250,000 post booking scores were randomly sampled to be analyzed over a 3 month period ranging from March 1, 2016 to July 1, 2016. Tens of thousands of unique hotels were analyzed in this analysis, and the overall PBS is 70% positive on average.
All plots are generated using Bayesian methodology (shown in the legend of the plots as Bayesian mean and 95% credible bounds). Because PBS is composed of positive and negative data, the data can be represented using the beta distribution. Some intuition around the beta distribution in the context of batting averages can be found here. What that means in the context of PBS is that we have some prior knowledge about what the post booking score should be for a hotel if the feature we are investigating is not correlated. We will model this belief with the conjugate prior.
Some alpha and beta value will be chosen to reflect the prior knowledge that the overall mean PBS score is 70%. To do this, alpha and beta values will be selected such that α/(α+β) = 0.7. We can model how strongly we believe this to be true by changing the magnitude of the prior, with the magnitude being equal to α+β. A magnitude of 5 was chosen for this analysis. This leaves an alpha prior of 3.5 and a beta prior of 1.5.
In the plots shown below, PBS data was grouped into bins. For discrete values, precise values were passed in (such as 1-5 at 0.5 intervals for hotel star rating). For continuous values (like purchase savings) bin size and number were calculated using the Freedman-Diaconis rule. The mean Bayesian Post Booking Score and credible interval for each bin is plotted for each feature below.
Relative to the amount of data at hand, the magnitude of the priors is fairly weak and should be quickly overwhelmed by the data for plots with a small number of bins. The arithmetic mean PBS score for each bin is also shown. The credible bounds are generated by sampling the beta distribution with the updated alpha and beta values 2000 times and selecting the 2.5th percentile as the lower bound and the 97.5th percentile of the sampled data for each bin. For a Python implementation, NumPy offers a method to draw samples from a given beta distribution whose API is describe here.
Only data in the 1st through 99th percentile range is included; the data outside this range are considered outliers and removed.
First, several features were selected to analyze correlation with PBS. The graphs shown have the feature in question plotted along the X axis with proportion of positive PBS response on the Y axis for that specific bin on the X axis. The credible interval for the data grows wide when the data does not align with the prior belief or when the data is sparse over a bin.
Fig. 2 – Discount savings percent vs. average PBS score
Discount savings percentage (DSP) is the percent off the hotel deal the user receives. This is a deal quality metric which states the obvious: users are happier with higher discounts.
Fig. 3 – Recommendation percentage vs. average PBS Score
Recommendation percentage is a hotel quality metric calculated for each hotel based on user reviews submitted after their stay. This is the first form of user feedback used at Hotwire, and is the main driver in the purchase likelihood model as well. It’s good to see that immediate feedback to the hotel deals correlates so strongly with post stay survey responses.
Fig. 4 – Hotel star rating vs. average PBS score
Hotel rating is the hotel star rating on a scale of 1-5. The credible interval for 1-star hotels is so wide because very little 1-star inventory is sold. Hotel star rating is a good driver of PBS.
Fig. 5 – Distance from neighborhood center searched for to booked hotel vs. average PBS score
There appears to be no correlation between the distance of the hotel to the center of the neighborhood the hotel is contained in and PBS score. This is interesting because it is a major driver for the purchase likelihood model and serves as an offline quality metric for model evaluation. One possible explanation for this discrepancy is that Hotwire’s neighborhoods are small enough that knowing where the hotel lies in the neighborhood may not influence the customer’s immediate reaction, i.e. the customer gains no new information about the neighborhood after the hotel deal is revealed. A visualization of the neighborhood size is shown below:
Fig. 6 – Hotwire neighborhood map for San Francisco
Fig. 7 – Neighborhood quality vs. average PBS score
Neighborhood quality is the score given to a neighborhood by our local inventory suppliers and does not appear to influence PBS – perhaps for a similar reason that hotel distance does.
Fig. 8 – Purchase savings vs. average PBS score
The purchase savings is the total amount of money the user saved on a single booking. This metric is not normalized to nights stayed or number of guests per room. It’s interesting to see that the vast majority of gains in PBS occur for savings under a few hundred dollars.
This feature was not included in the PBS model in favor of the normalized feature, discount savings percentage. Normalized versions of this feature would likely be more useful and produce less dramatic graphs. Purchase savings per night booked or purchase savings per night booked per hotel star would be interesting features to investigate.
Next, models were tested with the objective of predicting post booking score based on the features recommendation percentage, discount savings percentage, hotel star rating, neighborhood quality, and distance from neighborhood center to hotel. Logistic regression, boosted decision trees, random forests, and support vector machine classifiers were all tried out. Boosted decision trees over fit the data, as did random forests to a lesser degree. Support vector machines with a linear kernel provided a reasonable solution but had excessive training time. In the end logistic regression was the simplest and most effective method to model the linear trends shown in the above plots given the implementation time constraints of the project.
Variable importance for the logistic regression model implemented by the scikit-learn python package are as follows:
Table 1 – feature importance for logistic regression model
|Discount Savings Percent||0.166|
It is interesting to compare logistic regression versus boosted decision trees for the purpose of visualizing overfit. Note the smooth, linear surface provided by plotting recommendation percentage and discount savings percentage (DSP) versus probability of positive PBS with logistic regression and compare it to the monstrosity provided by a BDT model with 1000 estimators, a learning rate of 0.1, and a max depth of 5. The surfaces are generated by holding all features constant except the two shown on the x and y axis at the 25/50/75th percentile.
Fig. 9 – Logistic regression model
Fig. 10 – Boosted decision tree overfitting PBS data
It is apparent that logistic regression captures the linear trends in the data much better than BDTs do for the PBS dataset. Additional tuning of BDT hyper-parameters should lower the impact of over fitting, but for the meantime logistic regression produced satisfactory models.
The output of the model can be interpreted as the probability of a given hotel deal resulting in a positive post booking score. The interpretation for how this model will be integrated into the hotel sorting algorithm is like this: probability of a given hotel deal being purchased * the probability of that purchase resulting in a positive post booking score = final score to sort by.
This model was tested against the previous Hotwire hotel sort control model and resulted in a large increase in hotel quality based on offline metrics. Deployment of this utility model will result in higher quality hotels being sorted to the top. Customer satisfaction metrics are expected to also increase since customer feedback is being directly incorporated into the scoring model.
One of the downsides for this kind of model is higher quality hotels come at an increased price point. Consequently, the average displayed price for the top ranked hotels increased as well. This may hurt the customer whose only concerns is booking the lowest cost hotel regardless of quality. In the future, more normalized features may be added to the model to balance out this downside of the model or perhaps the responsibility for this lies in the selected base model for the utility model to be used with.
This project was completed over a three month period by Hotwire data science intern Jared Rondeau.