At Hotwire we build machine learning models for various purposes, such as to power our hotel sort or to drive our hotel pricing strategy. We spend our time substituting data-driven machine learning approaches for legacy human curated set of business rules that are difficult to scale as well as to incrementally and systematically improve. However, not all is always without issues with machine learning models as they bring their own unique set of challenges. One of the main difficulties that we have to deal with on daily basis is to understand what changes to expect from a new “black-box” machine learning model. It is quite easy to see that weights in neural network model have changed or that the trees in random forest model started picking up different attributes higher in the trees, but what impact is this going to have on the customer? And most importantly, how did the behavior of the new machine learning model changed from our baseline model and does it intuitively and logically match our intuition and expectation? To answer these questions, this blog post demonstrates a simple technique for visualizing the difference in behavior of two black-box machine learning models by comparing their partial dependence plots. The example used to describe this method is understanding the incremental change in a new boosted-decision-tree algorithm trained to sort hotel search results.
There are many approaches to building a model for sorting hotel search results on a travel site such as Hotwire. One of the methods that we use is to build a purchase likelihood model that estimates the probability of purchase for each search result. The final sort order displayed to the user of the website is then produced by ordering the search results in decreasing order based on the predicted purchase likelihood. This way, results that we believe are most likely to be purchased by the user will be placed higher in the sort order, thus increasing the chance that customer will find relevant inventory and eventually purchase.
To build such a model, we parse through our historical click-stream data and we look for results that customers previously purchased and for results that customers did not purchase. We then collect attributes for the displayed results such as star ratings, recommendation percentage, distance from searched area or location popularity and combine these with features describing customer search such as number of children, number of days to arrival or whether this searched time interval includes weekdays or weekends. Finally, we label each of these feature vectors with Boolean label that expresses whether specific results was purchased or not by the user. Such data set can be used to train a machine learning model that predicts the probability of the purchase event given the input features.
One of the popular models that we use is the Boosted Decision Tree (BDT). BDT is an ensemble model composed of weak decision tree models. At each training stage of the boosting process the algorithm selects random subset of data points as well as random subset of features and trains a decision tree. This decision tree is fitted to the residual error of the model from the previous iterations, thus boosting the accuracy of the overall model.
A typical BDT model with hundreds of trees is very powerful machine learning system capable of predicting complex non-linear relationships, however, sometimes the price we have to pay for that is not knowing exactly what is going on inside such a model. As mentioned earlier, when we build new sort algorithms with different features and hyper-parameter configurations using BDT and we are considering them for an AB test with a live production traffic, it is always crucial to ensure that the new algorithm will be “well behaved”, that customer experience will not be impacted and that the entire system will function within the boundaries of the business and the desired user experience.
Comparing two models using partial dependence plots
When evaluating the performance of the new BDT sort model we are of course looking at multitude of statistical metrics such as Normal Discounted Cumulative Gain , Expected Reciprocal Rank  or the mean recommendation percentage in the top positions of the sorted search results. But in addition, to these quantitative metrics we are also using simple yet very informative technique to visualize the difference in the response of the new model when compared to the previous baseline model. Understanding these difference can confirm that the new model behaves within the bounds or it can alert us that something is not working as intended.
For simple models, such as linear or logistic regression the difference in behavior could be easily understood by studying the changes in model coefficients, for example their polarity and magnitude. However, for more complex models such as BDT, no such simple technique exists. To assists in such analysis, the method of partial dependence plots was proposed by Friedman in 2001 . Implementation of this method could be found in popular data science libraries such as in Scikit-Learn in Python  or in gmb package in R .
For clarity, the partial dependence method can be described as follows. Let’s suppose that we have a dataset with N data points x, where each data point consists of k dimensional feature vector. In addition, each data point is labelled with output label y. Furthermore, let’s assume that we have trained an arbitrary machine learning model that maps the input data point x to an output label prediction via function f as follows:
The main idea of the partial dependence method is then to iterate over the k features in the input vector space, for each of them discretize the range of feature values and then for each such feature value average the model response over all values of the remaining k-1 input features. In summary we can compute the partial dependence for the jth feature as follows:
While, plotting the partial dependence for a specific model is quite informative, we find it very useful to plot together the partial dependence plots of the baseline and the new candidate BDT sort model. By comparing the behavior of the partial dependence in these plots we easily gain understanding of how the specific model is behaving as well as what has changed in the response of the new model relative to the control model.
As an example, the figure below shows the comparison of dependence plots for a baseline BDT model and for a new BDT model with refactored feature extraction pipeline. Four interesting model features were selected for this example as follows:
- Price * Distance – The product between price of the hotel and its distance from center of the searched area (e.g. downtown)
- Location Popularity Given Searched City – Popularity index expressing whether customer is interested in staying within specific location given the city they searched for (e.g. what is the chance that customer wants to stay in downtown San Francisco given that they searched for Oakland, CA)
- Displayed Savings Percentage – The relative difference between the retail rate and the Hot Rate price
- Global Location Popularity – Popularity index that expresses the number of transactions in given area normalized by the maximum number of transactions in any location across the entire US market.
From the plotted partial dependence comparisons we can make the following observations:
- The Price * Distance feature appears to match our intuition that hotels located closer to the searched area as well as hotels that are cheaper should be more likely purchased by the user. From the plot it appears that the new model puts more weight on this feature.
- The Location Popularity Given Searched City feature plot shows a significant difference between the two BDT models. While this feature was of high importance to the baseline model, the new model appears not to use this feature at all. After investigating further, this observation lead to uncovering of a bug with incorrectly matching location ids that caused this feature not to behave as intended.
- The Displayed Savings Percentage feature shows minimal difference between both models and it seems to match our intuition where the higher savings Hotwire offers when compared to the retail prices the more likely customer is going to purchase given hotel on our site.
- The Global Location Popularity feature shows substantial difference between both models and it does not really match our intuition in either case. While one would expect the purchase likelihood to increase as the location popularity increases, there seems to be bias towards inputs with lower popularity values. After further investigation it was determined that due to the normalization of the number of transaction in given location by the overall maximum number of transactions, the feature became biased towards markets with higher production and it became rather noisy for smaller markets. This observation led to a redesign of the popularity feature extraction method.
As the example above demonstrates, observations 1 and 3 confirmed our intuition while observations 2 and 4 pointed us to serious bugs and inconsistencies in our implementation that should not make it to the final model that will be tested in production. It is possible, that without looking at the partial dependence plots, these issues would remain hidden and they would negatively affect the experience of our customers as well as the company business metrics.
 K. Jarvelin, J. Kekalainen, “Cumulated gain-based evaluation of IR techniques,” in ACM Transactions on Information Systems 20(4), pp. 422-446, 2002.
 O. Chapelle, Y. Zhang, D. Metzler, P. Grinspan, “Expected Reciprocal Rank for Graded Relevance,” in Proc of CIKM, Hong-Kong, 2009.
 J. H. Friedman, “Greedy Function Approximation: A Gradient Boosting Machine,” in Annals of Statistics 29: 1189–1232, 2001.
 Scikit-Learn: http://scikit-learn.org/stable/auto_examples/ensemble/plot_partial_dependence.html