Team members: Feng Qian, Sophie Zhao, Yizhou Wang

Recommendation system can be a vital competitive edge for service providers such as Spotify, who mainly grows business through user subscriptions. Accurate recommendations help improve user experience and strengthen customer loyalty.

Traditional recommendation methods include modeling user-item interaction with supervised learning such as classification, memory-based content-filtering from user history and many more. These ideas overlook the dependency across consecutive time steps. Inspired by the progress of reinforcement learning in other domains, such as playing Atari game, we apply a state of the art model, the Deep Deterministic Gradient Policy (DDPG), to model music recommendations as a sequential decision process. In this setup, action of the DDPG learner is a song selected from a huge pool. By representing each song using a set of continuous features and subsequently expanding the action space from discrete to continuous, our agent successfully scales up the number of candidate songs it can accommodate, while maintaining satisfying recommendation accuracy and diversity.

Contents

Data
- Data Structure
- Data Sampling
Model
- Autoencoder
- Reinforcement Learning
Conclusion

Data

We use the “Music Streaming Sessions Dataset” (MSSD), originally published for a competition by Spotify. The dataset contains both listening session data and a lookup table for song features.

Data Structure

Above is a plot of the structure of the data we have. We have two data files: one file with rows of sessions and the other file with rows of track features. Tracks are songs and a session is a sequence of tracks that a single user listens to. The maximum of the maximum length of a session is limited to 20 by Spotify.

Inside of a session, actions between and within tracks are recorded. Actions between tracks include: skip_very_briefly, skip_briefly, mostly_played_before_skip, no_skip, indicating the pattern a user transfer from one track to the next. Actions within a song mainly include move_forward, move_backward, no_move, indicating user behavior while listening to a track; and, no_pause_before_play, short_pause_before_play, long_pause_before_play indicating user pausing behavior. The above actions can be interpreted as user preferences on each track such that actions can be corresponde into rewards.

We also have feature data on each track and the features are given by Spotify through either manual or automatic means. Such features include acousticness, beat_strentgh, danceability, etc. and vary between $[0,1]$. We can incorporate these features into our model.

Data Sampling

MSSD is a huge dataset with more than 20 millions of songs and 17 millions of sessions, comprising around 600 GB of data. In order to accommodate our model to this dataset, we take a subset of songs and sessions as we explain below.

To select a subset of MSSD we do as follows: first, we sample the session data, then, we sample session track data. In the original data, We decided to sample 100 thousand sessions from the data. Since the data is divided into 10 zip files with similar sizes, we need to sample 10 thousand sessions from each zip file. Each zip file is comprised of N data files so we need to sample $10k/N$ thousand sessions from each data file. Since each data file contains a different number of sessions, we need to sample from each data file with different probabilities. For example, if there is $M$ session in data file $f_1$, we accept each session in file $f_1$ randomly with a probability of $(10k/N)/M = \frac{10}{MN}$

After sampling all sessions, we need to find out all tracks that appear in our sampled data. It could be easy if the size of the data is small: we could use a set data structure to find a set of track ids. However, since the size of the data is too big to store in memory, we decided to use a database (Mysql on a Macbook Air 2014) to let the database maintain the tree-structure for us. Using the database, we could find tracks we need easily.

We sampled uniformly from more than 300 data files and reduced the data to 106,375 sessions (376MB) and 281,185 tracks (167MB).

Model

A reinforcement learning model has these components: Agent, Environment, State, Reward Function, Value Function and Policy. To simplify the problem, we assume a hypothetical user whose experience is pooled from all the actual users. Our recommender model is going to be the agent of the system that processes songs to this hypothetical user that will skip/not-skip the recommendation. The user behaves as the system’s environment, responding to the system’s recommendation depending to the state of the system. User feedback determines our reward, that is, one score only if the user does not skip. Action of the agent is song recommended. Our state is defined as the song features and corresponding user reactions of past 5 steps, excluding the current step. Therefore, feedback and action together give us the next state. The goal of the agent is to learn a policy that maximizes accumulated rewards in 15 steps. We set the length of prediction to be 15 to avoid a cold start, that is, predicting when there isn’t sufficient history, given our original session length is 20 and history length is 5.

More formally, the mathematical definition is as follows:

$t$: current time

$X_t$: skip/not skip decision at time t

$a_t$: features of recommended song at time $t$.

$O_t$: raw states, or observed inputs

$S_t$: transformed states

$O_t = [a_{t-6:t-1},X_{t-6:t-1}]$

$X_{t-6:t-1}= [X_{t-6},X_{t-5},…,X_{t-1}], X_t \in {0,1}$; with 1 meaning user not skipping. This is the skipping decisions of past 5 time steps.

$a_{t-6:t-1}$ is defined similarly; $a_t \in R^p$; p is defined as the number of song features.

In matrix format, $O_t = \begin{bmatrix}a_{t-6}&X_{t-6} …a_{t-1}&X_{t-1}\end{bmatrix}$

For this study, we select 20 song features to describe $a_t$, plus the singleton $X_t$, we know $O_t \in R^{5\times 21}$ Input states we actually use is $S_t = \phi(O_t)$, where $\phi$ is a mapping from inputs of length 5 along time dimension to length 1. We transform the observations using an autoencoder with an RNN, which compresses varying-length inputs along the time dimension to a fixed-length representation. This means we can easily extend our model to compress full-user history from the start of a listening session (varying-length) to a fixed-length inputs. Namely, $O_t$ is defined by $X_{1:t-1}$ and $a_{1:t-1}$ instead of the past 5 steps: $X_{t-6:t-1}$ and $a_{t-6:t-1}$.

Autoencoder

Overview

Our data first go through an autoencoder with 2 parts: a numerical compressor and a time compressor. Each data point is originally 5 $\times$ 21 corresponding to 20 song features for 5 songs and the observed action of user (skip/no-skip). We compress the 20 song features with a feedforward autoencoder which transforms the input into 5 $\times$ 8. We concatenate the user response at the end to make this 5 $\times$ 9. We then input this latent representation to an LSTM autoencoder that compresses along the time dimension into a single vector of dimension $1\times 9$.

Implementation

During training, we first build an autoencoder and train it on a song feature dataset.

We then fix the encoding layers and decoding layers and connect them with the LSTM time compressor. The encoding layers are responsible for pre-processing the numerical features, and the decoding layers are responsible for extending the output of time decoder to full length.

Time compressor tries to compress 5 $\times$ 9 input into 1 $\times$ 9 and then decode it into 5 $\times$ 9. In order to accomodate mixed-type data, we use two time decoders, one with MSE loss to recover the numerical, and the other with cross entropy loss to recover binary user skipping behavior. The final loss is a linear combination of these two losses and the weights assigned affect model performance in these two tasks. Despite using two decoders, they share a single length 9 latent representation as input. During prediction, one addition step is that the numerical decoder takes the output of time decoder and restore the original dimension for numerical features.

Possible Extension

Instead of compressing numerical features only, the former “numerical compressor” compresses both song features and binary user responses, which we refer to as the “combinational compressor” in order to distinguish it from the “numerical compressor”. By releasing the burden of compressing categorical response from our time compressor, this structure can be further improved. In this project, we have yet to connect the combinational compressor with the time compressor.

Result

table1: Autoencoders Result

Our numerical features are normalized to be within 0 to 1. The MSE for restructuring 5 $\times$ 20 from 1$\times$ 9 latent is 0.0111, about 1% of the feature range. The Accuracy of reconstructing binary user behavior is 88.89%. The MSE from training the numerical compressor is 0.0016. All statistics are calculated on the test set. The last two rows are performance statistics of the experimental combinational compressor. The experimental combinational compressor has an accuracy of one retrieving binary user skipping behaviors on the test set, and a slightly increased numerical reconstruction MSE at 0.0064.

Reinforcement Learning

Framework

State: latent vector of the time compressor. The state, of length 9, contains the information of the previous five songs and corresponding user responses.
Action: latent vector of the numerical compressor. The action is the recommended song. To reduce dimension, we use the latent representation of song features with a length of 8.
Reward: sum of not-skipping probability and diversity of recommendation. $r = 0.5 * P_{not\ skip} + 0.5 * I_{\{dis(a_t, a_{t-1}) > threshold\}},$ where $I_{{\cdot}}$ refers to the indicator function. The diversity is measured by the distance of the current action and previous action.
Agent: policy function and Q-function. Policy function: $\mu(s) = a$ Q-function: $Q(s,a)$

Environment

When given a pair of state and action, the environment is supposed to return the reward. However, in this problem we cannot observe real-time user response. What we will do is using the existing data to estimate the skip/not skip behavior.

Let the current state and action be $(s, a)$. The data is comprised of $\mathbf{w}_i = (s_i, a_i, x_i)$ pairs, where $x_i$ denotes whether the song is skipped. The probability of not skipping is calculated as follows:

Find pairs whose state is close to the current state.
Choose pairs $\mathbf{w}_i$ from those selected in step 1, of which the distance between $a_i$ and $a$ is below a threshold. Let this subset of pairs be $W$.
Estimate the not-skipping probability using the weighted average of $x$ in $X = {x_i: (s_i, a_i, x_i) \in W}$.

Deep Deterministic Policy Gradient

DDPG, short for Deep Deterministic Policy Gradient, is a model-free off-policy actor-critic algorithm, combining DPG with DQN. The original DQN works in discrete space, and DDPG extends it to continuous action space with the actor-critic framework while learning a deterministic policy.

There are four neural networks in this algorithm: actor, critic, actor target and critic target. Actor network learns the policy function, while critic network approximates the Q-function.

Soft update. The target networks are employed for better exploration. Specifically, DDPG does soft updates on the parameters of both target networks, with $\tau \ll 1$: $\theta′←\tau\theta+(1−\tau)\theta′,$ where $\theta, \theta’$ are the weights of the original network and target network respectively. In this way, the target network values are constrained to change slowly, different from the design in DQN that the target network stays frozen for some period of time.

Hard update. The actor (policy function) and critic network (Q-function) will be trained.

Actor update. The return of a policy $\mu_\theta(s)$ is defined as

$J(\theta) = E_s[Q(s, \mu_\theta(s))].$

Take the gradient of function $J(\theta)$ w.r.t. $\theta$ yields

$\nabla_\theta J(\theta) = E_s(\nabla_a Q(s, a|\theta^Q)\nabla_\theta\mu_\theta(s|\theta^\mu)|a=\mu_\theta(s|\theta^\mu))$

Using gradient ascent, we can find the best $\theta$ that produces the highest return.

Critic update. Loss function is the mean squared error: $L = E_{s,a,r,s’}(y-Q(s,a|\theta^Q))^2.$ Here the target value is evaluated using target networks: $y = r + \gamma Q(s’,\mu_\theta(s’|\theta^{\mu’})|\theta^{Q’}).$

Experience Replay. We continually update a replay memory table of transitions $(s_t, a_t, r_t, s_{t+1})$ as steps are performed. The actor and critic networks are trained on random mini-batches of transitions from the replay memory, instead of consecutive samples. This reduces correlations between samples and allows more efficient use of previous experience. The length of memory table in our model is set to be 1000.

Result

Since our model requires the information of the previous five songs, the agent will make 15 recommendations for each session. Therefore we run 15 steps in each episode.

The maximum score the recommender system can accumulate from our truncated music sessions is 15. The not-skip behaviour takes up 34% of the whole data, which translates to a benchmark score of around 5. As we can see from the left plot, upon training on merely a few hundreds episodes, our agent reaches a score of around 11, showcasing much better performance than the benchmark. On the right is the diversity score. If the distance between the current action and previous action is beyond a certain threshold (0.4 times the standard deviation), we get a diversity score of 1, otherwise we get 0. This is calculated for each step, hence the maximum score is also 15. From the plot, we can see that at the beginning, our agent tends to recommend similar songs. But after around 300 episodes, it learns to take into account the diversity of recommendations.

Conclusion

In this project, we successfully use reinforcement learning to capture user-song interaction and the time dependency between current and past decisions. Future research can extend on our discovery by relaxing many of the simplifying assumptions: first, instead of assuming a single hypothetical user, perform a customer stratification and accommodate this variable into the model. Second, instead of truncating history at five, we can take into account full user history since the start of the session. In addition, we can test whether the experimental combined compressor outperforms the current model by connecting it to time compressor and downstream DDPG agent. In terms of training, we are training different model components separately due to constraints of computational resources and time. It would be a good idea to train the model end to end so the latent may be better suited for reinforcement learning task. Finally, possibly the most significant modification one can make on top of our research is to provide the reinforcement learning agent with a real environments. Our current dataset is highly biased towards skipping behaviors (with 34% non-skipping rate) and may not reflect realistic customer behaviors. Therefore, instead of simulating an environment using the dataset (600G), a better way to make the model production-ready is to recruit some participants to test how they react to the recommendations. This way the agent can fully explore recommendation space and make more trustworthy recommendation.

Finally, we would like to give our most sincere gratitude to: Javier Zazo, our mentor, expert in reinforcement learning, and a great friend who has always been giving us valuable suggestions and encouragement we need to proceed. Pavlos Protopapas, the head of the 297R research project, the instructor who makes this unforgettable research journey possible, and Aparna Kumar, Spotify senior researcher and data scientist, a great supporting manager to report to.

The research project is under the IACS 297R research project.