Factorization Machines

by Weidong Liang

Beijing, 2014.06

Introduction

The Factorization Machine model has the following form: $$ \hat y (x) = \omega_0 + \sum_{i=1}^{n}{\omega_i x_i} + \sum_{i=1}^{n} \sum_{j=i+1}^{n}{ \lt v_i , v_j \gt x_i x_j } $$ where $ \omega_0 \in \mathbb{R} $, $ w \in \mathbb{R}^p $ and $ V \in \mathbb{R}^{p \times k} $. Direct evaluation of y using the above requires a time complexity of $ O(k n ^ 2) $.

The model can be efficiently computed using $$ \hat y (x) = \omega_0 + \sum_{i=1}^{n}{\omega_i x_i} + \frac{1}{2} \sum_{f=1}^{k}( (\sum_{i=1}^{n}{v_{i, f} x_i } )^2 - \sum_{i=1}^{n}{ v_{i, f}^2 x_i^2 } ) $$ which requires only $ O(k n) $.

Learning FM

Two common loss functions and their gradients are:

Square Loss for regression: $ L^{LS}(\hat y, y) = \frac{1}{2} (\hat y - y)^2 $, its derivative is: $ \frac{d L^{LS}(\hat y, y)}{d \hat y} = (\hat y - y) $.
Logit loss for binary classification: $ L^{C}(\hat y, y) = - (y*ln(\sigma(\hat y)) + (1-y)*ln(1 - \sigma(\hat y))) $, its derivative is: $ \frac{L^{C}(\hat y, y)}{d \hat y} = \sigma(\hat y) - y $.

Learning FM with Stochastic Gradient Descent

Deriving the gradient of $ \hat y (x) $ with respect to $ \omega $, we have: $$ \frac{\partial \hat y}{\partial \omega_0} = 1 $$ $$ \frac{\partial \hat y}{\partial \omega_i} = x_i $$ $$ \frac{\partial \hat y}{\partial v_{i, f}} = \frac{\partial }{\partial v_{i, f}} \sum_{f=1}^{k}( (\sum_{j=1}^{n}{v_{j, f} x_j } )^2 - \sum_{i=1}^{n}{ v_{i, f}^2 x_i^2 } ) = x_i (\sum_j^n{v_{j,f} x_j} - v_{i,f} x_i) $$

Integerating the above with the gradient descent algorithm of regularized optimization: $$ OptReg(S, \lambda) := argmin_{\Theta} ( \sum_{(x, y) \in S}{l(\hat y(x | \Theta), y) + \sum_{\theta \in \Theta}{\lambda_{\theta} \theta^2 }} ) $$

procedure SolveOptReg(S, $ \lambda $)

$ \omega_0 := 0 $
$ w := (0, 0, ..., 0) $
$ V \sim N(0, \sigma) $
repeat until stopping criterion is met
1. for $ (x, y) \in S $ do
  1. $ \partial_0 := \frac{\partial}{\partial \omega_0}(l(\hat y(x | \Theta), y )) $
  2. $ \omega_0 := \omega_0 - \alpha (2 \lambda_0 \omega_0 + \partial_0) $
  3. for $ i \in {1, ..., p} $ and $ x_i \ne 0 $ do
    - $ \partial_i := \frac{\partial}{\partial w_i}(l(\hat y(x | \Theta), y )) $
    - $ w_i := w_i - \alpha(2 \lambda_w w_i + \partial_i) $
    - for $ f \in {1, ..., k} $ do
      - $ \partial_{i, f} := \frac{\partial}{\partial v_{i, f}}(l(\hat y(x | \Theta), y )) $
      - $ v_{i, f} := v_{i, f} - \alpha(2 \lambda_f v_{i, f} + \partial_{i, f}) $

Learning FM using SGD with Adaptive Regularization

The data set is split into training and validation $ S = S_T \cup S_V $, with $ S_T $ for learning $ \Theta $ and $ S_V $ for tuning $ \lambda $. This can be done by alternating between improving $ \Theta $ while $ \lambda $ is fixed, and improving $ \lambda $ while $ \Theta $ is fixed; with the first part being the same as the above, and we shall derive the second part below:

Instead of evaluating: $$ \lambda^* | \Theta^t := argmin_{\lambda \in R_+} \sum_{(x, y) \in S_V} l(\hat y(x | \Theta^t), y) $$ which, having the right side independent of $ \lambda $, would result in a gradient equals 0; the main idea is to make the dependence of $ \hat y $ on $ \lambda $ to be explicit using: $$ \hat y(x | \Theta^t) = w_0^{t+1} + \sum_i^{n}{w_i^{t+1} x_i} + \sum_{i=1}^n \sum_{j=i+1}^n < v_i^{t+1}, v_j^{t+1} > x_i x_j $$ Combining the above with $$ \theta^{t+1} = \theta^t - \alpha ( \frac{\partial}{\partial \theta^t} l(\hat y (x | \Theta ^ t), y) + 2 \lambda \theta ^ t ) $$ we have $$ \hat y(x | \Theta^{t+1}) = \{ w_0^t - \alpha ( \frac{\partial}{\partial \omega_0^t} l(\hat y (x | \Theta ^ t), y) + 2 \lambda \omega_0 ^ t ) \} + \sum_{i = 1}^n x_i \{ w_i^t - \alpha ( \frac{\partial}{\partial \omega_i^t} l(\hat y (x | \Theta ^ t), y) + 2 \lambda \omega_i ^ t ) \} + \sum_{i = 1}^n \sum_{j=i+1}^n \sum_{f=1}^{k}[ x_i \{ v_{i,f}^t - \alpha ( \frac{\partial}{\partial v_{i,f}^t} l(\hat y (x | \Theta ^ t), y) + 2 \lambda v_{i,f}^t ) \} x_j \{ v_{j,f}^t - \alpha ( \frac{\partial}{\partial v_{j,f}^t} l(\hat y (x | \Theta ^ t), y) + 2 \lambda v_{j,f}^t ) \} ] (Eq.*) $$ it follows that we can solve the following: $$ \lambda^* | \Theta^t := argmin_{\lambda \in R_+} \sum_{(x, y) \in S_V} l(\hat y(x | \Theta^{t+1}), y) $$ which answers the question: "What is the best value of $ \lambda $ such that the next update on $ \Theta $ generates the smallest error on the validation set?"

The SGD-update for $ \lambda $ given a case $ (x,y) \in S_V $ is $$ \lambda^{t+1} = \lambda^t - \alpha \frac{\partial}{\partial \lambda} l(\hat y(x|\Theta^{t+1}), y) $$

Similar to the above, we have derivative of the cost function: $$ \frac{\partial}{\partial \lambda}( \hat y(x | \Theta^{t+1}) - y) ^ 2 = 2(\hat y(x | \Theta^{t+1}) - y) \frac{\partial}{\partial \lambda}(\hat y(x | \Theta^{t+1})) $$ $$ \frac{\partial}{\partial \lambda}{-ln \sigma(\hat y(x | \Theta^{t+1}) y)} = (\sigma(\hat y(x | \Theta^{t+1}) y) - 1) y \frac{\partial}{\partial \lambda}(\hat y(x | \Theta^{t+1})) $$

Differentiate the above future model equation, with $ \lambda_0, \lambda_w, \lambda_f $ corresponds to the regularization coeffficient for $ \omega_0, w_i, v_{i, f} $ respectively, we have: $$ \frac{\partial}{\partial \lambda_0}(\hat y(x | \Theta^{t+1})) = -2 \alpha \omega_0^t $$ $$ \frac{\partial}{\partial \lambda_w}(\hat y(x | \Theta^{t+1})) = -2 \alpha \sum_{i=1}^n{w_i^t x_i} $$ $$ \frac{\partial}{\partial \lambda_f}(\hat y(x | \Theta^{t+1})) = -2 \alpha [ \sum_{i=1}^n{x_i v_{i,f}^{t+1}} \sum_{j=1}^n{x_j v_{j,f}^t} - \sum_{j=1}^n{x_j^2 v_{j,f}^{t+1} v_{j,f}^t} ] $$ which depends both on the current $ v_{i,f}^t $ and future $ v_{i,f}^{t+1} $.

In order to enable faster calculation, instead of calculating $ v_{i,f}^{t+1} $ using $ (x,y) \in S_T $, it can be approximated using: $$ \tilde v_{i,f}^{t+1} := v_{i,f}^t - \alpha ( \partial_{v_{i,f}} + 2 \lambda v_{i,f}^t ) \approx v_{i,f}^{t+1} $$ where $ \partial_{v_{i,f}} $ is the following gradient that has been calculated in the previous $ \Theta $ step.

procedure SolveOptAdaptiveReg( $S_T, S_V $ )

$ \omega_0 := 0 $
$ w := (0, ..., 0) $
$ V := N(0, \sigma) $
$ \lambda := (0, ..., 0) $
$ \partial := (0, .., 0) $
repeat till stopping criterion is met
1. for $ (x, y) \in S_T $ do

Factorization Machines

Introduction

Learning FM

Learning FM with Stochastic Gradient Descent

Learning FM using SGD with Adaptive Regularization

Learning FM with Alternating Least Square

Learning FM with Markov Chain Monte Carlo

Simulation

Algorithm

Implementation

Note

Reference