11

Can anyone explain how field-aware factorization machines (FFM) compare to standard Factorization Machines (FM)?

Standard: http://www.ismll.uni-hildesheim.de/pub/pdfs/Rendle2010FM.pdf

"Field Aware": http://www.csie.ntu.edu.tw/~r01922136/kaggle-2014-criteo.pdf

josh
  • 166
  • 5
B_Miner
  • 702
  • 1
  • 7
  • 20

3 Answers3

3

Suppose (prior to one-hot-encoding) you have predictors/fields from a set $Z$ (say movie genre, user gender, and user race). Suppose further, each predictor $z \in Z$ can take on one of $k_z$ values. After one-hot encoding, you will have a new set of binary features $X$ of size $K := \sum_{z \in Z}k_z$.

In a model with all interactions, you must estimate a matrix of interaction coefficients $Q$, which has $K\times (K+1) / 2$ unique terms.

The factorization machine puts structure on the matrix $Q$, and assumes that $Q \equiv W^{T} W$, where $W$ is of dimension $l \times K$, with $1\le l \le K$ some number specified by the user. We estimate $W$ instead of $Q$.

The field-aware factorization machine puts structure on $Q$ as well. It partitions $Q$ into blocks based on $z$ (the original features). If $q_{z_i, z_j}$ denotes the $z_i,z_j$ block of $Q$, we assume that $q_{z_i,z_j}$ comes from the $z_i,z_j$ block of $W_j^{T} W_i$, where $W_i$ is of dimension $l \times K$. As with the FM, we estimate the $W_i$ instead of $Q$.

The FM factorization of $Q$ has $K \times l $ parameters. The "feild-aware" FM has $K\times l\times |Z|$ parameters. A model with all interactions has $K \times (K+1)/2$ parameters.

kalu
  • 131
  • 3
2

It seems like you're asking for a high-level description. If you refer to the slides linked to within the slides of your original post, there's a comparison of FM (slide 11) vs FFM (slide 12).

As a quick example, if you're learning about users and movies, FM might have the following factor:

w_{user_1}*w_{movie_1}*... + w{user_1}*w_{genre_1}*...

FFM would have:

w_{user_1, movies}*w_{movie_1, users}*... + w{user_1, genres}*w_{genre_1, users}*...

The key difference is that in FM, the w_{user_1} coefficient is the same in both terms--there's a single notion of the user. In FFM, you learn a separate w_{user_1} for each context, e.g. whether it's interacting with movies or genres. Note that it isn't learned separately for each particular movie or genre, but for movies and genres generally. That is, it separately learns the context of the user for each type of interaction.

Also note that w_{movie_1} went to w_{movie_1, users} since that term is interacting with w_{user_1}, a user.

ZakJ
  • 136
  • 3
1

Standard factorization machines have fields too. The "novelty" here seems to be the use of GBDT features and the application of the hashing tricks. Not to great effect, it seems: check out the minute range in performance on the last slide.

Emre
  • 10,481
  • 1
  • 29
  • 39
  • According to the authors, there is indeed a field aware characteristic to the model, relative to the standard implementation - it is stated in the kaggle forums. I was just not able to follow what it meant and what the difference actually was. – B_Miner Oct 21 '14 at 13:51
  • http://www.kaggle.com/c/criteo-display-ad-challenge/forums/t/10555/3-idiots-solution/55932#post55932 – B_Miner Oct 21 '14 at 14:12
  • Based on slie 14, it appears that they based their solution on [this](https://kaggle2.blob.core.windows.net/competitions/kddcup2012/2748/media/Opera.pdf) paper (_[Ensemble of Collaborative Filtering and Feature Engineered Models for Click Through Rate Prediction](https://kaggle2.blob.core.windows.net/competitions/kddcup2012/2748/media/OperaSlides.pdf)_). – Emre Oct 22 '14 at 00:20