Combining heterogeneous numerical and text features

Question

We want to solve a regression problem of the form "given two objects $x$ and $y$, predict their score (think about it as a similarity) $w(x,y)$". We have 2 types of features:

For each object, we have about 1000 numerical features, mainly of the following types: 1) "Historical score info", e.g. historical means $w(x,\cdot)$ up to the point we use the feature; 2) 0/1 features meaning whether object $x$ has a particular attribute, etc.
For each object, we have a text which describes the object (description is not reliable, but still useful).

Clearly, when predicting a score for a pair $(x,y)$, we can use features for both $x$ and $y$.

We are currently using the following setup (I omit validation/testing):

For texts, we compute their BERT embeddings and then produce a feature based on the similarity between the embedding vectors (e.g. cosine similarity between them).
We split the dataset into fine-tuning and training datasets. The fine-tuning dataset may be empty meaning no fine-tuning.
Using the fine-tuning dataset, we fine-tune BERT embeddings.
Using the training dataset, we train decision trees to predict the scores.

We compare the following approaches:

Without BERT features.
Using BERT features, but without fine-tuning. There is some reasonable improvement in prediction accuracy.
Using BERT features, with fine-tuning. The improvement is very small (but the prediction using only BERT features improved, of course).

Question: Is there something simple I'm missing in this approach? E.g. maybe there are better ways to use texts? Other ways to use embeddings? Better approaches compared with decision trees?

I tried to do multiple things, without any success. The approaches which I expected to provide improvements are the following:

Fine-tune embeddings to predict difference between $w(x,y)$ and mean $w(x, \cdot)$. The motivation is that we already have a feature "mean $w(x,\cdot)$", which is a baseline for an object $x$, and we are interested in the deviation from this mean.

Use NN instead of decision trees. Namely, I use few dense layers to turn embedding vectors into features, like this:

 nn.Sequential(
      nn.Linear(768 * 2, 1000),
      nn.BatchNorm1d(1000),
      nn.ReLU(),
      nn.Linear(1000, 500),
      nn.BatchNorm1d(500),
      nn.ReLU(),
      nn.Linear(500, 100),
      nn.BatchNorm1d(100),
      nn.ReLU(),
      nn.Linear(100, 10),
      nn.BatchNorm1d(10),
      nn.ReLU(),
  )

After that, I combine these new $10$ features with $2000$ features I already have, and use similar architecture on top of them:

  nn.Sequential(
      nn.Linear(10 + n_features, 1000),
      nn.BatchNorm1d(1000),
      nn.ReLU(),
      nn.Linear(1000, 500),
      nn.BatchNorm1d(500),
      nn.ReLU(),
      nn.Linear(500, 100),
      nn.BatchNorm1d(100),
      nn.ReLU(),
      nn.Linear(100, 1),
  )

But as a result, my prediction is much worse compared with decision trees. Are there better architectures suited for my case?

Combining heterogeneous numerical and text features

0 Answers0