LSTM Feature engineering: using different Knowledge Graph data types

Question

For a research project, I'm planning to use an LSTM to learn from sequences of KG entities. However, I have little experience using LSTMs or RNNs in general. During planning, a few questions concerning feature engineering have come up.

Let me give you some context:
My initial data will be a collection of $n$ texts. From these texts, I will extract $n$ sequences of entities of variable length using a DBPedia or Wikidata tagger. Consequently, I'll have $n$ sequences of KG entities that somehow correspond to their textual counterparts.

Most LSTM implementations I've seen take only one type of feature as input. However, as we're dealing with knowledge graphs, we have access to more types of information. I'm wondering what would be a good strategy to use more than just one type of feature.

Objective

Given a sequence of seen entities, I want the model to predict the continuation of that sequence. A set of truncated sequences from the corpus will be kept apart. The beginnings will serve as prompts and the endings will be truth values for evaluation.
I'm also interested in the model's prediction probabilities when predicting following entities for one single entity given as a prompt.

Assumptions

I assume that diverse types of features will help the model make good predictions. Specifically, I want the model to learn not only from entity sequences but also from KG 'metadata' like associated RDF classes or pre-computed embedding vectors.

Features

Feature 1: Numerical vocabulary features

The simplest case I can think of is to create an orderet set from all extracted entities.
For example, if the extracted entities from all my documents were [U2, rock, post-punk, yen, Bono, revolutionary, guitar] (in reality that'll probably be a few thousands more), I'd create this ordered set representing my vocabulary:

{1: http://dbpedia.org/resource/U2, 2: http://dbpedia.org/resource/Rock_music, 3: http://dbpedia.org/resource/Post-punk, 4: http://dbpedia.org/resource/Japanese_yen, 5: http://dbpedia.org/resource/Bono, 6: http://dbpedia.org/resource/Revolutionary, 7: http://dbpedia.org/resource/Acoustic_guitar}

The training data for the LSTM would then be sequences of integers such as

training_data = [
# Datapoint 1
[[1, 2, 3, 4, 5, 6, 7]],        #document 1
# Datapoint 2
[[5, 3, 3, 1, 6]],              #document 2
# Datapoint 3
[[2, 4, 5, 7, 1, 6, 2, 1, 7]],  #document 3
...]

Feature 2: Numerical class features

I want to include additional information about RDF classes. Similar to the approach in Feature 1, I could create an ordered set containing all possible classes. However, the difference is that each entity belongs to one or more classes

If all classes extracted were

{1: owl:Thing, 2: dbo:MusicGenre, 3: dbo:Agent, 4: dbo:Person, 5: dbo:PersonFunction}

I would create a new data structure for each data point, this time containing class information. The notation represents {entity: [classes]}. My training data could then look something like this:

training_data = [
# Datapoint 1
[
[1, 2, 3, 4, 5, 6, 7],                      # feature 1
{1: [1,2,4], 2: [2,3,4,5], ..., 7: [3,5]}   # feature 2
],
# Datapoint 2
[
[5, 3, 3, 1, 6],                            # feature 1
{1: [2,3,4], 2: [1,2,4,5], ..., 5: [3,5]}   # feature 2
],
# Datapoint 3
[
[2, 4, 5, 7, 1, 6, 2, 1, 7],                # feature 1
{1: [1,2,4], 2: [1,2,3,5], ..., 9: [2,3]}   # feature 2
],
...]

Feature 3: RDF2Vec embeddings

Each KG entity from a collection of entities can be mapped into a low-dimensional space using tools like RDF2Vec. I'm not sure whether to use this feature or not as its latent semantic content might interfere with my research question, but it is an option.
Embedding features, in this case, are vectors of length 200:

embedding_vector = tensor([5.9035e-01, 2.6974e-01, 8.6569e-01, 8.9759e-01, 9.3032e-01, 5.2442e-01, 9.6031e-01, 1.8393e-01, 6.3000e-01, 9.5930e-01, 2.5407e-01, 5.6510e-01, 8.1476e-01, 2.0864e-01, 2.7643e-01, 4.8667e-02, 9.3791e-01, 8.0929e-02, 5.0237e-01, 1.4946e-01, 5.9263e-01, 4.7912e-01, 6.8907e-01, 4.8248e-03, 4.9926e-01, 1.5715e-01, 7.0777e-01, 6.0065e-01, 2.6858e-01, 7.2022e-01, 4.4128e-01, 4.5026e-01, 1.9987e-01, 2.8191e-01, 1.2493e-01, 6.0253e-01, 6.9298e-01, 2.5828e-01, 2.8332e-01, 9.6898e-01, 4.5132e-01, 4.6473e-01, 8.0197e-01, 8.4105e-01, 8.8928e-01, 5.5742e-01, 9.5781e-01, 3.8824e-01, 4.6749e-01, 4.3156e-01, 2.8375e-03, 1.5275e-01, 6.7080e-01, 9.9894e-01, 7.2093e-01, 2.7220e-01, 8.5404e-01, 6.9299e-01, 3.9316e-01, 8.9538e-01, 8.1654e-01, 4.1633e-01, 9.6143e-01, 7.1853e-01, 9.5498e-01, 4.5507e-01, 3.6488e-01, 6.3075e-01, 8.0778e-01, 6.3019e-01, 4.4128e-01, 7.6502e-01, 3.2592e-01, 9.5351e-01, 1.1195e-02, 5.6960e-01, 9.2122e-01, 3.3145e-01, 4.7351e-01, 4.5432e-01, 3.7222e-01, 4.3379e-01, 8.1074e-01, 7.6855e-01, 4.0966e-01, 2.6685e-01, 2.4074e-01, 4.1252e-01, 1.9881e-01, 2.2821e-01, 5.9354e-01, 9.8252e-01, 2.7417e-01, 4.2776e-01, 5.3463e-01, 2.9148e-01, 5.8007e-01, 8.2275e-01, 4.8227e-01, 8.5314e-01, 3.6518e-01, 7.8376e-02, 3.6919e-01, 3.4867e-01, 8.9571e-01, 2.0085e-02, 7.9924e-01, 3.5849e-01, 8.7784e-01, 4.6861e-01, 6.2004e-01, 6.8465e-01, 4.1273e-01, 4.2819e-01, 9.4532e-01, 2.2362e-01, 8.3943e-01, 1.1692e-01, 6.9463e-01, 7.6764e-01, 2.8046e-02, 6.9382e-01, 9.2750e-01, 3.6031e-01, 6.8065e-01, 1.6976e-01, 8.2079e-01, 6.4580e-01, 8.3944e-01, 3.9363e-01, 4.4026e-01, 4.4569e-01, 8.2344e-01, 5.4172e-01, 1.6886e-04, 3.8689e-01, 5.8966e-01, 1.9510e-02, 2.5976e-01, 4.0868e-01, 3.1406e-01, 3.6334e-01, 6.1768e-01, 5.4854e-01, 4.1273e-01, 7.2670e-04, 2.4486e-01, 4.1042e-01, 9.0760e-01, 1.6224e-01, 7.4019e-02, 8.1329e-01, 7.2573e-01, 8.2816e-01, 7.3032e-01, 6.6017e-01, 6.4281e-01, 4.1839e-01, 9.2251e-01, 1.5183e-02, 4.4538e-01, 9.7205e-01, 9.5677e-01, 9.5649e-01, 1.2610e-01, 9.2521e-01, 3.2649e-01, 2.1019e-02, 2.5695e-01, 4.2663e-01, 9.2064e-01, 4.5242e-01, 7.0447e-01, 8.1233e-01, 2.7507e-01, 2.4744e-01, 1.3670e-01, 6.4032e-01, 5.8332e-01, 5.5130e-01, 2.4997e-02, 7.7206e-01, 1.5085e-01, 2.8028e-01, 8.2839e-01, 5.8292e-01, 9.9087e-01, 6.0233e-01, 4.1489e-01, 6.4902e-01, 7.5428e-01, 8.0953e-01, 3.7530e-01, 4.8196e-01, 1.8786e-01, 9.8463e-01, 6.3303e-01, 4.8519e-01, 7.6163e-01, 3.3821e-01]

If I included this in my training data, it would look something like this:

training_data = [
# Datapoint 1
[
[1, 2, 3, 4, 5, 6, 7],                      # feature 1
{1: [1,2,4], 2: [2,3,4,5], ..., 7: [3,5]},  # feature 2
[7 embedding vectors],                      # feature 3
],
# Datapoint 2
[
[5, 3, 3, 1, 6],                            # feature 1
{1: [2,3,4], 2: [1,2,4,5], ..., 5: [3,5]},  # feature 2
[5 embedding vectors],                      # feature 3
],
# Datapoint 3
[
[2, 4, 5, 7, 1, 6, 2, 1, 7],                # feature 1
{1: [1,2,4], 2: [1,2,3,5], ..., 9: [2,3]},  # feature 2
[9 embedding vectors],                      # feature 3
],
...]

Questions

My training data will consist of lists of variable length and matrices/tensors. How do I best feed this data to the model? In any case, I'm interested in predicting only entities. Training only on feature 1 could be a baseline that I compare to combinations of features, e.g. Features 1+2 or 1+3 or 1+2+3

Based on what I've read until now, I think I'm going to use padding and masking. However, I'm not sure what my features should finally look like.

I appreciate any kind of feedback. Thanks for sharing your thoughts!

On feature 1, just to clarify - you want your sequence model to consume multiple sequences as inputs at the same time, correct? — hH1sG0n3, Oct 06 '22 at 11:57
I think so. In my examples, each datapoint represents the sequential information related to entities in a document from the corpus. Does this make sense? — acxcv, Oct 06 '22 at 12:25

score 1 · Accepted Answer · answered Oct 06 '22 at 16:32

As general points:

Multivariate RNN: You can use multiple sequential features as an input to your recurrent layers. Taking pytorch as a reference, you can see that the input of LSTM object is a tensor of shape $(L, H_{in})$ or $(L, N, H_{in})$ for batched, where $L$ is the length of your sequences whereas $H_{in}$ is the number of input features. In this approach, you can leave mapping tokens to a vocabulary as part of the standard procedure of a standard embedding being learnt.
You may be able to use a multi-label approach (as opposed to multi-class), if I understand your question correctly.
Multimodal learning: If features related to embeddings can be considered static/not evolving over time, you may want to add a second auxiliary port to your network, to specifically model this data type. This second part would consist of a feed-forward network with fully connected layers. The fixed-length vector representations / embeddings at the outputs of your RNN and FFN modules could get concatenated before passed to your classification layer. In this way you allow the model to reason from a joint representation of both data modalities.

Hope it helps.

nw (: For 2, can you possibly discuss why it is *not* a multi-label task? In this way we can probably provide more specific ideas. — hH1sG0n3, Oct 08 '22 at 11:39
I want the model to predict the continuation of a sequence for which candidate entities are mutually exclusive. In other words, a sequence of entities must be followed by one unambiguous entity. — acxcv, Oct 11 '22 at 08:52