For a research project, I'm planning to use an LSTM to learn from sequences of KG entities. However, I have little experience using LSTMs or RNNs in general. During planning, a few questions concerning feature engineering have come up.
Let me give you some context:
My initial data will be a collection of $n$ texts.
From these texts, I will extract $n$ sequences of entities of variable length using a DBPedia or Wikidata tagger. Consequently, I'll have $n$ sequences of KG entities that somehow correspond to their textual counterparts.
Most LSTM implementations I've seen take only one type of feature as input. However, as we're dealing with knowledge graphs, we have access to more types of information. I'm wondering what would be a good strategy to use more than just one type of feature.
Objective
Given a sequence of seen entities, I want the model to predict the continuation of that sequence. A set of truncated sequences from the corpus will be kept apart. The beginnings will serve as prompts and the endings will be truth values for evaluation.
I'm also interested in the model's prediction probabilities when predicting following entities for one single entity given as a prompt.
Assumptions
I assume that diverse types of features will help the model make good predictions. Specifically, I want the model to learn not only from entity sequences but also from KG 'metadata' like associated RDF classes or pre-computed embedding vectors.
Features
Feature 1: Numerical vocabulary features
The simplest case I can think of is to create an orderet set from all extracted entities.
For example, if the extracted entities from all my documents were [U2, rock, post-punk, yen, Bono, revolutionary, guitar] (in reality that'll probably be a few thousands more), I'd create this ordered set representing my vocabulary:
{1: http://dbpedia.org/resource/U2, 2: http://dbpedia.org/resource/Rock_music, 3: http://dbpedia.org/resource/Post-punk, 4: http://dbpedia.org/resource/Japanese_yen, 5: http://dbpedia.org/resource/Bono, 6: http://dbpedia.org/resource/Revolutionary, 7: http://dbpedia.org/resource/Acoustic_guitar}
The training data for the LSTM would then be sequences of integers such as
training_data = [
# Datapoint 1
[[1, 2, 3, 4, 5, 6, 7]], #document 1
# Datapoint 2
[[5, 3, 3, 1, 6]], #document 2
# Datapoint 3
[[2, 4, 5, 7, 1, 6, 2, 1, 7]], #document 3
...]
Feature 2: Numerical class features
I want to include additional information about RDF classes. Similar to the approach in Feature 1, I could create an ordered set containing all possible classes. However, the difference is that each entity belongs to one or more classes
If all classes extracted were
{1: owl:Thing, 2: dbo:MusicGenre, 3: dbo:Agent, 4: dbo:Person, 5: dbo:PersonFunction}
I would create a new data structure for each data point, this time containing class information. The notation represents {entity: [classes]}. My training data could then look something like this:
training_data = [
# Datapoint 1
[
[1, 2, 3, 4, 5, 6, 7], # feature 1
{1: [1,2,4], 2: [2,3,4,5], ..., 7: [3,5]} # feature 2
],
# Datapoint 2
[
[5, 3, 3, 1, 6], # feature 1
{1: [2,3,4], 2: [1,2,4,5], ..., 5: [3,5]} # feature 2
],
# Datapoint 3
[
[2, 4, 5, 7, 1, 6, 2, 1, 7], # feature 1
{1: [1,2,4], 2: [1,2,3,5], ..., 9: [2,3]} # feature 2
],
...]
Feature 3: RDF2Vec embeddings
Each KG entity from a collection of entities can be mapped into a low-dimensional space using tools like RDF2Vec. I'm not sure whether to use this feature or not as its latent semantic content might interfere with my research question, but it is an option.
Embedding features, in this case, are vectors of length 200:
embedding_vector = tensor([5.9035e-01, 2.6974e-01, 8.6569e-01, 8.9759e-01, 9.3032e-01, 5.2442e-01, 9.6031e-01, 1.8393e-01, 6.3000e-01, 9.5930e-01, 2.5407e-01, 5.6510e-01, 8.1476e-01, 2.0864e-01, 2.7643e-01, 4.8667e-02, 9.3791e-01, 8.0929e-02, 5.0237e-01, 1.4946e-01, 5.9263e-01, 4.7912e-01, 6.8907e-01, 4.8248e-03, 4.9926e-01, 1.5715e-01, 7.0777e-01, 6.0065e-01, 2.6858e-01, 7.2022e-01, 4.4128e-01, 4.5026e-01, 1.9987e-01, 2.8191e-01, 1.2493e-01, 6.0253e-01, 6.9298e-01, 2.5828e-01, 2.8332e-01, 9.6898e-01, 4.5132e-01, 4.6473e-01, 8.0197e-01, 8.4105e-01, 8.8928e-01, 5.5742e-01, 9.5781e-01, 3.8824e-01, 4.6749e-01, 4.3156e-01, 2.8375e-03, 1.5275e-01, 6.7080e-01, 9.9894e-01, 7.2093e-01, 2.7220e-01, 8.5404e-01, 6.9299e-01, 3.9316e-01, 8.9538e-01, 8.1654e-01, 4.1633e-01, 9.6143e-01, 7.1853e-01, 9.5498e-01, 4.5507e-01, 3.6488e-01, 6.3075e-01, 8.0778e-01, 6.3019e-01, 4.4128e-01, 7.6502e-01, 3.2592e-01, 9.5351e-01, 1.1195e-02, 5.6960e-01, 9.2122e-01, 3.3145e-01, 4.7351e-01, 4.5432e-01, 3.7222e-01, 4.3379e-01, 8.1074e-01, 7.6855e-01, 4.0966e-01, 2.6685e-01, 2.4074e-01, 4.1252e-01, 1.9881e-01, 2.2821e-01, 5.9354e-01, 9.8252e-01, 2.7417e-01, 4.2776e-01, 5.3463e-01, 2.9148e-01, 5.8007e-01, 8.2275e-01, 4.8227e-01, 8.5314e-01, 3.6518e-01, 7.8376e-02, 3.6919e-01, 3.4867e-01, 8.9571e-01, 2.0085e-02, 7.9924e-01, 3.5849e-01, 8.7784e-01, 4.6861e-01, 6.2004e-01, 6.8465e-01, 4.1273e-01, 4.2819e-01, 9.4532e-01, 2.2362e-01, 8.3943e-01, 1.1692e-01, 6.9463e-01, 7.6764e-01, 2.8046e-02, 6.9382e-01, 9.2750e-01, 3.6031e-01, 6.8065e-01, 1.6976e-01, 8.2079e-01, 6.4580e-01, 8.3944e-01, 3.9363e-01, 4.4026e-01, 4.4569e-01, 8.2344e-01, 5.4172e-01, 1.6886e-04, 3.8689e-01, 5.8966e-01, 1.9510e-02, 2.5976e-01, 4.0868e-01, 3.1406e-01, 3.6334e-01, 6.1768e-01, 5.4854e-01, 4.1273e-01, 7.2670e-04, 2.4486e-01, 4.1042e-01, 9.0760e-01, 1.6224e-01, 7.4019e-02, 8.1329e-01, 7.2573e-01, 8.2816e-01, 7.3032e-01, 6.6017e-01, 6.4281e-01, 4.1839e-01, 9.2251e-01, 1.5183e-02, 4.4538e-01, 9.7205e-01, 9.5677e-01, 9.5649e-01, 1.2610e-01, 9.2521e-01, 3.2649e-01, 2.1019e-02, 2.5695e-01, 4.2663e-01, 9.2064e-01, 4.5242e-01, 7.0447e-01, 8.1233e-01, 2.7507e-01, 2.4744e-01, 1.3670e-01, 6.4032e-01, 5.8332e-01, 5.5130e-01, 2.4997e-02, 7.7206e-01, 1.5085e-01, 2.8028e-01, 8.2839e-01, 5.8292e-01, 9.9087e-01, 6.0233e-01, 4.1489e-01, 6.4902e-01, 7.5428e-01, 8.0953e-01, 3.7530e-01, 4.8196e-01, 1.8786e-01, 9.8463e-01, 6.3303e-01, 4.8519e-01, 7.6163e-01, 3.3821e-01]
If I included this in my training data, it would look something like this:
training_data = [
# Datapoint 1
[
[1, 2, 3, 4, 5, 6, 7], # feature 1
{1: [1,2,4], 2: [2,3,4,5], ..., 7: [3,5]}, # feature 2
[7 embedding vectors], # feature 3
],
# Datapoint 2
[
[5, 3, 3, 1, 6], # feature 1
{1: [2,3,4], 2: [1,2,4,5], ..., 5: [3,5]}, # feature 2
[5 embedding vectors], # feature 3
],
# Datapoint 3
[
[2, 4, 5, 7, 1, 6, 2, 1, 7], # feature 1
{1: [1,2,4], 2: [1,2,3,5], ..., 9: [2,3]}, # feature 2
[9 embedding vectors], # feature 3
],
...]
Questions
My training data will consist of lists of variable length and matrices/tensors. How do I best feed this data to the model? In any case, I'm interested in predicting only entities. Training only on feature 1 could be a baseline that I compare to combinations of features, e.g. Features 1+2 or 1+3 or 1+2+3
Based on what I've read until now, I think I'm going to use padding and masking. However, I'm not sure what my features should finally look like.
I appreciate any kind of feedback. Thanks for sharing your thoughts!