1

I want to preface my question with the highlighted situation I have might not be applicable to kohonen self organising maps (SOM) due to a lack of understanding on my part so I do apologise if that is the case. If this the case I would greatly appreciate any suggestions on alternative methods to compare the similarities for my given input data.

I am trying to create a self organising map for the similarity comparison between the ngram ordered set of tokens that can be extracted from a given input sequence. For example, if the tokens are:

[A, B, C]

And my input is:

A, B, B, C, A

Then the resulting 1-gram ordered set would be:

A, B, C

I will try to highlight my question using a simple example followed by some more realistic cases I am dealing with.

If I have the collection of tokens:

[A, B, C, D, E, F]

The input data contains K variable length strings, made from these tokens.

A, B, A, B, C, D
A, A, A, B, B, E, F
A, B, C
E, F, E, E, F, D, C, A, A, A, A

I then generate the unique ordered sets for the tokens found in each of these sequences using the alphabetical order of tokens.

A, B, C, D
A, B, E, F
A, B, C
A, C, D, E, F

The above would be the 1-gram ordered set for the values. I however need to be able to represent the N-gram ordered sets.

E.g. for 2-grams the ordered set of tokens for each sequence would be:

AB, BA, BC, CD
AA, AB, BB, BE, EF
AB, BC
AA, CA, DC, EE, EF, FD, FE

And for 3-grams:

ABA, ABC, BAB, BCD
AAA, AAB, ABB, BBE, BEF
ABC
AAA, CAA, DCA, EEF, EFD, EFE, FDC, FEE

The 4-grams would only produce 3 ordered sets rather than 4 as the sequence A,B,C only has 3 tokens.

My initial thought was to have the of attributes be all the permutations of these n-gram token sequences.

For 1-grams this would:

A B C D E F

And for 2-grams this would be:

AA AB AC AD AE AF BA BB BC BD BE BF CA CB CC CD CE CF DA DB DC DD DE DF EA EB EC ED EE EF FA FB FC FD FE FF

Unfortunately, this very quickly gets out of hand, even when just going from 1-grams to 2-grams. With the number of attributes increasing by T^N... This seems completely impractical and illogical especially given the majority of the resulting matrix consist mainly of 0's where the attribute doesnt exist in a given sequence.

The problem is only made worse as the collection of tokens I have is closer to 150 rather than the 6 used in this example.

Each possible ordered set is also highly variable in length and will range from 1 to the number of permutations for the given token set.

The other issue is even if I was to optimise the number of attributes to only include the existing permutations in the sequence, there is no guarantee that testing data will conform to this set of attributes.

I am guessing that I will need to somehow convert these ordered sets to a more manageable set of attributes, but I am unsure of what would be a viable solution as each set of tokens seems to have an equal importance weighting.

I am also assuming that this be difficult to generalise and each N would have to have a different trained model? (One for 1-grams, another for 2-grams... up to the maximum N-gram chosen?)

Cookies
  • 111
  • 2
  • Welcome to DataScienceSE. Can you please clarify what is the goal? What kind of insight do you hope to find from using SOM on these combinations of n-grams? – Erwan Dec 21 '20 at 09:50
  • 1
    Hi Erwan, thank you for welcoming me :) my goal is to find similarities / overlaps between these n-gram token sequences (a signature or sorts) and to evaluate accuracy changes for different N for variable length samples. Without machine learning I can calculate the similarity between these sets fairly naïvely by comparing each against every other. My current corpus A and B have 50k and 30k samples respectively and comparison this was only yields a percentage overlap between each one but I wanted to explore using a SOM for visualising the high dimensionality so I can then explore clustering. – Cookies Dec 21 '20 at 12:09
  • I understand that there is some expert knowledge behind this, but I can't wrap my head around why you need to calculate all the combinations of tokens (n-grams) based on the ordered set. You consider these sets of n-grams as the signatures that need to be compared right? And these sets of ngrams are completely determined from the unique ordered sets that you calculate at the beginning, so I feel like that there could be some simpler way to compare these objects. There are some methods for grouping sets of ngrams for efficient comparison, but I'm not sure that's what you need here. – Erwan Dec 21 '20 at 14:13
  • Hmm you might be right, that is something I think would definitely be worth exploring. Would it be possible to provide some resources / links to some methods that can be used for grouping K samples where each sample is a set of n-grams? – Cookies Dec 21 '20 at 18:57
  • I was thinking about [record linkage](https://en.wikipedia.org/wiki/Record_linkage), which is about matching similar strings (usually names and/or addresses). One of the methods is blocking: grouping strings which have a minimum number of characters/n-grams in common, this way only pairs which are in the same group ("block") need to be compared. But I doubt it's relevant here, at least it would give the same result if you do it directly on the original ordered set so there's no point computing all the "n-grams". My impression is that you just need a custom similarity measure for these objects. – Erwan Dec 21 '20 at 20:22
  • Thank you for the resource, unfortunately I'm not too sure if it achieves exactly what I am looking for. I was using the Jaccard similarity for my set similarities and after digging a bit deeper found w-shingling/k-shingles which seems to work a bit like a bag-of-words model. I was wondering if you might be able to help clarify that I have understood this correctly, do you know If using w-shingling and minhashing can generate a fixed length feature vector for S given rows of data? And if so could I pass new unseen data to a model or would I need to calculate the minhash for each new data row? – Cookies Dec 22 '20 at 13:13
  • Whether with n-grams or w-shingles or anything similar to bag-of-words, the features are always built as vectors over the full vocabulary: the representation can be exactly like one hot encoding (boolean value) or it can also contain numeric values for representing frequency or a weight. It means that yes, you can have a fixed length vector, but be careful not to have too many features and/or not enough instances because that could cause overfitting. It also means that the full vocabulary must be determined from the training set (it's the same for any categorical feature btw), so any new ... – Erwan Dec 22 '20 at 15:03
  • ... 'element' in the test set is ignored or replaced with a special value representing 'unknown'. – Erwan Dec 22 '20 at 15:04
  • That makes sense, but I am assuming that if most of the vocabulary of possible sequences is covered then the percentage similarity will be "close enough" to acknowledge some similarity whilst significantly reducing the number of features. For example, most permutations of the possible tokens are simply not possible, in a test for 5-grams, I found that across 50k sets, there are only ~9k distinct permutations (significantly less than the possible ~76 billion otherwise)! I'll explore minhash and LSH with this reduced set check if the estimated Jaccard Similarity yields feasible results. – Cookies Dec 25 '20 at 14:25
  • Also, I wanted to check that hashing the shingles with some hash function "h", e.g. if there set was S = { [A,B,C], [B,B,B], ... }, could also be represented like this: S = { h([A,B,C]), h([B,B,B]) }. Or would it be a better idea to extract the universal set mapping of permutations, M, to some index of a type that is capable of representing n-permutations like: M = { [A,B,C] -> 0, [B,B,B] -> 1, ... } or if using something like xxhash/ murmur would suffice due to the relatively low chance of hash collisions? Cheers and happy holidays :) – Cookies Dec 25 '20 at 14:35
  • To be honest I'm a bit lost with what you want to do and I'm not really knowledgeable about the methods that you mention, so I don't think I can help I'm afraid! Don't hesitate to ask another question about these specific methods, maybe somebody else could help. – Erwan Dec 25 '20 at 23:40
  • Not to worry :) thank you for all the help and assistance you've provided so far! I think I have a few potential avenues to explore so I shall see how they pan out, but it might be worth posting a new, more specific question. Reflecting on my initial question, I feel it makes a few assumptions that I cannot assume without first processing my data into more relevant parts first. – Cookies Dec 26 '20 at 12:56

0 Answers0