I'm very new to pytorch - taking a course in udemy. There is something I find hard to understand and would like to get explaination about, in a bit simpler words than what I can find in the documentation.
one of the lessons is about solving regression problem with ANN with pytorch.
The original dataframe contains categorical and continous data.
We create emnedding sizes only for the categorical data columns and use also size parameter, which is calculated with formula that I don't understand:
# set embessing size
cat_cols=['sex','marital-status','education']
cat_szs=[len(df[col].cat.categories) for col in cat_cols]
#why 50???
emb_szs=[(size, min(50, (size+1)//2)) for size in cat_szs]
emb_szs
I don't understand the following:
1. cat_szs is the number of unique values in a column (category). What is the embedding size calculation? why do we have it?
2. How to interpret the embessing size values?
[(14, 7), (2, 1), (6, 3)] I don't understand what are the 7,2,3 stands for. I know they come from the formula but what does it mean?
3. Why embedding is needed? I understand is a way to represent categorical data, and is always needed for categorical data,am I correct?