Converting non-numeric data values into equivalent rank scores

Question

Consider a data-frame similar to the one shown (the actual data-frame is much larger)

ID EDUCATION   OCCUPATION      BINARY_VAR
1  Undergrad   Student              1
2  Grad        Business Owner       1
3  Undergrad   Unemployed           0
4  PhD         Other                1

The final objective is to apply PRIDIT scoring on individual profiles (ID) based on discrete "rank" scores of the individual values in the cell. These ranks can be thought of as indicator variables which will be used to collectively rate any $ID_i$

So for example, the ranks could signify the possibility of some $ID_i$ committing fraud:

1 : Low
2 : Medium
3 : High

The variable, BINARY_VAR is something like a "training variable" or rather, a "predictor variable", such that

$Var = 0:$ Fraud

$Var = 1:$ Non-fraud

By this reasoning, an unemployed Undergraduate would be a Rank 3 profile.

In order to apply PRIDIT, I must first convert the non-numeric variables into scores or levels.

The way it is currently being done is by applying correspondence analysis on each column against BINARY_VAR and then calculating the distance of the column contribution scores from row contribution score for non-fraud.

Row and column scores look something like this (respectively):

            CONTR
0           1.654
1           98.346
------------------------------
                  CONTR
Undergraduate     2.803602e-04
Graduate          3.147824e+00
PhD               9.176451e+00
Other             1.179664e+01

The obtained distance (supposedly) gives the required score for the level, which is written back to the data-frame as a rank (higher value resulting in a higher rank).

My main concerns about this technique are:

The data-frame is really large, and resources are limited - it is a computationally expensive method.
It involves a lot of steps, and the result of the scoring can not really be verified (can it?).

My questions are:

Does this technique seem viable?
What are betters way to assign "ranks" to non-numeric variables?

Diego · Answer 1 · 2016-07-28T22:24:28.543

1

To me the approach looks overcomplicated. If you're not limited to that one algorithm use one-hot encoding and try out various classifiers. Many of them can predict probability which you can use to calculate the ranks.

edited Jul 28 '16 at 22:24

answered Jul 28 '16 at 13:45

Diego

550
2
8

I _am_ doing that, this is an additional measure. Thanks, though. – yad Jul 28 '16 at 18:28
If you'd use an ml classifier like the SGD from the Scikit-learn it will create a model that would contain your "ranks" after it is trained. No need to manually calculate them. So may be I misunderstood your endeavor and you want to redesign a classifier algorithm itself? – Diego Jul 28 '16 at 22:31

score 0 · Answer 2 · answered Jun 27 '16 at 09:06

Sorry cannot comment.

According to PRIDIT, " ordinal categorical variables with different categories of possible response values, or continuous variables ", this model should be able to support 'continuous variables'(Rank), and 'categories of possible response values' (Non-numeric data). I don't know your implementation but this model should be able to support category variable. In another word, you do not need to convert the non-numeric data to continuous variable to fit the model. Or you just need to assign a category to any integer.

Does this technique seem viable?

This technique is to embed categories into a vector with one dimension.
Pros: easy to show the relationship with a category and the prediction value
Cons: not good if the results is depended on combinations of categories

What are betters way to assign "ranks" to non-numeric variables?

I recommend keeping the category and category (one hot vector) is supported in many current models. It can prevent losing data

PS, you may not use the whole data frame to do analysis, selecting a certain amount with random records always give accurate results. In you case, you may select a certain amount while the number of each unique category is equal.

Converting non-numeric data values into equivalent rank scores

2 Answers2