Consider a data-frame similar to the one shown (the actual data-frame is much larger)
ID EDUCATION OCCUPATION BINARY_VAR
1 Undergrad Student 1
2 Grad Business Owner 1
3 Undergrad Unemployed 0
4 PhD Other 1
The final objective is to apply PRIDIT scoring on individual profiles (ID) based on discrete "rank" scores of the individual values in the cell. These ranks can be thought of as indicator variables which will be used to collectively rate any $ID_i$
So for example, the ranks could signify the possibility of some $ID_i$ committing fraud:
1 : Low
2 : Medium
3 : High
The variable, BINARY_VAR is something like a "training variable" or rather, a "predictor variable", such that
$Var = 0:$ Fraud
$Var = 1:$ Non-fraud
By this reasoning, an unemployed Undergraduate would be a Rank 3 profile.
In order to apply PRIDIT, I must first convert the non-numeric variables into scores or levels.
The way it is currently being done is by applying correspondence analysis on each column against BINARY_VAR and then calculating the distance of the column contribution scores from row contribution score for non-fraud.
Row and column scores look something like this (respectively):
CONTR
0 1.654
1 98.346
------------------------------
CONTR
Undergraduate 2.803602e-04
Graduate 3.147824e+00
PhD 9.176451e+00
Other 1.179664e+01
The obtained distance (supposedly) gives the required score for the level, which is written back to the data-frame as a rank (higher value resulting in a higher rank).
My main concerns about this technique are:
The data-frame is really large, and resources are limited - it is a computationally expensive method.
It involves a lot of steps, and the result of the scoring can not really be verified (can it?).
My questions are:
- Does this technique seem viable?
- What are betters way to assign "ranks" to non-numeric variables?