converting array to a true/false matrices

Question

I have a data set where each record is a json document with a label, and an array of signals. The signals will vary for each record:

{
    "label":"bad",
    "id": "0009",
    "signals":["high_debt_ratio", "no_job"] 
},

{
    "label":"good",
     "id": "0002",
    "signals":["high_debt_ratio", "great_credit", "no_id_match"] 
},

{
    "label":"good",
    "id": "0003",
    "signals":["low_debt_ratio", "great_credit"] 
},

{
    "label":"bad",
    "id": "0001",
    "signals":["high_risk_loc", high_debt_ratio", "no_job", "no_id_match"] 
}

I want to convert this to a matrices that looks like this:

id	label	high_risk_loc	high_debt_ratio	no_job	great_credit	no_id_match	low_debt_ratio
0009	bad	false	true	true	false	false	false
0002	good	false	true	false	true	true	false
0003	good	false	false	false	true	false	true
0001	bad	true	true	true	false	true	false

I created a function but it seems like this would be a common thing to do. Is there a python lib (pandas, scikit, etc.) that does this for you? I'd rather use something from a package but i'm not sure what to search for.

score 1 · Answer 1 · answered Mar 23 '21 at 11:43

First you need to read your json data with json_normalize in data frame using pandas

import pandas as pd

df = pd.json_normalize(['your json data'])

Your data frame look like this

  label    id                                            signals
0   bad  0009                          [high_debt_ratio, no_job]
1  good  0002       [high_debt_ratio, great_credit, no_id_match]
2  good  0003                     [low_debt_ratio, great_credit]
3   bad  0001  [high_risk_loc, high_debt_ratio, no_job, no_id...

Now we need unique list of value for signals column and loop over with its availability and base on that need to insert True or False in value of particular column... also removed signals column after getting final datas

for i in list(set(df.signals.sum())):
    df[i] = df.signals.apply(lambda x: i in x)
df.drop('signals',axis=1,inplace=True)
print(df)

-:output:-
  label    id  low_debt_ratio  great_credit  high_debt_ratio  no_id_match  high_risk_loc  no_job
0   bad  0009           False         False             True        False          False    True
1  good  0002           False          True             True         True          False   False
2  good  0003            True          True            False        False          False   False
3   bad  0001           False         False             True         True           True    True

score 0 · Answer 2 · answered Mar 23 '21 at 22:57

You can try this one line solution:

d = pd.json_normalize([{
    "label":"bad",
    "id": "0009",
    "signals":["high_debt_ratio", "no_job"] 
},

{
    "label":"good",
     "id": "0002",
    "signals":["high_debt_ratio", "great_credit", "no_id_match"] 
},

{
    "label":"good",
    "id": "0003",
    "signals":["low_debt_ratio", "great_credit"] 
},

{
    "label":"bad",
    "id": "0001",
    "signals":["high_risk_loc", "high_debt_ratio", "no_job", "no_id_match"] 
}])

d.merge(d.signals.apply(lambda x: "|".join(x)).str.get_dummies(sep = "|"), left_index = True, right_index= True)

Outputs:

score 0 · Answer 3 · answered May 10 '21 at 02:13

Something for intuitive using sklearn's MultiLabelBinarizer, though slightly more verbose:

from sklearn.preprocessing import MultiLabelBinarizer

mlb = MultiLabelBinarizer()
mlb.fit(df['signals'])
new_col_names = mlb.classes_

# New DataFrame containing 0/1 values of the signals
signals_df = pd.DataFrame(mlb.transform(df['signals']), columns=new_col_names)

# Concatenate with original DataFrame
pd.concat( [df, signals_df], axis=1 )

converting array to a true/false matrices

3 Answers3