I have a dataframe of about 200 features and 1M rows that I can train a RidgeCV model and get an R2 of about 0.01
I'd like to scale up my training to 5M or 10M rows but that won't fit in memory for me so I'm looking for out of core techniques and I read about partial_fit and SGDRegressor. I tried to setup the problem using the following params but I see that while RidgeCV with the following code 'trains' in about 20s and gives a 0.01 R2
# Control -- test with RidgeCV
y_all=df_targets[hyper['target_name']].values
x_all = (df_targets[features].values)
regmodel = make_pipeline(StandardScaler(), RidgeCV(alphas=[0.01, 0.1, 1.0, 10.0]))
regmodel.fit(x_all, y_all)
level_2_r2_score = regmodel.score(x_all, y_all)
level_2_r2_score
Gives an R2 of 0.010599955481100931.
I tried to formulate the equivalent type of training using SGDRegressor with partial_fit and minibatches of 32 like so:
# Numpy version partial fit / train loop
# Initialize SGDRegressor and StandardScaler
sgd = SGDRegressor(verbose=0,
max_iter=1,
tol=1e-3,
loss='squared_error',
penalty='l2',
alpha=0.1,
learning_rate='optimal',
fit_intercept=False
)
scaler = StandardScaler()
CHUNK_SIZE = 32
epoch_loops = 20000
r2s=[]
# train test split df_targets
df_train = df_targets.iloc[:-100000]
df_validation = df_targets.iloc[-100000:]
x_validation = df_validation_subset[features].values
y_validation=df_validation_subset[hyper['target_name']].values
scaler.fit(df_targets[features].values)
x_validation_scaled = scaler.transform(x_validation)
x_train = scaler.transform(df_train[features].values)
y_train = df_train[hyper['target_name']].values
print('x_train.shape = ', x_train.shape)
print('y_train.shape = ', y_train.shape)
print('x_test.shape = ', x_validation.shape)
print('y_test.shape = ', y_validation.shape)
for j in range(0, epoch_loops):
# Shuffle the rows in x_train and y_train equally
x_train, y_train = shuffle(x_train, y_train)
# Iterate over chunks of the numpy array x_validation in CHUNK_SIZE steps
for i in range(0, len(x_validation), CHUNK_SIZE):
x_chunk = x_train[i:i+CHUNK_SIZE]
y_chunk = y_train[i:i+CHUNK_SIZE]
sgd.partial_fit(x_chunk, y_chunk)
# EvalR2 after the epoch
r2 = sgd.score(x_validation, y_validation)
print('EPOC {} === > r2 = {} '.format(j, r2))
r2s.append(r2)
But 33minutes later and 112 EPOCHS -- I'm still at an R2 of -6*10**22 -- i.e. just nowhere near what RidgeCV can do.
My hope was to show I can get the same R2 using partial_fit on the 1M rows and then scale that up to 5 or 10M rows and see if I improve my R2... Is that not possible?