7

I work in physics. We have lots of experimental runs, with each run yielding a result, y and some parameters that should predict the result, x. Over time, we have found more and more parameters to record. So our data looks like the following:

Year 1 data: (2000 runs)
    parameters: x1,x2,x3                target: y
Year 2 data: (2000 runs)
    parameters: x1,x2,x3,x4,x5          target: y
Year 3 data: (2000 runs)
    parameters: x1,x2,x3,x4,x5,x6,x7    target: y

How does one build a regression model that incorporates the additional information we recorded, without throwing away what it "learned" about the older parameters?

Should I:

  • just set x4, x5, etc to 0 or -1 when I'm not using them?
  • completely ignore x4,x5,x6,x7 and only use x1,x2,x3?
  • add another parameter that is simply the number of parameters?
  • train separate models for each year, and combine them somehow?
  • "weight" the parameters, so as to ignore them if I set the weight to 0?
  • make three different models, using x1,x2,x3, x4,x5, and x6,x7 parameters, and then interpolate somehow?
  • make a custom "imputer" to guestimate the missing parameters (using available parameters)

I have tried imputation using mean and median, but neither works very well because the parameters are not independent, but rather fairly correlated.

JoseOrtiz3
  • 172
  • 6
  • I'd set up neural network and disconnect the neurons corresponding to the missing inputs as necessary. – Emre Jul 14 '16 at 19:20
  • impute the missing vals? – Brandon Loudermilk Jul 14 '16 at 20:12
  • Could impute the missing values...Would have to make my own imputer though, the imputer for scikit-learn is not very smart (mean, median, etc.) Not sure I want to just replace every missing value with the average. – JoseOrtiz3 Jul 14 '16 at 20:44
  • 3
    What kind of model are you intending to make? When used to predict, do you need it to work with limited parameters same as your early experiments? Or will the final model assume that users will always input all the parameters you have determined are interesting? – Neil Slater Jul 15 '16 at 06:46
  • The model needs to predict `y` well across the historical data and the new data. One of the parameters, `x1`, is a time, and I need to model `y` as a function of this time parameter, y(x1), and fit an exponential to it for *each* run. y(x1) is the goal, but to improve y(x1)'s predicted form, I hope to incorporate more than just `x1,x2` by using `x3,x4,...` as these new measurement channels become used. As the experiment continues, new measurement channels will probably be added. – JoseOrtiz3 Jul 15 '16 at 19:13
  • My Worry about Imputing the missing values is that it might dilute the information when the values are not missing. – JoseOrtiz3 Jul 15 '16 at 21:36
  • If you are looking at "pointwise estimates" of partially observed random variables, imputing the missing values with mean might be a valid approach. If the data is identically and independently distributed throughout Year 1, 2 ... mean is least likely to introduce bias. Or you can model the problem using a probabilistic graphical model, in which case you can marginalize out the missing variables. In this kind of modeling, you also take the spread of the variable into account while predicting y. So missing variables with high variance might not dilute the information. – abhnj Jul 16 '16 at 01:07
  • Tried it...I guess my data is such that mean doesn't actually work very well. I'm going to need a better imputation method. – JoseOrtiz3 Aug 01 '16 at 19:06

3 Answers3

2

One simple idea, no imputation needed: build a model using the parameters have always existed, then each time a new set of parameters gets added, use them to model the residual of the previous model. Then you can sum the contributions of all the models that apply to the data you happen to have. (If effects tend to multiply rather than add, you could do this in log space.)

Ken Arnold
  • 246
  • 1
  • 6
  • I didn't try this, but this sounds very similar to boosting (e.g. Boosted Decision Trees). Boosting has only impressed me. – JoseOrtiz3 Nov 03 '18 at 22:26
1

If the old variables and the new variables are highly correlated then you could do a more advanced form of imputation and make a model for each new input that predicts the new input given the old inputs. This model would probably be pretty effective good at predicting the new inputs because, as you said, there is a strong correlation among the inputs. Then you would split up all of your data across the years so that you have an equal proportion of old records and new records in your training, validation, and test sets.

Ryan Zotti
  • 4,129
  • 3
  • 19
  • 32
0

I would multiply-impute the values for x4, x5, x6, and x7. For the number of imputations, look at the whole dataset and compute the % of fields missing and round up to the nearest integer. Don't use mean- or median-imputation, use PROC MI in SAS or the equivalent. I would guess that because your data are monotone-missing, you could use a MONOTONE statement. This is probably the most conservative approach because you open yourself up to bias when excluding information--whether it be variables or observations.

The Baron
  • 1
  • 1