3

I've fit a GLM (Poisson) to a data set where one of the variables is categorical for the year a customer bought a product from my company, ranging from 1999 to 2012. There's a linear trend of the coefficients for the values of the variable as the year of sale increases.

Is there any problem with trying to improve predictions for 2013 and maybe 2014 by extrapolating to get the coefficients for those years?

Sean Owen
  • 6,585
  • 6
  • 31
  • 43
JenSCDC
  • 317
  • 1
  • 10

2 Answers2

6

If you suspect your response is linear with year, then put year in as a numeric term in your model rather than a categorical.

Extrapolation is then perfectly valid based on the usual assumptions of the GLM family. Make sure you correctly get the errors on your extrapolated estimates.

Just extrapolating the parameters from a categorical variable is wrong for a number of reasons. The first one I can think of is that there may be more observations in some years than others, so any linear extrapolation needs to weight those year's estimates more. Just eyeballing a line - or even fitting a line to the coefficients - won't do this.

Spacedman
  • 1,982
  • 11
  • 16
  • Hmm... it never occurred to be to make year a continuous variable. In retrospect it seems obvious. – JenSCDC Aug 26 '14 at 18:51
4

I believe that this is a case for applying time series analysis, in particular time series forecasting (http://en.wikipedia.org/wiki/Time_series). Consider the following resources on time series regression:

Aleksandr Blekh
  • 6,518
  • 4
  • 28
  • 54
  • The reason I'm using regression is that I need the per year rate of change for reasons I'd rather not get into right now. – JenSCDC Aug 23 '14 at 20:24
  • 1
    @AndyBlankertz: I just updated my answer. – Aleksandr Blekh Aug 23 '14 at 20:33
  • Thanks. I'd love to delve into the resources, but I'm time limited- the report I'm working on is due on Friday. I also have some slack in statistical rigorousness, because the target audience is Management :) Hopefully next week. – JenSCDC Aug 23 '14 at 20:44
  • 1
    @AndyBlankertz: You're welcome. I understand, as I'm not a statistician myself :-). But I'm trying to learn wherever and whenever I can. – Aleksandr Blekh Aug 23 '14 at 20:54
  • This isn't time series analysis (unless you throw away loads of data). I think the data records are individual sales records with a year attached as a covariate. Time series analysis is used when the variable of interest (eg *total sales*) has a unique time point. You could compute total sales within years and do time series analysis, but that would mean losing all the other information from each sales record (eg item purchased, buyer age etc). Regression is the right thing here. – Spacedman Aug 26 '14 at 08:51
  • @Spacedman: The term I've emphasized in my answer is **time series regression**. Thus, in my view, it could be considered as a **special case** of either of the two approaches, depending on the **perspective**. – Aleksandr Blekh Aug 26 '14 at 09:12
  • All I'm saying is that individual sales records data are not time series data. So you can't treat them like time series data. So reading about fitting AR(1) models and time series regression approaches is a waste of the OP's time here when all they have to do is convert year to numeric and run the model again. My concern now is wondering exactly what the OP means by "per-year rate of change", which may imply something more than a linear term in year is required (some kind of smoother or polynomial term perhaps). – Spacedman Aug 26 '14 at 09:40
  • @Spacedman: I see. Thank you for the clarification. However, my initial impression was that for this particular task, the OP is only interested in future values of a **single aggregate** *outcome variable* (keeping the model's **full information** for *regression analysis*). That would be the case for *time series forecasting*, wouldn't it? Perhaps, I misunderstood the question. – Aleksandr Blekh Aug 26 '14 at 10:01