0

I have a large dataset exploring the effects of the independent variables on the dependent variable using Poisson regression since the dependent variable is a count variable.

However, the range of the dependent variable is too large. Hence, I was thinking of grouping the dependent variable, like low, medium and high values, and then use the Poisson regression for each group. My question is, does this makes sense? I mean, grouping the dependent variable and then running the tests separately for each group.

One question may be what is a low, medium and high value. I noticed some clustering algorithms and I (have used) and will use one of them(k-means seems to be fine) to divide the dependent variable into three groups.

My initial analysis shows that the effects of independent variables are different for each group, but I am not sure if this is the correct way of doing the analysis.

Can you please comment?

Thank you,

EDIT: There are 454270 samples is the dataset and the range of the dependent variable is between 0 and 160 (with mean 1.293136 and std 2.681311) and 90% of the values are in the range between 0 and 3. The histogram of the variable is as follows:

enter image description here

A more detailed histogram of the dependent variable and two scatter plots of the IVs and the DV is as follows (there are also other independent variables but they are control variables)

enter image description here

enter image description here enter image description here

tempx
  • 121
  • 3
  • Can you elaborate on "the range of your dependent variable is too large"? Are you meaning to say that the variance of your response is larger than what a Poisson model can explain? Rather than discretizing your response (which is arbitrary like you mentioned, and also a possible inefficient use of your data) perhaps consider something like negative binomial regression instead which allows for overdispersion. – aranglol Jan 23 '21 at 23:25
  • @aranglol it is not that Poisson cannot explain. Actually I also checked the negative binomial distribution and the stats tells that it is more suitable than the Poisson distribution. My dependent variable is highly skewed and the range of it is between 0-160. But more than 3/4 of it is 0 or 1, the next group is between 2-10 and there is very little that is bigger 10. That is why I considered grouping the dependent variable. – tempx Jan 24 '21 at 03:29
  • Possible to include a scatter plot of your data? Or is it too large? – WBM Mar 07 '21 at 17:37
  • @WBM added the histogram and some statistics. – tempx Mar 09 '21 at 05:25
  • I was hoping for a scatter plot between your dependent and independent variable, as it might help me visualise the problem. Also, in the above histogram do `plt.xlim(0,20)` or so. – WBM Mar 09 '21 at 08:13
  • @WBM added two scatter plots and truncated the histogram to [0,20]. I am not sure if I have to also add the other scatter plots since other variables are used as control variables. – tempx Mar 09 '21 at 15:13

0 Answers0