Sample Size for Poisson Regression

Recently, I was tasked with a sample size calculation for a study in which the outcome is to be modeled using a Poisson regression (i.e. a generalized linear model). For quick and simple calculations of this nature, I often use PASS, a statistical software package dedicated to power/sample size calculations. However, I noticed that in the "Poisson Regression" menu for PASS, one of the options is specifying the distribution of the PREDICTOR variable (X). I was rather confused by this, since regression models don't make assumptions about the distribution of the covariates, so I checked the online documentation (here). On page 870-2 of that document, they give a formula for sample size calculations. On that page, they state, "The variance [of the regression parameter estimate $\hat$] for the non-null case depends on the underlying distribution of $X$." I always learned rather unequivocally during my education that regression models do NOT make ANY assumptions about the distribution of the covariates themselves, and that even assumptions about the distribution of regression parameters are purely inferential (i.e. the parameter estimates themselves will be unbiased regardless of the underlying distribution of the covariate, assuming a properly specified model, but assumptions about the distribution of that parameter are useful for building confidence intervals, etc.). And it doesn't seem straightforward to me to draw a direct relationship between the distribution of a covariate and the variance of a corresponding parameter estimate, since the latter should be driven by the functional relationship between the covariate and the outcome. The PASS documentation quoted above cites a 1991 Biometrika paper called "Sample Size for Poisson Regression", which I investigated in an effort to better understand what is going on here. That paper is available online here (I don't believe there is a paywall, but I'm on an institutional network at the moment so I may be wrong on that). As with the PASS documentation, this paper talks extensively about the maximum likelihood estimates for the regression parameters as functions of the distribution of the covariates. At the bottom of page 1 of this paper, they write the likelihood function of the Poisson model (using their notation) as: $L(\beta_0,\beta) = \prod_^f_X(x_i)f_T(t_i)\lambda_i^exp(-\lambda_i)/y_i!$ where $\lambda_i=t_iexp(\beta_0+\beta^Tx_i)$ This is, of course, not the usual likelihood function for a Poisson model, which would typically would not include $f_X(x)$ or $f_T(t)$ (where $f_X$ is the distribution of the covariates, $X$, and $f_T$ is the distribution of the exposure times (i.e. the 'offset' term in a Poisson model). Granted, the paper notes that they are treating $X$ and $T$ as random variables, but I am struggling to understand why they are doing this, since it is so radically different from the traditional approach to estimating regression parameters using maximum likelihood. This paper further cites a 1989 JASA paper that is making similar calculations for the case of logistic regression, again with the variance of parameters as a function of the distribution of the covariates themselves (and with a similar expression for the likelihood including some $f_X(x)$. Now THIS paper (available here) also includes a table (Table 1, on page 28), which seems to parameterize the distribution of X in terms of the regression parameters! I am having a very difficult time understanding this. I have always learned that regression models do not make assumptions about the distribution of the covariates, or even of the regression parameters, yet it seems that these methods for sample size calculations under logistic regression and Poisson regression (both of these papers I linked to are fairly well cited) are explicitly making such assumptions. Can anybody shed any light on this subject? 1) Do we ever need to make assumptions about the distribution of covariates in a regression model? Am I simply incorrect in believing we never need to make these assumptions, especially if we are assuming we have a correctly specified regression model? 2) If we DON'T need to make these assumptions, then what is the utility of doing so for the purposes of these sample size calculations? I will note further that using the formulas in these papers produces some RADICALLY different sample size estimates for the same effect sizes, based only on changing the assumed distribution of the covariates. I don't see how estimates using this method can be valid if we answer "NO" to question 1.

Ryan Simmons asked May 15, 2017 at 21:15 Ryan Simmons Ryan Simmons 1,893 1 1 gold badge 19 19 silver badges 25 25 bronze badges

$\begingroup$ Continuing to research this question, it seems that the issue might have something to do with the fact that in the case of a non-linear link function, the covariance of the regression parameters does not have a closed form. I will try to find some good reference son this subject in the GLM literature to see if it sheds any light, but the properties of the link function may be the key reason it is necessary in the case of a Poisson/binomial regression but not, say, a traditional linear regression. $\endgroup$

Commented May 15, 2017 at 22:12

$\begingroup$ I think you are heading down the wrong road with the issue of a link function. Simply log transforming an outcome in a linear regression is an instance of a GLM with gaussian variance and a log link. GLMs also specify a mean-variance relationship: which partially explains why the how the actual predicted values of the outcome affect the power above-and-beyond the "effect size" (as coined by Cohen). Poisson regression, for instance, uses a log link as well, but also a Poisson variance so that $Var(Y) = E(Y)$. $\endgroup$