The identification of risk factors associated with a binary endpoint is a typical problem in biostatistics. One might use regression techniques to determine a set of associations, and then use them for approximate inference or prediction. However, developing this model can be a fluid and dynamic process. Typically, one relies on the experienced analyst to iterate over a varying set of model specifications and covariate functional forms to arrive at a reasonable model. Unfortunately in the big-data era, this model exploration phase can be time consuming, especially when conducting analyses on a typical corporate workstation. To speed up this model development, we propose a novel subsampling scheme to enable rapid model exploration using flexible yet complex model setup (GLMMs with additive smoothing splines).
First, we reframe a binary-response prospective cohort study into a case-control type design, and demonstrate that by using our knowledge of sampling fractions, we can approximate model estimates as would be calculated from the full dataset in only a fraction of the time. We then extend this idea to derive cluster specific sampling fractions and thereby incorporate clustervariation easily into the analysis. To demonstrate the approach, we present the results of a simulation study, and discuss two very simple case and control selection strategies for implementation. Importantly, we demonstrate that previously computationally prohibitive analyses can be conducted in a timely manner on a typical corporate workstation, and show that cluster variation and group specific non-linear effects exist for common risk factors for adverse reactions to blood donation.