Business Administration - Research Publications

Permanent URI for this collection

Search Results

Now showing 1 - 10 of 18
  • Item
    Thumbnail Image
    Mitigating spatial confounding by explicitly correlating Gaussian random fields
    Marques, I ; Kneib, T ; Klein, N (Wiley, 2022-08-01)
    Abstract Spatial models are used in a variety of research areas, such as environmental sciences, epidemiology, or physics. A common phenomenon in such spatial regression models is spatial confounding. This phenomenon is observed when spatially indexed covariates modeling the mean of the response are correlated with a spatial random effect included in the model, for example, as a proxy of unobserved spatial confounders. As a result, estimates for regression coefficients of the covariates can be severely biased and interpretation of these is no longer valid. Recent literature has shown that typical solutions for reducing spatial confounding can lead to misleading and counterintuitive results. In this article, we develop a computationally efficient spatial model that explicitly correlates a Gaussian random field for the covariate of interest with the Gaussian random field in the main model equation and integrates novel prior structures to reduce spatial confounding. Starting from the univariate case, we extend our prior structure also to the case of multiple spatially confounded covariates. In simulation studies, we show that our novel model flexibly detects and reduces spatial confounding in spatial datasets, and it performs better than typically used methods such as restricted spatial regression. These results are promising for any applied researcher who wishes to interpret covariate effects in spatial regression models. As a real data illustration, we study the effect of elevation and temperature on the mean of monthly precipitation in Germany.
  • Item
    Thumbnail Image
    Variational inference and sparsity in high-dimensional deep Gaussian mixture models
    Kock, L ; Klein, N ; Nott, DJ (SPRINGER, 2022-10)
    Abstract Gaussian mixture models are a popular tool for model-based clustering, and mixtures of factor analyzers are Gaussian mixture models having parsimonious factor covariance structure for mixture components. There are several recent extensions of mixture of factor analyzers to deep mixtures, where the Gaussian model for the latent factors is replaced by a mixture of factor analyzers. This construction can be iterated to obtain a model with many layers. These deep models are challenging to fit, and we consider Bayesian inference using sparsity priors to further regularize the estimation. A scalable natural gradient variational inference algorithm is developed for fitting the model, and we suggest computationally efficient approaches to the architecture choice using overfitted mixtures where unnecessary components drop out in the estimation. In a number of simulated and two real examples, we demonstrate the versatility of our approach for high-dimensional problems, and demonstrate that the use of sparsity inducing priors can be helpful for obtaining improved clustering results.
  • Item
    Thumbnail Image
    A non-stationary model for spatially dependent circular response data based on wrapped Gaussian processes
    Marques, I ; Kneib, T ; Klein, N (Springer Science and Business Media LLC, 2022-10-01)
    Abstract Circular data can be found across many areas of science, for instance meteorology (e.g., wind directions), ecology (e.g., animal movement directions), or medicine (e.g., seasonality in disease onset). The special nature of these data means that conventional methods for non-periodic data are no longer valid. In this paper, we consider wrapped Gaussian processes and introduce a spatial model for circular data that allow for non-stationarity in the mean and the covariance structure of Gaussian random fields. We use the empirical equivalence between Gaussian random fields and Gaussian Markov random fields which allows us to considerably reduce computational complexity by exploiting the sparseness of the precision matrix of the associated Gaussian Markov random field. Furthermore, we develop tunable priors, inspired by the penalized complexity prior framework, that shrink the model toward a less flexible base model with stationary mean and covariance function. Posterior estimation is done via Markov chain Monte Carlo simulation. The performance of the model is evaluated in a simulation study. Finally, the model is applied to analyzing wind directions in Germany.
  • Item
    Thumbnail Image
    Is age at menopause decreasing? - The consequences of not completing the generational cohort.
    Martins, R ; Sousa, BD ; Kneib, T ; Hohberg, M ; Klein, N ; Duarte, E ; Rodrigues, V (Springer Science and Business Media LLC, 2022-07-11)
    BACKGROUND: Due to contradictory results in current research, whether age at menopause is increasing or decreasing in Western countries remains an open question, yet worth studying as later ages at menopause are likely to be related to an increased risk of breast cancer. Using data from breast cancer screening programs to study the temporal trend of age at menopause is difficult since especially younger women in the same generational cohort have often not yet reached menopause. Deleting these younger women in a breast cancer risk analyses may bias the results. The aim of this study is therefore to recover missing menopause ages as a covariate by comparing methods for handling missing data. Additionally, the study makes a contribution to understanding the evolution of age at menopause for several generations born in Portugal between 1920 and 1970. METHODS: Data from a breast cancer screening program in Portugal including 278,282 women aged 45-69 and collected between 1990 and 2010 are used to compare two approaches of imputing age at menopause: (i) a multiple imputation methodology based on a truncated distribution but ignoring the mechanism of missingness; (ii) a copula-based multiple imputation method that simultaneously handles the age at menopause and the missing mechanism. The linear predictors considered in both cases have a semiparametric additive structure accommodating linear and non-linear effects defined via splines or Markov random fields smoothers in the case of spatial variables. RESULTS: Both imputation methods unveiled an increasing trend of age at menopause when viewed as a function of the birth year for the youngest generation. This trend is hidden if we model only women with an observed age at menopause. CONCLUSION: When studying age at menopause, missing ages must be recovered with an adequate procedure for incomplete data. Imputing these missing ages avoids excluding the younger generation cohort of the screening program in breast cancer risk analyses and hence reduces the bias stemming from this exclusion. In addition, imputing the not yet observed ages of menopause for mostly younger women is also crucial when studying the time trend of age at menopause otherwise the analysis will be biased.
  • Item
    Thumbnail Image
    Using Background Knowledge from Preceding Studies for Building a Random Forest Prediction Model: A Plasmode Simulation Study
    Hafermann, L ; Klein, N ; Rauch, G ; Kammer, M ; Heinze, G (MDPI, 2022-06)
    There is an increasing interest in machine learning (ML) algorithms for predicting patient outcomes, as these methods are designed to automatically discover complex data patterns. For example, the random forest (RF) algorithm is designed to identify relevant predictor variables out of a large set of candidates. In addition, researchers may also use external information for variable selection to improve model interpretability and variable selection accuracy, thereby prediction quality. However, it is unclear to which extent, if at all, RF and ML methods may benefit from external information. In this paper, we examine the usefulness of external information from prior variable selection studies that used traditional statistical modeling approaches such as the Lasso, or suboptimal methods such as univariate selection. We conducted a plasmode simulation study based on subsampling a data set from a pharmacoepidemiologic study with nearly 200,000 individuals, two binary outcomes and 1152 candidate predictor (mainly sparse binary) variables. When the scope of candidate predictors was reduced based on external knowledge RF models achieved better calibration, that is, better agreement of predictions and observed outcome rates. However, prediction quality measured by cross-entropy, AUROC or the Brier score did not improve. We recommend appraising the methodological quality of studies that serve as an external information source for future prediction model development.
  • Item
    No Preview Available
    bamlss: A Lego Toolbox for Flexible Bayesian Regression (and Beyond)
    Umlauf, N ; Klein, N ; Simon, T ; Zeileis, A (JOURNAL STATISTICAL SOFTWARE, 2021-11)
  • Item
    Thumbnail Image
    Review of guidance papers on regression modeling in statistical series of medical journals.
    Wallisch, C ; Bach, P ; Hafermann, L ; Klein, N ; Sauerbrei, W ; Steyerberg, EW ; Heinze, G ; Rauch, G ; topic group 2 of the STRATOS initiative, ; Mathes, T (Public Library of Science (PLoS), 2022)
    Although regression models play a central role in the analysis of medical research projects, there still exist many misconceptions on various aspects of modeling leading to faulty analyses. Indeed, the rapidly developing statistical methodology and its recent advances in regression modeling do not seem to be adequately reflected in many medical publications. This problem of knowledge transfer from statistical research to application was identified by some medical journals, which have published series of statistical tutorials and (shorter) papers mainly addressing medical researchers. The aim of this review was to assess the current level of knowledge with regard to regression modeling contained in such statistical papers. We searched for target series by a request to international statistical experts. We identified 23 series including 57 topic-relevant articles. Within each article, two independent raters analyzed the content by investigating 44 predefined aspects on regression modeling. We assessed to what extent the aspects were explained and if examples, software advices, and recommendations for or against specific methods were given. Most series (21/23) included at least one article on multivariable regression. Logistic regression was the most frequently described regression type (19/23), followed by linear regression (18/23), Cox regression and survival models (12/23) and Poisson regression (3/23). Most general aspects on regression modeling, e.g. model assumptions, reporting and interpretation of regression results, were covered. We did not find many misconceptions or misleading recommendations, but we identified relevant gaps, in particular with respect to addressing nonlinear effects of continuous predictors, model specification and variable selection. Specific recommendations on software were rarely given. Statistical guidance should be developed for nonlinear effects, model specification and variable selection to better support medical researchers who perform or interpret regression analyses.
  • Item
    Thumbnail Image
    Statistical model building: Background "knowledge" based on inappropriate preselection causes misspecification
    Hafermann, L ; Becher, H ; Herrmann, C ; Klein, N ; Heinze, G ; Rauch, G (BMC, 2021-09-29)
    BACKGROUND: Statistical model building requires selection of variables for a model depending on the model's aim. In descriptive and explanatory models, a common recommendation often met in the literature is to include all variables in the model which are assumed or known to be associated with the outcome independent of their identification with data driven selection procedures. An open question is, how reliable this assumed "background knowledge" truly is. In fact, "known" predictors might be findings from preceding studies which may also have employed inappropriate model building strategies. METHODS: We conducted a simulation study assessing the influence of treating variables as "known predictors" in model building when in fact this knowledge resulting from preceding studies might be insufficient. Within randomly generated preceding study data sets, model building with variable selection was conducted. A variable was subsequently considered as a "known" predictor if a predefined number of preceding studies identified it as relevant. RESULTS: Even if several preceding studies identified a variable as a "true" predictor, this classification is often false positive. Moreover, variables not identified might still be truly predictive. This especially holds true if the preceding studies employed inappropriate selection methods such as univariable selection. CONCLUSIONS: The source of "background knowledge" should be evaluated with care. Knowledge generated on preceding studies can cause misspecification.
  • Item
    Thumbnail Image
    Editorial "Joint modeling of longitudinal and time-to-event data and beyond"
    Suarez, CC ; Klein, N ; Kneib, T ; Molenberghs, G ; Rizopoulos, D (WILEY, 2017-11)
  • Item
    Thumbnail Image
    Studying the relationship between a woman's reproductive lifespan and age at menarche using a Bayesian multivariate structured additive distributional regression model
    Duarte, E ; de Sousa, B ; Cadarso-Suarez, C ; Klein, N ; Kneib, T ; Rodrigues, V (WILEY, 2017-11)
    Studies addressing breast cancer risk factors have been looking at trends relative to age at menarche and menopause. These studies point to a downward trend of age at menarche and an upward trend for age at menopause, meaning an increase of a woman's reproductive lifespan cycle. In addition to studying the effect of the year of birth on the expectation of age at menarche and a woman's reproductive lifespan, it is important to understand how a woman's cohort affects the correlation between these two variables. Since the behavior of age at menarche and menopause may vary with the geographic location of a woman's residence, the spatial effect of the municipality where a woman resides needs to be considered. Thus, a Bayesian multivariate structured additive distributional regression model is proposed in order to analyze how a woman's municipality and year of birth affects a woman's age of menarche, her lifespan cycle, and the correlation of the two. The data consists of 212,517 postmenopausal women, born between 1920 and 1965, who attended the breast cancer screening program in the central region of Portugal.