A config-driven R package for reproducable analysis of tabular biomedical data

Motivation

tabularTools demonstrates a clean, testable software engineering approach to tabular data analysis in biomedical research, with a focus on

  • explicit configuration with YAML
  • reproducable preprocessing using dplyr/tidyverse
  • extendable modeling APIs
  • parallel-safe execution
  • unit-tested components
#Reading the yaml config file
read_config()

#Reading data file specified in config
read_data()

#Validating the data is valid tabular data
validate_data()

#Proprocessing based on config definitions
preprocess_data()

#Fitting defined models
fit_models()

#Evaluation of model results
evaluate_results()

#Creating visualizations based on evaluations
visualize_results()

Example Usage

library(tabularTools)
library(future)

#Enable parallel execution
plan(multisession, workers = 4)

cfg   <- read_config("example/config.yaml")
data  <- read_data(cfg)

validate_data(data, cfg)

pdata <- preprocess_data(cfg, data)

models <- fit_models(pdata, cfg)

#Inspect fitted model
summary(models$logistic$`0_vs_1`$model)

User Configuration

Analysis is driven by a YAML configuration file

data:
  file: "heart_disease_uci.csv"
  
analysis:
  outcome: num
  predictors:
    - age
    - sex
    - chol
    - cp
    - trestbps
    - fbs
    - restecg
    - thalch
    - exang
    - oldpeak
    - slope
    - ca
    - thal
  models:
    - logistic
    - svm
  contrasts:
    - [0, 1]
    - [0, 2]
    - [0, 3]
    - [0, 4]

preprocessing:
  scale_numeric: true
  impute_missing: median #options are "drop", "median", or "none"

visualization:
  roc_curve: true
  coefficient_plot: true

report:
  title: "Heart Disease Analysis"

Repository structure

  • R/ - source code
  • tests/ - testthat unit tests
  • vignettes/ - sample quarto markdown files

Status

This package is under active development and is intended to show R’s useful software development tools to simplify the management of complex tabular data