I recently read a new review of the decrease in cervical cancer following introduction of the HPV vaccine. I am loosely basing my model off of their findings. However, this model does not account for the difference before and after introduction, so it is really just a model of change over time without accounting for vaccination.
I started by establishing my explanatory and response variables.
# establish explanatory variable: years before and after the introduction of the HPV vaccine
year <- 1992:2021
# establish response variable: number of cervical cancer deaths
deaths <- rnorm(30, mean=39, sd=15)
I first created a simple model of years vs cervical cancer deaths.
data_frame <- data.frame(year, deaths)
str(data_frame)
## 'data.frame': 30 obs. of 2 variables:
## $ year : int 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001 ...
## $ deaths: num 51.9 50 32.6 29 56.6 ...
To analyze this data, I used a linear model. Since it is randomly generated and distributed, there is little to no association between year and mortality.
# fitting a linear model
model <- lm(deaths~year)
summary(model)
##
## Call:
## lm(formula = deaths ~ year)
##
## Residuals:
## Min 1Q Median 3Q Max
## -22.105 -8.361 -2.483 8.872 35.602
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -552.0010 539.3016 -1.024 0.315
## year 0.2975 0.2688 1.107 0.278
##
## Residual standard error: 12.74 on 28 degrees of freedom
## Multiple R-squared: 0.04193, Adjusted R-squared: 0.007718
## F-statistic: 1.226 on 1 and 28 DF, p-value: 0.2777
# visualize dataset
plot(year, deaths)
To test the effect of different sample sizes, I adjusted the total number of deaths that occurred over the study period. When the overall number of deaths is greater, the correlation is much stronger.
year <- 1992:2021
# testing a large sample size
total_deaths_large <- seq(from = 400, to = 50, length.out = 30)
for(i in 1:30) {
large_sample_size <- data.frame(year = year, deaths = total_deaths_large)
plot(large_sample_size[,1], large_sample_size[,2], xlab = "year", ylab = "deaths")
}
# testing a small sample size
total_deaths_small <- seq(from = 100, to = 10, length.out = 30)
for(i in 1:30) {
small_sample_size <- data.frame(year = year, deaths = total_deaths_small)
plot(small_sample_size[,1], small_sample_size[,2], xlab = "year", ylab = "deaths", ylim = c(0, 400))
}
To test different effect sizes, I added noise using random normal values. When the standard deviation of the noise variable is low, the effect is very strong and the plot is still very linear. When the standard deviation of the noise variable is high, the variation is much stronger and the effect is less distinct.
year <- 1992:2021
total_deaths <- seq(from = 400, to = 50, length.out = 30)
noise_small <- rnorm(30, mean = 0, sd = 5)
# testing a strong effect size (minimal noise)
for(i in 1:30) {
large_effect_size <- data.frame(year = year, deaths = total_deaths + noise_small)
plot(large_effect_size[,1], large_effect_size[,2], xlab = "year", ylab = "deaths")
}
# testing a weak effect size (a lot of noise)
noise_large <- rnorm(30, mean = 0, sd= 50)
for(i in 1:30) {
small_effect_size <- data.frame(year = year, deaths = total_deaths + noise_large)
plot(small_effect_size[,1], small_effect_size[,2], xlab = "year", ylab = "deaths")
}
In the study, the actual standard deviation was about 15 and there were only 55 deaths in the first year.
year <- 1992:2021
sample <- seq(from = 55, to = 12, length.out = 30)
noise <- rnorm(30, mean = 0, sd = 15)
results <- data.frame(year = year, deaths = rep(NA, 30))
for(i in 1:30) {
results[,2] <- sample + noise
plot(results[,1], results[,2], xlab = "year", ylab = "deaths")
}
model <- lm(deaths~year, data=results)
summary(model)
##
## Call:
## lm(formula = deaths ~ year, data = results)
##
## Residuals:
## Min 1Q Median 3Q Max
## -30.668 -12.438 4.441 13.905 29.252
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3502.4872 722.1788 4.850 4.18e-05 ***
## year -1.7267 0.3599 -4.797 4.82e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 17.06 on 28 degrees of freedom
## Multiple R-squared: 0.4512, Adjusted R-squared: 0.4315
## F-statistic: 23.02 on 1 and 28 DF, p-value: 4.825e-05