Title: | Functions and Data for a Course on Modern Regression and Classification |
---|---|
Description: | Functions and data are provided that support a course that emphasizes statistical issues of inference and generalizability. The functions are designed to make it straightforward to illustrate the use of cross-validation, the training/test approach, simulation, and model-based estimates of accuracy. Methods considered are Generalized Additive Modeling, Linear and Quadratic Discriminant Analysis, Tree-based methods, and Random Forests. |
Authors: | John Maindonald |
Maintainer: | John Maindonald <[email protected]> |
License: | GPL (>= 2) |
Version: | 0.62.6 |
Built: | 2025-02-12 05:12:40 UTC |
Source: | https://github.com/jhmaindonald/gamclass |
For purposes of this package, modern regression extends to include classification and multivariate exploration. A strong focus is on methods described in Wood (2017) <doi:10.1201/9781315370279>
Functions are mostly designed to facilitate a variety of cross-validation and bootstrap calculations.
John Maindonald
Maintainer: [email protected]
Venables, W N, & Ripley, B D (2013). Modern applied statistics with S-PLUS. Springer Science & Business Media.
Wood, S N (2017) Generalized Additive Models: An Introduction with R (2nd edition). Chapman and Hall/CRC.
https://github.com/jhmaindonald/gamclass
This is designed for adding horizontal lines that show predicted
values to a plot of observed values versus x-values,
in rpart
regression. Where predicted values change
between two successive x-values lines are extended to the midway
point. This reflects the way that predict.rpart
handles predictions for new data.
addhlines(x, y, ...)
addhlines(x, y, ...)
x |
Vector of predictor variable values. |
y |
Vector of predicted values. |
... |
Additional graphics parameters, for passing through to the
|
Lines are added to the current graph.
John Maindonald
x <- c(34, 18, 45, 18, 27, 24, 34, 20, 24, 28, 21, 18) y <- c(14, 11, 12, 9, 4, 11, 6, 9, 4, 10, 9, 2) hat <- c(10.5, 7.75, 10.5, 7.75, 7, 7, 10.5, 7.75, 7, 10.5, 7, 7.75) plot(x, y) addhlines(x, hat, lwd=2, col="gray") ## The function is currently defined as function(x,y, ...){ ordx <- order(x) xo <- x[ordx] yo <- y[ordx] breaks <- diff(yo)!=0 xh <- c(xo[1],0.5*(xo[c(FALSE,breaks)]+xo[c(breaks, FALSE)])) yh <- yo[c(TRUE, breaks)] y3 <- x3 <- numeric(3*length(xh)-1) loc1 <- seq(from=1, to=length(x3), by=3) x3[loc1] <- xh x3[loc1+1]<- c(xh[-1], max(x)) x3[loc1[-length(loc1)]+2] <- NA y3[loc1[-length(loc1)]+2] <- NA y3[loc1] <- yh y3[loc1+1] <- yh lines(x3,y3, ...) }
x <- c(34, 18, 45, 18, 27, 24, 34, 20, 24, 28, 21, 18) y <- c(14, 11, 12, 9, 4, 11, 6, 9, 4, 10, 9, 2) hat <- c(10.5, 7.75, 10.5, 7.75, 7, 7, 10.5, 7.75, 7, 10.5, 7, 7.75) plot(x, y) addhlines(x, hat, lwd=2, col="gray") ## The function is currently defined as function(x,y, ...){ ordx <- order(x) xo <- x[ordx] yo <- y[ordx] breaks <- diff(yo)!=0 xh <- c(xo[1],0.5*(xo[c(FALSE,breaks)]+xo[c(breaks, FALSE)])) yh <- yo[c(TRUE, breaks)] y3 <- x3 <- numeric(3*length(xh)-1) loc1 <- seq(from=1, to=length(x3), by=3) x3[loc1] <- xh x3[loc1+1]<- c(xh[-1], max(x)) x3[loc1[-length(loc1)]+2] <- NA y3[loc1[-length(loc1)]+2] <- NA y3[loc1] <- yh y3[loc1+1] <- yh lines(x3,y3, ...) }
Aircraft Crash Data
data(airAccs)
data(airAccs)
A data frame with 5666 observations on the following 7 variables.
Date
Date of Accident
location
Location of accident
operator
Aircraft operator
planeType
Aircraft type
Dead
Number of deaths
Aboard
Number aboard
Ground
Deaths on ground
For details of inclusion criteria, see https://www.planecrashinfo.com/database.htm
https://www.planecrashinfo.com/database.htm
https://www.planecrashinfo.com/reference.htm
data(airAccs) str(airAccs)
data(airAccs) str(airAccs)
Australian regional temperature data, Australian regional rainfall
data, and Annual SOI, are given for the years 1900-2018. The regional
rainfall and temperature data are area-weighted averages for the
respective regions. The Southern Oscillation Index (SOI) is the
difference in barometric pressure at sea level between Tahiti and Darwin.
Data through to 2021, including also the Indian Ocean Dipole, is available in the file DAAG::bomregions2021
.
data("bomregions2018")
data("bomregions2018")
This data frame contains the following columns:
Year
Southeastern region average temperature (degrees C)
Southern temperature
Eastern temperature
Northern temperature
Southwestern temperature
temperature
temperature
temperature
temperature
temperature
temperature
temperature
Murray-Darling basin temperature
Australian average temperature, area-weighted mean
Southeast Australian annual rainfall (mm)
Southern rainfall
Eastern rainfall
Northern rainfall
Southwest rainfall
Queensland rainfall
NSW rainfall
Northern Territory rainfall
South Australian rainfall
Tasmanian rainfall
Victorian rainfall
West Australian rainfall
Murray-Darling basin rainfall
Australian average rainfall, area weighted
Annual average Southern Oscillation Index
Annual average sunspot counts
Moana Loa CO2 concentrations, from 1959
Moana Loa CO2 concentrations, 1900 to 1978
CO2 concentrations, composite series
Annual average Dipole Mode Index, for the Indian Ocean Dipole
Australian Bureau of Meteorology web pages:
http://www.bom.gov.au/climate/change/index.shtml
The SOI data are from http://www.bom.gov.au/climate/enso/#tabs=SOI.
The CO2 series co2law
, for Law Dome ice core data. is from
https://data.ess-dive.lbl.gov/portals/CDIAC.
The CO2 series co2mlo
is from Dr. Pieter Tans, NOAA/ESRL
(https://gml.noaa.gov/ccgg/trends/)
The series CO2
is a composite series, obtained by adding 0.46 to
he Law data for 1900 to 1958, then following this with the Moana Loa
data that is avaiable from 1959. The addition of 0.46 is designed so
that the averages from the two series agree for the period 1959 to
1968
Sunspot data is from http://www.sidc.be/silso/datafiles
D.M. Etheridge, L.P. Steele, R.L. Langenfelds, R.J. Francey, J.-M. Barnola and V.I. Morgan, 1998, Historical CO2 records from the Law Dome DE08, DE08-2, and DSS ice cores, in Trends: A Compendium of Data on Global Change, on line at Carbon Dioxide Information Analysis Center, Oak Ridge National Laboratory, U.S. Department of Energy, Oak Ridge, Tenn., U.S.A.
Lavery, B., Joung, G. and Nicholls, N. 1997. An extended high-quality historical rainfall dataset for Australia. Australian Meteorological Magazine, 46, 27-38.
Nicholls, N., Lavery, B., Frederiksen, C.\ and Drosdowsky, W. 1996. Recent apparent changes in relationships between the El Nino – southern oscillation and Australian rainfall and temperature. Geophysical Research Letters 23: 3357-3360.
SIDC-team, World Data Center for the Sunspot Index, Royal Observatory of Belgium, Monthly Report on the International Sunspot Number, online catalogue of the sunspot index: http://www.sidc.be/silso/datafiles, 1900-2011
plot(ts(bomregions2018[, c("mdbRain","SOI")], start=1900), panel=function(y,...)panel.smooth(bomregions2018$Year, y,...)) avrain <- bomregions2018[,"mdbRain"] xbomsoi <- with(bomregions2018, data.frame(Year=Year, SOI=SOI, cuberootRain=avrain^0.33)) xbomsoi$trendSOI <- lowess(xbomsoi$SOI, f=0.1)$y xbomsoi$trendRain <- lowess(xbomsoi$cuberootRain, f=0.1)$y xbomsoi$detrendRain <- with(xbomsoi, cuberootRain - trendRain + mean(trendRain)) xbomsoi$detrendSOI <- with(xbomsoi, SOI - trendSOI + mean(trendSOI)) ## Plot time series avrain and SOI: ts object xbomsoi plot(ts(xbomsoi[, c("cuberootRain","SOI")], start=1900), panel=function(y,...)panel.smooth(xbomsoi$Year, y,...), xlab = "Year", main="", ylim=list(c(250, 800),c(-20,25))) par(mfrow=c(1,2)) rainpos <- pretty(xbomsoi$cuberootRain^3, 6) plot(cuberootRain ~ SOI, data = xbomsoi, ylab = "Rainfall (cube root scale)", yaxt="n") axis(2, at = rainpos^0.33, labels=paste(rainpos)) mtext(side = 3, line = 0.8, "A", adj = -0.025) with(xbomsoi, lines(lowess(cuberootRain ~ SOI, f=0.75))) plot(detrendRain ~ detrendSOI, data = xbomsoi, xlab="Detrended SOI", ylab = "Detrended rainfall", yaxt="n") axis(2, at = rainpos^0.33, labels=paste(rainpos)) with(xbomsoi, lines(lowess(detrendRain ~ detrendSOI, f=0.75))) mtext(side = 3, line = 0.8, "B", adj = -0.025) par(mfrow=c(1,1))
plot(ts(bomregions2018[, c("mdbRain","SOI")], start=1900), panel=function(y,...)panel.smooth(bomregions2018$Year, y,...)) avrain <- bomregions2018[,"mdbRain"] xbomsoi <- with(bomregions2018, data.frame(Year=Year, SOI=SOI, cuberootRain=avrain^0.33)) xbomsoi$trendSOI <- lowess(xbomsoi$SOI, f=0.1)$y xbomsoi$trendRain <- lowess(xbomsoi$cuberootRain, f=0.1)$y xbomsoi$detrendRain <- with(xbomsoi, cuberootRain - trendRain + mean(trendRain)) xbomsoi$detrendSOI <- with(xbomsoi, SOI - trendSOI + mean(trendSOI)) ## Plot time series avrain and SOI: ts object xbomsoi plot(ts(xbomsoi[, c("cuberootRain","SOI")], start=1900), panel=function(y,...)panel.smooth(xbomsoi$Year, y,...), xlab = "Year", main="", ylim=list(c(250, 800),c(-20,25))) par(mfrow=c(1,2)) rainpos <- pretty(xbomsoi$cuberootRain^3, 6) plot(cuberootRain ~ SOI, data = xbomsoi, ylab = "Rainfall (cube root scale)", yaxt="n") axis(2, at = rainpos^0.33, labels=paste(rainpos)) mtext(side = 3, line = 0.8, "A", adj = -0.025) with(xbomsoi, lines(lowess(cuberootRain ~ SOI, f=0.75))) plot(detrendRain ~ detrendSOI, data = xbomsoi, xlab="Detrended SOI", ylab = "Detrended rainfall", yaxt="n") axis(2, at = rainpos^0.33, labels=paste(rainpos)) with(xbomsoi, lines(lowess(detrendRain ~ detrendSOI, f=0.75))) mtext(side = 3, line = 0.8, "B", adj = -0.025) par(mfrow=c(1,1))
The data consist of observations on three variables for each of 212 men in a sample of Cardiff enumeration districts.
bronchitis
bronchitis
A data.frame of 212 obs of 3 variables:
cig
numeric, the number of cigarettes per day
poll
numeric, the smoke level in the locality
r
integer, 1= respondent suffered from chronic bronchitis
rfac
factor, with levels abs
(r
=0), and abs
(r
=0)
See p.224 in SMIR
This copy of the dataset was copied from version 0.02 of the SMIR package, which in turn obtained it from Jones (1975).
Jones, K. (1975), A geographical contribution to the aetiology of chronic bronchitis, Unpublished BSc dissertation, University of Southampton. Published in Wrigley, N. (1976). Introduction to the use of logit models in geography, Geo.Abstracts Ltd, CATMOG 10, University of East Anglia, Norwich.
Murray Aitkin, Brian Francis, John Hinde and Ross Darnell (2009). SMIR: Companion to Statistical Modelling in R (SMIR). Oxford University Press.
data(bronchit)
data(bronchit)
y
, for all possible splits on values of x
Each point of separation between successve values of x
is used
in turn to create two groups of observations. The between group sum
of squares for y
is calculated for each such split.
bssBYcut(x, y, data)
bssBYcut(x, y, data)
x |
Variable (numeric) used to define splits. Observations with |
y |
Variable for which BSS values are to be calculated. |
data |
Data frame with columns |
Data frame with columns:
xOrd |
Cut points for splits. |
comp2 |
Between groups sum of squares |
J H Maindonald
xy <- bssBYcut(weight, height, women) with(xy, xy[which.max(bss), ]) ## The function is currently defined as function (x, y, data) { xnam <- deparse(substitute(x)) ynam <- deparse(substitute(y)) xv <- data[, xnam] yv <- data[, ynam] sumss <- function(x, y, cut) { av <- mean(y) left <- x < cut sum(left) * (mean(y[left]) - av)^2 + sum(!left) * (mean(y[!left]) - av)^2 } xOrd <- unique(sort(xv))[-1] bss <- numeric(length(xOrd)) for (i in 1:length(xOrd)) { bss[i] <- sumss(xv, yv, xOrd[i]) } list(xOrd = xOrd, bss = bss) }
xy <- bssBYcut(weight, height, women) with(xy, xy[which.max(bss), ]) ## The function is currently defined as function (x, y, data) { xnam <- deparse(substitute(x)) ynam <- deparse(substitute(y)) xv <- data[, xnam] yv <- data[, ynam] sumss <- function(x, y, cut) { av <- mean(y) left <- x < cut sum(left) * (mean(y[left]) - av)^2 + sum(!left) * (mean(y[!left]) - av)^2 } xOrd <- unique(sort(xv))[-1] bss <- numeric(length(xOrd)) for (i in 1:length(xOrd)) { bss[i] <- sumss(xv, yv, xOrd[i]) } list(xOrd = xOrd, bss = bss) }
Compare, between models, probabilities that the models assign to membership in the correct group or class. Probabilites should be estimated from cross-validation or from bootstrap out-of-bag data or preferably for test data that are completely separate from the data used to dervive the model.
compareModels(groups, estprobs = list(lda = NULL, rf = NULL), gpnames = NULL, robust = TRUE, print = TRUE)
compareModels(groups, estprobs = list(lda = NULL, rf = NULL), gpnames = NULL, robust = TRUE, print = TRUE)
groups |
Factor that specifies the groups |
estprobs |
List whose elements (with names that identify the models) are matrices that give for each observation (row) estimated probabilities of membership for each of the groups (columns). |
gpnames |
Character: names for groups, if different from
|
robust |
Logical, |
print |
Logical. Should results be printed? |
The estimated probabilities are compared directly, under normal distribution assumptions. An effect is fitted for each observation, plus an effect for the method. Comparison on a logit scale may sometimes be preferable. An option to allow this is scheduled for incorporation in a later version.
modelAVS |
Average accuracies for models |
modelSE |
Approximate average SE for comparing models |
gpAVS |
Average accuracies for groups |
gpSE |
Approximate average SE for comparing groups |
obsEff |
Effects assigned to individual observations |
The analysis estimates effects due to model and group (gp
),
after accounting for differences between observations.
John Maindonald
library(MASS) library(DAAG) library(randomForest) ldahat <- lda(species ~ length+breadth, data=cuckoos, CV=TRUE)$posterior qdahat <- qda(species ~ length+breadth, data=cuckoos, CV=TRUE)$posterior rfhat <- predict(randomForest(species ~ length+breadth, data=cuckoos), type="prob") compareModels(groups=cuckoos$species, estprobs=list(lda=ldahat, qda=qdahat, rf=rfhat), robust=FALSE)
library(MASS) library(DAAG) library(randomForest) ldahat <- lda(species ~ length+breadth, data=cuckoos, CV=TRUE)$posterior qdahat <- qda(species ~ length+breadth, data=cuckoos, CV=TRUE)$posterior rfhat <- predict(randomForest(species ~ length+breadth, data=cuckoos), type="prob") compareModels(groups=cuckoos$species, estprobs=list(lda=ldahat, qda=qdahat, rf=rfhat), robust=FALSE)
Given actual and predicted group assignments, give the confusion matrix
confusion(actual, predicted, gpnames = NULL, rowcol=c("actual", "predicted"), printit = c("overall","confusion"), prior = NULL, digits=3)
confusion(actual, predicted, gpnames = NULL, rowcol=c("actual", "predicted"), printit = c("overall","confusion"), prior = NULL, digits=3)
actual |
Actual (prior) group assigments |
predicted |
Predicted group assigments. |
gpnames |
Names for groups, if different from |
rowcol |
For predicted categories to appear as rows,
specify |
printit |
Character vector. Print |
prior |
Prior probabilities for groups, if different from the relative group frequencies |
digits |
Number of decimal digits to display in printed output |
Predicted group assignments should be estimated from cross-validation or from bootstrap out-of-bag data. Better still, work with assignments for test data that are completely separate from the data used to dervive the model.
A list with elements overall (overall accuracy), confusion (confusion matrix) and prior (prior used for calculation of overall accuracy)
John H Maindonald
Maindonald and Braun: 'Data Analysis and Graphics Using R', 3rd edition 2010, Section 12.2.2
library(MASS) library(DAAG) cl <- lda(species ~ length+breadth, data=cuckoos, CV=TRUE)$class confusion(cl, cuckoos$species) ## The function is currently defined as function (actual, predicted, gpnames = NULL, rowcol = c("actual", "predicted"), printit = c("overall","confusion"), prior = NULL, digits = 3) { if (is.null(gpnames)) gpnames <- levels(actual) if (is.logical(printit)){ if(printit)printit <- c("overall","confusion") else printit <- "" } tab <- table(actual, predicted) acctab <- t(apply(tab, 1, function(x) x/sum(x))) dimnames(acctab) <- list(Actual = gpnames, `Predicted (cv)` = gpnames) if (is.null(prior)) { relnum <- table(actual) prior <- relnum/sum(relnum) acc <- sum(tab[row(tab) == col(tab)])/sum(tab) } else { acc <- sum(prior * diag(acctab)) } names(prior) <- gpnames if ("overall"%in%printit) { cat("Overall accuracy =", round(acc, digits), "\n") if(is.null(prior)){ cat("This assumes the following prior frequencies:", "\n") print(round(prior, digits)) } } if ("confusion"%in%printit) { cat("\nConfusion matrix", "\n") print(round(acctab, digits)) } invisible(list(overall=acc, confusion=acctab, prior=prior)) }
library(MASS) library(DAAG) cl <- lda(species ~ length+breadth, data=cuckoos, CV=TRUE)$class confusion(cl, cuckoos$species) ## The function is currently defined as function (actual, predicted, gpnames = NULL, rowcol = c("actual", "predicted"), printit = c("overall","confusion"), prior = NULL, digits = 3) { if (is.null(gpnames)) gpnames <- levels(actual) if (is.logical(printit)){ if(printit)printit <- c("overall","confusion") else printit <- "" } tab <- table(actual, predicted) acctab <- t(apply(tab, 1, function(x) x/sum(x))) dimnames(acctab) <- list(Actual = gpnames, `Predicted (cv)` = gpnames) if (is.null(prior)) { relnum <- table(actual) prior <- relnum/sum(relnum) acc <- sum(tab[row(tab) == col(tab)])/sum(tab) } else { acc <- sum(prior * diag(acctab)) } names(prior) <- gpnames if ("overall"%in%printit) { cat("Overall accuracy =", round(acc, digits), "\n") if(is.null(prior)){ cat("This assumes the following prior frequencies:", "\n") print(round(prior, digits)) } } if ("confusion"%in%printit) { cat("\nConfusion matrix", "\n") print(round(acctab, digits)) } invisible(list(overall=acc, confusion=acctab, prior=prior)) }
P-values were calculated for each of 3072 genes, for data that compared expression values between post-settlement coral larvae and pre-settlement coral larvae.
data("coralPval")
data("coralPval")
The format is: num [1:3072, 1] 8.60e-01 3.35e-08 3.96e-01 2.79e-01 6.36e-01 ...
t-statistics, and hence p-values, were derived from five replicate two-colour micro-array slides. Details are in a vignette that accompanies the DAAGbio package.
See the ?DAAGbio::coralRG
Grasso, L. C.; Maindonald, J.; Rudd, S.; Hayward, D. C.; Saint, R.; Miller, D. J.; and Ball, E. E., 2008. Microarray analysis identifies candidate genes for key roles in coral development. BMC Genomics, 9:540.
## From p-values, calculate Benjamini-Hochberg false discrimination rates fdr <- p.adjust(gamclass::coralPval, method='BH') ## Number of genes identified as differentially expressed for FDR = 0.01 sum(fdr<=0.01)
## From p-values, calculate Benjamini-Hochberg false discrimination rates fdr <- p.adjust(gamclass::coralPval, method='BH') ## Number of genes identified as differentially expressed for FDR = 0.01 sum(fdr<=0.01)
Measurements made beteween 1675 and 1972
cvalues
cvalues
A data frame with 9 observations on the following 3 variables.
Year
Year of measurement
speed
estimated speed in meters per second
error
measurement error, as estimated by experimenter(s)
https://en.wikipedia.org/wiki/Speed_of_light accessed 2011/12/22
data(cvalues)
data(cvalues)
This function adapts cross-validation to work with clustered categorical outcome data. For example, there may be multiple observations on individuals (clusters). It requires a fitting function that accepts a model formula.
CVcluster(formula, id, data, na.action=na.omit, nfold = 15, FUN = MASS::lda, predictFUN=function(x, newdata, ...)predict(x, newdata, ...)$class, printit = TRUE, cvparts = NULL, seed = 29)
CVcluster(formula, id, data, na.action=na.omit, nfold = 15, FUN = MASS::lda, predictFUN=function(x, newdata, ...)predict(x, newdata, ...)$class, printit = TRUE, cvparts = NULL, seed = 29)
formula |
Model formula |
id |
numeric, identifies clusters |
data |
data frame that supplies the data |
na.action |
|
nfold |
Number of cross-validation folds |
FUN |
|
predictFUN |
|
printit |
Should summary information be printed? |
cvparts |
Use, if required, to specify the precise folds used for the cross-validation. The comparison between different models will be more accurate if the same folds are used. |
seed |
Set seed, if required, so that results are exactly reproducible |
class |
Predicted values from cross-validation |
CVaccuracy |
Cross-validation estimate of accuracy |
confusion |
Confusion matrix |
John Maindonald
https://maths-people.anu.edu.au/~johnm/nzsr/taws.html
if(requireNamespace('mlbench')&requireNamespace('MASS')){ data('Vowel',package='mlbench') acc <- CVcluster(formula=Class ~., id = V1, data = Vowel, nfold = 15, FUN = MASS::lda, predictFUN=function(x, newdata, ...)predict(x, newdata, ...)$class, printit = TRUE, cvparts = NULL, seed = 29) }
if(requireNamespace('mlbench')&requireNamespace('MASS')){ data('Vowel',package='mlbench') acc <- CVcluster(formula=Class ~., id = V1, data = Vowel, nfold = 15, FUN = MASS::lda, predictFUN=function(x, newdata, ...)predict(x, newdata, ...)$class, printit = TRUE, cvparts = NULL, seed = 29) }
The cross-validation estimate of accuracy is sufficiently independent of the available model fitting criteria (including Generalized Cross-validation) that it provides a useful check on the extent of downward bias in the estimated standard error of residual.
CVgam(formula, data, nfold = 10, debug.level = 0, method = "GCV.Cp", printit = TRUE, cvparts = NULL, gamma = 1, seed = 29)
CVgam(formula, data, nfold = 10, debug.level = 0, method = "GCV.Cp", printit = TRUE, cvparts = NULL, gamma = 1, seed = 29)
formula |
Model formula, for passing to the |
data |
data frame that supplies the data |
nfold |
Number of cross-validation folds |
debug.level |
See |
method |
Fit method for GAM model. See |
printit |
Should summary information be printed? |
cvparts |
Use, if required, to specify the precise folds used for the cross-validation. The comparison between different models will be more accurate if the same folds are used. |
gamma |
See |
seed |
Set seed, if required, so that results are exactly reproducible |
fitted |
fitted values |
resid |
residuals |
cvscale |
scale parameter from cross-validation |
scale.gam |
scale parameter from function |
The scale parameter from cross-validation is the error mean square)
John Maindonald
https://maths-people.anu.edu.au/~johnm/nzsr/taws.html
if(require(sp)){ library(mgcv) data(meuse) meuse$ffreq <- factor(meuse$ffreq) CVgam(formula=log(zinc)~s(elev) + s(dist) + ffreq + soil, data = meuse, nfold = 10, debug.level = 0, method = "GCV.Cp", printit = TRUE, cvparts = NULL, gamma = 1, seed = 29) }
if(require(sp)){ library(mgcv) data(meuse) meuse$ffreq <- factor(meuse$ffreq) CVgam(formula=log(zinc)~s(elev) + s(dist) + ffreq + soil, data = meuse, nfold = 10, debug.level = 0, method = "GCV.Cp", printit = TRUE, cvparts = NULL, gamma = 1, seed = 29) }
For example, dates may be dates of plane crashes. For purposes of analysis, this function tabulates number of crash events per event of time, for each successive specified event.
eventCounts(data, dateCol="Date", from = NULL, to = NULL, by = "1 month", categoryCol=NULL, takeOnly=NULL, prefix="n_")
eventCounts(data, dateCol="Date", from = NULL, to = NULL, by = "1 month", categoryCol=NULL, takeOnly=NULL, prefix="n_")
data |
Data frame that should include any columns whose names appear in other function arguments. |
dateCol |
Name of column that holds vector of dates |
from |
Starting date. If |
to |
Final date, for which numbers of events are to be tallied. If
|
by |
Time event to be used; e.g. "1 day", or "1 week", or "4 weeks", or "1 month", or "1 quarter", or "1 year", or "10 years". |
categoryCol |
If not |
takeOnly |
If not |
prefix |
If |
A data frame, with columns Date
(the first day of the
event for which events are given), and other column(s) that
hols counts of events.
John Maindonald
crashDate <- as.Date(c("1908-09-17","1912-07-12","1913-08-06", "1913-09-09","1913-10-17")) df <- data.frame(date=crashDate) byYears <- eventCounts(data=df, dateCol="date", from=as.Date("1908-01-01"), by="1 year")
crashDate <- as.Date(c("1908-09-17","1912-07-12","1913-08-06", "1913-09-09","1913-10-17")) df <- data.frame(date=crashDate) byYears <- eventCounts(data=df, dateCol="date", from=as.Date("1908-01-01"), by="1 year")
Data are from the US FARS (Fatality Analysis Recording System) archive that is intended to include every accident in which there was at least one fatality. Data are limited to vehicles where the front seat passenger seat was occupied. Values are given for selected variables only.
FARS
FARS
A data frame with 134332 observations on the following 18 variables.
caseid
a character vector. “state:casenum:vnum”
state
a numeric vector. See the FARS website for details
age
a numeric vector; 998=not reported; 999=not known. Cases
with age
< 16 have been omitted
airbag
a numeric vector
injury
a numeric vector; 4 indicates death. Blanks, unknown, and “Died prior to accident” have been omitted
Restraint
a numeric vector
sex
1=male, 2=female, 9=unknown
inimpact
a numeric vector; direction of initial impact. Categories 1 to 12 describe clock positions, so that 1,11, and 12 relate to near frontal impacts; 0 is not a collision; 13: top; 14: undercarriage. 18, introduced in 2005 has been omitted, as have 404 values in additional categories for 2010. 99 denotes a missing value.
modelyr
a numeric vector
airbagAvail
a factor with levels no
yes
NA-code
airbagDeploy
a factor with levels no
yes
NA-code
D_injury
a numeric vector
D_airbagAvail
a factor with levels no
yes
NA-code
D_airbagDeploy
a factor with levels no
yes
NA-code
D_Restraint
a factor with levels no
yes
NA-code
year
year of accident
Data is for automabiles where the right passenger seat was occupied, with one observation for each such passenger. Observations for vehicles where the most harmful event was a fire or explosion or immersion or gas inhalation, or where someone fell or jumped from the vehicle, are omitted. Data are limited to vehicle body types 1 to 19,48,49,61, or 62. This excludes large trucks, pickup trucks, vans and buses. The 2009 and 2010 data does not include information on whether airbags were installed.
The papers given as references demonstrate the use of Fatal Accident Recording System data to assess the effectiveness of airbags (even differences between different types of airbags) and seatbelts. Useful results can be obtained by matching driver mortality, with and without airbags, to mortality rates for right front seat passengers in cars without passenger airbags.
http://www-fars.nhtsa.dot.gov/Main/index.aspx
https://maths-people.anu.edu.au/~johnm/nzsr/taws.html
Olson CM, Cummings P, Rivara FP. 2006. Association of first- and second-generation air bags with front occupant death in car crashes: a matched cohort study. Am J Epidemiol 164:161-169
Cummings, P; McKnight, B, 2010. Accounting for vehicle, crash, and occupant characteristics in traffic crash studies. Injury Prevention 16: 363-366
Braver, ER; Shardell, M; Teoh, ER, 2010. How have changes in air bag designs affected frontal crash mortality? Ann Epidemiol 20:499-510.
data(FARS)
data(FARS)
Data are included on variables that may be relevant to assessing airbag and seatbelt effectiveness in preventing fatal injury.
fars2007 fars2008
fars2007 fars2008
A data frame with 24179 observations on the following 24 variables.
state
a numeric vector
casenum
a numeric vector
vnum
a numeric vector
pnum
a numeric vector
lightcond
a numeric vector
numfatal
a numeric vector
age
a numeric vector
airbag
a numeric vector
injury
a numeric vector
ptype
a numeric vector
restraint
a numeric vector
seatpos
a numeric vector
sex
a numeric vector
body
a numeric vector
inimpact
A numeric vector; numbers 1 to 12 give clockface directions of initial impact. Values in these datasets are limited to 11, 12 and 1; i.e., near frontal impact
mhevent
a numeric vector
numoccs
a numeric vector
travspd
a numeric vector
modelyr
a numeric vector
Data is for automabiles where a passenger seat was occupied, with one observation for each such passenger.
http://www-fars.nhtsa.dot.gov/Main/index.aspx
https://maths-people.anu.edu.au/~johnm/nzsr/taws.html
Olson CM, Cummings P, Rivara FP. 2006. Association of first- and second-generation air bags with front occupant death in car crashes: a matched cohort study. Am J Epidemiol 164:161-169
Cummings, P; McKnight, B, 2010. Accounting for vehicle, crash, and occupant characteristics in traffic crash studies. Injury Prevention 16: 363-366
Braver, ER; Shardell, M; Teoh, ER, 2010. How have changes in air bag designs affected frontal crash mortality? Ann Epidemiol 20:499-510.
data(fars2007) str(fars2007)
data(fars2007) str(fars2007)
Safety devices may be airbags or seatbelts. For airbags, alternatives are to use ‘airbag installed’ or ‘airbag deployed’ as the criterion. Ratio of driver deaths to passenger deaths are calculated for driver with device and for driver without device, in both cases for passenger without device.
data("frontDeaths")
data("frontDeaths")
The format is: List of 3 $ airbagAvail : num [1:13, 1:2, 1:4] 1068 1120 1089 1033 940 ... ..- attr(*, "dimnames")=List of 3 .. ..$ years : chr [1:13] "1998" "1999" "2000" "2001" ... .. ..$ D_airbagAvail: chr [1:2] "no" "yes" .. ..$ injury : chr [1:4] "P_injury" "D_injury" "tot" "prop" $ airbagDeploy: num [1:13, 1:2, 1:4] 1133 1226 1196 1151 1091 ... ..- attr(*, "dimnames")=List of 3 .. ..$ years : chr [1:13] "1998" "1999" "2000" "2001" ... .. ..$ D_airbagAvail: chr [1:2] "no" "yes" .. ..$ injury : chr [1:4] "P_injury" "D_injury" "tot" "prop" $ restraint : num [1:13, 1:2, 1:4] 780 783 735 714 741 645 634 561 558 494 ... ..- attr(*, "dimnames")=List of 3 .. ..$ years : chr [1:13] "1998" "1999" "2000" "2001" ... .. ..$ D_airbagAvail: chr [1:2] "no" "yes" .. ..$ injury : chr [1:4] "P_injury" "D_injury" "tot" "prop"
See FARS
data(frontDeaths) ## maybe str(frontDeaths) ; plot(frontDeaths) ...
data(frontDeaths) ## maybe str(frontDeaths) ; plot(frontDeaths) ...
Fit model using gam()
from mgcv, then use random forest
regression with residuals. Check perfomance of this hybrid model
for predictions to newdata
, if supplied.
gamRF(formlist, yvar, data, newdata = NULL, rfVars, method = "GCV.Cp", printit = TRUE, seed = NULL)
gamRF(formlist, yvar, data, newdata = NULL, rfVars, method = "GCV.Cp", printit = TRUE, seed = NULL)
formlist |
List of rght hand sides of formulae for GAM models. |
yvar |
Character string holding y-variable name. |
data |
Data |
newdata |
Optionally, supply test data. |
rfVars |
Names of explanatory variables for the |
method |
Smoothing parameter estimation method for use of |
printit |
Should a summary of results (error rates) be printed? |
seed |
Set a seed to make result repeatable. |
A vector of test data accuracies for the hybrid models (one for each
element of formlist
), plus test error mean square and OOB error
mean square for the use of randomForest()
.
The best results are typically obtained when a relatively low degree of freedom GAM model is used. It seems advisable to use those variables for the GAM fit that seem likely to be similar in their effect irrespective of geographic location.
John Maindonald <[email protected]>
J. Li, A. D. Heap, A. Potter and J. J. Daniell. 2011. Application of Machine Learning Methods to Spatial Interpolation of Environmental Variables. Environmental Modelling and Software 26: 1647-1656. DOI: 10.1016/j.envsoft.2011.07.004.
if(length(find.package("sp", quiet=TRUE))>0){ data("meuse", package="sp") meuse <- within(meuse, {levels(soil) <- c("1","2","2") ffreq <- as.numeric(ffreq) loglead <- log(lead)} ) form <- ~ dist + elev + ffreq + soil rfVars <- c("dist", "elev", "soil", "ffreq", "x", "y") ## Select 90 out of 155 rows sub <- sample(1:nrow(meuse), 90) meuseOut <- meuse[-sub,] meuseIn <- meuse[sub,] gamRF(formlist=list("lm"=form), yvar="loglead", rfVars=rfVars, data=meuseIn, newdata=meuseOut) } ## The function is currently defined as function (formlist, yvar, data, newdata = NULL, rfVars, method = "GCV.Cp", printit = TRUE, seed = NULL) { if(!is.null(seed))set.seed(seed) errRate <- numeric(length(formlist)+2) names(errRate) <- c(names(formlist), "rfTest", "rfOOB") ytrain <- data[, yvar] xtrain <- data[, rfVars] xtest <- newdata[, rfVars] ytest = newdata[, yvar] res.rf <- randomForest(x = xtrain, y = ytrain, xtest=xtest, ytest=ytest) errRate["rfOOB"] <- mean(res.rf$mse) errRate["rfTest"] <- mean(res.rf$test$mse) GAMhat <- numeric(nrow(data)) for(nam in names(formlist)){ form <- as.formula(paste(c(yvar, paste(formlist[[nam]])), collapse=" ")) train.gam <- gam(form, data = data, method = method) res <- resid(train.gam) cvGAMms <- sum(res^2)/length(res) if (!all(rfVars %in% names(newdata))) { missNam <- rfVars[!(rfVars %in% names(newdata))] stop(paste("The following were not found in 'newdata':", paste(missNam, collapse = ", "))) } GAMtesthat <- predict(train.gam, newdata = newdata) GAMtestres <- ytest - GAMtesthat Gres.rf <- randomForest(x = xtrain, y = res, xtest = xtest, ytest = GAMtestres) errRate[nam] <- mean(Gres.rf$test$mse) } if (printit) print(round(errRate, 4)) invisible(errRate) }
if(length(find.package("sp", quiet=TRUE))>0){ data("meuse", package="sp") meuse <- within(meuse, {levels(soil) <- c("1","2","2") ffreq <- as.numeric(ffreq) loglead <- log(lead)} ) form <- ~ dist + elev + ffreq + soil rfVars <- c("dist", "elev", "soil", "ffreq", "x", "y") ## Select 90 out of 155 rows sub <- sample(1:nrow(meuse), 90) meuseOut <- meuse[-sub,] meuseIn <- meuse[sub,] gamRF(formlist=list("lm"=form), yvar="loglead", rfVars=rfVars, data=meuseIn, newdata=meuseOut) } ## The function is currently defined as function (formlist, yvar, data, newdata = NULL, rfVars, method = "GCV.Cp", printit = TRUE, seed = NULL) { if(!is.null(seed))set.seed(seed) errRate <- numeric(length(formlist)+2) names(errRate) <- c(names(formlist), "rfTest", "rfOOB") ytrain <- data[, yvar] xtrain <- data[, rfVars] xtest <- newdata[, rfVars] ytest = newdata[, yvar] res.rf <- randomForest(x = xtrain, y = ytrain, xtest=xtest, ytest=ytest) errRate["rfOOB"] <- mean(res.rf$mse) errRate["rfTest"] <- mean(res.rf$test$mse) GAMhat <- numeric(nrow(data)) for(nam in names(formlist)){ form <- as.formula(paste(c(yvar, paste(formlist[[nam]])), collapse=" ")) train.gam <- gam(form, data = data, method = method) res <- resid(train.gam) cvGAMms <- sum(res^2)/length(res) if (!all(rfVars %in% names(newdata))) { missNam <- rfVars[!(rfVars %in% names(newdata))] stop(paste("The following were not found in 'newdata':", paste(missNam, collapse = ", "))) } GAMtesthat <- predict(train.gam, newdata = newdata) GAMtestres <- ytest - GAMtesthat Gres.rf <- randomForest(x = xtrain, y = res, xtest = xtest, ytest = GAMtestres) errRate[nam] <- mean(Gres.rf$test$mse) } if (printit) print(round(errRate, 4)) invisible(errRate) }
See website for details of data attributes
german
german
A data frame with 1000 observations on the following 21 variables.
V1
a factor with levels A11
A12
A13
A14
V2
a numeric vector
V3
a factor with levels A30
A31
A32
A33
A34
V4
a factor with levels A40
A41
A410
A42
A43
A44
A45
A46
A48
A49
V5
a numeric vector
V6
a factor with levels A61
A62
A63
A64
A65
V7
a factor with levels A71
A72
A73
A74
A75
V8
a numeric vector
V9
a factor with levels A91
A92
A93
A94
V10
a factor with levels A101
A102
A103
V11
a numeric vector
V12
a factor with levels A121
A122
A123
A124
V13
a numeric vector
V14
a factor with levels A141
A142
A143
V15
a factor with levels A151
A152
A153
V16
a numeric vector
V17
a factor with levels A171
A172
A173
A174
V18
a factor with levels good
bad
V19
a factor with levels A191
A192
V20
a factor with levels A201
A202
V21
a numeric vector
700 good and 300 bad credits with 20 predictor variables. Data from 1973 to 1975. Stratified sample from actual credits with bad credits heavily oversampled. A cost matrix can be used.
http://archive.ics.uci.edu/datasets
Grömping, U. (2019). South German Credit Data: Correcting a Widely Used Data Set. Report 4/2019, Reports in Mathematics, Physics and Chemistry, Department II, Beuth University of Applied Sciences Berlin.
data(german)
data(german)
Heights, in meters, are for the lakes Erie, Michigan/Huron, Ontario and St Clair
data(greatLakesM)
data(greatLakesM)
The format is: 'data.frame': 1212 obs. of 7 variables: $ month : Factor w/ 12 levels "apr","aug","dec",..: 5 4 8 1 9 7 6 2 12 11 ... $ year : int 1918 1918 1918 1918 1918 1918 1918 1918 1918 1918 ... $ Superior : num 183 183 183 183 183 ... $ Michigan.Huron: num 177 177 177 177 177 ... $ St..Clair : num 175 175 175 175 175 ... $ Erie : num 174 174 174 174 174 ... $ Ontario : num 74.7 74.7 74.9 75.1 75.1 ...
For more details, go to the website that is the source of the data.
data(greatLakesM) mErie <- ts(greatLakesM[,'Erie'], start=1918, frequency=12) greatLakes <- aggregate(greatLakesM[,-(1:2)], by=list(greatLakesM$year), FUN=mean) names(greatLakes)[1] <- 'year' ## maybe str(greatLakesM)
data(greatLakesM) mErie <- ts(greatLakesM[,'Erie'], start=1918, frequency=12) greatLakes <- aggregate(greatLakesM[,-(1:2)], by=list(greatLakesM$year), FUN=mean) names(greatLakes)[1] <- 'year' ## maybe str(greatLakesM)
Given an lda model object, calculate training set error, leave-one-out cross-validation error, and test set error.
ldaErr(train.lda, train, test, group = "type")
ldaErr(train.lda, train, test, group = "type")
train.lda |
Fitted lda model object. |
train |
Training set data frame. |
test |
Test set data frame. |
group |
Factor that identifies groups in training data. |
Vector that holds leave-one-out, training, and test error rates
## Not run: data(spam, package='kernlab') spam[,-58] <- scale(spam[,-58]) nr <- sample(1:nrow(spam)) spam01 <- spam[nr[1:3601],] ## Use for training, spam2 <- spam[nr[3602:4601],] ## Test spam01.lda <- lda(type~., data=spam01) ldaRates <- ldaErr(train.lda=spam01.lda, train=spam01, test=spam2, group="type") ## End(Not run)
## Not run: data(spam, package='kernlab') spam[,-58] <- scale(spam[,-58]) nr <- sample(1:nrow(spam)) spam01 <- spam[nr[1:3601],] ## Use for training, spam2 <- spam[nr[3602:4601],] ## Test spam01.lda <- lda(type~., data=spam01) ldaRates <- ldaErr(train.lda=spam01.lda, train=spam01, test=spam2, group="type") ## End(Not run)
GISS (Goddard Institute for Space Studies) Land-Ocean Temperature Index (LOTI) data for the years 1880 to 2019, giving anomalies in 0.01 degrees Celsius, from the 1951 - 1980 average.
loti
loti
A data frame with 140 observations on the following 19 variables.
Year
a numeric vector
Jan
a numeric vector
Feb
a numeric vector
Mar
a numeric vector
Apr
a numeric vector
May
a numeric vector
Jun
a numeric vector
Jul
a numeric vector
Aug
a numeric vector
Sep
a numeric vector
Oct
a numeric vector
Nov
a numeric vector
Dec
a numeric vector
JtoD
Jan-Dec averages
D.N
Dec-Nov averages
DJF
Dec-Jan-Feb averages
MAM
Mar-Apr-May
JJA
Jun-Jul-Aug
SON
Sept-Oct-Nov
JtoD2011
January to December average, from data accessed in 2011
Data are the Combined Land-Surface Air and Sea-Surface Water Temperature
Anomalies (Land-Ocean Temperature Index, LOTI), in 0.01 degrees Celsius, from
https://data.giss.nasa.gov/gistemp/tabledata_v4/GLB.Ts+dSST.txt
Data in the column JtoD2011
was accessed 2011-09-06.
Also available is a CSV file, with anomalies in degrees Celsius.
GISTEMP Team, 2020: GISS Surface Temperature Analysis (GISTEMP), version 4. NASA Goddard Institute for Space Studies. Dataset accessed 2020-11-13 at https://data.giss.nasa.gov/gistemp/.
data(loti) plot(JtoD ~ Year, data=loti) ## Add 11 point moving average ma11 <- filter(loti$JtoD, rep(1,11)/11, sides=2) lines(loti$Year, ma11)
data(loti) plot(JtoD ~ Year, data=loti) ## Add 11 point moving average ma11 <- filter(loti$JtoD, rep(1,11)/11, sides=2) lines(loti$Year, ma11)
Devices may be airbags or seatbelts. For airbags, alternatives are to use “airbag installed” or “airbag deployed” as the criterion. The plot shows, for each of the specified features, the ratio of driver death rate (or other outcome, e.g., death or injury) with feature, to rate without feature, in both cases for passenger without feature.
plotFars(tabDeaths=gamclass::frontDeaths, statistics = c("airbagAvail", "airbagDeploy", "restraint"))
plotFars(tabDeaths=gamclass::frontDeaths, statistics = c("airbagAvail", "airbagDeploy", "restraint"))
tabDeaths |
List, containing (as a minimum) three-dimensional arrays
with the names specified in the argument |
statistics |
Vector of character: names of the sublists, which contain information on the deathrates |
The name injury
is used, with frontDeaths
or sideDeaths
or rearDeaths
or otherDeaths
as the first argument, to refer to
deaths. The function tabFarsDeaths
allows the option of returning an
object, suitable for using as first argument, that treats injury
as
death or serious injury.
A graphics object is returned
Note that the “airbag deployed” statistic is not a useful measure of airbag effectiveness. At its most effective, the airbag will deploy only when the accident is sufficiently serious that deployment will reduce the risk of serious injury and/or accident. The with/without deployment comparison compares, in part, serious accidents with less serious accidents.
John Maindonald
The four list elements are for four positions of initial impact.
Each list element is a 13 by 3 years
by “safety device” matrix
that gives the proportion, for that device in year, of the total over
years
data("relDeaths")
data("relDeaths")
The format is: List of 4 $ front: num [1:13, 1:3] 0.559 0.548 0.544 0.577 0.574 ... ..- attr(*, "dimnames")=List of 2 .. ..$ : chr [1:13] "1998" "1999" "2000" "2001" ... .. ..$ : chr [1:3] "airbagAvail" "airbagDeploy" "restraint" $ side : num [1:13, 1:3] 0.36 0.366 0.367 0.35 0.348 ... ..- attr(*, "dimnames")=List of 2 .. ..$ : chr [1:13] "1998" "1999" "2000" "2001" ... .. ..$ : chr [1:3] "airbagAvail" "airbagDeploy" "restraint" $ rear : num [1:13, 1:3] 0.0507 0.0558 0.0575 0.0498 0.0522 ... ..- attr(*, "dimnames")=List of 2 .. ..$ : chr [1:13] "1998" "1999" "2000" "2001" ... .. ..$ : chr [1:3] "airbagAvail" "airbagDeploy" "restraint" $ other: num [1:13, 1:3] 0.0312 0.0304 0.0313 0.0237 0.0254 ... ..- attr(*, "dimnames")=List of 2 .. ..$ : chr [1:13] "1998" "1999" "2000" "2001" ... .. ..$ : chr [1:3] "airbagAvail" "airbagDeploy" "restraint"
data(relDeaths) ## maybe str(relDeaths) ; plot(relDeaths) ...
data(relDeaths) ## maybe str(relDeaths) ; plot(relDeaths) ...
This function adapts random forests to work (albeit clumsily and inefficiently) with clustered categorical outcome data. For example, there may be multiple observations on individuals (clusters). Predictions are made fof the OOB (out of bag) clusters
RFcluster(formula, id, data, nfold = 15, ntree=500, progress=TRUE, printit = TRUE, seed = 29)
RFcluster(formula, id, data, nfold = 15, ntree=500, progress=TRUE, printit = TRUE, seed = 29)
formula |
Model formula |
id |
numeric, identifies clusters |
data |
data frame that supplies the data |
nfold |
numeric, number of folds |
ntree |
numeric, number of trees (number of bootstrap samples) |
progress |
Print information on progress of calculations |
printit |
Print summary information on accuracy |
seed |
Set seed, if required, so that results are exactly reproducible |
Bootstrap samples are taken of observations in the in-bag clusters. Predictions are made for all observations in the OOB clusters.
class |
Predicted values from cross-validation |
OOBaccuracy |
Cross-validation estimate of accuracy |
confusion |
Confusion matrix |
John Maindonald
https://maths-people.anu.edu.au/~johnm/nzsr/taws.html
## Not run: library(mlbench) library(randomForest) data(Vowel) RFcluster(formula=Class ~., id = V1, data = Vowel, nfold = 15, ntree=500, progress=TRUE, printit = TRUE, seed = 29) ## End(Not run)
## Not run: library(mlbench) library(randomForest) data(Vowel) RFcluster(formula=Class ~., id = V1, data = Vowel, nfold = 15, ntree=500, progress=TRUE, printit = TRUE, seed = 29) ## End(Not run)
Given an randomForest model object, calculate training set error, out-of-bag (OOB) error, and test set error.
rfErr(train.rf, train, test, group = "type")
rfErr(train.rf, train, test, group = "type")
train.rf |
Fitted randomForest model object. |
train |
Training set data frame. |
test |
Test set data frame. |
group |
Factor that identifies groups |
Vector that holds training set error, out-of-bag (OOB) error, and test set error rates.
## Not run: data(spam, package='kernlab') spam[,-58] <- scale(spam[,-58]) nr <- sample(1:nrow(spam)) spam01 <- spam[nr[1:3601],] ## Use for training, spam2 <- spam[nr[3602:4601],] ## Test spam01.rf <- randomForest(type ~ ., data=spam01) rfRates <- rfErr(train.rf=spam01.rf, train=spam01, test=spam2, group='type') ## End(Not run)
## Not run: data(spam, package='kernlab') spam[,-58] <- scale(spam[,-58]) nr <- sample(1:nrow(spam)) spam01 <- spam[nr[1:3601],] ## Use for training, spam2 <- spam[nr[3602:4601],] ## Test spam01.rf <- randomForest(type ~ ., data=spam01) rfRates <- rfErr(train.rf=spam01.rf, train=spam01, test=spam2, group='type') ## End(Not run)
Given an rpart model object, calculate training set error, 10-fold cross-validation error, and test set error.
rpartErr(train.rp, train, test, group = "type")
rpartErr(train.rp, train, test, group = "type")
train.rp |
Fitted lda model object. |
train |
Training set data frame. |
test |
Test set data frame. |
group |
Factor that identifies groups |
Vector that holds training set error, 10-fold cross-validation error, and test set error rates.
## Not run: data(spam, package='kernlab') spam[,-58] <- scale(spam[,-58]) nr <- sample(1:nrow(spam)) spam01 <- spam[nr[1:3601],] ## Use for training, ## if holdout not needed spam2 <- spam[nr[3602:4601],] ## Test spam01.rp <- rpart(type~., data=spam01, cp=0.0001) rpRates <- rpartErr(train.rp=spam01.rp, train=spam01, test=spam2, group='type') ## End(Not run)
## Not run: data(spam, package='kernlab') spam[,-58] <- scale(spam[,-58]) nr <- sample(1:nrow(spam)) spam01 <- spam[nr[1:3601],] ## Use for training, ## if holdout not needed spam2 <- spam[nr[3602:4601],] ## Test spam01.rp <- rpart(type~., data=spam01, cp=0.0001) rpRates <- rpartErr(train.rp=spam01.rp, train=spam01, test=spam2, group='type') ## End(Not run)
Derive parameter estimates and standard errors by simulation, or by bootstrap resampling.
simreg(formula, data, nsim = 1000) bootreg(formula, data, nboot = 1000)
simreg(formula, data, nsim = 1000) bootreg(formula, data, nboot = 1000)
formula |
Model formula |
data |
Data frame from which names in formula can be taken |
nsim |
Number of repeats of the simulation ( |
nboot |
Number of bootstrap resamples ( |
Matrix of coefficients from repeated simulations, or from bootstrap
resamples. For simreg
there is one row for each repeat
of the simulation. For bootreg
there is one row for each
resample.
Note that bootreg
uses the simplest
possible form of bootstrap. For any except very large datasets,
standard errors may be substantial under-estimates
John Maindonald
https://maths-people.anu.edu.au/~johnm/nzsr/taws.html
xy <- data.frame(x=rnorm(100), y=rnorm(100)) simcoef <- simreg(formula = y~x, data = xy, nsim = 100) bootcoef <- bootreg(formula = y~x, data = xy, nboot = 100)
xy <- data.frame(x=rnorm(100), y=rnorm(100)) simcoef <- simreg(formula = y~x, data = xy, nsim = 100) bootcoef <- bootreg(formula = y~x, data = xy, nboot = 100)
Fars
dataset.
Safety devices may be airbags or seatbelts. For airbags, alternatives are to use ‘airbag installed’ or ‘airbag deployed’ as the criterion. Ratio of driver deaths to passenger deaths are calculated for driver with device and for driver without device, in both cases for passenger without device, and the ratio of these ratios calculated.
tabFarsDead(dset=gamclass::FARS, fatal = 4, restrict=expression(age>=16&age<998&inimpact%in%c(11,12,1)), statistics = c("airbagAvail", "airbagDeploy", "Restraint"))
tabFarsDead(dset=gamclass::FARS, fatal = 4, restrict=expression(age>=16&age<998&inimpact%in%c(11,12,1)), statistics = c("airbagAvail", "airbagDeploy", "Restraint"))
dset |
data frame containing data |
fatal |
numeric: 4 for fatal injury, or |
statistics |
Vector of character: ratio of rates variables that will be tabulated |
restrict |
Expression restricting values as specified |
Note that the ‘airbag deployed’ statistic is not a useful measure of airbag effectiveness. At its most effective, the airbag will deploy only when the accident is sufficiently serious that deployment will reduce the risk of serious injury and/or accident. The with/without deployment comparison compares, in part, serious accidents with less serious accidents.
A list with elements
airbagAvail |
a multiway table with margins |
airbagDeploy |
a multiway table with margins |
Restraint |
a multiway table with margins |
John Maindonald
tabDeaths <- tabFarsDead()
tabDeaths <- tabFarsDead()