How does one express a linear model where observations can belong to multiple categories and the number of categories is large?
For example, using time dummies as the categories, here is a problem that is easy to set up since the number of categories (time periods) is small and known:
tmp <- "day 1, day 2
periods <- read.csv(text = tmp)
y <- rnorm(3)
print(lm(y ~ day.1 + day.2 + 0, data=periods))
Now suppose that instead of two days there were 100. Would I need to create a formula like the following?
y ~ day.1 + day.2 + ... + day.100 + 0
Presumably such a formula would have to be created programmatically. This seems inelegant and un-R-like.
What is the right R way to tackle this? For example, aside from the formula problem, is there a better way to create the dummies than creating a matrix of 1s and 0s (as I did above)? For the sake of concreteness, say that the actual data consists (for each observation) of a start and end date (so that
tmp would contain a 1 in each column between start and end).
Based on the answer of @jlhoward, here is a larger example:
num.observations <- 1000
# Manually create 100 columns of dummies called x1, ..., x100
periods <- data.frame(1*matrix(runif(num.observations*100) > 0.5, nrow = num.observations))
y <- rnorm(num.observations)
print(summary(lm(y ~ ., data = periods)))
It illustrates the manual creation of a data frame of dummies (1s and 0s). I would be interested in learning whether there is a more R-like way of dealing with these "multiple dummies per observation" issue.
You can use the
. notation to include all variables other than the response in a formula, and
-1 to remove the intercept. Also, put everything in your data frame; don't make
y a separate vector.
set.seed(1) # for reproducibility df <- data.frame(y=rnorm(3),read.csv(text=tmp)) fit.1 <- lm(y ~ day.1 + day.2 + 0, df) fit.2 <- lm(y ~ -1 + ., df) identical(coef(fit.1),coef(fit.2)) #  TRUE