I have a data frame that looks a bit like this:
df <- structure(list(group = structure(c(1L, 1L, 2L, 2L), .Label = c("1",
"2"), class = "factor"), text = structure(c(2L, 1L, 4L, 3L), .Label = c("hello hi four five",
"hi hello one two three", "one three four five", "one two three"
), class = "factor")), .Names = c("group", "text"), row.names = c(NA,
-4L), class = "data.frame")
1 1 hi hello one two three
2 1 hello hi four five
3 2 one two three
4 2 one three four five
Now I want to summarise this data frame by retrieving the top (maximum 10) bibrams per group.
Something like this (make_bigrams is an imaginary function):
df <- group_by(df, group)
The result should be something like this:
1 1 hi_hello, hi_one, hi_two_etc.
2 2 one_three, one_two, etc.
I tried functions like the tokenizer of RWeka, but none did wat I intended. Does anyone have an idea? Many thanks in advance!
Here is something you could do for bigrams (ie. "contiguous sub-sequences of length n" where n=2 according to
library(tm) # for corpus and dtm; loads NLP library(dplyr) library(tidyr) df$text <- as.character(df$text) ## numbering documents df$doc <- factor(1:nrow(df)) corpus <- Corpus(VectorSource(df$text)) # function source: tm.r-forge.r-project.org/faq.html#Bigrams BigramTokenizer <- function(x) unlist(lapply(ngrams(words(x), 2), paste, collapse = "_"), use.names = FALSE) ## create a Term Document Matrix of bigrams tdm <- TermDocumentMatrix(corpus, control = list(tokenize = BigramTokenizer)) ## Let's find the most frequent for each group as.data.frame.matrix(tdm) %>% # transform to df add_rownames() %>% # we need the words gather(doc,value,-rowname) %>% # convert to long form filter(value != 0) %>% # remove bigrams not in document left_join(df[,c("doc","group")]) %>% # match doc number with group number group_by(group,rowname) %>% # grouping summarise(n=sum(value)) %>% # find out the number of bigrams by group arrange(desc(n)) %>% # sort the data by most frequently found bigrams slice(1:10) %>% # select only the 10 most frequent in each group summarize(most_frequent_bigrams=paste(rowname,collapse = ", ")) # format this to a single string