当前位置: 动力学知识库 > 问答 > 编程问答 >

r - how to summarise (dplyr) colums by retrieving top 10 bigrams (ngrams) per group?

问题描述:

I have a data frame that looks a bit like this:

df <- structure(list(group = structure(c(1L, 1L, 2L, 2L), .Label = c("1",

"2"), class = "factor"), text = structure(c(2L, 1L, 4L, 3L), .Label = c("hello hi four five",

"hi hello one two three", "one three four five", "one two three"

), class = "factor")), .Names = c("group", "text"), row.names = c(NA,

-4L), class = "data.frame")

df

group text

1 1 hi hello one two three

2 1 hello hi four five

3 2 one two three

4 2 one three four five

Now I want to summarise this data frame by retrieving the top (maximum 10) bibrams per group.

Something like this (make_bigrams is an imaginary function):

df <- group_by(df, group)

summarise(df, make_bigrams(text))

The result should be something like this:

 group text

1 1 hi_hello, hi_one, hi_two_etc.

2 2 one_three, one_two, etc.

I tried functions like the tokenizer of RWeka, but none did wat I intended. Does anyone have an idea? Many thanks in advance!

网友答案:

Here is something you could do for bigrams (ie. "contiguous sub-sequences of length n" where n=2 according to ?NLP::ngrams).

library(tm) # for corpus and dtm; loads NLP
library(dplyr)    
library(tidyr)

df$text <- as.character(df$text)

## numbering documents
df$doc <- factor(1:nrow(df))


corpus <- Corpus(VectorSource(df$text))
# function source: tm.r-forge.r-project.org/faq.html#Bigrams
BigramTokenizer <-
  function(x)
    unlist(lapply(ngrams(words(x), 2), paste, collapse = "_"), use.names = FALSE)

## create a Term Document Matrix of bigrams
tdm <- TermDocumentMatrix(corpus, control = list(tokenize = BigramTokenizer))

## Let's find the most frequent for each group
as.data.frame.matrix(tdm) %>%          # transform to df
  add_rownames() %>%                   # we need the words
  gather(doc,value,-rowname) %>%       # convert to long form
  filter(value != 0) %>%               # remove bigrams not in document
  left_join(df[,c("doc","group")]) %>% # match doc number with group number
  group_by(group,rowname) %>%          # grouping
  summarise(n=sum(value)) %>%          # find out the number of bigrams by group
  arrange(desc(n)) %>%                 # sort the data by most frequently found bigrams
  slice(1:10) %>%                      # select only the 10 most frequent in each group
  summarize(most_frequent_bigrams=paste(rowname,collapse = ", ")) # format this to a single string
分享给朋友:
您可能感兴趣的文章:
随机阅读: