I have a (large) list of parsed sentences (which were parsed using the Stanford parser), for example, the sentence "Now you can be entertained" has the following tree:
(ADVP (RB Now))
(NP (PRP you))
(VP (MD can)
(VP (VB be)
(VP (VBN entertained))))
I am using the set of sentence trees to induce a grammar using nltk:
# ... for each sentence tree t, add its production to allProductions
allProductions += t.productions()
# Induce the grammar
S = nltk.Nonterminal('S')
grammar = nltk.induce_pcfg(S, allProductions)
Now I would like to use
grammar to generate new, random sentences. My hope is that since the grammar was learned from a specific set of input examples, then the generated sentences will be semantically similar. Can I do this in nltk?
If I can't use nltk to do this, do any other tools exist that can take the (possibly reformatted)
grammar and generate sentences?
In NLTK 2.0 you can use
nltk.parse.generate to generate all possible sentences for a given grammar.
This code defines a function which should generate a single sentence based on the production rules in a (P)CFG.
# This example uses choice to choose from possible expansions from random import choice # This function is based on _generate_all() in nltk.parse.generate # It therefore assumes the same import environment otherwise. def generate_sample(grammar, items=["S"]): frags =  if len(items) == 1: if isinstance(items, Nonterminal): for prod in grammar.productions(lhs=items): frags.append(generate_sample(grammar, prod.rhs())) else: frags.append(items) else: # This is where we need to make our changes chosen_expansion = choice(items) frags.append(generate_sample,chosen_expansion) return frags
To make use of the weights in your PCFG, you'll obviously want to use a better sampling method than
choice(), which implicitly assumes all expansions of the current node are equiprobable.
First of all, if you generate random sentences, they may be semantically correct, but they will probably loose their sense.
(It's sounds to me a bit like those MIT students did with their SCIgen program which is auto-generating scientific paper. Very interesting btw.)
Anyway, I never done it myself, but it seems possible with nltk.bigrams, you may way to have a look there under Generating Random Text with Bigrams.
You can also generate all subtrees of a current tree, I'm not sure if it is what you want either.
With an nltk Text object you can call 'generate()' on it which will "Print random text, generated using a trigram language model."http://nltk.org/_modules/nltk/text.html