I'm having difficulty getting the CoreNLP system to correctly find where one sentence ends and another begins in a corpus of poetry.
The reasons why it's struggling:
This is a particularly tricky one
(The system thought the first sentence ended at the "." at the beginning of the second stanza)
Given the lack of capitals and punctuation to go on, I thought that I would try using -tokenizeNLs to see if that improved it, but it went overboard, and cut off any sentence that ran between blank lines (which there are a few of)
These sentences often end at the end of a line, but not always, so what would be slick is if the system could look at a line ending as a potential candidate for a sentence break, and maybe weigh the likelihood of those being the endpoints, but I don't know how I would implement that.
Is there an elegant way to do this? Or an alternative?
Thanks in advance!
(expected sentence output here)
This would be a neat project! I don't think anyone is working on it in our group at the moment, but I see no reason why we wouldn't incorporate a patch if you make one. The biggest challenge I see is that our sentence splitter is currently entirely rule-based, and therefore these sorts of "soft" decisions are relatively hard to incorporate.
A possible solution for your case could be to use language model "end of sentence" probabilities (Three options, in no particular order: https://kheafield.com/code/kenlm/, https://code.google.com/p/berkeleylm/, http://www.speech.sri.com/projects/srilm/). Then, line ends with a sufficiently high end of sentence probability could get split as new sentences.