Getting Started with Keyword Extraction

来源:转载

Recently, I have surveyed some keyword extraction tools, papers and documents, and record them here for getting started with keyword extraction. According wikipedia, Keyword Extractionis defined like this:

Keyword extraction is tasked with the automatic identification of terms that best describe the subject of a document.

Key phrases, key terms, key segments or just keywords are the terminology which is used for defining the terms that represent the most relevant information contained in the document. Although the terminology is different, function is the same: characterization of the topic discused in a document. Keyword extraction task is important problem in Text Mining, Information Retrieval and Natural Language Processing.

1. RAKE(A python implementation of the Rapid Automatic Keyword Extraction)

Started with RAKE, a python implementation of the Rapid Automatic Keyword Extraction, I follow the document “ NLP keyword extraction tutorial with RAKE and Maui“. As the document said:

A typical keyword extraction algorithm has three main components:

Candidate selection: Here, we extract all possible words, phrases, terms or concepts (depending on the task) that can potentially be keywords. Properties calculation: For each candidate, we need to calculate properties that indicate that it may be a keyword. For example, a candidate appearing in the title of a book is a likely keyword. Scoring and selecting keywords: All candidates can be scored by either combining the properties into a formula, or using a machine learning technique to determine probability of a candidate being a keyword. A score or probability threshold, or a limit on the number of keywords is then used to select the final set of keywords..

RAKE follow the three steps strictly, and have a good design structure for keyword extraction. Follow the document example Rake tutorial, I tested RAKE on my mac os environment step-by-step:

git clone https://github.com/zelandiya/RAKE-tutorialcd RAKE-tutorial/

then launch ipython environment in the RAKE-tutorial directory, and test it:ipython

Python 2.7.6 (default, Jun 3 2014, 07:43:23) Type "copyright", "credits" or "license" for more information. IPython 3.1.0 -- An enhanced Interactive Python.? -> Introduction and overview of IPython's features.%quickref -> Quick reference.help -> Python's own help system.object? -> Details about 'object', use 'object??' for extra details. In [1]: import rake In [2]: import operator In [3]: rake_object = rake.Rake("SmartStoplist.txt", 3, 3, 1) In [4]: text = "Natural language processing (NLP) deals with the application of computational models to text or speech data. Application areas within NLP include automatic (machine) translation between languages; dialogue systems, which allow a human to interact with a machine using natural language; and information extraction, where the goal is to transform unstructured text into structured (database) representations that can be searched and browsed in flexible ways. NLP technologies are having a dramatic impact on the way people interact with computers, on the way people interact with each other through the use of language, and on the way people access the vast amount of linguistic data now in electronic form. From a scientific viewpoint, NLP involves fundamental questions of how to structure formal models (for example statistical models) of natural language phenomena, and of how to design algorithms that implement these models. In this course you will study mathematical and computational models of language, and the application of these models to key problems in natural language processing. The course has a focus on machine learning methods, which are widely used in modern NLP systems: we will cover formalisms such as hidden Markov models, probabilistic context-free grammars, log-linear models, and statistical models for machine translation. The curriculum closely follows a course currently taught by Professor Collins at Columbia University, and previously taught at MIT." In [5]: keywords = rake_object.run(text) In [6]: print "keywords: ", keywordskeywords: [('nlp include automatic', 8.25), ('transform unstructured text', 8.0), ('hidden markov models', 8.0), ('structure formal models', 8.0), ('natural language phenomena', 7.916666666666666), ('natural language processing', 7.916666666666666), ('modern nlp systems', 7.75), ('machine learning methods', 7.75), ('natural language', 4.916666666666666), ('dialogue systems', 4.5), ('nlp technologies', 4.25), ('study mathematical', 4.0), ('information extraction', 4.0), ('electronic form', 4.0), ('vast amount', 4.0), ('speech data', 4.0), ('scientific viewpoint', 4.0), ('columbia university', 4.0), ('free grammars', 4.0), ('cover formalisms', 4.0), ('dramatic impact', 4.0), ('design algorithms', 4.0), ('flexible ways', 4.0), ('key problems', 4.0), ('linguistic data', 4.0), ('probabilistic context', 4.0), ('people access', 4.0), ('linear models', 4.0), ('curriculum closely', 4.0), ('professor collins', 4.0), ('statistical models', 4.0), ('computational models', 4.0), ('people interact', 3.666666666666667), ('previously taught', 3.5), ('application areas', 3.333333333333333), ('machine translation', 3.25), ('nlp', 2.25), ('language', 2.1666666666666665), ('text', 2.0), ('models', 2.0), ('machine', 1.75), ('interact', 1.6666666666666667), ('taught', 1.5), ('translation', 1.5), ('application', 1.3333333333333333), ('focus', 1.0), ('human', 1.0), ('goal', 1.0), ('structured', 1.0), ('languages', 1.0), ('widely', 1.0), ('mit', 1.0), ('log', 1.0), ('representations', 1.0), ('database', 1.0), ('browsed', 1.0), ('computers', 1.0), ('deals', 1.0), ('searched', 1.0), ('implement', 1.0)]

Here I set the rakc_object with a different parameters for my initial use:

rake_object = rake.Rake(“SmartStoplist.txt”, 3, 3, 1)

Each word has at least 3 characters

Each phrase has at most 3 words

Each keyword appears in the text at least 1 times

The test text is from the Coursera Natural Language Processingcourse introduction, and the keywords from the rake_object is:

keywords: [(‘nlp include automatic’, 8.25), (‘transform unstructured text’, 8.0), (‘hidden markov models’, 8.0), (‘structure formal models’, 8.0), (‘natural language phenomena’, 7.916666666666666), (‘natural language processing’, 7.916666666666666), (‘modern nlp systems’, 7.75), (‘machine learning methods’, 7.75), (‘natural language’, 4.916666666666666), (‘dialogue systems’, 4.5), (‘nlp technologies’, 4.25), (‘study mathematical’, 4.0), (‘information extraction’, 4.0), (‘electronic form’, 4.0), (‘vast amount’, 4.0), (‘speech data’, 4.0), (‘scientific viewpoint’, 4.0), (‘columbia university’, 4.0), (‘free grammars’, 4.0), (‘cover formalisms’, 4.0), (‘dramatic impact’, 4.0), (‘design algorithms’, 4.0), (‘flexible ways’, 4.0), (‘key problems’, 4.0), (‘linguistic data’, 4.0), (‘probabilistic context’, 4.0), (‘people access’, 4.0), (‘linear models’, 4.0), (‘curriculum closely’, 4.0), (‘professor collins’, 4.0), (‘statistical models’, 4.0), (‘computational models’, 4.0), (‘people interact’, 3.666666666666667), (‘previously taught’, 3.5), (‘application areas’, 3.333333333333333), (‘machine translation’, 3.25), (‘nlp’, 2.25), (‘language’, 2.1666666666666665), (‘text’, 2.0), (‘models’, 2.0), (‘machine’, 1.75), (‘interact’, 1.6666666666666667), (‘taught’, 1.5), (‘translation’, 1.5), (‘application’, 1.3333333333333333), (‘focus’, 1.0), (‘human’, 1.0), (‘goal’, 1.0), (‘structured’, 1.0), (‘languages’, 1.0), (‘widely’, 1.0), (‘mit’, 1.0), (‘log’, 1.0), (‘representations’, 1.0), (‘database’, 1.0), (‘browsed’, 1.0), (‘computers’, 1.0), (‘deals’, 1.0), (‘searched’, 1.0), (‘implement’, 1.0)]

It seems not good for this document, and I modified the parameters with the following settings and got another result:

# Each keyword appears in the text at least 2 timesIn [8]: rake_object = rake.Rake(“SmartStoplist.txt”, 3, 3, 2)

In [9]: keywords = rake_object.run(text)

In [10]: print “keywords: “, keywordskeywords: [(‘natural language processing’, 7.916666666666666), (‘statistical models’, 4.0), (‘computational models’, 4.0), (‘people interact’, 3.666666666666667), (‘language’, 2.1666666666666665), (‘models’, 2.0), (‘machine’, 1.75), (‘application’, 1.3333333333333333)]

The key points for RAKE is the parameters setting, and RAKE provides a method to select a proper parameters based on the train data. As the document summarize, RAKE is very easy to use to getting start keyword extraction, but seems lack something:

To summarize, RAKE is a simple keyword extraction library which focuses on finding multi-word phrases containing frequent words. Its strengths are its simplicity and the ease of use, whereas its weaknesses are its limited accuracy, the parameter configuration requirement, and the fact that it throws away many valid phrases and doesn’t normalize candidates.

Related Paper: Automatic keyword extraction from individual documents

2. Implementing the RAKE Algorithm with NLTK

This article implements the RAKE keyword extract algorithm based onNLTK, such as using the sent tokenize method to replace the origin implement in RAKE, and get the same result like RAKE.

3. Intro to Automatic Keyphrase Extractionby Burton DeWilde

I strongly recommend this document for anyone to getting started with keyword extraction or keyphrase extraction. It intro keyword extraction step-by-step, and divide keyword extraction into Candidate Identification, Keyphrase Selection with Unsupervised and supervised method with python code example. For the same testing code with a help corpus from coursera, I get the top-5 keywords by the methods of score_keyphrases_by_tfidf:

nlp 0.403572219961

way people interact 0.269048146641

models 0.169318964644

application areas 0.13452407332

application of computational models 0.13452407332

This is the result of text rank method score_keyphrases_by_textrank:

[(‘models’, 0.0619816545693813), (‘nlp’, 0.04454783455509914), (‘language’, 0.0334375900800485), (‘machine’, 0.029867774676152762), (‘course’, 0.026254400322149735), (‘application’, 0.024863645805797824)]

But for me, the interesting methods is the supervised keyword extraction, I will train a model for the courses text keyword extraction latter, just for wait.

4. Automatic Keyphrase Extraction Data

Keyword or Keyphrase extraction data is very valuable, followed from the document of “Intro to Automatic Keyphrase Extraction”, I found the AutomaticKeyphraseExtraction data from github, and following is the decription of the data:

DESCRIPTION

This repository contains the datasets for automatic keyphrase extraction task.

FILES

* 500N-KPCrowd.zip data from Marujo:LREC2012 (News articles annotated using AMT)

* 110-PT-BN-KP.zip data from Marujo:Interspeech2011 (non-English AKE corpus)

* MAUI.tar.gz data from University of Waikato (KEA, MAUI systems)

* Wan2008.tar.gz data from Wan:2008

* Schutz2008.tar.gz data from Schutz:2008 (only answer sets and readme are provided. the papers are available at ftp://ftp.ncbi.nlm.nih.gov/pub/pmc/articles.tar.gz)

* Nguyen2007.zip data from Nguyen:2007

* Hulth2003.tar.gz data from Hulth:2003

5. KEA – Keyphrase Extraction Algorithm

We cannot ignore the KEA algorithm for keyword or keyphrase extraction:

Keywords and keyphrases (multi-word units) are widely used in large document collections. They describe the content of single documents and provide a kind of semantic metadata that is useful for a wide variety of purposes. The task of assigning keyphrases to a document is called keyphrase indexing. For example, academic papers are often accompanied by a set of keyphrases freely chosen by the author. In libraries professional indexers select keyphrases from a controlled vocabulary (also called Subject Headings) according to defined cataloguing rules. On the Internet, digital libraries, or any depositories of data (flickr, del.icio.us, blog articles etc.) also use keyphrases (or here called content tags or content labels) to organize and provide a thematic access to their data.

KEA is an algorithm for extracting keyphrases from text documents. It can be either used for free indexing or for indexing with a controlled vocabulary.

KEA is implemented in Java and is platform independent. It is an open-source software distributed under the GNU General Public License.

Google code project: https://code.google.com/p/kea-algorithm/

Related Paper: KEA: Practical Automatic Keyphrase Extraction

Follow the Kea-5.0-Readme.txt, I compiled KEA and test it with an additional nlp.txt in the $KEAHome/testdocs/en/test which include the test text from the coursera nlp course. After run “java -Xmx526M TestKea”

Creating the model…

— Loading the Index…

— Building the Vocabulary index from SKOS file

— Reading the Documents…

Extracting keyphrases from test documents…

— Loading the Index…

— Building the Vocabulary index from SKOS file

— Extracting Keyphrases…

I got a nlp.key file which contains the extracted keywords for the nlp.txt:

Models

Computers

Data

Processing

Social groups

Methods

Uses

It seems not good, but give me a very good suggestion how to extract course text keywords.

6. kea-service: KEA 5.0 (keyphrase extraction software), modified to be an XML-RPC service

This is XML-RPC service for Kea, and based the kea service server, you can use any code to communicate with KEA.

A related doc by the Author also provides a general introduction: KEA KEYPHRASE EXTRACTION AS AN XML-RPC SERVICE (CODE RELEASE)

7. topia.termextract

This package determines important terms within a given piece of content. It uses linguistic tools such as Parts-Of-Speech (POS) and some simple statistical analysis to determine the terms and their strength.

8. tagger: A Python module for extracting relevant tags from text documents.

Here is a document to reference for me to find tagger and toipia.termextract: 3 Open Source Tools for Auto-Generating Tags for Content

9. Reference Papers

1) Automatic Keyphrase Extraction: A Survey of the State of the Art

2) SGRank: Combining Statistical and Graphical Methods to Improve the State of the Art in Unsupervised Keyphrase Extraction

3) “Without the Clutter of Unimportant Words”: Descriptive Keyphrases for Text Visualization

4) Automatic glossary extraction: beyond terminology identification

5) Unsupervised Approaches for Automatic Keyword Extraction Using Meeting Transcripts

Posted by TextMiner


分享给朋友:
您可能感兴趣的文章:
随机阅读: