# Gensim官方教程翻译（三）——相似度查询（Similarity Queries）

>>> import logging
>>> logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)12

>>> from gensim import corpora, models, similarities
>>> corpus = corpora.MmCorpus('/tmp/deerwester.mm') # comes from the first tutorial, "From strings to vectors"
>>> print(corpus)
MmCorpus(9 documents, 12 features, 28 non-zero entries)12345

lsi = models.LsiModel(corpus, id2word=dictionary, num_topics=2)

>>> doc = "Human computer interaction"
>>> vec_bow = dictionary.doc2bow(doc.lower().split())
>>> vec_lsi = lsi[vec_bow] # convert the query to LSI space
>>> print(vec_lsi)
[(0, -0.461821), (1, 0.070028)]12345

>>> index = similarities.MatrixSimilarity(lsi[corpus]) # transform corpus to LSI space and index it1

> 警告：similarities.MatrixSimilarity类仅仅适合能将所有的向量都在内存中的情况。例如，如果一个百万文档级的语料库使用该类，可能需要2G内存与256维LSI空间。
> 如果没有足够的内存，你可以使用similarities.Similarity类。该类的操作只需要固定大小的内存，因为他将索引切分为多个文件（称为碎片）存储到硬盘上了。它实际上使用了similarities.MatrixSimilarity和similarities.SparseMatrixSimilarity两个类，因此它也是比较快的，虽然看起来更加复杂了。

>>> index.save('/tmp/deerwester.index')

>>> sims = index[vec_lsi] # perform a similarity query against the corpus
>>> print(list(enumerate(sims))) # print (document_number, document_similarity) 2-tuples
[(0, 0.99809301), (1, 0.93748635), (2, 0.99844527), (3, 0.9865886), (4, 0.90755945),
(5, -0.12416792), (6, -0.1063926), (7, -0.098794639), (8, 0.05004178)]1234

>>> sims = sorted(enumerate(sims), key=lambda item: -item[1])
>>> print(sims) # print sorted (document number, similarity score) 2-tuples
[(2, 0.99844527), # The EPS user interface management system
(0, 0.99809301), # Human machine interface for lab abc computer applications
(3, 0.9865886), # System and human system engineering testing of EPS
(1, 0.93748635), # A survey of user opinion of computer system response time
(4, 0.90755945), # Relation of user perceived response time to error measurement
(8, 0.050041795), # Graph minors A survey
(7, -0.098794639), # Graph minors IV Widths of trees and well quasi ordering
(6, -0.1063926), # The intersection graph of paths in trees
(5, -0.12416792)] # The generation of random binary unordered trees1234567891011

（出处结果的注释是我加的，为了方便观察。）

Gensim是一个比较成熟的工具包，很多公司、个人已经成功将该工具应用于快速原型和生产。但是也并不意味着他是完美的。

Gensim并没有野心成为一个能包含自然语言处理（NLP）（或者仅仅是机器学习）领域的一切的框架。他的任务只是帮助NLP实践者能轻松地尝试流行的主题模型算法应用于大型数据集，并且帮助研究人员设计新算法原型。

[1] http://blog.csdn.net/questionfish/article/details/46746947