Gensim

http://radimrehurek.com/gensim/

Installation

Testing whether the fast version is installed:

>>> from gensim.models import word2vec
>>> assert word2vec.FAST_VERSION > -1

Models

Phrases

This model detects multi-word phrases that can be grouped, such as new_york_times. Can be used as a preprocessor for word2vec or doc2vec models.

>>> bigram_transformer = gensim.models.Phrases(sentences)
>>> model = Word2Vec(bigram_transformed[sentences], size=100, ...)

word2vec

Vocab object contains a word and its frequency (count) and other properties (e.g. sample_int is used for sampling purpose)

Let V as the size of the vocabulary and N as the dimension of the hidden layer (vector dimension).

model.syn0: $V \times N$ matrix. model.syn0[wordindex] returns the word vector.

doc2vec

Doc2Vec class

_do_train_job(self, job, alpha, inits): job is just sentences.

DocvecsArray

the document vectors are stored in this object.

indexed_doctags(self, doctag_tokens): given doctag_tokens (a list of document tags), return (integer index, doctag_syn0, self.doctag_syn0_lockf, doctag_tokens).