A Word Is Worth a Thousand Vectors

est · on March 12, 2015

As a native Chinese speaker, this comes so natural.

pork = pig + meat

So the year of 2015 is the year of ram/sheep/goat, in Chinese they literally means

Ram = male ∪ caprinae

sheep = wool ∪ caprinae

goat = mountain ∪ caprinae

basically, word composition is pretty common in analytic language like Chinese, but kinda new idea in fusional languages like English.

perdunov · on March 12, 2015

This is super-cool. I am trying to start learning Chinese on Coursera https://www.coursera.org/learn/learn-chinese

Learning a language that is based on different principles is an enormous brain exercise. Also, remembering the characters is a challenge.

imh · on March 12, 2015

I've always wondered about doing this in non-flat spaces. Like if I add the "7100 miles west" vector to the "California" point, I get Turkmenistan. If I add "7100 miles west" again, I get back near "California." Similarly, adding the "not" vector twice might get you back where you started in a word embedding. Anyone know if anyone is working on this? It could be tricky because "7100 miles west" lives in the tangent space to the space "California" lives in, but that in itself could be an interesting thing to study in the context of words.

jeremysalwen · on March 12, 2015

Take a look at some of the compositional models here under publications: http://www.socher.org/.

Here is the demo webpage for the sentiment analysis system: http://nlp.stanford.edu/sentiment/

madcowherd · on March 11, 2015

Wondering how this differs from the SemanticVectors package? Will have to look into word2vec further.

agibsonccc · on March 11, 2015

Word2vec is usually the standard neural word embeddings implementation. There are other algorithms as well such as glove[1], document embeddings[2] and backpropagation based methods[3]. Facebook just came out with a paper recently that beat word2vec as well[4]. Neural word embeddings are a neat way of representing concepts. I see a great future for automated feature engineering with text (joining audio and images) in deep learning.

[1]: http://nlp.stanford.edu/pubs/glove.pdf

[2]: http://cs.stanford.edu/~quocle/paragraph_vector.pdf

[3]:http://www.australianscience.com.au/research/google/35671.pd...

[4]: http://arxiv.org/abs/1502.01710

juxtaposicion · on March 11, 2015

It's my first time seeing the package, but looking over the docs it looks like it implements LSA. The major difference here is that word2vec dramatically outperforms LSA in a variety of tasks (http://datascience.stackexchange.com/questions/678/what-are-...). My experience has been that the vector representations in LSA can be underwhelming and poorly performant. I can't comment on the Random Projection and Reflective Random Indexing techniques SemanticVectors implements.

This link is about document distances but still compares other techniques nicely: http://datascience.stackexchange.com/questions/678/what-are-...

madcowherd · on March 11, 2015

Sorry, I should have specifically mentioned how it differs from random indexing/projection. I was immediately reminded of a similar inference example using random indexing/projection.

https://code.google.com/p/semanticvectors/wiki/PredicationBa...

madsravn · on March 11, 2015

Very exciting stuff. I love how you can take simple building blocks and create something elegant and fun with them.

However, why are there words more similar to "vacation" than "vacation"?

juxtaposicion · on March 12, 2015

Thanks! The word 'vacation' is just removed from the list since it's exactly what we're looking for.

Bill_Dimm · on March 12, 2015

It's not removed from the list -- it is second from the bottom. madsravn's question is a good one.

TomAnthony · on March 13, 2015

The one in the list includes a period after it, so I believe it is just a case of slightly dirty data.

Bill_Dimm · on March 16, 2015

Good observation -- I missed that (obviously). They seem to be using data from the word2vec project, so I would guess that it is intentional rather than a lack of cleaning.

madsravn · on March 12, 2015

But how come it is less similar to itself than other words?

nl · on March 11, 2015

I don't understand how the item matching is working. Do they have textual descriptions of each item (including colors and patterns), or are they somehow building vectors for the images and then doing cross-modal vector calculations?

If it's the first option, then generating those descriptions seems and important thing to mention.

If it's the second, then it's a pretty significant result! I've seen some papers that indicate some possibilities in that area, but never anything working as well as this.

juxtaposicion · on March 11, 2015

The item vectors are generated from the text: customers and stylists write text about the stripes, or maternity, and word2vec associates this with the item in question. How that happens is treated in the next section about summarizing documents (a 'document' here is the collection of all text about an item).

So we don't do any fancy deep learning from the images themselves, although this is on the horizon :)

nl · on March 11, 2015

Colors would be trivial to do at least.. "The Dress" not withstanding..

sinwave · on March 11, 2015

Am I wrong that the title seems to imply something negative about word vectors? But the article is super pumped about them!

juxtaposicion · on March 11, 2015

It's just a play on the phrase a 'picture is worth a thousand words' :) The word vectors themselves contain sophisticated relationships that seem almost miraculous, we're definitely not negative on them

(Speaking as one of the authors of the post)

hmate9 · on March 12, 2015

By simple algerba, we can now prove that a picture is worth 1,000,000 vectors.

Dewie · on March 12, 2015

> If the gender axis is more positive, then it's more feminine; more negative, more masculine.

Reminds me of my java textbook: the example was to model some employee[1] and it's gender was a `boolean`: `false` for man; `true` for woman. Of course that was just an intermediate example before they showed off the `enum` solution.

[1] because hey, java OOP + CRUD business application example == match made in heaven as an example, apparently.