The item vectors are generated from the text: customers and stylists write text about the stripes, or maternity, and word2vec associates this with the item in question. How that happens is treated in the next section about summarizing documents (a 'document' here is the collection of all text about an item).
So we don't do any fancy deep learning from the images themselves, although this is on the horizon :)
So we don't do any fancy deep learning from the images themselves, although this is on the horizon :)