Semantic Analogies

Datasets used as gold standard

Dataset	Structure	Size	Source
Capitals and countries	ca1 co1 ca2 co2	505	Word2Vec
Currency (and Countries)	cu1 co1 cu2 co2	866	Word2Vec
Cities and State	ci1 st1 ci2 st2	2,467	Word2Vec
(All) capitals and countries	ca1 co1 ca2 co2	4,523	Word2Vec

Model

The task takes the quadruplets (v_1, v_2, v_3, v_4) and works on the first three vectors to predict the fourth one. Among all the vectors, the nearest to the predicted one is retrieved, where the closest vector is computed by the dot product.

  def default_analogy_function(a, b, c){ return b - a + c }

The vector returned by the function (the predicted vector) gets compared with the top_k most similar ones. If the actual forth vector is among the top_k most similar ones, the answer is considered correct.

The analogy function to compute the predicted vector and the top_k value can be customised.

Output of the evaluation

Metric	Range	Optimum
Accuracy	[0,1]	Highest