An NLP application on SOMs

Using Self-Organizing Maps to Represent Word Vectors on a 2D map while preserving the topology of the original input space

Josef Haddad

May 2020 — 10 min read

This project aims at representing some of the most common English words on a 2D topological map. It is done by using a self-organizing map (SOM), also known as a Kohonen neural network. These networks apply competitive learning to preserve the topology of the data points in higher dimensional feature space to lower-dimensional ones. This is useful for visualising high dimensional data. The github repo with all the code can be found here.

Experiments

In the experiments of this project, pre-trained word vectors were used as input vectors and a 2D uniform grid of varying size were used as output space. Lists of the most common words in different domains were found on the internet and are described further below for each experiment. Find out more about how the data files were formatted in the data directory in the github repository.

The English language

In the first experiment, pre-trained 300-dimensional glove word vectors (glove.6B.300d.txt) made available by Stanford University were used. The output space consisted of a 40x40 uniform grid. To determine the 2000 most commonly used words in the English language, the list google-10000-english-no-swears.txt from this repository was used. The words in this list were determined by n-gram frequency analysis of the Google's Trillion Word Corpus. These 2000 words were the words included in the input space during training and plotting. The units/vertices in the grid were labelled to the label that the closest data point has (Euclidean distance).

Training Parameters used:

Parameter	value
Words	2000
Epochs	500
Learning rate	0.2
Unit initialisation	Gaussian(0,1)
Grid size	40x40
Initial neighborhood range	40 (Manhattan distance)

The resulting plot can be seen here:

SOM — Topological map of words in the English language

Click here to view the picture in a new tab.

As seen on the results, similar words are positioned close to each other.

The Swedish language

A similar experiment was done for the Swedish language as for the English language. The Swedish word vectors are 300-dimensional and are trained by Facebook Research. The word vectors can be found on their github repository and are licenced under the Creative Commons Attribution-Share-Alike License 3.0. The 600 most frequently used words in the Swedish language were included and the source I used to identify them can be found here. The units/vertices in the grid were labelled to the label that the closest data point has (Euclidean distance).

Training Parameters used:

Parameter	value
Words	600
Epochs	500
Learning rate	0.2
Unit initialisation	Gaussian(0,1)
Grid size	20x20
Initial neighborhood range	20 (Manhattan distance)

The resulting plot can be seen here (hover to zoom):

SOM_Swedish — Topological map of words in the Swedish language

Click here to view the picture in a new tab.

As seen on the results, similar words are positioned close to each other.