A Neural Probabilistic Language Model

The Problem:

The fundamental problem for probabilistic language modeling is that the joint distribution of a large number of discrete variables results in exponentially large free parameters. It is called ‘Curse of Dimensionality’. This demands a use of modeling using continuous variables where the generalization can be easily achieved. The function that is learned will then have a local smoothness and every point (n-gram sequence) have significant information about a combinatorial number of neighboring points.

The Solution:

- The paper presents an effective and computationally efficient probabilistic modeling approach that overcomes the curse of dimensionality. It also overcomes the problem when a totally new sequence not present in the training data is observed.
- A neural network model is developed which has the vector representations of each word and parameters of the probability function in its parameter set.
- The objective of the model is to find the parameters that minimize the perplexity of the training dataset.
- The model eventually learns the distributed representations of each word and the probability function of a sequence as a function of the distributed representations.
- The Neural model has a hidden layer with tanh activation and the output layer is a Softmax layer.
- The out of the model for each input of (n-1) prev word indices are the probabilities of the |V| words in the vocabulary.

by taking advantage of word order, and by recognizing that temporally closer words in the word sequence are statistically more dependent.

What does this ultimately mean in the context of what has been discussed? What problem is this solving? The language model proposed makes dimensionality less of a curse and more of an inconvenience. That is to say, computational and memory complexity scale up in a linear fashion, not exponentially. It improves upon past efforts by learning a feature vector for each word to represent similarity and also learning a probability function for how words connect via a neural network.

source: https://production-media.paperswithcode.com/methods/Screen_Shot_2020-05-26_at_2.17.37_PM.png

- Three input nodes make up the foundation at the bottom, fed by the index for the word in the context of the text under study. - The layer in the middle labeled tanh represents the hidden layer. Tanh, an activation function known as the hyberbolic tangent, is sigmoidal (s-shaped) and helps reduce the chance of the model getting “stuck” when assigning values to the language being processed. How is this? In the system this research team sets up, strongly negative values get assigned values very close to -1 and vice versa for positive ones. Only zero-valued inputs are mapped to near-zero outputs. - The uppermost layer is the output — the softmax function. It is used to bring our range of values into the probabilistic realm (in the interval from 0 to 1, in which all vector components sum up to 1).