Source code: Language-Adversarial Training
Language-Adversarial Training technique for the cross-lingual model transfer problem learns a language-invariant hidden feature space to achieve better cross-lingual generalization.
This project is a tensorflow implementation of language-adversarial training approach in the context of Cross-lingual text classification (CLTC)
We classify a product review into five categories corresponding to its star rating.
LAN has two branches. There are four main components in the network:
- Embedding averager EA that maps input sequence x to embeddings and gives an averaging of the embeddigs of the sequence.
- Joint Feature extractor F that maps the averaged embeddings to a fixed-length feature vector in the shared feature space.
- Sentiment classifier P that predicts the label for x given the feature representation F (x).
- Language discriminator Q that also takes F (x) but predicts a scalar score indicating whether x is from SOURCE (1) or TARGET (-1).
We adopt the Deep Averaging Network (DAN) for the (EA + F). DAN takes the arithmetic mean of the word vectors as input, and passes it through several fully-connected layers until a softmax for classification.
EA takes the arithmetic mean of the word vectors as input, F passes it through several fully-connected layers until a softmax for classification in Semantic Classifier (P). In LAN, EA first calculates the average of the word vectors in the input sequence, then passes the average through a feed-forward network with ReLU nonlinearities. The activations of the last layer in F are considered the extracted features for the input and are then passed on to P and Q. The sentiment classifier P and the language discriminator Q are standard feed-forward networks. P has a softmax layer on top for text classification and Q ends with a tanh layer of output width 1 to assign a language identification score (-1 for target and 1 for source).
Download (Upload) the files to Google Drive folder Colab Notebooks:
The directories should be: 'My Drive/Colab Notebooks/Work/' Copy the files extract_data.ipynb and LAN_v3.ipynb in this folder.
In that two directories get created after running extract_data.ipynb (you can create yourself if you prefer):
'My Drive/Colab Notebooks/Work/Amazon reviews/'
'My Drive/Colab Notebooks/Work/Amazon reviews/train'
'My Drive/Colab Notebooks/Work/Amazon reviews/dev'
'My Drive/Colab Notebooks/Work/Amazon reviews/test'
'My Drive/Colab Notebooks/Work/bwe/'
'My Drive/Colab Notebooks/Work/bwe/vectors'
Next run the LAN_v3.ipynb.
If the files are copied to some other folder instead, you may have to modify the top cells in both the files accordingly. After running the first cells, click the link generated and enter the authorization code for same drive from which the files are being run.
In extract_data.ipynb, you can select which language datas and word-embeddings to download.
- Python 3.7
- Tensorflow 2.3.0
- Numpy
- Regex
- Requests
- Tqdm
- JSON
- The basic architecture involves Embeddings layer, Averaging layer, Feature Extractor Model, Sentiment Classifier Model and Language Detector Model.
- The overall model has two inputs, which are the padded reviews converted to sequences, and the corresponding lengths of the reviews just before padding. The lengths are required for averaging the embeddings throught the averaging layer.
- Input1 : Padded sequences
- Input2 : Actual sequence lengths
- Output1 : Predicted labels/star ratings
- Output2 : Predicted language
- Labels : Actual star ratings (for source language training)
- The Embeddings and Averaging Layers are put as separate model for training purposes.
- The inputs are first passed through this model to convert the padded sequences to embeddings and then average them to produce averaged embeddings (like BOW approach) as outputs.
- This is the base model for feature extraction; during training this base remains the same for both branches. Thus while during sentiment analysis, the weights are updated accoring to averaged labeled reviews, during language classification, it happens with language-labeled data, for which reviews are not a necessity.
- The motive is to somehow take advantage of language detection in sentiment classification through feature extraction.
- Averaged embeddings are given as inputs and the outputs of the final Dense layer are the features we're refering to.
- This is the main objective model; classification/labeling of reviews. The training is done in this repository through source language (English) reviews labeled data.
- The features are given as inputs and the outputs are the label predictions.
- Loss is evaluated as Sparse Categorical Crossentropy loss.
- This model is for the adversarial training. It has been assumed that the model can't be trained on target language due to lack of labeled target reviews data. Thus the features of target language are learned through this model. Therefore, training of this branch of LAN updates the already trained feature extractor weights to adjust for target language features.
- The inputs are same as sentiment classifier, i.e., the features extracted through feature extractor. The outputs are binary labels with +1 representing source and -1 representing target language data, from the output of a tanh layer.
- The loss is evaluated as Hinge loss.
- During training of this branch, the gradients learnt are reversed (multiplied by -lambda) before updating the model weights. This is done so that the features learnt are invariant-features between the source an target language.
The notebook has been divided in several subsections, so that it can be easily converted for local machine running as well.
(Only while running the notebook through drive)
Options here refers to the CLI arguments that are passed through sys.argv variable in python. Since The execution through notebook doesn't include CLI arguments, the sys.argv list needs to be explicitly updated. This section also creates necessary folders if they're not already existing.
Some common use and debugging functions are defined here. In higher versions, the saving and loading of models is also defined, which is needed to create checkpoints, as the all the models can't be directly saved and imported; LAN as a whole is saved and loaded and the component models have to be reconfigured from the same only. (This is also another reason for division of feature extractor in two parts).
The Amazon Reviews class definition which reads the downloaded data from the datapath mentioned in Options section.
This section defines the Averaging layer in this version.
This section creates the models which are described above, with their inputs and outputs, and their combinations for training. The fundamental are names as EA, F, P and Q respectively. These are combined to produce models named EAF, EAFP, EAFQ and EAFPQ/LAN models.
The training was divided in 2 major parts:
The embedding layer weights (or weights of EA model) are kept untrainable or constant for this, so that only the F, P and Q weights get updated.
- The EAFP or sentiment classifier branch is trained on source language reviews data, thus updating F and P model weights.
- The EAFQ or language detetor branch is trained on source and target data, updating F and Q models.
The embedding layer weights are also made trainable, so that weights of EA are also updated during training.
- The EAFP or sentiment classifier branch is trained on source language reviews data, thus updating EA, F and P model weights.
- The EAFQ or language detetor branch is trained on source and target data, updating EA, F and Q models.
- The overall LAN is trained, updating weights of EA, F, P as well as Q at the same time. For this, the labels need to be zipped with the language-labels and shuffled so that loss can include both Sparse Categorical Crossentropy loss for sentiment classification, as well as Hinge loss for language detection.
The model is evaluated on (unseen) target data.
The model may need to be saved and reloaded at many places to cater to unexplained crashes while running.
Saving LAN will save all the models in a common model (just like usual saves as in TF documentation).
- Load the model as single LAN model.
- Look at LAN structure with LAN.summary() function.
- The summary should show three sequential models in the bottom. These are F, P and Q (in top-to-bottom order in v3 implementation).
- These layers and models can be accessed with LAN.layers and stored in (_, E, _, A, F, P, Q) in this order.
- There would be two inputs for padded review sequences and corresponding actual lengths. These have to be used for inputting to the other complex models, i.e., EA, EAF, EAFP, EAFQ. The inputs can be accessed with LAN.inputs and stored in (input1, input2).
The Utils section in v5 has these two functions implemented as load_models and save_models.
More detailed descriptions of running instructions can be found in folders in the repository.
Adversarial Deep Averaging Networks for Cross-Lingual Sentiment Classification