Language translation with convolutional neural network

Question

Many examples of language translation neural networks:

"the cat sat on the mat" -> [model] -> "le chat etait assis sur le tapis"

use RNN, and in particular LSTM. See for example Sentences language translation with neural network, with a simple layer structure (if possible sequential) and A ten-minute introduction to sequence-to-sequence learning in Keras .

Are there (not too complicated) attempts of doing language translation with CNN only and no RNN/LSTM?

Would you have an example in Keras?

You'll at least require RNNs for language translation. RNNs can *remember* the context of the previous words while it is predicting translations. — Shubham Panchal, Feb 28 '20 at 14:16
@ShubhamPanchal Couldn't a CNN (with a large kernel size) also keep context of adjacent words? If not, could you elaborate why it wouldn't work? — Basj, Feb 28 '20 at 15:47

score 3 · Answer 1 · answered Feb 28 '20 at 14:17

3

I just googled:

A Convolutional Encoder Model for Neural Machine Translation, by Gehring et al., link
Convolutional Sequence to Sequence Learning, by Gehring et al. link
Pervasive Attention: 2D Convolutional Neural Networks for Sequence-to-Sequence Prediction, by Elbayad et al. link

All the implementations I found on GitHub are in PyTorch. I'm not surprised I didn't find much: the application of CNNs to NLP was interesting, but they never beat RNNs at that. After the advent of Attention models, and especially Transformers, these kind of models have not been developed further. In fact the latest paper I linked you is from 2018 - i.e. an Ice Age ago for the pace of NLP.

If you really want to go deeper on this topic of Convolutional NMTs, I suggest you to check the available torch-based code and try to make a replica in Tensorflow/Keras. It's hard work, but a fancy model nonetheless. Good luck!

answered Feb 28 '20 at 14:17

Leevo

6,005
3
14
51

Thank you for your answer. A few notes: there still is a LSTM in your link (https://arxiv.org/pdf/1611.02344.pdf, page 4), isn't it? Also, I never studied Attention, I only heard the name; is it a special case of CNN or a special case of RNN or none of these? PS: Could you include the Github link you found in your answer, it would be helpful for future reference. – Basj Feb 28 '20 at 15:51
Yeah that is a hybrid with CNN Encoder and LSTM Decoder. On Attention: you can add it to any model. It's an additional component (usually feed-forward) specialized in telling the Decoder what to focus on the most. Think of it visually as a heatmap where "heat" is "paying attention". It can solve many problems of NMT, that aren't good at remembering stuff further back in the text. It's also great for Image captioning. You can read about it [here](https://medium.com/datadriveninvestor/attention-in-rnns-321fbcd64f05). The main formulations of Attention mechanisms are Bahdanau and Luong's. – Leevo Mar 06 '20 at 09:12
Sorry for being late, I found the [GitHub link](https://github.com/elbayadm/attn2d). – Leevo Mar 10 '20 at 22:00

score 0 · Answer 2 · answered Mar 08 '20 at 07:17

CNN and RNN have different architectures and are designed to solve different problems.

Images have a lot of pixels and thereby lot of features. Reducing some features doesn't impact much on what image coveys. CNNs are designed to reduce the features.

NLP are context driven. Farther the word is in sentence the less is its significance to context/meaning of current word. RNNs/LSTM/Transformers are designed to sustain that memory based on distance to current word. Therefore these architectures are better suited for NLP kind of scenario. Attention just helps achieving same objective by putting focus on some specific words (Attention can be used with CNNs too of course).

Now the original question, can CNN be used for RNN. Yes, but in that case you will have to control how much memory (in form of CNN stride/window size etc) yourself based on the sentence you like to understand/translate.
Simply, RNNs are just better than CNNs to solve this problem so there are more efforts put by community.

Let's say a particular sentence input is encoded as a 100x32 matrix (100 = max number of words, 32 = the embedding size). By using a CNN with kernel-size/window-size 50x32 we could totally "keep" 50 words of context around a paritcular word. What do you think? So it seems that CNN could totally be adapted to situations for which we need to keep "context". — Basj, Mar 08 '20 at 09:46
Using that window size you are going to do some pooling which will be either mean/medium/max. That means all 50 words will get same weight. Practically a word which appeared 50 words before current words is rarely correlated. — Sandeep Bhutani, Mar 08 '20 at 09:54

Language translation with convolutional neural network

2 Answers2