CNN and RNN have different architectures and are designed to solve different problems.
Images have a lot of pixels and thereby lot of features. Reducing some features doesn't impact much on what image coveys. CNNs are designed to reduce the features.
NLP are context driven. Farther the word is in sentence the less is its significance to context/meaning of current word. RNNs/LSTM/Transformers are designed to sustain that memory based on distance to current word. Therefore these architectures are better suited for NLP kind of scenario. Attention just helps achieving same objective by putting focus on some specific words (Attention can be used with CNNs too of course).
Now the original question, can CNN be used for RNN. Yes, but in that case you will have to control how much memory (in form of CNN stride/window size etc) yourself based on the sentence you like to understand/translate.
Simply, RNNs are just better than CNNs to solve this problem so there are more efforts put by community.