Leland's answer is exactly correct regarding why an autoencoder wouldn't be useful. Let me expand upon that point:
Autoencoders and other dimensionality reduction techniques attempt to keep objects that are "close" together in your high dimensional space also close in the lower dimensional space. Often the measure of closeness the autoencoder learns leads to a compressed representation that is useful for later tasks, but sometimes it may not.
Your categories most likely exhibit structure in the context of your game, where certain categories behave like certain others in certain situations. So a dimensionality reduction could certainly be both possible and useful.
However, the problem is that the way you have represented your categories currently exhibits no special structure. Every one-hot encoded category is exactly the same distance from every other category, so an autoencoder could not possibly learn anything useful about how categories behave relative to one another.
This is where an embedding layer as mentioned by ncasas in the comments can come to the rescue. An embedding layer can take one-hot encoded categories and learn a lower-dimensional embedding optimized for the task at hand. In other words, it is trained at the same time as the rest of your architecture, so it learns the best embedding for minimizing the error at the output of your network, which represents your end goal.
Unfortunately, as you mentioned your architecture doesn't quite support this - since you need pre-existing embeddings which you'd like concatenate. If I am interpreting this correctly, you are saying you'd like to pass multiple categories in as part of one input to the network. In this case you should just use a multi-hot encoding as input to an embedding layer.
If I am not interpreting this correctly as you really need pretrained embeddings, then consider doing exactly that! If you can find a simpler prediction task that takes the same categories as input, then you can learn an embedding on this task, and then reuse the embedding for your more complex task. Reusing embeddings on related tasks is a well-tested method, and you can read more about it in the recent fast.ai post on categorical embeddings.
To generate training data for a simpler task you could consider simulating some simple behaviors in your environment that are good differentiators of how different monster classes behave. For example you could have a group of $k$ monsters compete at some task, and the goal is to predict who will win. For large enough $k$ you should be able to generate a large, non-repeating data set.
When you have learned an embedding for individual monsters, you can then take the average of the relevant embeddings as input to your network for your original task. This method is used in Deep Neural Networks for YouTube Recommendations, and probably elsewhere too.