I am currently working on a prototype of an application that should be able to interact with user interfaces.
Now every user interface has some common elements, like buttons, scrollbars, input fields etc.
I would like to use Machine Learning to "interpret" such user interfaces in a way, in which I later can input a user interface as an image for example, and let the prototype then "try out" the interface, meaning, clicking on buttons, using scrollbars inputting some text into input fields etc.
I know that this would have to be done using Image Recognition, since there are many different UIs.
I am specifically interested in Websites, Adobe Reader with an opened PDF (that in turn can be a form etc.), and Word with an opened Document (again this can contain forms etc.).
Now my main question is if there is already some research going on in this field that I can use, or even an existing tool for parts of the process.
Any help is appreciated :)