Interpret User Interfaces with Machine Learning

Question

I am currently working on a prototype of an application that should be able to interact with user interfaces.

Now every user interface has some common elements, like buttons, scrollbars, input fields etc.

I would like to use Machine Learning to "interpret" such user interfaces in a way, in which I later can input a user interface as an image for example, and let the prototype then "try out" the interface, meaning, clicking on buttons, using scrollbars inputting some text into input fields etc.

I know that this would have to be done using Image Recognition, since there are many different UIs.

I am specifically interested in Websites, Adobe Reader with an opened PDF (that in turn can be a form etc.), and Word with an opened Document (again this can contain forms etc.).

Now my main question is if there is already some research going on in this field that I can use, or even an existing tool for parts of the process.

Any help is appreciated :)

score 1 · Accepted Answer · answered May 12 '16 at 14:39

I would try experimenting with recurrent neural networks: http://karpathy.github.io/2015/05/21/rnn-effectiveness/. Recurrent neural networks can output sequences of variable length given inputs of variable length. In your case a recurrent neural network might output a sequence like the following when given a user interface: click a button, select a field, type some text, hit enter. For another interface, the network might output only: click one button, click another button, and that's it. This would be useful for you because the sequence and length of actions from interface to interface might change a lot.

You could also experiment with reinforcement learning and build an algorithm that has an objective (reach some final page in as few actions as possible). The algorithm would start by doing random things (like clicking the same button a bunch of times), and then gradually learn over time to take appropriate actions. If you go that route you could use deep learning and Monte Carlo Tree Search (MCTS) like what Alpha Go did.

In either case you're going to need a framework that can train an algorithm quickly because you're likely to have to go through a lot of iterations. TensorFlow (https://www.tensorflow.org/) is one option (I've started using it recently, and I like it a lot because of its easy of use). TensorFlow is capable of building both recurrent neural nets and deep neural nets.

Interpret User Interfaces with Machine Learning

1 Answers1