Identify given patterns in unstructured data like text files

Question

I wasn't sure if I had to ask it here or in Stackoverflow, but since I am also seeking research papers/algorithms and not only code, I decided to do it here.

When I have a text, I can manually write a regex to find all the possible outputs from what I want to extract from the file. What I want to do, is to find an algorithm or a research, which can let you highlight (set the input) different positions of the same (repeated) data you want to extract in the text file, train the algorithm and then identify all the others under the same contentions of those you set.

For example, let's say that I have a text with several titles which are following with \n\n\n and starting with \n\n. It is easy with regex, but I want to do it dynamically.

An idea is to build an algorithm which will take examples and create regex automatically. But I am not aware of any research like this and maybe there are also other techniques that you can achieve it.

Any ideas?

It is perhaps hard to get a good grip on what the things/patterns you want to create regex's for might look like with only one example. Maybe more information will help resolve this ? Is it, for instance, only blank lines that define elements to search for ? — image_doctor, Sep 01 '15 at 17:35
That is the problem @image_doctor. It is not something specific, but it could be anything that have been given as an input. — Tasos, Sep 01 '15 at 17:52
Are there constraints on what constitutes well formatted input ? How do you know what constitutes something you have to create a regex for ? :) — image_doctor, Sep 01 '15 at 17:58
I was thinking about it as: A text contains 100.000 times the same structure of a specific data. A user starts to check a few of these times. The algorithm creates a Regex expression that can find all the other 99.999 times. Each one a user check, the Regex will be improved. — Tasos, Sep 01 '15 at 18:19
Sequence analysis maybe relevant, if what the user selects can be broken down into a sensible set tokens. — image_doctor, Sep 01 '15 at 18:27

score 0 · Accepted Answer · answered Sep 02 '15 at 00:22

That is exactly what the Trifecta product does (in addition to other features). It uses the Wrangle language which is a DSL (domain specific language) designed for data manipulation. There is a much earlier research project called Wrangler from the same people. The Wrangler papers might give you ideas.

Identify given patterns in unstructured data like text files

1 Answers1