Automatic annotation of medical text data

Question

I have a dataset of 30000 whole-genome sequence analysis. For each sample I have a text description that does not follow any fixed format. What I want, is an annotation of each sample indicating the disease and the particular tissue that characterize the sample.

I'm not into text mining, so I don't know which tools I can use. Any suggestions?

So you have some kind of free text? **Any chance of an example or two?** Do you know what the set of diseases you are looking for is? Does the actual gene sequence (GCCATATA etc) enter into this? — Spacedman, Apr 11 '16 at 06:54
I am not familiar with genome analysis. What is a "sample"? Is it a single record? And what is the "tissue"? — Pieter, Aug 12 '16 at 22:56

score 2 · Answer 1 · edited May 30 '17 at 14:50

You could use linear regression on the genome sequence to predict the occurrence of words in the description. More specifically:

Use dummy variables to encode the genome sequence.
Use stemming to make different conjugations of the same word the same.
Use a bag-of-words representation to represent the words.
Use a scaling of the word counts $w_i$ like $\log(w_i+1)$ or the more advanced TF-IDF.
Since you have a lot of independent variables (maybe more than the number of records?) you should use some regularization of the model. Lasso would be a good choice if you want a sparse model, use ridge regression if you want to put a zero prior on the coefficients.

This method you can use to predict which words are typical for a gene sequence and, hence, characterize the sequence.

You could use the intermediate result of the linear model to see what tissue is important for the prediction. The important ones are the dummy variables that are "on" and have high coefficients. Because you have multiple outputs you could simplify this by only using the coefficients of the top-n likely words.

score 0 · Answer 2 · answered Apr 14 '16 at 12:13

0

Is all the information you want for each sequence contained in the text that's attached to it? If so, just compare a list of diseases and a list of tissues against every text. Lists of diseases can be found e.g. on the CDC website.

answered Apr 14 '16 at 12:13

Sharon

31
1

Automatic annotation of medical text data

2 Answers2