Questions tagged [text-mining]

Refers to a subset of data mining concerned with extracting information from data in the form of text by recognizing patterns. The goal of text mining is often to classify a given document into one of a number of categories in an automatic way, and to improve this performance dynamically, making it an example of machine learning. One example of this type of text mining are spam filters used for email.

Text Mining is a process of deriving high-quality information from unstructured (textual) information. Possible applications for text-mining are

  • Comments of Survey responses
  • Customer messages, emails, complaints etc.
  • Investigating competitors by crawling their web sites

More about text mining in below links.

572 questions
39
votes
5 answers

What are some standard ways of computing the distance between documents?

When I say "document", I have in mind web pages like Wikipedia articles and news stories. I prefer answers giving either vanilla lexical distance metrics or state-of-the-art semantic distance metrics, with stronger preference for the latter.
Matt
  • 811
  • 1
  • 7
  • 12
31
votes
4 answers

What algorithms should I use to perform job classification based on resume data?

Note that I am doing everything in R. The problem goes as follow: Basically, I have a list of resumes (CVs). Some candidates will have work experience before and some don't. The goal here is to: based on the text on their CVs, I want to classify…
user1769197
  • 431
  • 1
  • 5
  • 5
31
votes
3 answers

General approach to extract key text from sentence (nlp)

Given a sentence like: Complimentary gym access for two for the length of stay ($12 value per person per day) What general approach can I take to identify the word gym or gym access?
William Falcon
  • 421
  • 1
  • 6
  • 7
30
votes
3 answers

What is difference between text classification and topic models?

I know the difference between clustering and classification in machine learning, but I don't understand the difference between text classification and topic modeling for documents. Can I use topic modeling over documents to identify a topic? Can I…
Ali
  • 361
  • 1
  • 4
  • 6
29
votes
1 answer

What is Hellinger Distance and when to use it?

I am interested in knowing what really happens in Hellinger Distance (in simple terms). Furthermore, I am also interested in knowing what are types of problems that we can use Hellinger Distance? What are the benefits of using Hellinger Distance?
Smith Volka
  • 665
  • 2
  • 6
  • 13
24
votes
3 answers

Keyword/phrase extraction from Text using Deep Learning libraries

Perhaps this is too broad, but I am looking for references on how to use deep learning in a text summarization task. I have already implemented text summarization using standard word-frequency approaches and sentence-ranking, but I'd like to explore…
23
votes
3 answers

How to grow a list of related words based on initial keywords?

I recently saw a cool feature that was once available in Google Sheets: you start by writing a few related keywords in consecutive cells, say: "blue", "green", "yellow", and it automatically generates similar keywords (in this case, other colors).…
nassimhddd
  • 587
  • 4
  • 12
23
votes
5 answers

How to annotate text documents with meta-data?

Having a lot of text documents (in natural language, unstructured), what are the possible ways of annotating them with some semantic meta-data? For example, consider a short document: I saw the company's manager last day. To be able to extract…
Amir Ali Akbari
  • 1,393
  • 3
  • 13
  • 25
22
votes
2 answers

Doc2Vec - How to label the paragraphs (gensim)

I am wondering how to label (tag) sentences / paragraphs / documents with doc2vec in gensim - from a practical standpoint. Do you need to have each sentence / paragraph / document with its own unique label (e.g. "Sent_123")? This seems useful if…
B_Miner
  • 702
  • 1
  • 7
  • 20
20
votes
2 answers

Extract most informative parts of text from documents

Are there any articles or discussions about extracting part of text that holds the most of information about current document. For example, I have a large corpus of documents from the same domain. There are parts of text that hold the key…
MaticDiba
  • 651
  • 1
  • 6
  • 10
20
votes
4 answers

What is the difference between a hashing vectorizer and a tfidf vectorizer

I'm converting a corpus of text documents into word vectors for each document. I've tried this using a TfidfVectorizer and a HashingVectorizer I understand that a HashingVectorizer does not take into consideration the IDF scores like a…
Minu
  • 795
  • 2
  • 8
  • 18
17
votes
1 answer

Algorithms for text clustering

I have a problem of clustering huge amount of sentences into groups by their meanings. This is similar to a problem when you have lots of sentences and want to group them by their meanings. What algorithms are suggested to do this? I don't know…
Maxim Galushka
  • 303
  • 1
  • 2
  • 7
16
votes
4 answers

How to do postal addresses fuzzy matching?

I would like to know how to match postal addresses when their format differ or when one of them is mispelled. So far I've found different solutions but I think that they are quite old and not very efficient. I'm sure some better methods exist, so…
Stéphanie C
  • 281
  • 1
  • 2
  • 5
15
votes
4 answers

Alternatives to TF-IDF and Cosine Similarity when comparing documents of differing formats

I've been working on a small, personal project which takes a user's job skills and suggests the most ideal career for them based on those skills. I use a database of job listings to achieve this. At the moment, the code works as follows: 1) Process…
Richard Knoche
  • 151
  • 1
  • 1
  • 3
14
votes
1 answer

Recognize a grammar in a sequence of fuzzy tokens

I have text documents which contain mainly lists of Items. Each Item is a group of several token from different types: FirstName, LastName, BirthDate, PhoneNumber, City, Occupation, etc. A token is a group of words. Items can lie on several…
1
2 3
38 39