Questions tagged [text-mining]

Refers to a subset of data mining concerned with extracting information from data in the form of text by recognizing patterns. The goal of text mining is often to classify a given document into one of a number of categories in an automatic way, and to improve this performance dynamically, making it an example of machine learning. One example of this type of text mining are spam filters used for email.

Text Mining is a process of deriving high-quality information from unstructured (textual) information. Possible applications for text-mining are

Comments of Survey responses
Customer messages, emails, complaints etc.
Investigating competitors by crawling their web sites

What are some standard ways of computing the distance between documents?

When I say "document", I have in mind web pages like Wikipedia articles and news stories. I prefer answers giving either vanilla lexical distance metrics or state-of-the-art semantic distance metrics, with stronger preference for the latter.

asked Jul 05 '14 at 16:10

Matt

votes

4 answers

What algorithms should I use to perform job classification based on resume data?

Note that I am doing everything in R. The problem goes as follow: Basically, I have a list of resumes (CVs). Some candidates will have work experience before and some don't. The goal here is to: based on the text on their CVs, I want to classify…

machine-learning classification nlp text-mining

asked Jul 03 '14 at 16:11

user1769197

votes

3 answers

General approach to extract key text from sentence (nlp)

Given a sentence like: Complimentary gym access for two for the length of stay ($12 value per person per day) What general approach can I take to identify the word gym or gym access?

machine-learning nlp text-mining data-cleaning

asked Mar 13 '15 at 16:41

William Falcon

votes

3 answers

What is difference between text classification and topic models?

I know the difference between clustering and classification in machine learning, but I don't understand the difference between text classification and topic modeling for documents. Can I use topic modeling over documents to identify a topic? Can I…

classification text-mining topic-model

asked Aug 12 '14 at 03:50

Ali

votes

1 answer

What is Hellinger Distance and when to use it?

I am interested in knowing what really happens in Hellinger Distance (in simple terms). Furthermore, I am also interested in knowing what are types of problems that we can use Hellinger Distance? What are the benefits of using Hellinger Distance?

machine-learning data-mining text-mining distance

asked Aug 31 '17 at 02:11

Smith Volka

votes

3 answers

Keyword/phrase extraction from Text using Deep Learning libraries

Perhaps this is too broad, but I am looking for references on how to use deep learning in a text summarization task. I have already implemented text summarization using standard word-frequency approaches and sentence-ranking, but I'd like to explore…

neural-network text-mining deep-learning beginner tensorflow

asked Feb 03 '16 at 10:56

shanky_thebearer

votes

3 answers

How to grow a list of related words based on initial keywords?

I recently saw a cool feature that was once available in Google Sheets: you start by writing a few related keywords in consecutive cells, say: "blue", "green", "yellow", and it automatically generates similar keywords (in this case, other colors).…

nlp text-mining freebase

asked Jun 17 '14 at 06:05

nassimhddd

votes

5 answers

How to annotate text documents with meta-data?

Having a lot of text documents (in natural language, unstructured), what are the possible ways of annotating them with some semantic meta-data? For example, consider a short document: I saw the company's manager last day. To be able to extract…

nlp metadata data-cleaning text-mining

asked May 29 '14 at 20:11

Amir Ali Akbari

1,393
3
13
25

votes

2 answers

Doc2Vec - How to label the paragraphs (gensim)

I am wondering how to label (tag) sentences / paragraphs / documents with doc2vec in gensim - from a practical standpoint. Do you need to have each sentence / paragraph / document with its own unique label (e.g. "Sent_123")? This seems useful if…

machine-learning text-mining word-embeddings word2vec

asked Feb 12 '16 at 02:22

B_Miner

votes

2 answers

Extract most informative parts of text from documents

Are there any articles or discussions about extracting part of text that holds the most of information about current document. For example, I have a large corpus of documents from the same domain. There are parts of text that hold the key…

nlp text-mining

asked Dec 08 '14 at 14:51

MaticDiba

votes

4 answers

What is the difference between a hashing vectorizer and a tfidf vectorizer

I'm converting a corpus of text documents into word vectors for each document. I've tried this using a TfidfVectorizer and a HashingVectorizer I understand that a HashingVectorizer does not take into consideration the IDF scores like a…

nlp scikit-learn text-mining tfidf hashingvectorizer

asked Aug 14 '17 at 16:42

Minu

votes

1 answer

Algorithms for text clustering

I have a problem of clustering huge amount of sentences into groups by their meanings. This is similar to a problem when you have lots of sentences and want to group them by their meanings. What algorithms are suggested to do this? I don't know…

clustering text-mining algorithms scikit-learn

asked Aug 15 '14 at 13:10

Maxim Galushka

votes

4 answers

How to do postal addresses fuzzy matching?

I would like to know how to match postal addresses when their format differ or when one of them is mispelled. So far I've found different solutions but I think that they are quite old and not very efficient. I'm sure some better methods exist, so…

text-mining data-cleaning

asked Mar 21 '16 at 12:01

Stéphanie C

votes

4 answers

Alternatives to TF-IDF and Cosine Similarity when comparing documents of differing formats

I've been working on a small, personal project which takes a user's job skills and suggests the most ideal career for them based on those skills. I use a database of job listings to achieve this. At the moment, the code works as follows: 1) Process…

nlp text-mining similarity cosine-distance

asked Jan 02 '17 at 20:41

Richard Knoche

votes

1 answer

Recognize a grammar in a sequence of fuzzy tokens

I have text documents which contain mainly lists of Items. Each Item is a group of several token from different types: FirstName, LastName, BirthDate, PhoneNumber, City, Occupation, etc. A token is a group of words. Items can lie on several…

data-mining clustering text-mining time-series correlation

asked Aug 08 '16 at 13:01

OoDeLally

2 3

…

38 39 Next