Questions tagged [text-classification]

For questions about text classification, the task of assigning predefined categories (or classes) to free-text documents.

290 questions
67
votes
4 answers

What is purpose of the [CLS] token and why is its encoding output important?

I am reading this article on how to use BERT by Jay Alammar and I understand things up until: For sentence classification, we’re only only interested in BERT’s output for the [CLS] token, so we select that slice of the cube and discard everything…
9
votes
2 answers

Effect of Stop-Word Removal on Transformers for Text Classification

The domain here is essentially topic classification, so not necessarily a problem where stop-words have an impact on the analysis (as opposed to, say, sentiment analysis where structure can affect meaning). With respect to the positional encoding…
6
votes
1 answer

Using Trainable=True in Keras Embedding obtained better performance

It is suggested by the author of Keras [1] to use Trainable=False when using the embedding layer in Keras to prevent the weights from being updated during training. But in my experience, I always got better performance (lower error in regression)…
6
votes
1 answer

How to use ndcg metric for binary relevance

I am working on a ranking problem to predict the right single document based on the user query and use the NDCG metric to measure the model. Given the details : Queries ( Q ), Result Document ( D ), Relevance score. But the relevance score is a…
5
votes
1 answer

Text classification based on n-grams and similarity

I have tried to cluster hundred texts using k-means clustering. I would like to consider other algorithms to group text based on their content and try to spot news not related to other news (topic different). I would like to know if there is some…
4
votes
1 answer

How to preprocess with NLP a big dataset for text classification

TL;DR I've never done nlp before and I feel like I'm not doing it in the good way. I'd like to know if I'm really doing things in a bad way since the beginning or there's still hope to fix those problems mentioned later. Some basic info I'm trying…
4
votes
1 answer

How to include categorical fields to enhance a text classification

I would have a question on how to add more categorical fields in a classification problem. My dataset had initially 4 fields: Date Text Short_Mex Username Label 01/01/2020 I…
4
votes
0 answers

Bag of words: Prediction on new (out-of-sample) data

I'm working with a bag of words in R: library(tm) corpus = VCorpus(textsource) dtm = DocumentTermMatrix(corpus) dtm = as.matrix(dtm) I use the matrix dtm to train a lasso model. Now I want to predict new (unseen) text. The problem is, that I need…
Peter
  • 7,277
  • 5
  • 18
  • 47
4
votes
2 answers

What are the exact differences between Word Embedding and Word Vectorization?

I am learning NLP. I have tried to figure out the exact difference between Word Embedding and Word Vectorization. However, seems like some articles use these words interchangeably. But I think there must be some sort of differences. In…
Nahid
  • 43
  • 1
  • 3
3
votes
2 answers

Over-sampling: is my model over-fitting?

I would like to ask you some questions on how to consider (good or not) the following results: OVER-SAMPLING precision recall f1-score support 0.0 1.00 0.85 0.92 873 1.0 0.87 …
3
votes
1 answer

Predictive output with your own model built

I would need to better understand how can be created a machine learning algorithm from scratch using an own model developed based on boolean values, for example # of words in a text, # of punctuation, # of capital letters, and so on, to determine if…
3
votes
2 answers

Is there any way to plot ROC curve for Ensemble hard voting classifier?

I am working on a multi-class text classification problem and performing an Ensemble learning for text classification. I chose hard voting as ensemble technique. I tried to plot ROC curve for my ensemble method but it didn't work by showing the…
3
votes
1 answer

use genetic algorithm as a feature selection for text classification

how to apply the genetic algorithm as a feature selection for text classification in python I need to use GA to select most relevant feature in text classification
3
votes
1 answer

Text vectorizer that capture feature offset in the text?

I'm using sklearn Tfifdfvectorizer to extract feature from text towards text classification. I believe the information I need tends to be in the beginning of the document, so I would like to somehow capture the offset of each feature per document…
3
votes
1 answer

Doubt on scope of text classification problem

I have a dataset that describes the sellers who are selling various brands. I need to identify the source (where did he buy those brands he is selling from) of those sellers. (Dimension of dataset 11,29,490 rows and 2 columns: Seller and brand) eg:…
1
2 3
19 20