Most Popular

1500 questions
8
votes
2 answers

Text similarity with sentence embeddings

I'm trying to calculate similarity between texts with various lengths. My current approach is following: Using Universal Sentence Encoder, I convert text to a set of vectors. I average these vectors to create the final feature vector. I compare…
8
votes
4 answers

How to learn spam email detection?

I want to learn how a spam email detector is done. I'm not trying to build a commercial product, it'll be a serious learning exercise for me. Therefore, I'm looking for resources, such as existing projects, source code, articles, papers etc that I…
ABCD
  • 3,510
  • 2
  • 18
  • 30
8
votes
2 answers

Is There a Way to Re-Calibrate Predicted Probabilities After Using Class Weights?

I have classification data with far more negative instances than positive instances. I have used class weights in my models and have achieved the discrimination I want but the predicted probabilities from the models do not match the actual…
8
votes
2 answers

Time-series prediction: Model & data assumptions in AI/ML models vs conventional models

I was wondering if there was a good paper out there that informs about model and data assumptions in AI/ML approaches. For example, if you look at Time Series Modelling (Estimation or Prediction) with Linear models or (G)ARCH/ARMA processes, there…
8
votes
4 answers

Why is there a difference between predicting on Validation set and Test set?

I have a XGBoost model trying to predict if a currency will go up or down next period (5 min). I have a dataset from 2004 to 2018. I split the data randomized into 95% train and 5% validation and the accuracy on the Validation set is up to 55%. When…
DBSE
  • 221
  • 2
  • 3
8
votes
1 answer

Complex Chunking with NLTK

I am trying to figure out how to use NLTK's cascading chunker as per Chapter 7 of the NLTK book. Unfortunately, I'm running into a few issues when performing non-trivial chunking measures. Let's start with this phrase: "adventure movies between 2000…
grill
  • 234
  • 3
  • 7
8
votes
1 answer

Gensim LDA model: return keywords based on relevance (λ - lambda) value

I am using the gensim library for topic modeling, more specifically LDA. I created my corpus, my dictionary, and my LDA model. With the help of the pyLDAvis library I visualized the results. When I print the words with the highest probability on…
8
votes
1 answer

Which classification algorithms to try for classifying text data into 300 categories

I have 40000 rows of text data of health care domain. Data has one column for text (2-5 sentences) and one column for its category. I want to classify that into 300 categories. Some categories are independent while some are somewhat related.…
Alok Nayak
  • 191
  • 1
  • 5
8
votes
2 answers

How to use Graph Neural Network to predict relationships between nodes with pytorch_geometric?

Let's say I have a partly connected graph that represents members of many unrelated communities. I would like to predict the possible friendships between members of the same community: on an sliding scale between 0 to 10 how likey would they like…
Soerendip
  • 724
  • 1
  • 9
  • 16
8
votes
5 answers

What is the best question generation state of art with nlp?

I was trying out various projects available for question generation on GitHub namely NQG,question-generation and a lot of others but I don't see good results form them either they have very bad question formation or the questions generated are…
Sundeep Pidugu
  • 108
  • 1
  • 10
8
votes
2 answers

Why is taking the gradient of the average error in SGD not correct, but rather the average of the gradients of single errors?

I am a little confused about taking averages in cost functions and SGD. So far I always thought in SGD you would compute the average error for a batch and then backpropagate it. But then I was told in a comment on this question that that was wrong.…
8
votes
1 answer

How does class_weight work in Decision Tree

The scikit-learn implementation of DecisionTreeClassifier has a parameter as class_weight. As per documentation: Weights associated with classes in the form {class_label: weight}. If not given, all classes are supposed to have weight one. and The…
Supratim Haldar
  • 279
  • 1
  • 3
  • 8
8
votes
2 answers

Which classification algorithms are negatively affected by class imbalances?

I've seen a few posts and papers floating around the web (mostly those related to over/undersampling, SMOTE, and cost-sensitive training) that, when discussing class imbalance, specify that certain algorithms are negatively impacted by class…
8
votes
3 answers

Isolation forest sklearn contamination param

I am working on an unsupervised anomaly detection task on time series data using an isolation forest algorithm. I am developing it in Python, more in detail using scikit-learn. I found a lot of examples on this, but what is not very clear, is how to…
8
votes
4 answers

What is the term for when a model acts on the thing being modeled and thus changes the concept?

I'm trying to see if there is a conventional term for this concept to help me in my literature research and writing. When a machine learning model causes an action to be taken in the real world that affects future instances, what is that called? …
jsmith54
  • 83
  • 2