Most Popular

1500 questions
31
votes
1 answer

What is a LB score in machine learning?

I was going through an article on kaggle blogs. Repeatedly, the author mentions 'LB score' and 'LB fit') as a metric for effectiveness of machine learning (along with cross validation (CV) score). With a research for the meaning of 'LB' I spent…
user345394
  • 505
  • 1
  • 4
  • 8
31
votes
3 answers

Neural Network for Multiple Output Regression

I have a dataset containing 34 input columns and 8 output columns. One way to solve the problem is to take the 34 inputs and build individual regression model for each output column. I am wondering if this problem can be solved using just one model…
sjishan
  • 411
  • 1
  • 4
  • 6
31
votes
8 answers

How to count the number of missing values in each row in Pandas dataframe?

How can I get the number of missing value in each row in Pandas dataframe. I would like to split dataframe to different dataframes which have same number of missing values in each row. Any suggestion?
Kaggle
  • 2,877
  • 5
  • 13
  • 8
30
votes
3 answers

What is difference between text classification and topic models?

I know the difference between clustering and classification in machine learning, but I don't understand the difference between text classification and topic modeling for documents. Can I use topic modeling over documents to identify a topic? Can I…
Ali
  • 361
  • 1
  • 4
  • 6
30
votes
8 answers

Purpose of visualizing high dimensional data?

There are many techniques for visualizing high dimension datasets, such as T-SNE, isomap, PCA, supervised PCA, etc. And we go through the motions of projecting the data down to a 2D or 3D space, so we have a "pretty pictures". Some of these…
hlin117
  • 675
  • 1
  • 8
  • 11
30
votes
3 answers

What is a better input for Word2Vec?

This is more like a general NLP question. What is the appropriate input to train a word embedding namely Word2Vec? Should all sentences belonging to an article be a separate document in a corpus? Or should each article be a document in said…
wacax
  • 3,370
  • 4
  • 22
  • 45
30
votes
2 answers

How to interpret classification report of scikit-learn?

As you can see, it is about a binary classification with linearSVC. The class 1 has a higher precision than class 0 (+7%), but class 0 has a higher recall than class 1 (+11%). How would you interpret this? And two other questions: what does…
user77241
30
votes
7 answers

Can machine learning learn a function like finding maximum from a list?

I have an input which is a list and the output is the maximum of the elements of the input-list. Can machine learning learn such a function which always selects the maximum of the input-elements present in the input? This might seem as a pretty…
user78739
  • 309
  • 1
  • 3
  • 3
30
votes
2 answers

How to feed LSTM with different input array sizes?

If I like to write a LSTM network and feed it by different input array sizes, how is it possible? For example I want to get voice messages or text messages in a different language and translate them. So the first input maybe is "hello" but the…
user3486308
  • 1,260
  • 5
  • 16
  • 27
30
votes
6 answers

What is the reason behind taking log transformation of few continuous variables?

I have been doing a classification problem and I have read many people's code and tutorials. One thing I've noticed is that many people take np.log or log of continuous variable like loan_amount or applicant_income etc. I just want to understand…
Sai Kumar
  • 601
  • 1
  • 8
  • 14
30
votes
1 answer

How is a splitting point chosen for continuous variables in decision trees?

I have two questions related to decision trees: If we have a continuous attribute, how do we choose the splitting value? Example: Age=(20,29,50,40....) Imagine that we have a continuous attribute $f$ that have values in $R$. How can I write an…
WALID BELRHALMIA
  • 411
  • 1
  • 4
  • 5
30
votes
4 answers

Is pandas now faster than data.table?

Here is the GitHub link to the most recent data.table benchmark. The data.table benchmarks has not been updated since 2014. I heard somewhere that Pandas is now faster than data.table. Is this true? Has anyone done any benchmarks? I have never used…
xiaodai
  • 620
  • 1
  • 5
  • 12
30
votes
3 answers

Why do we convert skewed data into a normal distribution

I was going through a solution of the Housing prices competition on Kaggle (Human Analog's Kernel on House Prices: Advance Regression Techniques) and came across this part: # Transform the skewed numeric features by taking log(feature + 1). # This…
30
votes
6 answers

How to fill missing value based on other columns in Pandas dataframe?

Suppose I have a 5*3 data frame in which third column contains missing value 1 2 3 4 5 NaN 7 8 9 3 2 NaN 5 6 NaN I hope to generate value for missing value based rule that first product second column 1 2 3 4 5 20 <--4*5 7 8 9 3 2 6 <-- 3*2 5 6 30…
KyL
  • 419
  • 1
  • 4
  • 5
30
votes
3 answers

How to get p-value and confident interval in LogisticRegression with sklearn?

I am building a multinomial logistic regression with sklearn (LogisticRegression). But after it finishes, how can I get a p-value and confident interval of my model? It only appears that sklearn only provides coefficient and intercept. Thank you a…
hminle
  • 401
  • 1
  • 4
  • 4