Questions tagged [kaggle]

Kaggle is an online community for data scientists and machine learning practitioners owned by Google.

Kaggle is an online community for data scientists and machine learning practitioners owned by Google. With Kaggle, users can find and publish machine learning datasets, use kernels to construct models, participate in online forums, take courses, and (most famously) participate in data science competitions.

119 questions
30
votes
3 answers

Why do we convert skewed data into a normal distribution

I was going through a solution of the Housing prices competition on Kaggle (Human Analog's Kernel on House Prices: Advance Regression Techniques) and came across this part: # Transform the skewed numeric features by taking log(feature + 1). # This…
22
votes
1 answer

Lightgbm vs xgboost vs catboost

I've seen that in Kaggle competitions people are using lightgbms where they used to use xgboost. My question is: when would you rather use xgboost instead of lightgbm? What about catboost?
David Masip
  • 5,981
  • 2
  • 23
  • 61
22
votes
3 answers

How to perform feature engineering on unknown features?

I am participating on a kaggle competition. The dataset has around 100 features and all are unknown (in terms of what actually they represent). Basically they are just numbers. People are performing a lot of feature engineering on these features. I…
12
votes
1 answer

Hashing Trick - what actually happens

When ML algorithms, e.g. Vowpal Wabbit or some of the factorization machines winning click through rate competitions (Kaggle), mention that features are 'hashed', what does that actually mean for the model? Lets say there is a variable that…
B_Miner
  • 702
  • 1
  • 7
  • 20
11
votes
3 answers

Why does Gradient Boosting regression predict negative values when there are no negative y-values in my training set?

As I increase the number of trees in scikit learn's GradientBoostingRegressor, I get more negative predictions, even though there are no negative values in my training or testing set. I have about 10 features, most of which are binary. Some of the…
user2592989
  • 219
  • 1
  • 2
  • 6
11
votes
4 answers

Why SMOTE is not used in prize-winning Kaggle solutions?

Synthetic Minority Over-sampling Technique SMOTE, is a well known method to tackle imbalanced datasets. There are many papers with a lot of citations out-there claiming that it is used to boost accuracy in unbalanced data scenarios. But then, when I…
Carlos Mougan
  • 6,011
  • 2
  • 15
  • 45
6
votes
3 answers

Kaggle notebook Vs Google Colab

What are the major differences between Kaggle notebook and Google Colab notebook? To work on a dataset my first step is to start a Kaggle notebook but then I cant help thinking what could be the advantage of using Colab notebook instead. I know few…
ashraf
  • 61
  • 1
  • 2
6
votes
3 answers

Should I perform cross validation only on the training set?

I am working with a dataset that I downloaded from Kaggle. The data set is already divided into two CSVs for Train and Test. I built a model using the training set because I imported the train CSV into a Jupyter Notebook. I predicted using the Train…
Amit Yadav
  • 63
  • 1
  • 5
6
votes
3 answers

How can I fill NaN values in a Pandas DataFrame in Python?

I am trying to learn data analysis and machine learning by trying out some problems. I found a competition "House prices" which is actually a playground competition. Since I am very new to this field, I got confused after exploring the data. The…
Ahmed Dhanani
  • 163
  • 1
  • 1
  • 5
4
votes
1 answer

AUC ROC metric on a Kaggle competition

I am trying to learn data modeling by working on a dataset from Kaggle competition. As the competition was closed 2 years back, I am asking my question here. The competition uses AUC-ROC as the evaluation metric. This is a classification problem…
4
votes
1 answer

How to automate the encoding process?

I am working on the Boston challenge hosted on Kaggle and I'm still refining my features. Looking at the dataset, I realize that some columns need to be encoded in binary, some encoded in decimals (ranking them out of a scale of n) and some need to…
4
votes
4 answers

Import data from google drive to Kaggle Kernel

I want to import a csv file from google drive . I tried using the link in add dataset tab but it is taking some thing else as "Open". Please see the image.
Monukumar
  • 41
  • 1
  • 1
  • 2
4
votes
1 answer

Hyperparameter tuning for stacked models

I'm reading the following kaggle post for learning how to incorporate model stacking http://blog.kaggle.com/2016/12/27/a-kagglers-guide-to-model-stacking-in-practice/ in ML models. The structure behind constructing the 5 folds and creating out of…
4
votes
1 answer

Avoid hardware limitation while competing in Kaggle?

I've learned machine learning via textbooks and examples, which don't delve into the engineering challenges of working with "big-ish" data like Kaggle's. As a specific example, I'm working on the New York taxi trip challenge. It's a regression task…
Heisenberg
  • 149
  • 4
4
votes
3 answers

Can you recommend a machine learning challenge that is suitable for novices?

I am looking for a challenge that is suitable for a group of novices who want to learn the basics of data science and machine learning. The challenge should match the following criteria: is based on a real application or is at least realistic has a…
clstaudt
  • 129
  • 6
1
2 3 4 5 6 7 8