Questions tagged [distribution]

167 questions
10
votes
2 answers

how to check the distribution of the training set and testing set are similar

I have been playing the Kaggle Competition and I find there is a situation that the distribution of the training set and testing set are different, so I am wondering how to check the distribution of the training set and testing set are similar. And…
7
votes
3 answers

xgboost: Is there a way to perform regression on rates/percentages data?

I have a dependent variable, $Y$, that is made up of rates/percentages data, so each value is between $0$ and $1$. I was attracted to the xgboost library because it allows focusing in on specific subsets of the data in training itself, but I am…
Coolio2654
  • 280
  • 3
  • 10
7
votes
3 answers

Which outlier detection can detect these outliers?

I have a vector and want to detect outliers in it. The following figure shows the distribution of the vector. Red points are outliers. Blue points are normal points. Yellow points are also normal. I need an outlier detection method (a…
7
votes
2 answers

Plotting different values in pandas histogram with different colors

I am working on a dataset. The dataset consists of 16 different features each feature having values belonging to the set (0, 1, 2). In order to check the distribution of values in each column, I used pandas.DataFrame.hist() method which gave me a…
enterML
  • 3,011
  • 9
  • 26
  • 38
6
votes
4 answers

Regression: How to deal with positive skewness in continuous target variable

I'm working on a regression problem. My aim is to "learn" the distribution of a continuous target $y$ as good as possible to make predictions. My model looks like: $$y_i=\beta X_i + u_i.$$ $y$ is right skewed (positive skewness) and consists of…
Peter
  • 7,277
  • 5
  • 18
  • 47
6
votes
1 answer

How can I plot/display a dataset or an image distribution?

I want to view a specific image or a dataset's distribution, and see if they are different. Does simply writing something like : # mydataset.shape = (50k,32,32,3) plt.hist(mydataset.reshape(-1)) do the trick? or should I be doing something…
Hossein
  • 535
  • 6
  • 14
6
votes
1 answer

How to estimate the mutual information numerically?

Suppose I have a sample {$z_i$}$_{i\in[0,N]}$ = {($x_i,y_i$)}$_{i\in[0,N]}$ which commes from a probability distribution $p_z(z)$. How can I use it to estimate the mutual information between X and Y ? $MI(X,Y) = \int_Y \int_X …
5
votes
3 answers

Boxplots or violinplots?

This is quite a general question, perhaps somewhat opinion-based. In most papers people use boxplots to visualize a certain distribution, yet violinplots are able to give more information. Violinplots are made by performing a kernel density…
Archie
  • 863
  • 7
  • 20
5
votes
2 answers

Working with Data which is not Normal/Gaussian

What happens if my data/feature is not normal? Can I still use machine learning algorithms to utilize such data for predictions? I noticed in many data sciences courses, there is always a strong assumption of using a normal/Gaussian data. I have…
Newbie01
  • 53
  • 1
  • 4
4
votes
2 answers

Is it possible to train probabilistic model to return several distributions?

I have nonlinear data of function y(x), which is let's say parabolic. At some points of x there are several y's (look at the picture). Is it possible to train a probabilistic model to return several distributions (when needed) i.e. several means…
4
votes
2 answers

Why do seaborn.dist and pyplot.hist generate two different looking histograms on the same data?

I'm looking at telecom customers data. Two of the variables I'm looking at currently are: Monthly Charges - The total amount charged to the customer monthly. Is Senior Citizen - Whether the customer is a senior citizen. I'm trying to plot two…
helloworld
  • 43
  • 4
4
votes
3 answers

How to predict whether or not a customer will renew

I have a dataset of customer contracts that specify a start date and if applicable an end date. Each month a customer is up for renewal. Below is an example of how the data is organized in excel: ID Customer Start Date Customer Drop Date 1 …
Geometric
  • 41
  • 2
4
votes
2 answers

Standard Deviation for Z-scores

I have a set of data that I'm trying to generate a z-score with. I know I need standard deviation as part of my calculations. I am using the formula of: $\sigma = \sqrt{p * n * (1-p)}$ My data is binary - the value can either go up or down. However,…
I_Play_With_Data
  • 2,079
  • 2
  • 16
  • 39
4
votes
3 answers

Transform a skewed distribution into a Gaussian distribution

I have a skewed distribution that looks like this: How can I transform it to a Gaussian distribution? The values represent ranks, so modifying the values does not cause information loss as long as the order of values remains the same. I'm doing…
Atte Juvonen
  • 323
  • 2
  • 5
  • 8
4
votes
2 answers

Analysis of probability distribution of each features and Machine Learning

While I know that probability distributions are for hypothesis testing, confidence level constructions, etc. They definitely have many roles in statistical analysis. However, it is not obvious to me now how probability distributions come in handy…
Student
  • 419
  • 2
  • 9
1
2 3
11 12