Questions tagged [statistics]

Statistics is a scientific approach to inductive inference and prediction based on probabilistic models of the data. By extension, it covers the design of experiments and surveys to gather data for this purpose.

1110 questions
132
votes
1 answer

How to get correlation between two categorical variable and a categorical variable and continuous variable?

I am building a regression model and I need to calculate the below to check for correlations Correlation between 2 Multi level categorical variables Correlation between a Multi level categorical variable and continuous variable VIF(variance…
GeorgeOfTheRF
  • 2,018
  • 5
  • 17
  • 20
60
votes
5 answers

Neural networks: which cost function to use?

I am using TensorFlow for experiments mainly with neural networks. Although I have done quite some experiments (XOR-Problem, MNIST, some Regression stuff, ...) now, I struggle with choosing the "correct" cost function for specific problems because…
46
votes
12 answers

Data Science in C (or C++)

I'm an R language programmer. I'm also in the group of people who are considered Data Scientists but who come from academic disciplines other than CS. This works out well in my role as a Data Scientist, however, by starting my career in R and only…
Hack-R
  • 1,919
  • 1
  • 21
  • 34
38
votes
3 answers

Calculation and Visualization of Correlation Matrix with Pandas

I have a pandas data frame with several entries, and I want to calculate the correlation between the income of some type of stores. There are a number of stores with income data, classification of area of activity (theater, cloth stores, food ...)…
gdlm
  • 535
  • 1
  • 6
  • 9
29
votes
4 answers

Books about the "Science" in Data Science?

What are the books about the science and mathematics behind data science? It feels like so many "data science" books are programming tutorials and don't touch things like data generating processes and statistical inference. I can already code, what…
Anton
  • 399
  • 4
  • 5
29
votes
10 answers

Any Online R console?

I am looking for an online console for the language R. Like I write the code and the server should execute and provide me with the output. Similar to the website Datacamp.
Gotham
  • 291
  • 1
  • 3
  • 3
26
votes
7 answers

Is Python a viable language to do statistical analysis in?

I originally came from R, but Python seems to be the more common language these days. Ideally, I would do all my coding in Python as the syntax is easier and I've had more real life experience using it - and switching back and forth is a pain. Out…
confused
  • 488
  • 4
  • 10
23
votes
4 answers

What statistical model should I use to analyze the likelihood that a single event influenced longitudinal data

I am trying to find a formula, method, or model to use to analyze the likelihood that a specific event influenced some longitudinal data. I am having difficultly figuring out what to search for on Google. Here is an example scenario: Image you own a…
Peter Kirby
  • 333
  • 1
  • 4
17
votes
2 answers

High-dimensional data: What are useful techniques to know?

Due to various curses of dimensionality, the accuracy and speed of many of the common predictive techniques degrade on high dimensional data. What are some of the most useful techniques/tricks/heuristics that help deal with high-dimensional data…
ASX
  • 451
  • 2
  • 4
  • 7
17
votes
5 answers

Beginner math books for Machine Learning

I'm a Computer Science engineer with no background in statistics or advanced math. I'm studying the book Python Machine Learning by Raschka and Mirjalili, but when I tried to understand the math of the Machine Learning, I wasn't able to understand…
16
votes
3 answers

Overfitting in Linear Regression

I'm just getting started with machine learning and I have trouble understanding how overfitting can happen in a linear regression model. Considering we use only 2 feature variables to train a model, how can a flat plane possibly be overfitted to a…
16
votes
1 answer

How many features to sample using Random Forests

The Wikipedia page which quotes "The Elements of Statistical Learning" says: Typically, for a classification problem with $p$ features, $\lfloor \sqrt{p}\rfloor$ features are used in each split. I understand that this is a fairly good educated…
15
votes
2 answers

Analyzing A/B test results which are not normally distributed, using independent t-test

I have a set of results from an A/B test (one control group, one feature group) which do not fit a Normal Distribution. In fact the distribution resembles more closely the Landau Distribution. I believe the independent t-test requires that the…
teebszet
  • 253
  • 2
  • 6
15
votes
4 answers

How to specify important attributes?

Assume a set of loosely structured data (e.g. Web tables/Linked Open Data), composed of many data sources. There is no common schema followed by the data and each source can use synonym attributes to describe the values (e.g. "nationality" vs…
vefthym
  • 503
  • 6
  • 13
14
votes
3 answers

When are p-values deceptive?

What are the data conditions that we should watch out for, where p-values may not be the best way of deciding statistical significance? Are there specific problem types that fall into this category?
user179
  • 143
  • 1
  • 4
1
2 3
73 74