Questions tagged [data-mining]

An activity that seeks patterns in large, complex data sets. It usually emphasizes algorithmic techniques, but may also involve any set of related skills, applications, or methodologies with that goal.

Conceptually speaking, data-mining can be thought of as one item (or set of skills and applications) in the toolkit of the data scientist.

More specifically, data-mining is an activity that seeks patterns in large, complex data sets. It usually emphasizes algorithmic techniques, but may also involve any set of related skills, applications, or methodologies with that goal.

In US-English colloquial speech, data-mining and data-collection are often used interchangeably.

However, a main difference between these two related activities is intentionality.

Definition inspired mostly by the contributions of @statsRus to Data Science.SE

1180 questions
200
votes
13 answers

K-Means clustering for mixed numeric and categorical data

My data set contains a number of numeric attributes and one categorical. Say, NumericAttr1, NumericAttr2, ..., NumericAttrN, CategoricalAttr, where CategoricalAttr takes one of three possible values: CategoricalAttrValue1, CategoricalAttrValue2 or…
IgorS
  • 5,444
  • 11
  • 31
  • 43
73
votes
7 answers

Open source Anomaly Detection in Python

Problem Background: I am working on a project that involves log files similar to those found in the IT monitoring space (to my best understanding of IT space). These log files are time-series data, organized into hundreds/thousands of rows of…
ximiki
  • 943
  • 1
  • 7
  • 15
71
votes
2 answers

Are Support Vector Machines still considered "state of the art" in their niche?

This question is in response to a comment I saw on another question. The comment was regarding the Machine Learning course syllabus on Coursera, and along the lines of "SVMs are not used so much nowadays". I have only just finished the relevant…
Neil Slater
  • 28,338
  • 4
  • 77
  • 100
40
votes
4 answers

Why do we need XGBoost and Random Forest?

I wasn't clear on couple of concepts: XGBoost converts weak learners to strong learners. What's the advantage of doing this ? Combining many weak learners instead of just using a single tree ? Random Forest uses various sample from tree to create…
39
votes
5 answers

What are some standard ways of computing the distance between documents?

When I say "document", I have in mind web pages like Wikipedia articles and news stories. I prefer answers giving either vanilla lexical distance metrics or state-of-the-art semantic distance metrics, with stronger preference for the latter.
Matt
  • 811
  • 1
  • 7
  • 12
37
votes
4 answers

Meaning of latent features?

I am learning about matrix factorization for recommender systems and I am seeing the term latent features occurring too frequently but I am unable to understand what it means. I know what a feature is but I don't understand the idea of latent…
Jack Twain
  • 719
  • 1
  • 5
  • 7
36
votes
6 answers

How to do SVD and PCA with big data?

I have a large set of data (about 8GB). I would like to use machine learning to analyze it. So, I think that I should use SVD then PCA to reduce the data dimensionality for efficiency. However, MATLAB and Octave cannot load such a large…
David S.
  • 547
  • 2
  • 6
  • 8
34
votes
6 answers

Gini coefficient vs Gini impurity - decision trees

The problem refers to decision trees building. According to Wikipedia 'Gini coefficient' should not be confused with 'Gini impurity'. However both measures can be used when building a decision tree - these can support our choices when splitting the…
Damien
  • 341
  • 1
  • 3
  • 3
29
votes
1 answer

What is Hellinger Distance and when to use it?

I am interested in knowing what really happens in Hellinger Distance (in simple terms). Furthermore, I am also interested in knowing what are types of problems that we can use Hellinger Distance? What are the benefits of using Hellinger Distance?
Smith Volka
  • 665
  • 2
  • 6
  • 13
29
votes
1 answer

Word2Vec vs. Sentence2Vec vs. Doc2Vec

I recently came across the terms Word2Vec, Sentence2Vec and Doc2Vec and kind of confused as I am new to vector semantics. Can someone please elaborate the differences in these methods in simple words. What are the most suitable tasks for each…
28
votes
3 answers

Why are NLP and Machine Learning communities interested in deep learning?

I hope you can help me, as I have some questions on this topic. I'm new in the field of deep learning, and while I did some tutorials, I can't relate or distinguish concepts from one another.
27
votes
2 answers

How to deal with time series which change in seasonality or other patterns?

Background I'm working on a time series data set of energy meter readings. The length of the series varies by meter - for some I have several years, others only a few months, etc. Many display significant seasonality, and often multiple layers -…
Jo Douglass
  • 401
  • 1
  • 5
  • 10
26
votes
4 answers

Is Data Science the Same as Data Mining?

I am sure data science as will be discussed in this forum has several synonyms or at least related fields where large data is analyzed. My particular question is in regards to Data Mining. I took a graduate class in Data Mining a few years back. …
demongolem
  • 413
  • 5
  • 10
23
votes
4 answers

What statistical model should I use to analyze the likelihood that a single event influenced longitudinal data

I am trying to find a formula, method, or model to use to analyze the likelihood that a specific event influenced some longitudinal data. I am having difficultly figuring out what to search for on Google. Here is an example scenario: Image you own a…
Peter Kirby
  • 333
  • 1
  • 4
20
votes
4 answers

K-means: What are some good ways to choose an efficient set of initial centroids?

When a random initialization of centroids is used, different runs of K-means produce different total SSEs. And it is crucial in the performance of the algorithm. What are some effective approaches toward solving this problem? Recent approaches are…
ngub05
  • 333
  • 1
  • 2
  • 8
1
2 3
78 79