Questions tagged [dimensionality-reduction]

Dimensionality reduction refers to techniques for reducing many variables into a smaller number while keeping as much information as possible. One prominent method is [tag pca]

297 questions
70
votes
11 answers

What is dimensionality reduction? What is the difference between feature selection and extraction?

From wikipedia: dimensionality reduction or dimension reduction is the process of reducing the number of random variables under consideration, and can be divided into feature selection and feature extraction. What is the difference between feature…
alvas
  • 2,340
  • 6
  • 25
  • 38
36
votes
6 answers

How to do SVD and PCA with big data?

I have a large set of data (about 8GB). I would like to use machine learning to analyze it. So, I think that I should use SVD then PCA to reduce the data dimensionality for efficiency. However, MATLAB and Octave cannot load such a large…
David S.
  • 547
  • 2
  • 6
  • 8
30
votes
8 answers

Purpose of visualizing high dimensional data?

There are many techniques for visualizing high dimension datasets, such as T-SNE, isomap, PCA, supervised PCA, etc. And we go through the motions of projecting the data down to a 2D or 3D space, so we have a "pretty pictures". Some of these…
hlin117
  • 675
  • 1
  • 8
  • 11
28
votes
6 answers

Machine learning techniques for estimating users' age based on Facebook sites they like

I have a database from my Facebook application and I am trying to use machine learning to estimate users' age based on what Facebook sites they like. There are three crucial characteristics of my database: the age distribution in my training set…
28
votes
5 answers

Improve the speed of t-sne implementation in python for huge data

I would like to do dimensionality reduction on nearly 1 million vectors each with 200 dimensions(doc2vec). I am using TSNE implementation from sklearn.manifold module for it and the major problem is time complexity. Even with method = barnes_hut,…
chmodsss
  • 1,954
  • 2
  • 17
  • 37
23
votes
3 answers

Nearest neighbors search for very high dimensional data

I have a big sparse matrix of users and items they like (in the order of 1M users and 100K items, with a very low level of sparsity). I'm exploring ways in which I could perform kNN search on it. Given the size of my dataset and some initial tests I…
22
votes
4 answers

Dimensionality and Manifold

A commonly heard sentence in unsupervised Machine learning is High dimensional inputs typically live on or near a low dimensional manifold What is a dimension? What is a manifold? What is the difference? Can you give an example to describe…
alvas
  • 2,340
  • 6
  • 25
  • 38
21
votes
5 answers

Feature selection vs Feature extraction. Which to use when?

Feature extraction and feature selection essentially reduce the dimensionality of the data, but feature extraction also makes the data more separable, if I am right. Which technique would be preferred over the other and when? I was thinking,…
20
votes
3 answers

Why are autoencoders for dimension reduction symmetrical?

I'm not an expert in autoencoders or neural networks by any means, so forgive me if this is a silly question. For the purpose of dimension reduction or visualizing clusters in high dimensional data, we can use an autoencoder to create a (lossy) 2…
dcl
  • 341
  • 2
  • 6
20
votes
1 answer

Are t-sne dimensions meaningful?

Are there any meanings for the dimensions of a t-sne embedding? Like with PCA we have this sense of linearly transformed variance maximizations but for t-sne is there intuition besides just the space we define for mapping and minimization of the…
Nitro
  • 407
  • 3
  • 9
18
votes
4 answers

One hot encoding alternatives for large categorical values

I have a data frame with large categorical values over 1600 categories. Is there any way I can find alternatives so that I don't have over 1600 columns? I found this interesting link. But they are converting to class/object which I don't want. I…
17
votes
2 answers

High-dimensional data: What are useful techniques to know?

Due to various curses of dimensionality, the accuracy and speed of many of the common predictive techniques degrade on high dimensional data. What are some of the most useful techniques/tricks/heuristics that help deal with high-dimensional data…
ASX
  • 451
  • 2
  • 4
  • 7
15
votes
1 answer

Can closer points be considered more similar in T-SNE visualization?

I understand from Hinton's paper that T-SNE does a good job in keeping local similarities and a decent job in preserving global structure (clusterization). However I'm not clear if points appearing closer in a 2D t-sne visualization can be assumed…
Javierfdr
  • 1,490
  • 12
  • 14
14
votes
2 answers

Efficient dimensionality reduction for large dataset

I have a dataset with ~1M rows and ~500K sparse features. I want to reduce the dimensionality to somewhere in the order of 1K-5K dense features. sklearn.decomposition.PCA doesn't work on sparse data, and I've tried using…
timleathart
  • 3,900
  • 20
  • 35
10
votes
2 answers

Reducing the dimensionality of word embeddings

I trained word embeddings with 300 dimensions. Now, I would like to have word embeddings with 50 dimensions: is it better to retrain the word embeddings with 50 dimensions, or can I use some dimensionality reduction method to scale the word…
Franck Dernoncourt
  • 5,573
  • 9
  • 40
  • 75
1
2 3
19 20