Highest Voted 'distributed' Questions - Data Science Stack Exchange

34

votes

5 answers

What are the use cases for Apache Spark vs Hadoop

With Hadoop 2.0 and YARN Hadoop is supposedly no longer tied only map-reduce solutions. With that advancement, what are the use cases for Apache Spark vs Hadoop considering both sit atop of HDFS? I've read through the introduction documentation for…

asked Jun 17 '14 at 20:48

idclark

521
1
5
6

23

votes

3 answers

Nearest neighbors search for very high dimensional data

I have a big sparse matrix of users and items they like (in the order of 1M users and 100K items, with a very low level of sparsity). I'm exploring ways in which I could perform kNN search on it. Given the size of my dataset and some initial tests I…

machine-learning distributed map-reduce dimensionality-reduction

asked Aug 14 '14 at 00:50

cjauvin

451
3
7

16

votes

3 answers

Parallel and distributed computing

What is(are) the difference(s) between parallel and distributed computing? When it comes to scalability and efficiency, it is very common to see solutions dealing with computations in clusters of machines, and sometimes it is referred to as a…

definitions parallel distributed

asked May 15 '14 at 04:59

Rubens

4,097
5
23
42

14

votes

4 answers

Looking for example infrastructure stacks/workflows/pipelines

I'm trying to understand how all the "big data" components play together in a real world use case, e.g. hadoop, monogodb/nosql, storm, kafka, ... I know that this is quite a wide range of tools used for different types, but I'd like to get to know…

machine-learning bigdata efficiency scalability distributed

asked Jun 17 '14 at 10:37

chrshmmmr

143
7

13

votes

4 answers

Large Graphs: NetworkX distributed alternative

I have built some implementations using NetworkX(graph Python module) native algorithms in which I output some attributes which I use them for classification purposes. I want to scale it to a distributed environment. I have seen many approaches like…

machine-learning graphs distributed

asked Aug 16 '16 at 12:18

20-roso

670
1
5
15

12

votes

2 answers

Tradeoffs between Storm and Hadoop (MapReduce)

Can someone kindly tell me about the trade-offs involved when choosing between Storm and MapReduce in Hadoop Cluster for data processing? Of course, aside from the obvious one, that Hadoop (processing via MapReduce in a Hadoop Cluster) is a batch…

bigdata efficiency apache-hadoop distributed

asked Jun 01 '14 at 10:25

mbbce

347
2
8

10

votes

2 answers

What is the difference between Pytorch's DataParallel and DistributedDataParallel?

I am going through this imagenet example. And, in line 88, the module DistributedDataParallel is used. When I searched for the same in the docs, I haven’t found anything. However, I found the documentation for DataParallel. So, would like to know…

gpu distributed pytorch

asked Aug 11 '17 at 17:50

Dawny33

8,226
12
47
104

9

votes

1 answer

What is meant by Distributed for a gradient boosting library?

I am checking out XGBoost documentation and it's stated that XGBoost is an optimized distributed gradient boosting library. What is meant by distributed? Have a nice day

xgboost distributed boosting

asked Nov 15 '18 at 14:24

Tommaso Bendinelli

275
1
8

8

votes

3 answers

How to compare experiments run over different infrastructures

I'm developing a distributed algorithm, and to improve efficiency, it relies both on the number of disks (one per machine), and on an efficient load balance strategy. With more disks, we're able to reduce the time spent with I/O; and with an…

bigdata efficiency performance scalability distributed

asked Jun 15 '14 at 00:00

Rubens

4,097
5
23
42

8

votes

2 answers

Understanding how distributed PCA works

As part of big data analysis project, I'm working on, I need to perform PCA on some data, using cloud computing system. In my case, I'm using Amazon EMR for the job and Spark in particular. Leaving the "How-to-perform-PCA-in-spark" question aside, I…

data-mining bigdata apache-spark pca distributed

asked Apr 19 '17 at 08:58

Adiel

183
3

7

votes

2 answers

Python distributed machine learning

I occasionally train neural nets for my research, and they usually take quite a long time to run (especially when I'm working on my laptop). I'm looking for a way to build the model on any computer and send it up to a server for training and have it…

machine-learning python neural-network distributed

asked Nov 15 '15 at 15:06

Simon

1,071
2
10
28

6

votes

1 answer

How to make k-means distributed?

After setting up a 2-noded Hadoop cluster, understanding Hadoop and Python and based on this naive implementation, I ended up with this code: def kmeans(data, k, c=None): if c is not None: centroids = c else: centroids = [] …

python apache-hadoop k-means map-reduce distributed

asked Feb 06 '16 at 02:38

gsamaras

291
6
15

5

votes

2 answers

How to speedup message passing between computing nodes

I'm developing a distributed application, and as it's been designed, there'll be a great load of communication during the processing. Since the communication is already as much spread along the entire process as possible, I'm wondering if there any…

efficiency distributed performance

asked Jun 18 '14 at 14:36

Rubens

4,097
5
23
42

5

votes

2 answers

Distributed PCA or an equivalent

We normally have fairly large datasets to model on, just to give you an idea: over 1M features (sparse, average population of features is around 12%); over 60M rows. A lot of modeling algorithms and tools don't scale to such wide datasets. So…

dimensionality-reduction pca distributed matrix-factorisation

asked Jul 11 '18 at 21:22

Tagar

198
1
12

5

votes

1 answer

How to decided which test of normality to use

Given a data set with features, that you want to check for normality, one feature at a time w/o a multivariate normal test, how do you decided which test of normality to use? For example, using the python module scipy I could either use:…

statistics distributed

asked Jul 18 '16 at 15:49

FakeBrain

109
1
6

Questions tagged [distributed]