Questions tagged [distributed]
38 questions
34
votes
5 answers
What are the use cases for Apache Spark vs Hadoop
With Hadoop 2.0 and YARN Hadoop is supposedly no longer tied only map-reduce solutions. With that advancement, what are the use cases for Apache Spark vs Hadoop considering both sit atop of HDFS? I've read through the introduction documentation for…
idclark
- 521
- 1
- 5
- 6
23
votes
3 answers
Nearest neighbors search for very high dimensional data
I have a big sparse matrix of users and items they like (in the order of 1M users and 100K items, with a very low level of sparsity). I'm exploring ways in which I could perform kNN search on it. Given the size of my dataset and some initial tests I…
cjauvin
- 451
- 3
- 7
16
votes
3 answers
Parallel and distributed computing
What is(are) the difference(s) between parallel and distributed computing? When it comes to scalability and efficiency, it is very common to see solutions dealing with computations in clusters of machines, and sometimes it is referred to as a…
Rubens
- 4,097
- 5
- 23
- 42
14
votes
4 answers
Looking for example infrastructure stacks/workflows/pipelines
I'm trying to understand how all the "big data" components play together in a real world use case, e.g. hadoop, monogodb/nosql, storm, kafka, ... I know that this is quite a wide range of tools used for different types, but I'd like to get to know…
chrshmmmr
- 143
- 7
13
votes
4 answers
Large Graphs: NetworkX distributed alternative
I have built some implementations using NetworkX(graph Python module) native algorithms in which I output some attributes which I use them for classification purposes.
I want to scale it to a distributed environment. I have seen many approaches like…
20-roso
- 670
- 1
- 5
- 15
12
votes
2 answers
Tradeoffs between Storm and Hadoop (MapReduce)
Can someone kindly tell me about the trade-offs involved when choosing between Storm and MapReduce in Hadoop Cluster for data processing? Of course, aside from the obvious one, that Hadoop (processing via MapReduce in a Hadoop Cluster) is a batch…
mbbce
- 347
- 2
- 8
10
votes
2 answers
What is the difference between Pytorch's DataParallel and DistributedDataParallel?
I am going through this imagenet example.
And, in line 88, the module DistributedDataParallel is used. When I searched for the same in the docs, I haven’t found anything. However, I found the documentation for DataParallel.
So, would like to know…
Dawny33
- 8,226
- 12
- 47
- 104
9
votes
1 answer
What is meant by Distributed for a gradient boosting library?
I am checking out XGBoost documentation and it's stated that XGBoost is an optimized distributed gradient boosting library.
What is meant by distributed?
Have a nice day
Tommaso Bendinelli
- 275
- 1
- 8
8
votes
3 answers
How to compare experiments run over different infrastructures
I'm developing a distributed algorithm, and to improve efficiency, it relies both on the number of disks (one per machine), and on an efficient load balance strategy. With more disks, we're able to reduce the time spent with I/O; and with an…
Rubens
- 4,097
- 5
- 23
- 42
8
votes
2 answers
Understanding how distributed PCA works
As part of big data analysis project, I'm working on,
I need to perform PCA on some data, using cloud computing system.
In my case, I'm using Amazon EMR for the job and Spark in particular.
Leaving the "How-to-perform-PCA-in-spark" question aside, I…
Adiel
- 183
- 3
7
votes
2 answers
Python distributed machine learning
I occasionally train neural nets for my research, and they usually take quite a long time to run (especially when I'm working on my laptop).
I'm looking for a way to build the model on any computer and send it up to a server for training and have it…
Simon
- 1,071
- 2
- 10
- 28
6
votes
1 answer
How to make k-means distributed?
After setting up a 2-noded Hadoop cluster, understanding Hadoop and Python and based on this naive implementation, I ended up with this code:
def kmeans(data, k, c=None):
if c is not None:
centroids = c
else:
centroids = []
…
gsamaras
- 291
- 6
- 15
5
votes
2 answers
How to speedup message passing between computing nodes
I'm developing a distributed application, and as it's been designed, there'll be a great load of communication during the processing. Since the communication is already as much spread along the entire process as possible, I'm wondering if there any…
Rubens
- 4,097
- 5
- 23
- 42
5
votes
2 answers
Distributed PCA or an equivalent
We normally have fairly large datasets to model on, just to give you an idea:
over 1M features (sparse, average population of features is around
12%);
over 60M rows.
A lot of modeling algorithms and tools don't scale to such wide datasets.
So…
Tagar
- 198
- 1
- 12
5
votes
1 answer
How to decided which test of normality to use
Given a data set with features, that you want to check for normality, one feature at a time w/o a multivariate normal test, how do you decided which test of normality to use? For example, using the python module scipy I could either use:…
FakeBrain
- 109
- 1
- 6