What technologies are fastest at performing joins on large datasets?

Question

By "large", I mean in the range of 100m to 10b rows.

I'm currently using both Hadoop MapReduce and Amazon RedShift. MapReduce has been a little disappointing here. Redshift works very well if the data is distributed well for the given query.

Are there other technologies that I should be looking at here? If so, what are the trade offs?

I think this might be a bit broad, and maybe better suited for StackOverflow. — Sean Owen, Nov 11 '14 at 11:29
MapReduce is not really a technology for joining. Do you mean Hive? Impala is the closest analog to Redshift and things like Teradata — Sean Owen, Jun 14 '15 at 09:39
consider [monetdb](http://www.asdfree.com/2013/03/column-store-r-or-how-i-learned-to-stop.html) — Anthony Damico, Jun 14 '15 at 12:44

score 4 · Answer 1 · answered Nov 10 '14 at 02:15

More importantly than the technology is the type of join you are using. For instance if the join keys are sorted, you can use sort merge joins and use join orders to get a better performance.

That being said, you can use in memory solutions for fastest joins if the size of your intermediate results will not blow up your cluster memory. Look at Spark SQL or Mem-SQL for instance.

score 0 · Answer 2 · answered Dec 05 '14 at 20:13

0

If you are willing to pay for a vendor solution, Teradata is designed to solve the problem of large scale joins with low latency.

answered Dec 05 '14 at 20:13

Brian Spiering

20,142
2
25
102

If price is no object you could check out Netezza... – rawkintrevo Dec 05 '14 at 23:27

score 0 · Answer 3 · answered Jun 14 '15 at 08:56

0

I would say Teradata should be good choice. However, when you make join you have to take careful consideration of choosing right joining keys.

answered Jun 14 '15 at 08:56

Paul

1
1

What technologies are fastest at performing joins on large datasets?

3 Answers3