5

By "large", I mean in the range of 100m to 10b rows.

I'm currently using both Hadoop MapReduce and Amazon RedShift. MapReduce has been a little disappointing here. Redshift works very well if the data is distributed well for the given query.

Are there other technologies that I should be looking at here? If so, what are the trade offs?

Bill
  • 51
  • 1

3 Answers3

4

More importantly than the technology is the type of join you are using. For instance if the join keys are sorted, you can use sort merge joins and use join orders to get a better performance.

That being said, you can use in memory solutions for fastest joins if the size of your intermediate results will not blow up your cluster memory. Look at Spark SQL or Mem-SQL for instance.

Pirooz
  • 141
  • 3
0

If you are willing to pay for a vendor solution, Teradata is designed to solve the problem of large scale joins with low latency.

Brian Spiering
  • 20,142
  • 2
  • 25
  • 102
0

I would say Teradata should be good choice. However, when you make join you have to take careful consideration of choosing right joining keys.

Paul
  • 1
  • 1