6

(Note: Pulled this question from the list of questions in Area51, but believe the question is self explanatory. That said, believe I get the general intent of the question, and as a result likely able to field any questions on the question that might pop-up.)

Which Big Data technology stack is most suitable for processing tweets, extracting/expanding URLs and pushing (only) new links into 3rd party system?

blunders
  • 1,922
  • 2
  • 15
  • 19
  • 1
    About how much tweets a "run" are we talking? – Johnny000 May 15 '14 at 07:02
  • It's hard to know without a range of tweets, 1000, 100,000 full datahose etc. – MCP_infiltrator May 15 '14 at 13:01
  • 1
    @Johnny000: [500 million Tweets a day](https://blog.twitter.com/2013/new-tweets-per-second-record-and-how); my understanding is that Twitter limits streams to vendors based on trust/need, but to insure the solution covers the current daily averages, the solution should account what is the max, or you're able to reference as the max via a reliable source other than yourself. – blunders May 15 '14 at 13:24
  • 1
    Here's more information, appear the "firehose" data feed is pricy, so guessing it would be more relevant to limit input to the volume produce via the [Twitter Streaming APIs](https://dev.twitter.com/docs/api/streaming) via the Public Stream; [guide to processing the data is here](https://dev.twitter.com/docs/streaming-apis/processing#Scaling). – blunders May 15 '14 at 20:48
  • Public Stream is a random sample of the "firehose" and appears to 1% of its total volume; meaning I estimate the feed to be 5 million tweets per day, and spikes might reach 1432 tweets per second; appears the spike must be account for, otherwise the feed gets discounted. – blunders May 15 '14 at 20:49
  • 1
    I'd suggest Apache Kafka as message store and any stream processing solution of your choice like Apache Camel or Twitter Storm – Konstantin V. Salikhov May 18 '14 at 11:31

1 Answers1

3

I'd suggest Apache Kafka as message store and any stream processing solution of your choice like Apache Camel or Twitter Storm