1

I am very new to Data Science, but I have an use case which I want to solve.

I want to build a data synchronization scheduler which keeps track of the amount of data sync after every scheduled triggers and auto-adjusts the next schedule.

For example :

Let us suppose I have 3 jobs to execute. Currently, we keep each of them at 5 minutes interval (say) but this needs to be auto-scheduled.

  • So let at 10 AM Job 1 got executed and got 10 entries.
  • At 10 AM job 2 got executed and got 100 entries.
  • At 10 AM job 3 got executed and got 200 entries.

For such a scenario, job 1 got less stream of data than job 2 and job 3. The auto-scheduler in such a case will auto-adjust the interval and recommend to change the next execution at :

  • Job 1 - may be 10 min interval
  • Job 2 - may be 5 min interval
  • Job 3 - may be 2 min interval.

The scheduler will train itself based on time based historical data as well, for instance if stream of data is more at 10 AM for job 3, it might be less at 1 PM, when Job 1 might have more data. The scheduler would automatically adjust the time and make next schedule of 1 PM at lesser time interval for job 1 than job 3.

Can you suggest me any algorithm which I can follow to support this case ? Or even if you can help me how to proceed in ML, it would help me a lot.

abhishek
  • 113
  • 4

1 Answers1

1

I want to build a data synchronization scheduler which keeps track of the amount of data sync after every scheduled triggers and auto-adjusts the next schedule.

This doesn't immediately strike me as a "use machine learning" problem to be honest. If you just want the scheduler to schedule the next run to be after an amount of time that's determined by the number of records processed in the current batch then that's quite a simple and deterministic formula. You could have the default gap be 20 mins and then do something like:

next gap = 10 * (10 / n)

Where n is the number of records processed in the last run. That would mean in this case:

  • So let at 10 AM Job 1 got executed and got 10 entries.
    • At 10 AM job 2 got executed and got 100 entries.
    • At 10 AM job 3 got executed and got 200 entries.

The next Job 1 would be scheduled at 20 * (20 / 20) = 20 minutes, the next job 2 would be at 20 * (20 / 100) = 4 minutes and the next job 3 would be at 20 * (20 / 200) = 2 minutes.

If you really wanted to use ML for it, I guess I'd suggest using a time series forecasting algorithm like ARIMA or Prophet to predict the number of samples each job will have to process over the next hour or something, and set the next run-time appropriately based on that.

Dan Scally
  • 1,724
  • 6
  • 23
  • 1
    Thanks Dan, it helped me a lot. It seems I have been thinking too much. I am trying out the algorithms too and see whether it gives any help or not. Otherwise the formula you provided, looks perfect. I will also keep this open for few more days to see if there is any more perception. Thank you so much anyways. – abhishek Aug 05 '19 at 13:34
  • @abhishek no problem :) – Dan Scally Aug 05 '19 at 13:35
  • Another thing I was thinking, I know the formula might work without ML, but there is one thing which it might not take into count. To give you an example, 10 AM i might be giving some extra credit because of which orders are coming more often at that time everyday, this might not be taken into count using the formula.. On the other hand, there are processes, like Inventory update job, which even though data volume might be big, the schedule might not be much of importance.Or even the product update... etc. I am trying out the algo you provided though. – abhishek Aug 05 '19 at 13:48
  • 1
    @abhishek Fair enough, if it's important that the schedule preempt big spikes in input that you know will happen at particular times of the day then a model like Prophet will forecast those spikes very nicely. – Dan Scally Aug 05 '19 at 13:55