3

I am trying to implement a Deep Q Network model for Dynamic pricing in Logistics. I can define

  1. State Space (Origin, Destination, type of the shipment, customer, Type of the product, Commodity of the shipment, AVAILABILITY of capacity etc.

  2. Action Space (price itself, can range from 0 to inf) we need to determine the price itself.

  3. Reward Signal (Rewards can be based on a similar offer to other customers, seasonality, remaining capacity.

I am planning to use Multi-Layer Perceptron for getting inputs from the state space and the outputting the price.

I am not sure how to define a reward function. Please help me in defining the mathematical formula for the reward function based on the price as an action space?

-- UPDATE --

State space that evolves over the time is the remaining capacity (Logistics). Consider at the initial time step is 10,000 kgs capacity and at over a period of time the capacity decreases and when the capacity is full and it cannot take anymore shipments, then the episode completes.

The agent will have to find an optimal price based on the following rewards.

  • 1
    The way to define a reward is to start with your goals and how you measure success of the agent. Could you add those? Also, you don't seem to have a state space that needs reinforcement learning. It looks more like a contextual bandit problem. Could you please identify any state variables that evolve over time, and what the time steps are? If each time step is a new, unrelated customer etc, then this is not really RL, although repeats of same customer might be handled as RL. – Neil Slater Mar 15 '19 at 12:12
  • Hi, I have updated the question. Kindly take a look into it. – Karthik Rajkumar Mar 15 '19 at 12:56
  • Thanks, that explains well how this maps to RL. However, I am still not sure what the goals are. Will it simply be total price sold at, or profit? Profit seems more likely the true goal, presumably you need to account for the current mix of destinations and route plan if this is a single container which must tour all the destinations in its itinery? – Neil Slater Mar 15 '19 at 13:38
  • For example, Similar price offered for the same origin and destination and the type of the shipment is 2.5 $ per kilo, Then based on the similar offer we can increase or decrease so the customer will accept the offer we provide. Lets take Seasonality. Any festival time we can increase the price as there will be more demand Or if capacity decreases and only few kilos left for accommodation we can increase the price. – Karthik Rajkumar Mar 15 '19 at 13:41
  • As well as capacity filling being end of episode, is this time limited? If you have an infinite number of customers lined up, then you can just set very high price and wait to make a huge profit. But reality is not like that, and once you accept your first customer, you will have limited opportunities to fill the rest of the capacity or be in breach of contract etc – Neil Slater Mar 15 '19 at 13:42
  • It is not time limited. Usually we take booking from 14 days before departure – Karthik Rajkumar Mar 15 '19 at 13:44
  • And we cannot come up with some bigger number as it will become irrelevant. So similar offer plays an important role. the price will be around that range. And after that we can increase or decrease based on other factors – Karthik Rajkumar Mar 15 '19 at 13:46
  • @NeilSlater Price is bound within a range based on similar offer provided or competitor price. So we don't give a price which is too high but we optimise the price by some factors like seasonality and availability of the space. For this I need to come up with a reward function. Kindly help in this regard – Karthik Rajkumar Mar 15 '19 at 14:30
  • 14 days looks very much like a time limit to me. I *am* helping - that is what this conversation is - you are further away from a valid question than you think, and these questions help clarify what you need (as opposed to what you *think* that you need). The same logic applies to price as I already suggested - something is preventing you setting the highest price in the range, and that thing is (I guess) that your eventual reward depends strongly on customers actually accepting your offer, combined with a limited number of customer opportunities. The episode may end with a half full container – Neil Slater Mar 15 '19 at 14:34
  • Sorry for misunderstanding Neil, I'll explain how it works. we have a period called Booking horizon which is departure - 14 days. And the capacity will be huge. At this time period we can accept bookings. When the departure time comes near the percentage of booking will be more. At this point of time we do not have the RL model for Dynamic pricing but it was done manually through some sort of intuition and by looking into influencing factors. we have that data. That is what I meant as similar offer can be referred from the already available data. pardon me if i dint answer the question u asked – Karthik Rajkumar Mar 15 '19 at 14:48
  • Yes I agree Neil, the eventual reward must depend strongly on customers accepting the offer. But during the training time, we will not be having such data ? Then we cannot construct a reward function with this datapoint isnt ? – Karthik Rajkumar Mar 15 '19 at 15:01
  • You have historical data? Also, that is the main point of using RL here - it can deal with delayed rewards. Your most likely reward signal is going to be profit per container/episode. If I get to writing an answer I will be pointing out the details of that - by asking for a "reward function" that applies per action you are showing some misunderstanding of how RL typically works. However, there's a lot of flexibility and I don't work in logistics, so I am not 100% sure. – Neil Slater Mar 15 '19 at 15:05
  • Thats fine Neil, thanks for your time and effort. One question, Is this use case a valid RL problem? – Karthik Rajkumar Mar 15 '19 at 15:11
  • Yes this looks like a valid RL problem to me. I think you just need to check the difference between *reward* and *return* (or *utility*) - the latter is typically a sum or mean over the former. RL gives you a mechanism to estimate the *utility* of an action, and this is typically what you seek to learn by using some RL algorithm. Whilst the main reward function should be something you define at the outset, like "the profit in dollars, after costs", even if that might only be known 2 weeks after an episode ends. – Neil Slater Mar 15 '19 at 15:18
  • Ohh correct. I somehow got confused with return and reward. Exactly. Thanks Neil – Karthik Rajkumar Mar 15 '19 at 15:21
  • Quick question - have you normalised your state? – ArchanaR Dec 28 '22 at 12:36

0 Answers0