0

Imagine there are multiple locations of interest from where water samples are gathered manually. Each sample is immediately analyzed, converted to a numerical value (a real number) and fed into a database. These values correlate s.t. some geographically nearby samples produce more similar values to each other than to samples from other bodies of water. E.g. changes in a lake will affect samples downstream, but not others (this is just a general rule, though). Of course, temporal distance has similar effect.

The sampling frequency varies with time and by location of interest: some locations are sampled multiple times a day, some locations might go for days without being sampled. Similarly, there are more sampling done during days than nights, for example. Meanwhile the body of water lives on and the numerical value of a hypothetical sample evolves - there's just no one sampling it.

The problem has two points of view:

  1. Given a new sample and its numerical value $X_a$ from location $A$ and all preceding samples from all locations, what is the probability that $X_a$ is the true value and not a measurement error (e.g. instrument contamination), a typo or anything else (e.g. some lazy sampler is just guessing the next value to get their reward)?
  2. Given that some time has passed without any new samples, what is the probability that the last value from each location is equal to the true value at that location, if a sample would be taken now?

There's in my view a slight difference between the cases: in the first case if the value is estimated to be reliable enough (i.e. has high enough probability of being the true value), it cannot be ignored in future estimations, while in the second case we are talking of hypothetical samples that can be ignored once a real sample arrives.

The primary question is, what methods would best handle this kind of a problem? I've noticed that LSTM networks are used a lot and Kalman filters also pop up here and there, but what bothers me in those approaches is I do not feel they address properly the increase in uncertainty as temporal distance to the last sample grows.

A secondary questions is that if I've misunderstood the applicability of e.g. Kalman filters to this kind of a problem, would you be so kind as to explain me why they could be the best fit for this problem?

Sometimes selecting the best method depends also on the amount of data, so in this case you could assume a few thousand locations and a few hundred thousand measurements spanning a few years.

Last, if I could ask you to prioritize non-NN methods, especially Bayesian methods, I'd doubly appreciate your effort.

Laurimann
  • 11
  • 3
  • Did you have a look into Krigging / Gaussian Processes? It sounds like good fit. The sample size is quite high so you might need a powerful computer and some patience to wait for the results – Broele Apr 23 '23 at 14:05
  • @Broele This question was posted originally to find a more lightweight alternative to Krigging / GP. I am unable to come up with an alternative method that would make as much sense as GP. If you would like to earn an answer, could you create a formal answer that goes a bit into the details of how'd you recommend to model the problem with Krigging/GP using Python (e.g. one model to cover all locations or one model per location, what libraries to use) and I'll accept it. – Laurimann Apr 24 '23 at 05:45

0 Answers0