10

I tried to detect outliers in the energy gas consumption of some dutch buildings, building a neural network model. I have very bad results, but I can't find the reason.

I am not an expert so I would like to ask you what I can improve and what I'm doing wrong. This is the complete description: https://github.com/denadai2/Gas-consumption-outliers.

The neural network is a FeedFoward Network with Back Propagation. As described here I splitted the dataset in a "small" dataset of 41'000 rows, 9 features and I tried to add more features.

I trained the networks but the results have 14.14 RMSE, so it can't predict so well the gas consumptions, consecutively I can't run a good outlier detection mechanism. I see that in some papers that even if they predict daily or hourly consumption in the electric power, they have errors like MSE = 0.01.

What can I improve? What am I doing wrong? Can you have a look of my description?

VividD
  • 636
  • 7
  • 18
marcodena
  • 1,667
  • 4
  • 14
  • 17
  • 2
    What do you mean, bad results? Describe your process, your results, and how they differ from what you expected, instead of only linking to the git repository. Otherwise this discussion will be of no use to anyone. – Air Jun 18 '14 at 23:08
  • It's also true this :D. I added the description in the page "The results has a 14.14 RMSE, so it can't predict so well the gas consumptions, consecutevely I can't run a good outlier detection mechanism. I see that in some papers that even if they predict daily or hourly consumption in the electric power, they have errors like MSE = 0.01." – marcodena Jun 18 '14 at 23:25
  • 1
    @marcodena This is a QA site, and others need to know what you're trying to solve, so that they'll understand the answers, and will hopefully be able to use them in their own problems. That's what AirThomas meant, and is also why it'd be nice if you could describe what you're doing and what exactly you think is wrong. If the link to your git-hub page changes, the link here will be invalid, and others won't be able to understand what the problem is. Please, take a minute to make your question self-contained. Thanks. – Rubens Jun 19 '14 at 00:35
  • @Rubens you are right, but as you see in the github link, the description is more than 6 A4 pages long, so here it would be very difficult to explain everything. I'll try. – marcodena Jun 19 '14 at 08:50
  • 1
    When you find that your problem takes a very long time to explain, that is when it's *most important* to spend the time to explain your question to others, explicitly and with plenty of details and discussion of your research/attempts. Often during that process you will find some or all of the answers yourself. Not only is that a great feeling, if what you find is useful to others, you can still post that question you spend so much time on, *and* the answer(s) you came up with. – Air Jun 19 '14 at 23:53
  • 1
    Just a clarification, when you mention that "in some papers they have errors like MSE = 0.01", do you refer to the same dataset you are using? Or is it a different dataset altogether? – insys Jun 21 '14 at 17:29
  • @AirThomas that's why I created a repository with a full explanation, with graphs etc. – marcodena Jun 23 '14 at 09:38
  • @insys no.. different :) – marcodena Jun 23 '14 at 09:38
  • Project updated! – marcodena Jun 30 '14 at 23:01

4 Answers4

8

Just an idea - your data is highly seasonal: daily and weekly cycles are quite perceptible. So first of all, try to decompose your variables (gas and electricity consumption, temperature, and solar radiation). Here is a nice tutorial on time series decomposition for R.

After obtaining trend and seasonal components, the most interesting part begins. It's just an assumption, but I think, gas and electricity consumption variables would be quite predictable by means of time series analysis (e.g., ARIMA model). From my point of view, the most exiting part here is to try to predict residuals after decomposition, using available data (temperature anomalies, solar radiation, wind speed). I suppose, these residuals would be outliers, you are looking for. Hope, you will find this useful.

sobach
  • 1,119
  • 5
  • 19
3

In your training notebook you present results for training with 20 epochs. Have you tried varying that parameter, to see if it affects your performance? This is an important parameter for back-propagation.

For estimating your model parameters, as user tomaskazemekas pointed out, plotting Learning Curves is a very good approach. In addition to that, you could also create a plot using a model parameter (e.g. training epochs or hidden layer size) vs. Training and Validation error. This will allow you to understand the bias/variance tradeoff, and help you pick a good value for your parameters. Some info can be found here. Naturally, it is a good idea to keep a small percentage of your data for a (third) Test set.

As a side note, it seems that increasing the number of neurons in your model show no significant improvement for your RMSE. This suggests that you could also try with a simpler model, i.e. with less neurons and see how your model behaves.

In fact, I would suggest (if you haven't done so already) trying a simple model with few or no parameters first e.g. Linear Regression, and compare your results with the literature, just as a sanity check.

insys
  • 459
  • 4
  • 9
  • I added some graphs, after having improved the model A LOT. In github there are the new steps. May I ask you how I can apply linear regression in a time series problem? :( – marcodena Jun 30 '14 at 22:59
2

The main problem here is that even before attempting to apply anomaly detection algorithms, you are not getting good enough predictions of gas consumption using neural networks.

If the main goal here is to reach the stage when anomaly detection algorithms could be used and you state that you have access to examples of successful application of linear regression for this problem, this approach could be more productive. One of the principles of successful machine learning application is that several different algorithms can be tried out before final selection based on results.

It you choose to tune your neural network performance, learning curve plotting the effect of change in different hyperparameters on the error rate can be used. Hyperparameters that can be modified are:

  • number of features
  • order of the polynomial
  • regularization parameter
  • number of layers in the network

Best settings can be selected by the performance on cross validation set.

tomaskazemekas
  • 313
  • 2
  • 13
2

In your notebooks, I did not see your neural network model, can you point which library is using, how many layers you have and what type of neural network are you using?

In your notebooks, it seems you are using the noisy and outlier dataset to train the neural network, I think you should train the neural network on the dataset that you do not have any outliers so that you could see the observation distance from the prediction of the neural network to label the observation either outlier or not.

I wrote couple of things on outlier detection in time-series signals, your data is highly seasonal as sobach mentioned and you could use FFT(first link above) to get the overall trend in the signal. After you get the frequency component in the gas consumption, you could look at the high frequency components to get the outliers.

Also if you want to insist on using neural network for seasonal data, you may want to check recurrent neural networks out as they could incorporate the past observations better than a vanilla neural network, and supposedly may provide a better result for the data that you have.

Bugra
  • 21
  • 1