Publication date: 15 June 2018
Source:Building and Environment, Volume 138
Author(s): Luis M. Candanedo, Véronique Feldheim, Dominique Deramaix
Whenever the long term monitoring of a building is attempted it is likely that specific sensors or the whole monitoring system used may experience long-term failure therefore creating important gaps in one or more variables of special interest. These long gaps may not be addressed using simple linear interpolation. The option of only using the available data for descriptive statistics would produce results that are biased towards the season of measurement. In addition discarding the incomplete data represents a significant waste of time and effort in the research study. A work around to reduce the bias problem is to predict the missing data from other measured variables using machine-learning techniques. Some questions that follow are: How much data is necessary to be able to train a regression model? What is the expected error of such prediction? What is the best model for such a task? This paper addresses the problem of completing a data set for the interior temperatures inside a passive house using different monitored predictors such as exterior temperature, humidity, wind speed, visibility, pressure and electrical energy use inside the building. Two regression models, multiple linear regression and random forest are compared using learning curves for the training and testing sets for visualizing the so-called bias-variance trade off. The learning curves help to answer the question of optimal sample size for training, model selection and expected error. Finally, descriptive statistics such as median, maximum, minimum, and room temperature averages are presented before and after completing the data sets.
Source:Building and Environment, Volume 138
Author(s): Luis M. Candanedo, Véronique Feldheim, Dominique Deramaix