Introduction
With the rapid development of data-driven computational algorithms and the growing trove of publicly available building attribute and energy data, there are now ample opportunities to apply machine learning techniques in assessment of building energy performance. This research compares machine-learning methods such as Artificial Neural Network and Random Forest in terms of their prediction performance for lighting, heating, and cooling energy use. In addition, this research also aspires to extract the most dominant variables and quantifying their potential in building energy management. In combination with the plentitude of data reported and the advancing sensing technology, the accuracy and robustness of the energy model could be enhanced greatly.
Data
The microdata used in this study is the 2012 Commercial Buildings Energy Consumption Survey (CBECS) that comes from U.S. Energy Information Administration (EIA), it contains records for 6,720 commercial buildings in the United States, designed to be a statistically representative sample. The input variables include building physical characteristics (such as location, area, construction material), operational characteristics and occupancy patterns and the output variables are the energy use intensities (EUI) of electricity and fuels for different building end-uses (such as lighting).
Methods
CBECS data are first pre-processed to remove the missing or extreme values and eliminate correlation among predictors. A k-fold cross-validation method is used to compare the predictive performance of different algorithms. The dataset is randomly divided into k sets with equal size. A statistical model is obtained by using k-1 sets as the training set, the last one set is then used as the testing sets and this process repeats until all of the small sets have been used as the testing set. To present a thorough comparison, both linear and Lasso regression model (least absolute shrinkage and selection operator) are constructed followed by the multilayer perceptron method (Artificial Neural Networks) and tree bagging method (Random Forest). The study also aims at performing feature selection among all the input variables based on both correlation and information gain, as well as by training a model on different subsets of features that minimizes the prediction error.
Findings
The main objective of the study is to compare the predictive performances of multiple statistical machine learning methods tested by building energy survey data, and therefore to propose which model should be adopted when constructing an energy estimation model. The best Random Forest analysis has a prediction rate of 60%, exceeding adjusted R2 results for multiple regression models run on the same set of explanatory variables. Another finding is the importance of different input variables, both individual and clustered feature importance would be quantified and compared to shed some lights on the relationships between physical, operational parameters and the energy use intensity of commercial buildings. Further work will apply these algorithms to opportunistic lighting, temperature, and occupancy sensors in the City of Boston, USA, as well as to energy data reported to the City for all medium and large commercial buildings under the Building Energy Disclosure and Reporting Ordinance (BERDO).
• Open source data, big data, data mining and industrial ecology , • Infrastructure systems, the built environment, and smart and connected infrastructure