Excerpt

## Table of Contents

Abstract

Acknowledgements

Table of Contents

List of Figures

List of Tables

Glossary

1 Introduction

2 Categorizing Energy Models

2.1 Building Type

2.2 Energy Type

2.3 Occupant Behavior

2.4 Prediction Model Type for Energy Consumption

3 Data-Driven Methods for the Prediction of Heating Energy Consumption

3.1 Common Characteristics of Data-Driven Methods

3.2 Data-Driven Methods in Academic Literature

3.2.1 Least Squares Regression

3.2.2 Artificial Neural Network

3.2.3 Genetic Algorithm

3.2.4 Support Vector Machines

3.2.5 Quantile Regression

3.3 Summary of Data-Driven Methods for Prediction of Heating Energy Consumption

4 An Introduction to Copula-Based Quantile Regression

4.1 Pair-Copula Construction

4.2 Bivariate Copula Families and Parameter Estimation

4.3 D-Vine Copula-Based Quantile Regression

4.4 Model Evaluation

5 Description of the Underlying Data Set

5.1 Pre-Processing of the Data

5.2 Data Understanding

6 Empirical Results

6.1 Performance of Point Estimation Methods

6.2 D-Vine Copula Fitting Results

7 Practical Implications of the Introduced Quantile Regression

7.1 Prediction of Quantiles for Post-Retrofit Energy Consumption

7.2 Quantile-Based Analysis of the Rebound Effect and Performance Gap

7.3 Rebound Effect Similarity Analysis

8 Conclusion

References

Appendix

## Abstract

Energetic retrofitting of residential buildings is poised to play an important role in the achievement of ambitious global climate targets. A prerequisite for purposeful policy-making and private investments is the accurate prediction of energy consumption. Building energy models are mostly based on engineering methods quantifying theoretical energy consumption. However, a performance gap between predicted and actual consumption has been identified in literature. Data- driven methods using historical data can potentially overcome this issue. The D-vine copula-based quantile regression model used in this study achieved very good fitting results based on a representative data set comprising 25,000 German households. The findings suggest that quantile regression increases transparency by analyzing the entire distribution of heating energy consumption for individual building characteristics. More specifically, the analyses reveal the following exemplary insights. First, for different levels of energy efficiency, the rebound effect exhibits cyclical behavior and significantly varies across quantiles. Second, very energy-conscious and energy-wasteful households are prone to more extreme rebound effects. Third, with regards to the performance gap, heating energy demand of inefficient buildings is systematically underestimated, while it is overestimated for efficient buildings.

Keywords: Data-Driven Heating Energy Analysis; Energetic Retrofitting; Quantile Regression; D-Vine Copula; Rebound Effect; Performance Gap

## Acknowledgements

First and foremost, I want to thank Jannick Töppel and Timm Tränkler for giving me the opportunity to work on this interesting topic. Their excellent supervision was always of great help and the thesis benefited from our fruit- and joyful discussions.

Grateful acknowledgement is also due to the Ministry of the Environment, Climate Protection and the Energy Sector Baden-Wuerttemberg for their support of the Trafo BW project "c. HANGE (BWT17004)” making this thesis possible.

Lastly and most importantly, I want to express my deep gratitude to my parents. This journey would have not been possible without their unprecedented support during my whole life. They have been an indispensable source of energy and constituted the foundation of my personal development.

## List of Figures

Figure 1: Categorization of data-driven methods for predicting heating energy consumption

Figure 2: D-vine with four variables, three trees, and the six edges representing pair-copulas. ..

Figure 3: Exemplary continuous convolution of building age class

Figure 4: Average monthly temperatures (°C) based on DWD (2018)

Figure 5: Boxplots of two exemplary retrofit levels for building age class 4 and 8

Figure 6: Non-monotonic relationship between specific heat load and building age class

Figure 7: First three trees of the final D-vine

Figure 8: Density of bivariate copula between specific heat load and building age class

Figure 9: Contour plot of bivariate copula between specific heat load and building age class

Figure 10: Consecutive retrofit path for building age class 4

Figure 11: Consecutive retrofit path for building age class 7

Figure 12: Quantile-based rebound effect between eight consecutive energetic retrofitting levels for exemplary building age class 4 (1958-1968)

Figure 13: Quantile-based performance gap of the eight consecutive energetic retrofitting levels for exemplary building age class 4 (1958-1968)

Figure 14: Contour plots of all bivariate copulas used to construct the six trees

## List of Tables

Table 1: Characterization of the presented building energy model

Table 2: Overview of the reviewed literature including performance measures

Table 3: Summary of D-vine copula characteristics for empirical analysis

Table 4: German building stock and coverage by the used data set

Table 5: Mapping of energetic retrofit variables to building components

Table 6: Two exemplary retrofit levels for building age class 4 and 8

Table 7: Fitting results of ANN and MLR

Table 8: Variables used for construction of first D-vine

Table 9: Variables used for the construction of the final D-vine

Table 10: Average coverage error for exemplarily quantiles

Table 11: Characteristics of the nine energetic retrofit levels based on building components. ...

Table 12: Weighted Euclidean distance matrix for rebound effect similarity analysis

Table 13: Overview of the variables and their characteristics

## Glossary

Abbildung in dieser Leseprobe nicht enthalten

## 1 Introduction

The residential building sector is one of the biggest energy consumers worldwide (IEA, 2017). Therein, one- and two-family houses are of particular interest, as they account for 60% of the total heating related CO2 emissions of the residential building sector in the European Union (Petersdorff et al., 2006). However, while research and policy-makers have focused on the energy efficiency potential of residential buildings, the resulting investment volume in retrofit measures is rather disappointing. The lack of energy efficiency gains can be attributed to the so-called performance gap, a state in which the realized energy savings negatively deviate from the predicted energy savings (Ahn et al., 2017; Cali et al., 2016). It is not to be confused with the energy efficiency gap, which refers to a slower than optimal diffusion of seemingly cost-effective energy efficiency technology (Jaffe and Stavins, 1994). In academic literature, the existence of the energy efficiency gap is mostly explained by diverse market barriers and the concepts of behavioral economics (Brown, 2001; Weber, 1997). The performance gap on the other hand is unrelated to the decision making of economic agents. It stems in large part from the fact that deficient prediction methods are applied and the used data is often inappropriate (de Wilde, 2014).

One reason for the discrepancy between realized and predicted energy savings is that data-driven building energy models are mostly fitted on synthetic data generated by engineering simulation software. Thereby, over- and under-estimation of synthetic energy consumption occurs frequently as the simulation software neglects the variance in actual energy consumption (Ahmad and Culp, 2006). This means that the input data itself is biased, which is then used for the training of data-driven prediction models. Consequently, even the most accurate prediction could not reflect actual energy consumption. A potential remedy to overcome this bias, is the strict usage of historical data, which inherently contains variance in energy consumption such as this caused by occupant behavior. However, at the same time, the prediction method must be capable of explaining the variance by capturing the dependence structure between heating consumption and its drivers. The rebound effect is a well-researched example inducing the underlying variance. It describes an increase in the demand of energy services by the occupants after the implementation of an energy efficiency measure (Greening et al., 2000). Thus, efficiency gains can be partly used up, fully used up, or even exceeded by the increasing demand. As a consequence from an investment perspective, payback periods and underlying risks are incorrectly assessed, despite their critical nature for private and commercial decision-making

(Mills et al., 2006). Such uncertainties hinder investments to a great extent. Further, national and international energy policy targets, like those formalized in the Paris Agreement of 2015, may not be met. Hence, the promised benefits resulting from a reduction of greenhouse gas emissions are unrealistic and deprive environmental policy of its credibility (Popescu and Ungureanu, 2013).

Appropriate data-driven approaches based on historical energy consumption and detailed building information potentially overcome these issues (Amasyali and El-Gohary, 2018; Mathew et al., 2005). For instance, Mathew et al. (2005) propose a data-based actuarial approach to price energy efficiency projects. However, such comprehensive data is still scarce, but is likely to become more common with the emerging trend of smart meter installation across residential buildings (Granderson et al., 2016). This in particular, calls for suitable prediction methods to be in place. The rise of advanced statistical and machine learning methods paired with increasing data availability offer plentiful opportunities for the analysis and prediction of building energy consumption (Hong et al., 2014; Zhou and Yang, 2016). Yet, commonly applied point estimation methods, such as artificial neural networks, exhibit major drawbacks whenever the variance in data is high (Geman et al., 1992). Quantile regression approaches are capable to face this challenge by estimating arbitrary intervals instead of single values. Recently, vine copula-based quantile regression has shown good results in various disciplines such as hydrology (Favre et al., 2004), engineering (Ren et al., 2015), and finance (Hu, 2006), but has so far not been applied to analyze residential building energy consumption. The methodology makes no strict assumptions about the shape of conditional quantiles. Instead, it offers great flexibility by separating marginal and dependence modeling (Kraus and Czado, 2017).

As literature based on historical data is scarce, a novel and representative real-world data set comprising 25,000 one- and two family houses in Germany, and various technical and household variables is introduced. Subsequently, a D-vine copula method to predict conditional quantiles of heating energy consumption after retrofitting is presented and implemented. To the best of our knowledge, we are the first to apply this novel method and one of the few that use historical data. In comparison to previous works, D-vine copula quantile regression provides full information on the distribution of heating energy consumption conditioned on a set of building parameters, such as building age, different building envelope insulations, windows, and heating systems. As consumption sensitivities vary across quantiles, this approach appears to be suita ble for the analysis of the rebound effect and performance gap at different parts of the distribution. This is an important contribution to literature, as it allows decision-makers to better comprehend the efficacy of different energetic retrofit measures depending on household and building properties.

The remainder of this thesis is organized as follows. Section 2 presents a concise categorization of building energy models. Section 3 presents existing data-driven methods used for the prediction of heating energy consumption in the residential sector. Next, Section 4 elaborates on vine copula-based quantile regression. This is followed by a description of the data employed in Section 5. Section 6 presents the empirical results and Section 7 provides the practical implications and contribution of the quantile regression approach introduced. Finally, the conclusions and limitations of this thesis are discussed in Section 8.

## 2 Categorizing Energy Models

The overarching objective of building energy models is the mapping of energy consumption onto a building’s physical, operational and occupant properties. The multitude of building energy models can be attributed to the interdisciplinary nature of the domain. It not only includes engineering or architectural sciences, but also disciplines like business sciences and statistics. This results in a heterogeneous domain with diverse solution approaches, utilized tools and objectives (Swan and Ugursal, 2009). For instance, an energy engineer seeks to obtain a building’s energy consumption profile through reproducing its actual thermal behavior. In contrast, a business economist’s approach would disregard any detailed thermal dynamics and rather focus on serving the investor's perspective by determining the payback periods of competing retrofit measures. However, regardless of the objective at hand, an accurate mapping is critical, as it allows for uncertainty reduction and enables the prediction of future consumption levels. To provide a better understanding for the distinguishing features of building energy models and justify the choice of reviewed approaches, in this section we outline the diversity, basic concepts and objectives of building energy models. Thereby we categorize building energy models according to the following features: 1) building type, 2) energy type, 3) occupant behavior and 4) prediction model type.

### 2.1 Building Type

A seemingly obvious way to differentiate energy buildings models is their suitability for either commercial, residential, education and research, or other buildings (Wang and Srinivasan, 2017). As opposed to residential buildings, commercial and educational facilities often have sophisticated HVAC-Systems (Heating Ventilation and Air Conditioning), where thermostatic control is generally a “set and forget” setting. However, this distinction is not always stressed in academic literature. This leads to the assumption, that some models are suitable for any class, despite differing occupational patterns during the day and the week. Further, energy supply and type may differ depending on building size and use case.

### 2.2 Energy Type

While the majority of literature focuses on electricity consumption (Aydinalp et al., 2002; Jain et al., 2014) other works focus on heating (Catalina et al., 2008, 2013; Kissock et al., 2003), or cooling (Yokoyama et al., 2009). When analyzing energetic retrofitting a special focus lies on heating and cooling demand depending on the geographic area. Like mentioned before, this work concerns itself with the assessment of energetic retrofitting for the residential sector in Germany, where heating energy consumption is prevalent. As a consequence, methods not specifically mentioned for this purpose, are excluded from the following review. Yet, models which were designed to forecast other energy types may be applicable to heating energy consumption through adaptations. There are often interactions between energy types, such as those with space heating and hot water production, which hamper load allocation (Aydinalp et al., 2004). For the sake of simplicity, we do not separate energy for space heating from other loads, like hot water production and cooking.

### 2.3 Occupant Behavior

In addition to building characteristics, occupant behavior is a major driver of energy consumption through different occupancy patterns, ventilation rates, and thermostat set-points (Gaetani et al., 2018). For that reason, numerous studies specifically investigate the effect of human interactions on energy consumption (de Meester et al., 2013; Guerra Santin et al., 2009; Haas et al., 1998; Sorgato et al., 2016; Virote and Neves-Silva, 2012; Yu et al., 2011). For example, Guerra Santin et al. (2009) consider a Dutch residential building stock, and show that occupant characteristics and behavior significantly affect energy use. Conversely, other studies examine the building in isolation to comprehend the effect of other factors such as weather variables (Hemsath and Alagheband Bandhosseini, 2015; Kissock et al., 2003; Sun et al., 2016). Having said that, the suitability of a particular energetic retrofit measure is always determined by the interplay between two components: building characteristics and its occupants. Ioannou and Itard (2015) found, that decreasing the importance of one component, increases the other component’s importance and vice versa. However, if information on occupants’ behavior is missing or unknown in the case of new buildings, it can only be estimated based on reference buildings and experience. This creates large space for uncertainty. Thus, occupant behavior can potentially and significantly contribute to the above mentioned uncertainty about future energy consumption (Hong et al., 2016). For that reason, sound building energy modelling for the retrofit purpose should take a holistic view on the building and include the occupant as a major driver.

The difference between predicted energy performance and actual metered energy use is discussed in literature as the performance gap for domestic buildings (Galvin, 2014) and nondomestic buildings (Menezes et al., 2012). A major driver of this performance gap is occupant behavior, which includes occupant presence (Ahn et al., 2017) and the rebound effect (Galvin, 2014). Three different types of rebound effects are generally distinguished in literature: (1) direct rebound effect, an increase in the demand of an energy service due to reduced costs resulting from energy efficiency measures, (2) indirect rebound effect, an increase in the demand of other services and goods resulting from the energy efficiency improvements, and (3) economy-wide rebound effects by rebalancing the supply and demand of goods. Analogous to Galvin (Galvin, 2015), the analyses of data-driven energy building models focus on the direct rebound effect in Section 7.2.

### 2.4 Prediction Model Type for Energy Consumption

The prediction model type of building energy models can be distinguished based on their reliance on physical laws or data (Coakley et al., 2014; Foucquier et al., 2013; Fumo, 2014)^{1}. Data-driven approaches predict future energy consumption based on either historical or synthetic data. While historical data reflects past real-world energy consumption, synthetic data is for example generated by engineering software tools. The advantage of historical data is that it implicitly contains individual characteristics, such as occupant behavior, exposure to sun radiation and wind, and building component degradation. Having said that, the unavailability of historical energy consumption forces one to rely on material and physical characteristics, such as insulation thickness and thermal transmittance values. Law-driven models use such variables as inputs for thermal calculations to simulate synthetic energy consumption data for buildings. These models are also referred to as engineering or white-box models, because they use physical equations with clear dependence structure between inputs and outputs (Coakley et al., 2014; Foucquier et al., 2013; Li and Wen, 2014). Yet, in the case of existing buildings material characteristics are often unknown or have changed due to degradation over time. Accordingly, law-based models are the generic approach for new constructions, while data-driven methods can only be used for existing buildings, or fictional buildings for which data was simulated (Wang et al., 2012).

However, both approaches have distinctive disadvantages. Law-driven models tailor the simulation to an individual building and, thus, require thorough calibration for accurate results. This is generally a time consuming procedure and requires high expertise of the modeler (Coakley et al., 2014; Zhao and Magoules, 2012). The disadvantage of data-driven models are twofold. First, they require a sufficiently large training data set to provide robust results (Li et al., 2014). Second, data analysis methods are not all-purpose weapons and need to be selected carefully based on underlying data set and use case. For example, non-linear dependencies, homoscedas- ticity, and multicollinearity must be considered during the model selection stage. Thus, in the following we outline the properties of data-driven modelling approaches and subsequently, introduce and compare existing data-driven methods for the residential heating energy field.

## 3 Data-Driven Methods for the Prediction of Heating Energy Consumption

In this section, we first shed light on some common characteristics that most data-driven methods share. This is followed by a review of data-driven methods by other authors, which were applied in the residential heating energy domain. We then summarize the findings and methods found in academic literature and point out several interesting insights.

### 3.1 Common Characteristics of Data-Driven Methods

Comprehensive data is needed when it comes to choosing between retrofit options to suit the individual building. This holds especially true when buildings are partially heated due to individual occupancy patterns. For instance, the installation of wall insulation in unheated rooms is likely to be uneconomic and has to be taken into account during the decision-making process. That is why the emerging trend of smart meter and device installation comes in handy, as it facilitates data collection and contributes to the availability of detailed information on energy consumption (Granderson et al., 2016). This growing availability of data, coupled with widely accessible computation power, is also one explanation for the rising interest in data-driven methods, considering that these methods stand and fall with data quantity (Wang and Srinivasan, 2017). Consequently, large data sets, like the one we will introduce in Section 5, recently attract a lot of attention, while asking for adequate methods to be applied. However, despite new opportunities, many obstalces must be overcome by data-driven methods, as clear patterns fade and dependence structures become more complex with data set size.

A common concept of data-driven methods is the assessment of retrofit measures based on the estimated efficiency gains they can provide. These can be defined as the difference in heating energy consumption with implemented retrofit measure and in the absence of it (Heo and Zavala, 2012). This leads to a distinctive problem of the field, as the pre-retrofit energy consumption becomes unobservable with the implementation of the retrofit measure. Therefore, it is necessary to estimate the pre-retrofit energy consumption for the post-retrofit period (Heo and Zavala, 2012).With a proper pre-retrofit model at hand, independent variables that impact energy consumption are altered in order to express the physical properties of the retrofit measure. Re-running the prediction model with altered variables provides the predicted postretrofit energy consumption. By subtracting the predicted post-retrofit energy consumption from the predicted pre-retrofit consumption, one obtains the estimated savings. For representative results, it is essential to compare both scenarios based on the same remaining explanatory variables, like weather conditions, occupancy parameters and other physical properties that are not subject to changes (Heo et al., 2012; Heo and Zavala, 2012).

### 3.2 Data-Driven Methods in Academic Literature

Based on the categorization of building energy models introduced above, in the following we review literature that focuses on the specifications depicted in Table 1 and exclude hybrid and law-driven methods.

Table 1: Characterization of the presented building energy model.

Abbildung in dieser Leseprobe nicht enthalten

For reviews on data-driven methods for the entire spectrum of building energy models, we refer to Amasyali and El-Gohary (2018) as well as Wang and Srinivasan (2017). More general reviews also include law-driven and hybrid models (Foucquier et al., 2013; Fumo, 2014; Swan and Ugursal, 2009; Zhao and Magoules, 2012).

Abbildung in dieser Leseprobe nicht enthalten

Figure 1: Categorization of data-driven methods for predicting heating energy consumption.

As depicted in Figure 1, a distinction is made between regression methods and machine learning methods following ASHRAE (2009). First, the focus is on common point estimation methods that calculate a single value for the predicted energy consumption. Next, interval estimation is introduced based on the determination of an interval expected to contain the true value with a given probability or confidence (Neyman, 1937). Interval estimation particularly overcomes the disadvantages of point estimation with regards to heteroscedasticity. For example, Sardia- nou (2008) showed that the impact of input variables on heating consumption vary across quantiles. As such, White’s heteroscedasticity adjustment was applied to correct the estimated standard errors for linear regression. Thus, Section 3.2.5 will elaborate on quantile regression as a common interval estimation method.

#### 3.2.1 Least Squares Regression

Least squares regression is the most commonly associated category with regression and can explain linear or nonlinear relationships. Based on our research, linear approaches are prevalent in the residential heating energy domain, which is why only these are discussed. Briefly, regression approaches estimate the relationship between a dependent variable and one (simple linear regression (SLR)) or multiple predictors (multiple linear regression (MLR)). Once the relationship is estimated as a function using the least squares algorithm, it is straightforward to examine changes of the dependent variable when altering one of the predictors. By plugging in new values of independent variables, the regression function can also be used for prediction purposes.

Douthitt (1989) uses an MLR model, wherein space heating fuel demand is regressed against a set of climatic, economic, occupant and building parameters for the residential sector in Canada. To be more specific, independent variables include average daytime internal design temperature, present and historic fuel price, substitute fuel price, historic fuel consumption, internal heat gains, number of adults and children, overall heating unit efficiency, number of floors, the total area of roof, walls, windows and doors and the thermal resistance thereof. For a small data set, the model explains between 37 and 79% of the variability (R2) depending on the energy source. Kissock et al. (2003) develop a three-parameter change point model based on MLR to estimate savings from retrofit measures in residencies and small commercial buildings. They assume that outdoor temperature is the only predictor variable for heating energy use, which increases linearly if the outdoor temperature drops below a treshold. Therefore, the authors incorporate the HDD approach, while energy use for hot water and cooking is accounted for by an all year constant. For large commercial buildings and those with complex HVAC Systems, the model is extended to a four- and five-parameter model, respectively. A two-stage grid search algorithm is presented to find the best-fit change point(-s), which excels in accuracy, simplicity and robustness when compared to competing algorithms. Catalina et al. (2008) use an MLR model for single-family homes and regress the monthly heating demand against a set of dependent variables, such as building shape factor, envelope U-value, window-to-floor area ratio, building time constant and climate. The authors achieve a R2 of 99.77%. Catalina et al. (2013) simplify their MLR model by aggregating the independent variables. By using the building global heat loss coefficient, south equivalent surface, and the difference between the indoor set point temperature and sol-air temperature, they show that the model is still capable to predict heating energy demand with an R2 of 97.44%. The results show that both models can accurately predict heating demand for the residential sector, even for a set of differing input weather files to validate applicability. However, their good prediction results are not based on historical real-world data, but simulated energy consumption. This synthetic data does not represent real-word variance, but rather resembles the physical equations of the applied simulation program.

The ease of implementation along with the general popularity of linear regression models are the main explanation for their wide use in the building energy field. However, the concomitant disadvantages of MLR are their inability to handle nonlinear relationships and in the case of existing multicollinearity, i.e. strong correlation between independent variables, the regression coefficients become unstable, lack in causal interpretability and may even cause the exclusion of significant predictor variables (Farrar and Glauber, 1967; Graham, 2003; Kraha et al., 2012).

#### 3.2.2 Artificial Neural Network

The first subcategory of machine learning methods are artificial neural networks (ANN), which are generally well suited to handle nonlinear relationships (Lek et al., 1996; Somers and Casal, 2009). In the vein of information transmission, the underlying concept is inspired by biological neural networks composed of a large number of interconnected processing elements (neurons) tied together with weighted connections that are analogous to synapses (activation functions). Briefly, an ANN consists of three types of layers: input, hidden and output layer. Each layer contains a fixed number of interconnected neurons, that in turn use their activation function to process all incoming information. The incoming informational content must be high enough to pass a predefined threshold, in order for the neuron to forward the processed information to the next neuron. Hence, the name activation function. Input and output neurons represent the predictor and dependent variables, respectively. Whereas the neurons of the hidden layers do not represent any variable, but rather process incoming information and pass it to connected neurons within the network. The training process of an ANN seeks to minimize an error metric measured at the output neuron by identifying the optimal number of layers, neurons and connection weights (Bishop, 1995; Franklin, 2005; Haykin, 1998). According to literature, the advantages of ANNs are their speed of computation, the ability to capture nonlinear dependence patterns and to gradually improve with additional data. Moreover, the training process does not require specialist knowledge in the building energy field and thus, promotes the ease of use. However, a substantial disadvantage is the lack of interpretability of weights due to the parallel nature of ANNs. (Zhang et al., 1998).

ANNs are a popular machine learning technique in the heating energy field (Chou and Bui, 2014). Kalogirou et al. (1997) trained an ANN based on synthetic data to predict the required heating load for heterogeneous spaces with only scarce information on inputs. Instead of specific physical properties, they use the areas of windows, walls, partitions and floors, and the design room temperature as continuous inputs along with the type of windows, walls and roof as categorical inputs. They achieved a prediction accuracy R2 around 90%. Aydinalp et al. (2004) continue their work on ANN applications for building energy modeling in the Canadian residential sector. They develop two networks, one for the space heating and one for the domestic hot-water heating energy consumption. Both ANNs are trained on an individual and comprehensive set of inputs. For the space heating network 28 input neurons are used, encompassing dwelling characteristics, system and equipment properties, indoor and outdoor temperatures and socio-economic characteristics of the households. While for the hot water network 18 inputs neurons are used, which can be categorized as heating system and equipment properties, consumption patterns, weather effects and socio-economic characteristics of the households. Also the hidden layer structure is more complex than the precedent models, as the optimal number for the space heating and hot water heating network is two and 10, respectively. For model selection, the paper relies on the fraction of variance (Anstett and Kreider, 1993), which was falsely introduced as R2, leading to confusion with the coefficient of determination (Nagelkerke, 1991) and, thereby, suggesting better than actual results. The authors found that the ANN’s ability to estimate savings strongly depends on the representation of the respective household in the training set. Ekici and Aksoy (2009) use a backpropagation ANN with an ordinary three layer design, where building transparency ratio, orientation and insulation thickness represent the three neurons of the input layer. They emphasize the model’s ability to disregard weather and thermo-physical properties of the materials, without sacrificing predition accuracy. The authors achieve a prediction accuracy of 94.8 to 98.5% (1 - relative error) for three different buildings, but use synthetic data based on the explicit finite difference approach of transient state one-dimensional heat conduction. Cheng-wen and Jian (2010) use a backpropagation ANN with 18 building envelope performance parameters, along with heating degree day (HDD) and cooling degree day (CDD) assigned to the 20 input neurons. They predict heating and cooling energy as the two output neurons for a large residential complex and achieve a prediction rate of 96% (1 - mean absolute error). The authors stress the model’s applicability to different climate zones, as a result of including weather parameters in form of HDD and CDD. However, again only synthetic data is used for training and evaluating the model. Popescu and Ungureanu (2013) apply cluster analysis in combination with a feedforward ANN on historical billing data to predict heating energy consumption for residential apartments in Romania. In a first step, apartments are classified according to their grouping trends of inhabitants’ consumption to obtain homogenous groups, for which ANNs are capable to deliver more accurate results. Two competing ANNs are set up. The first network uses monthly average outdoor temperature, heat transfer rate through the envelope and wall next to staircase, calculated heat flow rate due to infiltration/natural ventilation, and total heat gains for the five input neurons. Whereas, the second network’s eight input neurons represent monthly average outdoor temperature, heat transfer rate through the envelope and wall next to staircase, heat flow rate due to infiltration/natural ventilation, solar gain through transparent elements, internal gains, income level and occupants per room. Both ANN’s have a one hidden-layer design, which comprises seven hidden neurons. The authors found, that the second model delivers better prediction results with a correlation coefficient R = 0,83. They attribute its superiority to the inclusion of both, technical and occupant parameters, as opposed to only technical parameters in the first model. Naji et al. (2016) apply the Extreme Learning Machine (ELM) method, which is a single-layer feedforward neural network to simulate heating and cooling energy use for a two storey residential building. The simulation is carried out 180 times, each representing a unique set of insulation K- and insulation thickness values. The prediction results are compared with those obtained from ANN and Genetic Programming (GP). For all scenarios under consideration the prediction accuracy in terms of R2 was well above 95% and furthermore, the findings indicate that prediction accuracy can be enhanced by using the ELM approach.

The main performance measures of all reviewed studies are remarkable. However, all these ANN approaches used synthetic data obtained from building simulation programs, which is a coherent explanation for the outstanding results. Real world variation in energy consumption due to changing weather, building component degradation, equipment performance and occupant behavior is not inherently present in synthetic data. Thus, we question the meaningfulness of a high value of performance metric, like R2, if the simulated energy consumption is far off the true consumption. Moreover, in this case, the model itself is trained on flawed data and even perfect prediction accuracy would mislead decision-makers. Therefore, using true energy consumption as input for the model is crucial to achieve results, that can reliably represent reality. To conclude, current literature on ANN for building energy modeling is mainly based on synthetic data retrieved from engineering software. The results imply that ANN are capable to replicate the underlying engineering models.

#### 3.2.3 Genetic Algorithm

Genetic algorithms (GA), as the second subcategory of machine learning methods, are a stochastic optimization method inspired by natural selection theory. In an iterative optimization process possible solutions are generated randomly and are evaluated with an underlying objective function. Based on the results the most promising solutions are selected and stochastically adjusted via crossover and mutation, leading to new solutions for next iteration. One of the major disadvantages of genetic algorithms is that they are very slow, especially in the case of many input variables.

To the best of our knowledge, only few applications of GA address residential heating, even though it appears to be common in the non-residential field such as for the optimal design of building energy systems (Ooka and Komamura, 2009), the development of energy input estimation equations for the residential-commercial sector (Ozturk et al., 2004), and electricity energy estimation (Ozturk et al., 2005). Aras (2008) designs a GA to estimate the input parameters of a multiple nonlinear regression (MNLR), wherein the short-term residential natural gas demand is predicted as a function of time, HDD, and consumer price index. It takes the algorithm 810 iterations to obtain the optimal set of input variables, for which the RMSE of MNLR is smallest. However, the author misses to provide explicit numbers on this performance metric. Ferdyn-Grygierek and Grygierek (2017) aim at optimizing heating life cycle costs (LCC) of one-family houses as a function of of four types of windows, their size, building orientation, insulation of outer wall, roof and ground floor and infiltration. GAs are used for selecting optimal design parameters that lead to the largest energy reduction in the building performance simulation program, “EnergyPlus.” For the seven analyzed cases heating energy consumption can be reduced substantially by coupling GA with a building performance simulation program like EnergyPlus. The authors conclude how design parameters should be chosen for the largest reduction.

#### 3.2.4 Support Vector Machines

The third subcategory of machine learning methods are support vector machines (SVM), which are effective models in the sense that they only require small training sets to solve nonlinear problems through supervised learning algorithms. Fields of application are classification and regression problems, where they are frequently referred to as support vector classification and support vector regression (Burges, 1998; Cristianini and Shawe-Taylor, 2000). Common disadvantages of SVMs are the lack of transparency of results and their often reported fitting complexity (Moulin et al., 2004).

Only a few studies apply SVM in the residential heating sector, while other energy types are well represented (Foucquier et al., 2013; Wang and Srinivasan, 2017; Zhao and Magoules, 2012). Literature suggests that for district heating systems SVM are a common approach for operation optimization (Al-Shammari et al., 2016; Protic et al., 2015) and design (Izadyar et al., 2015). Chou and Bui (2014) compare numerous machine learning methods as individual models and in combination (ensemble models) to simulate heating and cooling based on synthetic data. They examine the performance of support vector regression (SVR), ANN, classification and regression tree (CART), chi- squared automatic interaction detector, general linear regression, and ensemble inference model. The latter ranks the aforementioned models, and combines the best performing ones into an ensemble model. The findings indicate that SVR outperforms for heating energy prediction, while an ensemble model of ANN and SVR achieves the best results for cooling energy prediction.

#### 3.2.5 Quantile Regression

Koenker and Bassett (1978) introduced quantile regression (QR) as a method to estimate conditional quantiles. The method enables the calculation of arbitrary quantiles, such as percentiles and median, which can be used for an in-depth relationship analysis. Quantile regression also overcomes the following shortcomings of least squares regression introduced above:

(1) The ability to describe an entire conditional distribution based on realized observations and not only the conditional mean. Knowing the entire distribution provides significantly more comprehensive information as it captures the overall variation, heavy tails, skewness, and kurtosis, and enables the calculation of confidence intervals.

(2) It allows for varying slopes at different quantiles, and hence accounts for heteroscedastic- ity.

(3) It is robust against outliers.

(4) It enables the calculation of risk measures (such as VaR or CVaR) that represent changes in conditional quantiles for a fixed set of predictor variables (Koenker and Bassett, 1978; Koenker and Hallock, 2001). For instance, the 10th percentile of the dependent variable may be positively influenced by increasing a predictor variable, while the 90th percentile is negatively influenced by the exact same increase. In such cases, a least squares regression approach could result in a coefficient close to zero for that predictor. This is because only the expected mean value is considered and therefore, potentially differing effects along the distribution are implicitly ignored.

In this vein, exhaustive literature exisits for a detailed look at the topic (Cade and Noon, 2003; Koenker and Bassett, 1978; Koenker, 2005; Koenker and Hallock, 2001). Quantile regression has been successfully applied across scientific disciplines, including economics (Machado and Mata, 2005), finance (Baur and Schulze, 2003), medicine (Boucai et al., 2011; Rosner et al., 2008), geology (Wasko and Sharma, 2014), and in the energy sector for electricity load forecasting (Boucai et al., 2011; Rosner et al., 2008). In the residential heating domain, quantile regression can be potentially used to analyze the effects of varying building, occupant, and weather parameters on energy consumption distributions and provide useful insights on their complex interdependencies.

Kaza (2010) applies quantile regression on a national level data set for the United States, in order to analyze the effects, that various variables have on specific tiers of energy consumers. The predictor variables represent housing size, household size, age of house, neighborhood density, income, average energy price, ownership status (own property, rental, no pay), housing type and the climate in form of HDD and CDD. At first, the data set is clustered in respect to the quantiles of energy use for each energy type (heating, cooling, and others) and afterwards, a model is fitted on each of the five clusters. Herein, only the heating model is reviewed. The analysis is conducted by testing the effect of each variable for statistical significance and by drawing conclusions for parts of the distrubtion. The author found, that neither household size, nor ownership affect heating. Whereas, rural households consume less, than urban and suburban households, which is particularly interesting, if the urban heat island effect holds. Housing size and age matters, but for housing age the effect is more pronounced for high heating energy users, which in turn implies, that energy conscious households obviate the effect of poorly insulated houses. Heating energy price has an expected effect and hence, reduces energy consumption with increasing price. However, the effect is stronger for the lower and middle of the distribution, but not so much at the tails. Climate has significant effects, yet the author stresses that microclimate with its differing vegetation cover, wind patterns and albedo has an even larger effect on heating energy, but cannot be inferred from the data. In general, are the effects at the tails of the consumption distribution substantially different from those of linear regression. Valenzuela et al. (2014) use quantile regression to examine the effects, that different variables have on the 10th, 50th and 90th percentile of energy consumption among one-family houses of an urban area in Texas. The independent variables can be categorized into two groups, demographic and housing unit characteristics. Demographic variables include the number of adults, as ordinal variable, and three dummy variables, namely head of household ages 65 years or older, marital status, and homeoffice worker. Housing units are described by their size in square meters, number of bedrooms and the year they were built. Moreover, six dummy variables are used that characterize number of stories, foundation type, outer wall type, and the presence of pool, fireplace, and central cooling/heating, respectively. Since there was no way to observe, if the heating system ran with natural gas or an alternative energy source, the authors constructed an indicator variable by analyzing the Spearman correlation coefficients between HDD and natural gas consumption, and HDD and electricity consumption. Therefore, if the correlation was high for natural gas and low for electricity consumption, they inferred the household was using natural gas as primary heating energy source. However, since no clear separation was possible energy consumption was highest in July and August due to cooling demand. The authors conclude, that there are different energy conservation strategies needed, since the effect of variables varies across the three quantiles and months, which can be regarded as proxy variables for predominant heating or cooling energy use. In a similar context, Huang (2015) analyzes the effects of socio-economic and dwelling characteristics on quantiles of electricity consumption distributions and finds differing effects across quantiles. He finds that policy-makers should consider the changes in future demographic structures to develop effective energy conservation programs.

To conclude, all the presented quantile regression models for heating energy consumption apply simple linear quantile regression, and thus do not account for complex dependencies between the variables. However, Bernard and Czado (2015) criticize the extremely restrictive assumptions on the shape of the linear quantile regression and suggest models based on adequate assumptions about the shape of conditional quantiles. More recently, innovative quantile modeling approaches, such as vine copula-based quantile regression (Kraus and Czado, 2017), quantile regression forests (Meinshausen, 2006), and SVM-based quantile estimation (Hwang and Shim, 2005) with less restrictive shape assumptions have emerged. In particular, Kraus and Czado’s (2017) vine copula-based quantile regression makes no precise assumptions about the shape of the conditional quantiles and accounts for multicollinearity by modeling the dependence structure of the input variables of a model in detail.

**[...]**

^{1} Additionally, hybrid or grey-box modelling approaches combine law- and data-driven methods.

- Quote paper
- B.Sc. Rochus Niemierko (Author), 2018, A D-Vine Copula-Based Quantile Regression Approach for the Prediction of Heating Energy Consumption. Using Historical Data for German Households, Munich, GRIN Verlag, https://www.grin.com/document/498767

Comments