# Estimating marginal effects of key factors that influence wholesale electricity demand and price distributions in Texas via quantile variable selection methods

## Tahir Ekin, Paul Damien and Jay Zarnikau

#### Abstract

Understanding the key drivers of prices and energy consumption is an important issue, which is complicated because the distributions of prices and consumption are asymmetric and fat-tailed. That is, the sets of relevant covariates can vary depending on the segment of interest in the conditional distributions of price and demand. Using a large data set from the Electric Reliability Council of Texas, this study uses quantile regressions and attendant variable selection methods to choose the most important factors that influence demand and price distributions; subsequently, the marginal effects of these factors are studied. Among the many findings, two critical ones are that the marginal effects of the covariates change throughout the distributions of demand and price, and that the number of relevant variables selected using mean regressions generally exceeds the number selected using quantile regressions. Related consequences for maintaining a reliable electricity market are discussed.

## Abstract

Understanding the key drivers of prices and energy consumption is an important issue, which is complicated because the distributions of prices and consumption are asymmetric and fat-tailed. That is, the sets of relevant covariates can vary depending on the segment of interest in the conditional distributions of price and demand. Using a large data set from the Electric Reliability Council of Texas, this study uses quantile regressions and attendant variable selection methods to choose the most important factors that influence demand and price distributions; subsequently, the marginal effects of these factors are studied. Among the many findings, two critical ones are that the marginal effects of the covariates change throughout the distributions of demand and price, and that the number of relevant variables selected using mean regressions generally exceeds the number selected using quantile regressions. Related consequences for maintaining a reliable electricity market are discussed.

## 1 Introduction

### 1.1 Research motivation

Electricity tends to have the greatest price volatility of any commodity traded in a wholesale market. Since electricity cannot yet be economically stored in large quantities, prices in organized wholesale electricity markets change as system operators strive to match supply and demand by dispatching resources with varying marginal costs in real time to maintain reliability. The cost of generating and transmitting electricity fluctuates due to unanticipated changes in end-user demand, the availability of generating units to provide supply, transmission bottlenecks and many other factors. This volatility can impose significant costs and risks upon the buyers and sellers of electricity, and a variety of physical and financial hedging strategies have been developed to manage these risks. The variability in demand that contributes to price fluctuations may be the result of changes in the weather, changes in production levels at industrial facilities, a response to electricity prices, demand–response actions by load-serving entities or other factors. In addition to contributing to price volatility, demand fluctuations impose costs on an electricity market by necessitating an infrastructure sized to handle peak loads, investments in peak generation capacity and operating reserves.

A typical approach to modeling demand or prices in an electricity market is to formulate a simple linear or log-linear regression to explain the fifteen-minute or hourly prices or demand using a set of explanatory variables. To explain day-ahead market prices, such variables might include the projected level of demand, the price of the fuel associated with the generation source likely to be on-the-margin (eg, natural gas) and the expected levels of generation from baseload generation sources (eg, wind energy, solar energy and nuclear power plants). To explain real-time electricity prices, similar explanatory variables might be considered. Alternatively, one might apply the day-ahead market price along with errors in the projections used to explain the day-ahead prices. To explain electricity demand, weather and temporal variables (eg, time-of-day, month-of-year, residential-appliance-use patterns) are often applied; however, the demand side of electricity markets is becoming increasingly sensitive to sharp price fluctuations.

A limitation of simple regression models is their inability to recognize that various explanatory variables affect different parts of the distributions of prices and demand differently. They might provide acceptable explanatory power “on average” but may be poor at modeling the entire distributions of price and demand. A change in the generation level from a baseload power plant may have little impact on prices if prices are low or moderate, but it could result in a large jump or decline in prices if prices are already high. An additional cooling-degree hour may have little impact on demand if demand is low, but could have a large impact on demand during a heat wave. A change in price from USD20/MWh to USD30/MWh is unlikely to elicit any demand response, but a spike in prices to USD2000/MWh will elicit a response from the demand side of an electricity market. Thus, the relationships may, in fact, be highly nonlinear (as exemplified by a merit order curve) and different variables may have varying impacts on different levels of prices or demand.11 1 The bid stacks and merit order curves may change every five minutes in the Electric Reliability Council of Texas (ERCOT) market. The US Energy Information Administration website provides a “generic” merit order curve displaying the nonlinear relationship between demand and prices (http://www.eia.gov/todayinenergy/detail.php?id=7590). Though this curve is not ERCOT specific, and is based on the marginal variable cost of resources (rather than offers), it nonetheless exemplifies the nonlinearity in the data.

### 1.2 Methodology overview

We shall model energy consumption and prices for each of the major zones (also called regions) in Texas: Houston, North, South and West. For the moment, without loss of generality, consider the following standard conditional mean regression equation for demand $y$ for the $i$th region, where for illustration we assume just one independent variable, say the day-ahead market (DAM) price, denoted by $x$:

 $y=\alpha+\beta x+e.$ (1.1)

The error, $e$, is normally distributed with mean $0$ and unknown standard deviation $\sigma$. The slope, $\beta$, is the sensitivity of demand to DAM price. To motivate the need for quantile regressions, consider the following limitations of (1.1).

### 1.3 Location shift model

The slope measures the impact of DAM price on the conditional mean of returns. Hence, under this assumption, the marginal effect of price is just a “location shift”, ie, the impact at the mean value of demand will be the same as the impact for the entire distribution of demand (Heckman et al 1997), and it has no effect on the scale or shape of the demand distribution. In the empirical analysis, it will be shown that this assumption is implausible for the demand data used in this paper. It is also implausible if we replace the left-hand-side variable $y$ by price, ie, conditional mean regression models for both price and consumption of electricity in Texas are inadequate.

### 1.4 Complex features of demand distributions

Earlier, we noted energy economics reasons why covariates are likely to affect demand and price differently throughout their respective distributions. Thus, in our simple example above, it is possible that DAM prices (and other covariates) may influence the demand distribution in different ways. In addition, the statistical distribution of demand (and price) is leptokurtic (fat-tailed), skewed and influenced by outliers.

### 1.5 Varying relationships between demand and covariates

A static and linear relationship between demand and covariates such as DAM price is typically assumed in (1.1). However, such an assumption can be inconsistent with the entire demand and/or price distribution. In this regard, it is worth recalling a useful insight from Mosteller and Tukey (1977): “What the regression curve does is give a grand summary for the averages of the distributions corresponding to the set of $X$’s. We could go further and compute several different regression curves corresponding to the various percentage points of the distributions and thus get a more complete picture of the set. Ordinarily this is not done, and so regression often gives a rather incomplete picture. Just as the mean gives an incomplete picture of a single distribution, so the regression curve gives a corresponding incomplete picture for a set of distributions.” In addition, Koenker and Hallock (2001) state that “quantile regression is fast becoming a comprehensive strategy for completing the regression picture”. Koenker and Hallock studied income distributions conditioned on factors such as age, sex, years of education and poverty level. Each additional year of education will have a large effect in lower-income groups, but little or no effect in upper-income groups. This type of nonlinear responses, based on various settings of the independent variables, leads to better explanatory and predictive inferences if we model the distribution of the response variable via quantile regressions.

### 1.6 Variable selection

Unlike the above mean regression example, in reality, multiple factors influence demand. Likewise, a model for price would also include many covariates. Variable selection is a difficult problem. If there are $p$ independent variables, then there are $2^{p}$ possible models, ie, each unique linear combination of the variables constitutes a model. Clearly, even for a modest $p$ it would be impossible to sift through all possible models to arrive at the “best” model. Note that “best” does not mean the “true” model; the latter is a theoretical artifact that is seldom, if ever, attained in practice. Also, “best” here does not mean different functional forms involving the covariates. However, we can still try to identify the “best” possible subset of variables that are most relevant to model the dependent variable of interest. Several criteria – such as Akaike, Bayesian and deviance information criteria, and forward selection – have been proposed to define such a “best” subset. Model selection in quantile regression has two interesting features. First, in the special case of the median regression it can be seen as a way of achieving robustness in variable selection. Second, in many economics data sets (including the one in this study) heterogeneity exists due to either heteroscedastic variance or covariate effects that are influenced not just by the location of the data distribution but also by higher-order moments such as skew and kurtosis. Thus, the sets of relevant covariates can vary depending on where we are on the conditional distribution of the dependent variable given covariate information. At the outset, we note that in this paper our focus is on in-sample variable selection.

One of the most robust methods to handle variable selection in quantile regressions is the well-known Bayesian information criterion (BIC) (see Schwarz 1978). In the popular linear mean regression model, the BIC has been successful because it is known that the best subset selection with the BIC identifies the true model consistently (Nishii 1984). Modified forms of the BIC within a quantile regression framework that account for large model spaces and large sample sizes are known to have sound statistical properties (see Machado 1993; Wu and Liu 2009; Wang et al 2012). Many of these improvements to the BIC are predicated on developing robust shrinkage estimation methods, such as the least absolute shrinkage and selection operator (Lasso) and smoothly clipped absolute deviation (SCAD) (see Wang et al 2007a, b, 2012; Lee et al 2014).

In the energy literature, several researchers have contributed to the use of quantile regressions (see, for example, Bessa et al (2012), He et al (2016), Maciejowska et al (2016), Hagfors et al (2016a, b), Cabrera and Schulz (2017), Li et al (2017), Lebotsa et al (2018), Taylor (2019), and the many additional references therein). Most of these studies consider forecasting electricity demand or prices; in the United Kingdom, for example, Hagfors et al (2016a) examine marginal effects for electricity price data but not for demand. None of the studies attempts variable selection at every quantile using BIC. Our goal is not forecasting; it is, first, to understand the impact of key covariates on the distributions of both day-ahead and real-time prices as well as electricity demand in Texas, and second, to employ variable selection methods to the various quantiles of these distributions. Identifying the most important variables would help practitioners to better understand the changing marginal effects of relevant – and differing – variables across these distributions. For instance, it is possible that the effect of natural gas prices on DAM prices is significantly different than its effect on real-time market (RTM) prices; moreover, within the DAM price distribution, natural gas prices could have differing impacts. Consider another reason why certain variables may be more relevant than others: at times, binding transmission constraints among the different zones in Texas lead to a divergence in zonal prices, and the generation resources within each zone then play a greater role in determining the prices within that zone. For example, nondispatchable wind generation that is concentrated in West Texas will occasionally set the market-clearing price in the West zone, when transmission limits prevent the export of wind generation to neighboring zones. Finally, we look at whether we should use mean or median regressions to model demand and price of electricity.

### 1.7 Paper findings

We explore the impact of key covariates on the distributions using hourly data for the years 2015–17 from the Electric Reliability Council of Texas (ERCOT) market. Divided into four major zones – Houston, North, South and West – ERCOT serves 85% of the electrical needs of the state with the largest electricity consumption in the United States; it accounts for about 8% of the nation’s total electricity generation, and is repeatedly cited as North America’s most successful attempt to introduce competition into both the generation and retail segments of the power industry (Treadway 2015).22 2 Strictly speaking, there are eight ERCOT zones, wherein South encompasses three smaller subzones (Austin, LCRA, San Antonio) and North includes a tiny area (Rayburn). From a modeling perspective, there is no loss in generality by focusing on the four major zones, which account for close to 85% of ERCOT’s load. The main findings are the following. First, we show that median (more generally, quantile) rather than mean regressions are better suited to model energy consumption and price distributions in ERCOT. Second, for each of the four zones, all the key marginal effects vary throughout the distributions of demand and price. Third, under both the demand and pricing models, a larger set of variables generally tends to be selected using a mean regression than is obtained using a median regression. Thus, even from a parsimony perspective, median regressions are better. Fourth, for the demand models in the four zones, the same set of independent variables are deemed relevant by the BIC procedure. This is a robust finding. Fifth, for the four DAM price models, almost the same set of variables are relevant, providing yet another robust conclusion regarding the “best” model. Finally, for the four RTM pricing models, both the mean and median regressions yield the same set of relevant independent variables; that is, the “best” models coincide. However, the median regression maximum likelihood estimation (MLE) values are less biased than the corresponding ordinary least squares (OLS) estimates for these independent variables.

### 1.8 Overview of the paper

Section 2 describes the data and variables used in the study. The quantile regression model and the variable selection process are detailed in Section 3. Section 4 provides a comprehensive empirical analysis for the Houston zone; the results are similar for the remaining three zones in Texas. A brief discussion is given in Section 5.

## 2 Data and variables

This section describes the data used in the analytic models, including geographical scope and sample period. All data is publicly available. ERCOT regularly posts data pertaining to prices, demand and generation by fuel type on its website (http://www.ercot.com). Additional archived historical data was obtained from ERCOT through data requests. The Henry Hub natural gas price data was obtained from the US Energy Information Administration’s website.

### 2.1 Geographical scope and graphical insights

See Figure 1 for a map of ERCOT. The North and Houston zones account for about 37% and 27%, respectively, of ERCOT market energy sales, while the South and West zones contribute 12% and 9%. Further, these four zones account for nearly all of the state’s retail competition, and most of the competitive generation resides within these zones. The input mix to electricity production in Texas is shown in Figure 2. In the sample period considered, Texas witnessed a rise in wind generation and a decline in coal generation. This is consistent with the state’s policy to reduce the negative impact of energy sources that could harm the environment. West Texas has the largest wind-generation operations in the state due to favorable weather conditions for that energy source.

Figure 3 shows the system-wide demand in MWh by hour, while Figure 4 shows the average energy consumption in MWh by month for the entire sample period. In Texas, these demands typically peak in the summer months, particularly in the afternoons. In Figure 3, the latter feature is evidenced by the rising demand during daylight hours till it reaches a maximum in the afternoon before dropping off. From Figure 4, it is clear that the average load is largest in the summer months.

Figures 5 and 6 provide the distribution of prices (as box plots) by hour and month, respectively. What is striking is that, while most prices are in the USD25 to USD30 range (25th to 75th percentiles), there are several instances of very high prices. This is one of the reasons why the quantile regression methodology and the associated variable selection methods detailed in this paper are useful. The variables that impact these price distributions at the extreme quantiles are likely to be different than those in the middle.

Overall, this study uses a very rich and large database to better understand the key variables that impact DAM and RTM prices and wholesale energy demand. Price and demand in the ERCOT market have been analyzed in a variety of prior studies using many of the same variables and data sources employed here, including Woo et al (2011, 2012), Zarnikau et al (2014, 2016, 2019a) and Tsai and Eryilmaz (2018). The sets of independent variables used to model demand and price in this paper are based on these prior studies.

Note that we do not model price and demand simultaneously in this paper. In a related study, Damien et al (2019) find that for some regions in Texas the simultaneous relationship does not hold. They also show that a different approach to modeling simultaneous relationships may be viable in general. To the best of our knowledge, there is no econometrics or statistics literature showing an easy way of modeling simultaneous quantile regression equations.

### 2.2 Sample period and variables

The sample period is from January 1, 2015 to December 31, 2017. In this time frame, the data were analyzed at the hourly load level; that is, for each hour in a twenty-four-hour cycle, complete data on all the variables used in the analysis was employed, leading to a very large data set for each of the four zones. For each region, we consider three models (models 1, 2 and 3) in which wholesale demand (also called energy consumption or load in MWh), DAM price (USD/MWh) and RTM price (USD/MWh) are the dependent variables, respectively. Thus, twelve quantile regressions are executed; note that this also nets twelve median regressions, since we select the median (also called the 50th percentile) in our analysis as one of the quantiles. In addition, we also implement appropriate mean regressions, using OLS, to provide a comparative analysis.

Tables 13 contain the summary statistics, by zone, for load, DAM price and RTM price variables, respectively. It is evident that all three distributions are asymmetric. Given that the kurtosis for a normal distribution is zero, the kurtosis values for the price distributions suggest highly leptokurtic (heavy-tailed) shapes, while the load distribution has tails that are slightly heavier than the normal for three of the four zones (South has lighter tails than the normal distribution). Note also that the prices are very large, especially DAM prices. This further suggests that modeling the entire distribution of prices could give useful insights. Indeed, estimates of marginal effects from mean regressions are likely to be biased since the conditional expectations of the response variable in such models will be stretched in the direction of the asymmetry.

#### Model 1: wholesale energy consumption (or demand) equations

Consider the following nine variables for each of the four energy consumption models.

Temperature (Fahrenheit):

this records the temperatures at a major city within each zone. Clearly, hot or cold temperatures increase electricity demand. The relationship between temperature and demand tends to be nonlinear; sometimes, depending on the level of demand, it could result in a spike. It is nonlinear because both very hot and very cold temperatures lead to an increase in demand. Temperature indirectly affects prices; extreme temperatures increase demand, which increases prices (see also Figures 3 and 4).

Transmission price:

this cost is typically a response of load-serving entities and large industrial energy consumers, based on contributions to system peak demand in four summer months (also called four coincident peaks (4CP)). ERCOT’s staff analysis suggests demand could potentially fall by over 1000 MW during a 4CP period.33 3 See “Analysis of load reductions associated with 4-CP transmission charges and price responsive load/retail DR: Raish’s presentation to the ERCOT Demand Side Working Group”. URL: http://www.ercot.com/calendar/2017/3/24/115556-DSWG. In reality, since transmission price is based on the four highest demand readings, it is not a DAM or RTM phenomenon; as such it cannot be calculated until the end of a summer. This makes it difficult to know which fifteen-minute intervals to use in the calculation until each month is complete. Typically, we would expect the slope coefficient to be negative, as transmission prices are charged to large industrial energy consumers and load-serving entities during 4CPs. However, based on the preceding description, there is considerable uncertainty in this transmission price data and we can expect significant fluctuation in parameter estimates.

Summer dummy:

June, July and August are coded as 1 in the binary representation of this monthly variable; other months are coded as 0 (see also Figure 4).

Hour dummy:

this binary variable measures the impact of extra demand during the peak hours of 16:00 to 18:00 each day (see also Figure 3). Generally, this variable’s marginal effect would be positive. Household consumption of energy tends to increase during these hours, when people return home. But, depending on the zonal temperatures (especially in the West and North), we can expect instances where a negative effect might result.

Lagged DAM price:

if DAM prices are high on a given day, this could lead to a reduction in demand the following day. A similar argument can be made for RTM prices.

concurrent days of high demand are encapsulated via a one-day lag; the marginal effect here is likely to be positive.

Moving average of RTM price:

two-hour moving averages of RTM prices are used in the DAM model for the reasons given in Section 1. In the short term, the rate of change due to this variable can be positive or negative, depending on the zonal and system-wide increase in demand requirements. Some industrial customers may cut back production even when RTM prices reach, say, USD300, whereas others might react only if prices spike to very high levels.

Price dummy for spike at USD300:

some mid-sized industrial customers tend to scale back production if RTM prices exceed USD300. This binary variable captures the impact of such customers.

Price dummy for spike at USD2000:

since we also include lagged load as an independent variable in the model, it would be prudent not to over fit by using too many dummy variables. However, the price distributions shown in the summary tables include some extremely large values. These occur so infrequently that it would be useful to study their marginal effect on energy consumption. The cutoff for this tail value is set at USD2000. We hope to capture the downward push on demand, if any, via this dummy variable.

Both of the price dummy variables above were selected after carefully exploring the graphical and tabular summaries. Our list of possible explanatory variables, including these dummy variables, is consistent with those used in the prior studies cited earlier.

#### Model 2: DAM price equations

Supply offers begin at 06:00 and end at 10:00 on the previous day for ERCOT’s day-ahead price determination by 13:30. Prices for the current day are set around 18:00, after ERCOT completes the unit commitments necessary for grid reliability. As actual wind generation and total system demand on the current day are unknown on the previous day, the DAM prices for energy depend on ERCOT’s day-ahead forecasts of wind generation, hourly loads and expectations regarding the availability of generating capacity of various fossil fuels. To model these day-ahead prices for the four zones, we consider the following thirteen independent variables.

Natural gas price (NGP):

this is the price of natural gas at the Henry Hub in Louisiana in millions of US dollars per British thermal unit (USDm/Btu). The daily settlement price is applied to all hours of the day.

Wind-powered forecast (wind):

short-term wind power forecast and wind-powered generation resource production potential are used to compute this forecast in MWh.

System-wide nuclear generation (NUC):

in MWh.

Ancillary service prices:

REGUP-P, RRS-P and NSPIN-P (USD/MW) correspond, respectively, to regulation up, responsive reserves and nonspinning reserves. These are the prices (set in the DAM) of various operating reserves. This is generation capacity or demand response that can provide a cushion when load forecasts are inaccurate or generation resources deviate from expected levels.

Ancillary service quantity:

REGUP-Q, RRS-Q and NSPIN-Q correspond, respectively, to regulation up, responsive reserves and nonspinning reserves; these are the quantities of various operating reserves.

DAM prices from other zones:

prices will be equal in the absence of transmission constraints. But, as consumption rises and constraints on interzonal transmission of electricity emerge, prices will diverge. In each price equation, this will net three unique variables.

the demand projection (in MWh) released by ERCOT for each zone at 06:00 on the day prior to the operating day, which coincides with the start of the DAM.

#### Model 3: RTM price equations

For the real-time prices in each of the four zones, the following seven variables are considered.

1. (1)

The DAM price in each zone.

2. (2)

The load forecasting error (LFE) in each zone, provided by ERCOT.

3. (3)

Other regions’ total load forecasting error (ORTLFE); this is a simple calculation using the original data.

4. (4)

RTM prices from other zones. Like DAM prices, RTM prices will be equal in the absence of transmission constraints. But, as demand rises and constraints on interzonal transmission of electricity emerge, prices will diverge. In each price equation, this will net three unique variables.

5. (5)

Wind-generation forecasting error (WFE) in each zone.

## 3 Linear quantile regression model and its Bayesian information criterion

Following Koenker and Bassett (1978, 1982), Koenker and d’Orey (1987) and Damette and Delacote (2012), and omitting the time subscript as it is not needed in the estimation process, the linear quantile regression model for the $i$th observation is given by

 $y_{i}=\bm{X}_{i}^{\mathrm{T}}\bm{\beta}^{*}+e_{i},\quad i=1,\dots,n,$ (3.1)

where the $e_{i}$ are independent and identically distributed, $P(e\leq 0\mid\bm{X}=\bm{x})=\tau$ for almost every $\bm{x}$, $\bm{X}_{i}=(X_{i1},\dots,X_{ip})^{\mathrm{T}}$ and $\bm{\beta}^{*}=(\beta^{*}_{1},\dots,\beta^{*}_{p})^{\mathrm{T}}$. The aim is to consider only $d^{*}$ independent variables among the $X_{ij}$s, which implies $p-d^{*}$ covariates are zero in (3.1). The choice of the error distribution is predicated on the data. In the first instance, when we model the demand, DAM and RTM price distributions using all of their respective covariates, we do not assign any parametric model for the error term (see Koenker and Bassett 1978). In other words, we allow the distribution of the dependent variable to be nonparametric. This allows us to assess the type of relationship each covariate has with the dependent variables under models 1–3 without making any distributional assumptions about the data. Once this is done, we pose the following question: under models 1–3, which independent variables are most relevant? To implement this variable-selection step, we use a special case of (3.1), namely median regression, given that the data is highly asymmetric and heavy tailed (see Tables 13). To employ the BIC metric, following Lee et al (2014), we assume that each $e_{i}$ follows an asymmetric Laplace distribution (ALD) whose density function is given by

 $f(e)=\tau(1-\tau)\sigma^{-1}\exp\bigg{(}{-}\frac{\rho_{\tau}(e)}{2\sigma}\bigg% {)},$ (3.2)

where $\rho_{\tau}(e)=e(2\tau-2I(e<0))$, $e_{i}$ is independent of $\bm{X}_{i}$ and $I$ is the indicator function. Thus, the conditional $\tau$-quantile of $y_{i}$ given $\bm{X}_{i}=\bm{x}_{i}$ is $\bm{x}_{i}^{\mathrm{T}}\bm{\beta}^{*}$. Note that the ALD is a very flexible family suited to the application at hand, in lieu of the summary statistics shown in Tables 13.

Adapting the notation from Lee et al (2014), let $S=\{j_{1},\dots,j_{d}\}\subset\{1,\dots,p\}$ denote a candidate model corresponding to the independent variables $X_{j1},\dots,X_{jd}$: define $\bm{X}_{s}=(X_{j1},\dots,X_{jd})^{\mathrm{T}}$, let $|S|$ be the cardinality $d$ of $S$ and let $(\hat{\beta}_{S},\hat{\sigma})$ be the maximum likelihood estimator of $(\beta_{S},\sigma)$. Then, the BIC for a linear quantile regression is given by

 $\textrm{BIC}=\log\bigg{(}\sum^{n}_{i=1}\rho_{\tau}(y_{i}-X_{iS}^{\mathrm{T}}% \hat{\beta}_{S})\bigg{)}+|S|\log\frac{n}{2n}C_{n},$ (3.3)

where $C_{n}$ is a positive constant that diverges to infinity as $n$ increases. Lee et al (2014) study the theoretical properties of (3.3) and show that it includes other forms of BIC as special cases; importantly, they show that this generalized BIC consistently identifies the true model in high-dimensional quantile regression models.

### 3.1 Variable selection using $\textbf{BIC}(S)$

In the absence of variable selection, assuming the ALD for the error term, we estimate the regression parameters by minimizing the weighted absolute values of the residuals at every quantile level. This is a slight variation from the original estimator proposed by Koenker and Bassett (1978). In other words, the choice of the ALD model for the error does not change the optimization problem developed by Koenker and Bassett. What changes is the likelihood function, which is predicated on the asymmetric Laplace density. Extending this parameter estimation to variable selection, via the BIC, is straightforward; it proceeds by simply introducing an extra penalty parameter. To this end, consider the following.

Let

 $\hat{\beta}_{\lambda}=(\hat{\beta}_{\lambda,1},\dots,\hat{\beta}_{\lambda,p})^% {\mathrm{T}}.$

The process of selecting the best subset model proceeds by choosing $\lambda>0$ as follows:

 $\hat{\lambda}=\operatorname*{arg~{}min}_{\lambda}\bigg{(}\log\bigg{(}\sum^{n}_% {i=1}\rho_{\tau}(y_{i}-X_{iS}^{\mathrm{T}}\hat{\beta}_{\lambda})\bigg{)}\bigg{% )}+|\hat{S}_{\lambda}|\log\frac{n}{2n}C_{n}.$ (3.4)

The selected subset $\hat{S}_{\lambda}\equiv\{j\colon\hat{\beta}_{\lambda,j}\neq 0,\,1\leq j\leq p\}$ then comprises those variables that are useful at the $\tau$th quantile of the distribution of the dependent variable $y$. All that remains is to find an estimator for $\smash{\hat{\beta}_{\lambda}}$. In practice, the two most widely used ones are the Lasso and SCAD estimators. Software packages (such as R) offer both alternatives. Simulation studies generally show that when the number of independent variables is moderate, say less than 20, both methods yield comparable conclusions. That is, the “best” models are generally the same (see Tibshirani (1996) and Fan and Li (2001) for additional discussion of this point). For the application considered here, we let $\lambda$ vary between 0.01 and 1. If a particular independent variable’s coefficient estimate is less than 0.001, it is set to zero, which means it is not useful in the model. We use the Lasso procedure in R (Sherwood and Maidman 2017). When the code converges – usually in a matter of minutes – for the optimal $\lambda$, the module outputs the corresponding estimates for $\hat{\beta}_{\lambda}$ and the BIC values at each of the selected quantiles. We select $\tau=\{0.01,0.05,0.25,0.50,0.75,0.95,0.99\}$ to better capture the impact of the covariates at various positions along the entire distributions of price and demand, including the tails of the respective distributions. In particular, we obtain the median regression at the 50th percentile.

## 4 Empirical analysis

To better focus the results, we provide copious details for the Houston region using energy consumption (also called load demanded) as the dependent variable; this is what was labeled as model 1. Since the analyses for the DAM and RTM prices as dependent variables proceed similarly, we only report key summaries for them. Once this is accomplished, we summarize the results for the demand, DAM and RTM price models for the remaining three regions. Collectively, these provide inferences for most of ERCOT’s service area, depicted in Figure 1.

### 4.1 Quantile and mean regressions for Houston demand

Where appropriate, we will employ the following terminology: the quantile curves of the regression coefficients with a long tail at the lower quantiles are called “floor effects”, whereas those with a long tail at the upper quantiles are called “ceiling effects”.44 4 The phrases “floor effect” and “ceiling effect” have stylized meanings in different areas of research. Here, we use them to delineate sharp differences in how the demand at the opposite tails of the demand distributions responds to the various covariates. In Figure 7, the horizontal axis shows the quantiles of the dependent variable and the vertical axis represents parameter estimates. The black dots are the point estimates at each of the seven quantiles (the $\tau$ values) that lie on the dashed quantile curves. The gray-shaded areas are the 95% confidence bands for these functional estimates. The solid black horizontal line is the mean regression estimate, and the dashed red lines are the 95% confidence intervals for the mean. We now discuss each of the panels in Figure 7, starting with the following important observations about all the panels.

• For each independent variable, the quantile regression coefficient estimates almost always lie outside the corresponding mean regression 95% confidence intervals. This implies that the “location shift” interpretation of the marginal effects of these variables on the demand for electricity in Houston is implausible.

• In the interests of space we do not show similar demand quantile plots for other regions, or the quantile plots for the eight pricing models (four for DAM and four for RTM). Barring a few exceptions, most of the plots are similar to Figure 7 and are available from the authors on request. Thus, a key empirical claim made at the outset is validated.

• Methodologically, mean regressions are ill-suited to studying the impact of covariates on demand and price in ERCOT as a whole. This underscores the quote by Mosteller and Tukey (1977) given in Section 1.

Let us now examine each of the panels in Figure 7.

Intercept:

the nonnormality of the distribution of demand in Houston is evident, as the quantile curve is not a line through the origin. The intercept is also the conditional quantile function for some representative case of the various covariates.

Temperature:

the ceiling and floor effects of temperature on demand are different. The marginal effect of temperature can vary anywhere from roughly 85 MWh in the lowest quantiles to 10 MWh in the uppermost quantiles of the demand distribution, depicting a convex relationship between temperature and demand. In contrast, the linear mean effect is fixed at roughly 20 MWh throughout the distribution of demand.

Transmission:

all else being fixed, we would expect this slope coefficient to be negative, as transmission prices are charged to large industrial energy consumers and load-serving entities during 4CPs. However, as noted in Section 2, there is considerable uncertainty in this transmission data, and we can expect significant fluctuation in parameter estimates. And indeed this is the case. The disparity between transmission prices at the lower tails and upper tails is significant, as they change from USD2500 to $-$USD500 or so in the upper tails. The mean estimates severely underestimate these costs in the range of the floor effects.

Summer:

here, the floor effect is negative, whereas the ceiling effect is positive. Summer months (June, July and August) have little effect compared with non-summer months when demand is low, while they tend to have a substantial positive impact when demand is high. This is consistent with reality, as surges in demand are associated with heat waves in the summer months in Texas.

Hour:

the hour-of-day (16:00 to 18:00) impact on Houston’s demand distribution is akin to the summer-month dummy variable.

DAM price:

DAM prices are known a day ahead. Houston’s DAM price effect on demand varies from negative (lower tails) to positive (upper tails). That is, for each USD1/MWh change in DAM prices known on day $t-1$, demand could decrease by roughly 10 MWh or increase by as much as 30 MWh on day $t$. Note that the mean estimate of this marginal effect is clearly biased, especially in the upper tails.

demand on day $t$ is an increasing function of demand on day $t-1$ up to the 80th percentile, after which it declines. This parabolic relationship is completely missed by the static, linear mean estimate.

2hrMA RTM:

this variable is day $t$’s two-hour moving average of RTM prices in Houston. An increasing convex function, the ceiling effect of these prices is significantly larger than the floor effect. That is, RTM prices have a stronger positive impact on demand when demand is high than when demand is low.

PriceSpike300:

this indicator variable captures the impact of DAM prices exceeding USD300. In the lower tails of the demand distribution, the marginal effect of DAM prices exceeding USD300 is positive; that is, demand increases. But the impact is negative in the upper tails. Note, however, that, unlike the other independent variables, this covariate has very wide confidence bands for the quantile and least-squares estimates, suggesting considerable uncertainty in its impact on demand.

PriceSpike2000:

this indicator variable captures the impact of DAM prices exceeding USD2000. The marginal effect of DAM prices exceeding USD2000 is slightly positive in the lower tails of the demand distribution, and is negative in the upper tails. That is, consistent with expectations, very large price spikes are negatively associated with very high demand; conversely, very large price spikes are positively, but weakly, related to demand when demand is low. The static mean regression estimate of this dummy variable, like the others, fails to capture such insights.

We next discuss the MLE and OLS estimates for the coefficients given in Table 4; these are the empirical values corresponding to the plots in Figure 7.

Barring the two price-spike dummy variables, the OLS estimates for all the other coefficients are significant at least at the 0.01 level of significance. Likewise, the quantile regression coefficients for these dummy variables are not significant, except for the price spike USD2000 dummy variable, which is significant ($p<0.05$) at the two uppermost quantiles. Most of the quantile regression coefficient estimates for the other variables are significant. This is consistent with the plots. Thus, the empirical summaries further validate our empirical claim that to better understand the marginal effects of key variables on demand for electricity in ERCOT, it is critical not to rely on conditional mean regressions.

Given the above, which of the nine independent variables we have chosen to work with to explain Houston’s demand are really relevant? To answer this question, we could use a variable selection procedure at every quantile shown in Table 4. But, from a practical perspective, it may suffice to work with just one quantile, namely the 50th percentile (or median). It is evident from the plots that the mean generally over- or underestimates the marginal effects as it gets pulled in the direction of the asymmetry of the demand distribution. However, the median – as a measure of central tendency – does not. Hence, in the next subsection, we only discuss variable selection at the median of the demand distribution.55 5 We also selected this variable at each of the chosen quantiles detailed in Table 4 and Figure 7. Since the overall conclusions are qualitatively similar, they are omitted due to space restrictions and are available from the author on request. For comparison, we also execute a standard stepwise variable selection at the mean.

#### 4.1.1 Variable selection for Houston demand model

Using the BIC, we employ variable selection for the Houston demand models. The variables selected under the median (50th percentile) and mean regressions are summarized in Table 5. Note that, of the nine variables, the mean regression selects six covariates, whereas the median regression selects the four of the same variables; three variables are common to both the mean and median models – temperature, lagged load and 2hrMA RTM price. Interestingly, the variables that are discarded by the median regression coincide with those that were not significant and/or had large variances from the analysis in Section 4.1. Finally, lagged load (demand) is significant and relevant under the median model but not the mean model. The latter model, in turn, selects the summer, hour and transmission dummy variables. These dummy variables are precisely the ones that represent the tails of the demand distribution; hence, the mean is more sensitive to them than the median. Indeed, by construction, lagged load, to some extent, captures the impact of these dummy variables. Thus, in addition to being parsimonious, the variables chosen under the median variable selection procedure reflect the changing marginal effects on the entire conditional distribution of demand better than the variables chosen under the conditional mean alone.

#### 4.1.2 Variable selection for Houston DAM price model

Recall that there are thirteen independent variables in each of the four regions’ DAM price models. Using the analytic process described for the Houston demand model in Sections 3 and 4.1.1, consider the variables for Houston’s DAM equation selected under the median and mean regressions, which are summarized in Table 6. Five variables are left out in the median regression, while only two are deemed irrelevant in the mean variable selection process.

#### 4.1.3 Variable selection for Houston RTM price model

Recall that there are seven independent variables in each of the four regions’ RTM price models. Using the same process as before, consider the variables for Houston’s RTM equation, which are selected under the median and mean regressions summarized in Table 7. Unlike the demand and DAM price models for Houston, for RTM prices, the same set of independent variables is selected by both the mean and median regressions. However, even in this instance, the median maximum likelihood estimates of the parameters are better than the corresponding mean estimates, since the latter are adversely influenced by extreme values and fat tails. Indeed, the MLE and OLS estimates are markedly different, which implies that the marginal effects of these covariates under the median and mean models are substantially different.

#### 4.1.4 Variable selection for North, South and West’s demand and price models

We mimic the Houston analyses for North, South and West’s demand and price models. For brevity, we omit the details. Here, we summarize the main conclusions.

1. (1)

Using the median regression variable selection process, the set of independent variables in Houston is also relevant for the other zones. This robust result is very encouraging. However, this is untrue for the mean regression variable selection across the four zones.

2. (2)

Figure 7 clearly shows the nonlinear relationships between Houston’s demand and its covariates. The plots for the other regions are similar.

3. (3)

DAM price models: like the demand models for the four regions, here there is considerable overlap in the variables that are selected for each of the four regions’ DAM pricing models. South is the only region that shows some minor differences in terms of which variables impact its DAM price distribution. The mean regression models once again select more variables than the median regressions for all these three zones’ DAM models – a less parsimonious outcome that is not surprising given the marginal effects are static under the mean regression approach.

4. (4)

RTM price models: recall from above that, in the RTM pricing model for Houston, the same set of variables was chosen by both the median and mean regression variable selection processes. This is generally true for the remaining three regions’ RTM pricing models: of the seven independent variables, the DAM price in the region of interest, and the three appropriate RTM prices in the regions that appear as independent variables, are always relevant (barring a couple of instances) under both the median and mean regression variable selection procedures.

## 5 Discussion

An understanding of the variables impacting price formation in restructured electricity markets is of great interest to electricity generators, retailers, utilities and policymakers, while an understanding of the determinants of the demand for electricity is of critical importance to system operators and planners striving to maintain reliability. Often linear or log-linear regression models are used to explain price patterns and to explain and forecast demand. Yet, the extreme volatility in electricity prices presents challenges. The variables that might explain a spike in energy prices can differ from those that might explain patterns at lower price levels. Similarly, weather variables might be important determinants of the level of demand when demand is high but may have limited explanatory power when demand is relatively low. Using quantile regressions and variable selection methods, we identified the strongest determinants of price and demand variables when these variables were at different levels or percentiles. A large data set from the ERCOT market was applied.

We found that quantile regressions are better suited than mean regressions to model the distributions of real-time and day-ahead prices as well as electricity demand in four of the largest ERCOT zones (Houston, North, South and West); these regions serve 85% of the electrical needs of the state with the largest electricity consumption in the United States. Using the BIC metric, a quantile variable selection method showed that the same four of the nine variables for the demand models are most relevant in all four regions – a robust and parsimonious finding. For the day-ahead prices, eight out of thirteen independent variables are relevant; barring the South region, this too is a robust result. For the real-time price distribution, four of the seven covariates are relevant for all regions; three out of these four are the same. In contrast, for the demand and price models, the corresponding mean regression variable selections typically yield larger subsets of relevant variables – a less parsimonious outcome; importantly, given the nonnormal nature of the price and demand distributions, the marginal impacts of the key drivers of price and demand from the mean regressions over- or underestimate the relationships. This was further confirmed by examining appropriate quantile functional plots that depict highly nonlinear relationships between the covariates and the response variables. Finally, since it is impossible to know the “true” model, the quantile regression variable selection approach used in this paper could, we hope, help guide practitioners in the direction of the variables that are most important to modeling demand in ERCOT’s four largest zones. In addition to sound contextual insights, ERCOT needs to gather data on fewer variables to better understand price and demand fluctuations, leading to more efficient use of time and resources.

It would be interesting to use quantile regressions to study the impacts of wind and solar generation on wholesale market prices. Zarnikau et al (2016) study the effects of wind-generation development on day-ahead and real-time electricity market prices in Texas using standard methods. Only recently has there been sufficient solar energy development in Texas to enable any analysis of its impact on prices. An initial quantification of solar energy development upon prices is provided in Zarnikau et al (2019b).

Another area for future study is to use the variable selection methodology in this paper to select variables for predicting distribution at particular hours, rather than by zone, as in this study. Variables that impact wholesale consumption in the morning hours might be different than those in the evening hours.

## Declaration of interest

The authors report no conflicts of interest. The authors alone are responsible for the content and writing of the paper.

## References

• Bessa, R., Miranda, V., Botterud, A., Zhou, Z., and Wang, J. (2012). Time-adaptive quantile-copula for wind power probabilistic forecasting. Renewable Energy 40(1), 29–39 (https://doi.org/10.1016/j.renene.2011.08.015).
• Cabrera, B., and Schulz, F. (2017). Forecasting generalized quantiles of electricity demand: a functional data approach. Journal of the American Statistical Association 112, 127–136 (https://doi.org/10.1080/01621459.2016.1219259).
• Damette, O., and Delacote, P. (2012). On the economic factors of deforestation: what can we learn from quantile analysis? Econometric Modeling 29(6), 2427–2434 (https://doi.org/10.1016/j.econmod.2012.06.015).
• Damien, P., Fuentes-Garcia, R., Mena, R. H., and Zarnikau, J. (2019). Impacts of day-ahead versus real-time market prices on wholesale electricity demand in Texas. Energy Economics 81, 259–272 (https://doi.org/10.1016/j.eneco.2019.04.008).
• Fan, J., and Li, R. (2001). Variable selection via nonconcave penalized likelihood and its oracle properties. Journal of the American Statistical Association 96, 1348–1360 (https://doi.org/10.1198/016214501753382273).
• Hagfors, L. I., Bunn, D., Kristoffersen, E., Staver, T., and Westgaard, S. (2016a). Modeling the UK electricity price distributions using quantile regression. Energy 102, 231–243 (https://doi.org/10.1016/j.energy.2016.02.025).
• Hagfors, L. I., Kamperud, H., Paraschiv, F., Prokozcuk, M., Sator, A., and Westgaard, S. (2016b). Prediction of extreme price occurrences in the German day-ahead electricity market. Quantitative Finance 16(12), 1929–1948.
• He, Y., Xu, Q., Wan, J., and Yang, S. (2016). Short-term power load probability density forecasting based on quantile regression neural network and triangle kernel function. Energy 114, 498–512 (https://doi.org/10.1016/j.energy.2016.08.023).
• Heckman, J., Smith, J., and Clements, N. (1997). Making the most out of programme evaluations and social experiments: accounting for heterogeneity in programme impacts. Review of Economic Studies 64(4), 487–535 (https://doi.org/10.2307/2971729).
• Koenker, R., and Bassett, G., Jr. (1978). Regression quantiles. Econometrica 46, 33–50 (https://doi.org/10.2307/1913643).
• Koenker, R., and Bassett, G., Jr. (1982). Robust tests for heteroscedasticity based on regression quantiles. Econometrica 19, 43–61 (https://doi.org/10.2307/1912528).
• Koenker, R., and d’Orey, V. (1987). Algorithm AS 229: computing regression quantiles. Applied Statistics 36(3), 383–393 (https://doi.org/10.2307/2347802).
• Koenker, R., and Hallock, K. (2001). Quantile regression. The Journal of Economic Perspectives 15, 143–156 (https://doi.org/10.1257/jep.15.4.143).
• Lebotsa, M., Sigauke, C., Bere, A., Fildes, R., and Boylan, J. (2018). Short-term electricity demand using partially linear additive quantile regression with an application to the unit commitment problem. Applied Energy 222, 104–118 (https://doi.org/10.1016/j.apenergy.2018.03.155).
• Lee, E., Noh, H., and Park, B. (2014). Model selection via Bayesian information criterion for quantile regression models. Journal of the American Statistical Association 109, 216–229 (https://doi.org/10.1080/01621459.2013.836975).
• Li, Z., Hurn, A., and Clements, A. (2017). Forecasting quantiles of day-ahead electricity load. Energy Economics 67, 60–71 (https://doi.org/10.1016/j.eneco.2017.08.002).
• Machado, J. (1993). Robust model selection and $M$-estimation. Econometric Theory 9(3), 478–493 (https://doi.org/10.1017/S0266466600007775).
• Maciejowska, K., Nowotarski, J., and Weron, R. (2016). Probabilistic forecasting of electricity spot prices using factor quantile regression averaging. International Journal of Forecasting 32(3), 957–965 (https://doi.org/10.1016/j.ijforecast.2014.12.004).
• Mosteller, F., and Tukey, J. (1977). Data Analysis and Regression: A Second Course in Statistics. Addison Wesley, Boston, MA.
• Nishii, R. (1984). Asymptotic properties of criteria for selection of variables in multiple regression. Annals of Statistics 12(2), 758–765 (https://doi.org/10.1214/aos/1176346522).
• Schwarz, G. (1978). Estimating the dimension of a model. Annals of Statistics 6(2), 461–464 (https://doi.org/10.1214/aos/1176344136).
• Sherwood, B., and Maidman, A. (2017). rqPen: penalized quantile regression. R package, Version 2.0.
• Taylor, J. (2019). Forecasting value at risk and expected shortfall using a semiparametric approach based on the asymmetric Laplace distribution. Journal of Business and Economic Statistics 37(1), 121–133 (https://doi.org/10.1080/07350015.2017.1281815).
• Tibshirani, R. (1996). Regression shrinkage and selection via the Lasso. Journal of the Royal Statistical Society B 58(1), 267–288 (https://doi.org/10.1111/j.2517-6161.1996.tb02080.x).
• Treadway, N. (2015). The annual baseline assessment of choice in Canada and the United States (ABACCUS). Report, Distributed Energy Financial Group. URL: https://bit.ly/38mS4RQ.
• Tsai, C., and Eryilmaz, D. (2018). Effect of wind generation on ERCOT nodal prices. Energy Economics 76, 21–33 (https://doi.org/10.1016/j.eneco.2018.09.021).
• Wang, H., Li, G., and Jiang, G. (2007a). Robust regression shrinkage and consistent variable selection through the LAD-Lasso. Journal of Business and Economic Statistics 25(3), 347–355 (https://doi.org/10.1198/073500106000000251).
• Wang, H., Li, R., and Tsai, C. (2007b). Tuning parameter selectors for the smoothly clipped absolute deviation method. Biometrika 94(3), 553–568 (https://doi.org/10.1093/biomet/asm053).
• Wang, L., Wu, Y., and Li, R. (2012). Quantile regression for analyzing heterogeneity in ultra-high dimension. Journal of the American Statistical Association 107, 214–222 (https://doi.org/10.1080/01621459.2012.656014).
• Woo, C.-K., Zarnikau, J., Moore, J., and Horowitz, I. (2011). Wind generation and zonal-market price divergence: evidence from Texas. Energy Policy 39(7), 3928–3938 (https://doi.org/10.1016/j.enpol.2010.11.046).
• Woo, C.-K., Horowitz, I., Horii, B., Orans, R., and Zarnikau, J. (2012). Blowing in the wind: vanishing payoffs of a tolling agreement for natural-gas-fired generation of electricity in Texas. Energy Journal 33(1), 207–229 (https://doi.org/10.5547/ISSN0195-6574-EJ-Vol33-No1-8).
• Wu, Y., and Liu, Y. (2009). Variable selection in quantile regression. Statistica Sinica 19, 801–817.
• Zarnikau, J., Woo, C.-K., Gillett, C., Ho, T., Zhu, S., and Leung, E. (2014). Day-ahead forward premiums in the Texas electricity market. Journal of Energy Markets 8(2), 1–20 (https://doi.org/10.21314/JEM.2015.126).
• Zarnikau, J., Woo, C.-K., and Zhu, S. (2016). Zonal merit-order effects of wind generation development on day-ahead and real time electricity market prices in Texas. Journal of Energy Markets 9(4), 17–47 (https://doi.org/10.21314/JEM.2016.153).
• Zarnikau, J., Woo, C.-K., Zhu, S., and Tsai, C. (2019a). Market price behavior of wholesale electricity products: Texas. Energy Policy 125, 418–428 (https://doi.org/10.1016/j.enpol.2018.10.043).
• Zarnikau, J., Woo, C.-K., Zhu, S., and Tsai, C.-H. (2019b). Will Texas’ operating reserve demand curve likely provide adequate investment incentive for natural-gas-fired generation? Journal of Energy Policy, forthcoming.

Only users who have a paid subscription or are part of a corporate subscription are able to print or copy content.

To access these options, along with all other subscription benefits, please contact [email protected] or view our subscription options here: http://subscriptions.risk.net/subscribe

#### More papers in this issue

You need to sign in to use this feature. If you don’t have a Risk.net account, please register for a trial.

##### You are currently on corporate access.

.

Alternatively you can request an individual account here: