Introduction
This article describes basic guidelines in order to calculate the mathematical formula which defines the reference consumption (baseline) in a facility, by reviewing the available variables, the reference period and how to obtain the formula.
External variables
Depending on the data that is available, the error and uncertainty in the regressions will be vary. Obtaining data might produce overcosts for the analysis.
Below is a matrix with some of the influential external variables that can affect energy consumption. Each project can be influenced by other variables, but these are the most common. The user must identify them based on the facility and their experience.
Reference period
The period of time with a representative cycle of activity in our facility must be established in order to know which is the minimum period of data to be collected before defining the reference consumption (baseline). Some examples:
Data preprocessing
The variables to be correlated with the energetic consumption are established, they can be inserted in the platform and they can be treated to be used with a statistical software (EXCEL, Minitab, Stata, Matlab, etc.) in order to generate the formula.
There can be interactions between different variables or a variable can be representative if it is squared or cubed. It is recommended to prepare additional data sets among variables and power calculations.
Anomalous or unrepresentative data of the installation should be purged at this point.
Obtaining the mathematical formula
The mathematical formula is the result of a linear/nonlinear statistical regression. Each of the software will have their own processes to obtain them, so we must follow their own tutorials to get the formula.
Each project has its own mathematical formula and there not exist "standard" formulas for a type of installation.
Statistical error or uncertainty
All statistical process involves a misscalculation. The procedure selected for the calculation of the formula will give a correlation coefficient which will indicate the number of points explained in the formula
Normally, the indicator is R^2.
For example, if the result is R^2 = 98%, this indicates that the formula explains 98% of the consumption, so there will be an error or uncertainty of 2%.
Calculate a formula using Automatic Baseline Calculator
You can now easily calculate and evaluate different baseline formulas with the market app "Automatic Baseline Calculator". You can find more information about this app here.
You can install this app using the application market.
Example of manual baseline calculation using Minitab
This section details how to calculate the theoretical consumption of a location based on some external parameters.
The location
Activity: Service station, with a coffee shop, shop and restaurant.
Location: Barcelona
Data Requirements
In the case of a service station, we could discuss if the representative period of activity is a week or a year. Probably the consumption pattern is repeated weekly so with some weeks of data would be enough. On the other hand, if the station has a lot of seasonal climate, it would be convenient to have one year of data.
As consumption data, at least the main consumption of the location will be needed, if no submetering is available
As external data or consumption variables, it would be ideal to have degree days (CDD, HDD) information and an indicative parameter about the occupation of the service station (tickets or sales).
Available data
Response variable: The main consumption of the installation from January to December (1 year), hourly frequency
Explicative variables:
 Heating and cooling degree days (calculated from a virtual meteo station) (HDD, CDD)
 Daily sales (S)
 Daily tickets (T)
Interaction between variables: It is interesting to generate new data sets transforming existing variables and to study if they are correlated with consumption. For this example, square and cubic variables of the tickets (T) and Sales (S) are created, as well as the product of Sales*Heating degree days and Sales*Cooling degree days.

HDD². HDD³

CDD², CDD³

T², S²

S*HDD, S*CDD
Resolution or baseline frequency
The resolution or baseline resolution is affected by the resolution of the available variables. In this case, Degree Days, tickets and Sales have "Daily" resolution, so this will be the resolution of the formula.
Calculating the formula with Minitab
Minitab is a statistical software that can calculate linear regression formulas. Other software can run similar analysis.
Upon running the fotware, data can be introduced in the columns as seen in the following figure:
Note: In the image above there is an error with the first HDD^2 values, as they are HDD^3.
The tool used in order to calculate the formula is called "Regression". A more exhaustive analysis with added accuracy can be run with the "Regression step by step" tool.
Click in "Statistics" > "Regression" > "Regression..."
As Response, select "Main [kWh]" and as a "Predictors" the rest of variables execpt the "Date", which is not necessary to calculate the formula. The variables are selected by doubleclicking on them.
Note: If you are interested in observing the residual values in graphical format, in "Graphics" you can activate the "four in one" option. Seeing residues (error) in a graphical format will allow detecting outliers which can be eliminated from the analysis.
Click on "Accept" and Minitab will calculate automatically the equation based on the predictors selected, the correlation coefficient (Rcuad). Then, it's time to refine the process.
Refining the equation  statistical P value
The first equation estimated is not necessarily the best. In fact, a set of interactions between variables and transformations that maybe don't correlate with the model have been introduced in the model.
One way to understand if a variable correlated or not is to look at the P value. As a general rule of thumb, if it's greater than 0.05, it can be assumed that the variable in the model does not correlate.
Following this rule, the first model has up to 7 variables that do not correlate:
There is no need to eliminate all the variables which not correlate because there they are related between them. The process has to be run one by one, from highest to lowest P value and analyse iteration by iteration what values of P are obtained.
Moreover, it may happen that after removing one of the variables, if it is reintroduced, the P value can be less than 0.05.
Therefore, the refinement process it's not concrete, and infinite combinations can exist. We must decide when the model is correct for our purpose and the iterations.
This could be a result:
The formula has been simplified, using only Cooling degree days (CDD) and sales (S) as predictors. The formula is a nonlinear grade 3 polynomial. We have the cubic CDD variable and the interaction of CDD with Sales (S), indicating more heating is related to more sales.
The correlation value is good (94.4%), which indicates an error in savings of about 5.6%
Outliers
With the residual chart, the existance of some outliers in the model or not can be observed, indicating a lower correlation coefficient and a bigger error.
When this happens, an agreement with the client to delete this data from the model could be reached.
Once the formula has been calculated, it's time to insert it to a Measure & Verification project in the platform.