There is no doubt that knowing the future can be a great advantage. The person who is capable of anticipating what life has in store will be able to get closer to achieving everything he or she proposes. Although we know that knowing the future completely is impossible, we can try to predict what is to come. We can anticipate the temperature during the day by seeing what the sky looks like when we wake up, and based on this, we decide what we are going to wear and if we will need an umbrella or not. It is true that we are not always right and on many occasions the day doesn’t go as expected, but in general our instinct is usually right and the days of pain due to a bad choice are the least frequent.

In the daily life of many businesses, there are also situations in which they must anticipate future events and make decisions in order to obtain the highest possible profit. For example, the manager of a clothing store can anticipate future sales by using sales data from previous years in order to ensure stock. Additional information such as temperature, or sales trends in other stores, can also be added, making a much more accurate approximation. This manager is using time series without being aware of it, because the information available is being placed in an ordered way in time and extracting patterns from the result of this ordering. This could be, for example, the ordering in time of sales grouped by day.

This example of daily sales represents a small part of what this manager might be interested in predicting about the future. Using available data and time series to get the most benefit from the business.

**Introduction to time series forecasting**

Continuing with the same example, our objective is to be able to use all the information contained in this data (product sales historical data, temperature historical data, holiday calendar and sales of competitors or stores in the same sector) to predict future sales of the different products on a daily basis. As we mentioned before, if we consider each of the available data sources grouped at a daily level and ordered chronologically, we can say that we are working with time series.

A time series is a succession of observations of a variable made at regular time intervals and ordered chronologically.

In this case, the time series associated with product sales would consist of measuring the sales of the different items from day to day (at the daily level) and ordered from the first day for which a historical record is available to the current day. We can also define the time series for the temperature historical data, considering the average temperature of each day, a series that reflects day by day whether it is a holiday or not, and a daily series of sales for the historical data of competitors or similar stores.

Once it is clear how to consider our historical data in the form of a time series, we must set the prediction horizon. This horizon is defined as the number of forward terms of the target time series to be estimated. In the case of the clothing store we are considering, the manager should ask if it is enough to know the sales of the store on the next day or if it is necessary to estimate more than one day. In the first case, the prediction horizon is one day, but in case the manager wants to estimate more than one day, we will talk about a multi-horizon problem of ‘n’ days.

This difference between approaching a problem with a prediction horizon of more or fewer terms of the series is very relevant when considering the estimation method. While finding a solution for one-day-ahead forecasting can be covered by a large number of algorithms or models, when we talk about multi-horizon problems, finding a solution can be quite complex. This is due to the closeness of the estimated terms to the time of estimation. If we think of a series containing the temperature of a city, we will find more relationship between the temperature of two consecutive days, to the temperature on days of different weeks. That is, it will be easier to estimate the temperature tomorrow if we say that it will be the same temperature as today, than if we use the same value to estimate the temperature in a week’s time, even if it is the same day of the week. It is therefore essential to consider whether it is necessary to pose the problem to be solved as a multi-horizon prediction problem and, in this case, to know which techniques or models work best to solve it.

To pursue this example further, we must first enter the fascinating world of neural networks.

**Neural networks for prediction problems**

Neural networks are a mathematical approach to the functioning of neurons in the human brain. Neurons in the human brain transmit information to each other through electrical impulses. A particular neuron will receive the impulse generated by its neighboring neurons and, depending on the amount of impulse received, it will activate and emit a response or, on the contrary, remain inactive. The signal coming from the closest neurons will be of greater intensity than the one coming from the more distant ones.

This behavior was transferred to a mathematical model by means of the perceptron. The perceptron consists of a computational unit that receives different input signals (impulse from neighboring neurons), performs a weighted sum of these signals (signal intensity) and a function is applied to the result of this sum before issuing its response (activation or inactivation of the neuron). We can consider the perceptron as the first artificial neuron model proposed. The weighted sum coefficients are known as ‘perceptron weights’ and the function that processes the result of this sum is known as the ‘activation function’. There are different types of activation functions and one of the advantages of this model is the use of non-linear functions that allow solving problems of a non-linear type.

Over time, new types of neurons and neural architectures have been defined in order to solve different types of problems. We can find the multilayer perceptron, a group of artificial neurons of perceptron type in different layers connected in a sequential way that transmit information from an input layer to an output layer.

In order to work with problems such as the one we encountered in the time series, **recurrent neurons** were designed. This type of neuron, in addition to returning the result of its activation as the output of the network, considers it as part of the input to make the next prediction. That is, to make the prediction at time ‘t’, it considers the output that it itself has generated at time ‘t-1’ and also uses this information to estimate the future output at time ‘t+1’. The power of this type of neurons is the resolution of sequential type problems such as the case of time series or translation, where both the term or element of the series or sequence being treated and the previous terms of the same must be taken into consideration.

**Recurrent neural networks for multi-horizon forecasting**

A specific type of recurrent neuron is the LSTM (long short-term memory). This type of recurrent network, in addition to considering as input the output of the neuron in the previous instant, also considers an additional state that allows it to have memory of what it has processed so far. This additional state is modified according to the input received by the neuron and allows the changes generated by this input in the current instant not only to be used in the following instant, but also to be used in the future. For this reason, it is said that this type of neuron has a memory capable of forgetting information that is not interesting and storing information that is interesting to help in future predictions.

For time series processing in a multi-horizon problem, a good solution is to use LSTM type neurons together with an encoder/decoder technique. The idea behind this technique is to use a first LSTM type network to encode the information received in the last ‘n’ known terms of the series, which feeds a second LSTM network that uses this encoded information to predict the next terms of the series. As can be seen, this type of architecture combines two models based on artificial neurons that combine to make the final prediction.

Another significant advance in neural networks that is applied in solving time series problems is the attention layers. These layers allow consideration of which parts of the input information are most important. These types of layers are very useful in translation problems, as they allow the network to focus on those words that give context to the sentence, and allow it to find the best translation for the word being processed.

An example of a neural architecture that combines different types of artificial neurons for solving multi-horizon time series problems is Google’s Temporal Fusion Transformer (TFT). Let’s see how it works.

**Google’s TFT neural network explained**

As introduced in the previous section, we are going to look at Google’s Temporal Fusion Transformer (TFT) neural architecture. This neural architecture combines the use of LSTM type neurons with an encoding and decoding approach together with different attention layers.

**Network input and encoding/decoding block**

To explain the operation of this architecture we start from the input it receives. One of the main strengths or advantages of this architecture is the combination of temporal information known in the past, temporal information known both in the past and in the future, and static information. In the example of the clothing store that we mentioned at the beginning when we talked about the manager being able to anticipate sales based on his experience, he will probably also use this distinction between the types of information without being aware of it. He will use the characteristics of the different items (size, color, type of article) as ‘static information’, the sales of each product in the last few weeks as ‘temporal information known in the past’, and the sales of the same products in the same month but in previous years as ‘temporal information known both in the past and in the future’. Each of these types of information plays an important role in the functioning of the architecture, as it will in the mind of the manager.

The manager, in order to get an idea of how the market is behaving, and in particular how well or poorly the different articles are selling, uses the ‘known-to-past temporal information’. This reasoning translated into the functioning of the neural architecture would be equivalent to the step of encoding the information coming from the time series in terms prior to the moment of prediction.

This first intuition of how the market is behaving is probably combined with the information that the manager has from previous years’ sales, together with future information that may be available, such as the presence or absence of holidays, or weather forecasts. In the neural architecture we would be in the decoder part, which receives the information encoded by the encoder, together with the information of the future terms of the series for which both the past and the future are known.

Finally, we can think about how the manager is able to differentiate between the different types of products for which he is estimating sales, and to know for each of them what aspects of the temporal information available to him should be taken into account. A sales peak in the time series reflecting sales of winter gloves in the middle of August will not be as important as a sales peak of the same product at the beginning of winter. This is why static information plays a very important role when dealing with this type of problem, since it can provide more information than one might initially think. For the neural architecture this process of employing the static information to accommodate the prediction to each type of product is carried out through an encoder. Its output is considered by both the encoder and the decoder when assigning importance to the variables they receive and as a starting point in the encoder.

**Attention block**

The next block in the neural architecture combines different attention techniques capable of detecting different patterns in the time series that cause its operation to vary depending on the input received. In the case of the clothing store, this block would simulate the manager’s behavior by looking at different aspects of the available information depending on: the product for which he is trying to predict sales, the time of the year or the future time for which he is trying to anticipate. This attention block uses both the information coming from the LSTM neurons part of the previous block, and from the coding that has arisen when dealing with static information. In this way, the behavior of the architecture depends both on the static information it is processing and on the different time series it is currently using.

As a result of this attention block, the architecture returns, for each future instant to be estimated, a coded information that will be used in the last output block.

**Output generation block**

Another great strength or advantage that we find in this architecture is the possibility of using a quantile prediction as an output method. This type of prediction differs in that it attempts to estimate the median value or certain quantiles of the variable to be predicted. Seen from the point of view of the store manager, this type of forecasting does not always attempt to predict the exact sale to be had or the most expected future sale value, but rather allows predicting both that median value in the future sale as a pessimistic and an optimistic estimate. In the case of the pessimistic estimate we would be talking about the case in which the shopkeeper would put himself in the worst possible scenario and think that his sales are going to be the lowest possible. Translated into a mathematical context we would be saying that the manager is estimating a low percentile of the distribution of the variable to be predicted (sales). On the contrary, if the manager makes a very optimistic estimate and thinks that he/she will sell more than expected, we would be facing a scenario in which the estimate would seek to estimate a high percentile. A mid-percentile estimate (50th percentile) would reflect neither optimistic nor pessimistic behavior on the part of the manager and would always allow him to be somewhere in between.

The architecture we are mentioning allows us to define as output several of these percentiles, so that, with the same trained model, we can obtain a future estimate for different levels of “optimism”. With this model, the manager would not obtain a single estimate of future sales, but for each article of clothing and moment in the future, he would obtain a range of possible sales that would allow him to make a more complete decision.

**Conclusion**

Predictive models can be really valuable for anticipating events and making better business decisions. Some examples are the demand forecasting or the prediction of electricity prices in the energy market. In all these cases we speak of time series forecasting, since the information to be used to make the prediction has a temporal arrangement.

Although different statistical techniques or machine learning algorithms can be used, recurrent neural networks are usually used in this type of problem. One of the most complete neural architectures to date for solving this type of problem is Google’s Temporal Fusion Transformer (TFT).

The advantages of this network are:

- The combination of static variables, time series to past and time series to past and future.
- The use of layers of attention that allow the same model to be used for the prediction of different time series.
- The use of quantile regression as model output, allowing different levels of optimism in the predictions.