# 3 Input data preparation¶

In Sec. 2.1.1 (Solution of the mean–variance model) we mentioned that the expected return vector \(\mu\) and the covariance matrix \(\Sigma\) of the securities needs to be estimated. In this chapter we will discuss this topic through a more general approach.

Portfolio optimization problems are inherently stochastic because they contain uncertain forward-looking quantities, most often the return of securities \(R\). In the case of the basic mean–variance optimization problems (2.1)-(2.3) the simple form of the objective and constraint functions make it possible to isolate uncertainty from the decision variables and wrap it into the parameters. This allows us to separate stochastic optimization into parameter estimation and deterministic optimization steps.

In many cases, for instance those involving different risk measures, however, it is not possible to estimate the optimization inputs as a separate step. These objectives or constraints can be a non-trivial function of portfolio returns, and thus we have to explicitly model the randomness. Therefore more generally our goal is to estimate the *distribution of returns* over the investment time period. The simplest way to do that is by using the concept of *scenario*.

## 3.1 Scenarios¶

A scenario is a possible realization \(\mathbf{z}_k\) of the \(N\) dimensional random vector \(Z\) of a quantity corresponding to a given time period [MWOS15]. Thus we can also see a set of scenarios as a discretization of the distribution of \(Z\).

We denote the number of scenarios by \(T\). The probability of scenario \(\mathbf{z}_k\) occurring will be \(p_k\), with \(\sum_{k=1}^T p_k = 1\). In practice when the scenarios come from historical data or are obtained via computer simulation, all scenarios are assumed to have the same probability \(1/T\).

We commonly work with scenarios of the security returns \(R\). Scenario \(k\) of security returns will be \(\mathrm{r}_k\). Assuming that \(p_k = 1/T\), we can arrange the scenarios as columns of the \(N\times T\) matrix \(\mathbf{R}\).

### 3.1.1 Scenario generation¶

Both the quality and the quantity of scenarios is important to get reliable optimal portfolios. Scenarios must be representative and also there must be enough of them to accurately model a random variable. Too few scenarios could lead to approximation errors, while too many scenarios could result in excessive computational cost. Next we discuss some common ways of generating scenarios.

### 3.1.2 Historical data¶

If we can assume that past realizations of a random quantity represents possible future outcomes, we can simply use historical data to create scenarios. Observations are then collected from a predetermined time window, with a given frequency. Each scenario would correspond to a vector of simultaneous observations for all \(N\) securities.

Using historical data is a simple non-parametric approach, but it could be inaccurate or insufficient in some cases:

The underlying assumption about past realizations representing the future might not be realistic, if markets fundamentally change in the observed time period, or unexpected extreme events happen. Such changes can happen at least in every several years, often making the collection of enough representative historical data impossible.

It can represent the tails of the distribution inaccurately because of low number of samples corresponding to the low probability tail events. Also these extreme values can highly depend on the given historical sample, and can fluctuate a lot when the sample changes.

It depends on the observations being independent and identically distributed. Without this assumption, for example if the data exhibits autocorrelation, volatility clustering, etc., the distribution estimated based on historical scenarios will be inaccurate.

### 3.1.3 Monte Carlo simulation¶

With this approach we can generate a large number of scenarios according to a specified distribution. It can be useful in cases when there is a lack of historical scenarios available in the desired timeframe, but we have a model of the relevant probability distribution. This allows us to generate scenarios through a parametric approach. It has the following steps:

Select a (multivariate) probability distribution that explains all interesting features of the quantity of interest, e. g. fat tails and tail-dependence of return.

Estimate the distribution parameters using historical (independent and identically distributed) data samples.

Generate scenarios by sampling from the fitted distribution.

Later we will see an example of Monte Carlo scenario generation in Sec. 5.4.1 (Single factor model).

## 3.2 Modeling the distribution of returns¶

In this section we describe how to estimate the distribution of returns over the time period of investment. It can be in the form of the estimated expected returns and covariance matrix of securities, or in the form of a set of return scenarios.

The simplest case is when we have sufficient amount of return data for the desired time period, then we can estimate their distribution easily and proceed with the optimization. However, such data is seldom available, especially if the time period is very long, for reasons mentioned in Sec. 3.1.2 (Historical data). Therefore often we need to use higher frequency (shorter timeframe) data to estimate the distribution, then project the distribution to the longer timeframe of interest. This is typically not straightforward to do. We will describe this procedure first for stocks, then also outline the general method for other securities.

### 3.2.1 Notions of returns¶

There are two different notions of returns. Let \(P_{t_0}\) and \(P_{t_1}\) be the prices of a security at the beginning and end of a time period.

#### 3.2.1.1 Linear return¶

Also called *simple return*, calculated as

This return definition is the one we used so far in this book, the return whose distribution we would like to compute over the time period of investment.

We can aggregate linear returns across securities, meaning that the linear return of a portfolio is the average of the linear returns of its components: \(R^\mathrm{lin}_\mathbf{x} = \sum_i x_iR^\mathrm{lin}_i\). This is a property we need to be able to validly compute portfolio return and portfolio risk. However, it is not easy to scale linear return from one time period to another, for example to compute monthly return from daily return.

#### 3.2.1.2 Logarithmic return¶

Also called *continuously compounded return*, calculated as

We can aggregate logarithmic returns across time, meaning that the total logarithmic return over \(K\) time periods is the sum of all \(K\) single-period logarithmic returns: \(R^\mathrm{log}_K = \mathrm{log}\left(P_{t_K}/P_{t_0}\right) = \sum_{k=1}^{K} \mathrm{log}\left(P_{t_k}/P_{t_{k-1}}\right)\). This property makes it very easy to scale logarithmic return from one time period to another. However, we cannot average logarithmic returns the same way as linear returns, therefore they are unsuitable for computation of portfolio return and portfolio risk.

#### 3.2.1.3 Relationship between returns¶

The two notions of return are related by

We must not confuse them, especially on longer investment time periods [Meu10a]. Logarithmic returns are suitable for projecting from shorter to longer time periods, while linear returns are suitable for computing portfolio level quantities.

### 3.2.2 Data preparation for stocks¶

Because both returns have properties necessary for portfolio optimization, the data preparation procedure for a stock portfolio is the following:

We start with logarithmic return data over the time period of estimation (e. g. one day if the data has daily frequency).

We estimate the distribution of logarithmic returns on this time period. If we assume that logarithmic returns are normally distributed, then here we estimate their mean vector and covariance matrix.

Using the property of aggregation across time we scale this distribution to the time period of investment (e. g. one year). We can scale the mean and covariance of logarithmic returns by the “square-root rule” 1. This means that if \(h\) is the time period of investment and \(\tau\) is the time period of estimation, then

\[\EMean_h = \frac{h}{\tau}\EMean_\tau,\quad \ECov_h = \frac{h}{\tau}\ECov_\tau.\]Finally, we convert it into a distribution of linear returns over the period of investment. Then we can use the property of cross-sectional aggregation to compute portfolio return and portfolio risk. We show an example working with the assumption of normal distribution for logarithmic returns in Sec. 3.3 (Example).

### 3.2.3 Data preparation in general¶

We can generalize the above procedure to other types of securities; see in detail in [Meu05]. The role of the logarithmic return in case of stocks is generalized by the concept of *market invariant*. A market invariant is a quantity that determines the security price and is repeating (approximately) identically across time. Thus we can model it as a sequence of independent and identically distributed random variables. The steps will be the following:

Identify the market invariants. As we have seen, the invariant for the stock market is the logarithmic return. However, for other markets it is not necessarily the return, e. g. for the fixed-income market the invariant is the change in yield to maturity.

Estimate the joint distribution of the market invariant over the time period of estimation. Apart from historical data on the invariants, any additional information can be used. We can use parametric assumptions, approximate through moments, apply factor models, shrinkage estimators, etc., or simply use the empirical distribution, i. e., a set of scenarios.

Project the distribution of invariants to the time period of investment. Here we will assume that the invariants are additive across time, i. e., the invariant corresponding to a larger time interval is the sum of invariants corresponding to smaller intervals (the property that logarithmic returns have). Then the projection is easy to do through the characteristic function (see in [Meu05]), or we can also scale the moments by applying the “square-root rule” (see in [Meu10b]).

Map the distribution of invariants into the distribution of security prices at the investment horizon through a pricing function. Then compute the distribution of linear returns from the distribution of prices. If we have a set of scenarios, we can simply apply the pricing function on each of them.

This process will be followed in the case studies section in an example.

### 3.2.4 Limitations of linear returns¶

While it is common to use linear returns in portfolio optimization, it is not the most general approach. Under the hood, the goal of maximizing investment returns rests on two assumptions:

Our goal is to maximize final wealth \(W = \mathbf{v}^\mathsf{T}P_h\), where \(\mathbf{v}\) is the vector of security holdings in terms of quantity and \(P_h\) is the vector of security prices at \(t=h\).

The initial wealth \(w_0 = \mathbf{v}^\mathsf{T}\mathbf{p}_0\) at time \(t=0\) is nonzero.

These make it possible to define portfolio returns and portfolio weights as

This shows that we cannot measure returns in general when the initial investment value \(w_0\) is zero, e. g. for market-neutral strategies. A workaround is to introduce suitable reference values \(b_i\) and change the above formula such that \(x_i=v_ib_i/\sum_i v_i b_i\) and \(R_i = (P_{h,i} - p_{0,i})/b_i\); see in [CJPT18] for details.

### 3.2.5 Performance assessment¶

We advise to have two separate sets of scenario data.

The first one is the

*in-sample data*, which we use for running the optimization and building the portfolio. We assume it known at the time of portfolio construction.The other data set is the

*out-of-sample data*, which is for assessing the performance of the optimal portfolio. Out-of sample data is assumed to become available after the portfolio has been built.

## 3.3 Example¶

In this example we will show a case with real data, that demonstrates the steps in Sec. 3.2.3 (Data preparation in general) and leads to the expected return vector and the covariance matrix we have seen in Sec. 2.4 (Example).

### 3.3.1 Problem statement¶

Suppose that we wish to invest in the U.S. stock market with a long-only strategy. We have chosen a set of eight stocks and we plan to re-invest any dividends. We start to invest at time \(t=0\) and the time period of investment will be \(h=1\) year, ending at time \(t=1\).

We also have a pre-existing allocation \(\mathbf{v}_0\) measured in number of shares. We can express this allocation also in fractions of total dollar value \(\mathbf{v}_0^\mathsf{T}\mathbf{p}_0\), by \(\mathbf{x}_0 = \mathbf{v}_0 \circ \mathbf{p}_0 / \mathbf{v}_0^\mathsf{T}\mathbf{p}_0\).

Our goal is to derive a representation of the distribution of one year linear returns at time \(t=1\), and use it to solve the portfolio optimization problem. Here we will represent the distribution with the estimate \(\EMean\) of the expected one year linear returns \(\mu = \mathbb{E}(R)\) and with the estimate \(\ECov\) of the covariance of one year linear returns \(\Sigma = \mathrm{Cov}(R)\), at the time point \(t=1\).

### 3.3.2 Data collection¶

We obtain a time-series of daily stock prices from data providers, for a five year period (specifically from 2016-03-21 until 2021-03-18), adjusted for splits and dividends. During this time period the market was generally bullish, with only short downturn periods. This will be reflected in the estimated mean return vector and in the covariance structure. See a deeper analysis in Sec. 5.4.1 (Single factor model).

We choose an estimation time period of one week as a balance between having enough independent observations and the homogeneity of the data series. Thus \(\tau = 1/52\) year.

At this point we assume that the price data is available in the pandas DataFrame `df_prices`

. The graphs of the price time-series can be seen on Fig. 3.1:

### 3.3.3 Estimation of moments¶

Now we will go over the steps in Sec. 3.2.3 (Data preparation in general).

#### 3.3.3.1 Market invariants¶

For stocks, both linear and logarithmic returns are market invariants, however it is easier to project logarithmic returns to longer time periods. It also has an approximately symmetrical distribution, which makes it easier to model. Therefore we resample each price time series to weekly frequency, and obtain a series of approximately 250 non-overlapping observations:

```
df_weekly_prices = df_prices.resample('W').last()
```

Next we compute weekly logarithmic returns from the weekly prices, followed by basic handling of missing values:

```
df_weekly_log_returns =
np.log(df_weekly_prices) - np.log(df_weekly_prices.shift(1))
df_weekly_log_returns = df_weekly_log_returns.dropna(how='all')
df_weekly_log_returns = df_weekly_log_returns.fillna(0)
```

#### 3.3.3.2 Distribution of invariants¶

To estimate the distribution of market invariants, in this example we choose a parametric approach and fit the weekly logarithmic returns to a multivariate normal distribution. This requires the estimation of the distribution parameters \(\mu_\tau^\mathrm{log}\) and \(\Sigma_\tau^\mathrm{log}\). For simplicity we use the sample statistics denoted by \(\EMean_\tau^\mathrm{log}\) and \(\ECov_\tau^\mathrm{log}\). The “log” superscript indicates that these statistics correspond to the logarithmic returns. In code this looks like

```
return_array = df_weekly_log_returns.to_numpy()
m_weekly_log = np.mean(return_array, axis=0)
S_weekly_log = np.cov(return_array.transpose())
```

#### 3.3.3.3 Projection of invariants¶

We project the distribution of the weekly logarithmic returns represented by \(\EMean_\tau^\mathrm{log}\) and \(\ECov_\tau^\mathrm{log}\) to the one year investment horizon. Because the logarithmic returns are additive across time, the projected distribution will also be normal with parameters \(\EMean_h^\mathrm{log} = \frac{h}{\tau}\EMean_\tau^\mathrm{log}\) and \(\ECov_h^\mathrm{log} = \frac{h}{\tau}\ECov_\tau^\mathrm{log}\).

```
m_log = 52 * m_weekly_log
S_log = 52 * m_weekly_log
```

#### 3.3.3.4 Distribution of linear returns¶

To obtain the distribution of linear returns at the investment horizon \(h\), we first derive the distribution of security prices at the investment horizon. Using the characteristic function of the normal distribution, and the pricing function \(P_h = \mathbf{p}_0\mathrm{exp}\left(R^\mathrm{log}\right)\), we get

In code, this will look like

```
p_0 = df_weekly_prices.iloc[0].to_numpy()
m_P = p_0 * np.exp(m_log + 1/2*np.diag(S_log))
S_P = np.outer(m_P, m_P) * (np.exp(S_log) - 1)
```

Then the estimated moments of the linear return is easy to get by \(\EMean = \frac{1}{\mathbf{p}_0} \circ \mathbb{E}(P_h) - \mathbf{1}\) and \(\ECov = \frac{1}{\mathbf{p}_0\mathbf{p}_0^\mathsf{T}} \circ \mathrm{Cov}(P_h)\), where \(\circ\) denotes the elementwise product and division by a vector or a matrix is also done elementwise.

```
m = 1 / p_0 * m_P - 1
S = 1 / np.outer(p_0, p_0) * S_P
```

Notice that we could have computed the distribution of linear returns from the distribution of logarithmic returns directly is this case, using the simple relationship (3.1). However in general for different securities, especially for derivatives we derive the distribution of linear returns from the distribution of prices. This case study is designed to demonstrate the general procedure. See details in [Meu05, Meu11].

Also in particular for stocks we could have started with linear returns as market invariants, and model their distribution. Projecting it to the investment horizon, however, would have been much more complicated. There exist no scaling formulas for the linear return distribution or its moments as simple as the ones for logarithmic returns.

Footnotes

- 1
Originally the square-root rule for stocks states that we can annualize the standard deviation \(\sigma_\tau\) of \(\tau\)-day logarithmic returns by \(\sigma_\mathrm{ann} = \sigma_\tau\sqrt{252/\tau}\), where \(\sigma_\mathrm{ann}\) is the annualized standard deviation, and \(252\) is the number of business days in a year.