About Us(CRI)

This page explains what the EOF is and how the customer can use them. This page describs a part of the data analysis services we offer at CRI. Please click "Data Analysis" button above to see other types of data analysis we offer.

We prepared explanatory pages with some examples for underlined words in blue. If you want to see those pages, please click underlined words in blue below.

What is EOF analysis?
In short, the EOF (Empirical Orthogonal Function) analysis is a principal component analysis (PCA) applied to a group of time series data. We explain it in more detail with some examples below but this terminology is probably a jargon among oceanographers and atmospheric scientists. We usually use EOF analysis to extract coherent variations that are dominant among a group of time series data for further analysis. The EOF is also used to generate index time series data of some sort from a group of time series data as typical applications of PCA.

Why should I bother computing EOF?
The advantage of EOF analysis is that we need to analyze a few, usually just one or two, sets of new time series data generated, or should we say extracted, by EOF from a group of original time series data if these original time series data are relatively coherent each other. This is because these few sets of new time series data include majority of variations included in the original data that might consist of hundreds of time series data sets. We describe this topic in more detail later with some examples but the point here is that EOF might help us reducing our efforts/costs of analyzing data considerably. If original time series data are utterly uncorrelated each other, however, the EOF does do any good.

Type of EOF
There are several variants of EOF analyses applied in the past. Most of them are computed in a time domain. However, like a spectral analysis, it is possible to apply EOF in a frequency domain. Looking at time series in a time domain is convenient if you want to know the sequence (or history) of events. On the other hand, looking at time series in a frequency domain might be more convenient if you want to know the frequency of events. Typical frequency domain analyses are the power spectral and coherency analyses. These analyses are counterparts of an analysis based on time series plot and a correlation analysis. Please note that the basic unit of frequency is the inverse of time. The frequency domain EOF is occasionally called complex EOF but it is somewhat confusing because there are other types of EOF that use complex numbers as described below.

1 The EOF; non-complex and for time domain analysis
1-1 What is it?

The simplest EOF analysis is the time domain EOF analysis that is basically the computation of eigenvector and eigenvalue of a covariance or a correlation matrix computed from a group of original time series data. If you are familiar with PCA, you probably have noticed immediately that the computation of EOF is nothing different from that of PCA. This type of EOF does not use complex numbers and is best suited if the coherent variations among original time series data have no time lag among them, or the time lag is constant in time and known. For the latter case, we need to shift data before computing EOF. This topic will be described in more detail later.

We can extract coherent variations among a group of original time series data and create a new group of time series data by using original time series data and eigenvectors. One of the important things about these new time series data is that they are statistically non-correlated each other. In other words, EOF separates coherent variations mixed in the group of original time series data into several components and these components themselves are not correlated each other. We might be able to identify the causes of variations of individual component separately. The magnitude of eigenvalues shows how important these new time series data are. The eigenvectors show us similar information but for each individual original time series.

1-2 Example
Figure 1a shows time series plots of current meter records at several selected depths at 147E on the equator (North of Papua New Guinea). If you are not particularly interested in oceanographic data, you could consider these data as whatever the the time series data you are interested in (such as sales records) at different locations instead of depths. The vertical scale on the upper left is for the record at 50m and all other records are shifted downward to make figure easier to see.

We compute EOF for you.

Estimations are free. For more information,
please send a mail
-->here<--

The zero levels for each record are shown as horizontal black lines. We downloaded these data from NOAA, U.S.A. (http://www.pmel.noaa.gov/tao) and applied a band-pass filter to remove variations of period longer than 150 days and shorter than 20 days. If you are not interested in ocean current, you could consider these as some kind of time series data, such as sale records, obtained at different locations, such as different provinces, or a group of time series of different kind at same or different locations.
This figure suggests that the variations of current at 100m are inversely correlated with those at 50m while those at 200m might be inversely correlated with those at 100m. There are data at 23 depths after ignoring bad data but studying the relations among those 23 time series data separately like this way is an overwhelming task, at least, to us. So we applied EOF to extract common variations. Figure 2a shows time series plots of two major components extracted from original time series data. Let us call these components mode 1 (blue line) and mode 2 (red line) following oceanographic tradition. Although we admit that it looks like they are somewhat correlated, their correlation coefficient is actually zero (yes, we actually computed).
Figure 2b shows how much variation, which is called variance if you allow us to use statistical terminology, in the original time series data (all 23 of them) these components explain. The vertical axis of this figure is the percentage and the horizontal axis is the mode number (there are 23 modes). The mode 1 explains 53.5% of variance and the mode 2 explains 14.3% of variance exist in the group of original time series data. This information is what we can obtain from the eigenvalues. The variances explained by higher mode decrease progressively as the mode number increases. Thus, we can concentrate our analysis efforts onto a few lower modes but can ignore many higher modes without losing too much information.
In this case if we can successfully analyze mode 1 and mode 2, then we practically analyzed 67.8% of variations included in the original 23 sets of time series data. Analyzing just only two sets of time series would considerably reduce our efforts/costs to analyze original 23 sets of time series data and this is why EOF is a useful tool.

Figure 2c, the eigenvectors of mode 1 and 2, shows how the amplitude of variations plotted in Figure 2a varies at different depths. The mode 1 is positive at 50m, negative at 100m and positive at 200m. This means that the variations at 100m look like mirror images of those at 50m and at 200m except that their amplitudes are different. This pattern matches our previous description of Figure 1. The amplitudes of the variations of mode1 at 100m and at 200m are about a half (0.44 to be precise) and about a quarter (0.22) of that at 50m, respectively. Here is another advantage of using EOF. Now we know how the amplitude shown in Figure 2a varies at different depths quantitatively. The information like this would help us to know what caused these variations. Even if we do not need to know the cause of these modes, knowing the amplitude of them at different depths quantitatively instead of qualitatively would be probably nice.

So, what is the mode 1 anyway, which includes 53.5% of variance of the original time series data? Figure 3a shows time series plot of mode 1 (blue line) and east-west wind speed (red line) at 147.5E on the equator obtained from NOAA, U.S.A. (http://www.cdc.noaa.gov/cdc/reanalysis).

The correlation coefficient has a value ranging -1 to 1. If the value is -1, two variables are perfectly correlated but in the opposite direction(like a mirror image). If the value is zero, they are not correlated at all. If the value is one, they are perfectly correlated and they vary in the same direction. This figure suggests that mode 1 variations of ocean current at this location are related to the wind speed variations. The variations of ocean current lag behind those of wind. The value of the correlation coefficient arrives its maximum, 0.68, when we shift wind data to the right by 9 days. This correlation is statistically significant (We computed 95% confidence interval assuming effective sampling interval is 10 days since the cut-off period of a band-pass filter we applied is 20 days.).

From figure 2c wind influence is such that the ocean current near the surface and at depths below 160m is accelerated in the down-wind direction but it is accelerated against wind direction at depths between about 80 and 160m.

We will stop our analysis at this point since this is not a scientific paper, but we would like to mention that we published results of more detailed analysis applied to the older data at the same location in a scientific journal.


1-3 Some cautions of using EOF

The EOF is just one of the numerical computations based on a statistical theory but not a magic wand which would automatically extract useful information from a group of complicated time series data. User should pay some attentions on the limitations of EOF.
For example,
(1) This type of EOF cannot deal with phase or time lag among time series data.
To demonstrate this point we create a new data set consisting of 5 time series data. The first experiment is that the case all the 5 time series data are exactly the same; the one at 50m. Theoretically, the eigenvalue is one for mode 1 but zero for all other modes. In other words, there does only one mode exist and that mode contains all the variations at 50m. Our computation shows that the sum of eigenvalues of mode 2 through mode 5 is (0.00000000000000013. This is not exactly zero but negligibly small. Thus, we only need to look at mode 1. We usually have some numerical errors when we do numerical computations.

The data set for the second experiment is that the first time series data are the one at 50m as before. The second time series data are the same as the first one except that they are shifted forward (to the right on a time series plot) by 5-days. We shifted the third one by another 5-days, totally 10-days, and manipulated the fourth and fifth time series data in a similar manner. Figure 4a shows how this data set looks like.

Figure 4b shows time series plots of mode1 and mode2. Figure 4c shows eigenvalues of this experiment. Mode 1 contains only about 60% of variance of the input data set and mode 2 contains about 34% of it. Figure 4d shows eigenvectors of mode 1 and mode 2. Apparently mode 2 is no longer negligible and mode 1 has amplitude variations at different "depths" although all the time series data are exactly the same except of the time shift among them. You might also notice that the time series plots of mode 1 and 2 look suspiciously similar.

Again, correlation coefficient between them is zero by principle but it becomes 0.81 if we shift mode 2 time series data to the left by 13.3. The zero-correlation is guaranteed only if we do not shift resultant time series data at all. To avoid result like this we have to adjust data set before computing EOF as described before if we know that the variations in time series data have time lag among them. Alternatively we might apply complex EOF of time domain or frequency domain EOF instead if we are not sure how far we need to shift original data. We will describe these methods later in this page.

(2) Eigenvectors and eigenvalues are supposed to be reasonably constant
The EOF produces only one set of eigenvalues (Figure 2b) and eigenvectors (Figure 2c). If these are not constant in time, then the result of EOF might become hard to interpret or, at worst, meaningless. Actually there is a good physical reason to believe that the eigenvectors might not be constant in time in our example. Then, what we have done was to create a time series of eigenvector of mode 1 in the following manner. First, we computed EOF with initial 91.25-day (1/4-year) long segment of data. Then, we computed another EOF with another 91.25-day long segments, start date of that segment is shifted forward in time by half of 91.25days. We repeated his procedure until we reached the end of the data. Figure 5, the result of this computation, shows how the eigenvector of mode 1 changes in time. This figure shows that eigenvector has a two-layer structure, negative near the surface and positive below, at the beginning of the data.
Then, it changes to a three-layer structure from the beginning of 2003 and this three-layer structure continues throughout the record. Thus, in our example (Figure 2 and 3), we started computation from February 2003 but discarded data prior to that month.
(3) We might need to do some pre-processings before computing EOF
It might be better if we apply some pre-processings to our data before computing EOF. In case of our example we know that there are strong tidal signals in our data. We know also that there are variations of periods of half a year and one year. We are not interested in these variations. Also, we have an idea at which frequencies wind has a strong influence on ocean currents through coherency function analysis. Thus, we applied a band-pass filter to our data before computing EOF based on this prior knowledge.

Another important point here is that single external factor might have influences to our data by several different mechanisms via different ways (or routes). The responses caused by the same factor but by these different mechanisms might not be proportional. For example, certain mechanisms might dump variations of shorter period while others might amplify them. It might become difficult to interpret time series produced by EOF as a result of this. Wind affects ocean currents on the equator in several different ways in our example. We have a theoretical reason to believe that the mechanism by which wind affects to ocean current where eigenvector is positive near the surface and the mechanism at work where eigenvector is negative are different. One of the methods we can try in case like this is remove some data from our data set. So, we re-calculate EOF using data only between 40m and 80m (5 time series data). Here, we might say we "filtered" our data allowing only those at depths between 40 and 80m. Figure 3b, the result of this re-calculation, shows shorter periods variations such as "dual-bump" features more clearly than Figure 3a does.

If we mix time series data with different units, we usually need to adjust their amplitude unless we use a correlation matrix to compute EOF. This process is called weighting and multiplying different constants to each of these time series often does it. We usually remove average and often remove trend from each time series data before computing EOF. Using a correlation matrix to compute EOF is equivalent to adjusting amplitude of input data by dividing input data by the square root of variance of them before computing EOF with a covariance matrix. By doing so all the time series data will have an equal importance (weight) in EOF computation.

Finally, EOF might not be able to separate variations caused by different factors especially when they are correlated for whatever the reasons. In case of ocean we have a daily variations caused by tidal motions. We have another daily variations caused by solar heating during the daytime and radiation cooling during the night near the surface. Periods of these variations are not exactly the same but if we have a limited length of data, we are quite unlikely able to separate effects of these two factors at all by EOF.

(4) The result of EOF might be meaningless.
Let us generate random numbers and create new time series data set from them. These data contain no meaning by definition. However, if we apply EOF, it still gives us eigenvalues, eigenvectors and time series data but they are just as much meaningless as input data are. While this is rather extreme example, there is no guarantee that EOF would extract any useful information in general. There may be the case when we cannot relate the time series data generated by EOF to any known variations. In our previous example EOF gives us 23 components but it is highly doubtful if variations of mode 3 and higher carry any useful information. If you are interested in extracting rather weak variations, you should try to amplify them by applying appropriate filters and removing unnecessary time series data from your data set before computing EOF.

Click below for more
about EOF.