Implementation Guide for the Generic EnKF Package.

Written by Olwijn Leeuwenburgh (KNMI) and Laurent Bertino (NERSC) during their collaboration within the ENACT project

General recommendations

We advise to use the EnKF to perform the analysis step only. A separate program is needed to integrate the model ensemble forward (one can write a script or Fortran control program to alternatingly perform a model integration and analysis and take care of the time parameter).

The EnKF will need as input files: The EnKF will produce as output file the analysed ensembleA.uf file containing all restart members for the next model propagation. As for the forecast file, they should preferentially be written in a single binary file on different records.

The makefile is written for the IBM and some compiling options may need to be changed. It is advised however to stick to the filenaming and other conventions followed. mksource.sh is used to create a list of source files. mkdepend.pl then diagnoses the dependencies between the routines. Use 'make' to compile and create the EnKF executable.

Use of the generic routines

The following points should be considered when adapting the code to make it suitable for one's own model:

Application-oriented routines

Only for those who want to do exactly the same as we do! These routines can be found following Code and NERSC_implementation/ People who start data assimilation from scratch can easily adapt them to their own needs.

Setup of an EnKF assimilation experiment

The model

Programming issue

The model should be able to restart several times from the same time step, usually it is just a matter of reading the forcing fields at the right place. Each of the member can then run with its own random perturbations independently from the other members. They will only meet at analysis time. Remember to reset the random numbers to a different seed every time you restart the executable.

Choosing the model error

That is where science really starts. Feel free to randomize wherever you think the model has its main error. The ensemble variance or spread will indicate whether you have touched a sensitive parameter or not. The assimilation results (in particular the reduction in ensemble spread for each variable) will then show you where the model builds the multivariate links between model variables an measurements. For example in the TOPAZ experiment, the ECMWF atmospheric forcing fields are perturbed by adding 2D random fields (m_pseudo2D.F90 was used for this purpose) and each ensemble member is running in HYCOM with its own Richardson mixing parameter (but constant in time). The model error can be made time-correlated, then remember to store its value and read it again on restart time.

The measurements

They do not need to be part of the state vector, if the measurement relate only to diagnostic variables (like sea surface heights in HYCOM), then you can store these diagnostic variables for every member in a separate auxiliary file (named model_ssh.uf for example) and read it before analysis to compute the matrix S=HA' and the innovations D=D-S. Measurement errors can be space correlated if these measurements have been interpolated/regridded beforehand. Consider using m_pseudo2D.F90 to simulate the random measurement error fields if this is the case. (a Cholesky decomposition of the covariance matrix can also be used if there are less than 1000 irregulaly spaced measurements).

The initial error

Difficult to tell what is the initial state error at a given time! People having successful experience with OI may use their OI covariance matrix to draw random simulations the initial members. In HYCOM, a convenient (and efficient) way of simulating the initial error is to perturb the depths of the interfaces between isopycnic layers. (see program NERSC_implementation/HYCOM_IO/mkensemble.F90 ) Then we integrate the ensemble one/two weeks forward with perturbations in the forcing fields to let it build the adequate multivariate covariances. Another strategy is to record a climatological ensemble during a model spinup and store the restart files during the adequate season so that you will be able to restart from each of them. Then one should make sure to avoid having a bunch of members VERY different from all the others. Which leads us to the next remark ...

Statistician's remarks

Gaussianity

In the EnKF the assimilation is performed by linear estimation based on an invisible Gaussian assumption. This means that all measurements and state variables should be satisfactorily approximated as Gaussian. This is often the case if the variability is rather small compared to the actual value of the variables. In the contrary case you can have a look at the paper Bertino et al (2003) and try a Gaussian transformed model.
To use a log-transformation for example (positive variables with exceptionally large values), set A = log (A), d = log (d) at the beginning, specify measurement errors as a ratio of their actual value, think hard about the observation matrix [can I say log(d) = H log(A)?], and use the same routines as provided here. Back transformed the ensemble at the end.

Avoid the ugly little ducks

The enemy of the EnKF is the outlying member because it has a strong impact on the empirical covariances (that are ensemble-based). Imagine an ensemble among which one member (or a bunch of members) is very distant from the others, then it will draw all the correlations to himself, as if all other members were only one. So if you want to avoid 100 members to behave like 2, make sure the ensemble is a homogeneous population, avoid unusual members, small minorities or partitions within the ensemble. (Laurent is talking statistics here, humanly speaking he appreciates the company of unusual people and minorities).

About the number of ensemble members

How few ensemble members should I use?

Well ... Don't believe too much in miracles nor in "saturation effects". If your problem is really interesting, then 10 members won't be smart enough to tell you all about the most tricky multivariate covariances. We advise 100 members to get sound reproduceable results. 100 is the ensemble size that has been used during the DIADEM and TOPAZ real-time experiments. See the paper by Natvik (2002) to see how well 20, 40, 60, 80 and 100 members behave with a coupled physical-ecological model of the North Atlantic. If you can afford more than 100, then go for more! But remember that 1000 members will not help you if the model has serious biases: then all 1000 members will miss the solution and you will feel like beating a dead horse.

This article is translated to Serbo-Croatian language by Jovana Milutinovich from Web Geeks Resources.