Implementation Guide for the Generic EnKF Package.
Written by Olwijn Leeuwenburgh (KNMI) and Laurent Bertino (NERSC)
during their collaboration within the ENACT project
General recommendations
We advise to use the EnKF to perform the
analysis step only. A separate program is needed to integrate the
model ensemble forward (one can write a script or Fortran control
program to alternatingly perform a model integration and analysis
and take care of the time parameter).
The EnKF will need as input files:
- The forecast ensemble file produced by the model.
This file can be written as a binary file with every member on a
separate record, preceded by a character chain that identifies
the model and its version. TIP: if one member propagation fails for
some reason, then the record header chain will not be recognized
and you can avoid processing the file.
- The observations file. We advise you to store together with each data value
all informations that relate the observation to the model state :
- An identifier ('SST', 'SLA', 'TEM', 'TB' ...)
- The measurement error variance.
- The coordinates x,y,z of the observation.
- The nearest model grid cell (i,j)
- Some coefficients for interpolation of the model values to the
observation location (the 4 bilinear interpolation coefficients for example).
- If the observations are given on larger averaging cells (or "support")
than the model mesh the number of model cells covered by the observation should also be
given. This seems unfortunate but consider for example
Reynolds SST data or El Nino indexes.
- The topography/bathymetry in order to compute only wet points.
The EnKF will produce as output file the analysed ensembleA.uf file
containing all restart members for the next model propagation.
As for the forecast file, they should preferentially be written
in a single binary file on different records.
The makefile is written for the IBM and some compiling options may
need to be changed. It is advised however to stick to the filenaming
and other conventions followed. mksource.sh is used to create a list of
source files. mkdepend.pl then diagnoses the dependencies between the
routines. Use 'make' to compile and create the EnKF executable.
Use of the generic routines
The following points should be considered when adapting the
code to make it suitable for one's own model:
- 1) Specify the model dimensions nx,ny and nz in mod_dimensions.F90
- 2) Identify the prognostic variables of the model. If
necessary, modify the type declarations, interfaces,
subroutines and functions in mod_states.F90 (mod_states2D.F90).
- 3) m_pseudo2D.F90 contain calls to FFT routines for CRAY, IBM,
DEC and SGI machines. An experimental version of this routine
using a Numerical Recipes routine is also available.
- 4) Reset the random seed to a different integer each time
you use the EnKF executable, this will avoid trouble due
to "Extremely time-correlated measurement error".
m_set_random_seed2.F90 is given in NERSC_implementation/Random/
- 5) An identifier (cident) is used to check for the proper model
ensemble files and whether the files contain single or double
precision data . This identifier may need to be changed (e.g.
m_read_ensemble.F90).
- 6) The expected forecast ensemble file is 'ensembleF.uf'.
- 7) The restart file written by the model should be in the format
expected by the EnKF.
- 8) analysis.F90 uses double or single precision BLAS routines dgemm
or sgemm.
- 9) analysis.F90 is not m_analysis.F90. This keeps the compiler from complaining
about the implicit conversion from type(states) A(:) to the real-valued array A(:,:)
which should not be a problem as long as you have only floating points values in the state
vector (no integer, no character values ...).
Application-oriented routines
Only for those who want to do exactly the same as we do!
These routines can be found following Code and NERSC_implementation/
People who start data assimilation from scratch can easily adapt them
to their own needs.
- 1) The routine m_consistency_A.F90 checks the model state
ensemble for extreme or invalid values. Appropriate min/max
values should be put here. (in HYCOM_IO/)
- 2) Both gridded fields and profile data can be assimilated. Some
types of data are already accounted for (see e.g. list in
m_Generate_element.F90 in Obs_operator/).
- 3) m_modstate_point.F90 (in Obs_operator/)
identifies the model state variables
corresponding to different types of data and may thus be model
dependent (it is the core of the H matrix).
- 4) m_obs_pert.F90 (in Random/) determines the grid
size on which random 2D fields are generated, select the range of 2D perturbations
there.
- 5) mkensemble.F90 generates an initial ensemble for HYCOM coupled to
an ice model.
- 6) Both local and global analysis use almost the same routines,
in our implementation they are selected at run-time when reading the file assimilation.in
Setup of an EnKF assimilation experiment
The model
Programming issue
The model should be able to restart several times from the same time step,
usually it is just a matter of reading the forcing fields at the right place.
Each of the member can then run with its own random perturbations independently
from the other members. They will only meet at analysis time.
Remember to reset the random numbers to a different seed every time
you restart the executable.
Choosing the model error
That is where science really starts. Feel free to randomize wherever you
think the model has its main error. The ensemble variance or spread will
indicate whether you have touched a sensitive parameter or not. The assimilation results
(in particular the reduction in ensemble spread for each variable) will then show you
where the model builds the multivariate links between model variables an measurements.
For example in the TOPAZ experiment, the ECMWF atmospheric forcing fields are
perturbed by adding 2D random fields (m_pseudo2D.F90 was used for this purpose)
and each ensemble member is running in HYCOM with its own Richardson mixing parameter
(but constant in time).
The model error can be made time-correlated, then remember to store its value and read it
again on restart time.
The measurements
They do not need to be part of the state vector, if the measurement relate only
to diagnostic variables (like sea surface heights in HYCOM), then you can store
these diagnostic variables for every member in a separate auxiliary file (named
model_ssh.uf for example) and read it before analysis to compute the matrix S=HA'
and the innovations D=D-S.
Measurement errors can be space correlated if these measurements have been
interpolated/regridded beforehand. Consider using m_pseudo2D.F90 to simulate
the random measurement error fields if this is the case.
(a Cholesky decomposition of the covariance matrix can also be used if there
are less than 1000 irregulaly spaced measurements).
The initial error
Difficult to tell what is the initial state error at a given time!
People having successful experience with OI may use their OI covariance
matrix to draw random simulations the initial members.
In HYCOM, a convenient (and efficient) way of simulating the initial error
is to perturb the depths of the interfaces between isopycnic layers.
(see program NERSC_implementation/HYCOM_IO/mkensemble.F90 )
Then we integrate the ensemble one/two weeks forward with perturbations in the
forcing fields to let it build the adequate multivariate covariances.
Another strategy is to record a climatological ensemble during a model
spinup and store the restart files during the adequate season so that you will be
able to restart from each of them. Then one should make sure to
avoid having a bunch of members VERY different from all the others. Which leads us to
the next remark ...
Statistician's remarks
Gaussianity
In the EnKF the assimilation is performed by linear estimation based on
an invisible Gaussian assumption.
This means that all measurements and state variables should be satisfactorily
approximated as Gaussian. This is often the case if the variability is rather
small compared to the actual value of the variables. In the contrary case
you can have a look at the paper Bertino et al (2003) and try a Gaussian
transformed model.
To use a log-transformation for example (positive variables with exceptionally large values),
set A = log (A), d = log (d) at the beginning,
specify measurement errors as a ratio of their actual value,
think hard about the observation matrix [can I say log(d) = H log(A)?],
and use the same routines as provided here.
Back transformed the ensemble at the end.
Avoid the ugly little ducks
The enemy of the EnKF is the outlying member because it has a strong impact
on the empirical covariances (that are ensemble-based).
Imagine an ensemble among which one member (or a bunch of members) is very distant
from the others, then it will draw all the correlations to himself, as if all other
members were only one. So if you want to avoid 100 members to behave like 2, make
sure the ensemble is a homogeneous population, avoid unusual members,
small minorities or partitions within the ensemble.
(Laurent is talking statistics here, humanly speaking
he appreciates the company of unusual people and minorities).
About the number of ensemble members
How few ensemble members should I use?
Well ... Don't believe too much in miracles nor in "saturation effects".
If your problem is really interesting, then 10 members won't be smart enough to
tell you all about the most tricky multivariate covariances.
We advise 100 members to get sound reproduceable results. 100 is the ensemble size
that has been used during the DIADEM and TOPAZ real-time experiments.
See the paper by Natvik (2002) to see how well 20, 40, 60, 80 and 100 members
behave with a coupled physical-ecological model of the North Atlantic.
If you can afford more than 100, then go for more! But remember that 1000 members will
not help you if the model has serious biases: then all 1000 members will miss the
solution and you will feel like beating a dead horse.
This article is translated to Serbo-Croatian language by Jovana Milutinovich from Web Geeks Resources.