flash.tex

\subsection{FLASH}
The FLASH code \cite{FLASH,Dubey2009}  has been under development for nearly two decades at the
Flash Center at the University of Chicago. The code was originally
developed to simulate thermonuclear runaways in astrophysics such as
novae and supernovae. It was created out of an amalgamation of three
legacy codes, Prometheus for 
shock hydrodynamics, PARAMESH for adaptive mesh refinement (AMR) and
locally developed equation of state and nuclear burn code. It has
slowly evolved into a well-architected extensible software with a user
base in over half a dozen scientific communities. FLASH has been
applied to a variety of problems including supernovae, X-ray bursts,
galaxy clusters, stellar structure, fluid instabilities,
turbulence, laser-experiments design and analysis, and nuclear reactor
rods. It supports an Eulerian mesh combined with a Lagrangian
framework to cover a large class of applications. Physics capabilities
include compressible hydrodynamics and magnetohydrodynamics solvers,
nuclear burning, various forms of equations of state, radiation, laser
drive, fluid-structure interactions and many more.   

\subsubsection{Code Design}
\label{sec:FLASHdesign}

From the outset FLASH was required to have composability because the
simulations of interest needed capabilities in different permutations
and combinations. For example, most simulations needed compressible
hydrodynamics, but with different equations of state. Some needed to
include self-gravity while others did not. An 
obvious solution was to use object-oriented programming model with
common APIs and specializations to account for the different
models. However, the physics capabilities were mostly legacy with F77
implementations. Rewriting the code in an object oriented language was
not an option. A compromise was found by exploiting the unix directory
structure for inheritance, where, the top level directory of the unit
defined its API and the subdirectories implemented it.
Meta-information about the role of a particular directory level in the
object oriented framework was encoded in a very limited
domain-specific language (configuration DSL). The meta-information
also included state and runtime variables requirements, dependences on
other code units etc. A {\em setup tool} parsed this information to
configure a consistent {\em application}. The setup tool also
interpreted the configuration DSL to implement  inheritance using the
directory structure. For more details about FLASH's object oriented
framework see \cite{Dubey2009}.    

FLASH is designed with separation of concerns as an objective, which
is achieved by separating the infrastructural components
from physics. The abstraction that permits this approach is very
well known in scientific codes, that of decomposing a physical domain into
rectangular blocks surrounded by halo cells copied over from the
surrounding neighboring blocks. To a physics operator, the whole domain is
not distinguishable from a box. Another necessary aspect of the abstraction 
is not to let any of the physics modules own the state
variables. They are owned by the infrastructure that 
decomposes the domain into blocks. A further separation of concern
takes place within the units handling the infrastructure, that of
isolating parallelism from the bulk of the code. Parallel
operations such as ghost cell fill, refluxing or regridding have
minimal interleaving with state update obtained from applying
physics operators. To distance the solvers from their parallel
constructs, the required parallel operations provide an API with
corresponding functions implemented as a subunit. The implementation
of numerical algorithms for physics operators is sequential,
interspersed with access to the parallel API as needed. 

Minimization of data movement is achieved by letting the state be
completely owned by the infrastructure modules. The dominant
infrastructure module is the {\em Eulerian} mesh, owned and managed by
the {\em Grid} unit. The physics modules query the {\em Grid} unit
for the bounds and extent of the block they are operating on, and
get a pointer to the physical data. This arrangement works in most
cases, but gets tricky where  the data access pattern does not conform
to the underlying mesh. An example is any physics dealing with
Lagrangian entities (LEs). They need a different data structure, and
the data movement is dissimilar from that of the mesh. Additionally,
the LEs interact with the mesh, so maintaining physical proximity of
the corresponding mesh cell is important in their distribution. This
is an example of unavoidable lateral interaction between modules. In order
to advance, LEs need to get field quantities from the mesh and
then determine their new locations internally. They may
need to apply near- and far-field forces, or pass some information
along to the mesh, or be redistributed after advancing in time. FLASH solves 
this conundrum by keeping the LE data structure extremely simple,
and using argument passing by reference in the APIs. The LEs are
attached to the block in the mesh that has the overlapping cell, an LE
leaves its block when its location no longer overlaps with the
block. Migration to a new block is an independent operation from
everything else that happens to the LEs. In FLASH parlance this is
the Lagrangian framework (see \cite{Dubey2012} for more details). The
combination of {\em Eulerian} and {\em Lagrangian} frameworks that
interoperate well with one another has succeeded in largely meeting the
performance critical data management needs of the code. 

\subsubsection{Verification and Validation}
\label{sec:FLASHvandv}

FLASH instituted a rigorous verification program early in its
lifecycle. The earliest versions of FLASH were subjected to
a battery of standard hydrodynamics verification tests
\cite{Fryxell2000}. These verification tests were then used to set up
an automated regression test suite run on a few local
workstations. Since then the test suite has evolved into a
combination of variety of tests that aim to provide comprehensive
coverage for verifying correct operation of the code
\cite{Dubey2013,Calder2005}. Because FLASH is in a constant state of
production and development, verification of its correctness on a
regular basis is a critical requirement.  The testing is complicated
both by the variety of environments in which FLASH is run, and the
sheer number of ways in which it can be configured. 

Testing is an area where the standard practices do not adequately
meet the needs of the code. Many multi-physics codes have legacy
components in them that are written in early versions of
Fortran. Contrary to popular belief, a great deal of new development
continues in Fortran because it still is the best HPC language
in which to express mathematical algorithms. All of solver code in
FLASH is written in F90, so popular unit test harnesses aren't available for
use. Small scale unit tests can only be devised for infrastructural
code because all the physics has to interact with the mesh. Also,
because regular testing became a part of FLASH development process
long before formal incorporation of software engineering practices in
the development process, FLASH's designation of tests only loosely follows the
standard definitions. So a unit test in FLASH can rely on other parts
of the code, as long as the feature being tested is isolated. For
example testing for correct filling of halo cells uses a lot of AMR
code that has little to do with the halo filling, but it is termed
unit test in FLASH parlance because it exclusively tests a single
limited functionality, we refer to them as {\em FLASH-unit-test} from
here on. The dominant form of regular testing is
integration testing, where more than one code capability is combined to
configure an executable. The results of the run are compared against
pre-approved results to verify that changes are within a specified
acceptable range. Because of a large space of possible valid and
useful combinations, selection of tests is challenging. FLASH's
methodology for test design and selection is described below, for more
details see \cite{Dubey2015}.

FLASH's testsuite consists of about a dozen {FLASH-unit-tests} and
$85$ multi-purpose composite tests. A composite test runs in two
stages, and has a comparison {\em benchmark} for each stage. A {\em
  benchmark} in FLASH testsuite is a full state checkpoint. Composite
test benchmarks are checkpoints at two distinct timesteps $M$ and $N$ where
$M < N$. Both are analyzed and approved as producing expected results
by a domain expert. If there is a legitimate cause for change in the
results (i.e. and improved method) benchmarks have to be
re-approved. The first stage of a composite  test verifies that no
errors have been introduced into the covered code units and their interactions
by comparing against the benchmark at time $M$. The second
stage restarts the execution from checkpoint at $M$ and runs up to
time $N$. This stage of the test verifies that the covered code can
correctly start from a checkpoint without any loss of state
information. FLASH testsuite configures and builds each test every
time it is run, therefore, build and configuration tests are
built-in. Progress of a test is reported at every stage, the final
stage being the outcome of the comparison. Because FLASH has many
available application configurations whose union provides  
very good code coverage, the task of building a test suite is simplified
to selection among existing applications. In many applications the
real challenge is picking parameters that exercise the targeted
features without running for too long. We use a matrix
to ensure maximum coverage, where infrastructure features are placed
along the rows and physics modules are placed along columns. For each
selected test all the covered features are marked off in the
matrix. Marking by the same test in two or more places in the same row or
same column represents interoperability among the corresponding
entities. The following order is used for filling the matrix:
\begin{itemize}
\item unit tests
\item setups used in science production runs
\item setups known to be sensitive to perturbations
\item simplest and fastest setups that fill the remaining gaps. 
\end{itemize} 

FLASH's testing can be broadly classified into
three categories: the daily testing, to verify
ongoing correctness of the code; more targeted testing  related to
science production runs; and porting to and testing on new
platforms. Daily testing is performed on multiple combinations of
platforms and software stacks and uses the methodology described above. In preparing for a  production
schedule, testing is a combination of 
scaling tests, cost estimation tests, and looking for potential
trouble spots. Scientists and developers work closely to devise
meaningful weak scaling tests (which can be difficult because of
non-linearity and adaptive mesh refinement), and tests that can 
exercise the vulnerable code sections without overwhelming the test
suite resources. Sample smaller scale production runs are also
performed on the target platform to help make informed estimates of cpu
hours and disk space needed to complete the simulation. For more
details on simulation planning see \cite{Dubey2013}. For porting the
code to a new platform, a successful production run from the past is used as a
benchmark for exercising the code on a new platform, along with 
a subset of the standard test suite.

FLASH has had some opportunities for validation against
experiments. For example FLASH could model a variety of laboratory
experiments involving fluid instabilities
\cite{Dimonte2004,Kane2001}. These efforts allowed researchers to
probe the validity of models and code modules, and also served to
bolster the experimental efforts by creating realistic simulation
capabilities for use in experimental design. The newer high-energy
density physics (HEDP) initiative involving FLASH  
is directed at simulation-based validation and design of
experiments at the major laser facilities in the US and
Europe. Other forms of validation have been convergence tests for the
flame model that is used for supernova simulations, and validation of
various numerical algorithms against analytical solutions of some
known problems. For example, the Sedov \cite{sedov} problem, which is
seeded by a pressure spike in the center that sends out a spherical
shock-wave into the domain has a known analytical solution. It is used
to validate hydrodynamics in the code. There are several other similar
examples where a simple problem can help to validate a code capability
through known analytical solutions. 

\subsubsection{Software Process}
\label{sec:FLASHSoftwareProcess} 
The software process of FLASH has evolved organically with the growth
of the code. For instance, in the first version there was no clear
design document, the second version had a loosely implied design
guidance, whereas the third version documented the whole design
process. The third version also published the developer's guide which
is a straight adaptation from the design document. Because of multiple
developers with different production targets, versioning repository was
introduced early in the code life cycle. The repository used has been
SVN since 2003, though its branching system has been used in some very
unorthodox ways to meet peculiar needs of the Flash Center. Unlike
most software projects where branches are kept for somewhat isolated
development purposes, FLASH uses branches also to manage multiple
ongoing projects. This particular need arose when there were four
different streams of physics capabilities being added to the code. All
projects needed some code from the trunk, but the code being added was
mostly exclusive to the individual project. It was important that the
branches stay more or less in sync with the trunk and that the new
code be tested regularly. This was accomplished by turning the trunk
into essentially a merge area, with a schedule of merge from 
individual branches, and an intermediate branch for forward merge. The
path was tagged-trunk $=>$ forward-branch $=>$ projects $=>$ merge into
trunk $=>$ tag trunk when stabilized. Note that the forward branch was
never allowed a backward merge to avoid the possible inadvertent
breaking of code for one project by another one. For the same reason
the project branches never did a forward merge directly from the
trunk. 

One of the biggest challenges in managing a code like FLASH occurs
during major version changes, when the infrastructure of the code
undergoes deep changes. FLASH has undergone two such changes where the
first transition took the approach of keeping the development branch
synchronized with the main branch at all times. An effort was made to
keep new version backward compatible with the old version. During
and after the process the team realized many shortcomings of this
approach. One was that the code needed to have deeper structural
changes than were possible under this approach. Also, the attempt to keep the development and
production versions in sync placed undue demands on the developers of
the new version, leading to inefficient use of their time. The adoption of the new version was delayed
because keeping up with the ongoing modifications to the older version
(needed by the scientists to do their work) turned the completion of
the transition into a moving target. 

Because of these lessons learned the second transition took a
completely different approach and was much more successful. The
infrastructural backbone/framework for the new version was built in
isolation from the old version in a new repository. The framework design
leveraged the knowledge gained by the developers about the
idiosyncracies of the solvers in earlier versions and focussed on
the needs of the future version. There was no attempt at backward
compatibility with the framework of the previous version. Once the
framework was thoroughly tested, physics modules were
transitioned. Here the emphasis was on transitioning all the
capabilities needed for one project at the same time, starting with
the most stable modules. Once a module was moved to the new version it
was effectively frozen in the old version (the reason for selecting
the most stable and mature code sections). Any modification after that
point had to be made simultaneously in the new version as well. Though
it sounds like a lot of duplicate effort, in reality such instances
were rare. This version transition was adopted by the scientists very quickly.

FLASH's documentation takes a comprehensive approach with a
user's guide, a developer's guide, robodoc API, inline documentation,
and online resources. Each type of documentation serves a different
purpose and is indispensable to the developers and users of the code.  
There are scripts in place that look for violations of coding
standards and documentation requirements. The user's guide documents the
mathematical formulation, algorithms used and instructions on using
various code components. It also includes examples of
relevant applications explaining the use of each code module. The
developer's guide specifies the design principles and coding standards
with an extensive example of the module architecture. Each function in
the API is required to have a robodoc header explaining the
input/output, function and special features of the function. Except
for the third party software, every non-trivial function in the code
is required to have sufficient inline documentation so that a non-expert
can understand and maintain the code.

FLASH effectively has two versions of release - internal, which
is close to the agile model, and general, which is no more than twice
a year. The internal release amounts to tagging a stable version in
the repository for the internal users of the code. This signals to
the users that a forward merge into their production branch is
safe. General releases have a more rigorous process which makes
them more expensive and therefore infrequent. For a general release the code undergoes 
pruning, checking for compliance with coding and
documentation standards, and more stringent than usual testing. The dual
model ensures that the quality of code and documentation are
maintained without unduly straining the team resources, while 
near continuous code improvement is still possible for ongoing
projects. 
 
\subsubsection{Policies}
\label{sec:FLASHPolicies}
In any project, policies regarding attributions, contributions and
licensing matter. In scientific domains intellectual property rights,
and interdisciplinary interactions are additional policy
areas that are equally important. Some of these policy requirements
are a direct consequence of the strong gatekeeping regimes that
majority of publicly distributed scientific software follow.  Many arguments
are forwarded for dominance of this model in the domain, the
most compelling one relates to maintaining the quality of
software. Recollect that the developers in this domain are typically
not trained in software engineering, and software quality control
varies greatly between individuals and/or groups of
developers. Because of tight, and sometimes lateral, coupling between
functionalities of code modules, a lower quality component introduced
into the code base can have disproportionate impact on the overall
reliability of output produced by the code. Strong gate-keeping is
desirable, and that implies having policies in place for accepting
contributions. FLASH again differentiates between internal and
external contributors in this regard. The internal contributors are
required to meet the quality requirements such as coding standards,
documentation, and code verification in all of their
development. Internal audit processes minimize the possibility of
poorly written and tested code from getting into a release. The internal audit also goes
through a periodic pruning to ensure that bad or redundant code gets
eliminated.  

The external contributors are required to work with a member of the
internal team to include their code in the released version. The
minimum set required from them is:  (1) code that meets coding standards,
has been used or will be used for results reported in peer-reviewed
publication, (2) at least one test that can be included in the
test-suite for nightly testing, (3) documentation for user's guide,
robodoc documentation for any API functions and inline documentation
explaining the flow of the control, and finally (4) a commitment to answer
questions on users mailing list. The contributors can negotiate the
terms of release, a code section can be excluded from the release for
a mutually agreed period of time to enable the contributor to complete
their research and publish their work before the code becomes
public. This policy permits potential contributors to be freed
from the necessity of maintaining their code independently, while
still retaining control over their software until agreed upon
release time.  As a useful side effect their code remains in sync with
the developments in the main branch between releases.  

There is another model of external contribution to FLASH that is
without any intervention from the core gate-keeping team. In this
model anyone can stage any FLASH compatible code on a site hosted by
them. The code has no endorsement from the distributing entity, the
Flash Center, which does not take any responsibility for its
quality. The Flash Center maintains a list of externally hosted
`as-is' code sites, the support for these code sections is entirely
the responsibility of hosting site. 

The attribution practices in computational science and engineering are
somewhat ad-hoc. For many developers, the only metric of importance is scientific
publications that result from using the software. When a team is
dominated by such developers proper attribution for code development
is not given enough importance or thought. Other teams also employ
computing professionals whose career growth depends upon their
software artifacts, and publications describing their algorithms and
artifacts, FLASH falls into the latter category. All 
contributors' names are included in the author list for the user's
guide, the release notes explicitly mention new external
contributions and their contributors, if any, for that
release. Internal contributors rely upon software related publications
for their attribution. This policy  usually works well, one
exception  has been citations skewed in favor of early
developers. Users of FLASH cite Fryxell et al \cite{Fryxell2000},
published in 2000, which does not include any of the later code 
contributors in its author list.  So the later authors
are deprived of legitimate citations for their work.  Many major long
running software projects have this problem, which is peculiar to the academic world where these
codes reside and are used.  

%\subsubsection{User Support}
%\label{sec:FLASHUsers}