Modeling the Propagation of a Virus in MPI
This blog posts covers the implementation of the actor pattern on a population simulation for propagating a virus across actors in MPI. Let’s think of a population of squirrels living in a fixed two-dimensional space. These squirrels, we set the number at the start of the simulation, move randomly through this grid and while some of them have been injected the virus at the simulation start, others are healthy (not holding the virus). Moreover, squirrels reproduce, can get infected, live on for a bit after infection, and then semi-randomly die.
Getting started with the code in MPI, we describe our implementation design, the necessary communication involved, elaborate on how to ensure correctness and present performance considerations.
Design
Implementing the actor pattern for the virus propagation simulation we chose C++
as implementation language to leverage its object-orientation for
different actor types and actor instantiations. Figure
2 illustrates the
OOP design with distinct classes for Squirrel
s, GridCells
, Clock
s
and Helper
s. Different squirrels and gridcells are represented by
individually instantiated Squirrel
and GridCell
objects. For the
simulation we instantiate only one object of Clock
though in general
there might be use cases when more than one Clock
instantiation could
be required. The Helper
class is presented as static as it holds
adjacent globally available enum
, wraps functions for initalising
individual actors and only counts the number of currently alive and
infected squirrels.
Initalisation
Preparing the simulation and setting up the actors, the master (MPI-rank
zero and processPool
master) initialises all actors including the
helper, the clock, the gridcells and the number of squirrels that have
been defined by user input or by the default values. Initialisation of
an actor comprises calling the process pool’s startWorker()
and
sending the required information to the newly started worker. The newly
started worker initialises an object of the respective class after
having received all information through initial communication. The
initial communication to inform the actor about the necessary
information is visualised in figure
1. Initial communication is
implemented as synchronous and blocking send and blocking receiving to
ensure that actors go to work when they are fully set up for it. As
communication is blocking, the order of sending and receiving has to be
met on both sides. All actors receive their actor type as first
message. Depending on this actor type, they receive further information
about actor ranks that they will communicate with while performing their
task (during simulation).
The squirrel receives an ordered list of all gridcell ranks (as vector that maps the gridcells’ ID to their ranks). A gridcell might be initalised on rank 12 but reflects gridcell ID 3. Gridcell IDs range from 0 to 15 for 16 gridcells. Squirrels initialised at program start are equipped with an infection status depending on the user defined parameters i.e. the number of infected squirrels. Squirrels initialised during simulation is always sent a negative infection status, thus newly born squirrels during simulation are born healthy. If the squirrel is born during simulation, the newly born squirrel receives its parent’s position. Squirrels initalised at program start, begin the simulation at position tuple (0,0).
Besides actor type and the clock’s rank, the gridcell receives its CELL ID that ranges from 0 to 15 (for 16 gridcells in the simulation). The gridcell includes its cell ID in its monthly print statement. The helper and the clock receive the same cell rank list like the squirrels. The helper additionally receives the clock’s rank and the clock receives the helper’s rank.
Communication during simulation
Actors communicate by sending and receiving messages through OpenMPI.
Actors use message tags that are centrally defined in the Helper
headerfile helper.hpp
. The actor pattern loosely handles the send and
receive of messages. It does not specifically guarantee messages to be
sent and received for service. It mainly follows a fire-and-forget
principle that we implement by using MPI_Bsend()
or buffered send.
Generally, actors act asynchronously. This may be reflected in squirrels
taking different number of stpng per month. For correctness and
validation we invoke two special cases that compromise on the actor
patterns asynchronism. With exceptions of communication for
initialisation described in section
2.1 and for
synchronous and blocking handshakes between squirrel and gridcells
communication bases on buffered sends. The way these actors communicate
is rather dynamic and unpredictable: We know squirrels interact with
gridcells but the actual squirrel actor on rank X that communicates with
the gridcell on rank Y is unpredictable in rank (or source and
destination) and in the number of interactions. Actors also randomly
appear and disappear i.e. squirrels being born and squirrels dying.
The communication between actors during the simulation and while
performing their respective task is depicted in Figure
3. Upon arrival in a gridcell, the squirrel
sends its binary health status to the gridcell and receives the
gridcell’s most current populationInflux
value and infectionLevel
in
a handshake like send-and-receive manner. Detailed elaboration on why
this handshake is necessary is presented in section
2.3. The clock sends End of Month (EoM)
messages to gridcells and the helper. Upon receiving an EoM message, the
gridcells print their populationInflux
and infectionLevel
values and
update these properties thereafter. Receiving the EoM message, the
helper prints the current number of squirrels alive and infected.
Squirrels send notificatios to the helper when they give birth and
reproduce, when they get infected and when they die. The helper
instantiates a new squirrel upon learning about an existing squirrel’s
reproduction.
In general, each actor maintains its own state. Squirrels hold
information about their current position, its infection status, the
lately visited cells’ infectionLevel
s and populationInflux
values to
reliably calculate the probabilities for giving birth to a new squirrel,
for getting infected with the Parapoxvirus and for their demise.
Logically, squirrels are independent. Each squirrel might take any
number of stpng per month. The implementation details about the
boundaries of how many stpng a squirrel may take during a month are
presented in section
2.4. Across squirrels,
individuals might take different numbers of stpng per month. A squirrel
sends its binary infection status to the corresponding gridcell as soon
as it arrives in the cell. After having received the squirrel’s message,
the gridcell instantly responds by sending its current
populationInflux
value and infectionLevel
to the newly arrived
squirrel. A squirrel making a step and arriving at the same cell as it
was at before counts as new squirrel arriving at this cell. For two
subsequent stpng into the same cell, this squirrel contributes twice to
the cell’s properties. Thus, it increases the cell’s populationInflux
value and the infectionLevel
(if it is infected) also for the second
step (if a squirrel steps and ends up in the same cell, this counts as a squirrel arriving in the cell i.e. every hop can be treated in the same way,
as a squirrel arriving in a cell).
Gridcells aggregate the individual messages by squirrels for indicating the arriving squirrel’s infection status. Thus, gridcells maintain their individual populationInflux value, their infectionLevel and update these properties after receiving a message from clock indicating the end of a month.
The clock simulates the time sphere in this setting. It iterates over the number of months that the simulation lasts and infuses time-awareness in gridcell and the helper actors. After each month the clock notifies each gridcell and the helper about the end of the month. The squirrels are not aware of time, do not know time or season and therefore do not receive any messages by the clock.
The helper accumulates and manages the number of squirrels in the environment. After each month it outputs the number of currently alive and infected squirrels.
Figure 3 presents actors as objects of the
respective classes and their interactions. For the Parapoxvirus
simulation we identify three entities that are implemented as classes in
C++: Squirrel
, GridCell
and Clock
. The implementation invokes one
Clock
object, 16 GridCell
objects representing the environments in
cells and one Squirrel
object for each squirrel in the population.
Objects interact by sending messages from a sender to a receiver. Any
object can be a sender or receiver.
When the clock reaches the end of its month-simulation loop, it sends
End of Simulation (EoS) messages to gridcells and the helper.
Receiving EoS messages, the helper reutrns from its act() method and
calls the process pool’s shutdownPool()
. While the clock and helper
already returned from their respective act() functions, the squirrels
receive the notification sent by the pool master to stop working
(testing through shouldWorkerStop()
. All actors call workerSleep()
after finishing their task (respectively receiving the EoS message). The
process pool master already shut down the pool. Therefore, all
awaiting/sleeping processes are freed and proceed to call
processPoolFinalise()
and subsequently MPI_Finalize()
. As the
squirrels receive the shutdown command later by the pool than the other
actors receive the EoS message by the clock, the squirrels keep living
and taking stpng for a short period. The handshake between gridcells and
squirrels are implemented as blocking communications, the gridcells keep
spinning in their while loop to clean up squirrel messages that could
arrive late (meaning after the simulation’s actual ending). The
gridcells keep on listening to messages for a defined time after the
simulation, potentially respond to squirrel messages and then shut down.
Correctness and validation
A squirrel stpng into a gridcell, informs the girdcell about its arrival
(with a message including its binary infection status) and then makes
the next step after having received the response message by the cell it
just stepped into (including populationInflux
value and
infectionLevel
). We implemented this blocking and synchronous
handshake pattern for communication between squirrels and gridcells in
favour of correctness over performance. The synchronous handshake
generally contradicts the actor pattern’s asynchronism but compromises
on correctness. If a squirrel does not wait for the gridcells response
including populationInflux
and infectionLevel
it will not be able to
act reliably when taking the next step arriving at another gridcell. The
squirrel needs to know whether it got infected at the last arrival at
the lately visited gridcell. Moving on without being aware of the
outcome of the latest "Did I just get infected?" test falls short of
the simulation’s overall reliability.
The impact of randomness in the decisions whether a squirrel dies, gets infected or gives birth (we use a random number generator to tell us which of these options to simulate) could be mitigated by running the simulation for a high number of repetitions and averaging the outcome. We ran the simulation for 100 times with identical parameters across runs on the Cirrus backend, averaged numbers of squirrels alive and infected and compared it to results from individual runs. We observed varying counts but also an overall trend that was similar across all runs. An even more reliable validation test could extend the number of repetitions to 1000 or even more repetitions. The question is what sample size is necessary or sufficient to represent a general probability or derive a universally valid distribution. Quantitatively reporting standard deviations could help in validating simulation output.
Squirrels start living and take stpng as soon as they are instantiated as objects. Each squirrel step impacts the squirrel’s overall proneness of reproduction, infection and death. Ensuring the squirrel starts performing its task when the clock starts to simulate months could be implemented by barriers. For our implementation we focused on an instantiation order to reflect the simulation’s structure. First we create the clock and helper objects and subsequently the gridcells followed by the squirrels. Nevertheless, in theory it is not possible to fully guarantee that these squirrels start working after all other actor types. Contrary, it might be legit to let squirrels randomly work on their task and capture time-awareness through the clock for each end of a month for any given point in the squirrel population’s phase of existence. If the squirrels’ actions were as random as suggested the population would have not shown the same trend throughout all test runs. Opposed, it is doubtable whether the population numbers would show any trend at all if squirrel action’s were totally random. Basing our assumptions on the random number generator’s seed that is different on all squirrel actors, we conclude that at least a repeated test with high numbers of runs like our initial test with 100 runs for identical parameters should show similar trend across all runs.
Performance considerations
Regarding the number of stpng that a squirrel takes per month and the impact on the population’s overall evolution, we defined a month as lasting 50ms. We tried unrestricted squirrel movement and temporal restrictions on squirrel stpng. Figure 4 shows three test runs with different configurations concerning squirrel movement frequency. The graph shows the population numbers for squirrels taking one step per 1ms, one step per 2ms and one step per 3 ms. In any of these configurations no squirrel takes more than 50 stpng per month. 50 stpng is the lower boundary for a squirrel to die after infection. Hence, we ensure to observe the evolution from state infected to dead month by month. In previous tests, squirrels took more than 50 stpng per month leading to a whole population’s demise in the first month. Preventing squirrels from getting infected and dying in the same month, we set the number of squirrel stpng per month below 50 by defining a temporal month length of 50ms and squirrels taking stpng every 2ms.
The underlying network that enables communication between processes impacts overall program performance. Running tests on the Cirrus frontend led to different output than on the backend. We implemented one actor per process that enables a high degree of parallelism but could also result in higher load balancing overhead. Squirrels are centrally invoked by the helper that manages bookkeeping to stay within a limit of 200 concurrent squirrel processes.
With increasing squirrel actors sending messages to gridcells and receiving responses from gridcells the network workload rises. The gridcells or the network could become the bottleneck. Receiving extraordinary high numbers of messages from squirrels the gridcells need more time to send responses. The handshake between gridcells and squirrels is blocking and synchronous. Long waiting phases keep the squirrels idle waiting for the response, thus potentially further decreasing the number of stpng per month and per squirrel. Currently, only one helper actor accumulates the messages by squirrels. Increasing the number of event-like messages for reproduction, infection and death, the helper could become the bottleneck. This problem could be solved by implementing a second helper actor on a separate process that shares work with the existing helper. Besides the additional overhead in sharing work, adding new helpers could reduce the proneness of becoming a bottleneck in communication and message handling. If the helper is not able to service messages, new born squirrels are either never initialised or only set up with delay.
Helpers for implementing the actor pattern
With the Helper
class we provide a reusable skeleton to structure
actor pattern related code and separate it from problem specific code
like the squirrel population implementation. For the initialisation
procedures for setting up the actors (including initial communication)
we decided to separate them from other classes. For instance, the actors
of class Squirrel
are initialised with
Helper::squirrel_actor_init()
. With camelCase we indicate objects
and their methods and with underscore_case we indicate static
functions, e.g. those wrapped in Helper
. The classes currently contain
act()
methods indicating necessary functions for the actual actor
pattern implementation. The communication part including calls to MPI
directives such as MPI_Bsend()
are closely coupled with the actual
implementation of the simulation code and therefore reside in the
problem classes. Contrary, the adjacent init
functions are not part of
the problem related code and are therefore logically separated in the
static Helper
class.
Listing debug_flags defines five flags to enable debugging information in the respective classes. Setting a specific flag to 1 will print helpful information during initialisation and simulation.
#define MAIN_DEBUG 0
#define HELPER_DEBUG 0
#define SQUIRREL_DEBUG 0
#define GRIDCELL_DEBUG 0
#define CLOCK_DEBUG 0
Maintaining uniform message tags to be used across actors and actor
types, we introduce global enum
that are packed in the helper.hpp
headerfile for convenient import (#include "../include/helper.hpp"
) in
the problem code. Listing
message_tags shows the available tag definitions of the
simulation code. For instance to tell the actor type to a newly
instantiated gridcell:
// use enum to declare global state
// MPI message tag
// to distinguish different message types
// based on tags
enum message_tag {
WHAT_AM_I, // actor type
ALL_CELLS_ID, // list of cell ranks
INFECTION_STATUS, // binary state
CLOCK_ID, // clock rank
CELL_ID, // cell rank
HELPER_ID, // helper rank
SQUIRREL_POSITION,
SQUIRREL_TO_GRIDCELL,
GRIDCELL_TO_SQUIRREL,
END_OF_MONTH,
END_OF_SIMULATION
};
// to distinguish indices of
// wrapper arrays used for sending
// popInflux and InfLvl from cell to squirrel
enum information_on_arrival {
POPULATION_INFLUX,
INFECTION_LEVEL
};
// to distinguish types of actors on a process
enum what_am_i {
HELPER,
CLOCK,
SQUIRREL,
GRIDCELL,
MASTER
};
enum squirrel_status {
ALIVE,
HEALTHY,
REPRODUCED1, // announcing new born, then directly:
REPRODUCED2, // send parent position
INFECTED,
DIED
};
enum squirrel_position {
X,
Y
};
Listing using_message_tag on line 2 shows how to access the
definition of the actor type GRIDCELL
. On line 3, the message tag
WHAT_AM_I
is used to indicate that this message’s value represents an
actor type.
Using message tags with defined enum
in helper.hpp
:
// tell actor what type it is (either clock, grid cell or squirrel)
int actor_type = what_am_i::GRIDCELL;
MPI_Ssend ( &actor_type, 1, MPI_INT, process_id, message_tag::WHAT_AM_I, comm );
Summary and what’s next
From the initial idea of modeling the propagation of a virus, we’ve decided to implement actors i.e. clock
, grid cell
and squirrel
(the original virus host) as MPI processes with squirrels continuously moving through the 2D grid environment and infecting other hosts as they share the same grid cell or location. We’ve chosen parameters to eventually infect all hosts ending the simulation when all hosts are deceased. In the section Performance Considerations
, we showed how the choice of the simulation clock parameters impact the remaining life time of squirrel actors.
Next, we could think about more abstractions between our application logic with actor classes and the enabling library, here MPI. Could we use abstractions for send and receive logic so we could easily plug-in another parallel implementation?