Mark Klaisoongnoen, PhD Candidate

Modeling the Propagation of a Virus in MPI

This blog posts covers the implementation of the actor pattern on a population simulation for propagating a virus across actors in MPI. Let’s think of a population of squirrels living in a fixed two-dimensional space. These squirrels, we set the number at the start of the simulation, move randomly through this grid and while some of them have been injected the virus at the simulation start, others are healthy (not holding the virus). Moreover, squirrels reproduce, can get infected, live on for a bit after infection, and then semi-randomly die.

Getting started with the code in MPI, we describe our implementation design, the necessary communication involved, elaborate on how to ensure correctness and present performance considerations.

Design

Implementing the actor pattern for the virus propagation simulation we chose C++ as implementation language to leverage its object-orientation for different actor types and actor instantiations. Figure 2 illustrates the OOP design with distinct classes for Squirrels, GridCells, Clocks and Helpers. Different squirrels and gridcells are represented by individually instantiated Squirrel and GridCell objects. For the simulation we instantiate only one object of Clock though in general there might be use cases when more than one Clock instantiation could be required. The Helper class is presented as static as it holds adjacent globally available enum, wraps functions for initalising individual actors and only counts the number of currently alive and infected squirrels.

Initital communication at before simulation start

Initalisation

Preparing the simulation and setting up the actors, the master (MPI-rank zero and processPool master) initialises all actors including the helper, the clock, the gridcells and the number of squirrels that have been defined by user input or by the default values. Initialisation of an actor comprises calling the process pool’s startWorker() and sending the required information to the newly started worker. The newly started worker initialises an object of the respective class after having received all information through initial communication. The initial communication to inform the actor about the necessary information is visualised in figure 1. Initial communication is implemented as synchronous and blocking send and blocking receiving to ensure that actors go to work when they are fully set up for it. As communication is blocking, the order of sending and receiving has to be met on both sides. All actors receive their actor type as first message. Depending on this actor type, they receive further information about actor ranks that they will communicate with while performing their task (during simulation).

The squirrel receives an ordered list of all gridcell ranks (as vector that maps the gridcells’ ID to their ranks). A gridcell might be initalised on rank 12 but reflects gridcell ID 3. Gridcell IDs range from 0 to 15 for 16 gridcells. Squirrels initialised at program start are equipped with an infection status depending on the user defined parameters i.e. the number of infected squirrels. Squirrels initialised during simulation is always sent a negative infection status, thus newly born squirrels during simulation are born healthy. If the squirrel is born during simulation, the newly born squirrel receives its parent’s position. Squirrels initalised at program start, begin the simulation at position tuple (0,0).

Besides actor type and the clock’s rank, the gridcell receives its CELL ID that ranges from 0 to 15 (for 16 gridcells in the simulation). The gridcell includes its cell ID in its monthly print statement. The helper and the clock receive the same cell rank list like the squirrels. The helper additionally receives the clock’s rank and the clock receives the helper’s rank.

Communication during simulation

Actors communicate by sending and receiving messages through OpenMPI. Actors use message tags that are centrally defined in the Helper headerfile helper.hpp. The actor pattern loosely handles the send and receive of messages. It does not specifically guarantee messages to be sent and received for service. It mainly follows a fire-and-forget principle that we implement by using MPI_Bsend() or buffered send. Generally, actors act asynchronously. This may be reflected in squirrels taking different number of stpng per month. For correctness and validation we invoke two special cases that compromise on the actor patterns asynchronism. With exceptions of communication for initialisation described in section 2.1 and for synchronous and blocking handshakes between squirrel and gridcells communication bases on buffered sends. The way these actors communicate is rather dynamic and unpredictable: We know squirrels interact with gridcells but the actual squirrel actor on rank X that communicates with the gridcell on rank Y is unpredictable in rank (or source and destination) and in the number of interactions. Actors also randomly appear and disappear i.e. squirrels being born and squirrels dying.

The communication between actors during the simulation and while performing their respective task is depicted in Figure 3. Upon arrival in a gridcell, the squirrel sends its binary health status to the gridcell and receives the gridcell’s most current populationInflux value and infectionLevel in a handshake like send-and-receive manner. Detailed elaboration on why this handshake is necessary is presented in section 2.3. The clock sends End of Month (EoM) messages to gridcells and the helper. Upon receiving an EoM message, the gridcells print their populationInflux and infectionLevel values and update these properties thereafter. Receiving the EoM message, the helper prints the current number of squirrels alive and infected. Squirrels send notificatios to the helper when they give birth and reproduce, when they get infected and when they die. The helper instantiates a new squirrel upon learning about an existing squirrel’s reproduction.

Class UML

In general, each actor maintains its own state. Squirrels hold information about their current position, its infection status, the lately visited cells’ infectionLevels and populationInflux values to reliably calculate the probabilities for giving birth to a new squirrel, for getting infected with the Parapoxvirus and for their demise. Logically, squirrels are independent. Each squirrel might take any number of stpng per month. The implementation details about the boundaries of how many stpng a squirrel may take during a month are presented in section 2.4. Across squirrels, individuals might take different numbers of stpng per month. A squirrel sends its binary infection status to the corresponding gridcell as soon as it arrives in the cell. After having received the squirrel’s message, the gridcell instantly responds by sending its current populationInflux value and infectionLevel to the newly arrived squirrel. A squirrel making a step and arriving at the same cell as it was at before counts as new squirrel arriving at this cell. For two subsequent stpng into the same cell, this squirrel contributes twice to the cell’s properties. Thus, it increases the cell’s populationInflux value and the infectionLevel (if it is infected) also for the second step (if a squirrel steps and ends up in the same cell, this counts as a squirrel arriving in the cell i.e. every hop can be treated in the same way, as a squirrel arriving in a cell).

Gridcells aggregate the individual messages by squirrels for indicating the arriving squirrel’s infection status. Thus, gridcells maintain their individual populationInflux value, their infectionLevel and update these properties after receiving a message from clock indicating the end of a month.

The clock simulates the time sphere in this setting. It iterates over the number of months that the simulation lasts and infuses time-awareness in gridcell and the helper actors. After each month the clock notifies each gridcell and the helper about the end of the month. The squirrels are not aware of time, do not know time or season and therefore do not receive any messages by the clock.

The helper accumulates and manages the number of squirrels in the environment. After each month it outputs the number of currently alive and infected squirrels.

Actors and communications while performing tasks

Figure 3 presents actors as objects of the respective classes and their interactions. For the Parapoxvirus simulation we identify three entities that are implemented as classes in C++: Squirrel, GridCell and Clock. The implementation invokes one Clock object, 16 GridCell objects representing the environments in cells and one Squirrel object for each squirrel in the population. Objects interact by sending messages from a sender to a receiver. Any object can be a sender or receiver.

When the clock reaches the end of its month-simulation loop, it sends End of Simulation (EoS) messages to gridcells and the helper. Receiving EoS messages, the helper reutrns from its act() method and calls the process pool’s shutdownPool(). While the clock and helper already returned from their respective act() functions, the squirrels receive the notification sent by the pool master to stop working (testing through shouldWorkerStop(). All actors call workerSleep() after finishing their task (respectively receiving the EoS message). The process pool master already shut down the pool. Therefore, all awaiting/sleeping processes are freed and proceed to call processPoolFinalise() and subsequently MPI_Finalize(). As the squirrels receive the shutdown command later by the pool than the other actors receive the EoS message by the clock, the squirrels keep living and taking stpng for a short period. The handshake between gridcells and squirrels are implemented as blocking communications, the gridcells keep spinning in their while loop to clean up squirrel messages that could arrive late (meaning after the simulation’s actual ending). The gridcells keep on listening to messages for a defined time after the simulation, potentially respond to squirrel messages and then shut down.

Correctness and validation

A squirrel stpng into a gridcell, informs the girdcell about its arrival (with a message including its binary infection status) and then makes the next step after having received the response message by the cell it just stepped into (including populationInflux value and infectionLevel). We implemented this blocking and synchronous handshake pattern for communication between squirrels and gridcells in favour of correctness over performance. The synchronous handshake generally contradicts the actor pattern’s asynchronism but compromises on correctness. If a squirrel does not wait for the gridcells response including populationInflux and infectionLevel it will not be able to act reliably when taking the next step arriving at another gridcell. The squirrel needs to know whether it got infected at the last arrival at the lately visited gridcell. Moving on without being aware of the outcome of the latest "Did I just get infected?" test falls short of the simulation’s overall reliability.

The impact of randomness in the decisions whether a squirrel dies, gets infected or gives birth (we use a random number generator to tell us which of these options to simulate) could be mitigated by running the simulation for a high number of repetitions and averaging the outcome. We ran the simulation for 100 times with identical parameters across runs on the Cirrus backend, averaged numbers of squirrels alive and infected and compared it to results from individual runs. We observed varying counts but also an overall trend that was similar across all runs. An even more reliable validation test could extend the number of repetitions to 1000 or even more repetitions. The question is what sample size is necessary or sufficient to represent a general probability or derive a universally valid distribution. Quantitatively reporting standard deviations could help in validating simulation output.

Squirrels start living and take stpng as soon as they are instantiated as objects. Each squirrel step impacts the squirrel’s overall proneness of reproduction, infection and death. Ensuring the squirrel starts performing its task when the clock starts to simulate months could be implemented by barriers. For our implementation we focused on an instantiation order to reflect the simulation’s structure. First we create the clock and helper objects and subsequently the gridcells followed by the squirrels. Nevertheless, in theory it is not possible to fully guarantee that these squirrels start working after all other actor types. Contrary, it might be legit to let squirrels randomly work on their task and capture time-awareness through the clock for each end of a month for any given point in the squirrel population’s phase of existence. If the squirrels’ actions were as random as suggested the population would have not shown the same trend throughout all test runs. Opposed, it is doubtable whether the population numbers would show any trend at all if squirrel action’s were totally random. Basing our assumptions on the random number generator’s seed that is different on all squirrel actors, we conclude that at least a repeated test with high numbers of runs like our initial test with 100 runs for identical parameters should show similar trend across all runs.

Performance considerations

Regarding the number of stpng that a squirrel takes per month and the impact on the population’s overall evolution, we defined a month as lasting 50ms. We tried unrestricted squirrel movement and temporal restrictions on squirrel stpng. Figure 4 shows three test runs with different configurations concerning squirrel movement frequency. The graph shows the population numbers for squirrels taking one step per 1ms, one step per 2ms and one step per 3 ms. In any of these configurations no squirrel takes more than 50 stpng per month. 50 stpng is the lower boundary for a squirrel to die after infection. Hence, we ensure to observe the evolution from state infected to dead month by month. In previous tests, squirrels took more than 50 stpng per month leading to a whole population’s demise in the first month. Preventing squirrels from getting infected and dying in the same month, we set the number of squirrel stpng per month below 50 by defining a temporal month length of 50ms and squirrels taking stpng every 2ms.

Squirrel distribution and varying configuration

The underlying network that enables communication between processes impacts overall program performance. Running tests on the Cirrus frontend led to different output than on the backend. We implemented one actor per process that enables a high degree of parallelism but could also result in higher load balancing overhead. Squirrels are centrally invoked by the helper that manages bookkeeping to stay within a limit of 200 concurrent squirrel processes.

With increasing squirrel actors sending messages to gridcells and receiving responses from gridcells the network workload rises. The gridcells or the network could become the bottleneck. Receiving extraordinary high numbers of messages from squirrels the gridcells need more time to send responses. The handshake between gridcells and squirrels is blocking and synchronous. Long waiting phases keep the squirrels idle waiting for the response, thus potentially further decreasing the number of stpng per month and per squirrel. Currently, only one helper actor accumulates the messages by squirrels. Increasing the number of event-like messages for reproduction, infection and death, the helper could become the bottleneck. This problem could be solved by implementing a second helper actor on a separate process that shares work with the existing helper. Besides the additional overhead in sharing work, adding new helpers could reduce the proneness of becoming a bottleneck in communication and message handling. If the helper is not able to service messages, new born squirrels are either never initialised or only set up with delay.

Helpers for implementing the actor pattern

With the Helper class we provide a reusable skeleton to structure actor pattern related code and separate it from problem specific code like the squirrel population implementation. For the initialisation procedures for setting up the actors (including initial communication) we decided to separate them from other classes. For instance, the actors of class Squirrel are initialised with Helper::squirrel_actor_init(). With camelCase we indicate objects and their methods and with underscore_case we indicate static functions, e.g. those wrapped in Helper. The classes currently contain act() methods indicating necessary functions for the actual actor pattern implementation. The communication part including calls to MPI directives such as MPI_Bsend() are closely coupled with the actual implementation of the simulation code and therefore reside in the problem classes. Contrary, the adjacent init functions are not part of the problem related code and are therefore logically separated in the static Helper class.

Listing debug_flags defines five flags to enable debugging information in the respective classes. Setting a specific flag to 1 will print helpful information during initialisation and simulation.

Debug falgs in helper.hpp:

        #define MAIN_DEBUG 0
        #define HELPER_DEBUG 0
        #define SQUIRREL_DEBUG 0
        #define GRIDCELL_DEBUG 0
        #define CLOCK_DEBUG 0

Maintaining uniform message tags to be used across actors and actor types, we introduce global enum that are packed in the helper.hpp headerfile for convenient import (#include "../include/helper.hpp") in the problem code. Listing message_tags shows the available tag definitions of the simulation code. For instance to tell the actor type to a newly instantiated gridcell:

Message tags in helper.hpp

    // use enum to declare global state
    // MPI message tag
    // to distinguish different message types
    // based on tags
    enum message_tag {
        WHAT_AM_I, // actor type
        ALL_CELLS_ID, // list of cell ranks
        INFECTION_STATUS, // binary state
        CLOCK_ID, // clock rank
        CELL_ID, // cell rank
        HELPER_ID, // helper rank
        SQUIRREL_POSITION,
        SQUIRREL_TO_GRIDCELL,
        GRIDCELL_TO_SQUIRREL,
        END_OF_MONTH,
        END_OF_SIMULATION
    };

    // to distinguish indices of
    // wrapper arrays used for sending
    // popInflux and InfLvl from cell to squirrel
    enum information_on_arrival {
        POPULATION_INFLUX,
        INFECTION_LEVEL
    };

    // to distinguish types of actors on a process
    enum what_am_i {
        HELPER,
        CLOCK,
        SQUIRREL,
        GRIDCELL,
        MASTER
    };

    enum squirrel_status {
        ALIVE,
        HEALTHY,
        REPRODUCED1, // announcing new born, then directly:
        REPRODUCED2, // send parent position
        INFECTED,
        DIED
    };

    enum squirrel_position {
        X,
        Y
    };

Listing using_message_tag on line 2 shows how to access the definition of the actor type GRIDCELL. On line 3, the message tag WHAT_AM_I is used to indicate that this message’s value represents an actor type.

Using message tags with defined enum in helper.hpp:

    // tell actor what type it is (either clock, grid cell or squirrel)
    int actor_type = what_am_i::GRIDCELL;
    MPI_Ssend ( &actor_type, 1, MPI_INT, process_id, message_tag::WHAT_AM_I, comm );

Summary and what’s next

From the initial idea of modeling the propagation of a virus, we’ve decided to implement actors i.e. clock, grid cell and squirrel (the original virus host) as MPI processes with squirrels continuously moving through the 2D grid environment and infecting other hosts as they share the same grid cell or location. We’ve chosen parameters to eventually infect all hosts ending the simulation when all hosts are deceased. In the section Performance Considerations, we showed how the choice of the simulation clock parameters impact the remaining life time of squirrel actors.

Next, we could think about more abstractions between our application logic with actor classes and the enabling library, here MPI. Could we use abstractions for send and receive logic so we could easily plug-in another parallel implementation?