Simulating Datacenter Fabrics

Overview

Background

Cloud vendors, including Amazon AWS, Microsoft Azure and Google Cloud, make their datacenter network accessible for clients by abstracting out the complexities of the underlying interconnect. For the client, the datacenter network is one large fabric that magically connects all of their rented instances as in Figure 1.

An actual data center fabric, however, looks much more like Figure 2 below. The process of designing and deploying a data center fabric, and its accompanying network stack, costs upwards of a billion dollars. Hence, computer scientists evaluate a potential design’s performance using a network simulator prior to deploying it in an actual datacenter.

Problem

Network simulators are usually serial. This proves to be a major issue for data center frameworks which tend to be massive in scale with multiple highly interconnected layers. Furthermore for large data centers, which are extremely expensive, network issues like network congestion can result in additional high costs and service issues if unaccounted for. Therefore these simulations are important as they help highlight issues such as network bottlenecks in different network topologies. These simulations also help to optimize network topologies and better balance traffic. We will use parallelization to deal with the compute-intensive aspect of traversing network nodes (i.e. servers, switches, and links) that comprise these simulations.

To get an intuition for why parallelizing a network is non-trivial, consider the following naive way of dividing the problem.

Approach 1: Simulate network devices in parallel. This involves writing a simulator for a network switch and running parallel simulations for all the switches in figure 2 on separate threads. The problem with this approach is that there is a dependence between a switch and every other switch in the network. As an example, you cannot simulate the load on any of the switches at the top of figure 2 (core) without knowing how much data they are being sent by the switches underneath them (aggregation). The same goes for any other set of switches one can think of.

Other approaches, such as trying to simulate network traffic in parallel, have similar pitfalls.

Solution

For our project, we used multi-processing for the purposes of traffic generation and for reading traces, traffic matrices, and network topologies. Furthermore, we used multi-threading for the execution of pruned/isolated network clusters on separate nodes. Finally shared memory is used to aggregate the results at the end as well as to share traffic information between nodes.

Design

Overview of Phases

The design of our project consisted of 3 phases.

Phase 1: Fake traffic generation for network flow.
In order to generate fake traffic, we require user input in the form of the CDF of the traffic as well as the interarrival times. With the CDF, we use random sampling on the inverse CDF in order to simulate the load of the time. Finally, we use the Poisson Process in order to get the time of the next flow.

Phase 2: Fake network creation with switches and service/time delay for simulation.
For this phase, we implemented a M/M/1 Queue that simulated the service time of the switches. This queue is then processed in a FIFO manner. While this queue is processed, we add the link delay between switches as the network simulation runs. This queue will run through the network topology created in the next phase.

Phase 3: Evaluation of network topologies using simulation.
We realize that network packets don't always interact with every single part of the network. Thus we can parallelize the network through the creation of isolated network clusters. Thus in this phase, we prune the entire network into network clusters that are isolated from one another (i.e. packets don't move from one network cluster to another). In this way, we can parallelize flows through each of these network clusters and speedup the performance of the simulation.

Phase 1 and Phase 2 are combined into the "Traffic Generation" component of our project and Phase 3 is the "Network Simulation" component of our project.
At the end of our project, we produce a flow completion time CDF like the one shown below (which was generated from a "fork" topology of one sender and two receivers).

Network Topology

For our network simulation, we chose to create a fat tree topology. A graphical representation can be found below.

Using our test.topo.dot file (fat_tree.topo found on the GitHub repository under tests/), we generated our own fat tree network topology shown below.

In the diagram, the network instances are labeled as "h#" for hosts/servers and "s#" for switches. The lines connecting the network instances are the links in the network.

Data Used

Examples of network flow size traces can be found publicly from CAIDA (Center for Applied Internet Data Analysis) [4], GitHub [5], and Google [3].
More information on the use of the Poisson Process for network load generation can be found on Alizadeh et. al (MIT) [1], and Harchol-Balter et. al (Carnegie Mellon) [2].

Infrastructure

The main 2 parallel infrastructures used are MPI and OpenMP. Specifically OpenMP was used for the Traffic Generation (Phase 1 + 2) in order to speed up the process of network creation, previously a serial task. MPI was used for Network Simulation (Phase 3) as a way for the pruned network clusters on separate nodes to communicate with one another.

Specification

Machine Specifications

Traffic generation experiments: Intel Core i7, 16 GB DDR 4 Memory, 6 Cores, 1 Processor, 256 KB L2, 9 MB L3 Cache.
Flow completion experiments: Intel Core i5, 16 GB DDR 4 Memory, 4 Cores, 1 Processor, 256 KB L2, 6 MB L3 Cache.

Software Specifications

OpenMP 5.0
OpenMPI 4.0.3

Results

Traffic Generation

We used OpenMP to parallelize the traffic generation code on an increasing amount of threads alongside increasing simulation times (in ms) which requires more data to be generated. These simulation times increased in a exponential manner from 0.1 to 1 and eventually to 10000.

Bar plots for the traffic generation phases can be found below (left is on a linear scale and right is on a logarithmic scale). Furthermore, these bar plots display the error margins.

From the data, it is evident that paralleization of the traffic generation provided little to no improvement when the simulation time (to be generated) is 0.1 ms. Furthermore, the improvements that result from the parallelization of traffic generation for simulation times of 1, 10, and 100 ms are minor and are only visible on a logarithmic scale. Greater improvements from paralleization result from traffic generation greater that 100 ms.

Furthermore, it can be seen that in most cases more threads improves completion time. The exceptions to this are when the simulation times are 0.1, 1, and even 10 ms (when increasing from 4 to 5 threads), although the deviations are minor in scale.

To better visualize the speedup with an increasing number of threads, we plotted the speedup from an increasing number of threads alongside an increasing simulation time.

From the line plot, speedup with 1 thread is constant, as it should be. Furthermore, speedup increases alongside an increasing number of threads (although it should be noted that speedup is relatively the same for any number of threads when simulation time is less than or equal to 10 ms). The rate of speedup growth is higher for runs with more threads. Finally, the rate of speedup growth seems to approach zero for the generation of traffic data for simulation times above 1000 ms.

Network Simulation

We used MPI to parallelize the network simulation code on 1 MPI node and 4 MPI nodes for increasing simulation times that were generated in the Traffic Generation aspect of our code.

Bar plots for the network simulation phase can be found below.

From the bar plot, it is evident that, in most cases, parallelization with MPI significantly improves the completion time of the simulation. The exceptions for this are when the simulation time is 1 ms, where completion time of the parallelized run is slower, and when the simulation time is 5 ms, where the completion time of the parallelized run is only slightly faster.

To better visualize the speedup with more MPI nodes, we plotted the speedup from using 4 MPI nodes alongside increasing simulation time.

From the line plot above, it can be seen that speedup grows linearly with problem size but declines slightly with simulation times greater than 25 ms. It should also be noted that for simulation times between 1 ms and 5 ms, speedup is less than one, meaning that the parallel version is slower than the serial version.

Conclusion

Challenges

The main challenges we faced were:

Ensuring that network clusters are correctly pruned and isolated.
Creating small yet comprehensive data structures to hold what is needed for the simulation
Combining every single phase (Traffic Generation + Network Simulation) to create a seamless application
Optimizing the parallelized performance of the nodes for the network simulation.

Overheads

The main overheads of our project were:

Communication: Multiple objects in our code (i.e. switches, links, and queues) needed to communicate with one another to relay information and pass the network packet. Furthermore, communication was needed between the MPI nodes in order to communicate results and times.
I/O: The need to read and write large files for the random sampling of the CDF, the generation of the fat tree network topology, and reading the flow sizes for the simulation.
Sequential Operations: The queue used by the server is serial by nature and cannot be parallelized.
Load Balancing: Time is needed in order to prune and prepare the isolated network clusters that can then be used for parallelization.

Conclusions

Parallelization of our Traffic Generation code significantly benefitted it performance although the rate of its speedup increases started to decline after simulation times over 1000 ms. It should be noted, however, that parallelization only helped Traffic Generations of simulation times over 100 ms due to the added overheads that comes with parallelization. Thus, we suggest that parallelization by increasing threads with OpenMP is a viable way of generating the large traffic flows needed for network simulations. One caveat, however, is that speedup starts to remain constant after simulation times of 1000 ms which could be addressed with more threads or other improvements in the code (see Future Work below).

Parallelization of our Network Simulation with just 4 MPI nodes resulted in massive improvements in performance, although speedup started to decline after simulation times of 30 ms. It should also be noted that (like the parallelization of the traffic generation), parallelization provided little benefits and was, at times, detrimental to completion time for simulation times at or below 5 ms. Thus, we suggest that parallelization through the use of MPI nodes is viable way to reduce the completion time of large network simulations. The benefits/speedups of this parallelization could be potentially greater with the use of more MPI nodes (see Future Work below).

Future Work

In the future, we would like to investigate whether an increasing number of MPI nodes will further speedup the simulation. Furthermore, we would like to look into the posibility of GPU computing as a way of improving our traffic generation as well as network simulation code. Additionally, we can find methods to deal with overheads that hinder the speed of some aspects of our project.

Works Cited

Mohammad Alizadeh, Shuang Yang, Milad Sharif, Sachin Katti, Nick McKeown, Balaji Prabhakar, and Scott Shenker. 2013. PFabric: minimal near-optimal datacenter transport. In Proceedings of the ACM SIGCOMM 2013 conference on SIGCOMM (SIGCOMM ’13). Association for Computing Machinery, New York, NY, USA, 435–446. DOI:https://doi.org/10.1145/2486001.2486031
Mor Harchol-Balter. 2013. Performance Modeling and Design of Computer Systems: Queueing Theory in Action (1st. ed.). Cambridge University Press, USA.

Google Cluster Data (v3) - John Wilkes et. al. Google. Sunnyvale, CA.

The CAIDA Anonymized Internet Traces (2019). caida.org.

Empirical Traffic Generator. Mohammad Alizadeh et al. (Cisco).