When I started CS380P at UT Austin, I expected a tough parallel computing assignment from my Parallel Computing course. What I did not expect was to get genuinely obsessed with a physics simulation.
The problem was deceptively simple: implement a parallel N-body gravitational simulator using the Barnes–Hut algorithm with MPI. I built a correct sequential Barnes–Hut tree code first, then parallelized it with MPI, then kept going until I had an interactive web demo that lets you see the quadtree evolve in real time.
This post is a guided walkthrough of what I built, how Barnes–Hut turns an impossible looking problem into something tractable, what MPI does to a tree-based algorithm, and what the accuracy versus speed knob (θ) really means when you can watch it.
Why I wrote this
I have a soft spot for problems where a single clean idea changes the whole game. N-body is one of those problems. The physics is simple, but the computational cost is brutal. You get a direct confrontation with complexity, approximation, numerical stability, and communication overhead, all in one project.
By the end of the assignment, I could not let it stay a command-line program. Barnes–Hut is a visual algorithm. The quadtree is geometry. The simulation is emergent behavior. If you can see it, the intuition becomes permanent.
Problem context
This started as a CS380P lab with a clear checklist: implement Barnes–Hut with a quadtree, parallelize with MPI, clamp close-distance forces with an rlimit, and use a fixed 2D domain ([0,4] × [0,4]). The program reads bodies from a text file, runs for a user-provided number of steps, writes the final state, and prints a single elapsed-time number in seconds. I kept those constraints, then layered on extra instrumentation and a separate web demo.

Quick background: what is the N-body problem?
An N-body gravitational simulation tracks N particles (stars, planets, dark matter particles, atoms) that attract each other. At every time step you want the net force on each body. The straightforward approach compares every pair, which is exact, but it scales as O(N²).
For 10 bodies, the number of pair interactions is 45. For 1,000 bodies, it is 499,500. For 100,000 bodies, it is 4,999,950,000. That is nearly five billion pairwise force evaluations per step. Even before you care about rendering, I/O, and integration, this is already too slow for interactive work.
The core question becomes: how do we compute a good approximation to the gravitational field without touching every pair?
Small aside: if you have watched 3 Body Problem, you already know the vibe :)
The Barnes–Hut idea: distance makes things simpler
Barnes–Hut is built on a physical intuition: when a group of masses is far away, you do not need to model each mass individually. You can approximate the entire group as a single mass located at the group’s center of mass. For a far-away observer, the fine details barely matter.
Barnes–Hut turns that intuition into an algorithm by organizing space hierarchically. Nearby structure gets represented with detail. Far-away structure gets represented in aggregates.
In practice this transforms the computational cost from O(N²) to roughly O(N log N) for typical distributions, while keeping the error controllable.
Quadtree construction
My implementation is 2D, so I use a quadtree. (In 3D you use an octree and the same ideas apply.)
Start with a square simulation domain that contains all bodies. If a region contains more than one body, subdivide it into four quadrants: NW, NE, SW, SE. Recurse until each leaf contains at most one body.
Each node represents a spatial cell. It stores the cell bounds and, after aggregation, the total mass and center of mass of everything in that cell.

A crucial detail is what you do with bodies that lie on boundaries. You need a deterministic rule so every point maps to exactly one child cell. Without that, you get ambiguous insertion and subtle correctness bugs.
Computing centers of mass
Once the quadtree is built, you compute aggregate values bottom-up.
For a leaf node that contains a single body, the node’s mass is that body’s mass, and the node’s center of mass is the body’s position.
For an internal node, the mass is the sum of child masses, and the center of mass is the weighted average of child centers:
This is the data that makes the approximation possible. The whole algorithm depends on being able to replace a spatially distributed cluster with a single “equivalent” mass.
1// Build tree (replicated on each rank)
2root = create_node(domain_min, domain_max, domain_min, domain_max);
3for (int i = 0; i < n; i++) {
4 if (bodies[i].mass >= 0) {
5 insert_body(root, i, bodies);
6 }
7}
8compute_mass_distribution(root, bodies);
9Force computation: the MAC and the θ trade-off
When computing the force on a body, we traverse the tree. At each node we decide whether the entire cell can be approximated as a single body.
The classic Barnes–Hut Multipole Acceptance Criterion (MAC) is:
If (s / d) < θ, accept the approximation.
Here, s is the cell width, d is the distance from the body to the node’s center of mass, and θ is a user-defined threshold.
If the cell is far enough relative to its size, we treat it as one mass at the center of mass and add its contribution to the force. If it is not far enough, we recurse into the children and repeat.
θ is the main knob that controls the speed versus accuracy trade-off.
When θ is small, you recurse more, compute more interactions, and get higher accuracy.
When θ is large, you accept more approximations, compute fewer interactions, and run faster, but the error increases.
In my experiments, θ = 0.5 is a strong default for educational demos because it is visibly faster than conservative settings, while still producing stable-looking dynamics.


To make the θ trade-off tangible, I've added an interactive visualization to the demo.
When you enable the quadtree view and click on any body, the simulation highlights how the Barnes-Hut algorithm computes forces specifically for that selected body (marked with a yellow ring). The quadtree nodes are color-coded to show the algorithm's decisions:
Green boxes represent nodes where the MAC is satisfied (s/d < θ), meaning all bodies within are approximated as a single mass at the center of mass, this is where the O(n log n) speedup comes from.
Red boxes indicate nodes that are too close to approximate safely, forcing the algorithm to recurse deeper and eventually compute individual body-to-body interactions.
Blue boxes show the rest of the quadtree structure. Try adjusting θ from 0.3 to 1.0 while observing a selected body: you'll see more red boxes (recursion) at lower values, transitioning to more green boxes (approximation) at higher values.
This visual feedback makes it immediately clear why smaller θ values are slower but more accurate, while larger values sacrifice precision for performance. The white/blue glowing bodies you see are all the other masses creating gravitational forces on your selected body, watching how they get grouped or individually calculated reveals the elegance of the Barnes-Hut approximation in action.
Physics: forces, softening, and stability
The gravitational force magnitude between two bodies is:
In a simulation, the tricky part is what happens when r gets very small. The 1/r² singularity can cause enormous accelerations, NaNs, and eventually the entire simulation blowing up.
A common approach is gravitational softening. Instead of using r² directly, you add a small ε²:
In my assignment implementation I used a simple minimum-distance clamp (rlimit). If two bodies get closer than rlimit, I replace r with rlimit for the force computation. Conceptually, this is a coarse softening that prevents singular behavior and keeps the numeric pipeline stable.
You can treat ε or rlimit as a resolution scale. You are saying: below this distance, I no longer trust the simulation to resolve close encounters faithfully.
Time integration: why I avoid Euler
Once you compute the net force on a body, you convert it to acceleration:
Then you update velocity and position over a time step dt.
Euler integration is easy to implement, but for Hamiltonian systems like gravity it tends to drift in energy and can turn stable orbits into spirals.
A better choice is a symplectic integrator such as leapfrog (often written as kick-drift-kick) or the closely related velocity Verlet form. The key property is that it behaves much better over long horizons.
Here is the velocity Verlet update. In the code I use the single-acceleration form for simplicity (no second force evaluation); the two-acceleration form is more accurate but costs an extra force pass:
Update position using current velocity and acceleration:
Compute new acceleration a(t+dt) from forces at the new positions.
Update velocity using the average acceleration:
1// Velocity Verlet-style update (single-acceleration form)
2ax = fx / m;
3ay = fy / m;
4
5x += vx * dt + 0.5 * ax * dt * dt;
6y += vy * dt + 0.5 * ay * dt * dt;
7
8vx += ax * dt;
9vy += ay * dt;
10Making it parallel with MPI
MPI stands for Message Passing Interface. It is a standardized programming model and library interface for building parallel programs where many separate processes cooperate by explicitly sending messages to each other. In practice, you launch multiple processes (called ranks) that each run the same program, then use MPI calls like broadcast, gather, and reduction to share data and synchronize work across ranks.
Read more: https://www.mpi-forum.org
My CS380P assignment specifically required using MPICH, which is an implementation of the MPI standard. MPICH provides the MPI headers and runtime you link against, plus the tooling you use to compile and run MPI programs on a machine or cluster.
Barnes–Hut is tree-based, and tree traversal is data dependent. That makes parallelization less straightforward than parallelizing a simple all-pairs kernel.
The core issue is that to compute forces for any body, you need access to the global tree. So you either distribute the tree (hard), or you replicate it (simple).
Replicated quadtree strategy
I chose replicated trees because it is conceptually clean and easy to reason about.
At each time step, every MPI rank builds the same quadtree from the global list of bodies. Then each rank computes forces for a disjoint subset of bodies. Finally, we communicate the results back to rank 0, update positions on rank 0, and broadcast the updated bodies to all ranks for the next step.
This is a compute-heavy, communication-light pattern for sufficiently large N. For small N, the communication cost becomes dominant.
Communication pattern
There are two correct ways to combine per-rank results, depending on what you compute.
If each rank computes forces only for its own bodies, you are producing disjoint chunks of a force array. The most direct collective is a gather-style operation (Gather or Gatherv) to assemble the full force array on rank 0.
If each rank computes partial contributions to the same bodies and those contributions must be summed, then Reduce makes sense.
In my implementation I still use Reduce, but with a full-sized force array where unowned entries are left at zero. The SUM reduction turns those sparse per-rank arrays into a complete force array on rank 0.
Example loop:
for each timestep:
build tree from all bodies (on all ranks)
compute forces for my assigned bodies
reduce SUM forces to rank 0
if rank 0:
update positions and velocities
broadcast updated bodies to all ranks
Load balancing
A static partitioning by body index is simple and works well for uniform distributions. For clustered distributions, some ranks may traverse deeper parts of the tree more often, which creates imbalance.
A dynamic strategy can improve this, but it complicates communication and scheduling. For the assignment constraints and input sizes, static partitioning was a good trade.
Results and scaling
I measured runtime per step across different N and different MPI process counts on an Apple M4 Max [16 (12 performance and 4 efficiency)]. For small N, MPI overhead dominates and speedup is limited. For large N, compute dominates and speedup improves.

Here is an example strong scaling table from my runs:
Bodies | 1 proc | 2 proc | 4 procs | 8 procs | Speedup (8 vs 1) |
|---|---|---|---|---|---|
100 | 0.03 ms | 0.04 ms | 0.03 ms | 0.04 ms | 0.64x |
1,000 | 0.70 ms | 0.49 ms | 0.41 ms | 0.33 ms | 2.11x |
10, 000 | 16.02 ms | 9.48 ms | 6.25 ms | 4.78 ms | 3.36x |
100,000 | 274.60 ms | 177.49 ms | 124.93 ms | 95.30 ms | 2.88x |
I measured runtime per step across N = 100, 1,000, 10,000, and 100,000 with 1, 2, 4, and 8 MPI processes. As expected, the smallest case (N=100) is dominated by MPI overhead and doesn’t scale. For larger N, compute dominates and speedup improves, but on my laptop the best observed speedup at N=100,000 was ~2.9× on 8 processes, indicating non‑trivial overhead from tree rebuilds and global collectives. This is still useful for educational purposes, but it’s not near‑linear scaling, which is exactly what you’d expect from a replicated‑tree design on a single machine.
Baselines
If you want to make the performance story even tighter, include two baselines.
First, an O(N²) sequential baseline for small N. This makes the asymptotic benefit of Barnes–Hut obvious.
Second, a sequential Barnes–Hut baseline. This isolates the MPI overhead and gives a clean strong-scaling picture.

Takeaway: the naive all‑pairs baseline bends upward much faster with N, while Barnes–Hut grows more gently. The gap widening with N is the practical reason the tree approximation is worth the added complexity.
Accuracy measurement
When I say “accuracy,” I mean accuracy relative to an all-pairs reference computed at the same positions and time step. A simple, defensible choice is a relative force error metric per body, then averaged over bodies and over a small number of steps.
I define the metric precisely in Appendix B.
Beyond the assignment
The C + MPI version is fast, but it is not visible. You can log trajectories and plot them later, but that does not teach the algorithm.
So I built an interactive web demo that runs Barnes–Hut in the browser. It is not meant to beat the MPI code. It is meant to teach.
The demo implements the same core components: quadtree, MAC, force accumulation, and a stable integrator. It adds real-time rendering, sliders for θ and dt, and an overlay that draws the quadtree so you can see exactly why the algorithm becomes fast.
I know the video lags a bit sorry for that!
Node‑visit heatmap
This plot shows where the tree traversal actually spends work. Each body is colored by how many quadtree nodes it visits while computing its force; brighter regions mean deeper traversal and more computation. In clustered inputs, the hot spots sit at dense cores and along cluster edges, where the MAC fails more often and the algorithm must recurse. The colder regions are sparse space where large cells are accepted early. This makes the core point of Barnes–Hut visible: the cost is driven by spatial density, not just N, and θ directly controls how much of the tree each body touches.
What the heatmap shows
X and Y axes: actual simulation coordinates in the 2D domain ([0,4] × [0,4]).
Color: how many quadtree nodes are visited when computing forces for bodies that are located in that area.
Interpretation: brighter = more traversal work (denser regions or complex boundaries), darker = fewer node visits (sparser space where the MAC accepts big cells early).
What dark vs bright means. Darker means less computation per body in that region. Think of the heatmap as work spent, not mass density:
If a region is dark, bodies there usually accept large tree cells early (MAC passes), so they visit fewer nodes.
If a region is bright, bodies there trigger more recursion (MAC fails more often), usually because nearby mass is dense or uneven.
Bodies can still pass through dark regions; they just don’t force the algorithm to traverse as much of the tree there. If you want a pure density map, that would be a different plot (body count per cell).

What I learned
Barnes–Hut made me appreciate approximation as an engineering tool. You are not giving up accuracy. You are shaping where accuracy matters. For many scientific simulations, the uncertainties in inputs dominate the last decimal place of the force computation.
I also learned that data structures are the real algorithm. The quadtree is not an implementation detail. It is the reason the runtime stops exploding.
MPI taught the most painful, most useful lesson: communication is expensive. You can have a perfect parallel compute kernel and still lose to broadcast and gather overhead at small N.
Numerical stability forced discipline. Softening or clamping is not optional if you want robustness. The integrator choice is not a preference. It changes whether your simulation behaves like physics or like a numerical artifact.
Finally, profiling kept me honest. The code that feels slow is not always the code that is slow. Measuring first made the optimization work straightforward.
Limitations
This project is educational-first. The 2D quadtree is simpler to visualize, but real astrophysical simulations are 3D and need an octree. The MPI strategy uses a replicated tree, which is simple and often effective, but it is not memory optimal at very large N or at scale across many nodes.
The MAC is the classic s/d < θ criterion. There are more sophisticated criteria and higher-order multipole expansions that improve accuracy for the same cost. I chose the simplest version because the goal is understanding.
The browser demo is not designed for 100k bodies. It is tuned for interactive teaching, roughly in the 100 to 500 body range.
Future directions
A 3D extension is the natural next step. An octree is conceptually the same, but visualization pushes you toward WebGL or a 3D engine.
GPU acceleration is also compelling. The force computation per body is parallel, but tree traversal is irregular. A practical hybrid approach is to build the tree on the CPU and offload force evaluation with a GPU-friendly tree representation.
Adaptive time stepping is a high leverage improvement. Bodies near massive objects need smaller dt for stability, while distant bodies can take larger steps. Done carefully, it improves both accuracy and speed.
Finally, it would be fun to include algorithm comparisons in the demo. Showing O(N²) next to Barnes–Hut, and then a faster multipole-style method, would make the algorithmic trade-offs visceral.
Glossary
Barnes–Hut: A hierarchical approximation algorithm for N-body interactions that uses a tree to aggregate far-away masses.
Quadtree: A 2D spatial tree that recursively subdivides space into four quadrants.
Octree: The 3D analogue of a quadtree, subdividing into eight children.
MAC: Multipole Acceptance Criterion, the rule that decides when a tree cell is “far enough” to approximate.
θ (theta): The MAC threshold controlling the speed versus accuracy trade-off.
Softening: A modification to the force law that removes the near-zero singularity and stabilizes close encounters.
Symplectic integrator: A time integration method that preserves geometric structure of Hamiltonian systems, typically showing better long-term energy behavior.
Strong scaling: Fixed problem size N, increasing the number of processes P.
Weak scaling: Increasing N proportionally with P to keep per-process work roughly constant.
References
Barnes, J., & Hut, P. (1986). A hierarchical O(N log N) force-calculation algorithm. Nature.
Singh, J. P., et al. (1995). Load balancing and data locality in adaptive hierarchical N-body methods. Journal of Parallel and Distributed Computing.
Warren, M. S., & Salmon, J. K. (1993). A parallel hashed oct-tree N-body algorithm. SC’93.
Github repo: https://github.com/Montekkundan/barnes-hut-nbody-simulation
Demo: https://lab.montek.dev/experiments/12.barnes-hut
Appendices
Appendix A: Pseudocode
This appendix includes the algorithms in a way that you can map directly to the code.
A.1 Quadtree insertion
1insert(node, body):
2 if body outside node bounds: return
3 if node is leaf:
4 if empty: node.body = body
5 else:
6 subdivide()
7 insert(old_body)
8 insert(body)
9 else:
10 child = quadrant(body)
11 insert(child, body)
12A.2 Center of mass aggregation
1compute_mass(node):
2 if leaf and has body:
3 node.mass = body.mass
4 node.com = body.pos
5 else:
6 node.mass = 0
7 node.com = (0, 0)
8 for child in node.children:
9 compute_mass(child)
10 node.mass += child.mass
11 node.com += child.com * child.mass
12 if node.mass > 0:
13 node.com /= node.mass
14A.3 Force computation with MAC
1force(node, body):
2 if node.mass == 0: return
3 d = distance(body.pos, node.com)
4 if node is leaf:
5 if node.body != body: addForce(node.com, node.mass)
6 else if node.size / d < theta:
7 addForce(node.com, node.mass)
8 else:
9 for child in node.children: force(child, body)
10A.4 MPI step
1for each step:
2 broadcast bodies to all ranks
3 build quadtree on each rank
4 compute forces for my body range
5 reduce SUM forces to rank 0
6 rank 0 integrates and marks lost bodies
7Appendix B: Accuracy metric
To quantify accuracy, I compare Barnes–Hut forces to an all-pairs reference at the same positions.
For each body i, define the relative force error:
e_i = ||F_i^(BH) - F_i^(ref)|| / (||F_i^(ref)|| + δ)
δ is a small constant to prevent division by zero when a reference force is near zero.
Then define summary metrics that are easy to report:
Mean relative error over bodies:
E_mean = (1/N) Σ e_i
Max relative error over bodies:
E_max = max_i e_i
If you want a more stable estimate, average these over T steps after a short warmup.



