Home About Recent Martial Responsibility For Safety The Buffalo Trap Longsword Solo Drills - Meÿer Square Starting Historical European Martial Arts Group Historical European Martial Arts - Equipment Game development Generic A* for Games Procedural Island Generation Data-Oriented Design Matters Misconceptions of Component-Based Entity Systems Level of Detail Experiments Planet Generation - Part II Planet Generation - Part I Procedural Generation in Games Oculus Rift Integration Android Favorite Android Games NDK with Android Studio Android NDK Programming - Part III Android NDK Programming - Part II Android NDK Programming - Part I Personal Personal Stuff: Running! Global Game Jam 2014 Experiences Anime Claymore The Twelve Kingdoms Games Favorite Android Games Dungeons & dragons D&D Journal: I Historical european martial arts Martial Responsibility For Safety The Buffalo Trap Longsword Solo Drills - Meÿer Square Starting Historical European Martial Arts Group Historical European Martial Arts - Equipment Longsword Martial Responsibility For Safety The Buffalo Trap Longsword Solo Drills - Meÿer Square

Lately I had the opportunity to speak about game development in quite a few conferences and I always tried to sneak a few minutes about data-oriented design. I feel it is very important and will explain why in this post.

The first thing a programmer must understand is that memory is slow and the way you code affects how efficiently it is utilized. In most programs the CPU tends to wait idle for memory to work on because programmers declare the data in a way which is meaningless to their order of operations.

The easiest way to see that is with an example. See this code:

char data[1000000]; // One Million bytes
unsigned int sum = 0;

for ( int i = 0; i < 1000000; ++i )
{
  sum += data[ i ];
}

An array of one million bytes is declared and iterated on byte by byte. Now let's change things a little to gain some understanding over the hardware.

char data[16000000]; // Sixteen Million bytes
unsigned int sum = 0;

for ( int i = 0; i < 16000000; i += 16 )
{
  sum += data[ i ];
}

The array is changed to contain sixteen million bytes and we iterate one million of them, skipping 16 at a time. It seems like there shouldn't be any effect on performance as the code is translated to the same number of instructions, however that is not the case. Here is the difference graph on a logarithmic scale.

Benchmark Graph

The average difference in performance is 5x and is consistent when iterating 1,000 bytes up to a million bytes, sometimes increasing up to 7x. This is a serious change in performance that must be accounted for in programs.

Note: The benchmark ran on multiple hardware including a desktop with Intel 5930K 3.50GHz CPU, a Macbook Pro Retina laptop with 2.6 GHz Intel i7 CPU and Android Nexus 5 and Nexus 6 devices. The results differences were pretty consistent.

Also note if you wish to replicate this test make sure to run on release. Some compilers will load the array into the cache on declarations and you might need to ensure it is out of the cache before running the loop.

Explanation

What happens in the example is quite simple. The program tells the CPU to run an instruction on part of the array but that data isn't available to the CPU's memory, called the cache, so it waits while the data is copied into the cache so it can be accessed by the CPU. Assume cache size of 16 bytes for the L1 cache line, meaning 16 bytes will be copied to it, starting from the requested address for data [ i ]. This sequence is called a Cache Miss.

In the first code example the program next tries to operate on the next byte which is already copied into the L1 cache, therefore it can continue operating without encountering another cache miss, this is also true for the next 14 bytes. After 16 bytes from the first cache miss the loop will encounter another cache miss and the CPU will wait for data to operate on, copying the next 16 bytes to the L1.

In the second code sample the loop skips 16 bytes at a time but that does not change the way hardware works. The cache still copies 16 subsequent bytes each time it encounters a cache miss which means the loop will trigger a cache miss with each iteration!

Cache misses

The CPU waiting for data due to a cache miss every iteration makes the code run 5-7 times slower!

Note: Variables aren't the only ones affected by the cache. Functions also live in memory, therefore using inheritance and calling virtual functions can also trigger cache misses!

Real World Example

Now that we know what's going on in the hardware, let's see how we can put it into real world practice.

When we develop we tend to think about our end goal and are taught to break it down into objects and systems. We create objects to describe different functionalities and give them methods to do the work, this is Object Oriented Programming (OOP). In game development we can take a physics object as a common example.

In many games the purpose of a physics object is to describe how an entity interacts with the world. This end goal brings a common design of a base object that includes basic data such as position, velocity, mass and drag. It is also common to add methods to manipulate the object, such as ApplyForce and Simulate.

class PhysicsObject
{
public:
  void ApplyForce( const Vec3& force );
  void Simulate( const float delta_time );

private:
  Vec3 mPosition;
  Vec3 mVelocity;

  float mMass;
  float mDrag;
};

It is also common to extend the PhysicsObject to describe the object's shape for calculating collisions. The object is inherited by classes such as SphereObject, BoxObject, CapsuleObject and others. Each child adds variables describing its shape.

class BoxObject : public PhysicsObject {
private: Vec3 mExtents; };

However this added data brings us back to our code sample from before and we suddenly find ourselves losing more performance in code sections that worked well before. For example:

void PhysicsObject::Simulate( const float delta_time )
{
  mPosition += mVelocity * delta_time; // Keeping it simple!
}

// Somewhere else...
{
  // ..
  std::vector< PhysicsObject* > objects;
  delta_time = 1.0f / 60.0f;

  for ( auto it : objects )
  {
    it->Simulate( delta_time );
  }
}

The simulate function uses only the position and velocity variables, all other variables create a gap between the current data set and the next, taking important place in the cache and forcing a cache miss at every iteration.

Structures of Arrays

The solution is to design systems in a way that makes sense when transforming the data. Instead of using arrays of structures (AoS) use structures of arrays (SoA) tightly packed and optimized for CPU consumption.

class PhysicsSystem
{
public:
  void SimulateObjects( const float delta_time );

private:
  size_t mNumObjects;
  std::vector< Vec3 > mPositions;
  std::vector< Vec3 > mVelocities;

  // ...
};

void PhysicsSystemSimulateObjects( const float delta_time )
{
  for ( int i = 0; i < mNumObjects; ++i )
  {
    mPositions[ i ] += mVelocities[i] * delta_time;
  }
}

Note you can still use arrays of structures where it makes sense and you know you are most likely going to manipulate all the data at the same time. In fact a Vec3 is such a structure. You might want to create a structure of attributes used when calculating the acceleration and velocity out of an applied physical force.

struct PhysicsAttributes
{
  float mMass;
  float mDrag;
  // ...
};

Final Words

When designing your systems you should be aware that virtual functions are also capable of causing cache misses because they reside in memory. This means that creating a parent GameObject to inherit and override base methods such as Update can have harmful effect on performance, even if everything else is designed properly. You can read more about this in my post Misconceptions of Component-Based Entity Systems.

Data-oriented design makes multi-threading your systems much easier, as it is a lot simpler to parallelize a for loop when you know exactly what is going to happen to the data and where. I recommend trying out Intel Threading Building Blocks to incorporate multi-threading with a good work-stealing task manager and lambda for loop parallelization.

Furthermore, I didn't go into extensive detail about the hardware and different levels of the cache. Beyond the L1 it is typical to have the L2, a larger but slower memory line which acts as an intermediate between RAM and the L1.

My call to action for this post:

  • Know your data and where it is being transformed, design accordingly.
  • Design for the many, not the one. You will most likely have a lot of objects manipulated the same one and rarely or never a single object.

References

DICE's Introduction to Data Oriented Design
Adventures in data-oriented design – Part 1: Mesh data
Mike Acton's Typical C++ Bullshit reviewing OgreNode.cpp from Ogre Engine
On Mike Acton’s review of OgreNode.cpp
Data-Oriented Design (Or Why You Might Be Shooting Yourself in The Foot With OOP)

Back
Home About Recent Martial Responsibility For Safety The Buffalo Trap Longsword Solo Drills - Meÿer Square Starting Historical European Martial Arts Group Historical European Martial Arts - Equipment Game development Generic A* for Games Procedural Island Generation Data-Oriented Design Matters Misconceptions of Component-Based Entity Systems Level of Detail Experiments Planet Generation - Part II Planet Generation - Part I Procedural Generation in Games Oculus Rift Integration Android Favorite Android Games NDK with Android Studio Android NDK Programming - Part III Android NDK Programming - Part II Android NDK Programming - Part I Personal Personal Stuff: Running! Global Game Jam 2014 Experiences Anime Claymore The Twelve Kingdoms Games Favorite Android Games Dungeons & dragons D&D Journal: I Historical european martial arts Martial Responsibility For Safety The Buffalo Trap Longsword Solo Drills - Meÿer Square Starting Historical European Martial Arts Group Historical European Martial Arts - Equipment Longsword Martial Responsibility For Safety The Buffalo Trap Longsword Solo Drills - Meÿer Square