Open Source Game Development
Published On: Wednesday, June 27, 2007 | Last Modified On: Wednesday, June 27, 2007
Introduction
A 3D game engine is a complex collection of code. Anyone entering into game development would have to spend at least a year developing a game engine or purchase a pricey game engine to utilize. Of course, another option would be to use an open source engine, but game developers have often shied away from these due to their lack of features and reliability. However, these days there are several open source engines (or low-cost commercial engines) that have a rich set of features and offer stability.
Open source engines, however, do not necessarily have the performance of their more expensive commercial counterparts as they do not always take advantage of the latest features available on the CPU and GPU. The intent of this paper is to go over a few of the most common open source game engines and show how Intel® tools and technologies can bring goodness to open source game development by getting the best possible performance out of these engines.
Game Engine Block Diagram
The block diagram below is of a typical single-player 3D engine, and displays the complexity of modern game engines. It shows the various subsystems and dependence between them. The "tools" portion of the engine (level editors, geometry and animation exporters, scripted event generators, etc.) have been left out for the sake of simplicity.
Figure 1. Block Diagram of a Modern 3D Game Engine
click here for larger image
Open Source Game Engines
There are several open source engines available on the Internet, some of which are listed below. This paper will focus on both the Object-Oriented Graphics rendering 3D engine and the Quake* 3 game engine.
The following is a short list of some of the open source game engines that are available for use:
· Object-Oriented Graphics Rendering (www.ogre3d.org*)
· Quake 3 (www.idsoftware.com/business/techdownloads/*)
· Crystal Space* (www.crystalspace3d.org*)
· Irrlicht Engine (http://irrlicht.sourceforge.net*)
The following is a short list of freely available 3D engines that are not open source and may charge a minimal fee for commercial use:
· Ca3D Engine (CA3DE) (www.ca3d-engine.de*)
· Power Render 3D Engine (www.powerrender.com*)
· Torque Game Engine (www.garagegames.com*)
Optimization Activities
Optimizing software for performance need not be a daunting task. There are several techniques that can be implemented to increase the performance of a game engine. While detailed descriptions of all of these items is beyond the scope of this paper, they each have several papers written about them which can be downloaded from http://developer.intel.com. Check the Additional Resources section for links to some of these papers.
Profiling
Profiling is an important step when it comes to optimizing your code. Without profiling, you wouldn’t know what portion of your code to target. A helpful tool for profiling application time and other events is the Intel® VTune™ Performance Analyzer. Among other things, VTune™ can show you the execution breakdown of all the modules on the entire system. By first creating a profile, you won't end up spending time optimizing a section of code that is barely executed and instead target the more heavily executed portion of your code.
SIMD Optimizations
Single Instruction Multiple Data, or SIMD, is essentially operating on multiple pieces of data at once. Intel's implementation of SIMD is called SSE. The set of instructions can operate on integer, single-precision floating-point, and double-precision floating-point data. Since these instructions can operate on more than one piece of data at a time, they can execute certain algorithms much quicker than standard single data instructions.
Threading
As seen in the previous block diagram, 3D game engines are very complex and stress every aspect of the system. When running, they tend to grow in complexity with the amount of compute power thrown at them. Industry trends in performance have moved from pure frequency to multi-core/multi-processor machines. Intel has already shipped millions of processors with Hyper-Threading Technology—the latest iteration is the dual-core processor (two processor cores in one physical package).
To take advantage of multiple cores, games need to be designed for parallelism. For a given problem, domain concurrency usually can be expressed in terms of:
· The tasks that need to get done (Task Parallelism)
· The data that gets worked on (Data Parallelism)
· The flow of data (Pipelined Parallelism)
Task Parallelism (Functional Decomposition)
A game engine is composed of many complex subsystems. Some of these subsystems can be executed in parallel on multiple threads, giving us Task Parallelism. In most games, these subsystems are often very tightly coupled with many interrelated dependencies. Care must be taken to architect the engine to reduce dependencies amongst the various subsystems, thereby reducing the synchronization needed and maximizing the benefit of parallel execution.
Data Parallelism (Data Decomposition)
Parallelism can also be achieved by data decomposition, which is when the same independent operations are applied to different subsections of data in parallel. Compute-intensive loops are ideal candidates for this kind of decomposition like visibility determination, or game physics/simulation solvers. This technique can be applied to different subsystems over time giving a fork join parallelism.
Figure 2. Data Flow Parallelism (Producer-Consumer Decomposition)
click here for larger image
Often data dependencies exist between tasks that can't be eliminated. For example, the I/O Subsystem needs to load world data before it can be processed. One way to work around the data dependency is to try and decompose the problem using the "Producer-Consumer" approach. The output of the producer (I/O thread) becomes the input of the consumer (compute thread). The consumer thread has to wait before the producer thread has started generating output. The threads will start executing serially until data is loaded and then assumes parallel execution. This approach is often referred to as Producer-Consumer Decomposition or Pipelined Parallelism.
Figure 3. Threading models to expressing parallelism
click here for larger image
Automatic Parallelization
The simplest way to parallelize loops (data decomposition) is to let the compiler do the analysis and parallelization of the loops for you simply by compiling the source with the Intel® C++ Compiler (version 7.0 or greater) with the Qparallel option. The switch enables the feature, and a few simple guidelines like not modifying the loop counter within the loop, avoiding branching, and not accessing global state from within the loop increases the likelihood that the compiler will identify and parallelize a loop. Use the -Qpar_report3 switch to have the compiler generate a report on which loops are successfully parallelized and the dependencies that prevent parallelization of others. (-Qpar_report[n] will allow varying degrees of reporting.)
Compiler-Directed Parallelism with OpenMP*
Automatic parallelization is limited in its abilities to generate parallel code. The compiler can do a much better job with user input. This is accomplished with OpenMP* pragmas. The original serial code is left intact and a few #pragmas are sprinkled in. For example, to parallelize for loop, all you have to do is add '#pragma omp parallel for' above the for loop. The Intel C++ compiler 7.0 (or greater) and Microsoft Visual Studio* 2005 support OpenMP. The pragmas are ignored by default, so to enable OpenMP-based parallelism, the source must be compiled with the /Qopenmp option.
for ( i=start; i <= end; i++ ) { // do some compute intensive work here }
// Parallel version #pragma omp parallel for for ( i=start; i <= end; i++ ) { // do some compute intensive work here } |
OpenMP automatically runs multiple copies of the loop body in parallel using a pool of threads, each of which works on a different iteration of the loop. In this example, all the variables except the loop iteration variable are shared. Often threads will need private copies of certain variables in order to avoid data races. Identification of these variables is a hard task to do by inspection, but becomes much easier when building the source project with the Intel® C++ Compiler with the /Qtcheck and /Qopenmp options. After this is done, running the application through Intel® Thread Checker will identify the variables and areas with probable data race conditions that may need to be made private.
Threading with Libraries
Thread libraries (e.g. Win32* threads, or pThreads) can be used to drive explicitly defined threads. Thread libraries are flexible enough to address most kinds of parallelism discussed above. They are more invasive and time consuming to use, but provide the programmer with flexibility and direct control. The Intel® Thread Checker can be used with programs threaded with the Win32 or pThread APIs to provide insight into potential problems like data races and deadlocks. Creating threads is expensive; therefore, if using the threading libraries, it is probably worthwhile to generate a pool of threads during initialization, and set events to wake up threads to perform their tasks when necessary. When the threads finish their tasks, they can go back to a resting state, releasing hardware resources for other tasks.
Intel® Thread Profiler can be used to analyze and maximize performance of applications using OpenMP and Win32 threads.
Quake* 3
Profiling
So, where do we begin threading? Profiling, Profiling, Profiling. A good understanding of application behavior is needed before we can decide what type of parallelism is best suited for a particular subsystem.
Sampling the application with the Intel® VTune™ Performance Analyzer will give us a detailed analysis of the amount of CPU time being used by each module and function. This information provides useful insight into doing a functional decomposition, meaning it will help us decide what portions of the engine we would want to run on multiple threads in parallel to maximize performance.
Generating a call graph of the application running a specific workload using the Intel® VTune™ Performance Analyzer will give us a detailed analysis showing the self time (time being spent in a function) and total time (time spent in a function and all the functions called within it) at each node, along with the call hierarchy. This information, along with the sampling data, can provide useful insight into where it's best along the call tree to do a data decomposition.
Sampling Quake* 3 with the Intel VTune Performance Analyzer in timedemo mode and ignoring the demo load time we get the following data:
Thread | Process | Clockticks | Instructions Retired | %Clockticks |
thread74 | quake3.exe | 8,437 | 8,034 | 99.98 |
thread103 | quake3.exe | 2 | 1 | 0.02 |
This table shows that even though Quake 3 spawns two threads, most of the work (99.98%) is done by a single thread. Delving further into the processes running on that thread, we can see that 51% of the time is being spent in the graphics driver and about 32% of the time is being spent in the engine:
Module | Process | Clocktick samples | %Clockticks | Instructions Retired | CPI |
ati2dvag.dll | quake3.exe | 4,308 | 51.06 | 52.44 | 1.027 |
quake3.exe | quake3.exe | 2,719 | 32.23 | 36.05 | 0.894 |
atioglxx.dll | quake3.exe | 599 | 7.1 | 5.65 | 1.319 |
Other32 | quake3.exe | 352 | 4.17 | 2.36 | 1.853 |
ntdll.dll | quake3.exe | 129 | 1.53 | 1.56 | 1.032 |
hal.dll | quake3.exe | 163 | 1.93 | 0.71 | 2.86 |
ntoskrnl.exe | quake3.exe | 55 | 0.65 | 0.19 | 3.667 |
Threading
The best place to do a functional decomposition with would be moving the renderer on a separate thread. This will involve moving all the qgl (quake GL) calls over to a single thread as the graphics drivers are not yet thread-safe and will not handle calls from shared contexts on different threads very well. The front-end will prepare a frame and the back-end will render it on a separate thread. Some of the functionality is in place in the Quake 3 source base, but it is not enabled. Managing object lifetimes can also be challenging as the objects cannot be deleted until the back-end is done rendering them. Double-buffering the data needed every frame would be necessary for the front-end to prepare the next frame while the back-end renders the previous frame.
Ogre* 3D
Profiling
For the data gathered below, we generated profiles for four of the demos available for Ogre*. These four demos highlight how applications can use Ogre differently and stress different parts of the engine. The following tables show the four different demo profiles:
Click here for larger view
By looking at these four profiles you can immediately see that they do indeed stress different parts of the engine. OgreOde focuses on the physics simulation which is located in OgreOde_Core.dll; DynTex does some very CPU-intensive calculations and thus most of its time is spent within its own module; Fresnel makes heavy use of the Ogre API; and CelShading causes the video card driver to maximize its processing.
This is why profiling an application is important. It will tell you what application areas to target, and prevents you from spending extra time optimizing a section of code that will barely be used.
Now that you have the profiles and know which items consume the most cycles, it's time to optimize them. The following section will discuss how to use the Intel® Streaming SIMD Extensions instructions to speed up calculations, and the Ogre threading section will point out how to thread some items to get speed benefits on Intel processors with Hyper-Threading Technology or Dual-Core Intel® Xeon® Processors.
SIMD Optimizations
Previously, we took profiles of the four Ogre demos and got vastly different results on which module was spending the most time executing on the CPU. This section describes how to convert the core CPU-intensive areas to SSE and SSE2. Of the four demos, DynTex and Fresnel would benefit the most from SIMD optimizations.
Let's take a look at DynTex. According to VTune™, most time is spent in the runStep function located in the DynTex.cpp file. There are several blocks of code in this function that could be converted to SSE, but we will focus on the code that calculates the chemical reaction beginning at line
170 in
the cpp file (shown in the code snippet below).
This block of code can be easily optimized using SSE2. All the data is lined up in an array that can be loaded four-at-a-time using SIMD. The following code snippets show a comparison between the original source, and how you might want to optimize the loop. Please take into account that the SIMD code is assuming properly aligned memory and a loop iteration count that is divisible by four.
// Reaction (Grey-Scott) idx = reactorExtent+1; int U,V;
for ( y=0; y < reactorExtent-2; y++ ) { for( x=0; x < reactorExtent-2; x++ ) { U = chemical[0][idx]; V = chemical[1][idx]; int UVV = MULT( MULT( U, V ), V ); delta[0][idx] += -UVV + MULT( F, (1<<16)-U ); delta[1][idx] += UVV - MULT( F+k, V ); idx++; } idx += 2; } |
Scalar C Code
__m128i _mm_mullo_epi32( __m128i a, __m128i b ) { __m128i t0; __m128i t1;
t0 = _mm_mul_epu32(a,b); t1 = _mm_mul_epu32( _mm_shuffle_epi32( a, 0xB1 ), _mm_shuffle_epi32( b, 0xB1 ) );
t0 = _mm_shuffle_epi32( t0, 0xD8 ); t1 = _mm_shuffle_epi32( t1, 0xD8 );
return _mm_unpacklo_epi32( t0, t1 ); }
#define _MM_MULT( X, Y ) _mm_srai_epi32( _mm_mullo_epi32( X, Y ), 16 )
// Reaction (Grey-Scott) idx = reactorExtent+1; const __m128i xNeg1 = _mm_set1_epi32( -1 ); const __m128i x1 = _mm_set1_epi32( 1 ); const __m128i x10000 = _mm_set1_epi32( 1 << 16 ); __m128i xU, xV; __m128i xK = _mm_set1_epi32( k ); __m128i xF = _mm_set1_epi32( F );
for ( y=0; y < reactorExtent-2; y++ ) { for ( x=0; x < reactorExtent-2; x+=4 ) { __m128i xDelta0 = _mm_load_si128( (__m128i*)&delta[0][idx] ); __m128i xDelta1 = _mm_load_si128( (__m128i*)&delta[1][idx] ); xU = _mm_load_si128( (__m128i*)&chemical[0][idx] ); xV = _mm_load_si128( (__m128i*)&chemical[1][idx] ); __m128i xUVV = _MM_MULT( _MM_MULT( xU, xV ), xV ); __m128i xNegUVV = _mm_add_epi32( _mm_xor_si128( xUVV, xNeg1 ), x1 ); xDelta0 = _mm_add_epi32( xDelta0, _mm_add_epi32( xNegUVV, _MM_MULT( xF, _mm_sub_epi32( x10000, xU ) ) ) ); xDelta1 = _mm_add_epi32(xDelta1,_mm_sub_epi32( xUVV, _MM_MULT( _mm_add_epi32( xF, xK ), xV ) ) ); _mm_store_si128( (__m128i*)&delta[0][idx], xDelta0 ); _mm_store_si128( (__m128i*)&delta[1][idx], xDelta1 ); idx+=4; } idx += 4; } |
SIMD Intrinsics Code
If you're new to SIMD programming, this code could seem somewhat complex. The SIMD instructions used can perform operations on four integer items at a time, which is why the loop increment x, and the index idx, are incremented in steps of four. The one curious item is why the statement idx+=
2 in
the outer for-loop became idx+=4. This is because the algorithm uses the instructions to read in aligned data from memory, causing these loads to be faster. The grids used in these calculations now need to have some buffer space so that the next line of loads will start on a 16-byte aligned boundary. All code that accesses these grids will now have to take into account that there is some buffer space.
Of course, you could always try using the Intel® C++ Compiler instead of writing the SIMD code yourself. For blocks of code like this, the Intel compiler will typically catch it and vectorize it for you with SIMD (assuming you set the appropriate switch in the compiler). It will even cover all the cases of unaligned memory and an uneven (not divisible by four) number of iterations. However, with this code, the compiler will not be able to properly align the memory during allocation, causing most loads to be unaligned, so it would still be best to do this one by hand.
Fresnel is somewhat different in that the code using up most of the CPU time is located in Ogre. Also, it is floating-point intensive, which will benefit from SSE as opposed to DynTex which was integer-intensive (and used SSE2 instructions). This time, however, the CPU-intensive code is located within the OgreMain dll. The function that is consuming the most time is softwareVertexBlend, located in the OgreMesh.cpp file.
A common problem with vectorizing 3D is the order of the data. Consider a vertex in 3D coordinates. Something like this would normally be stored as 3 floats (labeled x, y, z) contiguously in memory. An array of these would be something like: X, Y, Z, X, Y, Z, X, Y, Z. When vectorizing this, it is not possible to evenly load up four coordinates into a SIMD register (which operates on 4 floats at a time). Ogre suffers from the same vectorization problem.
Four possible ways to do SIMD operations on this code are:
· Fill a SIMD register to ¾ capacity
· Load the SIMD register with a mix of vertex information
· Transpose the data into the correct format
· Rearrange the data structures to be SIMD-friendly
The first is filling a SIMD register to ¾ capacity. For this you would need to do 2 loads on the data since there is no way to load 3 floats at a time. First load 1 float (e.g. X) and then load the other 2 floats (e.g. Y and Z). This will leave one of the items as empty, essentially wasting one of the slots and getting only ¾ of the potential processing power. The next method operates on 4 items at a time except it is a mixed bag of X, Y, and Z. The following is what this would look like:
As you can see, you will need to operate on four X, Y, Z vectors to do this effectively. It can also pose a challenge depending on what kind of calculations you need to do on the data. The third method loads up four X, Y, Z vectors and then transposes the data so that you get X|X|X|X, Y|Y|Y|Y, and Z|Z|Z|Z in the SIMD registers. Again, you have to have four vectors to work with. The final method is to re-order the data structures so that instead of it being X, Y, Z, it has an array of four values for each coordinate, as such:
struct Vector3D { float x[4]; float y[4]; float z[4]; }; |
While this does waste some space at the end for items that aren't a multiple of four vectors, it does speed up SIMD calculations quite significantly.
Threading OGRE3D
We are now going to focus on the other two demos that were profiled in the profiling section, OgreOde and CelShading. By threading the correct locations in OgreMain, a significant speed up can be seen in these two demos (assuming that there is more than 1 microprocessor present in the system). The two demos would require the threading to be done in two different locations. For the OgreOde demo it would be done in the Listener class (either within Ogre or the application) and for the CelShading demo the threading would have to take place in the rendering subsystem.
There are two possible ways to thread the OgreOde demo. Ogre has what is called a Listener class, which a program overrides, that get called by the Ogre engine for certain events, such as pre-render and post-render. To make the threading more general-purpose you would want to thread the area in Ogre where the dispatching of the calls is done. That can be found in the Ogre::Root class in the OgreRoot.cpp file in two functions – Root::_fireFrameStarted and Root::_fireFrameEnded. The other method would be to thread the calls to OgreOde within the demo itself.
There are two possible ways to thread the OgreOde demo. Ogre has what is called a Listener class, which a program overrides, that get called by the Ogre engine for certain events, such as pre-render and post-render. To make the threading more general-purpose you would want to thread the area in Ogre where the dispatching of the calls is done. That can be found in the Ogre::Root class in the OgreRoot.cpp file in two functions – Root::_fireFrameStarted and Root::_fireFrameEnded. The other method would be to thread the calls to OgreOde within the demo itself.
The CelShading demo's performance (in this case) can be improved by threading the rendering. Threading the renderer has its own issues that need to be dealt with beyond that of threading the listener. The renderer can be found in the Ogre::RenderSystem class located in the OgreRenderSystem.h file. Upon examining the file you will notice that there are several pure virtual functions. These functions are implemented by plug-ins for the different 3D graphics APIs. It would be here that you would need to insert another layer that creates a thread and accompanying synchronization for using calls into the plug-in.
If you do decide to place the renderer on its own thread you need to realize that you can no longer make API calls directly into it. All commands will have to be issued through a synchronized access queue from the main code to the graphics API. Something to watch out for is to not delete the graphics resources that the render system is still using. All objects in Ogre have graphics resources associated with them that the graphics card uses for rendering. It is possible (and not uncommon) that all instances of a certain type of object have been deleted and therefore the corresponding graphics resource used to describe it is no longer needed. Since the rendering is happening concurrently, it still might be using that resource and therefore the resource cannot be deleted until after the end of the frame.
Conclusion
For some games, open source engines are a viable alternative to the expensive commercial game engines available on the market. In recent years they have gotten much more stable and provide a rich set of features that should satisfy all but the most demanding of games for the hard-core gamer. While they may be somewhat behind in performance, with the tools and technologies discussed in this paper those concerns should be readily alleviated. Intel® VTune™ will easily target potential areas for optimizations to eliminate time wasted on non-critical sections of code; Intel's SIMD instructions will speed-up execution of certain blocks of code by operating on more than one piece of data at a time; and threading will divide program execution to take advantage of the available processors in a system to quicken execution time.