23. Optimization and profiling

Once you have a large game, with many large 3D models, you will probably start to wonder about the speed and memory usage.

1. Watch Frames Per Second

The main thing that measures your game speed is the Frames Per Second. Engine automatically keeps track of this for you. Use the TCastleControl.Fps or TCastleWindow.Fps to get an instance of TFramesPerSecond, and inside you have two important numbers: TFramesPerSecond.FrameTime and TFramesPerSecond.RealTime. We will explain the difference between FrameTime and RealTime shortly.

How to show them? However you like:

  • If you use TCastleWindow, you can trivially enable TCastleWindow.FpsShowOnCaption to show FPS on your window caption.
  • You can show them on Lazarus label or caption. Just be sure to not update them too often — updating normal Lazarus controls all the time may slow your OpenGL context drastically. Same warning goes about writing them to the console with Writeln — don't call it too often, or your rendering will be slower. It's simplest to use Lazarus TTimer to update it only once per second or such. Actually, these properties show you an average from last second, so there's not even a reason to redraw them more often.
  • You can also simply display them on an OpenGL context (see the example about designing your own TGame2DControls in earlier chapter).

1.1. How to interpret Frames Per Second values?

There are two FPS values available: frame time and real time. Frame time is usually the larger one. Larger is better, of course: it means that you have smoother animation.

Use "real time" to measure your overall game speed. This is the actual number of frames per second that we managed to render. Caveats:

  • Make sure to turn off "limit FPS" feature, to get maximum number available. Use view3dscene "Preferences -> Frames Per Second" menu item, or (in your own programs) change LimitFPS global variable (if you use CastleControl unit with Lazarus) or change Application.LimitFPS (if you use CastleWindow unit). Change them to zero to disable the "limit fps" feature.

  • Make sure to have an animation that constantly updates your screen. E.g. keep camera moving, or have something animated on the screen, or set TCastleWindow.AutoRedisplay to true. Otherwise, we will not refresh the screen (no point to redraw the same thing), and "real time" will drop to almost zero if you look at a static scene.

  • Note that the monitor will actually drop some frames above it's frequency, like 80. This may cause you to observe that above some limit, FPS are easier to gain by optimizations, which may lead you to a false judgement about which optimizations are more useful than others. To make a valuable judgement about what is faster/slower, always compare two versions of your program when only the relevant thing changed — nothing else.

"Frame time" measures how much frames we would get, if we ignore the time spent outside OnRender events. Use "frame time"... with caution. But it's often useful to compare it with "real time" (with LimitFPS feature turned off), as it may then tell you whether the bottleneck is in rendering or outside of rendering (like collision detection and creature AI). Caveats:

  • Modern GPUs work in parallel to the CPU. So "how much time CPU spent in OnRender" doesn't necessarily relate to "how much time GPU spent on performing your drawing commands".

So making your CPU busy with something else (like collisions, or waiting) makes your "frame time" lower, while in fact rendering time is the same — you're just not clogging you GPU. Which is a good thing, actually, if your game can spend this time on something useful like collisions. Just don't overestimate it — you didn't make rendering faster, but you managed to do a useful work in the meantime.

For example: if you set LimitFPS to a small value, you may observe that "frame time" grows higher. Why? Because when the CPU is idle (which is often if LimitFPS is small), then GPU has a free time to finish rendering previous frame. So the GPU does the work for free, outside of OnRender time, when your CPU is busy with something else. OTOH when CPU works on producing new frames, then you have to wait inside OnRender until previous frame finishes.

In other words, improvements to "frame time" must be taken with a grain of salt. We spend less time in OnRender event: this does not necessarily mean that we really render faster.

Still, often "frame time" does reflect the speed of GPU rendering.

If you turn off LimitFPS, and compare "frame time" with "real time", you can see how much time was spent outside OnRender. Usually, "frame time" will be close to "real time". If the gap is large, it may mean that you have a bottleneck in non-rendering code (like collision detection and creature AI).

2. Making your games run fast

First of all, watch the number of vertexes and faces of the models you load. Use view3dscene menu item Help -> Scene Information for this.

Graphic effects dealing with dynamic and detailed lighting, like shadows or bump mapping, have a cost. So use them only if necessary. In case of static scenes, try to "bake" such lighting effects to regular textures (use e.g. Blender Bake functionality), instead of activating a costly runtime effect.

2.1. Backface culling

If the player can see the geometry faces only from one side, then backface culling should be on. This is the default case (X3D nodes like IndexedFaceSet have their solid field equal TRUE by default). It avoids useless drawing of the other side of the faces.

2.2. Textures

Optimize textures to increase the speed and lower GPU memory usage:

  • Use texture compression (makes GPU memory usage more efficient). You can do it very easily by using material properties and auto-compressing the textures using our build tool.
  • Scale down textures on low-end devices (desktops and mobiles). You can do it at loading, by using material properties and auto-downscaling the textures using our build tool, see TextureLoadingScale. Or you can do it at runtime, by GLTextureScale. Both of these approaches have their strengths, and can be combined.
  • Use texture atlases (try to reuse the whole X3D Appearance across many X3D shapes, if possible). This avoids texture switching when rendering, so the scene renders faster. When exporting from Spine, be sure to use atlases.
  • Use spite sheets (TSprite class) instead of separate images (like TGLVideo2D class). This again avoids texture switching when rendering, making the scene render faster. It also allows to easily use any texture size (not necessarily a power of two) for the frame size, and still compress the whole sprite, so it cooperates well with texture compression.
  • Don't set too high TextureProperties.anisotropicDegree if not needed. anisotropicDegree should only be set to values > 1 when it makes a visual difference in your case.

2.3. Animations

There are some TCastleScene features that are usually turned on, but in some special cases may be avoided:

  • Do not enable ProcessEvents if the scene should remain static.
  • Do not add ssDynamicCollisions to Scene.Spatial if you don't need better collisions than versus scene bounding box.
  • Do not add ssRendering to Scene.Spatial if the scene is always small on the screen, and so it's usually either completely visible or invisible. ssRendering adds frustum culling per-shape.

Various techniques to optimize animations include:

  • If your model has animations but is often not visible (outside of view frustum), then consider using Scene.AnimateOnlyWhenVisible := true (see TCastleSceneCore.AnimateOnlyWhenVisible).

  • If the model is small, and not updating it's animations every frame will not be noticeable, then consider setting Scene.AnimateSkipTicks to something larger than 0 (try 1 or 2). (see TCastleSceneCore.AnimateSkipTicks).

  • For some games, turning globally OptimizeExtensiveTransformations := true improves the speed. This works best when you animate multiple Transform nodes within every X3D scene, and some of these animated Transform nodes are children of other animated Transform nodes. A typical example is a skeleton animation, for example from Spine, with non-trivial bone hierarchy, and with multiple bones changing position and rotation every frame.

  • Consider using TCastlePrecalculatedAnimation to "bake" animation from events as a series of static scenes. This makes sense if your animation is from Spine or X3D exported from some software that understands X3D animations. (No point doing this if your animation is from KAnim or M3D, they are already "baked".) TODO: the API for doing this should use TNodeInterpolator, not deprecated TCastlePrecalculatedAnimation.

  • Watch out what you're changing in the X3D nodes. Most changes, in particular the ones that can be achieved by sending X3D events (these changes are kind of "suggested by the X3D standard" to be optimized) are fast. But some changes are very slow, cause rebuilding of scene structures, e.g. reorganizing X3D node hierarchy. So avoid doing it during game. To detect this, set LogSceneChanges := true and watch log (see CastleLog docs and tutorial) for lines saying "ChangedAll" - these are costly rebuilds, avoid them during the game!

2.4. Create complex shapes, not trivial ones

Modern GPUs can "consume" a huge number of vertexes very fast, as long as they are provided to them in a single "batch" or "draw call".

In our engine, the "shape" is the unit of information we provide to GPU. It is simply a VRML/X3D shape. In most cases, it also corresponds to the 3D object you design in your 3D modeler, e.g. Blender 3D object in simple cases is exported to a single VRML/X3D shape (although it may be split into a couple of shapes if you use different materials/textures on it, as VRML/X3D is a little more limited (and also more GPU friendly)).

The general advice is to compromise:

  1. Do not make too many too trivial shapes. Do not make millions of shapes with only a few vertexes — each shape will be provided in a separate VBO to OpenGL, which isn't very efficient.

  2. Do not make too few shapes. Each shape is passed as a whole to OpenGL (splitting shape on the fly would cause unacceptable slowdown), and shapes may be culled using frustum culling or occlusion queries. By using only a few very large shapes, you make this culling worthless.

A rule of thumb is to keep your number of shapes in a scene between 100 and 1000. But that's really just a rule of thumb, different level designs will definitely have different considerations.

You can also look at the number of triangles in your shape. Only a few triangles for a shape is not optimal — we will waste resources by creating a lot of VBOs, each with only a few triangles (the engine cannot yet combine the shapes automatically). Instead, merge your shapes — to have hundreds or thousands of triangles in a single shape.

2.5. Do not instantiate too many TCastleScenes

You usually do not need to create too many TCastleScene instances.

  • To reduce memory usage, you can place the same TCastleScene (or TCastlePrecalculatedAnimation) instance many times within SceneManager.Items, usually wrapped in a different T3DTransform. The whole code is ready for such "multiple uses" of a single scene instance.

    For an example of this approach, see frogger3d game (in particular, it's main unit game.pas). The game adds hundreds of 3D objects to SceneManager.Items, but there are only three TCastleScene instances (player, cylinder and level).

  • To improve the speed, you can often combine many TCastleScene instances into one. To do this, load your 3D models to TX3DRootNode using Load3D, and then create a new single TX3DRootNode instance that will have many other nodes as children. That is, create one new TX3DRootNode to keep them all, and for each scene add it's TX3DRootNode (wrapped in TTransformNode) to that single TX3DRootNode. This allows you to load multiple 3D files into a single TCastleScene, which may make stuff faster — octrees (used for collision routines and frustum culling) will work Ok. Right now, we have an octree only inside each TCastleScene, so it's not optimal to have thousands of TCastleScene instances with collision detection.

2.6. Collisions

We build an octree (looking at exact triangles in your 3D model) for precise collision detection with a level. For other objects, we use bounding volumes like boxes and spheres. This means that the number of shapes doesn't matter much for collision speed. However, number of triangles still matters for level.

Use X3D Collision node to easily mark unneeded shapes as non-collidable or to provide a simpler "proxy" mesh to use for collisions with complicated objects. See demo_models/vrml_2/collisions_final.wrl inside our demo VRML/X3D models. It's really trivial in X3D, and we support it 100% — I just wish there was a way to easily set it from 3D modelers like Blender. Hopefully we'll get better X3D exporter one day. Until them, you can hack X3D source, it's quite easy actually. And thanks to using X3D Inline node, you can keep your auto-generated X3D content separated from hand-written X3D code — that's the reason for xxx_final.x3dv and xxx.x3d pairs of files around the demo models.

You can adjust the parameters how the octree is created. You can set octree parameters in VRML/X3D file or by ObjectPascal code. Although in practice I usually find that the default values are really good.

2.7. Avoid loading (especially from disk!) during the game

Avoid any loading (from disk to normal memory, or from normal memory to GPU memory) once the game is running. Doing this during the game will inevitably cause a small stutter, which breaks the smoothness of the gameplay. Everything necessary should be loaded at the beginning, possibly while showing some "loading..." screen to the user. Use TCastleScene.PrepareResources to load everything referenced by your scenes to GPU.

Enable some (or all) of these flags to get extensive information in the log about all the loading that is happening:

  • LogTextureLoading
  • LogAllLoading
  • TextureMemoryProfiler.Enabled
  • LogRenderer (from CastleRenderer unit)

Beware: This is usually a lot of information, so you probably don't want to see it always. Dumping this information to the log will often cause a tremendous slowdown during loading stage, so do not bother to measure your loading speed when any of these flags are turned on. Use these flags only to detect if something "fishy" is happening during the gameplay.

2.8. Consider using occlusion query

The engine allows you to easily define custom culling methods or use hardware occlusion query (see examples and docs). This may help a lot in large scenes (city or indoors).

3. Profile (measure speed and memory usage)

You can compile your application with the build tool using --mode=valgrind to get an executable ready to be tested with the magnificent Valgrind tool.

You can use any FPC tool to profile your code, for memory and speed. There's a small document about it in engine sources, see castle_game_engine/doc/profiling_howto.txt See also FPC wiki about profiling.

4. Measure memory use and watch out for memory leaks

To detect memory leaks, it's easiest to compile with FPC options -gl -gh. At the program's exit, you will get a very useful report about the allocated and not freed memory blocks, with a stack track to the allocation call. Consider adding this to your fpc.cfg file (see FPC documentation "Configuration file" to know where you can find your fpc.cfg file):

#IFDEF DEBUG
-gh
-gl
#ENDIF

We do not have any engine-specific tool to measure memory usage or detect memory problems, as there are plenty of them available with FPC+Lazarus already. To simply see the memory usage, just use process monitor that comes with your OS. See also Lazarus units like LeakInfo.

You can use full-blown memory profilers like valgrind's massif with FPC code (see section "Profiling" on this page about valgrind).