A Introduction to Metal API

Metal is a low-overhead and high performance API to execute graphics and compute work on the GPU. The common job of a GPU is to draw geometries and the fundamental design principles of Metal aim to help applications to draw geometry extremely fast.

Drawing geometry is done executing draw calls on the GPU. A draw call is a collection of graphics commands and states that produce a visual result on screen; each draw call requires its own graphic state vector, meaning it requires to explicitly specify the shaders, the graphics states, the data buffers, the textures and the render targets used to perform the drawing. In all previous generation of hardware graphics API like OpenGL ES, changing state vectors is a really expensive operation because all API commands must be translated in the correspondent hardware commands. The cost of this is usually all on the CPU, the one responsible of performing the such translation and all API commands must be translated before the GPU can start do any work. The following picture shows a typical sequence of draw calls and the flow of execution from the application (the CPU side) to the GPU.

Metal is built around 6 key design principles:

Thinnest possible API, meaning reduce the amount of code executed between the application and the GPU.
Designed to provide full support for all modern GPU hardware features.
Do expensive operations less frequently.
Provide predictable performance.
Provide explicit control on commands submission.
Optimized for CPU behavior.

Almost all modern mobile games tend to manage the CPU and GPU workload targeting a certain frame rate and most of the times this target is 60 frame per seconds (fps) while other times is 30 fps. The following picture shows a common case of a game that tries to optimize the CPU and GPU workload keeping a steady 30 fps: the CPU prepares the rendering commands for a certain frame and the GPU consumes these commands during the next frame.

When everything works as expected, this setup can provide a perfect and well balanced parallelism, but this is an ideal situation because in real life most of the time the CPU can take more time to generate rendering commands than the GPU to consume them and this leave the GPU idle for a part of the frame. Looking a bit more in detail the work the CPU must execute, we can split it in 2 parts: the time spent by the CPU executing application logic and the time spent by the CPU preparing rendering API commands; usually the latter is the one that takes the majority of the available frame time. As we can observe in the following picture, the CPU could not translate all the API commands inside the target frame time and this can cause the GPU to skip a frame.

Metal tries to focus on the work of preparing rendering API commands and it provides support to reduce it at the bare minimum; this actually frees up CPU time that can be used on other activities and the majority of the times this additional time is used to generate more draw calls.

To better understand how Metal API can achieve such result, it’s important to understand why is the GPU programming so expensive on the CPU.

There are 3 main reasons:

State validation; each time the application call a rendering API, the rendering API implementation must verify that the call is performed in the right way: the application uses the right number and type of parameters and the hardware context will move to a valid state once the call is completed. But there is more! upon an API call, the implementation must also encode the API states into the correspondent hardware states and check again other hardware states to figure out how to combine them all together to move the global context into a new one.
Shaders compilation; the source code of all shaders must be compiled to generate GPU machine code and this usually happen at runtime. Often the state and the shader code are not described in a way that is exactly what the hardware really expect and so when the application change certain states, it can happen that the generate machine code must be recompiled.
GPU work submission; states and shaders code can request resources that are not resident on the GPU side and so they must be moved in memory to a location where the GPU can access them.

Because all these, what all games do is to combine together operations that share similar states and resources with the intent of reduce the workload and improve efficiency, we usually refer to this process as batching commands…but batching commands requires to run more logic on the CPU to create these batches. So the end result is that there is always a constant work of balancing between schedule the right amount of work for the CPU to produce a workload that will keep GPU busy for the entire frame and complete all this work inside the target frame time.

The reason why Metal is different resides in the design principle do expensive operations less frequently.

In all rendering API before Metal, specially OpenGL ES, state validation, shaders compilation and GPU work submission were happening during the drawing of a frame, making the management of the frame time constrained by things not directly under the control of the application. Metal supports offline shaders compilation and state validation upon rendering object creation and this leaves the application just the worry of submit work to the GPU and nothing more.

To better understand all this, let’s look in details at all the object part of the Metal API; let’s walk through all of them:

The Device (MTLDevice): this is an abstraction of the physical GPU, it’s the thing that will consume the rendering and compute commands; this is also the go-to object to do anything in Metal as all the objects the application interacts to come from this object.
The Command Queue (MTLCommandQueue): this object stores all the commands and allow the application to control the order of execution of all commands.
The Command Buffer (MTLCommandBuffer): this object stores translated hardware commands ready for consumption by the GPU.
The Command Encoder (MTLCommandEncoder): this object is responsible of translating rendering and compute commands into hardware commands.
The States: the state of the GPU is described by a series of state objects: the configuration of the framebuffers, the type of blending, the depth function, the different samples to use when dealing with textures are all stored in objects.
The Code: this represents the source code of all the vertex and fragment shaders declared and used by the application.
The Resources: these are objects that store in memory the data representing resources like vertex buffers or the textures or the set of shader constants.

As described in the picture here above, from a instance of MTLDevice the application creates a MTLCommandQueue object; typically an application will create one or more command queues at the initialization and then will keep those queues around throughout its lifetime. Using an instance of a MTLCommandQueue object, the application creates one or more MTLCommandBuffer objects to store the hardware commands that will be submitted to a MTLCommandEncoder object. In order to generate the commands it’s necessary specify some information to the MTLCommandEncoder object and this is done by attaching various object before being able to use it. In order to create resource objects Metal provide a mechanism built around data structures called descriptors. A descriptors allow the application to specify all the necessary states required to create a certain resource object. The same concept apply for the state objects: the API provide descriptors that the application must use to create them. In the diagram above it’s possible to see two of the most used state objects: the Render Pipeline State object and the Depth Stencil State object respectively created with a MTLRenderPipelineDescriptor and a MTLDepthStencilDescriptor; these two objects allow the application to setup the various rendering states of the GPU. The last essential state object is the Render Pass object that can be created through a MTLRenderPassDescriptor and this object describe how the application will output the geometries.

The reason why Metal splits up the state between descriptor and state/resource objects is because once the application has created everything and declared all the different states combinations, Metal bakes all these in a small number states objects already translated to their hardware format with shaders already compiled and state validated; in this way the only work left to do is to submit and execute commands meaning draw calls are performed really fast! Commands Submission Model

As said Command Encoders store API commands as hardware commands inside Command Buffer objects. Command Buffer objects are extremely lightweight objects and usually applications create large number of them in the execution of a frame; Command Buffer objects are thread-safe, so it’s quite common to prepare them in parallel using multiple threads and when they are all ready submit all in once to control the order they will be executed by the GPU. What makes Command Encoder so efficient is the fact they don’t just store some work that later must be consume by the CPU before being execute, they generate command immediately without deferred state validation…it’s like make a direct call to the GPU driver. Resource Update Model

The resource model in Metal is designed to support an unified memory system meaning the CPU and the GPU share the same storage space for exchanging data, this remove the need of perform implicit copies to allow the GPU to see data handled by the CPU and vice-versa. Metal provide also support for an automatic cache coherency model and assure CPU and GPU will observe command buffer execution boundaries while touching the memory; the only thing required to the application it’s to assure the rendering work is scheduled to occur in such way CPU and GPU won’t write on the same chunk of memory at the same time. Delegating the scheduling and the synchronization of the rendering work to the application, it frees the API from perform any internal and/or implicit extra synchronization block producing a significant boost in performances.

Talking of resources the key concept to highlight is that the structure of resource is immutable, once created it can’t change; this allow to avoid costly resource validation when they are used. Of course the content of resources can change on any time and because the model is build around the concept of an unified system and the read/write synchronization is under the control of the application, there is no need to have a lock API to access the data of a resource; so on Metal the update of a resources requires to get a pointer in memory and read/write the data.