Writing.jpg

Sound, Sensation, and Spatialization: A Postmortem of "Fixing Incus"

Originally published on 01/04/2016 on Gamasutra.com.

Introduction

    My brother and I were audiophiles before we even knew the word. Every garage sale we passed was an opportunity to find another amp, a new set of speakers, and a better equalizer. Our bedroom walls were covered with speakers, all strung in crude serial chains of wires twisted together by excited little fingers and driven off a ridiculous series of amps. Even our computer chair had a pair of speakers duct-taped to the back, attached to a 2x4 that doubled as a makeshift head rest. We were chasing after bigger and more-immersive sound, and where our hardware was wanting, we filled in with crude innovation, as we watched MIDI, wavetables, and stereo CD sound take center stage. Things were changing quickly though, and on one fateful day we received a peculiar CD in the mail: a demo for something called a Vivid 3D, a small box that would take a stereo input and upmix it into "SRS 3D audio". I can still remember the opening line when we loaded the disc... "Got the keys? Let's go!" It was unlike anything we'd heard before, and we struggled to understand how the audio could possibly leap away from the speakers like that. Was it an illusion? Was it magic? Would this be the technology that brought us closer to our goal of the ultimate audio experience? I didn’t know it then, but my fascination with the answers to those questions would ultimately drive my career in audio for years to come.

    The explosion of virtual and augmented reality technologies has brought with it a dramatic resurgence for 3D spatialized audio, creating opportunities to reexamine the rules and best practices of sound design for a new era. In March of last year, I was approached by a friend and former colleague about making a demo for VisiSonics, the makers of the RealSpace3D audio spatializer. The purpose of the demo would be simple: show off the capabilities of the company’s burgeoning audio engine in a scripted sequence that would thrill casual and hardcore gamers alike. What followed was an exceptional opportunity to work closely with a small team of dedicated engineers, academics, and scientists on a type of technology that has excited me since I was very young. This document is a postmortem on that project, to illustrate the opportunities and challenges of developing with 3D audio in this renaissance of interactive sound design.

Planning and Growth

    At the start of planning for Fixing Incus, we discussed creating a “blind” audio-only demo in Unity. Without a graphic artist on our team, for the sake of simplicity and time, we forwent graphics in favor of a more polished audio experience. Despite having experienced success with this approach in non-spatialized environments in the past, on this project, I learned immediately that it was difficult to prototype a 3D audio scene without actual geometry to use as a reference. I had originally planned on preparing a linear stereo demo that could be critiqued and then extrapolated into a more complex scene so that we could iterate on a linear sequence much faster. While it was an attractive idea, I realized quickly that it was going to be very hard to get a feel for the pace and spacing of the demo without using an actual scene to get a sense of the distances that would be traveled by sound sources and the relationship between them.

    Following the decision to pursue at least a rudimentary graphic presentation, my first iteration of the 3D set was intentionally crude, again to avoid having to put too much effort into the visuals. I thought that if we weren’t going to have it be completely “blind”, perhaps we could obscure the visuals in a way that served the fiction. This first effort led to me grey-boxing characters and standing up some very primitive geometry while painting over everything with thick, dream-like post effects. I explained the concept to the team: “You’re a robot and your visual system is broken.” After demo’ing this early version at a local VR meetup to mixed feedback, we determined that this was not an adequate solution. The ambiguous flow of lights and shapes looked interesting, but it weakened the story being told. While many liked the immersion of the 3D audio, the scene overall was confusing. The lack of a cohesive visual presentation detracted from the audio substantially and as a result, almost no one could connect what they were hearing and seeing. Interestingly, gamers and game developers seemed far more comfortable with the abstract nature than the journalists and VR enthusiasts in the group. Overall, it was clear that I had to find a solution to the graphics problem--we couldn’t let half-baked visuals distract and detract from the real star, the spatialized audio.

    At this point I dove into the deep end of Unity and started using the Asset Store extensively, combining groups of high quality sci-fi assets to compose a believable scene: dark, harshly-lit, metallic hallways and futuristic humanoid figures with hoods that, by covering their faces, allowed us to avoid the task of lip-syncing. Rapidly, the scene began to take shape. I retrofitted the new scenery over my old spiking to preserve the relationships between 3D sounds, blocked out the rest of the movement, and finally was ready to begin the real creative work. With both the geometry and rendering materials settled, I could think about the kind of sound effects that I wanted to populate this small world with to show off the positioning capabilities in interesting ways, while also creating environmental effects to emphasize the feelings of space and reflection. Although I perhaps shouldn’t have been, I was surprised at the creative impetus spurred by working within the confines of a concrete visual scene. With the ambiguity of spacing, feel, and theme gone, I was free to put effort into a distinct tone for the sound design. For me, it drove home the importance of settling on a visual look early, since spatial sound design needs to consider not only the movement of objects as sources of sound, but also how they compose reverberant scenes that reflect the nature of the materials and geometry around them.

Exploration and Technique Development

    One of the first techniques I worked on developing for this project was splitting multi-channel audio files into separate point-sources to create a sense of space, as the feeling of size becomes critically important in 3D audio design. The elevator in Fixing Incus, for example, needed to have a huge, groaning, electrically powerful feel as it lumbered down the shaft. A point-sourced mono sound felt thin and mushed, even with intricate details in the source audio effect. While wide 2D stereo fills are the usual solution, that wouldn’t work in this case. Currently, multi-channel audio can’t be passed through a spatialized audio filter, and even a nicely detailed, 2D stereo voice sounded flat and out-of-place since it had no sense of depth or movement. I determined that I would have to split the sound into not just two, but three separate point-sources for the largest pieces of machinery; left and right to create a sense of space, and a center channel to give it a strong core. In most cases the procedure was relatively simple. First, I made a high fidelity, stereo-rich sound. Then I rendered the right and left channels into separate mono files. Following that, I made a third related channel to fit in the middle that synchronized with the timing but contained unique content. For example, if there are heavy wheels on the side of an elevator that suddenly balk at running over a rusted rail, there should be something like a chain in this new center channel that shakes loosely to unite the separate elements. The concern here, of course, is whether or not the separate left and right channels would phase one another due to inevitable drift and constantly-shifting ITDs (Interaural Time Delays) rendered by the spatializer. In a dry scenario, this is certainly the case. The non-interleaved channels phase into one another and the resulting sound is bizarrely thin, phasey, and almost unrecognizable. However, containing the sounds inside a spatialized environmental reverb as part of the scene eliminated the phasing completely, to great effect. Instead of sounding phased and thin, the first-order reflections enrich the sound, giving it a huge, spacious, multi-dimensional feel beyond anything a 2D experience can provide. The sound source and inward-bouncing reflection actually coat the geometry with sound, like splattering paint over an invisible surface, giving it dimension and the feel of being physically present. I took this model one step further in the hangar sequence of Incus when the armature comes down over the viewer’s head. For this audio group, two sounds are spaced far apart on either side in a rich, reverberant environment to give it a monstrously heavy feel as it approaches, but it’s the third channel in the center that brings the experience together. This is partly because the thick centering adds a focused mass, but it’s also the uniquely shorter rolloff of the center channel that makes the details fade in over two layers, allowing the feel of the lift to go from heavy and ominous to focused and in-your-face. On many occasions, while watching people view the demo, I’ve noticed that even after the armature slows to a stop, the invasive feel of the center channel’s proximity makes people jump back in their seats. This method proved to be a very powerful test of the viewer’s natural instinct to recoil when something gets loud very quickly as it approaches his or her face.

    For a situation that requires a life-like and organic feel while also feeling large and spacious, a similar approach can be taken with only two channels. Like the previous example, the sound is first designed with stereo in mind, but this time, care is taken to place completely unique content on each side. For the opening scene in the hallway of Incus, the blowing wind is made up of two channels born of the same source content, but this time one channel is slip-edited to create two completely unique samples. When combined in 3D space, they create a very convincing sound of wind blowing violently through the cracked door. The difference here when comparing to the previous example is that each is simply based on synchronous or asynchronous sound, and thus the use of these techniques is dependent on if the sound designer wants to create either a mechanical or an organic sound. Mechanics need to feel wide and heavy, yet synchronized and related to box in the listener. Organics, on the other hand, need to feel light, free and chaotic to create a sense of mixed space. In short, it can be thought of as “outside-in” sound design (linearly-synchronized channels that contribute to a solid core of sound) or “inside-out” sound design (chaotic channels that are unique but related).

An Extra Edge

    Another technique that became part of my regular process was using noise reduction, harmonic excitation, and light EQ’ing in the bottom end (sub-500 Hz range) to craft sounds that would provide enough spectral width for the spatializer to process while avoiding garbage sound that would draw the viewer out of the scene. For example, a light hiss in a sound effect, while not desirable, isn’t a unilateral deal-breaker that precludes its use in a game scene. In general, the fuzz of the audio effects collect uniformly across the given speaker matrix, and while discerning ears might pick it out, it sounds largely unremarkable. However, in spatialized sound design, that suddenly changes because a noisy signal doesn’t get to drop evenly across a linear matrix. Instead, it’s floating in a definite point of space, hissing and wheezing angrily while suspended conspicuously for all to hear, and pulling players out of the experience with the subtlety of a sledgehammer. So when possible, the cleanest samples should be used and any disturbances denoised with a high-quality tool. In some cases, sound sources should also be “de-reverbed” to ensure they’re as dry as possible (I use iZotope RX for both of these circumstances). Again, like the previous example, a little bit of inherent room or “air” in a source recording shouldn’t definitively eliminate its use, but caution should be taken to ensure it doesn’t confuse or conflict with the intended sense of space.

    On the output stage of my DAW source projects, I used a harmonic exciter to add a uniform amount of extra sparkle to brighten the effects to what I would describe as a slightly over-real sound (I prefer tube-style excitation since it succeeds in feeling warm and slightly crystalline, but without digital fuzz). The purpose of this was to assist the RS3D spatializer, as higher frequency bands are of the utmost importance in helping listeners locate objects in 3D space since they’re the most heavily-affected band during filtering and reflection. Adding that small amount of sparkle made them a little more “grippy”. Further, this process made sure that any sound source that got close enough to the listener had the detail we expect to hear when listening to something super-close. As I mentioned previously, this basically allowed sound effects in certain difficult positions to edge slightly to “over-real” while the rest of the world felt simply “real”. Think of this principle like accenting the attack on a string instrument during editing; the personality is in the timbre, but it’s the articulate clicks that occasionally peer over the equalizing ridge that add the right amount of character.

    All of these techniques were, unsurprisingly, critically important to the voice clips. In the pursuit of “presence”, there are few sounds as powerful as the human voice that can pull listeners into an experience, and clean, dry, sparkling audio is all a part of that experience in 3D space. However, it was a serendipitous discovery that led to one additional trick that added a startling amount of presence to the voice acting. When I’m not listening on monitors, I use a pair of Blue Mo-Fi headphones as my go-to cans. They have a built-in analog amplifier that has both an “On” and “On+” state, the latter of which adds a gentle amount of analog bass EQ. I was listening to the opening vocal sequence of Incus when I decided to engage the On+ mode. I was startled. Suddenly the voices went from feeling like mere sound effects to the uncanny sensation of a person actually speaking near me, almost like seeing a mannequin out of the corner of your eye before realizing it’s not a person. Going back and boosting the bass in the source edits helped preserve the effect. What I was finding was that while harmonic excitation helped the voice take a good step toward spatial presence, it was the bass boost that allowed me to feel the voice. It shouldn’t be a new concept to anyone that’s ever had a conversation with someone at close range. Eventually, as a speaker gets closer, we start to feel the physical sensation of a person’s voice resonating against our skin in a way that is uniquely human. Like the harmonic excitation, this was a simple little nudge over the line into hyper-reality that made a huge difference with close proximity sounds.

    Speaking of proximity, the use of environmental reverb can take a presentation from great to stunning in a few short moments. When I first began using the RealSpace 3D tools, they provided “shoebox” style reverb--an individual DSP that corresponds to a small, attached geometric space that follows the audio source around but largely doesn’t reflect the geometry of the actual environment. “Environmental” reverb, on the other hand, is placed as part of the scene, lining up against walls and surfaces of the actual gamespace. The effect sounds pleasing to the ears, but it also adds to the ever-elusive “presence” of a scene. For example, in the opening sequence of Incus, the engineers’ voices become richer the nearer they get to the environmental walls. This is a simple physics simulation at work; as an object gets closer to a wall, the primary and secondary sound paths become more closely timed, a phenomenon that enriches the sound. With the accurately-modeled first order reflections provided by the RS3D engine, this effect and the sensation it provokes is expressed in a very impressive fashion. Further, as an object moves into a corner, the effect becomes even more pronounced. Environmental reverb not only makes primary sounds feel more like they’re part of a scene, but also illuminates objects with reflected sound that would otherwise remain silent. In this way, the walls speak to the listener by stating where they are in 3D space. Sounds as simple as footsteps then become like pulses of light that bathe a room with rays, speaking in short and fleeting phrases about the walls, floors, and ceilings, where they are, and what they’re made of. For this reason, the importance of good environmental reverb cannot be underestimated--it rises in importance to a level nearly equal to the sound itself, speaking for objects that would otherwise feel invisible, transparent, and meaningless.

Unexpected Inspiration

    Lastly, an exercise I found immensely useful through the development of Incus was the consistent empirical study of the natural world through binaural recording. I was already doing a large amount of recording natural ambiences, and following the acquisition of a pair of high-quality in-ear binaural microphones, I made a habit of recording my daily commute. Each morning I’d insert the microphones in my ears and press Record before walking out the door. This exercise illustrated for me just what a massive amount of information hits our ears constantly throughout the day, and how passively we hear it. The subtle electric squeal of train wheels rocketing down hardened steel rails before the train is even visible; the audible change in reflection in a train cabin as more commuters pack aboard the car; the surprisingly noticeable shadowing effect as the Doppler-shifting engine of a car drives briefly behind a park bench … I could continue for several more pages about what can be heard in the most mundane of sounds, and that’s exactly what this exercise accomplishes. Try it for yourself. There’s something about the process of recording your daily world that causes something in the mind to click and whir when it’s played back in short order, seemingly granting permission to the visual part of the mind to step back while the memory is still fresh and allowing the auditory sense to explode in sensitivity. As an example, something as small as a mailbox could be heard by the subtle change in quality of my footsteps in the snow as it shadowed one sound and reflected another. In me, this study provoked a type of active curiosity in my mind that I had not experienced for many years, and I am certain it made me a better spatial artist as a result. It also made me painfully aware of just how much further we have to go.

    With that in mind, here are some observations I made throughout the development of Incus, during both Unity work and analyzing binaural recordings. This is not intended to be viewed as a list of scientific facts, but rather, a collection of anecdotal observations that I hope will be helpful to others constructing virtual spaces of the future:

  1. Quick and effective head-tracking is a non-negotiable component in the creation of virtual and/or augmented presence. Period.
  2. A convincing sense of space is created more fully by sending information to the listener from multiple angles and distances. The greater the amount of positional “noise”, the more pronounced the 3D effect can be.
  3. Attenuation of sound falls off dramatically sooner than expected as it moves away from a listener.
  4. Conversely, the sudden ramp up in volume created by auditory proximity can make the listener recoil or shiver, even when a source is not visible.
  5. There is a region of confusion that extends about two to three feet in front of a listener’s face where sound loses spatial clarity in the absence of a strong visual cue.
  6. Unlike standard recording practices, reverberant spaces are strongly preferable over dry ones in binaural recording, as the reflections contribute to a sense of positional space (see point 2).
  7. Bass cues, in the right context, stimulate a sense of motion (ex: wind blowing over ears, bumps from road joists in a moving car, etc.)
  8. Listeners are far more sensitive to geometric occlusion and obstruction than we expect, to the point where even the occlusion of room tone can inform one about objects in the immediate area.
  9. Footsteps and other body Foley are an untapped goldmine in emerging virtual markets. There is potential for new and  incredibly detailed soundset libraries to allow for plug-n-play experiences that convince a user of bodily presence while also painting and illuminating virtual spaces with auditory light that has the side-benefit of slipping beneath the conscious radar.
  10. Attenuation curves have the sensation of changing with volume output and source direction.
  11. The conical attenuation of the human voice is tremendously descriptive, going so far as to be a vector that can blindly communicate social cues to a listener (e.g., hearing when a speaking person is facing directly at you versus in any other direction; directing attentional arousal and conversational turn, even in the presence of an ambiguous sentence context).

    The overall lesson to draw is that the smallest details within a 3D audio space can contribute dramatically toward creating a sense of presence, space, and comfort. While we’re a long way from having the tools and resources to create auditory scenes that rival the sophistication of physically-based rendered graphics of our visual art peers, deliberate and thoughtful use of auditory details can chisel away at a listener’s resistance toward accepting an audio scene as a lively and organic reality.

Conclusion

    In the end, it turns out that creating audio in true 3D isn’t terribly different from what we’re already used to doing in non-spatialized work. We still need to make good use of all parts of the auditory spectrum, we still need to be careful about the amount of resources we use, and we still need to make our mixes clear and uncluttered. That much doesn’t change. The main difference is that the margin of error is smaller, which is both a challenge and an opportunity. It means that we have to be more careful, detailed, and meaningful with how we make and place sound effects, but it also means that when we do our jobs properly, we can poke at the brains and hearts of player like never before. Our creations start to have real emotional presence thanks to the assortment of tools and techniques that are rapidly becoming more developed, mature, and available. Like my brother and I hunting for speakers at a garage sale, we as audio artists can invest ourselves in collecting an assortment of resources that can be cobbled together to form a truly unique experience, and move in the direction of the next state of the art right around the corner. Binaural, spatialized audio doesn’t have to look like a big, scary revolution in audio. Instead, if we treat it as the next evolutionary step, we can make thoughtful decisions about how to adjust our processes and become comfortable with 3D audio artistry as we explore this new frontier and bring light to the dark corners of the sensory experience.

Mixer.jpg

Raising Hell in Mobile Audio: The Sounds of Dungeon Keeper

Originally published on 01/08/2015 on AudioGANG.org.

Welcome back, Keeper.

    Does the phrase, “Mobile Audio” make you shiver with dread? When your target platform is a small, relatively underpowered device with a finite power source, no hardware acceleration, and not a lot of memory, face-melting HiDef audio doesn’t always feel within reach. Instead, it’s a battle to carry out our audio craft faithfully under the humblest of standards as departments ration already-tiny resource budgets. It shouldn’t come as a surprise, then, that the quality of sound design and music for mobile games tends to suffer. When it’s a struggle to play only a handful of sound effects and a single stream of music, it can feel like we’re edging away from compromise and toward giving up. When you download a Top 10 game and hear objectively poor sound coming out of the speakers, what message does that send about the necessity of high-quality sound design? If the competition can get away with a single sound effect for the entire UI and four bars of music, why should we shoot for anything more? These are difficult questions to answer at times when the casual market is exploding, but after several years of working with “core audience” titles, I can tell you that mobile audio, though challenging, is an opportunity to learn, grow, and remember why we love sound.

    While you're reading, take a look at the audio demo reel of Dungeon Keeper here: http://youtu.be/Hfl6V1rYGS0

First Dig

    Before we got too far into the prototyping process for Dungeon Keeper, my team and I decided to identify what we were shooting for. We agreed that we wouldn’t treat our audio as being destined for a mobile device; we were simply going to make the absolute best audio we could, and figure out how to cram it into a phone along the way. Our previous game, Ultima Forever, had been a case study in this endeavor. It was a PC MMO that later became a tablet game, and we managed to jam 4GB of audio and a two-hour soundtrack into a 200MB footprint, while still making the audio run smoothly on an iPad 2. We had learned a lot and gained confidence in that process, but Dungeon Keeper would pose an even greater challenge, with significantly smaller resource budgets despite our ambitious audio plans. If we were going to succeed, we needed to carve our goals in stone and stick to them no matter what. If we weren’t working on “mobile audio”, then we needed to aspire for “next-gen audio”.

    So, what exactly does next-gen audio mean? He’s what we determined:

  1. Next-gen audio is detailed and dynamic. It’s full of organic layers that change over time and compliment the game scene, with sound effects that inform us of the details of our surroundings.
  2. Next-gen audio is also full-spectrum and crisp. It moves elegantly from quiet, calm moments to ferociously loud ones. It covers the range from crystalline highs to booming lows, and it avoids the dull, lo-fi sound that audio compression frequently creates. Finally, it guides us along an emotional spectrum, taking us from moments of introspection and curiosity all the way to bombastic, gripping violence.
  3. Next-gen audio is impactful but intelligible. This is important with regard to spectrum, because without careful consideration to dynamic mixing, we can end up with a mess of noise that doesn’t give us the opportunity to communicate clear messages to the player. Great audio can speak loudly, but only when the rest of the mix is capable of carving out space to accommodate it.
  4. Finally, next-gen audio is supported with strong tools. In our case, we chose Audiokinetic Wwise, a combination audio engine and authoring tool. It’s a fast, powerful, and informative set of tools, but most of all, it’s flexible, something we knew would be essential for the development of an ambitious mobile title.

All In the Details

    With the groundwork established, we began our work. Having previously played many hours of competitive city-building titles, one thing I wanted to focus on for Dungeon Keeper was making the dungeon seem as organic and life-like as possible. In other titles, I would work so hard on my creations but never feel really connected with my world due to the lack of ambient detail. For Dungeon Keeper, we wanted everything to feel alive, as if you were truly “The Hand of Evil”, hearing detailed sounds of digging out tunnels, the shouts and grunts of summoned minions, and the satisfying noise of ripping structures out of the ground to drop them somewhere else. The start of this was to give every building (or “rooms” as we called them) a unique ambience. This included an interesting stereo loop, as well as a variety of one-shot accents played out at random, giving the room ambiences the auditory sense (feel?) of being longer and more complex than they actually were. These sounds (were) would be persistent and carefully attenuated so that at a distance, they wouldn’t cost CPU or memory until you zoomed in to examine your creations. Even then, the audio ranges were shaped as upward-facing cones to more tightly control the voice count. Further, if you chose to zoom into a single room using the info button, the ambience would shift its HDR window and flip from a 3D voice to a 2D voice, allowing the sound to fill the entire soundstage while gracefully attenuating the rest of the music and ambience.

    To further compliment the feeling of life and presence, we turned our attention to the minions. We first listed all their various body sizes and methods of locomotion and determined the smallest number of soundsets we could get away with, while using resource-cheap voice effects to increase variety. For example, we made light barefoot sounds for the Imps, heavy metallic clunks for the Necromancers, and leathery flapping for the Dragon Whelps. We then applied the sounds to animation notifies and carefully controlled their rolloffs and instance limits to again provide fun amounts of detail when examined up close, but not peg the CPU when further away. For the minion voices, we used a combination of sound effects and voice actors, depending on how exotic the creature was. Importantly, we used professional voice talent for the units’ various grunts, shouts, and barks even though they didn’t speak any actual lines. We worked with actors that we had experience with, people that we knew could jump immediately into character and bring presence to their voices in even the shortest of samples (a special shout-out to Steve Blum, Michael Donovan, David Lodge, and April Stewart, our immensely-talented creature actors, and Richard Ridings, the illustrious voice of Horny). We included not only the common voice samples like “spawn” and “death”, but also a variety of “pain” sounds that reflected what trap they were snagged by during combat, adding even more contextual reactivity and humor to their animations. I still loving hearing the Trolls convulse uncontrollably when struck by a bug zapper or a Warlock screaming comically when he catches on fire.

Music Fit for a Dungeon

    Next, we focused on the music. Mobile game or not, “Interactive Soundtrack” was going to be our minimum specification, so we needed to find a way to make it both blend seamlessly with the action but also speak up when something interesting was happening. This again made us reflect on what we learned from playing competing games and on what we could do to lessen players’ frustrations. During the core mechanic of combat, games of this genre tend to be a mess of simultaneous actions that leave the player scrambling to pull the screen in every direction trying to figure out what’s happening. To help with this, we looked at what messages the music and UI sounds needed to emphasize so that the audio would extend the sensory experience beyond the edges of the screen and empower players with battlefield information (telemetry, of sorts). Here’s what we came up with:

  1. Combat has begun (after transitioning from a passive “scouting” state).
  2. A room has been destroyed.
  3. A spell is being cast.
  4. A high-powered “immortal” unit has spawned or been knocked out.
  5. The Dungeon Heart (the critical structure) is being attacked.
  6. The player is on the brink of winning or dominating with seconds left on the clock.
  7. The scenario was won or lost.

    Through this sequence, we created an emotional cadence, where segments and stingers were strung together in a musical sentence that accurately informed the player about what they should be feel and notice. The exciting, segmented messages move from Exposition (the Scouting sequence), to Rising Action (the Attack sequence), to Climatic Plateau or Peak (the Heart-Attacked and/or Brink segments), to Resolution (the Win/Loss cadences), with informative critical events sprinkled throughout (Destruction and Spell stingers). This all plays out through carefully-timed segment transitions and tempo-sync’d stingers, giving the music a smooth, continuous feel. This same care also went into designing and writing the Build Mode music, filling players with joy and excitement as they construct, upgrade, and maintain their dungeons. Like the combat music, there are distinct musical cues for each action that gradually build up the excitement and reward players for working in their lairs.

Technical Considerations

    Lastly on the technical side, we focused on how to make this massive amount of sound work on phones and tablets. The raw size of the audio was a half-gig (and that’s 16-bit sounds, not 24!), which included a complex music system and 700 lines of dialog. We used every compression format and optimization Wwise makes available and managed to compress the audio package down to a mere 30MB. Obviously we made liberal use of the Vorbis codec for dialog and music, but we also used it for much of the UI and sound effects. This may sound surprising, since audio designers are frequently cautioned from using too many compressed voices, as they gulp CPU cycles greedily. However, we found that Vorbis could be our first choice and not out last resort if we grouped compressed voices into their own submix, whereupon we could programmatically throttle the number of simultaneous Vorbis voices without worrying about where the CPU ceiling was. Making liberal use of file streaming was another key choice, allowing us to use only 3.2MB of audio memory at runtime (we’re working with hard memory, which can of course take a lot more streaming punishment than a traditional hard disk without worry about thrash). Further, the HDR capabilities of Wwise were a critical component in making the whole thing work. By breaking up groups of sounds into HDR “bands”, we could paradoxically make the CPU usage drop when more voices tried to compete for audibility. Through this, we knew our CPU percentage would cap at under 12%, because even though as many as 80 voices would be playing virtually, only 10 to 15 would ever physically make it through to the speakers. This resulted in very predictable CPU loads while also providing an extremely tight mix that could speak clearly during the busiest of moments. We extended the usability of this scheme even further by implementing a Level-of-Detail system that controlled the complexity of the audio scene with a series of RTPCs to do things like change voice limits, turn DSPs on or off, pause specific submixes to free up channels, and then adjust the overall mix to compensate for the difference. This was primarily driven by a hardware poll on the app startup to trigger the appropriate “platform” event, but Wwise made it easy enough that we could even change the LoD parameters during low-memory situations, dynamically adjusting the audio if the device performance was suffering from external circumstances.

    The last thing we addressed on the technical side was a significant “first” for mobile games, and that was to add an output optimizer in the audio options panel. We provided three buttons to allow the user to select the best match to their output device:

  1. System speakers. This mix had the narrowest HDR window and the most compression, along with an EQ the skewed heavily to the top end. Naturally, if a device can’t reproduce low frequencies, we shouldn’t use up precious digital headroom by leaving them in the mix. This allowed the game to be unusually loud and clear on even the most modest of device speakers.
  2. Small earphones. This was our target mix since we expected most players to be using the earphones that came with their device (the classic Apple earphones, in particular). The HDR window had a reasonable width and a moderate amount of compression to allow a wider dynamic range, but still a bit of restriction to keep the audio at a consistent level since small earphones tend to have poor isolation against competing noise.
  3. Large circumaural headphones. This was our most impressive mix, with a much wider HDR window and less-aggressive compression, and an EQ that gave a slight boost to the top and bottom ends. The result was a mix that had an enormous dynamic range and a sonic quality that was positively thrilling. This made it perfectly suited for reproduction on the more isolated output of circumaural headphones (or a HiFi surround system, for that matter).

    The final result of these technical features was the accomplishment of something we felt very strongly about, which was the commitment to making Dungeon Keeper sound spectacular no matter what the capabilities of the player’s device was. New, old, big, small, it didn’t matter. We wanted everyone to take part in audio delight, and we feel we met that challenge thoroughly.

Closing Up Shop

    In the end, the accomplishments of Dungeon Keeper come very bittersweet; a technical marvel and an incredibly enjoyable game, hampered by what was justifiably called out for having an overly-aggressive economy that concluded with Mythic being shut down. We have to ask ourselves again, as we did in the beginning, what it all worth it? If best-in-class sound design isn’t required for a top-grossing game, and if it won’t save an otherwise struggling game, what good is it? Do we stop trying? Do we give up? My answer is still emphatically, “No.” Making audio for mobile games is hard. Damn hard. Even with traditional game hardware, audio has to fight for resource budgets and flounder at the butt-end of the development cycle. So when you’re making a mobile game and take away memory and hardware support, and try to do it all in less than a year, your mission as audio designers, engineers, and musicians gets exceedingly tough. There’s something to remember though… The systems we played on as kids and teens didn’t have the kind of capabilities that consoles and PCs have now, but we were still able to have emotional experiences with them that stuck with us, even today. When we’re clear and purposeful about the art we create, we can use audio to drive emotional experiences without feeling limited by the capabilities of the platform. No other discipline can reach players faster or more effectively, which shines a bright light on our true opportunity. Art is about audience, and when our mobile audience is tens of millions instead of hundred of thousands, we have an incredible chance to show the casual gamers of today what real emotional experiences can be and why they should keep playing our games.

    So keep trying. Do more. Shoot higher. Don’t settle for what’s good enough. Select the right tools, keep your passion high, and remember who you’re doing it all for. It’s the art, and the love of the craft, and mobile gaming is the clearest reminder of both the stakes and rewards of what’s ahead of us.

Onward, Keepers.