Sound, Sensation, and Spatialization: A Postmortem of "Fixing Incus"
Originally published on 01/04/2016 on Gamasutra.com.
Introduction
My brother and I were audiophiles before we even knew the word. Every garage sale we passed was an opportunity to find another amp, a new set of speakers, and a better equalizer. Our bedroom walls were covered with speakers, all strung in crude serial chains of wires twisted together by excited little fingers and driven off a ridiculous series of amps. Even our computer chair had a pair of speakers duct-taped to the back, attached to a 2x4 that doubled as a makeshift head rest. We were chasing after bigger and more-immersive sound, and where our hardware was wanting, we filled in with crude innovation, as we watched MIDI, wavetables, and stereo CD sound take center stage. Things were changing quickly though, and on one fateful day we received a peculiar CD in the mail: a demo for something called a Vivid 3D, a small box that would take a stereo input and upmix it into "SRS 3D audio". I can still remember the opening line when we loaded the disc... "Got the keys? Let's go!" It was unlike anything we'd heard before, and we struggled to understand how the audio could possibly leap away from the speakers like that. Was it an illusion? Was it magic? Would this be the technology that brought us closer to our goal of the ultimate audio experience? I didn’t know it then, but my fascination with the answers to those questions would ultimately drive my career in audio for years to come.
The explosion of virtual and augmented reality technologies has brought with it a dramatic resurgence for 3D spatialized audio, creating opportunities to reexamine the rules and best practices of sound design for a new era. In March of last year, I was approached by a friend and former colleague about making a demo for VisiSonics, the makers of the RealSpace3D audio spatializer. The purpose of the demo would be simple: show off the capabilities of the company’s burgeoning audio engine in a scripted sequence that would thrill casual and hardcore gamers alike. What followed was an exceptional opportunity to work closely with a small team of dedicated engineers, academics, and scientists on a type of technology that has excited me since I was very young. This document is a postmortem on that project, to illustrate the opportunities and challenges of developing with 3D audio in this renaissance of interactive sound design.
Planning and Growth
At the start of planning for Fixing Incus, we discussed creating a “blind” audio-only demo in Unity. Without a graphic artist on our team, for the sake of simplicity and time, we forwent graphics in favor of a more polished audio experience. Despite having experienced success with this approach in non-spatialized environments in the past, on this project, I learned immediately that it was difficult to prototype a 3D audio scene without actual geometry to use as a reference. I had originally planned on preparing a linear stereo demo that could be critiqued and then extrapolated into a more complex scene so that we could iterate on a linear sequence much faster. While it was an attractive idea, I realized quickly that it was going to be very hard to get a feel for the pace and spacing of the demo without using an actual scene to get a sense of the distances that would be traveled by sound sources and the relationship between them.
Following the decision to pursue at least a rudimentary graphic presentation, my first iteration of the 3D set was intentionally crude, again to avoid having to put too much effort into the visuals. I thought that if we weren’t going to have it be completely “blind”, perhaps we could obscure the visuals in a way that served the fiction. This first effort led to me grey-boxing characters and standing up some very primitive geometry while painting over everything with thick, dream-like post effects. I explained the concept to the team: “You’re a robot and your visual system is broken.” After demo’ing this early version at a local VR meetup to mixed feedback, we determined that this was not an adequate solution. The ambiguous flow of lights and shapes looked interesting, but it weakened the story being told. While many liked the immersion of the 3D audio, the scene overall was confusing. The lack of a cohesive visual presentation detracted from the audio substantially and as a result, almost no one could connect what they were hearing and seeing. Interestingly, gamers and game developers seemed far more comfortable with the abstract nature than the journalists and VR enthusiasts in the group. Overall, it was clear that I had to find a solution to the graphics problem--we couldn’t let half-baked visuals distract and detract from the real star, the spatialized audio.
At this point I dove into the deep end of Unity and started using the Asset Store extensively, combining groups of high quality sci-fi assets to compose a believable scene: dark, harshly-lit, metallic hallways and futuristic humanoid figures with hoods that, by covering their faces, allowed us to avoid the task of lip-syncing. Rapidly, the scene began to take shape. I retrofitted the new scenery over my old spiking to preserve the relationships between 3D sounds, blocked out the rest of the movement, and finally was ready to begin the real creative work. With both the geometry and rendering materials settled, I could think about the kind of sound effects that I wanted to populate this small world with to show off the positioning capabilities in interesting ways, while also creating environmental effects to emphasize the feelings of space and reflection. Although I perhaps shouldn’t have been, I was surprised at the creative impetus spurred by working within the confines of a concrete visual scene. With the ambiguity of spacing, feel, and theme gone, I was free to put effort into a distinct tone for the sound design. For me, it drove home the importance of settling on a visual look early, since spatial sound design needs to consider not only the movement of objects as sources of sound, but also how they compose reverberant scenes that reflect the nature of the materials and geometry around them.
Exploration and Technique Development
One of the first techniques I worked on developing for this project was splitting multi-channel audio files into separate point-sources to create a sense of space, as the feeling of size becomes critically important in 3D audio design. The elevator in Fixing Incus, for example, needed to have a huge, groaning, electrically powerful feel as it lumbered down the shaft. A point-sourced mono sound felt thin and mushed, even with intricate details in the source audio effect. While wide 2D stereo fills are the usual solution, that wouldn’t work in this case. Currently, multi-channel audio can’t be passed through a spatialized audio filter, and even a nicely detailed, 2D stereo voice sounded flat and out-of-place since it had no sense of depth or movement. I determined that I would have to split the sound into not just two, but three separate point-sources for the largest pieces of machinery; left and right to create a sense of space, and a center channel to give it a strong core. In most cases the procedure was relatively simple. First, I made a high fidelity, stereo-rich sound. Then I rendered the right and left channels into separate mono files. Following that, I made a third related channel to fit in the middle that synchronized with the timing but contained unique content. For example, if there are heavy wheels on the side of an elevator that suddenly balk at running over a rusted rail, there should be something like a chain in this new center channel that shakes loosely to unite the separate elements. The concern here, of course, is whether or not the separate left and right channels would phase one another due to inevitable drift and constantly-shifting ITDs (Interaural Time Delays) rendered by the spatializer. In a dry scenario, this is certainly the case. The non-interleaved channels phase into one another and the resulting sound is bizarrely thin, phasey, and almost unrecognizable. However, containing the sounds inside a spatialized environmental reverb as part of the scene eliminated the phasing completely, to great effect. Instead of sounding phased and thin, the first-order reflections enrich the sound, giving it a huge, spacious, multi-dimensional feel beyond anything a 2D experience can provide. The sound source and inward-bouncing reflection actually coat the geometry with sound, like splattering paint over an invisible surface, giving it dimension and the feel of being physically present. I took this model one step further in the hangar sequence of Incus when the armature comes down over the viewer’s head. For this audio group, two sounds are spaced far apart on either side in a rich, reverberant environment to give it a monstrously heavy feel as it approaches, but it’s the third channel in the center that brings the experience together. This is partly because the thick centering adds a focused mass, but it’s also the uniquely shorter rolloff of the center channel that makes the details fade in over two layers, allowing the feel of the lift to go from heavy and ominous to focused and in-your-face. On many occasions, while watching people view the demo, I’ve noticed that even after the armature slows to a stop, the invasive feel of the center channel’s proximity makes people jump back in their seats. This method proved to be a very powerful test of the viewer’s natural instinct to recoil when something gets loud very quickly as it approaches his or her face.
For a situation that requires a life-like and organic feel while also feeling large and spacious, a similar approach can be taken with only two channels. Like the previous example, the sound is first designed with stereo in mind, but this time, care is taken to place completely unique content on each side. For the opening scene in the hallway of Incus, the blowing wind is made up of two channels born of the same source content, but this time one channel is slip-edited to create two completely unique samples. When combined in 3D space, they create a very convincing sound of wind blowing violently through the cracked door. The difference here when comparing to the previous example is that each is simply based on synchronous or asynchronous sound, and thus the use of these techniques is dependent on if the sound designer wants to create either a mechanical or an organic sound. Mechanics need to feel wide and heavy, yet synchronized and related to box in the listener. Organics, on the other hand, need to feel light, free and chaotic to create a sense of mixed space. In short, it can be thought of as “outside-in” sound design (linearly-synchronized channels that contribute to a solid core of sound) or “inside-out” sound design (chaotic channels that are unique but related).
An Extra Edge
Another technique that became part of my regular process was using noise reduction, harmonic excitation, and light EQ’ing in the bottom end (sub-500 Hz range) to craft sounds that would provide enough spectral width for the spatializer to process while avoiding garbage sound that would draw the viewer out of the scene. For example, a light hiss in a sound effect, while not desirable, isn’t a unilateral deal-breaker that precludes its use in a game scene. In general, the fuzz of the audio effects collect uniformly across the given speaker matrix, and while discerning ears might pick it out, it sounds largely unremarkable. However, in spatialized sound design, that suddenly changes because a noisy signal doesn’t get to drop evenly across a linear matrix. Instead, it’s floating in a definite point of space, hissing and wheezing angrily while suspended conspicuously for all to hear, and pulling players out of the experience with the subtlety of a sledgehammer. So when possible, the cleanest samples should be used and any disturbances denoised with a high-quality tool. In some cases, sound sources should also be “de-reverbed” to ensure they’re as dry as possible (I use iZotope RX for both of these circumstances). Again, like the previous example, a little bit of inherent room or “air” in a source recording shouldn’t definitively eliminate its use, but caution should be taken to ensure it doesn’t confuse or conflict with the intended sense of space.
On the output stage of my DAW source projects, I used a harmonic exciter to add a uniform amount of extra sparkle to brighten the effects to what I would describe as a slightly over-real sound (I prefer tube-style excitation since it succeeds in feeling warm and slightly crystalline, but without digital fuzz). The purpose of this was to assist the RS3D spatializer, as higher frequency bands are of the utmost importance in helping listeners locate objects in 3D space since they’re the most heavily-affected band during filtering and reflection. Adding that small amount of sparkle made them a little more “grippy”. Further, this process made sure that any sound source that got close enough to the listener had the detail we expect to hear when listening to something super-close. As I mentioned previously, this basically allowed sound effects in certain difficult positions to edge slightly to “over-real” while the rest of the world felt simply “real”. Think of this principle like accenting the attack on a string instrument during editing; the personality is in the timbre, but it’s the articulate clicks that occasionally peer over the equalizing ridge that add the right amount of character.
All of these techniques were, unsurprisingly, critically important to the voice clips. In the pursuit of “presence”, there are few sounds as powerful as the human voice that can pull listeners into an experience, and clean, dry, sparkling audio is all a part of that experience in 3D space. However, it was a serendipitous discovery that led to one additional trick that added a startling amount of presence to the voice acting. When I’m not listening on monitors, I use a pair of Blue Mo-Fi headphones as my go-to cans. They have a built-in analog amplifier that has both an “On” and “On+” state, the latter of which adds a gentle amount of analog bass EQ. I was listening to the opening vocal sequence of Incus when I decided to engage the On+ mode. I was startled. Suddenly the voices went from feeling like mere sound effects to the uncanny sensation of a person actually speaking near me, almost like seeing a mannequin out of the corner of your eye before realizing it’s not a person. Going back and boosting the bass in the source edits helped preserve the effect. What I was finding was that while harmonic excitation helped the voice take a good step toward spatial presence, it was the bass boost that allowed me to feel the voice. It shouldn’t be a new concept to anyone that’s ever had a conversation with someone at close range. Eventually, as a speaker gets closer, we start to feel the physical sensation of a person’s voice resonating against our skin in a way that is uniquely human. Like the harmonic excitation, this was a simple little nudge over the line into hyper-reality that made a huge difference with close proximity sounds.
Speaking of proximity, the use of environmental reverb can take a presentation from great to stunning in a few short moments. When I first began using the RealSpace 3D tools, they provided “shoebox” style reverb--an individual DSP that corresponds to a small, attached geometric space that follows the audio source around but largely doesn’t reflect the geometry of the actual environment. “Environmental” reverb, on the other hand, is placed as part of the scene, lining up against walls and surfaces of the actual gamespace. The effect sounds pleasing to the ears, but it also adds to the ever-elusive “presence” of a scene. For example, in the opening sequence of Incus, the engineers’ voices become richer the nearer they get to the environmental walls. This is a simple physics simulation at work; as an object gets closer to a wall, the primary and secondary sound paths become more closely timed, a phenomenon that enriches the sound. With the accurately-modeled first order reflections provided by the RS3D engine, this effect and the sensation it provokes is expressed in a very impressive fashion. Further, as an object moves into a corner, the effect becomes even more pronounced. Environmental reverb not only makes primary sounds feel more like they’re part of a scene, but also illuminates objects with reflected sound that would otherwise remain silent. In this way, the walls speak to the listener by stating where they are in 3D space. Sounds as simple as footsteps then become like pulses of light that bathe a room with rays, speaking in short and fleeting phrases about the walls, floors, and ceilings, where they are, and what they’re made of. For this reason, the importance of good environmental reverb cannot be underestimated--it rises in importance to a level nearly equal to the sound itself, speaking for objects that would otherwise feel invisible, transparent, and meaningless.
Unexpected Inspiration
Lastly, an exercise I found immensely useful through the development of Incus was the consistent empirical study of the natural world through binaural recording. I was already doing a large amount of recording natural ambiences, and following the acquisition of a pair of high-quality in-ear binaural microphones, I made a habit of recording my daily commute. Each morning I’d insert the microphones in my ears and press Record before walking out the door. This exercise illustrated for me just what a massive amount of information hits our ears constantly throughout the day, and how passively we hear it. The subtle electric squeal of train wheels rocketing down hardened steel rails before the train is even visible; the audible change in reflection in a train cabin as more commuters pack aboard the car; the surprisingly noticeable shadowing effect as the Doppler-shifting engine of a car drives briefly behind a park bench … I could continue for several more pages about what can be heard in the most mundane of sounds, and that’s exactly what this exercise accomplishes. Try it for yourself. There’s something about the process of recording your daily world that causes something in the mind to click and whir when it’s played back in short order, seemingly granting permission to the visual part of the mind to step back while the memory is still fresh and allowing the auditory sense to explode in sensitivity. As an example, something as small as a mailbox could be heard by the subtle change in quality of my footsteps in the snow as it shadowed one sound and reflected another. In me, this study provoked a type of active curiosity in my mind that I had not experienced for many years, and I am certain it made me a better spatial artist as a result. It also made me painfully aware of just how much further we have to go.
With that in mind, here are some observations I made throughout the development of Incus, during both Unity work and analyzing binaural recordings. This is not intended to be viewed as a list of scientific facts, but rather, a collection of anecdotal observations that I hope will be helpful to others constructing virtual spaces of the future:
- Quick and effective head-tracking is a non-negotiable component in the creation of virtual and/or augmented presence. Period.
- A convincing sense of space is created more fully by sending information to the listener from multiple angles and distances. The greater the amount of positional “noise”, the more pronounced the 3D effect can be.
- Attenuation of sound falls off dramatically sooner than expected as it moves away from a listener.
- Conversely, the sudden ramp up in volume created by auditory proximity can make the listener recoil or shiver, even when a source is not visible.
- There is a region of confusion that extends about two to three feet in front of a listener’s face where sound loses spatial clarity in the absence of a strong visual cue.
- Unlike standard recording practices, reverberant spaces are strongly preferable over dry ones in binaural recording, as the reflections contribute to a sense of positional space (see point 2).
- Bass cues, in the right context, stimulate a sense of motion (ex: wind blowing over ears, bumps from road joists in a moving car, etc.)
- Listeners are far more sensitive to geometric occlusion and obstruction than we expect, to the point where even the occlusion of room tone can inform one about objects in the immediate area.
- Footsteps and other body Foley are an untapped goldmine in emerging virtual markets. There is potential for new and incredibly detailed soundset libraries to allow for plug-n-play experiences that convince a user of bodily presence while also painting and illuminating virtual spaces with auditory light that has the side-benefit of slipping beneath the conscious radar.
- Attenuation curves have the sensation of changing with volume output and source direction.
- The conical attenuation of the human voice is tremendously descriptive, going so far as to be a vector that can blindly communicate social cues to a listener (e.g., hearing when a speaking person is facing directly at you versus in any other direction; directing attentional arousal and conversational turn, even in the presence of an ambiguous sentence context).
The overall lesson to draw is that the smallest details within a 3D audio space can contribute dramatically toward creating a sense of presence, space, and comfort. While we’re a long way from having the tools and resources to create auditory scenes that rival the sophistication of physically-based rendered graphics of our visual art peers, deliberate and thoughtful use of auditory details can chisel away at a listener’s resistance toward accepting an audio scene as a lively and organic reality.
Conclusion
In the end, it turns out that creating audio in true 3D isn’t terribly different from what we’re already used to doing in non-spatialized work. We still need to make good use of all parts of the auditory spectrum, we still need to be careful about the amount of resources we use, and we still need to make our mixes clear and uncluttered. That much doesn’t change. The main difference is that the margin of error is smaller, which is both a challenge and an opportunity. It means that we have to be more careful, detailed, and meaningful with how we make and place sound effects, but it also means that when we do our jobs properly, we can poke at the brains and hearts of player like never before. Our creations start to have real emotional presence thanks to the assortment of tools and techniques that are rapidly becoming more developed, mature, and available. Like my brother and I hunting for speakers at a garage sale, we as audio artists can invest ourselves in collecting an assortment of resources that can be cobbled together to form a truly unique experience, and move in the direction of the next state of the art right around the corner. Binaural, spatialized audio doesn’t have to look like a big, scary revolution in audio. Instead, if we treat it as the next evolutionary step, we can make thoughtful decisions about how to adjust our processes and become comfortable with 3D audio artistry as we explore this new frontier and bring light to the dark corners of the sensory experience.