Diegesis refers to the story that is explicitly depicted on screen (e.g., the events that the characters themselves experience firsthand), as opposed to the larger story that is implied (e.g., historical contexts being mentioned, events that have led up to the present action, events happening elsewhere in the film’s universe but not shown on screen).

     In general, sounds in film include dialogues, singing, music, ambiance noise, and sounds made by particular objects.

     Diegetic sounds come from a source within a film’s fictional universe that is directly related to the scene. These sounds are audible to the characters and film audiences. Diegetic sound helps to shape the space and time of the shot.

     On-screen sound emanates from characters or objects visible on screen, while off-screen sound comes from sources outside the camera’s frame. Here is another example of diegetic sound. A can drops outside the camera’s frame. A character within the camera’s frame hears it and turns their head. Their reaction to the sound of the can indicates that the sound is diegetic.  

     Diegetic sound can be on- or off-screen. What makes a sound diegetic or non-diegetic is its relationship to the narrative on screen. 

     Diegetic sound can be recorded during filming, manufactured during post-production, or both on and off the set. In other words, diegetic sounds (sound effects, music, dialogues, noises) always come from the world of the story, but these sounds are not necessarily recorded as the camera is rolling. For example, Halle Bailey sang as Ariel and Daveed Diggs sang as Sebastian the crab on screen in Rob Marshall’s 2023 The Little Mermaid. However, these songs were recorded in professional studios off screen, as shown by the following Disney Studios’ documentary. 


Subtitles and Diegetic Sound

     Diegetic sound can be confusing sometimes, as it is not always easy for audiences to hear everything. This is where subtitles come in. Subtitles is a heuristic device in foreign-language films or as an accessibility tool to aid the deaf or hard-of-hearing communities. However, subtitles, as both a transcript of dialogues and useful descriptions of types of sound or music in a scene, have become increasingly popular. According to Netflix, “40% of its global users have subtitles on all the time, and 80% switch them on at least once a month.” Captioning serves everyone, not just as a prosthetic device for hearing impairment. 

     For a myriad of reasons, able-bodied viewers turn on subtitles for screen works in languages they understand. One of the main reasons is inaudible dialogues caused by, ironically, advances in sound technology. In early filmmaking, actors articulated and projected towards fixed boom microphones. More recently, actors wear hidden, portable mics, which enables more intimate styles of delivery. Actors tend to speak softly or even mumble, making dialogues inaudible. 

     In the following video, Edward Vega at VOX explains why we all need subtitles now.

     The video above is accompanied by a short article on Vox‘s website. You can access the article here.


     Sound design is an important element in Christopher Nolan’s science fiction film Inception (2010), which depicts the infiltration and manipulation of people’s subconscious through their dreams. Sounds and music, orchestrated by sound editor Richard King, transport audiences between imagined (dream) worlds and the embodied reality.

     Characters in the film Inception fall asleep to enter the dream world, and they only wake up when their non-sleeping companions play a pre-designated song: Edith Piaf’s “Non, Je ne regrette rien.” The diegetic music jolts the characters awake.



     In the masked ball scene where Romeo meets Juliet for the first time in Baz Luhrmann’s William Shakespeare’s Romeo and Juliet (1996), a transition in sound design signals to film audiences a transition of moods.

     As Romeo and Juliet discover, and look at, each other through a fish tank, the diegetic sound, music and chatter from the ongoing party, is tuned down and drowned out by a romantic soundtrack. The intrusion of the nondiegetic music transports the couple and film audiences to a parallel universe that is disconnected from the party

Romeo, in a costume of knight in shining knight armor, and Juliet, dressed as an angel with white wings, meet each other for the first time as a masked ball. They discover each other’s presence through an aquarium. Romeo stands to the left of the frame. The right of the frame is taken up entire by the fish tank, with tropical fish swimming near Juliet’s face.


     Actors’ voice and accent are usually part of the diegesis of a scene. That voice can carry anxiety, joy, and a wide range of emotions. Changes to a character’s pitch, timbre, and cadence of a character’s voice can signify a turning point or a crisis.

     Gender is as important a factor as race in film sound. Toward the end of John Madden’s Shakespeare in Love (1998), Will Shakespeare and his fellow actors are ready to stage the premiere of Romeo and Juliet. Audiences have filled their playhouse. However, they discover a not-so-minor problem. The actor cast as Juliet is going through puberty and a voice change and can no longer play the maiden.

     This clip is provided by Miramax’s official channel.

     Voices are not only racialized but often gendered as well, and the impression is further solidified by costumes, camerawork, and the mise-en-scène. It is particularly ironic that this film imposes a modern ideology of binary genders onto the early modern, gender-fluid world that it is supposed to portray. In Shakespeare’s times, women did not generally perform on the English professional stage. Characters of all genders were performed by all-male casts. Gender variance is a recurring motif in Shakespeare’s plays.


     One important, but often glossed over, element of diegetic sound is the accent. Actors—guided by accent coaches—could use a particular accent to fit the story. It is therefore useful to listen to rather than simply watch a film.

     Many people are used to seeing race. By paying attention to accents, we will notice that r ace and ethnicity are not only visible and palpable but also audible. Jennifer Lynn Stoever points out that “listening operates as an organ of racial discernment, categorization, and resistance.”

     Alexa Alice Joubin’s research shows that audiences often “actively and unconsciously listen for accents and other sonic registers” of characters. They listen attentively to more familiar accents, and filter out, selectively, unfamiliar ones.

     In other words, dominant listening habits visualize linguistic differences. It is a form of racialization.

     Take, for example, Spike Lee’s 2018 BlacKkKlansman, which follows Ron Stallworth (John David Washington), the first Black officer in Colorado Springs in 1972, as he infiltrates the local division of the Ku Klux Klan.

     In one scene, Stallworth speaks to David Duke (Topher Grace), founder of the KKK, on the phone.

     Let us first listen to their phone conversation with any visuals. 

     When we could only hear the voice and have no visual cues, how do we respond to the clip? Examine how we tend to “hear” or not hear racial identities. Here is a transcript of what they are saying.

Speaker 1:        It’s him.

Speaker 2:        Hey, Ron, I don’t share this with a lot of people, but …

Speaker 1:        Yeah. Well, I’m anxious to meet you.

Speaker 2:        Excited to meet you too, Ron.

Speaker 1:        Aren’t you ever concerned of some smart-ass n__ calling you pretending to be white?

Speaker 2:        No, I can always tell when I’m talking to a n___.

Speaker 1:        How so?

Speaker 2:        Take you for example, Ron. Okay.

Speaker 1:        Yes?

Speaker 2:        I mean, I can tell that you’re a pure Arian white man from the way you pronounce certain words

Speaker 1:        Can you gimme any examples?

Speaker 2:        Yeah. Take the word, uh, R. Pure Arians like you or I would pronounce it correctly. N__ pronounces it ARA. Did you ever notice that it’s like ARA you gonna fry up that crispy fried chicken soul brother.

Speaker 1:        <laugh>. Wow. You are so white. Thank you for teaching me this lesson. If you had not brought it to my attention, I wouldn’t have noticed the difference between how we talk and how n__ talk.

Speaker 2:        Good. Good.

Speaker 1:        Yeah. I’d love to continue this conversation in Colorado Springs. It’s beautiful here, sir. God’s country.

Speaker 2:        Well, that’s what I hear, Ron. I look forward to meeting you and, uh, we’ll be talking real soon.

Speaker 1:        God bless white America.

     Now, let us watch this scene with audio and visual elements.


     Your Turn:     With both visuals and audio on, how have your reactions to the narrative changed? How do we see and/or hear race?

     Tips:     When describing sound and music, we can break it down to three categories: the sound’s characteristics (such as pitch, volume, and quality), the source (whether it is diegetic or nondiegetic), and the type (musical, orchestral, vocal, dialogue, special sound effects).


Further Reading