News Release

Beyond words

Study reveals the hidden musical grammar of natural speech

Peer-Reviewed Publication

Weizmann Institute of Science

The AI revolution, which has begun to transform our lives over the past three years, is built on a fundamental linguistic principle that lies at the base of large language models such as ChatGPT. Words in a natural language are not strung together in random patterns; rather, there is a statistical structure that allows the model to guess the next word based on what came before. Yet these models overlook a crucial dimension of human communication: content that is not conveyed by words. In a new study being published today in Proceedings of the National Academy of Sciences, USA (PNAS), researchers from Prof. Elisha Moses’s lab at the Weizmann Institute of Science reveal that the melody of speech in spontaneous conversations in English functions as a distinct language, with a “vocabulary” of hundreds of basic melodies and even rules of syntax that can be used to predict the next melody in the sequence. The study lays the foundation for an artificial intelligence that will understand language beyond words.

The melody, or music, of speech, referred to by the linguistic term “prosody,” encompasses variations in pitch (intonation), loudness (for example, for emphasis), tempo and sound quality (such as a whisper or creaky voice). This form of expression predates words in evolution: Recent studies reveal that both chimpanzees and whales incorporate complex prosodic structures in their communication. In human communication, prosody adds a nuanced layer of meaning beyond words. A brief pause, much like a comma, can change the meaning of a sentence (“Let’s eat Grandma”) and the tempo of spoken text can generate suspense. Linguists specializing in prosody have traditionally studied literary texts and ways in which prosody reflects historical changes. This meant that, despite prosody’s critical importance for the understanding of human language, its study remained a niche field for years, devoid of applications and filled with conflicting ideas about prosody’s structure and significance.

"Our study lays the foundation for an automated system to compile a ‘dictionary’ of prosody for every human language and for different speaker populations"

Prosody, however, is an inherent part of every conversation. It assigns linguistic function to words – for instance, whether they pose a question or state a fact – and reveals the speakers’ attitude toward what they say. In the new study, led by linguist Dr. Nadav Matalon and neuroscientist Dr. Eyal Weinreb from Moses’s lab in Weizmann’s Physics of Complex Systems Department, the researchers analyzed prosody as an unfamiliar language, aiming to deliver a data-driven explanation for the linguistic mystery of prosody’s structure and meaning. Rather than relying on literature, they used two massive collections of audio recordings of spontaneous conversations, one of telephone conversations between two participants and the other of face-to-face conversations in various locations, such as a kitchen or classroom.

The first task for the research team was to compile a dictionary of the short melodies that function as “words” in English-language prosody and to assign each of them a function and a meaning. “To understand why there is no prosodic dictionary yet, it’s worth remembering that there wasn’t even a comprehensive English dictionary until the nineteenth century,” Moses says. “When the University of Oxford was tasked with compiling one, it asked the public to help with the workload by sending quotes showing the historical changes in the meaning of words. One of the main contributors was a prisoner who spent more than 20 years reading books and sending quotes. In our study, instead of collecting information by ourselves over the course of decades, we analyzed massive collections of audio recordings, using AI.”

The melody of each person’s speech is unique, but the AI model found several hundred basic patterns that recur, with slight variations, in all spontaneous English conversations. While written words are sequences of letters, a prosodic “word” is a short melody, that is, a short sequence of sounds with varying pitch, lasting about a second on average. To work out the meaning of these “words,” Matalon sampled 20 basic melodic patterns and then listened to the recordings again. “We discovered that each pattern has several linguistic functions,” he explains. “For example, depending on the context, a pattern can define whether someone is asking a question or making a statement. However, each pattern typically conveys one specific attitude of the speaker – such as curiosity, surprise or confusion – toward what’s being said. One common prosodic ‘word’ is a sharp rise of the pitch followed by a quick drop. This pattern signals enthusiasm and, depending on the context, can express strong agreement or acknowledgment of receiving important new information.”

"Oxford’s first full English dictionary came out in the 19th century, with the public helping manage the workload –  including a prisoner who contributed for 20 years"

ext, the researchers tried to identify syntactic rules governing the order of these prosodic patterns, which can potentially allow future language learning models to understand and use prosody. “We noticed that there are patterns that tend to appear next to each other, in pairs, in spontaneous speech,” Weinreb explains. “It’s a simple statistical system, in which the correct choice of the next unit in a sequence depends solely on the previous one. This system works well for spontaneous conversation because it requires planning only a few seconds ahead, which is just as long as short-term memory lasts.” These pattern pairs, the researchers discovered, act as simple sentences, expressing “one new idea,” so that each pair relates to a specific topic, adding a single piece of information about it – for example, referring to a fact mentioned in the conversation and providing positive feedback.

“Our study lays the foundation for the development of an automated system that will compile a ‘dictionary’ of prosody and identify its syntactic rules for every human language and for different speaker populations,” Moses says.

“Prosody can vary depending on social status, historic events and the age of the speakers, and these variations can even manifest themselves in literary works that carefully reflect spontaneous speech,” Matalon adds. “We analyzed audiobooks as part of the study and discovered that prosodic patterns are longer in scripted speech and that the simple paired syntax of spontaneous conversation has disappeared. There are other differences, too. It’s safe to assume that the aging process and the acquisition of language in childhood are also accompanied by quantifiable prosodic changes. Moreover, there is evidence that prosody is important in internal speech – the language of thought – and that we can deepen our understanding of the existing prosody of robotic voices that are produced by speech-generating devices. The model we created promises to close the gaps that emerged over the centuries in research into expression beyond words.”

A major future application of an automated prosodic dictionary might be the development of AI capable of understanding and conveying messages through the melody of speech rather than words alone. “Imagine if Siri could understand from the melody of your voice how you feel about a certain subject, what’s important to you or whether you think you know better than her,” Weinreb adds, “and that she could adapt her response to make it sound enthusiastic or sad. We already have brain implants that convert neural activity into speech for people who can’t speak. If we can teach prosody to a computer model, we’ll be adding a significant layer of human expression that robotic systems currently lack.”

Science Numbers

While English speakers use thousands of words a day in spontaneous conversation, this study reveals that their speech is complemented by only 200 to 350 basic prosodic patterns.

Also participating in the study were Dr. Dominik Freche from Weizmann’s Physics of Complex Systems Department; Dr. Erez Volk from NeuraLight Inc., Tel Aviv; Dr. Tirza Biron from Weizmann’s Computer Science and Applied Mathematics Department; and Prof. David Biron from the University of Chicago.

 


Disclaimer: AAAS and EurekAlert! are not responsible for the accuracy of news releases posted to EurekAlert! by contributing institutions or for the use of any information through the EurekAlert system.