Say hello to J-Moshi, the first publicly available Japanese AI dialogue system that can speak and listen simultaneously
Nagoya University researchers have developed an AI chatbot that mimics natural conversation patterns
Nagoya University
image: The Higashinaka Lab is developing AI-human dialogue systems designed to work alongside human operators. As part of their research, a guide robot was deployed at Osaka’s NIFREL Aquarium to answer visitors’ questions about marine life. Human operators could step in to provide help with complex questions.
Credit: Higashinaka Lab, Nagoya University. Taken at NIFREL Aquarium, Osaka
How do you develop an AI system that perfectly mimics the way humans speak? Researchers at Nagoya University in Japan have taken a significant step forward to achieve this. They have created J-Moshi, the first publicly available AI system specifically designed for Japanese conversational patterns.
J-Moshi captures the natural flow of Japanese conversation, which often has short verbal responses known as "aizuchi" that Japanese speakers use during conversation to show they are actively listening and engaged. Responses such as “Sou desu ne” (that’s right) and “Naruhodo” (I see) are used more often than similar responses in English.
Traditional AI has difficulty using aizuchi because it cannot speak and listen at the same time. This capability is especially important for natural-sounding Japanese AI dialogue. Consequently, J-Moshi has become very popular with Japanese speakers who recognize and appreciate its natural conversation patterns.
Building a Japanese Moshi model
The development team, led by researchers from the Higashinaka Laboratory at the Graduate School of Informatics, built J-Moshi by adapting the English-language Moshi model created by the non-profit laboratory Kyutai. The process took about four months and involved training the system using multiple Japanese speech datasets.
The biggest dataset was obtained from J-CHAT, the largest publicly available Japanese dialogue dataset created and released by the University of Tokyo. It contains approximately 67,000 hours of audio from podcasts and YouTube. Additionally, the team used smaller but higher-quality dialogue datasets, some collected within the lab and others dating back 20-30 years. To increase their training data, the researchers also converted written chat conversations into artificial speech with text-to-speech programs they developed for this purpose.
In January 2024 J-Moshi gained significant attention when demonstration videos went viral on social media. Beyond its technical novelty, it has possible practical applications in language learning. For example, helping non-native speakers practice and understand natural Japanese conversation patterns. The research team is also exploring commercial applications in call centers, healthcare settings, and customer service. They note that adapting the system to specialized fields or industries is challenging due to the limited availability of Japanese speech data compared to resources available for English.
The research team's leader, Professor Ryuichiro Higashinaka, brings a unique perspective to academic AI research, having spent 19 years as a corporate researcher at NTT Corporation before joining Nagoya University five years ago. During his industry tenure, he worked on consumer dialogue systems and voice agents, including a project to realize a question-answer function for Shabette Concier, a voice agent service by NTT DOCOMO. To further pursue research on human communication patterns, he set up his own lab at Nagoya University’s Graduate School of Informatics in 2020.
His 20-member lab now tackles challenges that bridge theoretical research and practical applications, from understanding conversational timing in Japanese to deploying AI guides in public spaces like aquariums.
“Technology like J-Moshi can be applied to systems that work with human operators. For example, our guide robots at the NIFREL Aquarium in Osaka can handle routine interactions independently and easily connect visitors to human operators for complex questions or when specialized assistance is needed," Professor Higashinaka said. “Our work is part of a national Cabinet Office Moonshot Project that aims to improve service quality through advanced AI-human collaboration systems.”
Opportunities and challenges for human-robot interactions
Prof. Higashinaka explained the unique challenges facing Japanese AI research: “Japan suffers from a scarcity of speech resources, limiting researchers' ability to train AI dialogue systems. Privacy concerns also need to be considered.” This data shortage forced creative solutions, such as using computer programs to separate mixed voices in podcast recordings into individual speaker tracks needed for training.
Currently, dialogue systems have difficulty with complex social situations, especially when interpersonal relationships and physical environments need to be considered. Visual obstacles such as masks or hats can also impair their performance as important visual cues like facial expressions are covered. Testing at Osaka’s NIFREL Aquarium showed that sometimes the AI cannot handle user questions and needs human operators to intervene and take over the conversation.
While J-Moshi represents a significant achievement in capturing natural Japanese conversational patterns with overlapping speech and aizuchi interjections, these limitations mean it currently needs human backup systems for most practical applications. The researchers are working to enhance these human backup systems to mitigate these challenges.These include methods for dialogue summarization and dialogue breakdown detection systems that alert operators to potential problems so they can respond quickly.
The lab's broader research extends beyond J-Moshi and includes multiple methods for human-robot interaction. In collaboration with colleagues working on realistic humanoid robots, they are developing robot systems that coordinate speech, gestures, and movement for natural communication. These robots, including those manufactured by Unitree Robotics, represent the latest advances of AI in physical form, where dialogue systems must navigate not just conversational nuances but also physical presence and spatial awareness. The team regularly showcases their work during university open campus days, where the public can experience how AI dialogue systems are evolving firsthand.
Their paper on J-Moshi has been accepted for publication in Interspeech, the largest international conference in the field of speech technology and research. Professor Higashinaka and his team are looking forward to presenting their J-Moshi research in Rotterdam, The Netherlands, in August 2025.
“In the near future, we will witness the emergence of systems capable of collaborating seamlessly with humans through natural speech and gestures. I aspire to create the foundational technologies that will be essential for such a transformative society,” Professor Higashinaka said.
LINKS:
- For more information on the Higashinaka lab’s research, please see here: https://www.ds.is.i.nagoya-u.ac.jp/en/home/
- Listen to audio of J-Moshi here: https://nu-dialogue.github.io/j-moshi/
- The codebase used for training J-Moshi is available here: https://github.com/nu-dialogue/moshi-finetune
- The paper “Towards a Japanese full-duplex spoken dialogue system” can be accessed here: https://arxiv.org/abs/2506.02979
Disclaimer: AAAS and EurekAlert! are not responsible for the accuracy of news releases posted to EurekAlert! by contributing institutions or for the use of any information through the EurekAlert system.