Zoom, a communication platform, has an add-on feature that allows users to install plug-in apps to help with video conferencing. These are available for download from the Zoom App Marketplace, which launched in 2018, and as of 2021, there are more than 1,000 apps listed in the marketplace. The Zoom Marketplace has a variety of tools to increase meeting productivity, including automatic note taking, meeting timers, and more.
Motivation : Zoom app development background
Among the listed apps, one of the tools that can help video conferencing the most is translation/interpretation. However, there are currently no apps listed that can translate multiple languages simultaneously. The closest you can get is the Closed Caption feature in the Zoom client, which automatically generates captions for one language without requiring any installation. However, this is not very helpful for people who speak different languages, as it provides the same monolingual closed captions for all participants.
Among the listed apps, one of the tools that can help video conferencing the most is translation/interpretation. However, there are currently no apps listed that can translate multiple languages simultaneously. The closest you can get is the Closed Caption feature in the Zoom client, which automatically generates captions for one language without requiring any installation. However, this is not very helpful for people who speak different languages, as it provides the same monolingual closed captions for all participants.
XL8 is a global company with team members who speak different native languages. When they need to have a meeting, they usually use English. While working on machine translation technology using artificial intelligence, the team had an idea: "Wouldn't it be useful to develop an app that allows participants to communicate with each other in different languages during meetings?"
For example, if there are participants who speak Korean, English, and Japanese, and someone speaks in Korean, it will be translated into English for the English speaker, and Japanese for the Japanese speaker, so they can just speak in the language they want to speak.
It's risky from an engineering standpoint to develop a product with a dependency on a specific platform, but it's also a tool with a large user base that you can use to your advantage when it comes to deploying and activating your app.
Do Your Research: What Skills Do I Need?
Before we started developing the app in earnest, we listed the technologies we would need to implement it.
- An engine that can translate in real-time
- A programming language for developing the UI of the Zoom plugin app
- A way for the server to receive voice data from each participant in real time
As of July 23, 2018, we have an engine that can translate 25 language pairs in both directions in real time. When we say 25 languages are bidirectional, we mean that every language can be translated into each of the other 24 languages, regardless of whether Korean or English is the starting language, which means that people speaking 25 different languages can be in a meeting at the same time and still read the translated text in their own language.
Since the Zoom plugin app is a web-based application that looks like a website (web view), you can use modern web frameworks like React or Vue.
The last three were critical to creating this real-time translation solution, as the official Zoom documentation explains that you can invite a meeting bot and receive raw voice data after the bot has received recording permissions.
Designing: Optimal server configuration
Now that we had defined the tech specs we needed, we next designed how we would actually configure the server.
- XL8 real-time translation engine server
- One socket server to receive the speech sent by Zoom Bot
- A remote database server
- A backend server for data processing
- A front-end server for website
While the structure is a little different now, this is an early version of the diagram we submitted for our application deployment review, and the overall process of how the app works is as follows.
- The meeting host launches the plugin app in the Zoom client, pre-specifies the language to be used in the meeting, and invites the Zoom bot.
- Once the bot joins, the voice data of everyone in the meeting is sent to our socket server.
- The socket server immediately sends the data to the XL8 translation server, and when it receives the translated data, it stores it in its database along with the IDs of the participants.
With this configuration, our socket server saves the voice to the database as translated text in real time, and the frontend server only needs to periodically fetch the data.
Developing :
Front-end Side
The Zoom Marketplace currently supports several forms of app development. We chose Zoom Apps because we needed to create an app that would be available inside the Zoom client.
Because Zoom Apps runs inside the Zoom client, we can use the API to get some information that is accessible to the user (using the Zoom Apps SDK).
For example, I can get a URL to join a meeting, information about whether I'm a host or participant in a meeting, or a user ID on the front end. (These APIs are also restricted based on whether I'm a host or participant in a meeting.)
Next, we need to design the UX of how users will use this app given the technical limitations. First of all, all current translation engines don't automatically recognize what language this person is speaking, which means it's hard to determine if this person is speaking English or Korean based on their voice alone.
So we designed it so that there are two states, host and participant, and the host of the meeting can specify in advance what language they want to use in this meeting. The host creates a kind of chat room, and the rest of the participants choose the language they want to speak and then enter the room.
By storing each participant's language in a database in advance, the socket server would then be able to recognize which language the participant was speaking and know which language to translate it into.
Backend Side
This backend server is responsible for receiving information from the frontend, storing it in a database, and inviting Zoom bots to join the meeting. What I had to pay special attention to when developing this app was that when inviting a bot to a meeting, the progress of the bot's participation and recording is provided to this backend server in the form of webhooks.
After the meeting host sends a bot invite request, we just need to show the appropriate UI on the client for the bot status and make sure the service works for each step.
Launching the app after development
Our real-time Zoom meeting interpreter app went from a small idea during a meeting to a full-fledged launch as an EventCAT for Zoom app, a service available to Zoom customers. The app, which translates multilingual meetings and seminars in real time, has been downloaded in just two weeks since its official launch and translates more than 100 hours of meetings per week for users around the world. Whether you're meeting with foreign coworkers, meeting with international clients, or hosting an online seminar, you can try it out for free today by clicking the link below.