Advertisement
Advertisement
Alibaba
Get more with myNEWS
A personalised news feed of stories that matter to you
Learn more
The new DAMO Academy model is an enhancement from previous vision-LLMs as it can tackle two challenges in video understanding. Photo: Shutterstock

Alibaba reveals progress with large language model research as Chinese Big Tech firms continue to push for ChatGPT rival

  • A group of researchers from DAMO Academy have unveiled a new audiovisual language model called Video-LLaMA
  • The new DAMO Academy model is an enhancement from previous vision-LLMs as it can tackle two challenges in video understanding
Alibaba

Alibaba Group Holding’s in-house research unit is making progress with its own large language models (LLMs), as Chinese Big Tech companies continue to pile into the artificial intelligence (AI) space in an attempt to come up with a rival to OpenAI’s ChatGPT.

A group of researchers from DAMO Academy unveiled a new audiovisual language model called Video-LLaMA, which helps the system to understand visual and auditory content in videos, in a research paper published last week on ArXiv, an online scientific paper repository.

The codes have also been open-sourced by the researchers on online developer community GitHub. Alibaba owns the South China Morning Post.

LLMs, which are trained through machine learning, are the underpinning of AI-powered chatbots like ChatGPT. LLMs allow the chatbots to answer sophisticated queries, generate detailed writings, code, or other content.

Alibaba transfers its self driving team to logistics arm in latest restructuring move

The new DAMO Academy model is an enhancement from previous vision-LLMs as it can tackle two challenges in video understanding: capturing the temporal changes in visual scenes and integrating audiovisual signals, according to the three researchers, Zhang Hang, Li Xin and Bing Lidong.

In a case demonstrated by the researchers, when given a video of a man playing saxophone on stage, the model was able to describe in text both the background sound of applause and visual content of the video. By comparison, previous models, such as MiniGPT-4 and LLaVA, mainly focus on static image comprehension, the researchers said.

Meanwhile, the researchers noted that the model is still “an early-stage prototype” with a few limitations, such as its limited ability to handle long videos including films and TV shows.

The move comes as a part of broader efforts by Alibaba, which is in the midst of its largest-ever corporate restructuring, to double down on its investment in the development and application of LLMs.

Alibaba’s cloud unit in April unveiled its own alternative to ChatGPT - Tongyi Qianwen - which is based on DAMO’s LLMs, marking one of the earliest Chinese companies to join the ChatGPT bandwagon, along with search engine giant Baidu which launched its Ernie Bot in March. The service had received more than 200,000 beta testing applications from corporate clients, Alibaba chairman and CEO Daniel Zhang Yong said in a conference call with analysts last month.

DAMO first introduced a large AI model called Tongyi last September, when deputy head Zhou Jingren unveiled it at the World AI Conference in Shanghai. He described it as a multimodal pre-trained language model that is able to process different types of inputs including text, images, audio, and video.

Alibaba has started to work with partners to develop industry-specific AI models, Zhang said. For instance, it is planning to launch cloud products and enterprise solutions based on its AI model, and integrate AI capabilities into various products, including its workplace collaboration tool DingTalk.

1