As an early adopter of AI, the CEO of Soul, Zhang Lu is always keen on leveraging the potential of the technology in social networking. Soul’s team can be credited with the development of several cutting-edge AI solutions that enhance digital interactions.
The latest offering from the extremely popular social networking platform came in the form of research on real-time, AI-driven portrait animation. The fact that the paper submitted by Soul Zhang Lu’s team was accepted at the Conference on Computer Vision and Pattern Recognition (CVPR) 2025 is in itself a testament to how groundbreaking the work is.
As one of the most prestigious conferences in artificial intelligence and computer vision, CVPR consistently attracts top-tier research. Be it industry leaders or researchers from top academic institutions, experts from the world over are keen to showcase their work at the Conference.
For instance, in 2025, a whopping 13,000+ papers were submitted of which a mere 2878 were accepted. That’s an acceptance rate of just 22.1%, which points to just how rigorous the selection process is as well as the increasing competition in the field.
So, the recognition from CVPR is undoubtedly a distinctive feather in the cap of Soul Zhang Lu’s team. But, this group of expert engineers is no stranger to such achievements. Soul’s team also received recognition for their work at the 2024 ACM International Conference on Multimedia (ACM MM) and they secured pole position at the Multimodal Emotion Recognition Challenge (MER24).
The paper accepted by CVPR was titled – “Teller: Real-Time Streaming Audio-Driven Portrait Animation with Autoregressive Motion Generation”. The research done by Soul Zhang Lu’s team for this paper was centered on an autoregressive framework meant to enhance efficiency in generating “talking-head” animations. The goal of the research was to meet the steadily increasing demand for AI models that deliver human-like interactions in real-time.
What makes the “Teller” framework a one-of-its-kind approach is the fact that it strikes a balance between performance and efficiency like no other model out there. For instance, traditional talking-head animation models are marred by their requirement for significant computational resources, which translates to higher processing time.
In contrast, the model presented by Soul Zhang Lu’s team makes use of an autoregressive motion generation framework. This model not only retains optimal efficiency but does so without compromising on the fluidity and authenticity of natural facial and body movements. The paper submitted by the team discussed two primary components of this technology:
- Facial Motion Latent Generation (FMLG): By leveraging large-scale training data, FMLG improves the synchronization between audio and visual cues. This leads to more fluid and natural facial expressions in response to speech inputs.
- Efficient Body Movement Generation (ETM): By using adiffusion-based approach, the model is able to accurately capture body dynamics. This enhances realism in the movements of facial and body muscles and even accessories.
During tests, Soul Zhang Lu’s engineers found that this dual-module system enables AI-generated avatars to present expressions and gestures that feel surprisingly human, and that too in real-time. Now, it goes without saying that this degree of realism significantly improves user experience in virtual interactions.
As mentioned earlier, the founder of Soul, Zhang Lu was one of the industry leaders who foresaw the scope of AI, particularly as it applies to social networking. In fact, when the technology was still in its nascent stages and the social platform was still trying to gain a foothold in the industry, the company was already gearing up to leverage the power of AI.
Since 2016, when the app was just a couple of months old, Soul Zhang Lu chose to consistently invest in technological resources that would give the social networking platform an AI-driven edge. The first significant step in this direction came in the form of the self-developed Lingxi Engine, which was used to forge user connections based on mutual interests.
This was followed by rapid progress in the platform’s AI capabilities that involved speech and text-based interaction, as well as 3D virtual human modeling. A mere 4 years down the line, Soul was already on its way to harnessing the power of AI-generated content (AIGC). By 2020, the team was focused on using AI for intelligent dialogue systems and voice synthesis.
The launch of its proprietary AI model, Soul X, in 2023, put Soul Zhang Lu’s AI ambitions into fourth gear. The homegrown model introduced features such as multilingual voice calls, speech synthesis, and AI-generated music to the platform.
The team’s recent breakthrough in the form of the “Teller” framework is another stride towards the goal of combining speech, vision, and natural language processing (NLP) to create AI-powered digital entities that can interact seamlessly with users in real time. The idea all along was to offer not just functional but also emotional companionship.
The company’s vision for the future of socializing was explained succinctly by Soul App’s Chief Technology Officer, Tao Ming in a recent interview. He stated that human face-to-face conversations remain the most effective means of exchanging information even in this digital age. As such, AI will need to replicate such interactions to provide digital experiences that are more emotionally engaging.
Simply put, Soul Zhang Lu envisions a future where AI avatars will have the capability to replicate real human expressions, making digital conversations feel more authentic. The implementation of Soul’s work in real-time video generation will find space in applications such as:
- AI avatars capable of expressing emotions and responding dynamically to user interactions.
- AI-generated hosts and participants for interactive group experiences.
- Multilingual AI-driven video calls that enhance cross-cultural interactions.
Soul Zhang Lu believes that artificial intelligence should not be relegated to just the role of a conversation facilitator. Instead, the full potential of the technology should be put to use to create experiences that are emotionally fulfilling for the app’s users.