The speech processing group (SPG) is dedicated to researching topics related to speech, audio and natural language tasks. The group is part of the Applied Artificial Intelligence Laboratory (LIAA), at the Computer Science Institute (University of Buenos Aires-CONICET) and was created in 2014 by its former director, Dr. Agustín Gravano.
SPG is currently directed by Dr. Luciana Ferrer and is composed of researchers and Ph.D. students from the Exact and Natural Sciences School and the Engineering School of the University of Buenos Aires. We are funded by a number of awards and grants from international companies and institutions like Google, Amazon, JPMorgan, the European Union, and from Argentina's FonCyT, and CONICET.
In this field, we investigate and develop artificial intelligence (AI) systems to identify voice pathologies, supporting healthcare professionals in the accurate diagnosis of patients. Additionally, we explore the development of automatic speech recognition systems that transcribe spoken language into text, providing a crucial solution for individuals with communication difficulties due to vocal pathologies. Finally, we study the use of AI systems for monitoring patient progress and offering them personalized exercises to optimize their rehabilitation.
This project seeks to establish the sensitivity, robustness and anatomical-functional basis of automated speech markers in Spanish speakers with MCI, integrating acoustic, textual, neuropsychological and neuroscientific measures with machine learning algorithms.
Another research area of the group is the extraction of metadata from the speech signal. The metadata of interest can include the speaker's identity, the emotion, age or language present in the signal, the transmition channel, etc. These systems can be used, for example, for the analysis of large datasets, to improve the naturalness of a human-computer dialog, or to control user access to a certain system.
The research area aims to explore how trust and teaming characteristics in human-computer interactions (HCI) influence user speech, and whether it's possible to detect the user's trust level towards a virtual assistant (VA) and the quality of the user-VA team based on the user's speech. The approach involves manipulating the VA's characteristics, analyzing their impact on user perception, speech, and task outcomes, and developing methods to predict trust and team quality from speech.
One of our research goals consists in understanding and modelling the extraordinary degree of coordination exhibited by human beings while holding a conversation, both at a temporal level and along other dimensions of speech. The ultimate goal is to include this knowledge into spoken dialogue systems, aiming at improving their naturalness.
Large Language Models (LLMs) are pre-trained on unsupervised data, usually by predicting the next token. This approach, followed by one or more supervised fine-tuning stages, provides very high-performance systems that can be used on multiple tasks. However, these systems are not usually adapted or calibrated to a specific downstream task, so the posteriors probabilities output by the model are not interpretable. This research area explores how LLMs can be manipulated in order to provide interpretable scores, that can potentially used to provide more reliable NLP systems.
In this area, we investigate systems for guiding students of a second language in their learning process. In particular, we develop systems that generate scores to measure the quality of the phonetic pronunciation or the stress pronunciation in a phrase, word, syllable or phone. These scores can then be used in computer programs designed to complement the process of learning a new language.
We study the problem of learning representations for speech and audio signals, by using self-supervision and other deep learning techniques. These representations are useful, specially for transfer learning, as they allow a more efficient training in problems like speech emotion recognition, automatic speech recognition, acoustic event detection and music genre classification. Moreover, we also analyze the learned representations, trying to understand what they are (and are not) representing, using techniques like representation similarity analysis and probing.