Using the latest tech, Facebook AI has now incorporated the GSLM (Generative Spoken Language Model), a textless NLP model that breaks new ground in text-based language models—exemplified by such systems as BERT, RoBERTa, and GPT-3.  As they can generate realistically written words from a given input, these models can be used for various NLP applications, including sentiment analysis, translation, information retrieval, and inference (inferred intent) summarization.  Up until recently, models have only been suitable for languages with very large text data sets.  Facebook’s research team trained the encoder and uLM (unit-based language model) on 6,000 hours of Libri-Light and Libri-speech.  It can now do the job without needing to make use of labeled / summarized data.

Facebook AI’s high-performance NLP model, the GSLM (Generative Spoken Language Model), leverages state-of-the-art representation learning in order to process raw audio signals without needing to make use of massive text data sets.  The baseline GSLM model has three components: an encoder (which converts speech into sound units), a language model (which is trained to predict the next unit based on what it’s seen before), and decoders (which convert sounds back into words).  By availing themselves of GSLM, companies can develop NLP models that take into account a wide range of nuance found in spoken language.  The plan is to apply GSLM to casual and spontaneous speech data sets where text-based methods are currently inadequate.