Authors: Jacob Andreas, Gašper Beguš, Michael M. Bronstein, Roee Diamant, Denley Delaney, Shane Gero, Shafi Goldwasser, David F. Gruber, Sarah de Haas, Peter Malkin, Nikolay Pavlov, Roger Payne, Giovanni Petri, Daniela Rus, Pratyusha Sharma, Dan Tchernov, Pernille Tønnesen, Antonio Torralba, Daniel Vogt, Robert J. Wood
Paper: https://www.sciencedirect.com/science/article/pii/S2589004222006642
Project site: https://www.projectceti.org
I recently met Tom Mustill, author of 'How to Speak Whale', and learned about the CETI (Cetacean Translation Initiative) project, which I had somehow overlooked before. It's an incredibly cool project! The name seems to be a nod to SETI. The goal of the project is to understand the communication of sperm whales. Watching 'Avatar 2' somehow made this project resonate even more with me 🙂
The project has a roadmap article from mid-last year (June 2022) on how participants aim to achieve an understanding of whale communication (spoiler: with the help of ML and robots!). This work could become a template for understanding other creatures, as whales are a sufficiently good model organism with developed neuroanatomical features, high cognitive abilities, social structure, and discrete encoding based on clicks.
To understand language-like communication, several things need to be understood.
What are the articulatory and perceptual building blocks that can be reliably produced and recognized? This is analogous to phonetics and phonology.
What are the rules of composition of these primitives? This is akin to morphology and syntax.
What are the rules for interpreting and assigning meaning to these blocks? This relates to semantics.
The influence of context on meaning may also be important, which is pragmatics.
Ideally, we want a universal automated data-driven toolkit that can be applied to non-human communication. But such a toolkit does not yet exist.
The application of ML is greatly hindered by the lack of large datasets for these tasks, and data collection (Record) and processing (Process) issues, such as annotating with whale IDs, form a significant part of the project. Another aspect is decoding with ML and creating a whale communication model (Decode). The final important part, Encode & Playback, involves interactive interaction with whales and refining the whale language model.
Studying whales is difficult compared to terrestrial animals due to their vastly different ecology and environment. For instance, it wasn't known until 1957 that sperm whales could produce sounds, and only in the 1970s was it understood that they use them for communication. Whales travel thousands of kilometers, likely live over a hundred years, essentially in a three-dimensional environment, often in stable social groups, caring for their young for extended periods. Social learning is probably more important for them than individual or genetically determined behavior. Most of their communication is apparently unimodal, through acoustics. In the light zone, vision is also important.
The sperm whale's bioacoustic system evolved as a sensory organ used for echolocation and prey detection. The clicks produced by the whale consist of several impulses with a powerful first impulse and subsequent ones with a decaying amplitude (resulting from the reverberation of the initial impulse in the spermaceti organ in the whale's nose - this seems similar to the fundamental tone and formants in human speech, but of course, one should be careful with anthropocentrism).
Besides echolocation, this organ is also used for communication, consisting of short packets of clicks (<2 seconds) with intervals (inter-click intervals, ICI) and distinguishable patterns, called codas. A coda consists of 2-40 omnidirectional clicks. Different groups of whales have their dialects, usually with at least 20 different codas per clan. It seems that they contain rich information about the identity of their source, but the communicative functions of specific codas remain a mystery.
In communication, there is a clear turn-taking; whales respond within 2s of each other, and communication can be conducted from meters to kilometers. Individuals within a family have a dialect with at least 10 common codas (but there is also individual variability). It takes at least two years for calves to start producing distinguishable codas.
For classical supervised ML methods, obtaining labeled corpora is quite expensive and challenging (although there are some successful applications of supervised learning for whales). Self-supervised representation learning methods seem promising. But the problem with them is that for proper SSL, large corpora are needed, even if unlabeled. For example, the Dominica Sperm Whale Project (DSWP) dataset contains only <104 codas, although it has been collected since 2005. So, it is still far from the size of datasets for GPT-3 with 1011 tokens (although it's difficult to compare tokens and codas). CETI intends to collect a dataset of about 109 (comparable to the dataset for BERT). The estimate is based on almost continuous monitoring of 50-400 whales, with an expected 400M to 4B recorded clicks per year.
The technical part of the project consists of several sections.
Data acquisition: Data must be collected as non-invasively as possible, and a bunch of modern technologies can help: air and sea drones (like artificial fish), tags on whales, networks of buoys, and floaters.
Data processing: Storage and smart preprocessing of this data are needed (click detection, preliminary annotation, synchronization of different types of signals). The resulting dataset will be a kind of "whale social network".
Decoding and encoding, building a whale communication model, which consists of the previously listed blocks.
Phonetics and phonology: many open questions, for example, are codas defined by absolute ICI, do spectral features of codas carry information, what are distributional restrictions, etc.
Morphology and syntax: understanding the grammar of communication, what are the rules for constructing codas, is there recursion (so far not found in any animal communication). Here, and in phonetics, large datasets are crucial.
Semantics: understanding the meaning of all these vocalizations, for which it is necessary to preserve all the important context for grounding the found morphemes.
Discourse and social communication: understanding the protocols of communication, when and who speaks. Predictive conversation models (in a sense, analogs of LLMs or chatbots) can be built on this.
Aspects of communication redundancy and fault tolerance are also important to understand. A separate big question is language acquisition, how calves pick up this language, and there may be many valuable patterns here: in what order codas are learned, what are the most basic structural blocks and their functions, etc. More data is also needed for this.
Playback-based validation will be valuable, with its own complexities. Do we know what to play back, can we adequately replicate in the real environment (not from a boat, but from an autonomous device), can we recognize the response?
In summary, this is a large, complex, and multidimensional project, with significant engineering and infrastructure parts, but insanely interesting and, I'm sure, very useful. Understand whales, understand aliens? Or should we first understand octopuses?