African elephants address one another with
individually specific name-like calls
Michael A. Pardo
, Kurt Fristrup
, David S. Lolchuragi
, Joyce H. Poole
Petter Granli
, Cynthia Moss
, Iain Douglas-Hamilton
& George Wittemyer
Personal names are a universal feature of human language, yet few analogues
exist in other species. While dolphins and parrots address conspecics by
imitating the calls of the addressee, human names are not imitations of
the sounds typically made by the named individual. Labelling objects or
individuals without relying on imitation of the sounds made by the referent
radically expands the expressive power of language. Thus, if non-imitative
name analogues were found in other species, this could have important
implications for our understanding of language evolution. Here we present
evidence that wild African elephants address one another with individually
specic calls, probably without relying on imitation of the receiver. We
used machine learning to demonstrate that the receiver of a call could be
predicted from the call’s acoustic structure, regardless of how similar the
call was to the receiver’s vocalizations. Moreover, elephants dierentially
responded to playbacks of calls originally addressed to them relative to calls
addressed to a dierent individual. Our ndings oer evidence for individual
addressing of conspecics in elephants. They further suggest that, unlike
other non-human animals, elephants probably do not rely on imitation of
the receiver’s calls to address one another.
A hallmark of spoken human language is the use of vocal labels: learned
sounds that refer to an object or individual (the ‘referent’)
. Many spe-
cies produce functionally referential calls for food and predators
, but
the production of these calls is typically innate
. Learned vocal labels
expand the expressive scope of communication by making it possible to
establish labels for new referents. Thus, they increase the sophistication
of cooperative behaviour and are central to humans’ ability to articulate
symbolic thought
. Personal names are a type of vocal label that refers to
another individual. Names must involve vocal learning, as an individual
cannot be born knowing the names for all its future social affiliates.
Thus, non-human analogues of personal names are highly relevant
to understanding the evolution of language and complex cognition.
Most human words, including names, are arbitrary: they are not
imitations of sounds typically made by the referent or tied to its physi-
cal properties
. Arbitrariness is crucial to language because it enables
communication about referents that do not make any imitable sound.
However, clear evidence for arbitrary names in other species is lacking.
Bottlenose dolphins (Tursiops truncatus) and orange-fronted parakeets
(Eupsittula canicularis) address individual conspecifics by imitating the
receiver’s ‘signature’ call, a sound that is most commonly produced by
the receiver to broadcast their identity
. While considered arbitrary
when used for self-identification
, it may be argued that copied signa-
ture calls used to address the call’s owner are iconic (non-arbitrary)
labels since they are an imitation of a sound most often produced by
the individual to whom the call refers. Non-imitative learned vocal
labelling may be more cognitively demanding than imitative labelling,
as it requires individuals to make an abstract connection between a
sound and referent. Evidence that arbitrary vocal labelling is not unique
to humans would expand the breadth of models for the evolution of
language and cognition.
Department of Fish, Wildlife, and Conservation Biology, Colorado State University, Fort Collins, CO, USA.
Department of Electronic and Computer
Save The Elephants, Nairobi, Kenya.
ElephantVoices, Sandejord, Norway.
Amboseli Elephant Research Project, Nairobi, Kenya.
Amboseli Elephant Research Project, Nairobi, Kenya. e-mail:
Nature Ecoogy & Evoution
call pairs with same receiver, 179 pairs with different receivers, χ
 = 13.0,
P = 0.0003, partial η
 = 0.063) (Fig. 1 and Extended Data Table 4). This
indicates that rumbles contain information specific to the individual
receiver, not merely to the caller or to the type of relationship between
the caller and receiver (Table 1, hypothesis 1, prediction 2).
Vocal labels more likely in certain contexts and
age classes
For 87.4% of calls, receiver ID was predicted consistently correctly or
consistently incorrectly across >95% of random forest iterations. We
used logistic regression to assess factors influencing the probability
of correct classification. Contact (n = 138, 42.0% correct) and caregiv-
ing rumbles (n = 62, 46.8% correct) were more likely to be correctly
classified than greeting rumbles (n = 127, 3.9% correct) (care/contact:
P = 0.264, odds ratio 6.4; care/greeting: P = 0.014, odds ratio 48.9;
contact/greeting: P = 0.047, odds ratio 7.6) (Extended Data Table 5).
Calls from adult females (n = 274, 32.8% correct) were more likely to
be predicted correctly than calls from juveniles (n = 53, 3.8% correct)
 = 6.5, P = 0.011, odds ratio 0.067). Calls that occurred later in the
bout were more likely to be predicted correctly (χ
 = 3.8, P = 0.0498,
odds ratio 2.8), as were calls addressed to receivers with more total
calls in our dataset (χ
 = 7.6, P = 0.006, odds ratio 1.4).
No evidence for imitation of receiver in vocal
Elephants are not known to produce discrete ‘signature’ calls like dol-
phins and parrots; instead, the caller specificity of elephant rumbles
is probably a product of voice characteristics
. If elephants address
individual receivers by imitating the receiver’s voice, they should sound
more like the receiver when addressing her than when addressing other
individuals. Among the calls for which we had recordings of the receiver
and recordings of the caller addressing other individuals (n = 236),
59.7% were divergent from the receiver’s calls; that is, less similar to the
Elephants are among the few mammals capable of mimicking
novel sounds, although the function of this vocal learning ability is
. The most common elephant call type is the rumble, a har-
monically rich, low-frequency sound that is individually distinct
and is produced across most behavioural contexts
Contact rumbles are long-distance calls produced when the caller is
out of sight and more than ~50 m from one or more social affiliates and
attempting to reinitiate contact. Greeting rumbles are affiliative calls
produced when one individual approaches another to within touching
. Caregiving rumbles are affiliative calls produced by an adult
or adolescent female while suckling, comforting or rousing a calf
In this Article, we analysed contact, greeting and caregiving rum-
bles from female–offspring groups of wild African savannah elephants
(Loxodonta africana) to assess whether they contain individual vocal
labels. We investigated (1) if elephants address conspecifics using
receiver-specific vocal labels, (2) if the labels are imitative of the receiv-
er’s calls or arbitrary, (3) if different callers share the same label for
the same receiver and (4) if playbacks to the assumed receiver elicit
behavioural responses indicating label recognition (Table 1).
For contact calls, we defined the receiver as the only adult mem-
ber of the family group separated (>50 m) from the caller or the only
individual who responded to the call by vocalizing or approaching.
For greeting calls, the receiver was the individual who approached or
was approached by the caller. For caregiving calls, the receiver was
the calf being suckled, comforted or roused by the caller. We excluded
calls with uncertain or multiple recipients. Given the complexity of
elephant vocalizations, it was not clear what acoustic features were
optimal for capturing the relevant variation in the calls. Thus, we ran
models separately for two different sets of features measured on each
call (spectral and cepstral; Extended Data Fig. 1 and Extended Data
Table 1). The results reported in the text and figures are for the spectral
features (see tables for cepstral results, which were similar).
Calls were specific to individual receivers
We ran a random forest
with sevenfold cross-validation to predict the
receiver of each rumble as a function of the acoustic features. Call struc-
ture varied with the identity of the targeted receiver (Extended Data
Figs. 2 and 3) as expected if elephants vocally label other individuals.
Our model correctly identified the receiver for 27.5% of calls analysed,
a significantly greater proportion than achieved by models with ran-
domly permuted acoustic features (permutation test, mean ± standard
deviation (s.d.) accuracy for 10,000 permuted models: 8.0 ± 0.66%
correct, one-tailed P < 0.0001) (Fig. 1 and Extended Data Table 2). This
indicated that receivers of calls could be correctly identified from call
structure statistically significantly better than chance (Table 1, hypoth-
esis 1, prediction 1).
As caller ID and receiver ID were partially aliased in our dataset
(Supplementary Table 1), the random forest could theoretically use
acoustic cues to caller ID
to predict receiver ID, even if the calls did
not contain any vocal label. To assess this possibility, we compared the
mean similarity of pairs of calls with the same caller and receiver to the
mean similarity of pairs of calls with the same caller and different receiv-
ers, using proximity scores derived from the random forest as a metric
of call similarity
. If the random forest relied entirely on cues to caller ID
to predict receiver ID, there should be no difference in proximity score
between ‘same caller/same receiver’ pairs and ‘same caller/different
receivers’ pairs. To control for the possibility that calls were specific to
the type of relationship between the caller and receiver rather than to
individual receivers, we categorized social relationship on the basis of
relatedness and age (Extended Data Table 3) and only considered pairs
of calls with the same type of relationship between caller and receiver.
Calls with the same caller and receiver were significantly more similar
(higher proximity scores) than calls with the same caller and different
receivers, even after controlling for social relationship, behavioural
context and recording date (rank-transformed linear model, n = 1,105
Table 1 | Hypotheses and predictions tested in this study and
whether they were supported
Hypotheses Predictions Supported?
1. Elephants vocally
label individual
1. Receiver ID can be predicted from
call structure
1. Yes
2. Calls with same caller and same
receiver will be more similar than
calls with same caller and different
receivers, while controlling for
caller–receiver relationship type
2. Yes
3. Elephants will respond more
strongly to playback of call originally
addressed to them than to playback
of call from same caller originally
addressed to another individual
3. Yes
2. Vocal labels
are arbitrary (not
imitative of receiver’s
1. Receiver can be predicted from
call structure regardless of whether
calls are convergent or divergent
from receiver’s calls relative to other
calls by the same caller
1. Yes
2. Calls from caller A to receiver B
will be no more similar to receiver B’s
calls than calls from caller A to other
receivers are to receiver B’s calls
2. Yes
3. Different callers
use same label for
same receiver
1. Calls with different callers and
same receiver will be more similar
than calls with different callers and
different receivers
1. Yes
2. Receiver ID can be predicted
from call structure independently of
caller ID
2. No
Nature Ecoogy & Evoution
receiver’s calls than typical for that caller. The random forest’s predic-
tion accuracy was significantly better than baseline expectations for
both convergent and divergent calls (Table 1, hypothesis 2, prediction 1)
(permutation test; convergent calls: 20.1% correct, permuted models
mean ± s.d. accuracy of 7.7 ± 1.3%, n = 95 calls, one-tailed P < 0.0001;
divergent calls: 32.6% correct, permuted models mean ± s.d. accuracy
of 17.9 ± 1.6%, n = 141 calls, one-tailed P < 0.0001) (Fig. 2 and Extended
Data Table 2).
Proximity scores for pairs of calls in which the receiver of one
call made the other call were marginally higher than for pairs in
which this was not the case, but this was not statistically significant
(rank-transformed linear model, n = 943 call pairs where receiver of
one call made the other call, 1,553 pairs where this was not the case,
 = 3.7, P = 0.056, partial η
 = 0.001) (Fig. 2 and Extended Data Table 6).
This suggests that calls addressed to a given receiver were no more con-
vergent with the receiver’s calls than with calls from other individuals
(Table 1, hypothesis 2, prediction 2). Collectively, the evidence suggests
that vocal labelling in elephants probably does not rely on imitation
of the receiver’s calls. However, a definitive conclusion about the role
of imitation will require exhaustively sampling the vocal repertoire
of each caller.
Mixed evidence for shared labels across callers
In humans and bottlenose dolphins, different callers generally use
the same label for a given receiver. To determine if elephants do the
same, we further examined call proximity scores. Calls from differ-
ent callers to the same receiver were significantly more similar than
calls from different callers to different receivers (Table 1, hypothesis
3, prediction 1) (rank-transformed linear model, n = 693 call pairs with
same receiver, 7,522 pairs with different receivers, χ
 = 10.7, two-tailed
P = 0.001, partial η
 = 0.004) (Fig. 3 and Extended Data Table 7). This
suggests that there was some vocal convergence among different call-
ers addressing the same receiver.
We then ran a random forest structured to predict receiver ID
from different callers than the model was trained on (n = 437 calls)
(Table 1, hypothesis 3, prediction 2). This model correctly classified
1.1% of calls, no better than the corresponding models with randomly
permuted acoustic features (permutation test, mean ± s.d. accuracy of
permuted models 1.4 ± 0.33% correct, one-tailed P = 0.896) (Fig. 3 and
Extended Data Table 2). Therefore, the random forest was not able to
predict receiver ID independently of caller ID, suggesting convergence
across callers was weak.
Playback confirms receiver recognition of vocal
To determine if elephants perceive and respond to the vocal labels
in calls addressed to them (Table 1, hypothesis 1, prediction 3), we
compared reactions of 17 wild elephants to playback of a call that was
originally addressed to them (test) relative to playback of a call from
the same caller that was originally addressed to a different individual
(control). By using test and control stimuli from the same caller, we
controlled for the possibility of the caller’s relationship to the subject
influencing the results. To control for the possibility that calls were spe-
cific to the type of relationship between the caller and receiver rather
than to the individual receiver, we included the type of relationship
between the caller and the original receiver as a factor in the analysis.
Further supporting the existence of vocal labels, subjects approached
the speaker more quickly (Cox regression, χ
 = 6.8, P = 0.009, hazards
ratio 8.77), vocalized more quickly (Cox regression, χ
 = 7.9, P = 0.005,
hazards ratio 7.45) and produced more vocalizations (Poisson regres-
sion, χ
 = 6.7, P = 0.009, rate ratio 2.41) in response to test playbacks
than control playbacks (Fig. 4 and Table 2). In trials where an approach
or vocalization occurred, the mean ± s.d. latency to the first approach
or vocalization was 99.7 ± 161.4 s.
Discussion and conclusions
Very few species are known to address conspecifics with vocal labels.
Our discovery of individual vocal labels in a species that diverged from
both the primate and cetacean lineages ~90–100 million years ago
provides an important opportunity to study the convergent evolu-
tion of unusually sophisticated communication
. Moreover, where
evidence for vocal labels has been found in non-human species, they are
either clearly imitative
or of unknown structure
. Our data suggest
that elephants may label conspecifics without relying on imitation of
the receiver’s calls, a phenomenon previously known to occur only in
human language. If further research supports the absence of receiver
imitation in elephant vocal labels, then investigating the social context,
acoustic structure and ontogeny of vocal labels in elephants may shed
light on why elephants and humans developed non-imitative vocal
labels in contrast to other species known to vocally label conspecifics.
Our results also have significant implications for elephant cognition,
as inventing or learning sounds to address one another suggests the
capacity for some degree of symbolic thought.
The existence of individual vocal labelling in elephants is sup-
ported by multiple lines of evidence that exclude simpler alternative
explanations. Receiver ID could be predicted from call structure sig-
nificantly better than chance. Moreover, analysis of random forest
proximity scores showed that calls from the same caller to the same
receiver were significantly more similar than calls from the same caller
to two different receivers who had the same type of relationship with
the caller. This ruled out the alternative explanations that call structure
predicted receiver ID because of the correlation between caller ID and
receiver ID in our dataset or that call structure reflected only the type
of relationship between caller and receiver and not the individual
0.1 0.2
Classification accuracy
Same caller pair type
Same caller
same receiver
Same caller
dierent receivers
Rank-transformed proximity score
Fig. 1 | Evidence that calls are specific to individual receivers within a caller.
Left: the classification accuracy of a random forest predicting receiver ID from
acoustic features (red line) was significantly higher than the classification
accuracies of 10,000 models predicting receiver ID from randomized acoustic
features (black histogram) (n = 437 calls, permutation test, one-tailed
P = 0.0000). Cross-validation folds were stratified so that the model was
trained and tested on the same combinations of caller and receiver; thus, the
classification accuracy represents the receiver specificity of calls within a caller.
Right: calls with the same caller and same receiver were significantly more similar
(higher proximity score) than calls with the same caller and different receivers
who had the same type of relationship to the caller (n = 1,105 call pairs with same
receiver, 179 pairs with different receivers, ANOVA on ranks, χ
 = 13.0, d.f. 1, two-
tailed P = 0.0003, partial η
 = 0.063). Boxplot centre lines, medians; box limits,
25th and 75th quantiles; whiskers, 1.5× interquartile range.
Nature Ecoogy & Evoution
identity of the receiver. We also controlled for behavioural context and
recording date in the proximity score analysis, ensuring that receiver
specificity was not an artefact of context-related cues or autocorrela-
tion among calls from the same day. The results did not change when
two individuals that accounted for a disproportionate number of calls
in the dataset (M6 and M6.99) were excluded, indicating that our results
were not driven by a few highly influential individuals (Supplementary
Information). Most importantly, elephants responded more strongly
to playback of calls addressed to them than to playback of calls from
the same caller addressed to a different receiver, indicating that the
calls contained receiver-specific information that was salient to the
elephants. The difference in response to test and control trials was
often pronounced. For example, subject R26 vocalized eight times and
approached the speaker in response to the test playback but vocalized
only once and did not approach the speaker in response to the control
playback. Only one subject exhibited an unambiguously stronger
response to the control playback than to the test playback. These
results are particularly notable in that we could not be certain that all
playback stimuli contained vocal labels.
The social behaviour and ecology of elephants create an environ-
ment in which individual vocal labelling may be particularly advanta-
geous. Elephants maintain lifelong differentiated social bonds with
many individuals, and due to their fission–fusion social dynamics are
often separated from their closely bonded social partners
. In contact
calls, where the caller and receiver are separated, vocal labels probably
allow elephants to attract the attention of a specific distant receiver.
In close-distance calls such as greeting and caregiving rumbles, vocal
labels may help strengthen social bonds, similar to the way in which
humans experience a positive affective response and increased willing-
ness to cooperate when someone remembers their name
Our random forest model correctly predicted receiver ID for
slightly over a quarter of calls (albeit significantly better than ran-
dom), suggesting that vocal labels may not be necessary in all or even
most contexts. Indeed, both humans and bottlenose dolphins only use
individual vocal labels (that is, names or imitated signature whistles) in
a small percentage of utterances
. We found that receiver ID was more
likely to be correctly predicted for contact and caregiving rumbles
than for greeting rumbles, which suggests that vocal labels may be
used more in the former two contexts. Vocally identifying the intended
receiver seems especially likely to be beneficial in contact calls, where
the caller and receiver are out of visual and tactile contact. It is some-
what surprising, however, that caregiving rumbles were more likely to
be correctly classified than greeting rumbles, as both are close-distance
affiliative calls. Perhaps labels are included in caregiving rumbles to
help calves learn the labels with which others address them or because
hearing the label is comforting for calves. Calls made by adult females
were also more likely to be correctly classified than calls made by
juveniles. This suggests that adult females may use vocal labels more
than calves, possibly because the behaviour takes years to develop.
Elephant rumbles are highly complex and simultaneously encode
multiple messages, including but not limited to caller identity, age,
sex, emotional state and behavioural context
. The top acoustic
features for predicting receiver ID were not those that explained the
most variation in the calls (Supplementary Discussion), suggesting that
0.05 0.10 0.15 0.20
Rank-transformed proximity score
0.1 0.2 0.3
Classification accuracy
Imitation pair type
Convergent calls
Divergent calls
Call A receiver
is Call B caller
Call A receiver
not Call B caller
Fig. 2 | Evidence that vocal labelling probably did not rely on imitation of the
receiver’s calls. Random forest predicted receiver ID significantly better than
models with randomly permuted features both among calls that were identified
as convergent to the receiver’s calls (top left) (n = 95 calls, permutation test,
one-tailed P = 0.0000) and divergent from the receiver’s calls (bottom left)
(n = 141 calls, permutation test, one-tailed P = 0.0000). The red lines represent
classification accuracy of the original random forest model, and the black
histograms represent the distribution of classification accuracies of null models
with randomized acoustic features. Right: pairs of calls in which the receiver of
one call made the other call did not differ significantly in mean proximity score
from pairs of calls in which the receiver of one call did not make the other call
(n = 943 call pairs where receiver of one call made the other call, 1,553 pairs where
this was not the case, ANOVA on ranks, χ
 = 3.7, d.f. 1, P = 0.056, partial η
 = 0.001).
Boxplot centre lines, medians; box limits, 25th and 75th quantiles; whiskers, 1.5×
interquartile range.
Dierent callers
same receiver
Dierent callers
dierent receivers
Dierent caller pair type
Classification accuracy
Rank-transformed proximity score
0.01 0.02 0.03
Predicting receiver across
Fig. 3 | Mixed evidence that different callers use similar labels for the same
receiver. Left: pairs of calls with different callers and the same receiver were
significantly more similar (higher proximity score) than pairs of calls with
different callers and different receivers, indicating some convergence among
callers addressing the same receiver (n = 693 call pairs with same receiver,
7,522 pairs with different receivers, ANOVA on ranks, χ
 = 10.7, d.f. 1, two-tailed
P = 0.001, partial η
 = 0.004). Boxplot centre lines, medians; box limits, 25th and
75th quantiles; whiskers, 1.5× interquartile range. Right: classification accuracy
(red line) of random forest designed to predict receiver ID from acoustic features
independently of caller ID (all calls with the same caller and receiver allocated to
the same cross-validation fold) was not significantly different from classification
accuracies of models with randomized acoustic features (black histogram),
indicating that receiver ID could not be predicted independently of caller ID
(n = 437 calls, permutation test, one-tailed P = 0.896). The fact that elephant calls
contain multiple messages and are structurally highly complex may account for
the model’s poor generalization of receiver ID across callers.
Nature Ecoogy & Evoution
vocal labels account for only a small fraction of the total variation in
rumbles. This appears to contrast with human names, in which the vocal
label accounts for most of the acoustic variation in the signal, even
though information such as the identity, age, sex and emotional state
of the speaker is also encoded in the speaker’s voice characteristics
Whereas human language conveys complex messages via sequential
encoding of information, elephants may rely more on simultaneous
encoding, packing more information into a single vocalization than
humans typically do.
The richness in the information content of elephant vocaliza-
tions makes it difficult to identify the specific acoustic parameters
that encode receiver ID, although the variable importance scores from
the random forest suggest possible candidate features (Supplementary
Discussion). Unlike dolphin and parrot signature calls
, elephant
vocal labels cannot be discerned by visual inspection of the spectrogram
and are probably encoded by a complex and subtle interaction among
many acoustic parameters. As a result, we employed machine learning
in this analysis, but innovative approaches in signal processing may
be necessary to isolate the aspects of rumbles encoding vocal labels.
We found mixed support for the hypothesis that different callers
use the same label to address the same receiver. While the random
forest failed to predict receiver ID independently of caller ID, analysis
of proximity scores indicated at least some convergence among differ-
ent callers addressing the same receiver. It is possible that all callers
within a family group use the same label for the same receiver and the
poor performance of the random forest was due to limitations of our
data. The dense information content and high variability of rumbles
coupled with the small number of calls per receiver in our dataset may
have prevented the random forest from learning cues to receiver ID
that generalized across callers. Moreover, as the acoustic features we
extracted were based on the mel frequency scale, which was inspired by
human vocal tract models
, it is possible that they provided peripheral
measures of the principal modes of label encoding. Acoustic features
more closely tailored to the properties of the elephant vocal tract might
result in a higher classification accuracy for receiver ID.
Alternatively, it is possible that callers only partially share labels
for a given receiver. Such a system would greatly increase the number
of labels that elephants need to understand, although partial overlap
in the labels addressed to a given receiver could mitigate the difficulty
of this task. Nonetheless, partial convergence among labels might be
favoured if it is easier for receivers to learn to respond to multiple labels
than it is for callers to learn to produce the exact same label for a given
100 200 300
Seconds after playback
400 500 600
Cumulative probability of approach
100 200 300
Seconds after playback
400 500 600
Cumulative probability of call
Mean number of vocalizations
Test Control
Test Control
Fig. 4 | Response to playbacks of test stimuli (calls originally addressed to
the subject) versus control stimuli (calls from the same caller originally
addressed to a different individual). Left: subjects approached the speaker
more quickly (n = 17 individuals, Cox regression, χ
 = 6.8, d.f. 1, two-tailed
P = 0.009, hazards ratio 8.77) in response to test playbacks than controls.
Centre: subjects vocalized more quickly in response to test playbacks than
controls (n = 17 individuals, Cox regression, χ
 = 7.9, d.f. 1, two-tailed P = 0.005,
hazards ratio 7.45). Right: subjects produced more vocalizations in response to
test playbacks than controls (n = 17 individuals, Poisson generalized linear model,
 = 6.7, d.f. 1, two-tailed P = 0.009, hazards ratio 2.41). The shaded areas in the left
and centre panels represent 95% confidence intervals around survival curves.
Boxplot centre line, median; box limits, 25th and 75th quantiles; whiskers, 1.5×
interquartile range; grey squares, location of outliers; black circles, all individual
data points. The median and the 25th quantile of the control box are both 0. No
corrections were done for multiple comparisons as the analyses presented in this
figure were three distinct models with different response variables.
Nature Ecoogy & Evoution
receiver. This seems possible, as modifying the structure of calls based
on auditory experience (vocal production learning) requires more spe-
cialized neural circuitry than modifying the context in which calls are
produced (usage learning)
. Spectacled parrotlets (Forpus conspicil-
latus) and budgerigars (Melopsittacus undulatus) reportedly address
individual conspecifics with vocal labels that are not shared across call-
, although this could reflect imperfect imitation of the receiver’s
calls rather than discrete ‘nicknames’
. Further work to identify how
vocal labels are encoded in elephant calls is necessary to determine to
what degree different callers use the same label for the same receiver.
Isolating the labels for individual elephants will allow investigation of
questions such as whether elephants understand the labels used by
third parties or even refer to third parties in their absence.
Both African and Asian elephants have a demonstrated capacity
for vocal mimicry in captivity, but no study has documented a function
of this ability in the wild
. Depending on whether callers share labels
for the same receiver, vocal labelling in elephants could rely on either
vocal production learning or vocal innovation combined with usage
learning. However, given the evidence for partial convergence among
callers, it seems likely that production learning is involved. Dolphins
and parrots, which show evidence for individual vocal addressing
via imitation of the receiver, are adept vocal learners. Another vocal
learner, the Egyptian fruit bat (Rousettus aegyptiacus), produces calls
that are specific to individual receivers and may be vocal labels as well,
although it is currently unknown if the bats perceive this information
Humans, dolphins, parrots, bats and elephants all form long-term
social bonds and live in groups with a high degree of fission–fusion
. A mechanism to direct communication to individual
conspecifics could be especially beneficial for animals that frequently
separate and rejoin with bonded social partners. This raises the possibil-
ity that social selection pressures creating a need to address individual
conspecifics may have led to multiple independent origins of vocal
production learning, a precursor for language.
The use of learned arbitrary labels is part of what gives human
language its uniquely broad range of expression
. Our results sug-
gesting possible use of arbitrary vocal labels in elephants provide an
opportunity to investigate the selection pressures that may have led
to the evolution of this rare ability in two divergent lineages. Moreo-
ver, these findings raise intriguing questions about the complexity
of elephant social cognition, considering the potential relevance of
symbolic communication to their social decision-making.
Field recording
We collected audio recordings of wild female–calf groups in Amboseli
National Park, Kenya in 1986–1990 and 19972006 and Samburu and
Buffalo Springs National Reserves (hereafter, Samburu), Kenya in
November 2019 to March 2020 and June 2021 to April 2022. Both
populations have been continuously monitored for decades, and all
individuals can be individually identified by external ear morphol-
. We recorded calls from a vehicle during daylight hours with
all-occurrence sampling
using an Earthworks QTC1 microphone
(4 Hz to 40 kHz ± 1 dB) with a Nagra IV-SJ reel-to-reel tape recorder or
an HHB PDR 1000 DAT recorder in Amboseli, and an Earthworks QTC40
microphone (3 Hz to 40 kHz ± 1 dB) with a Sound Devices MixPre3 or
MixPre3-II digital recorder in Samburu. Recordings were recorded at
a 48 kHz sampling rate with 16 bits of amplitude resolution and stored
at 2 kHz in Amboseli and recorded and stored at 44.1 kHz with 24 or
32 bits of amplitude resolution in Samburu.
When possible, we recorded for each call the identity of the caller,
the behavioural context and the identity of the receiver (criteria for
identifying receiver defined in the main text). The caller was identi-
fied using behavioural and contextual cues, such as an open mouth,
flapping ears or being the only individual of the right age class in the
immediate vicinity (calls made by young calves are audibly shorter
and higher pitched than adult calls)
. Behavioural observations were
recorded by a single observer at each field site (M.A.P. in Samburu,
J.H.P. in Amboseli). Since the observations at each field site were con-
ducted without accompanying video in most cases, there was no way
to calculate inter-observer reliability.
Scoring behavioural context
For this study, we only used rumbles produced in the contexts of ‘con-
tact calling’, ‘greeting’ and ‘caregiving’, as these are the contexts in
which vocal labelling seems most likely to be beneficial
. We did not
include rumbles from other behavioural contexts as these typically
either involve multiple simultaneous receivers (for example, ‘let’s go’
rumbles) or occur in contexts where vocal labelling is less likely to be
necessary (for example, ‘begging’, ‘protest’, ‘oestrus’ and ‘musth’ rum-
. Nonetheless, there was a great deal of variation in the precise
social context surrounding the production of each call and the age and
internal state of the callers. As elephant rumbles vary with behavioural
context, age and the emotional state of the caller
, this contextual
Table 2 | Results for type III analyses of deviance on playback experiment models
variable (model
Subject ID
(d.f. 1)
Relationship of
caller to original
receiver (d.f. 4)
Distance (d.f. 1) dBC (d.f. 1) Other adults
(d.f. 1)
(d.f. 1)
exposure (d.f. 1)
Latency to
approach (Cox)
3.43 χ
RR 8.77
RR 0.79
RR 1.38
RR 3.13
RR 4.62
RR 0.88
Latency to
vocalize (Cox)
2.84 χ
RR 7.45
RR 0.87
RR 0.96
RR 3.25
RR 2.02
RR 0.91
Number of
calls (Poisson)
RR 2.41
RR 0.98
RR 1.09
RR 1.54
RR 0.84
RR 0.99
Latency to
vigilance (Cox)
0.02 χ
RR 2.07
RR 0.93
RR 0.84
RR 4.24
RR 0.64
RR 0.99
duration after–
before (linear)
9.95 χ
Subject ID was included as a random effect in all models except the Poisson regression for number of calls, because it had a variance of 0 for this model. Values in the ‘Subject ID’ column
represent the square root of the variance explained by that random effect. Signiicant P values are in bold. Latency to vigilance exhibited a non-signiicant trend towards faster onset of vigilance
in response to test playbacks. In addition to the d.f., χ
statistic and two-tailed P value from the analysis of deviance, this table includes the hazard or rate ratios (RR) for the Cox and Poisson
models and the estimated slope parameters (β) for the linear model. Ratios and slopes are not shown for relationship of caller to original receiver, as this covariate had more than two levels.
Nature Ecoogy & Evoution
heterogeneity of the recordings probably added substantial noise to
the data.
Following published methodology
, we defined contact rumbles
as calls produced by or addressed to an individual who was separated
from the receiver by >~50 m and apparently attempting to reinitiate
contact. Our category of ‘greeting’ rumbles encompasses two different
categories distinguished by Poole
: ‘little-greeting’ and ‘greeting’. Both
call types are produced when one individual approaches another in an
affiliative manner, but Poole’s ‘greeting rumbles’ are produced after a
greater period of separation than ‘little-greeting rumbles’, are more
likely to involve a face-to-face approach and typically involve greater
emotive behaviour such as temporal gland streaming and pirouetting
to stand in parallel
. The context of ‘caregiving’ in our study is primarily
synonymous with ‘coo rumbles’ described by Poole
, which are rumbles
produced by adult or adolescent females to a calf when gently touch-
ing or suckling the calf or in an apparent attempt to reassure a calf who
exhibited distress (for example, being pushed by another elephant,
being separated from its mother and so on). We also included in this
category two calls from adult females attempting to rouse a calf who
was sleeping when the group began to move off.
Scoring certainty of caller ID, behavioural context and
receiver ID
In Samburu, we recorded the certainty with which we knew caller ID,
behavioural context and receiver ID as 1 over the number of possible
. For example, in cases where we thought the call was
plausibly addressed to a single individual but there were two possible
candidates for who the receiver was, we designated one of the two indi-
viduals as the putative receiver and assigned the certainty of receiver
ID a value of 0.5. In Amboseli, certainty of caller ID and behavioural
context were scored as ‘certain’, ‘fairly confident’, ‘educated guess’ or
‘no idea’. The certainty of receiver ID was not systematically recorded
in Amboseli, but sometimes the field notes specified that the receiver
ID was uncertain.
Call selection
For all analyses in this paper, we only used rumbles with the highest pos-
sible certainty for receiver ID (that is, certainty of 1 for Samburu calls,
no notes indicating uncertain receiver ID for Amboseli calls). We also
required rumbles to have the first two formants clearly visible in the
spectrogram with no significant overlap with other calls or loud sounds
in the same frequency range. This dataset consisted of 469 calls, 101
unique callers and 117 unique receivers, with 1–36 (median 2) calls per
caller, 1–40 (median 2) calls per receiver, 1–7 (median 2) receivers per
caller and 1–7 (median 1) callers per receiver (Supplementary Table 1).
There were 32 calls for which the receiver ID was certain but the
caller ID was not. We used these calls in the random forest model that
was used to generate the proximity score matrix and the conditional
inference forest used to calculate variable importance scores for pre-
dicting receiver ID, as caller ID was irrelevant to these models. However,
for all other analyses, including the linear mixed models with proximity
score as a response variable, we only used calls where the caller ID was
known for certain (certainty of 1 for Samburu, ‘certain’ for Amboseli).
For analyses that examined behavioural context (linear mixed
models, logistic regression), we required the certainty of behavioural
context to be 1 in Samburu or ‘certain’ in Amboseli. For analyses that did
not explicitly include behavioural context, we also included calls with
uncertain contexts as long as the only possible options were contact,
greeting or caregiving.
Call segmentation
In Amboseli, we wrote down the elapsed time on the recorder and
contextual information for each call heard in the field; in Samburu, we
recorded verbal annotations onto a second channel of the recorder in
real time using a Martel Stenomask, which isolated the sound of the
observer’s voice from the Earthworks microphone
. We manually drew
a selection box around the spectrogram of each call in Raven Pro 1.5
(Cornell Lab of Ornithology, Ithaca, NY), with a buffer of approximately
1 s on either side of the call (Samburu (44.1 kHz sampling rate): Hann
window, 50% overlap, window 11,878 samples, Discrete Fourier Trans-
form 16,384 samples; Amboseli (2 kHz sampling rate): Hann window,
50% overlap, window 312 samples, Discrete Fourier Transform 512 sam-
ples). This automatically generated a selection table in .txt format with
the file name and start and end times of each selection box, to which
we added caller ID, receiver, ID, behavioural context and the certainty
of each. We performed all further acoustic and statistical analyses in
R version 4.1.3 (ref. 39).
To determine the precise onset and offset of each call, we low-pass
filtered the calls (Butterworth filter, order 5, cut-off 490 Hz), downsam-
pled them to 2,000 Hz if not already at that sampling rate, applied a
high-pass filter (Butterworth filter, order 10, cut-off 30 Hz) and normal-
ized them to 70% of max amplitude and 16 bits of amplitude resolution
using the packages seewave
and tuneR
. We then used the function
segment() in the package soundgen
to detect the onset and offset of
each call based on the amplitude envelope. We verified the automati-
cally detected start and end time for each call by visual inspection of
the amplitude envelope and spectrogram and manually adjusted the
times when necessary.
Acoustic measurements
We trimmed the original unfiltered sound clips to the automatically
detected start and end times, low-pass filtered the clips (Butterworth
filter, order 5, cut-off 800 Hz), downsampled them to 2,000 Hz if not
already at that sampling rate, applied a high-pass filter (Butterworth
filter, order 2, cut-off 4 Hz) and finally normalized them to 70% of the
max amplitude and 16 bits of amplitude resolution. For each call, we
measured the smoothed Hilbert amplitude envelope (moving average
window, window length 350 ms, overlap 90%) and two alternative sets
of features: normalized mel spectrogram and mel-frequency cepstral
coefficients (MFCCs).
A mel spectrogram is similar to a traditional spectrogram (raster
plot with time on the x axis, frequency on the y axis, and amplitude indi-
cated by pixel darkness) but with frequency transformed to the loga-
rithmic mel scale
. While the mel scale was designed to approximate
human hearing sensitivity, most other mammals, including elephants,
perceive frequency on a similar logarithmic scale
. We calculated a mel
spectrogram for each call using the audspec() function of the tuneR
package (26 mel-frequency bands between 0 Hz and 500 Hz, 350 ms
Hamming window, 90% overlap). We then normalized the mel spectro-
gram by dividing the energy value in each cell of the spectrogram by
its column sum so that the energies would be a proportion of the total
energy in each time window, and logit-transformed these proportional
energies so the values would not be limited between 0 and 1. We also
calculated delta and delta–delta values for each mel spectral band,
with delta values being the differences between successive energy
values in the mel spectral band (that is, the change in energy over
time within a mel spectral band) and delta–delta values being the dif-
ferences between successive delta values (that is, the acceleration of
energy over time within a mel spectral band) (Extended Data Fig. 1). We
saved the vector of energies in each mel spectral band and their corre-
sponding delta and delta–delta values as acoustic contours for further
processing. While mel spectral bands have not previously been used as
acoustic features for analysing elephant calls, they describe more of the
variation in the call than commonly used features such as fundamental
frequency and formants, while remaining easily interpretable.
We also calculated MFCCs for each call, which are less interpretable
than mel spectral bands but have been previously used successfully to
classify elephant vocalizations
. MFCCs are calculated by applying
a discrete cosine transform to each time window of a mel spectro-
gram, with the coefficients of the discrete cosine transform being the
Nature Ecoogy & Evoution
cepstral coefficients
. Each cepstral coefficient can be thought of as
representing the degree of modulation of the spectrum at a different
period, with lower numbered coefficients representing slower periods
of modulation. Since MFCCs are calculated for each time window of
the mel spectrogram, the output is a vector of values for each cepstral
coefficient. We calculated MFCCs using the melfcc() function in the
tuneR package, with a time window of 350 ms with 90% overlap, 40
mel-frequency bands between 0 Hz and 500 Hz, and a pre-emphasis
filter with a cut-off frequency of 10 Hz, and kept the first 12 coefficients
(12 vectors per call) for further processing. We also calculated delta
and delta–delta values for the first 12 cepstral coefficient contours.
Extraction of derived features from acoustic contours
We extracted derived acoustic features separately for the spectral
acoustic contours + amplitude envelope and the cepstral acoustic
countours + amplitude envelope. We rescaled each set of acoustic con-
tours by arranging them in a matrix with each contour in a separate row,
and then subtracting the column median from each value and dividing
the result by the column mean average deviation. We decorrelated
the contours with robust principal components analysis in the rpca
package in R, which separates the data into a low-rank matrix of robust
principal components without outliers, and a sparse matrix containing
the outlier values (λ = 0.00996)
. Robust principal component analysis
(PCA) has the advantage over standard PCA of being more resilient to
noisy data. We extracted four measurements from the sparse matrix to
use for statistical analysis: median, robust skewness and two measures
of spread: minimum extent and equivalent statistical extent. We also
calculated the means of the first n low-rank principal components
required to explain 99.9% of the variation (74 for spectral features, 12
for cepstral features).
We used multi-taper spectral estimation
to derive the frequency
spectra of the low-rank principal components that explained 99.9% of
the variation (treating each principal component as if it were a wave-
form) and calculated an F ratio for each point in each spectrum, test-
ing the null hypothesis that the spectral value in question could have
been derived from a random waveform. We calculated the mean of
the F ratios at each point across the aligned spectra and selected the
four largest peaks in the series of mean F ratios. We sorted these peaks
in order of increasing frequency and calculated the frequency and
magnitude of each peak.
We calculated the same metrics on spectra that were weighted
according to the proportion of variation that was explained by the
principal component from which the spectrum was derived. We mul-
tiplied the F ratios in each of the spectra by the proportion of variation
in the data explained by the principal component in question, summed
the weighted F ratios at each point in the aligned spectra and then
calculated the frequencies and magnitudes of the four largest peaks
in the summed F ratios, sorted in order of increasing frequency. The
final acoustic features used in our models are summarized in Extended
Data Table 1. We ran all subsequent statistical analyses separately for
the spectral and cepstral acoustic features.
Statistical analysis of acoustic data
Unless otherwise specified, all statistical tests were two-tailed and all
measurements were taken from distinct samples. The significance
level was set to 0.05 for all tests. We used partial η
as a measure of
effect size for linear models, calculated according to the formula
partial η
, where SSE
is the sum of the variances for all the
error terms (random effects and residual error) in the full model and
is the sum of the variances for all the error terms in the same model
minus the fixed effect of interest
. For all regression models, we calcu-
lated P values for the fixed effects using type III analysis of deviance.
Are calls speciic to individual receivers (hypothesis 1)? We ran a sev-
enfold cross-validated random forest model in the R package ranger
to predict the identity of the receiver of each call (receiver ID) as a
function of the acoustic features (Table 1, hypothesis 1, prediction 1).
We stratified the cross-validation folds by caller ID and receiver ID to
ensure as even a distribution as possible of all caller–receiver dyads
across all folds. Thus, if calls contain acoustic cues to receiver ID, this
model was expected to predict receiver ID better than chance regard-
less of whether the label for a given receiver is shared across callers
(Table 1, hypothesis 1, prediction 1). We only used calls where caller ID
was known for certain (n = 437 calls). The model used 500 trees, 6 vari-
ables per node, 60% of observations per tree, a minimum node size of
1 and no maximum tree depth. To increase the stability of the model’s
classification accuracy, we ran the model 2,000 times and used the
mean classification accuracy across the 2,000 runs. To determine if
the model predicted receiver ID better than expected by chance, we
ran the model 10,000 times with the acoustic features randomly per-
muted and compared the classification accuracy of the original model
(averaged across 2,000 runs) with the null distribution of classification
accuracies generated by the 10,000 models with randomized acoustic
features (one-tailed permutation test).
To disentangle the effects of caller ID and receiver ID on call struc-
ture, we compared the mean pairwise similarities between pairs of calls
with the same caller and receiver and pairs with the same caller and
different receivers (same caller pair type). As a metric of call similarity,
we extracted a proximity score for each pairwise combination of calls
from a random forest trained to predict receiver ID as a function of the
acoustic features on the full dataset (469 training observations, 8,000
trees, other hyperparameters same as above). The proximity score for
a given pair of calls was the proportion of trees in which both calls were
classified in the same terminal node, corrected for the size of each node
and represented the degree of similarity between the two calls in terms
of the acoustic features most relevant to predicting receiver ID
. If calls
are specific to individual receivers within a given caller, then pairs of
calls with the same caller and same receiver should be more similar
(have higher proximity scores) than pairs of calls with the same caller
and different receivers (Table 1, hypothesis 1, prediction 2).
Previous work has shown that elephants alter the structure of their
rumbles when interacting with more dominant conspecifics
. To rule
out the possibility that calls were specific to the type of relationship
between caller and receiver rather than to individual receivers per se,
we restricted the analysis of same caller pair type to pairs of calls that
had the same type of relationship between caller and receiver. We
defined the caller–receiver relationship using 12 categories based on
sex, family group membership, relative age and mother–offspring
relationship, reflecting the fact that dominance in elephants is primar-
ily determined by age
and that mother–calf bonds are the strongest
social bonds in elephants
(Extended Data Table 3). As calls from
different behavioural contexts differ in acoustic structure
, we cat-
egorized each pair of calls according to whether the two calls had the
same or different behavioural contexts (‘same context’) and included
this variable as a factor in the analysis. We also included a binary factor
indicating whether the two calls were recorded on the same date, as
exploratory analyses indicated that calls recorded on the same date
were more similar than calls recorded on different dates. We only used
calls in this model for which the caller ID and behavioural context were
known for certain.
The proximity scores were highly skewed to the right, so
we rank-transformed them and ran a linear mixed model with
rank-transformed proximity score as the response variable and same
caller pair type, same context and same date as fixed effects. To account
for the fact that there were multiple call pairs with the same combina-
tion of callers and receivers, we included ‘pair ID’ (a unique identifier
for each caller–receiver–caller–receiver combination) as a random
effect. We excluded pair IDs with only one observation as it was not
possible to estimate within-class variability for these pair IDs (final
n = 1,284 call pairs).
Nature Ecoogy & Evoution
Which calls are most likely to contain vocal labels? Vocal labels might
be more likely to occur in certain behavioural contexts than others.
Similarly, callers may only use a vocal label in some of the calls within a
bout, as it would be redundant to include the same information in all the
calls. To assess whether behavioural context or position within a bout
influenced the likelihood of a call containing a vocal label, we calculated
the proportion of the 2,000 iterations of the random forest in which the
receiver ID was correctly predicted for each call (probability of correct
classification). We designated calls that were correctly predicted in
≥95% of iterations as ‘correct’ and calls that were correctly predicted in
≤5% of iterations as ‘incorrect’ and excluded all calls that did not meet
these criteria, as well as all calls with uncertain caller ID or behavioural
context, and receivers that occurred only once after applying the previ-
ous criteria (n = 327). Then, we ran a mixed-effects logistic regression
with prediction outcome (1 or 0) as the response, receiver ID as a random
effect, and behavioural context, caller age class, position within the bout
and the total number of calls addressed to the receiver in question as
fixed effects. The latter effect was included because receivers with more
calls in our dataset were expected to be predicted with greater accuracy,
as there were more training opportunities for the random forest to learn
them. Caller age class was defined as juvenile (<10 years old for females,
not yet dispersed from natal group for males) or adult (>10 years old for
females). There were no adult male callers in our dataset. We defined
a bout as calls produced by the same caller within the same sound file
with no more than 30 s between successive calls.
Are vocal labels based on imitation of the receiver’s calls
(hypothesis 2)? To assess whether imitation of the receiver’s calls was
necessary for vocal labelling, we examined the calls in the dataset for
which we had at least one recording of the receiver’s calls and at least
one recording of the caller addressing someone other than the receiver
(n = 236 calls). For each of these calls, we calculated its mean proximity
score to all the calls made by the receiver (mean proximity to targeted
receiver). We also calculated the mean proximity score between the
same caller and receiver when the caller was addressing other individu-
als (mean proximity when targeting others). Calls in which the mean
proximity to targeted receiver was greater than the mean proximity
when targeting others were classified as ‘convergent’ (n = 95) and diver-
gent otherwise (n = 141). We then examined the proportion of conver-
gent and divergent calls that were classified correctly by the random
forest model with receiver ID and the acoustic features as input vari-
ables, and cross-validation folds stratified by caller ID and receiver ID.
If vocal labelling relies on imitation of the receiver’s calls, we expected
only the convergent calls to be classified correctly more often than by
the null model, but if imitation is not necessary for vocal labelling, we
expected both convergent and divergent calls to be classified correctly
more often than by the null model (Table 1, hypothesis 2, prediction 1).
If elephants imitate the calls of the receiver that they are addressing,
then callers should sound more like a given conspecific when they are
addressing her than when they are addressing someone else (Table 1,
hypothesis 2, prediction 2). To assess whether this was the case, we clas-
sified each pair of calls into one of two types (hereafter, ‘imitation pair
type’): pairs in which the receiver of one call was the caller of the other
call, and pairs in which this was not the case. We separately classified each
call pair according to whether the two calls had the same relationship
between caller and receiver (hereafter, ‘same relationship’). We also cre
ated a categorical variable caller dyad ID, which was an identifier for each
unique combination of callers that composed a call pair. We ran a linear
mixed model with rank-transformed proximity score as the response
variable, imitation pair type, same relationship, same context and same
date as fixed effects, and caller dyad ID and pair ID as random effects.
By including caller dyad ID as a random effect, we assessed the effect
of imitation pair type within a given pair of callers, that is, whether calls
from caller A to receiver B were more similar to receiver B’s calls than
calls from caller A addressed to other receivers were to receiver Bs calls.
We excluded pairs of calls with the same caller or receiver, uncertain caller
ID or behavioural context for either call, that were recorded from different
family groups, for which caller dyad ID did not occur with both levels of
imitation pair type, or for which pair ID occurred only once (n = 2,360 call
pairs). Pairs of calls from different family groups were excluded because
they comprised a small percentage of pairs where the receiver of one call
was the caller of the other, and because it is possible that different families
have different vocal signatures, which would influence call similarity.
Do different callers use the same label for the same receiver
(hypothesis 3)? If different callers use similar labels for the same
receiver, then pairs of calls with different callers and the same receiver
should be more similar than pairs of calls with different callers and dif-
ferent receivers (Table 1, hypothesis 3, prediction 1). To test whether this
was the case, we ran another linear mixed model with rank-transformed
proximity score as the response variable, different caller pair type (dif-
ferent callers/same receiver or different callers/different receivers),
same relationship and same context as fixed effects, and pair ID as a
random effect. As before, we excluded calls with uncertain caller ID
or behavioural context, pairs of calls recorded from different family
groups, and levels of pair ID that occurred only once (n = 8,215 call pairs).
To determine if receiver ID could be predicted independently of
caller ID, which would be possible only if callers use similar labels for
a given receiver, (Table 1, hypothesis 3, prediction 2), we ran another
sevenfold cross-validated random forest model to predict receiver ID as
a function of the acoustic features but partitioned the cross-validation
folds such that all calls with the same caller and receiver were always
allocated to the same fold (observations and hyperparameters same as
first model). We averaged the classification accuracy of the model across
2,000 runs and compared this value with the distribution of classifica-
tion accuracies generated by 10,000 iterations of the same model with
the acoustic features randomly permuted (one-tailed permutation test).
Checking model assumptions. For all rank-transformed linear mixed
models, we checked the assumption of normality by visually examin-
ing histograms of the residuals. We checked the assumption of equal
variances by visually examining boxplots of all groups. The residuals
for all models exhibited only minor deviations from normality, with
the absolute values of skewness and excess kurtosis being less than 1
for all models. As linear models have been shown to be robust even to
severe deviations from normality with skewness as high as 2 and excess
kurtosis as high as 6 (a normal distribution has a skewness of 0 and
excess kurtosis of 0)
, we deemed the choice of model appropriate.
Boxplots indicated similar variances across groups.
How are labels encoded in calls? To investigate which acoustic fea-
tures encode receiver ID and caller ID, we extracted variable importance
scores (Supplementary Table 2) from a conditional inference random
forest model in the R package ‘party’
trained on the full dataset to
predict the response variable in question (receiver ID or caller ID)
as a function of the acoustic features (469 training observations for
receiver ID, 437 for caller ID; 1,000 trees; all other hyperparameters
same as other random forests). We used a conditional inference forest
because, unlike traditional random forest, it is not biased towards cor-
related variables
. We only calculated variable importance scores for
the spectral features, as cepstral coefficients are difficult to interpret
intuitively. To assess the relative importance of the original acoustic
contours, we weighted the loadings of the acoustic contours on each
principal component by the variable importance score of the mean
of the principal component in question and then calculated the sum
of the absolute values of these weighted loadings for each acoustic
contour (Supplementary Table 3). Acoustic contours with a higher
sum of the absolute values of the weighted loadings were deemed
more important. This weighting process only considered the means
of low-rank principal components.
Nature Ecoogy & Evoution
Playback experimental design
To determine if elephants respond more strongly to calls addressed to
them (Table 1, hypothesis 1, prediction 3), we played back rumbles with
known adult (>10-year-old) female callers and known receivers to 17
elephants (15 adult females, one 9-year-old female, one 9–10-year-old
male) in the Samburu study area. Fourteen subjects received one ‘test’
playback of a call that was originally addressed to them and one ‘control’
playback of a call from the same caller that was originally addressed to
another individual. One subject received two sets of test and control
playbacks from two different callers, one received only a test playback,
and one received only a control playback (Supplementary Table 4). Most
stimuli functioned as the test stimulus for one subject and the control
stimulus for another, but no stimulus was used as the same experimental
condition for more than one subject. The order of presentation was
balanced across subjects, and we waited at least 7 days (mean ± s.d.,
29.5 ± 27.1 days) between successive playbacks to the same subject.
Playback stimuli
Playback stimuli were recorded in Samburu and Buffalo Springs
between January 2020 and March 2022 from adult female callers. In
all but two cases, the playback stimuli were contact calls. In one case
we used a loud greeting call (similar in original amplitude to a typical
contact call but produced at a much closer distance), and in one case
we used a call that was produced in a similar context to contact calls
(caller and receiver >100 m apart and out of sight of each other) but was
lower in original amplitude than a typical contact call and was part of a
lengthy antiphonal exchange between two individuals and, therefore,
was probably a ‘cadenced rumble’
. These non-contact calls were used
to complete a pair of test and control stimuli because we were unable
to obtain contact calls to two different receivers from the same caller.
Three playback stimuli were elicited by another playback, and we
assumed that the individual whose call was broadcast from the speaker
was the intended receiver of the call that was produced in response to
that playback. We identified the receiver of natural calls as the only
adult member of the family group who was separated from the caller
during the call or the only individual who responded to the call. In one
case, there were two adult females separated from the caller, and we
assumed the receiver was the older of the two females who was in the
lead and who rejoined the caller first. We note that there was no mecha-
nism to ensure the playback stimulus contained a vocal label, and it is
possible not all stimuli were labelled. We prepared all playback stimuli
in Audacity 3.0.2. Each stimulus consisted of a single rumble preceded
by one second of background noise with a fade-in and followed by 1 s of
background noise with a fade-out. In three cases, we applied a high-pass
(5 Hz cut-off, 6 dB roll-off) or low-pass filter (1,000 Hz cut-off, 6 dB
roll-off) to remove excessive noise.
Playback system and volume
We played back all stimuli as .wav files (uncompressed audio) from
an iPhone SE (Apple) attached to a QLXD1 wireless bodypack trans-
mitter (Shure) transmitting to a custom-built loudspeaker (Bag End
Loudspeakers). The cord connecting the playback device to the wire-
less transmitter had to be replaced three times over the course of the
experiment, each time changing the output level of the speaker. Thus,
depending on which cord was in use, we normalized the stimuli to −24,
−22.5 or −18 dB in Audacity 3.0.2 to ensure a functionally equivalent
normalization level across all trials.
The speaker’s frequency response was flat from 10 Hz to 500 Hz up
to a given maximum output level (maximum output 89 dB sound pres-
sure level (SPL) at 10 Hz, 101 dB SPL at 20 Hz and 113 dB SPL at 40 Hz).
If the signal exceeded the maximum output at a given frequency, the
speaker automatically reduced the level of the frequencies in question
to avoid damage. Reported amplitudes for natural contact calls range
from 94 to 115 dB SPL (extrapolated value at 1 m from source)
. We
did not have access to an SPL meter with a flat frequency response
at low frequencies, but our playback stimuli ranged from 96.2 to
104.3 dBC (decibels with a C-weighting) at 1 m measured with a Protmex
PT6708 sound level meter (Protech International Group Co.) or 93.4
to 102.9 dB SPL at 1 m measured with the SoundMeter 10.5.8 iPhone
application (Faber Acoustical). Mean measured volume did not differ
between test and control stimuli (dBC: t-test, t
 = 0.03, P = 0.97; dB
SPL: t-test, t
 = 0.15, P = 0.88).
Playback trial protocol
We placed the speaker 40.2–59.0 m from the subject (mean 49.1 ± 4.2 m),
either on the ground in front of a tree or shrub and covered by cam-
ouflage netting or on the edge of the rear seat of a Toyota double cab
Landcruiser facing the door with all four doors and windows and both
roof hatches open. Rerecordings at 50 m revealed no obvious differ-
ence between sounds played with the speaker on the ground or inside
the vehicle. We conducted playbacks only when the original caller and
‘alternate receiver’ (the other subject receiving playbacks from the same
caller) were >180 m from and out of sight of the subject (>270 m from the
alternate receiver if she had not yet received all her playbacks). When
the original caller’s location was known (19/34 trials) the speaker was
placed in approximately the same direction relative to the subject as
the original caller. In the remaining trials, the caller could not be located
after searching a ~300 m radius around the subject. Trials were redone
after at least 7 days if the speaker malfunctioned, the subject moved her
head out of sight right before the playback started or we discovered after
the playback that the speaker was not in the correct location relative to
the subject and the original caller (Supplementary Table 4). During each
trial, we filmed the subject from inside the vehicle for at least 1 min before
the playback, then played the stimulus once and continued filming for at
least another 10 min. We also recorded audio with an Earthworks QTC40
microphone and Sound Devices MixPre3-II recorder. The observers
were blind to the playback condition (test or control) until all trials were
complete, and all videos and audio recordings were scored.
Statistical analysis of playback data
From the video and audio recordings of each playback trial, we meas-
ured the subject’s latency to approach the speaker, latency to vocalize,
number of calls produced within 10 min following the playback, latency
to vigilance and change in vigilance duration in the minute following
the playback compared with the minute preceding the playback. Laten-
cies were defined as the time from the start of the playback until the
behaviour of interest occurred and were censored when the subject
moved out of sight or at 10 min, whichever came first. Vigilance was
defined as lifting head above shoulder level, moving head from side
to side, holding ears away from body without flapping, or lifting trunk
while sniffing towards speaker
. We ran a separate model for each
response variable with subject ID as a random effect and treatment
and the following covariates/factors as fixed effects: caller–original
receiver relationship (relationship between the caller and the original
receiver of the call; Extended Data Table 3), distance (distance in metres
between the speaker and the subject), dBC (amplitude of the playback
stimulus in dBC at 1 m), other adults (whether other adults were within
50 m of subject during playback), speaker location (whether speaker
was on ground or in vehicle) and cumulative playback exposure (cumu-
lative number of playbacks to which subject was exposed at distance
of 300 m or less, including trials that were redone and playbacks to
other subjects). We used Cox proportional hazards regression in the
coxme package
for the latency variables, a generalized linear model
with a Poisson error distribution in the lme4 package
for number of
calls, and a linear model for change in vigilance duration. We applied
analysis of deviance with type III sums of squares to each model to
calculate a two-tailed P value for each fixed effect. For the Poisson
regression modelling number of calls, the random effect of subject ID
had a variance of 0, resulting in a near singular fit, so we removed the
random effect from this model.
Nature Ecoogy & Evoution
For the Cox regression models, we checked the assumption of
proportional hazards with a Schoenfeld test, which tests the null
hypothesis that there is no relationship between the scaled Schoen-
feld residuals and time. This test was non-significant (P > 0.05) for all
models, indicating no violation of the proportional hazards assump-
tion. For the Poisson regression model, we checked for overdispersion
using the AER package in R
. The dispersion parameter was estimated
to be 1.1, which did not differ significantly from the ideal value of 1
(P = 0.26), indicating that a Poisson distribution was appropriate. For
the linear regression model used to examine the change in vigilance
duration before versus after playbacks, visual inspection of the histo-
gram of the residuals indicated that the residuals were approximately
normally distributed. For treatment, distance, dBC, speaker location
and cumulative playback exposure, visual inspection of boxplots or
residual plots indicated approximate homoscedasticity. Relationship
of caller to original receiver and other adults were heteroscedastic.
However, regardless of whether these covariates were included, treat-
ment was not significant, so any potential issues with this model had
no bearing on the conclusions of our study.
Reporting summary
Further information on research design is available in the Nature
Portfolio Reporting Summary linked to this article.
Data availability
Data are available at (ref. 60).
Code availability
Code is available at (ref. 61).
We thank the Oice of the President of Kenya, the Samburu, Isiolo
and Kajiado County governments, the Wildlife Research & Training
Institute of Kenya, and Kenya Wildlife Service for permission to
conduct ieldwork in Kenya. We thank Save The Elephants and the
Amboseli Trust for Elephants for logistical support in the ield,
J. M. Leshudukule, D. M. Letitiya and N. Njiraini for assistance with the
ieldwork, G. Pardo for blinding the playback stimuli and S. Pardo for
input on the statistical analyses. We thank J. Berger, W. Koenig and
A. Horn for comments on the manuscript. This project was funded
by a Postdoctoral Research Fellowship in Biology to M.A.P. from the
National Science Foundation (award no. 1907122) and grants to
J.H.P. and P.G. from the National Geographic Society, Care for the Wild,
and the Crystal Springs Foundation. Fieldwork was supported by Save
the Elephants.
Author contributions
M.A.P. conceived the study. M.A.P. and D.S.L. collected the data in
Samburu, and J.H.P. and P.G. collected the data in Amboseli. M.A.P.
and K.F. performed the statistical analysis, and M.A.P. created the
igures. M.A.P. drafted the manuscript, and K.F., J.H.P. and G.W. edited
it. C.M., I.D.-H. and G.W. provided resources and access to long-term
datasets, and G.W. supervised the study.
Competing interests
The authors declare no competing interests.
Additional information
Extended data is available for this paper at
Supplementary information The online version
contains supplementary material available at
Correspondence and requests for materials should be addressed to
Michael A. Pardo.
Peer review information Nature Ecology & Evolution thanks Kenna
Lehmann and the other, anonymous, reviewer(s) for their contribution
to the peer review of this work. Peer reviewer reports are available.
Reprints and permissions information is available at
Publisher’s note Springer Nature remains neutral with regard to
jurisdictional claims in published maps and institutional ailiations.
Springer Nature or its licensor (e.g. a society or other partner) holds
exclusive rights to this article under a publishing agreement with
the author(s) or other rightsholder(s); author self-archiving of the
accepted manuscript version of this article is solely governed by the
terms of such publishing agreement and applicable law.
© The Author(s), under exclusive licence to Springer Nature Limited
Nature Ecoogy & Evoution
Extended Data Fig. 1 | Schematic illustrating how spectral acoustic features
were measured. First, a spectrogram was calculated by applying a Fast Fourier
Transform to the signal (Hamming window, 700 samples, 90% overlap). Then
a mel filter bank with 26 overlapping triangular filters between 0-500 Hz was
applied to each window of the spectrogram to produce a mel spectrogram. The
mel spectrogram was then normalized by dividing the energy value in each cell
by the total energy in that time window and these proportional energies were
logit-transformed so they would not be limited to between 0 and 1. As features for
the robust principal components analysis, we used the vector of energy in each of
the 26 mel frequency bands as well as the vectors of delta and delta-delta values
for each frequency band (representing the change and acceleration in energy
over time, respectively). In the spectrogram and mel spectrogram in this figure,
warmer colors indicate higher amplitudes (greater energy).
Nature Ecoogy & Evoution
Extended Data Fig. 2 | Scatterplots illustrating the separation in 3D space
between calls from the same caller to different receivers. Axes are the first
three principal coordinates extracted from the proximity scores of a random
forest trained to predict receiver ID. Each plot represents a single caller, each
point is a single call, and receiver IDs are coded by both color and shape. This
figure only includes calls where caller ID was known for certain, where the call
was predicted correctly in at least 25% of random forest iterations, and where the
caller made at least two such calls each to at least two different receivers.
Nature Ecoogy & Evoution
Extended Data Fig. 3 | Scatterplot illustrating the clustering in 3D space
of calls from different callers to the same receiver. Axes are the first three
principal coordinates extracted from the proximity scores of a random forest
trained to predict receiver ID. Each shape represents a different receiver and each
color represents a different caller. This figure only includes calls where caller ID
was known for certain, where the call was predicted correctly in at least 25% of
random forest iterations, and where the receiver received at least one such call
each from at least two different callers.
Nature Ecoogy & Evoution
Extended Data Table 1 | Acoustic features used in the random forest models
All acoustic features were derived from either the sparse matrix or low-rank matrix of a robust principal components analysis performed on multiple acoustic contours of equal length that
were measured directly from the signal. For the spectral acoustic features, the acoustic contours were the Hilbert amplitude envelope, the vector of energies in each of the 26 bands of
a mel spectrogram, and the delta and delta-delta values of the mel spectral bands. For the cepstral acoustic features, the acoustic contours were the Hilbert amplitude envelope, irst 12
mel-frequency cepstral coeficients, and the delta and delta-delta values of the irst 12 cepstral coeficients. The principal components analysis was performed on a matrix of all the contours
for each call stacked end-to-end.
Nature Ecoogy & Evoution
Extended Data Table 2 | Results of random forest models predicting receiver ID as a function of the acoustic features
All random forests had 500 trees, 6 variables per node, 60% of observations per tree, minimum node size = 1, no maximum tree depth, and 7-fold cross-validation. Classiication accuracies
were averaged across 2000 runs of the model to improve stability. To determine if the classiication accuracy was higher than expected by chance, the model was run 10,000 times with
randomly permuted acoustic variables, and the original classiication accuracy was compared to the distribution of classiication accuracies for these 10,000 permuted models. P-values are
Nature Ecoogy & Evoution
Extended Data Table 3 | Deinitions of social relationship categories between caller and receiver
Categories were deined based on sex, age, and mother-offspring status, the most important factors inluencing dominance and bond strength within an elephant family group. Females were
deined as adults if ≥10 years old, and males were deined as adults if independent from their natal group. All non-adults under this deinition were classiied as juveniles. Six years was chosen
as the cutoff for different age classes because it is between 1-2x the average inter-birth interval, so a female ≥6 years older than another individual could have been that individual’s allomother.
Nature Ecoogy & Evoution
Extended Data Table 4 | Results for linear mixed model assessing whether calls are speciic to individual receivers or the
type of relationship between caller and receiver
Each observation was a pair of calls and the response variable was rank-transformed proximity score. Same Caller Pair Type = whether the two calls in a pair had the same caller and receiver
(reference level) or same caller and different receivers with the same type of relationship to the caller; Same Context = whether the two calls in a pair had the same behavioral context
(reference level = no); Same Date = whether the two calls in a pair were recorded on the same day (reference level = no); Pair ID = unique combination of callers and receivers (random effect).
Pairs of calls recorded from different groups and levels of Pair ID that only occurred once were excluded (n=1105 call pairs with same receiver, 179 with different receivers who had the same
type of relationship to the caller). P-values are two-tailed.
Nature Ecoogy & Evoution
Extended Data Table 5 | Results for mixed effects logistic regression modeling the probability of a call being correctly
Odds ratios, χ
statistics, degrees of freedom, two-tailed P-values, reported for ixed effects. Standard deviations (square root of the variance explained) reported for the random effect. Odds
ratios for Context were calculated from the estimated marginal means. χ
statistics, degrees of freedom, two-tailed P-values were calculated from Type III Analysis of Deviance on the full
model. Receivers that only occurred once were excluded. Cepstral features model had warning message indicating convergence issues when Caller age class was included. Context: n=138
contact rumbles, 127 greeting rumbles, 62 caregiving rumbles. Caller age class: n=274 calls from adults, 53 juvenile calls from juveniles.
Nature Ecoogy & Evoution
Extended Data Table 6 | Results for linear mixed model assessing whether calls addressed to a receiver imitate the
receiver’s calls
Each observation was a pair of calls and the response variable was rank-transformed proximity score. Imitation Pair Type = whether the receiver of one call in a pair was the caller of the other
call (reference level = yes); Same Relationship = whether the callers of both calls in a pair had the same type of relationship to their respective receivers (reference level = no); Caller Dyad ID
= unique combination of callers (random effect). Same Context, Same Date, and Pair ID same as in Extended Data Table 4. Pairs of calls recorded from different groups, pairs with the same
caller or receiver, levels of Caller Dyad ID that only occurred with one level of Imitation Pair Type, and levels of Pair ID that only occurred once were excluded (n=943 call pairs where receiver
of one call was the caller of the other, 1553 where this was not the case). P-values are two-tailed.
Nature Ecoogy & Evoution
Extended Data Table 7 | Results for linear mixed model assessing whether different callers use similar labels for same
Each observation was a pair of calls and the response variable was rank-transformed proximity score. Different Caller Pair Type = whether the two calls in a pair had different callers and the
same receiver (reference level) or different callers and different receivers; Same Relationship, Same Context, Same Date, and Pair ID same as in Extended Data Tables 4 and 6. Pairs of calls
recorded from different groups and levels of Pair ID that only occurred once were excluded (n=693 call pairs with same receiver, 7522 with different receivers). P-values are two-tailed.
