Sapir–Whorf hypothesis (Linguistic Relativity Hypothesis)

Mia Belle Frothingham

Author, Researcher, Science Communicator

BA with minors in Psychology and Biology, MRes University of Edinburgh

Mia Belle Frothingham is a Harvard University graduate with a Bachelor of Arts in Sciences with minors in biology and psychology

Learn about our Editorial Process

Saul McLeod, PhD

Editor-in-Chief for Simply Psychology

BSc (Hons) Psychology, MRes, PhD, University of Manchester

Saul McLeod, PhD., is a qualified psychology teacher with over 18 years of experience in further and higher education. He has been published in peer-reviewed journals, including the Journal of Clinical Psychology.

Olivia Guy-Evans, MSc

Associate Editor for Simply Psychology

BSc (Hons) Psychology, MSc Psychology of Education

Olivia Guy-Evans is a writer and associate editor for Simply Psychology. She has previously worked in healthcare and educational sectors.

On This Page:

There are about seven thousand languages heard around the world – they all have different sounds, vocabularies, and structures. As you know, language plays a significant role in our lives.

But one intriguing question is – can it actually affect how we think?

Collection of talking people. Men and women with speech bubbles. Communication and interaction. Friends, students or colleagues. Cartoon flat vector illustrations isolated on white background

It is widely thought that reality and how one perceives the world is expressed in spoken words and are precisely the same as reality.

That is, perception and expression are understood to be synonymous, and it is assumed that speech is based on thoughts. This idea believes that what one says depends on how the world is encoded and decoded in the mind.

However, many believe the opposite.

In that, what one perceives is dependent on the spoken word. Basically, that thought depends on language, not the other way around.

What Is The Sapir-Whorf Hypothesis?

Twentieth-century linguists Edward Sapir and Benjamin Lee Whorf are known for this very principle and its popularization. Their joint theory, known as the Sapir-Whorf Hypothesis or, more commonly, the Theory of Linguistic Relativity, holds great significance in all scopes of communication theories.

The Sapir-Whorf hypothesis states that the grammatical and verbal structure of a person’s language influences how they perceive the world. It emphasizes that language either determines or influences one’s thoughts.

The Sapir-Whorf hypothesis states that people experience the world based on the structure of their language, and that linguistic categories shape and limit cognitive processes. It proposes that differences in language affect thought, perception, and behavior, so speakers of different languages think and act differently.

For example, different words mean various things in other languages. Not every word in all languages has an exact one-to-one translation in a foreign language.

Because of these small but crucial differences, using the wrong word within a particular language can have significant consequences.

The Sapir-Whorf hypothesis is sometimes called “linguistic relativity” or the “principle of linguistic relativity.” So while they have slightly different names, they refer to the same basic proposal about the relationship between language and thought.

How Language Influences Culture

Culture is defined by the values, norms, and beliefs of a society. Our culture can be considered a lens through which we undergo the world and develop a shared meaning of what occurs around us.

The language that we create and use is in response to the cultural and societal needs that arose. In other words, there is an apparent relationship between how we talk and how we perceive the world.

One crucial question that many intellectuals have asked is how our society’s language influences its culture.

Linguist and anthropologist Edward Sapir and his then-student Benjamin Whorf were interested in answering this question.

Together, they created the Sapir-Whorf hypothesis, which states that our thought processes predominantly determine how we look at the world.

Our language restricts our thought processes – our language shapes our reality. Simply, the language that we use shapes the way we think and how we see the world.

Since the Sapir-Whorf hypothesis theorizes that our language use shapes our perspective of the world, people who speak different languages have different views of the world.

In the 1920s, Benjamin Whorf was a Yale University graduate student studying with linguist Edward Sapir, who was considered the father of American linguistic anthropology.

Sapir was responsible for documenting and recording the cultures and languages of many Native American tribes disappearing at an alarming rate. He and his predecessors were well aware of the close relationship between language and culture.

Anthropologists like Sapir need to learn the language of the culture they are studying to understand the worldview of its speakers truly. Whorf believed that the opposite is also true, that language affects culture by influencing how its speakers think.

His hypothesis proposed that the words and structures of a language influence how its speaker behaves and feels about the world and, ultimately, the culture itself.

Simply put, Whorf believed that you see the world differently from another person who speaks another language due to the specific language you speak.

Human beings do not live in the matter-of-fact world alone, nor solitary in the world of social action as traditionally understood, but are very much at the pardon of the certain language which has become the medium of communication and expression for their society.

To a large extent, the real world is unconsciously built on habits in regard to the language of the group. We hear and see and otherwise experience broadly as we do because the language habits of our community predispose choices of interpretation.

Studies & Examples

The lexicon, or vocabulary, is the inventory of the articles a culture speaks about and has classified to understand the world around them and deal with it effectively.

For example, our modern life is dictated for many by the need to travel by some vehicle – cars, buses, trucks, SUVs, trains, etc. We, therefore, have thousands of words to talk about and mention, including types of models, vehicles, parts, or brands.

The most influential aspects of each culture are similarly reflected in the dictionary of its language. Among the societies living on the islands in the Pacific, fish have significant economic and cultural importance.

Therefore, this is reflected in the rich vocabulary that describes all aspects of the fish and the environments that islanders depend on for survival.

For example, there are over 1,000 fish species in Palau, and Palauan fishers knew, even long before biologists existed, details about the anatomy, behavior, growth patterns, and habitat of most of them – far more than modern biologists know today.

Whorf’s studies at Yale involved working with many Native American languages, including Hopi. He discovered that the Hopi language is quite different from English in many ways, especially regarding time.

Western cultures and languages view times as a flowing river that carries us continuously through the present, away from the past, and to the future.

Our grammar and system of verbs reflect this concept with particular tenses for past, present, and future.

We perceive this concept of time as universal in that all humans see it in the same way.

Although a speaker of Hopi has very different ideas, their language’s structure both reflects and shapes the way they think about time. Seemingly, the Hopi language has no present, past, or future tense; instead, they divide the world into manifested and unmanifest domains.

The manifested domain consists of the physical universe, including the present, the immediate past, and the future; the unmanifest domain consists of the remote past and the future and the world of dreams, thoughts, desires, and life forces.

Also, there are no words for minutes, minutes, or days of the week. Native Hopi speakers often had great difficulty adapting to life in the English-speaking world when it came to being on time for their job or other affairs.

It is due to the simple fact that this was not how they had been conditioned to behave concerning time in their Hopi world, which followed the phases of the moon and the movements of the sun.

Today, it is widely believed that some aspects of perception are affected by language.

One big problem with the original Sapir-Whorf hypothesis derives from the idea that if a person’s language has no word for a specific concept, then that person would not understand that concept.

Honestly, the idea that a mother tongue can restrict one’s understanding has been largely unaccepted. For example, in German, there is a term that means to take pleasure in another person’s unhappiness.

While there is no translatable equivalent in English, it just would not be accurate to say that English speakers have never experienced or would not be able to comprehend this emotion.

Just because there is no word for this in the English language does not mean English speakers are less equipped to feel or experience the meaning of the word.

Not to mention a “chicken and egg” problem with the theory.

Of course, languages are human creations, very much tools we invented and honed to suit our needs. Merely showing that speakers of diverse languages think differently does not tell us whether it is the language that shapes belief or the other way around.

Supporting Evidence

On the other hand, there is hard evidence that the language-associated habits we acquire play a role in how we view the world. And indeed, this is especially true for languages that attach genders to inanimate objects.

There was a study done that looked at how German and Spanish speakers view different things based on their given gender association in each respective language.

The results demonstrated that in describing things that are referred to as masculine in Spanish, speakers of the language marked them as having more male characteristics like “strong” and “long.” Similarly, these same items, which use feminine phrasings in German, were noted by German speakers as effeminate, like “beautiful” and “elegant.”

The findings imply that speakers of each language have developed preconceived notions of something being feminine or masculine, not due to the objects” characteristics or appearances but because of how they are categorized in their native language.

It is important to remember that the Theory of Linguistic Relativity (Sapir-Whorf Hypothesis) also successfully achieves openness. The theory is shown as a window where we view the cognitive process, not as an absolute.

It is set forth to look at a phenomenon differently than one usually would. Furthermore, the Sapir-Whorf Hypothesis is very simple and logically sound. Understandably, one’s atmosphere and culture will affect decoding.

Likewise, in studies done by the authors of the theory, many Native American tribes do not have a word for particular things because they do not exist in their lives. The logical simplism of this idea of relativism provides parsimony.

Truly, the Sapir-Whorf Hypothesis makes sense. It can be utilized in describing great numerous misunderstandings in everyday life. When a Pennsylvanian says “yuns,” it does not make any sense to a Californian, but when examined, it is just another word for “you all.”

The Linguistic Relativity Theory addresses this and suggests that it is all relative. This concept of relativity passes outside dialect boundaries and delves into the world of language – from different countries and, consequently, from mind to mind.

Is language reality honestly because of thought, or is it thought which occurs because of language? The Sapir-Whorf Hypothesis very transparently presents a view of reality being expressed in language and thus forming in thought.

The principles rehashed in it show a reasonable and even simple idea of how one perceives the world, but the question is still arguable: thought then language or language then thought?

Modern Relevance

Regardless of its age, the Sapir-Whorf hypothesis, or the Linguistic Relativity Theory, has continued to force itself into linguistic conversations, even including pop culture.

The idea was just recently revisited in the movie “Arrival,” – a science fiction film that engagingly explores the ways in which an alien language can affect and alter human thinking.

And even if some of the most drastic claims of the theory have been debunked or argued against, the idea has continued its relevance, and that does say something about its importance.

Hypotheses, thoughts, and intellectual musings do not need to be totally accurate to remain in the public eye as long as they make us think and question the world – and the Sapir-Whorf Hypothesis does precisely that.

The theory does not only make us question linguistic theory and our own language but also our very existence and how our perceptions might shape what exists in this world.

There are generalities that we can expect every person to encounter in their day-to-day life – in relationships, love, work, sadness, and so on. But thinking about the more granular disparities experienced by those in diverse circumstances, linguistic or otherwise, helps us realize that there is more to the story than ours.

And beautifully, at the same time, the Sapir-Whorf Hypothesis reiterates the fact that we are more alike than we are different, regardless of the language we speak.

Isn’t it just amazing that linguistic diversity just reveals to us how ingenious and flexible the human mind is – human minds have invented not one cognitive universe but, indeed, seven thousand!

Kay, P., & Kempton, W. (1984). What is the Sapir‐Whorf hypothesis?. American anthropologist, 86(1), 65-79.

Whorf, B. L. (1952). Language, mind, and reality. ETC: A review of general semantics, 167-188.

Whorf, B. L. (1997). The relation of habitual thought and behavior to language. In Sociolinguistics (pp. 443-463). Palgrave, London.

Whorf, B. L. (2012). Language, thought, and reality: Selected writings of Benjamin Lee Whorf. MIT press.

Print Friendly, PDF & Email

  • Bipolar Disorder
  • Therapy Center
  • When To See a Therapist
  • Types of Therapy
  • Best Online Therapy
  • Best Couples Therapy
  • Managing Stress
  • Sleep and Dreaming
  • Understanding Emotions
  • Self-Improvement
  • Healthy Relationships
  • Student Resources
  • Personality Types
  • Sweepstakes
  • Guided Meditations
  • Verywell Mind Insights
  • 2024 Verywell Mind 25
  • Mental Health in the Classroom
  • Editorial Process
  • Meet Our Review Board
  • Crisis Support

The Sapir-Whorf Hypothesis: How Language Influences How We Express Ourselves

Thomas Barwick / Getty Images

What to Know About the Sapir-Whorf Hypothesis

Real-world examples of linguistic relativity, linguistic relativity in psychology.

The Sapir-Whorf Hypothesis, also known as linguistic relativity, refers to the idea that the language a person speaks can influence their worldview, thought, and even how they experience and understand the world.

While more extreme versions of the hypothesis have largely been discredited, a growing body of research has demonstrated that language can meaningfully shape how we understand the world around us and even ourselves.

Keep reading to learn more about linguistic relativity, including some real-world examples of how it shapes thoughts, emotions, and behavior.  

The hypothesis is named after anthropologist and linguist Edward Sapir and his student, Benjamin Lee Whorf. While the hypothesis is named after them both, the two never actually formally co-authored a coherent hypothesis together.

This Hypothesis Aims to Figure Out How Language and Culture Are Connected

Sapir was interested in charting the difference in language and cultural worldviews, including how language and culture influence each other. Whorf took this work on how language and culture shape each other a step further to explore how different languages might shape thought and behavior.

Since then, the concept has evolved into multiple variations, some more credible than others.

Linguistic Determinism Is an Extreme Version of the Hypothesis

Linguistic determinism, for example, is a more extreme version suggesting that a person’s perception and thought are limited to the language they speak. An early example of linguistic determinism comes from Whorf himself who argued that the Hopi people in Arizona don’t conjugate verbs into past, present, and future tenses as English speakers do and that their words for units of time (like “day” or “hour”) were verbs rather than nouns.

From this, he concluded that the Hopi don’t view time as a physical object that can be counted out in minutes and hours the way English speakers do. Instead, Whorf argued, the Hopi view time as a formless process.

This was then taken by others to mean that the Hopi don’t have any concept of time—an extreme view that has since been repeatedly disproven.

There is some evidence for a more nuanced version of linguistic relativity, which suggests that the structure and vocabulary of the language you speak can influence how you understand the world around you. To understand this better, it helps to look at real-world examples of the effects language can have on thought and behavior.

Different Languages Express Colors Differently

Color is one of the most common examples of linguistic relativity. Most known languages have somewhere between two and twelve color terms, and the way colors are categorized varies widely. In English, for example, there are distinct categories for blue and green .

Blue and Green

But in Korean, there is one word that encompasses both. This doesn’t mean Korean speakers can’t see blue, it just means blue is understood as a variant of green rather than a distinct color category all its own.

In Russian, meanwhile, the colors that English speakers would lump under the umbrella term of “blue” are further subdivided into two distinct color categories, “siniy” and “goluboy.” They roughly correspond to light blue and dark blue in English. But to Russian speakers, they are as distinct as orange and brown .

In one study comparing English and Russian speakers, participants were shown a color square and then asked to choose which of the two color squares below it was the closest in shade to the first square.

The test specifically focused on varying shades of blue ranging from “siniy” to “goluboy.” Russian speakers were not only faster at selecting the matching color square but were more accurate in their selections.

The Way Location Is Expressed Varies Across Languages

This same variation occurs in other areas of language. For example, in Guugu Ymithirr, a language spoken by Aboriginal Australians, spatial orientation is always described in absolute terms of cardinal directions. While an English speaker would say the laptop is “in front of” you, a Guugu Ymithirr speaker would say it was north, south, west, or east of you.

As a result, Aboriginal Australians have to be constantly attuned to cardinal directions because their language requires it (just as Russian speakers develop a more instinctive ability to discern between shades of what English speakers call blue because their language requires it).

So when you ask a Guugu Ymithirr speaker to tell you which way south is, they can point in the right direction without a moment’s hesitation. Meanwhile, most English speakers would struggle to accurately identify South without the help of a compass or taking a moment to recall grade school lessons about how to find it.

The concept of these cardinal directions exists in English, but English speakers aren’t required to think about or use them on a daily basis so it’s not as intuitive or ingrained in how they orient themselves in space.

Just as with other aspects of thought and perception, the vocabulary and grammatical structure we have for thinking about or talking about what we feel doesn’t create our feelings, but it does shape how we understand them and, to an extent, how we experience them.

Words Help Us Put a Name to Our Emotions

For example, the ability to detect displeasure from a person’s face is universal. But in a language that has the words “angry” and “sad,” you can further distinguish what kind of displeasure you observe in their facial expression. This doesn’t mean humans never experienced anger or sadness before words for them emerged. But they may have struggled to understand or explain the subtle differences between different dimensions of displeasure.

In one study of English speakers, toddlers were shown a picture of a person with an angry facial expression. Then, they were given a set of pictures of people displaying different expressions including happy, sad, surprised, scared, disgusted, or angry. Researchers asked them to put all the pictures that matched the first angry face picture into a box.

The two-year-olds in the experiment tended to place all faces except happy faces into the box. But four-year-olds were more selective, often leaving out sad or fearful faces as well as happy faces. This suggests that as our vocabulary for talking about emotions expands, so does our ability to understand and distinguish those emotions.

But some research suggests the influence is not limited to just developing a wider vocabulary for categorizing emotions. Language may “also help constitute emotion by cohering sensations into specific perceptions of ‘anger,’ ‘disgust,’ ‘fear,’ etc.,” said Dr. Harold Hong, a board-certified psychiatrist at New Waters Recovery in North Carolina.

As our vocabulary for talking about emotions expands, so does our ability to understand and distinguish those emotions.

Words for emotions, like words for colors, are an attempt to categorize a spectrum of sensations into a handful of distinct categories. And, like color, there’s no objective or hard rule on where the boundaries between emotions should be which can lead to variation across languages in how emotions are categorized.

Emotions Are Categorized Differently in Different Languages

Just as different languages categorize color a little differently, researchers have also found differences in how emotions are categorized. In German, for example, there’s an emotion called “gemütlichkeit.”

While it’s usually translated as “cozy” or “ friendly ” in English, there really isn’t a direct translation. It refers to a particular kind of peace and sense of belonging that a person feels when surrounded by the people they love or feel connected to in a place they feel comfortable and free to be who they are.

Harold Hong, MD, Psychiatrist

The lack of a word for an emotion in a language does not mean that its speakers don't experience that emotion.

You may have felt gemütlichkeit when staying up with your friends to joke and play games at a sleepover. You may feel it when you visit home for the holidays and spend your time eating, laughing, and reminiscing with your family in the house you grew up in.

In Japanese, the word “amae” is just as difficult to translate into English. Usually, it’s translated as "spoiled child" or "presumed indulgence," as in making a request and assuming it will be indulged. But both of those have strong negative connotations in English and amae is a positive emotion .

Instead of being spoiled or coddled, it’s referring to that particular kind of trust and assurance that comes with being nurtured by someone and knowing that you can ask for what you want without worrying whether the other person might feel resentful or burdened by your request.

You might have felt amae when your car broke down and you immediately called your mom to pick you up, without having to worry for even a second whether or not she would drop everything to help you.

Regardless of which languages you speak, though, you’re capable of feeling both of these emotions. “The lack of a word for an emotion in a language does not mean that its speakers don't experience that emotion,” Dr. Hong explained.

What This Means For You

“While having the words to describe emotions can help us better understand and regulate them, it is possible to experience and express those emotions without specific labels for them.” Without the words for these feelings, you can still feel them but you just might not be able to identify them as readily or clearly as someone who does have those words. 

Rhee S. Lexicalization patterns in color naming in Korean . In: Raffaelli I, Katunar D, Kerovec B, eds. Studies in Functional and Structural Linguistics. Vol 78. John Benjamins Publishing Company; 2019:109-128. Doi:10.1075/sfsl.78.06rhe

Winawer J, Witthoft N, Frank MC, Wu L, Wade AR, Boroditsky L. Russian blues reveal effects of language on color discrimination . Proc Natl Acad Sci USA. 2007;104(19):7780-7785.  10.1073/pnas.0701644104

Lindquist KA, MacCormack JK, Shablack H. The role of language in emotion: predictions from psychological constructionism . Front Psychol. 2015;6. Doi:10.3389/fpsyg.2015.00444

By Rachael Green Rachael is a New York-based writer and freelance writer for Verywell Mind, where she leverages her decades of personal experience with and research on mental illness—particularly ADHD and depression—to help readers better understand how their mind works and how to manage their mental health.

helpful professor logo

Sapir-Whorf Hypothesis: Examples, Definition, Criticisms

Sapir-Whorf Hypothesis: Examples, Definition, Criticisms

Gregory Paul C. (MA)

Gregory Paul C. is a licensed social studies educator, and has been teaching the social sciences in some capacity for 13 years. He currently works at university in an international liberal arts department teaching cross-cultural studies in the Chuugoku Region of Japan. Additionally, he manages semester study abroad programs for Japanese students, and prepares them for the challenges they may face living in various countries short term.

Learn about our Editorial Process

Sapir-Whorf Hypothesis: Examples, Definition, Criticisms

Chris Drew (PhD)

This article was peer-reviewed and edited by Chris Drew (PhD). The review process on Helpful Professor involves having a PhD level expert fact check, edit, and contribute to articles. Reviewers ensure all content reflects expert academic consensus and is backed up with reference to academic studies. Dr. Drew has published over 20 academic articles in scholarly journals. He is the former editor of the Journal of Learning Development in Higher Education and holds a PhD in Education from ACU.

sapir whorf's hypothesis on language and gender

Developed in 1929 by Edward Sapir, the Sapir-Whorf hypothesis (also known as linguistic relativity ) states that a person’s perception of the world around them and how they experience the world is both determined and influenced by the language that they speak.

The theory proposes that differences in grammatical and verbal structures, and the nuanced distinctions in the meanings that are assigned to words, create a unique reality for the speaker. We also call this idea the linguistic determinism theory .

Spair-Whorf Hypothesis Definition and Overview

Cibelli et al. (2016) reiterate the tenets of the hypothesis by stating:

“…our thoughts are shaped by our native language, and that speakers of different languages therefore think differently”(para. 1).

Kay & Kempton (1984) explain it a bit more succinctly. They explain that the hypothesis itself is based on the:

“…evolutionary view prevalent in 19 th century anthropology based in both linguistic relativity and determinism” (pp. 66, 79).

Linguist Edward Sapir, an American linguist who was interested in anthropology , studied at Yale University with Benjamin Whorf in the 1920’s.

Sapir & Whorf began to consider lexical and grammatical patterns and how these factored into the construction of different culture’s views of the world around them.

For example, they compared how thoughts and behavior differed between English speakers and Hopi language speakers in regard to the concept of time, arguing that in the Hopi language, the absence of the future tense has significant relevance (Kay & Kempton, 1984, p. 78-79).

Whorf (2021), in his own words, asserts:

“Every language is a vast pattern-system, different from others, in which are culturally ordained the forms and categories by which the personality not only communicates, but also analyzes nature, notices or neglects types of relationship and phenomena, channels his reasoning, and builds the house of his consciousness” (p. 252).

10 Sapir-Whorf Hypothesis Examples

  • Constructions of food in language: A language may ascribe many words to explain the same concept, item, or food type. This shows that they perceive it as extremely important in their society, in comparison to a culture whose language only has one word for that same concept, item, or food.
  • Descriptions of color in language: Different cultures may visually perceive colors in different ways according to how the colors are described by the words in their language.
  • Constructions of gender in language: Many languages are “gendered”, creating word associations that pertain to the roles of men or women in society.
  • Perceptions of time in language: Depending upon how the tenses are structured in a language, it may dictate how the people that speak that language perceive the concept of time.
  • Categorization in language: The ways concepts and items in a given culture are categorized (and what words are assigned to them) can affect the speaker’s perception of the world around them.
  • Politeness is encoded in language: Levels of politeness in a language and the pronoun combinations to express these levels differ between languages. How languages express politeness with words can dictate how they perceive the world around them.
  • Indigenous words for snow: A popular example used to justify this hypothesis is the Inuit people, who have a multitude of ways to express the word snow. If you follow the reasoning of Sapir, it would suggest that the Inuits have a profoundly deeper understanding of snow than other cultures.
  • Use of idioms in language: An expression or well-known saying in one culture has an acute meaning implicitly understood by those that speak the particular language but is not understandable when expressed in another language.
  • Values are engrained in language: Each country and culture have beliefs and values as a direct result of the language it uses. 
  • Slang in language: The slang used by younger people evolves from generation to generation in all languages. Generational slang carries with it perceptions and ideas about the world that members of that generation share.

See Other Hypothesis Examples Here

Two Ways Language Shapes Perception

1. perception of categories and categorization.

How concepts and items in a culture are categorized (and what words are assigned to them) can affect the speaker’s perception of the world around them.

Although the examples of this phenomenon are too numerous to cite, a clear example is the extremely contextual, nuanced, and hyper-categorized Japanese language.

In the English language, the concept of “you” and “I” is narrowed to these two forms. However, Japanese has numerous ways to express you and I, each having various levels of politeness and appropriateness in relation to age, gender, and stature in society.

While in common conversation, the pronoun is often left out of the conversation – reliant on context, misuse or omission of the proper pronoun can be perceived as rude or ill-mannered.

In other ways, the complexity of the categorical lexicons can often leave English speakers puzzled. This could come in the form of classifications of different shaped bowls and plates that serve different functions; it could be traces of the ancient Japanese calendar from the 7 th Century, that possessed 72 micro-seasons during a year, or any number of sub-divided word listings that may be considered as one blanket term in another language.

Masuda et al. (2017) gives a clear example:

“ People conceptualize objects along the lines drawn between existing categories in their native language. That is, if two concepts fall into the same linguistic category, the perception of similarity between these objects would be stronger than if the two concepts fall into different linguistic categories.”

They then go on to give the example of how Japanese vs English speakers might categorize an everyday object – the bell:

“For example, in Japanese, the kind of bell found in a bell tower generally corresponds to the word kane—a large bell—which is categorically different from a small bell, suzu. However, in English, these two objects are considered to belong within the same linguistic category, “bell.” Therefore, we might expect English speakers to perceive these two objects as being more similar than would Japanese speakers (para 5).

2. Perception of the Concept of Time

According to a way the tenses are structured in a language, it may dictate how the people that speak that language perceive the concept of time

One of Sapir’s most famous applications of his theory is to the language of the Arizona Native American Hopi tribe.

He claimed, although refuted vehemently by linguistic scholars since, that they have no general notion of time – that they cannot decipher between the past, present, or future because of the grammatical structures that are used within their language.

As Engle (2016) asserts, Sapir believed that the Hopi language “encodes on ordinal value, rather than a passage of time”.

He concluded that, “a day followed by a night is not so much a new day, but a return to daylight” (p. 96).

However, it is not only Hopi culture that has different perception of time imbedded in the language; Thai culture has a non-linear concept of time, and the Malagasy people of Madagascar believe that time in motion around human beings, not that human beings are passing through time (Engle, 2016, p. 99).

Criticism of Sapir-Whorf Hypothesis

1. language as context-dependent.

Iwamoto (2005) expresses that the Sapir-Whorf hypothesis fails to recognize that language is used within context. Its purely decontextualized textual analysis of language is too one-dimensional and doesn’t consider how we actually use language:

“Whorf’s “neat and simplistic” linguistic relativism presupposes the idea that an entire language or entire societies or cultures are categorizable or typable in a straightforward, discrete, and total manner, ignoring other variables such as contextual and semantic factors .” (Iwamoto, 2005, p. 95)

2. Not universally applicable

Another criticism of the hypothesis is that Sapir & Whorf’s hypothesis cannot be transferred or applied to all languages.

It is difficult to cite empirical studies that confirm that other cultures do not also have similarities in the way concepts are perceived through their language – even if they don’t possess a similar word/expression for a particular concept that is expressed.

3. thoughts can be independent of language

Stephen Pinker, one of Sapir & Whorf’s most emphatic critics, would argue that language is not of our thoughts, and is not a cultural invention that creates perceptions; it is in his opinion, a part of human biology (Meier & Pinker, 1995, pp. 611-612).

He suggests that the acquisition and development of sign language show that languages are instinctual, therefore biological; he even goes so far as to say that “all speech is an illusion”(p. 613).

Cibelli, E., Xu, Y., Austerweil, J. L., Griffiths, T. L., & Regier, T. (2016). The Sapir-Whorf Hypothesis and Probabilistic Inference: Evidence from the Domain of Color.  PLOS ONE ,  11 (7), e0158725.  https://doi.org/10.1371/journal.pone.0158725

Engle, J. S. (2016). Of Hopis and Heptapods: The Return of Sapir-Whorf.  ETC.: A Review of General Semantics ,  73 (1), 95.  https://www.questia.com/library/journal/1G1-544562276/of-hopis-and-heptapods-the-return-of-sapir-whorf

Iwamoto, N. (2005). The Role of Language in Advancing Nationalism.  Bulletin of the Institute of Humanities ,  38 , 91–113.

Meier, R. P., & Pinker, S. (1995). The Language Instinct: How the Mind Creates Language.  Language ,  71 (3), 610.  https://doi.org/10.2307/416234

Masuda, T., Ishii, K., Miwa, K., Rashid, M., Lee, H., & Mahdi, R. (2017). One Label or Two? Linguistic Influences on the Similarity Judgment of Objects between English and Japanese Speakers. Frontiers in Psychology , 8 . https://doi.org/10.3389/fpsyg.2017.01637

Kay, P., & Kempton, W. (1984). What Is the Sapir-Whorf Hypothesis?  American Anthropologist ,  86 (1), 65–79. http://www.jstor.org/stable/679389

Whorf, B. L. (2021).  Language, Thought, and Reality: Selected Writings of Benjamin Lee Whorf . Hassell Street Press.

Gregory

  • Gregory Paul C. (MA) #molongui-disabled-link Social Penetration Theory: Examples, Phases, Criticism
  • Gregory Paul C. (MA) #molongui-disabled-link Upper Middle-Class Lifestyles: 10 Defining Features
  • Gregory Paul C. (MA) #molongui-disabled-link Arousal Theory of Motivation: Definition & Examples
  • Gregory Paul C. (MA) #molongui-disabled-link Theory of Mind: Examples and Definition

Chris

  • Chris Drew (PhD) https://helpfulprofessor.com/author/chris-drew-phd/ 23 Achieved Status Examples
  • Chris Drew (PhD) https://helpfulprofessor.com/author/chris-drew-phd/ 15 Ableism Examples
  • Chris Drew (PhD) https://helpfulprofessor.com/author/chris-drew-phd/ 25 Defense Mechanisms Examples
  • Chris Drew (PhD) https://helpfulprofessor.com/author/chris-drew-phd/ 15 Theory of Planned Behavior Examples

Leave a Comment Cancel Reply

Your email address will not be published. Required fields are marked *

SEP home page

  • Table of Contents
  • Random Entry
  • Chronological
  • Editorial Information
  • About the SEP
  • Editorial Board
  • How to Cite the SEP
  • Special Characters
  • Advanced Tools
  • Support the SEP
  • PDFs for SEP Friends
  • Make a Donation
  • SEPIA for Libraries
  • Back to Entry
  • Entry Contents
  • Entry Bibliography
  • Academic Tools
  • Friends PDF Preview
  • Author and Citation Info
  • Back to Top

Supplement to Philosophy of Linguistics

Whorfianism.

Emergentists tend to follow Edward Sapir in taking an interest in interlinguistic and intralinguistic variation. Linguistic anthropologists have explicitly taken up the task of defending a famous claim associated with Sapir that connects linguistic variation to differences in thinking and cognition more generally. The claim is very often referred to as the Sapir-Whorf Hypothesis (though this is a largely infelicitous label, as we shall see).

This topic is closely related to various forms of relativism—epistemological, ontological, conceptual, and moral—and its general outlines are discussed elsewhere in this encyclopedia; see the section on language in the Summer 2015 archived version of the entry on relativism (§3.1). Cultural versions of moral relativism suggest that, given how much cultures differ, what is moral for you might depend on the culture you were brought up in. A somewhat analogous view would suggest that, given how much language structures differ, what is thinkable for you might depend on the language you use. (This is actually a kind of conceptual relativism, but it is generally called linguistic relativism, and we will continue that practice.)

Even a brief skim of the vast literature on the topic is not remotely plausible in this article; and the primary literature is in any case more often polemical than enlightening. It certainly holds no general answer to what science has discovered about the influences of language on thought. Here we offer just a limited discussion of the alleged hypothesis and the rhetoric used in discussing it, the vapid and not so vapid forms it takes, and the prospects for actually devising testable scientific hypotheses about the influence of language on thought.

Whorf himself did not offer a hypothesis. He presented his “new principle of linguistic relativity” (Whorf 1956: 214) as a fact discovered by linguistic analysis:

When linguists became able to examine critically and scientifically a large number of languages of widely different patterns, their base of reference was expanded; they experienced an interruption of phenomena hitherto held universal, and a whole new order of significances came into their ken. It was found that the background linguistic system (in other words, the grammar) of each language is not merely a reproducing instrument for voicing ideas but rather is itself the shaper of ideas, the program and guide for the individual’s mental activity, for his analysis of impressions, for his synthesis of his mental stock in trade. Formulation of ideas is not an independent process, strictly rational in the old sense, but is part of a particular grammar, and differs, from slightly to greatly, between different grammars. We dissect nature along lines laid down by our native languages. The categories and types that we isolate from the world of phenomena we do not find there because they stare every observer in the face; on the contrary, the world is presented in a kaleidoscopic flux of impressions which has to be organized by our minds—and this means largely by the linguistic systems in our minds. We cut nature up, organize it into concepts, and ascribe significances as we do, largely because we are parties to an agreement to organize it in this way—an agreement that holds throughout our speech community and is codified in the patterns of our language. The agreement is, of course, an implicit and unstated one, but its terms are absolutely obligatory ; we cannot talk at all except by subscribing to the organization and classification of data which the agreement decrees. (Whorf 1956: 212–214; emphasis in original)

Later, Whorf’s speculations about the “sensuously and operationally different” character of different snow types for “an Eskimo” (Whorf 1956: 216) developed into a familiar journalistic meme about the Inuit having dozens or scores or hundreds of words for snow; but few who repeat that urban legend recall Whorf’s emphasis on its being grammar, rather than lexicon, that cuts up and organizes nature for us.

In an article written in 1937, posthumously published in an academic journal (Whorf 1956: 87–101), Whorf clarifies what is most important about the effects of language on thought and world-view. He distinguishes ‘phenotypes’, which are overt grammatical categories typically indicated by morphemic markers, from what he called ‘cryptotypes’, which are covert grammatical categories, marked only implicitly by distributional patterns in a language that are not immediately apparent. In English, the past tense would be an example of a phenotype (it is marked by the - ed suffix in all regular verbs). Gender in personal names and common nouns would be an example of a cryptotype, not systematically marked by anything. In a cryptotype, “class membership of the word is not apparent until there is a question of using it or referring to it in one of these special types of sentence, and then we find that this word belongs to a class requiring some sort of distinctive treatment, which may even be the negative treatment of excluding that type of sentence” (p. 89).

Whorf’s point is the familiar one that linguistic structure is comprised, in part, of distributional patterns in language use that are not explicitly marked. What follows from this, according to Whorf, is not that the existing lexemes in a language (like its words for snow) comprise covert linguistic structure, but that patterns shared by word classes constitute linguistic structure. In ‘Language, mind, and reality’ (1942; published posthumously in Theosophist , a magazine published in India for the followers of the 19th-century spiritualist Helena Blavatsky) he wrote:

Because of the systematic, configurative nature of higher mind, the “patternment” aspect of language always overrides and controls the “lexation”…or name-giving aspect. Hence the meanings of specific words are less important than we fondly fancy. Sentences, not words, are the essence of speech, just as equations and functions, and not bare numbers, are the real meat of mathematics. We are all mistaken in our common belief that any word has an “exact meaning.” We have seen that the higher mind deals in symbols that have no fixed reference to anything, but are like blank checks, to be filled in as required, that stand for “any value” of a given variable, like …the x , y , z of algebra. (Whorf 1942: 258)

Whorf apparently thought that only personal and proper names have an exact meaning or reference (Whorf 1956: 259).

For Whorf, it was an unquestionable fact that language influences thought to some degree:

Actually, thinking is most mysterious, and by far the greatest light upon it that we have is thrown by the study of language. This study shows that the forms of a person’s thoughts are controlled by inexorable laws of pattern of which he is unconscious. These patterns are the unperceived intricate systematizations of his own language—shown readily enough by a candid comparison and contrast with other languages, especially those of a different linguistic family. His thinking itself is in a language—in English, in Sanskrit, in Chinese. [footnote omitted] And every language is a vast pattern-system, different from others, in which are culturally ordained the forms and categories by which the personality not only communicates, but analyzes nature, notices or neglects types of relationship and phenomena, channels his reasoning, and builds the house of his consciousness. (Whorf 1956: 252)

He seems to regard it as necessarily true that language affects thought, given

  • the fact that language must be used in order to think, and
  • the facts about language structure that linguistic analysis discovers.

He also seems to presume that the only structure and logic that thought has is grammatical structure. These views are not the ones that after Whorf’s death came to be known as ‘the Sapir-Whorf Hypothesis’ (a sobriquet due to Hoijer 1954). Nor are they what was called the ‘Whorf thesis’ by Brown and Lenneberg (1954) which was concerned with the relation of obligatory lexical distinctions and thought. Brown and Lenneberg (1954) investigated this question by looking at the relation of color terminology in a language and the classificatory abilities of the speakers of that language. The issue of the relation between obligatory lexical distinctions and thought is at the heart of what is now called ‘the Sapir-Whorf Hypothesis’ or ‘the Whorf Hypothesis’ or ‘Whorfianism’.

1. Banal Whorfianism

No one is going to be impressed with a claim that some aspect of your language may affect how you think in some way or other; that is neither a philosophical thesis nor a psychological hypothesis. So it is appropriate to set aside entirely the kind of so-called hypotheses that Steven Pinker presents in The Stuff of Thought (2007: 126–128) as “five banal versions of the Whorfian hypothesis”:

  • “Language affects thought because we get much of our knowledge through reading and conversation.”
  • “A sentence can frame an event, affecting the way people construe it.”
  • “The stock of words in a language reflects the kinds of things its speakers deal with in their lives and hence think about.”
  • “[I]f one uses the word language in a loose way to refer to meanings,… then language is thought.”
  • “When people think about an entity, among the many attributes they can think about is its name.”

These are just truisms, unrelated to any serious issue about linguistic relativism.

We should also set aside some methodological versions of linguistic relativism discussed in anthropology. It may be excellent advice to a budding anthropologist to be aware of linguistic diversity, and to be on the lookout for ways in which your language may affect your judgment of other cultures; but such advice does not constitute a hypothesis.

2. The so-called Sapir-Whorf hypothesis

The term “Sapir-Whorf Hypothesis” was coined by Harry Hoijer in his contribution (Hoijer 1954) to a conference on the work of Benjamin Lee Whorf in 1953. But anyone looking in Hoijer’s paper for a clear statement of the hypothesis will look in vain. Curiously, despite his stated intent “to review and clarify the Sapir-Whorf hypothesis” (1954: 93), Hoijer did not even attempt to state it. The closest he came was this:

The central idea of the Sapir-Whorf hypothesis is that language functions, not simply as a device for reporting experience, but also, and more significantly, as a way of defining experience for its speakers.

The claim that “language functions…as a way of defining experience” appears to be offered as a kind of vague metaphysical insight rather than either a statement of linguistic relativism or a testable hypothesis.

And if Hoijer seriously meant that what qualitative experiences a speaker can have are constituted by that speaker’s language, then surely the claim is false. There is no reason to doubt that non-linguistic sentient creatures like cats can experience (for example) pain or heat or hunger, so having a language is not a necessary condition for having experiences. And it is surely not sufficient either: a robot with a sophisticated natural language processing capacity could be designed without the capacity for conscious experience.

In short, it is a mystery what Hoijer meant by his “central idea”.

Vague remarks of the same loosely metaphysical sort have continued to be a feature of the literature down to the present. The statements made in some recent papers, even in respected refereed journals, contain non-sequiturs echoing some of the remarks of Sapir, Whorf, and Hoijer. And they come from both sides of the debate.

3. Anti-Whorfian rhetoric

Lila Gleitman is an Essentialist on the other side of the contemporary debate: she is against linguistic relativism, and against the broadly Whorfian work of Stephen Levinson’s group at the Max Planck Institute for Psycholinguistics. In the context of criticizing a particular research design, Li and Gleitman (2002) quote Whorf’s claim that “language is the factor that limits free plasticity and rigidifies channels of development”. But in the claim cited, Whorf seems to be talking about the psychological topic that holds universally of human conceptual development, not claiming that linguistic relativism is true.

Li and Gleitman then claim (p. 266) that such (Whorfian) views “have diminished considerably in academic favor” in part because of “the universalist position of Chomskian linguistics, with its potential for explaining the striking similarity of language learning in children all over the world.” But there is no clear conflict or even a conceptual connection between Whorf’s views about language placing limits on developmental plasticity, and Chomsky’s thesis of an innate universal architecture for syntax. In short, there is no reason why Chomsky’s I-languages could not be innately constrained, but (once acquired) cognitively and developmentally constraining.

For example, the supposedly deep linguistic universal of ‘recursion’ (Hauser et al. 2002) is surely quite independent of whether the inventory of colour-name lexemes in your language influences the speed with which you can discriminate between color chips. And conversely, universal tendencies in color naming across languages (Kay and Regier 2006) do not show that color-naming differences among languages are without effect on categorical perception (Thierry et al. 2009).

4. Strong and weak Whorfianism

One of the first linguists to defend a general form of universalism against linguistic relativism, thus presupposing that they conflict, was Julia Penn (1972). She was also an early popularizer of the distinction between ‘strong’ and ‘weak’ formulations of the Sapir-Whorf Hypothesis (and an opponent of the ‘strong’ version).

‘Weak’ versions of Whorfianism state that language influences or defeasibly shapes thought. ‘Strong’ versions state that language determines thought, or fixes it in some way. The weak versions are commonly dismissed as banal (because of course there must be some influence), and the stronger versions as implausible.

The weak versions are considered banal because they are not adequately formulated as testable hypotheses that could conflict with relevant evidence about language and thought.

Why would the strong versions be thought implausible? For a language to make us think in a particular way, it might seem that it must at least temporarily prevent us from thinking in other ways, and thus make some thoughts not only inexpressible but unthinkable. If this were true, then strong Whorfianism would conflict with the Katzian effability claim. There would be thoughts that a person couldn’t think because of the language(s) they speak.

Some are fascinated by the idea that there are inaccessible thoughts; and the notion that learning a new language gives access to entirely new thoughts and concepts seems to be a staple of popular writing about the virtues of learning languages. But many scientists and philosophers intuitively rebel against violations of effability: thinking about concepts that no one has yet named is part of their job description.

The resolution lies in seeing that the language could affect certain aspects of our cognitive functioning without making certain thoughts unthinkable for us .

For example, Greek has separate terms for what we call light blue and dark blue, and no word meaning what ‘blue’ means in English: Greek forces a choice on this distinction. Experiments have shown (Thierry et al. 2009) that native speakers of Greek react faster when categorizing light blue and dark blue color chips—apparently a genuine effect of language on thought. But that does not make English speakers blind to the distinction, or imply that Greek speakers cannot grasp the idea of a hue falling somewhere between green and violet in the spectrum.

There is no general or global ineffability problem. There is, though, a peculiar aspect of strong Whorfian claims, giving them a local analog of ineffability: the content of such a claim cannot be expressed in any language it is true of . This does not make the claims self-undermining (as with the standard objections to relativism); it doesn’t even mean that they are untestable. They are somewhat anomalous, but nothing follows concerning the speakers of the language in question (except that they cannot state the hypothesis using the basic vocabulary and grammar that they ordinarily use).

If there were a true hypothesis about the limits that basic English vocabulary and constructions puts on what English speakers can think, the hypothesis would turn out to be inexpressible in English, using basic vocabulary and the usual repertoire of constructions. That might mean it would be hard for us to discuss it in an article in English unless we used terminological innovations or syntactic workarounds. But that doesn’t imply anything about English speakers’ ability to grasp concepts, or to develop new ways of expressing them by coining new words or elaborated syntax.

5. Constructing and evaluating Whorfian hypotheses

A number of considerations are relevant to formulating, testing, and evaluating Whorfian hypotheses.

Genuine hypotheses about the effects of language on thought will always have a duality: there will be a linguistic part and a non-linguistic one. The linguistic part will involve a claim that some feature is present in one language but absent in another.

Whorf himself saw that it was only obligatory features of languages that established “mental patterns” or “habitual thought” (Whorf 1956: 139), since if it were optional then the speaker could optionally do it one way or do it the other way. And so this would not be a case of “constraining the conceptual structure”. So we will likewise restrict our attention to obligatory features here.

Examples of relevant obligatory features would include lexical distinctions like the light vs. dark blue forced choice in Greek, or the forced choice between “in (fitting tightly)” vs. “in (fitting loosely)” in Korean. They also include grammatical distinctions like the forced choice in Spanish 2nd-person pronouns between informal/intimate and formal/distant (informal tú vs. formal usted in the singular; informal vosotros vs. formal ustedes in the plural), or the forced choice in Tamil 1st-person plural pronouns between inclusive (“we = me and you and perhaps others”) and exclusive (“we = me and others not including you”).

The non-linguistic part of a Whorfian hypothesis will contrast the psychological effects that habitually using the two languages has on their speakers. For example, one might conjecture that the habitual use of Spanish induces its speakers to be sensitive to the formal and informal character of the speaker’s relationship with their interlocutor while habitually using English does not.

So testing Whorfian hypotheses requires testing two independent hypotheses with the appropriate kinds of data. In consequence, evaluating them requires the expertise of both linguistics and psychology, and is a multidisciplinary enterprise. Clearly, the linguistic hypothesis may hold up where the psychological hypothesis does not, or conversely.

In addition, if linguists discovered that some linguistic feature was optional in two different languages, then even if psychological experiments showed differences between the two populations of speakers, this would not show linguistic determination or influence. The cognitive differences might depend on (say) cultural differences.

A further important consideration concerns the strength of the inducement relationship that a Whorfian hypothesis posits between a speaker’s language and their non-linguistic capacities. The claim that your language shapes or influences your cognition is quite different from the claim that your language makes certain kinds of cognition impossible (or obligatory) for you. The strength of any Whorfian hypothesis will vary depending on the kind of relationship being claimed, and the ease of revisability of that relation.

A testable Whorfian hypothesis will have a schematic form something like this:

  • Linguistic part : Feature F is obligatory in L 1 but optional in L 2 .
  • Psychological part : Speaking a language with obligatory feature F bears relation R to the cognitive effect C .

The relation R might in principle be causation or determination, but it is important to see that it might merely be correlation, or slight favoring; and the non-linguistic cognitive effect C might be readily suppressible or revisable.

Dan Slobin (1996) presents a view that competes with Whorfian hypotheses as standardly understood. He hypothesizes that when the speakers are using their cognitive abilities in the service of a linguistic ability (speaking, writing, translating, etc.), the language they are planning to use to express their thought will have a temporary online effect on how they express their thought. The claim is that as long as language users are thinking in order to frame their speech or writing or translation in some language, the mandatory features of that language will influence the way they think.

On Slobin’s view, these effects quickly attenuate as soon as the activity of thinking for speaking ends. For example, if a speaker is thinking for writing in Spanish, then Slobin’s hypothesis would predict that given the obligatory formal/informal 2nd-person pronoun distinction they would pay greater attention to the formal/informal character of their social relationships with their audience than if they were writing in English. But this effect is not permanent. As soon as they stop thinking for speaking, the effect of Spanish on their thought ends.

Slobin’s non-Whorfian linguistic relativist hypothesis raises the importance of psychological research on bilinguals or people who currently use two or more languages with a native or near-native facility. This is because one clear way to test Slobin-like hypotheses relative to Whorfian hypotheses would be to find out whether language correlated non-linguistic cognitive differences between speakers hold for bilinguals only when are thinking for speaking in one language, but not when they are thinking for speaking in some other language. If the relevant cognitive differences appeared and disappeared depending on which language speakers were planning to express themselves in, it would go some way to vindicate Slobin-like hypotheses over more traditional Whorfian Hypotheses. Of course, one could alternately accept a broadening of Whorfian hypotheses to include Slobin-like evanescent effects. Either way, attention must be paid to the persistence and revisability of the linguistic effects.

Kousta et al. (2008) shows that “for bilinguals there is intraspeaker relativity in semantic representations and, therefore, [grammatical] gender does not have a conceptual, non-linguistic effect” (843). Grammatical gender is obligatory in the languages in which it occurs and has been claimed by Whorfians to have persistent and enduring non-linguistic effects on representations of objects (Boroditsky et al. 2003). However, Kousta et al. supports the claim that bilinguals’ semantic representations vary depending on which language they are using, and thus have transient effects. This suggests that although some semantic representations of objects may vary from language to language, their non-linguistic cognitive effects are transitory.

Some advocates of Whorfianism have held that if Whorfian hypotheses were true, then meaning would be globally and radically indeterminate. Thus, the truth of Whorfian hypotheses is equated with global linguistic relativism—a well known self-undermining form of relativism. But as we have seen, not all Whorfian hypotheses are global hypotheses: they are about what is induced by particular linguistic features. And the associated non-linguistic perceptual and cognitive differences can be quite small, perhaps insignificant. For example, Thierry et al. (2009) provides evidence that an obligatory lexical distinction between light and dark blue affects Greek speakers’ color perception in the left hemisphere only. And the question of the degree to which this affects sensuous experience is not addressed.

The fact that Whorfian hypotheses need not be global linguistic relativist hypotheses means that they do not conflict with the claim that there are language universals. Structuralists of the first half of the 20th century tended to disfavor the idea of universals: Martin Joos’s characterization of structuralist linguistics as claiming that “languages can differ without limit as to either extent or direction” (Joos 1966, 228) has been much quoted in this connection. If the claim that languages can vary without limit were conjoined with the claim that languages have significant and permanent effects on the concepts and worldview of their speakers, a truly profound global linguistic relativism would result. But neither conjunct should be accepted. Joos’s remark is regarded by nearly all linguists today as overstated (and merely a caricature of the structuralists), and Whorfian hypotheses do not have to take a global or deterministic form.

John Lucy, a conscientious and conservative researcher of Whorfian hypotheses, has remarked:

We still know little about the connections between particular language patterns and mental life—let alone how they operate or how significant they are…a mere handful of empirical studies address the linguistic relativity proposal directly and nearly all are conceptually flawed. (Lucy 1996, 37)

Although further empirical studies on Whorfian hypotheses have been completed since Lucy published his 1996 review article, it is hard to find any that have satisfied the criteria of:

  • adequately utilizing both the relevant linguistic and psychological research,
  • focusing on obligatory rather than optional linguistic features,
  • stating hypotheses in a clear testable way, and
  • ruling out relevant competing Slobin-like hypotheses.

There is much important work yet to be done on testing the range of Whorfian hypotheses and other forms of linguistic conceptual relativism, and on understanding the significance of any Whorfian hypotheses that turn out to be well supported.

Copyright © 2024 by Barbara C. Scholz Francis Jeffry Pelletier < francisp @ ualberta . ca > Geoffrey K. Pullum < pullum @ gmail . com > Ryan Nefdt < ryan . nefdt @ uct . ac . za >

  • Accessibility

Support SEP

Mirror sites.

View this site from another server:

  • Info about mirror sites

The Stanford Encyclopedia of Philosophy is copyright © 2024 by The Metaphysics Research Lab , Department of Philosophy, Stanford University

Library of Congress Catalog Data: ISSN 1095-5054

Grammatical gender and linguistic relativity: A systematic review

  • Theoretical Review
  • Published: 19 August 2019
  • Volume 26 , pages 1767–1786, ( 2019 )

Cite this article

sapir whorf's hypothesis on language and gender

  • Steven Samuel   ORCID: orcid.org/0000-0001-7776-7427 1 ,
  • Geoff Cole 1 &
  • Madeline J. Eacott 1  

28k Accesses

30 Citations

34 Altmetric

Explore all metrics

Many languages assign nouns to a grammatical gender class, such that “bed” might be assigned masculine gender in one language (e.g., Italian) but feminine gender in another (e.g., Spanish). In the context of research assessing the potential for language to influence thought (the linguistic relativity hypothesis), a number of scholars have investigated whether grammatical gender assignment “rubs off” on concepts themselves, such that Italian speakers might conceptualize beds as more masculine than Spanish speakers do. We systematically reviewed 43 pieces of empirical research examining grammatical gender and thought, which together tested 5,895 participants. We classified the findings in terms of their support for this hypothesis and assessed the results against parameters previously identified as potentially influencing outcomes. Overall, we found that support was strongly task- and context-dependent, and rested heavily on outcomes that have clear and equally viable alternative explanations. We also argue that it remains unclear whether grammatical gender is in fact a useful tool for investigating relativity.

Similar content being viewed by others

How persistent are grammatical gender effects the case of german and tamil.

sapir whorf's hypothesis on language and gender

The gender congruency effect across languages in bilinguals: A meta-analysis

sapir whorf's hypothesis on language and gender

Women Versus Females: Gender Essentialism in Everyday Language

Explore related subjects.

  • Artificial Intelligence

Avoid common mistakes on your manuscript.

Introduction

The Sapir–Whorf or linguistic relativity hypothesis—henceforth, simply “relativity” (Whorf, 1956 )—takes various forms, but at its heart it contends that the idiosyncrasies of the languages we speak influence the way we think about the world. In its strongest incarnation—linguistic determinism—thought is constrained by language, but few if any contemporary scholars take this view (Athanasopoulos, 2009 ). At the opposite extreme is the “universalist” position, in which thought is said to be independent of language (e.g., Pinker, 1994 ; see Lucy, 2016 , for a historical overview). Although there is no agreement as to where between these two opposing views the truth is situated, the broad consensus is that neither extreme is correct (Gleitman & Papafragou, 2013 ). In the middle are a variety of standpoints, some of which are more limiting of the role of language than others. For example, thinking for language (i.e., thinking about speaking) might allow language to bias our attention toward the more linguistically describable aspects of what we perceive (Slobin, 1996 ), but this does not mean that language alters the underlying conceptual structures. Another standpoint is that language can influence our judgments about what we perceive, but not perception itself, which is cognitively impenetrable (see Firestone & Scholl, 2016 , for debate; see also Pylyshyn, 1984 ). For many, our perceptions are perhaps best described as modulated or biased by language (Athanasopoulos, 2006 ; Dolscheid, Shayan, Majid, & Casasanto, 2013 ; Gilbert, Regier, Kay, & Ivry, 2006 , 2008 ; Lupyan, 2012 ). Overall, investigating the ways in which language does and does not relate to thought now appears to be the prevailing approach (Lucy, 2016 ; Lupyan, 2012 ; Thierry, 2016 ).

Evidence for language influencing thought and judgments about perceptions comes from various domains. For example, performance on color discrimination/matching tasks has been shown to be more efficient when the stimuli have different labels in a participant’s language relative to when they are subsumed under one term (e.g., Roberson, Pak, & Hanley, 2008 ; Winawer et al., 2007 ). Differences between languages have also been demonstrated to affect how people think about object relations (Park & Ziegler, 2014 ) and objects themselves (Imai & Gentner, 1997 ) as well as broader, more abstract concepts such as quantity (Athanasopoulos, 2006 ), time (Boroditsky, 2001 ; Casasanto et al., 2004 ), and motion (Athanasopoulos & Bylund, 2013 ). Some of these studies have generated vigorous debate, perhaps most notably in the area of color perception (e.g., A. Brown, Lindsey, & Guckes, 2011 ; Franklin, Clifford, Williamson, & Davies, 2005 ; Witzel & Gegenfurtner, 2013 ). For some, aspects of grammar might provide a better tool by which relativity can be explored than vision-focused research; this is because grammar is unaffected by sensory input, obviating the need for language to “breach” psychophysical barriers. Additionally, for the relativity hypothesis to hold, it should not be limited to category labels but extend to features of syntax too (Sato & Athanasopoulos, 2018 ).

One aspect of grammar that has received attention in recent years is grammatical gender (Bassetti & Nicoladis, 2016 ; Bender, Beller, & Klauer, 2018 ). Unlike English, which has a semantic (or conceptual) gender system whereby the gender of a noun is dictated by its biological sex, many languages have a formal system by which all nouns are assigned to a grammatical gender category whether they have biological sex or not. For example, the English word “bed” has no gender, and is referred to with the pronoun “it,” but in Italian ( il m letto ) takes the masculine gender. The consequence of a formal grammatical gender system is obligatory conformity or agreement with the syntactic rules of that class. This may involve marking for gender any definite or indefinite articles (“the,” “a/an”), plural markings, and case markings, as well as other forms of agreement. Different languages also assign different genders to the same objects; for example, in contrast to Italian, “bed” takes feminine grammatical gender in Spanish ( la f cama ). Such differences are not at all exceptional; grammatical gender assignment in a given language is largely arbitrary (Corbett, 1991 ; Foundalis, 2002 ), and “escapes logic” (Boutonnet, Athanasopoulos, & Thierry, 2012 ). Indeed, as English and other languages with purely semantic gender show, a formal grammatical gender system is also unnecessary when it comes to the communicative function of a language.

The crucial exception to arbitrariness in grammatical gender assignment concerns words referring to entities, usually human, with actual biological sex. “Woman” is feminine and “man” masculine across German, French, Italian, and Spanish, and it is from these associations, presumably, Footnote 1 that grammatical “gender” receives its name. Occasionally, the correlation between grammatical gender and biological sex is imperfect; witness German, which has a third, neuter gender category in addition to masculine and feminine, and which assigns this gender to the word for “girl” ( das n Mädchen ). Nevertheless, even in German the vast majority of words referring to humans that have biological sex show a strong pattern of masculine with males and feminine with females. In this sense, languages with a formal gender system can be viewed as having a largely semantic gender system for human agents at least.

It is the combination of the arbitrariness of grammatical gender assignment and the link with biological sex that has attracted researchers to grammatical gender when examining relativity (e.g., Bassetti, 2007 ; Sato & Athanasopoulos, 2018 ). The question asked in such studies is whether grammatical gender assignment “rubs off” on the conceptual representations of inanimate objects that have no biological sex such that, for example, Italian speakers conceptualize beds as somehow more masculine than speakers of a language with no formal grammatical gender system, or than speakers with a gender system but for whom a different gender is assigned. Although historically seen as a quirk of grammar, grammatical gender is therefore regarded to be in a particularly strong position to inform long-standing philosophical, linguistic, and psychological debates around the universality or otherwise of human thought (Phillips & Boroditsky, 2003 ).

A variety of tasks have been employed to understand whether grammatical gender does influence concepts. The most common has been the voice choice task (16 different publications using this task were discovered in this review), in which participants are asked to assign a male or female voice to objects, with many finding that the sex of the voice and the grammatical gender of the target are indeed broadly consistent (e.g., Kurinski, Jambor, & Sera, 2016 ; Lambelet, 2016 ; Ramos & Roberson, 2011 ; Sera, Berge, & del Castillo Pintado, 1994 ; Sera et al., 2002 ). Similar results have been found when asking participants to assign a human name or a sex to an object (“sex assignment” tasks; e.g., Belacchi & Cubelli, 2012 ; Flaherty, 2001 ), and when asking participants to rate on a scale the similarity (“similarity task”) between pictures of male and female humans and objects (Phillips & Boroditsky, 2003 ). In object–name memory association tasks, participants are instructed to remember male and female names that substitute object names, such that “chair” might now be “Patricia,” and the results have sometimes shown that the ability to recall the human name is enhanced if it is congruent with the grammatical gender of the object in question (Boroditsky & Schmidt, 2000 ).

The tasks described above all involve judgments made without time pressure or measurement, but there is also some, albeit limited, evidence in favor of grammatical gender influencing concepts from speeded-response tasks. Converging evidence from different measurement types is important to generate the clearest picture possible of any effect. For example, in the extrinsic affective Simon task (EAST), participants respond to words using two keys; in one condition, each key is mapped to a color, and in another each key is mapped to a sex (male or female). Bender, Beller, and Klauer ( 2016b ) found that German speakers were faster to respond to the color of a word if that word was a gendered, personifiable noun (e.g., death, beauty) and the response key was also mapped to the sex implied by the target’s grammatical gender. However, they also found that the effect was weak or nonexistent for inanimate objects, and only held for personifiable nouns that had connotations of sex that were themselves congruent with the noun’s grammatical gender (e.g., “war” is masculine in connotation and gender, but “Spring” is feminine in connotation but masculine in gender).

The role of gender or sex in the tasks so far described has been clear, because participants are making judgments and responses with biological sex playing an obvious role in this process. Some researchers have preferred tasks in which the salience of sex and gender can be reduced or made more orthogonal to the participant’s conscious experience. For example, Konishi ( 1993 ) employed a property judgment paradigm, finding that German speakers rated masculine-gendered objects such as “key,” “table,” and “beach,” as more “potent” (a trait associated with masculinity) than concepts with feminine gender; Spanish speakers, for whom the same targets took the feminine gender, rated the same items as less potent. This type of task makes participants think about concepts without having to relate them to gender or sex directly; indeed, neither term need come up at all in the instructions for such experiments.

Other studies, often using the same or similar designs as those described above, have instead reported potentially important absences of evidence. For example, in a properties judgment task, Landor ( 2014 ) asked approximately 600 speakers of gendered languages to generate adjectives to describe inanimate objects, and then asked a different group of participants to assign a male or female voice to them. Cross-referencing these results back to the original stimuli, no evidence of property judgments consistent with grammatical gender was found. Absences of evidence have also been reported in sex assignment tasks (Nicoladis & Foursha-Stevenson, 2012 ), similarity tasks (Degani, 2007 ), the EAST (Bender et al., 2018 ), and priming-type tasks (e.g., Degani, 2007 ; Samuel, Roehr-Brackin, & Roberson, 2016 ). The importance of these findings is that they might delineate important constraints on the relativity hypothesis that, as we discussed earlier, would be a finding consistent with the majority of theoretical standpoints in the contemporary literature.

Further clarifications of such constraints might come from results that do not conform to a binary “support or no-support” distinction. For example, a property judgment task by Semenuks, Phillips, Dalca, Kim, and Boroditsky ( 2017 ) found that grammatical gender influenced property judgments, but only in an analysis that left out participants’ first-choice adjectives, focusing on second- and third-choice adjectives alone. The authors suggest that the absence of an effect from the first adjective choice might explain the tendency for more negative findings from speeded-response tasks, but an alternative explanation is that participants fell back on grammatical gender as a strategy after they had made their initial and potentially most faithful descriptions (cf. Bender, Beller, & Klauer, 2016a ).

Other studies have shown support for relativity only for limited subsets of results rather than for the full range of predictions. For example, in the study by Konishi ( 1993 ) already described, German and Spanish speakers both showed evidence of perceiving objects with masculine gender as more “potent” than objects with feminine gender, with “potent” rated as a masculine trait; however, participants also rated “negative” as a masculine trait, but this property did not transfer to objects. Indeed, even the link with potency involved very small differences between masculine and feminine concepts. Similarly, in a sex assignment task, Nicoladis, Da Costa, and Foursha-Stevenson ( 2016 ) found that Russian–English bilingual preschoolers assigned male sex to masculine gender objects more frequently than they did to neuter gender objects, but not more frequently than they did to feminine gender objects; and in a voice choice task, decisions have been found to be consistent with masculine gender but not with feminine gender (Bassetti, 2007 ).

As we mentioned earlier, the prevailing standpoint is that neither a strong view of relativity nor a universalist view provides an accurate account of the relationship between language and thought, and as a result the absences of evidence and the mixed results hint at precisely the systematic constraints most researchers seek to reveal. It is our view that a systematic review of the empirical literature would be well-placed to begin the process of drawing conclusions concerning the strength, extent, and limitations of any such relationship.

Before describing the results of the review, we outline the six potential constraints or “parameters” that we have extracted from discussions in the literature. Assessment of the influence of these parameters, together with the effects of the large variety of tasks used in this field, runs through our review. Finally, we consider whether the literature supports the view that the effects of grammatical gender are primarily about language, or whether the effects attributed to relativity might instead be explained by statistical co-occurrences and associations that are carried by language, but are not linguistic themselves. Such questions have the potential to go to the heart of contemporary thought about the role of language in cognition (Lupyan, 2012 ; Thierry, 2016 ).

The salience of gender/sex in the task

In terms of methodology, many studies that have shown effects of grammatical gender have tended to come from explicit judgments when gender and/or sex is a salient context in the experiment (e.g., Phillips & Boroditsky, 2003 ; Saalbach, Imai, & Schalk, 2012 ; Sera et al., 1994 ; Sera et al., 2002 ). Indeed, the most common task in the field, voice choice, asks participants to assign a biological sex to an object. Given the nature of the question (i.e., there is no objectively correct answer), the participant might seek some rationale for the choices rather than assign at random. Under such conditions, the chances of grammatical gender being consciously recruited are increased, potentially undermining a conceptual change account. A number of scholars have raised this issue (e.g., Bender, Beller, & Klauer, 2011 ; Bender et al., 2018 ; Kousta, Vinson, & Vigliocco, 2008 ; Pavlidou & Alvanoudi, 2013 ; Ramos & Roberson, 2011 ), and Phillips and Boroditsky ( 2003 ) have suggested that it is difficult to resolve empirically. For instance, in Study 2 of Sera et al. ( 1994 ), reference to “masculine” and “feminine” was removed from the instructions from the previous experiment, but participants were nevertheless still required to assign male or female voices. Even when the role of gender/sex is less prominent in a task, it is often still high. For instance, the EAST (Bender et al., 2016a , 2016b ) maps all responses onto the same keys, meaning that even when participants respond according to color, they do so using a key that has also been assigned to a biological sex. This is a necessary facet of a design that relies on mapping two concepts to the same response in order to test for any effect of the overlap. Any strategic use of grammatical gender, or simply its recruitment via strong associations with biological sex in a task, might undermine the case for conceptual-change account of results, and it is clearly important to understand the extent to which the evidence in favor of relativity might rely on such possibilities, by means of comparing high- versus low-gender/sex salience in tasks.

The salience of language in the task

Another issue concerns the salience of language in the task and whether the design instead indexes language processes, as in “thinking for speaking” (Slobin, 1996 ). Indeed, it was an early requirement of language-and-thought research that effects of language should be evident on nonlinguistic tasks (R. Brown & Lenneberg, 1954 ). As with the salience of gender/sex, it is important to understand the extent to which the evidence for relativity might rest upon the recruitment of grammatical gender through language processing rather than any underlying conceptual change. We therefore classified research as being either high or low in terms of the salience of language in the task.

Testing participants in their gendered language

For some, testing that occurs in participants’ gendered language limits inferences to the effects of grammatical gender on concepts within that language , rather than on the concepts themselves (Boroditsky & Schmidt, 2000 ; Slobin, 1996 ); if one’s language creates a worldview in which “bed” is masculine, “bed” should be conceptualized as masculine not only when acting in the context of that particular language, but also when acting in a nongendered language, such as English. By definition, this argument suggests that research is best conducted on bilinguals. In favor of such a possibility, testing participants in a second, ungendered language context has also revealed influences of a first-language gender system (Boroditsky & Schmidt, 2000 ; Phillips & Boroditsky, 2003 ; Semenuks et al., 2017 ). To test this particular hypothesis, we classified experiments in terms of whether participants performed in their gendered or nongendered language.

Two-gender versus three-gender languages

Another issue concerns the precise nature of the grammatical gender system under investigation. For example, in Romance languages like Spanish, Italian, and French, most nouns that refer to humans carry grammatical gender that is consistent with the target’s biological sex. In German, however, the correlation is weaker, in part owing to its third, neuter gender, but also because German articles do not always differentiate between genders as a result of the German case system. This can result, for example, in even animates being labeled with a grammatical gender incongruent with their biological sex. The issue of two- versus three-gendered languages therefore concerns whether an influence might be strongest in speakers of languages with two gender classes that form a particularly “tight fit” with semantic gender (see Saalbach et al., 2012 ; Sera et al., 1994 ; Vigliocco, Vinson, Paganelli, & Dworzynski, 2005 ). Results with German speakers on voice choice tasks have indicated that grammatical gender does not influence decisions in the way that, for example, French or Spanish grammatical gender does (Sera et al., 2002 ). If this pattern were borne out in a broader review, it might suggest a statistical, correlative relationship between biological sex and grammatical gender in relativity. Footnote 2 For Lucy ( 2016 ), the structural consequences of grammatical gender (case markings, adjective agreement, etc.) are too often overlooked and might explain some inconsistencies in results. We classified experiments in terms of whether participants spoke a two- or three-gendered language. Footnote 3

Effects with animate and inanimate targets

Another parameter concerns whether grammatical gender might influence the conceptualizations of animate but not inanimate targets (e.g., Vigliocco et al., 2005 ). For example, participants who speak a gendered language (German) showed a greater willingness to erroneously endorse sex-specific statements about animals if those statements were consistent with the animals’ grammatical gender than did speakers of an ungendered language (Japanese), but the same was not true of inanimate targets (Imai, Schalk, Saalbach, & Okada, 2014 ). In a series of priming experiments, Bender et al. ( 2011 ) asked German speakers to decide the gender of a target word after they had seen either (1) a definite article denoting gender ( der for masculine and die for feminine), (2) the words Mann (“man”) and Frau (“woman”), (3) the symbols for male and female, or (4) pictograms of a man or woman. They found that the congruent linguistic article primes sped up judgments relative to incongruent trials, for both animate and inanimate targets. However, of the other primes, only the Mann/Frau pictograms had an effect, and only on animate targets. The researchers concluded that the grammatical gender of objects does not appear to “seep” into the semantic content of inanimate nouns. A broadly similar pattern was found in the properties judgment task by Semenuks et al. ( 2017 ). Results like these have led some scholars to suggest that grammatical gender is only relevant to conceptualization when sex is a relevant property of the target (Ramos & Roberson, 2011 ; Vigliocco et al., 2005 ). If this is true, then relativity is subject to an important constraint, namely that grammatical gender only interacts with targets that have biological sex in the first place. We classified experiments in terms of whether participants responded to animate or inanimate targets.

Stronger effects in adults than children

Finally, studies with adults have been thought to provide more supporting evidence for relativity than studies with children, and particularly very young children. For example, only six out of 18 Spanish-speaking 3- to 5-year-olds freely sorted pictures of inanimate objects and male and female people into groups defined by grammatical gender (Martinez & Shatz, 1996 ), and only very limited effects of grammatical gender on object conceptualization, or indeed no effect at all, have been found in other studies with young children (Bassetti, 2007 ; Nicoladis & Foursha-Stevenson, 2012 ; Sera et al., 2002 ). Weaker effects at younger ages are consistent with the possibility that it is experience with grammatical gender that leads to biological sex connotations with objects, but also with the possibility that it is metalinguistic knowledge of grammatical gender acquired through formal instruction, rather than grammatical gender assignment itself, that might explain some positive results. We classified studies in terms of whether the participants were children (< 18 years old) or adults.

An existential question for relativity: What is “gendered” about grammatical gender?

A crucial question that underpins everything in this review concerns the nature of grammatical gender itself. It has been pointed out that the “gender” in grammatical gender is not intrinsic to language but is itself an arbitrary, human-made label (Bender et al., 2018 ). At some point in their lives, speakers of gendered languages usually learn that the names for the categories are “masculine” and “feminine,” but would they ever choose those names without formal instruction? As was highlighted by Foundalis ( 2002 ), “masculine” and “feminine” are in fact poor predictors of the majority of nouns in their class, and even the relationship between the meanings of the words gender and sex are stronger for speakers of nongendered languages, such as English; speakers of gendered languages would not usually use the translation equivalent for “gender” in grammatical gender to refer to biological sex at all.

Grammatical gender has usually been perceived as a particularly useful tool to study relativity because its assignment patterns are so arbitrary and have no psychological reality outside of language itself, but the same point could be leveled, with more detrimental consequences for the research paradigm, at the titles of the categories themselves, which are metalinguistic labels (they describe something about language) detached from the use of the grammar itself. Grammatical gender therefore appears to suffer from an identity problem that other tools used to investigate relativity do not. To illustrate, we might as a thought experiment substitute the terms “masculine” and “feminine” for either “plant” and “nonplant,” “sky” and “earth,” “Group 1” and “Group 2,” “x” and “y,” or any number of arbitrary labels, without interfering with the performative use of a gendered language itself. We might predict that if we told a cohort of Spanish speakers to go about their day “believing” that the titles of the classes had changed to “x” and “y,” they could simply continue to refer to la mesa (the f table) and el libro (the m book) with no real-world consequences. If, however, we asked the same cohort to randomly shuffle their color-word mappings for a day, such that they might need to refer to the Spanish word for “green” with the Spanish word for “blue,” or to swap their spatial prepositions, such that “over” might become “under,” we would be asking them to violate the rules of language use , and we would expect there to be errors.

If the labels “masculine” and “feminine” are in a sense historical accidents, this has consequences for the use of these categories in the relativity paradigm, because effects that have previously been attributed to conceptual change might in fact be the result of simple statistical co-occurrences and associations between two groups of labels: the entirely arbitrary, human-made labels of metalinguistic grammatical classes, on the one hand, and the similarly arbitrary “gender” assignment to noun labels, on the other. This would be the equivalent, for example, of finding that English speakers conceive of the digits 3, 5, 7, and 9 as somehow “weirder” that 2, 4, 6, and 8, because the former are arbitrarily labeled as “odd.” Although language would be the vehicle of any statistical associations, the outcome becomes trivial in the context of classic views of relativity as engaging conceptual change.

What patterns would a statistical co-occurrence account predict? We would expect five out of the six parameters described to influence results. First, speakers of a two-gendered language should show stronger effects than speakers of three-gendered languages, because the relationship between human biological sex and grammatical gender is reinforced through greater repetition and a stronger gender/sex correlation in human animates, at least when compared to the three-gendered language most commonly used in the field (German). We would additionally predict that performing in a gendered language would give rise to such associations in a way that performing in a nongendered language like English might not. We would predict that the more salient the roles of both gender/sex and language in the experiment, the greater the opportunity for associations between biological sex and language to be engaged. Finally, we would expect the presence of animate targets to elicit biological sex information more than inanimate targets. A sixth possibility, albeit a theoretically more tenuous one, is that adults would have more strongly reinforced associations than children, owing to their greater quantitative experience of language. Interestingly, all of these parameter settings have already been cited in the literature as conducive to finding effects of grammatical gender, although these suggestions have not yet been supported by a systematic review. Assuming that the strong version of relativity is wrong (as we do), it then becomes important to decide whether the effects of grammatical gender are statistical and associative or an effect of language on the fabric of conceptual representation itself.

The remit of the review

This review includes (1) empirical research with (2) human participants and (3) real languages and words (rather than languages and words invented for the purposes of experimentation), (4) that is either published or unpublished and (5) is reported in English . Footnote 4 Studies were also required to test the influence of grammatical gender on at least some nonhuman targets (excluding, therefore, studies on grammatical gender and gender stereotyping of men and women).

Formal searches were conducted encompassing the years 1990 to 2018 using the search terms “grammatical” and “gender” together, once with “Whorf” and once with “relativity,” in Web of Science (all) and Google Scholar (first five pages). Additionally, we searched the EThOS PhD thesis bank using the terms “grammatical” and “gender” for the broadest possible range of results and the NDLTD thesis bank using “grammatical gender,” once with “Whorf” and once with “relativity.” All studies from a recent special issue of the International Journal of Bilingualism (volume 20, issue 1) were included where relevant. To ensure maximal representation of relevant data and to minimize the potential for skewed results owing to potential publication biases (e.g., De Bruin, Treccani, & Della Sala, 2015 ), we also emailed the corresponding author of every study revealed in the first stage of the review to request any unpublished or in-press results. Finally, a call for data was also issued on the ResearchGate website as part of a project linked to the present review. A further 13 pieces of empirical research not turned up by these searches were added, either because they were known to the authors or were offered/recommended to the authors during this contact phase. The full list of items included and excluded from the review can be found in the supplemental online materials ( SOM1 ).

Classifications by task

Owing to the heterogeneity of methodologies in the field, we opted to sort the individual experiments into eight task types: voice choice, properties judgments, EAST, sex assignment, priming, similarity judgment, association, and object–name memory association. These eight task types were chosen because they were different enough in methodology to be considered in their own right. To illustrate, we felt that the closest pair was voice choice/sex assignment, since both involve explicit biological sex judgments to be made about targets. However, voice choice tasks require the participant to imagine an object speaking , which might recruit thinking about language in a way that assigning biological sex alone might not. Footnote 5

We classified every experiment according to the six parameters previously described as being potentially important to the outcomes of research. We adopted the following approach to these classifications. Regarding language content, we asked whether language goes in to the task (e.g., the stimuli are words), or comes out of the task (e.g., choosing between words such as “male” or “female,” thinking of adjectives in property judgment tasks). Where neither occurs, or where any role of language is judged to be highly orthogonal, that study is classified as low language content. The same approach was taken to gender/sex content (gender or sex information should not go in to the task or be part of the response or process leading to the response). The other parameters (age, language, number of gender categories, and target type) were readily classifiable at face value. Full details of item classifications can be found in SOM2 .

Classifications of results

Given that the results of tasks are sometimes not unambiguous in their support or otherwise of relativity, each individual study was classified in terms of one of three outcomes: support , mixed support , or no support . Studies classified as offering support showed an effect of grammatical gender on conceptualizations consistent with the hypothesis employed. A study was classified as offering no support if there was an absence of evidence for relativity that could be classified minimally as mixed support. Studies were classified as offering mixed support if they showed partial confirmation of an influence, such as an influence of one gender but not another (Bassetti, 2007 ), marginally significant effects (e.g., Semenuks et al., 2017 ), an influence in accuracy but not in response times (e.g., Bender et al., 2016a , 2016b ), or an influence shown on one criterion but not another (e.g., Konishi, 1993 ). SOM2 lists these classifications for each experiment.

Adjustments for sample size

Given that the sample sizes varied widely from study to study and condition to condition (from 7 to 924), we report results taking sample size into account. This has the obvious benefit of weighting the pattern of results in favor of those studies that are most likely to be highest-powered and less susceptible to spurious effects. Given that the same participants sometimes performed multiple conditions or analyses, we also allowed multiple data points in the review from the same participants. For example, this review allowed for separate data points for inanimate and animate targets from the same group. In such cases, separate classifications are necessary to ascertain the effect of different parameters, such that the evidence from animate targets might be classified as offering support, but the results from inanimate targets be classified as no support. Full details can be found in SOM2 .

This review adopts a methodological approach somewhere between a vote-count system (owing to the classifications of support, mixed support, or no support) and a statistical meta-analysis (owing to sample-size adjustments) (cf. Samuel, Roehr-Brackin, Pak, & Kim, 2018 ). Given the heterogeneity of their research designs, languages tested, and so on, occasionally only very small clusters of studies could be considered to be using the same methods, and often these were studies that came from the same labs and sampled the same type of linguistic population. We therefore considered this approach the better way to provide an overview of the multiple ways in which the field has investigated the research question, and how different designs may culminate in different outcomes.

What is not in the review

We excluded studies that looked not for relationships between targets and biological sex, but rather for relationships between objects of the same grammatical gender versus objects of different grammatical gender (e.g., Almutrafi, 2015 , Exp. 2; Bobb & Mani, 2013 ; Boutonnet et al., 2012 ; Cubelli, Paolieri, Lotto, & Job, 2011 ; Kousta et al., 2008 ; Yorkston & De Mello, 2005 ). For example, it has been demonstrated that semantic category judgments about nouns belonging to the same grammatical gender are processed more quickly than nouns from different grammatical genders (Cubelli et al., 2011 ), and that grammatical gender information is processed in semantic similarity tasks even when it is irrelevant and undetectable by behavioral measures (Boutonnet et al., 2012 ). Although such studies offer support for the processing of grammatical gender information when it is apparently task-irrelevant (a useful prerequisite for tasks in this review), and even show that objects of the same grammatical gender are perceived to be more similar than objects that are not, this is not the same as demonstrating that objects are conceptualized as more masculine or feminine as a function of their gender assignment. Such results might be explained in terms of an effect of membership in the same grammatical category, independently of biological sex information (cf. Cook, 2016 ). Because this review is entirely concerned with this specific relationship such studies, though clearly interesting in their own right, were omitted.

There have also been studies assessing how and when grammatical gender is processed in language production and comprehension, including in bilinguals for whom the grammatical gender for the same object might be opposite in their two languages (Costa, Kovacic, Fedorenko, & Caramazza, 2003 ; Costa, Kovacic, Franck, & Caramazza, 2003 ). Again, such studies are not designed to look at whether object conceptualization has the potential to be influenced by the biological sex connotation of their grammatical gender assignment, and they were therefore excluded.

The review also does not include one study that does pertain to grammatical gender and biological sex, but in which a meaningful understanding of the number of participants is not possible. This was the study by Segel and Boroditsky ( 2011 ), in which depictions of personifications and allegories in thousands of works of art were classified retrospectively in terms of their gender congruency. This was notionally classified as support.

Finally, the remit of the review excluded studies that did not involve human participants but investigated the question of relativity through connectionist models (e.g., Dilkina, McClelland, & Boroditsky, 2007 ; Sera et al., 2002 ) or through training in artificial “languages” (Eberhard, Heilman, & Scheutz, 2005 ; Phillips & Boroditsky, 2003 , Exps. 4 and 5; Sera et al., 2002 , Exp. 4) or nonsense words (Konishi, 1994 ; Vuksanovic, Bjekic, & Radivojevic, 2015 ). This is because we felt that we should limit our scope to behavior more clearly grounded in the experience of real people with real grammatical gender categories.

Overall, the initial search revealed 99 individual pieces of research, with a further 13 added that were known to the authors but were not revealed by the formal search. After removing those items that did not provide empirical data, the review included 43 individual pieces of research, one of which was unpublished (Nicoladis, 2019 ), and three of which were doctoral or master’s theses (Almutrafi, 2015 ; Degani, 2007 ; Landor, 2014 ). The remaining pieces of research were published journal articles, conference proceedings (always of the Cognitive Science Society), or book chapters. After subdividing this research by task type and condition, these pieces of research resulted in 158 lines of data (split by differences in conditions within experiments), which together surveyed 5,895 participants in total.

As we described earlier, we then calculated the number of “samples,” which was 7,334. This number differs from the number of participants because it allows for the possibility that the same participant might have performed in multiple conditions. It is for this reason that the number of samples can be higher (but not lower) than the number of participants. We present all our results in the context of samples , rather than participants, in order to capture these important within-experiment differences.

Overall results

Across the review as a whole, the results from 32% of all samples were classified as offering support for relativity, 24% were classified as offering mixed support, and 43% as offering no support. With the exception of one particularly large study (Montefinese, Ambrosini, & Roivainen, 2019 , N = 924 and N = 105, total N = 1,029), there was no evidence that the results were driven by only a small cluster of highly powered studies (see Fig. 1 ). If this outlying study were removed, support would be at 38%, mixed support at 28%, and no support at 34%.

figure 1

Number of samples (vertical axis) for each research item or line of the review (horizontal axis). The mean and median samples are displayed in the text box. The outlier is Montefinese et al. ( 2019 ) with a property judgment task ( N = 924 Italian speakers).

In what follows, we first describe the results by task type. We then describe results as a function of task parameters. A full at-a-glance view of all the results by task type and parameter can be found in Figs. 2 and 3 .

figure 2

Classifications of support, mixed support, and no support by task type. The total number of samples is shown on the y -axis.

figure 3

Classifications of support, mixed support, and no support according to the task parameters. The total number of samples is shown on the y -axis.

Results by task type

Properties judgment.

Full results for the properties judgment task are displayed in Table 1 . Properties judgments made up 37% of all samples in the review, making it the most commonly performed task in the literature (i.e. it comes with the highest number of samples, rather than the highest number of uses in the literature). Only 3% of samples were classified as offering support (Flaherty, 2001 ; Imai et al., 2014 ; Saalbach et al., 2012 ). A further 23% were classified as providing mixed support; reasons were the finding that results were more consistent with grammatical gender in one group than another (that spoke a different language), but apparently not more so than chance itself (Haertlé, 2017 ); results limited to one property but not another, despite evidence that both were linked to biological sex (Konishi, 1993 ); evidence to suggest an effect of the grammatical gender of a language the participants did not speak, with no direct comparison of this effect with the language they did speak (Sedlmeier, Tipandjan, & Jänchen, 2016 ); and effects limited to second- and third-choice, but not first-choice, adjectives (Semenuks et al., 2017 ). The remaining 75% of samples offered cases of no support at all (Flaherty, 2001 ; Imai et al., 2014 ; Landor, 2014 ; Mickan, Schiefke, & Stefanowitsch, 2014 ; Montefinese et al., 2019 ; Semenuks et al., 2017 ). It should be noted that the study by Montefinese et al. represents an extreme outlier. The results of this study were classified as no support. However, even if the results of this study were removed, the rate of support for properties judgment tasks would only rise to 4%. All properties judgment tasks were intrinsically high in language content, and all have so far been conducted with adult participants. Given the almost floor-level overall rate of support, an examination of the effect of different factor parameters was not conducted.

Voice choice

The full results for the voice choice task are displayed in Table 2 . The voice choice paradigm, although the most common experimental task for researchers, is the second most commonly performed by participants in the field, accounting for 28% of all samples. Of these, 64% were classified as offering support (Almutrafi, 2015 ; Athanasopoulos & Boutonnet, 2016 ; Beller et al., 2015 ; Bender et al., 2016a ; Haertlé, 2017 ; Kurinski et al., 2016 ; Lambelet, 2016 ; Ramos & Roberson, 2011 ; Sera et al., 1994 ; Sera et al., 2002 ; Vernich, 2017 ; Vernich, Argus, & Kamandulytė-Merfeldienė, 2017 ). An additional 22% were classified as providing mixed support; these included results consistent with the hypothesis but limited to one of two genders (Bassetti, 2007 ), effects for limited subsets of targets (Beller et al., 2015 ; Bender et al., 2016a ), and statistically marginal results (Bender et al., 2018 ). Mixed results also included cases in which the data suggested that voice choices were not more consistent with grammatical gender than chance levels (Forbes, Poulin-Dubois, Rivero, & Sera, 2008 ; Sera et al., 2002 ), and effects limited to native speakers but not learners (including advanced learners Footnote 6 ) of the same language (Kurinski & Sera, 2011 ). A minority of samples offered no support at all (12%) (Bassetti, 2007 ; Bender et al., 2018 ; Forbes et al., 2008 ; Sera et al., 1994 ; Sera et al., 2002 ).

All voice choice tasks were classified as having a high gender content and high language content. A greater rate of support for relativity was found when participants spoke a language with two genders (69%) instead of three (24%). There was also more support from adult samples (69%) than from children (39%). To illustrate these comparisons, the majority of no-support cases came from participants who were 5-6 year-olds (Sera et al., 1994 ; Sera et al., 2002 ), 9-year-olds, (Bassetti, 2007 ), or adult speakers of German, a three-gendered language (Bender et al., 2018 ). In contrast, there was no evidence that voice choices were more consistent with grammatical gender when applied to animate (38%) than to inanimate objects (68%), and a lower rate of support was found when assigning voices in one’s gendered language (56%) relative to one’s ungendered language (73%). Some of these results might be skewed by the relative dearth of child samples, samples who performed the task in a low gendered-language context, with a three-gendered language background, or with animate targets.

Sex assignment

Full results for the sex assignment task are displayed in Table 3 . Sex assignment tasks are almost identical to voice choice tasks, in that they are explicit assignments of male or female sex to animate and inanimate objects. Largely thanks to the study by Belacchi and Cubelli ( 2012 ), representing 412 samples (46% of all samples with this paradigm), sex assignment makes up 12% of the samples in the review.

Overall, 66% of samples were classified as offering support (Belacchi & Cubelli, 2012 ; Flaherty, 2001 ; Pavlidou & Alvanoudi, 2018 ; Sera et al., 1994 ), 26% as mixed support (Bender et al., 2016b ; Nicoladis, 2019 ; Nicoladis et al., 2016 ; Nicoladis & Foursha-Stevenson, 2012 ; Pavlidou & Alvanoudi, 2013 ), and 8% as no support (Flaherty, 2001 ; Nicoladis & Foursha-Stevenson, 2012 ).

Sex assignment tasks are intrinsically classified as high in gender/sex content. Although they need not also have a high language content, they were judged high for every study in this review. For animate targets, the rate of support was 100%, higher than for inanimate targets (33%). The rate of support was slightly higher in children (69%) than in adults (62%). The rate of support was 75% when participants performed in their gendered language, but zero in an ungendered language. Finally, the rate of support from two-gendered languages was high (83%), but from three-gendered languages it was low (23%). Again, some of these comparisons (with the probable exception of age) might be skewed by imbalances in the number of samples that were classified as performing under each parameter setting.

The full results for the EAST are displayed in Table 4 . The EAST is a unique case in this review, because it has only been used by one core group of researchers (Bender et al., 2016a , 2016b , 2018 ), and only ever with adult speakers of German, which is a three-gender language. It comprises 9% of the samples. There is a good case that it might constitute a priming task, but given its singularity we felt it was best considered as a task category in its own right. Overall, only 11% of the samples were classified as offering support (Bender et al., 2016a , 2018 ), 56% were classified as offering mixed support (Bender et al., 2016a , 2016b , 2018 ), and 33% as offering no support (Bender et al., 2016a , 2018 ).

Variation in the results by parameter should be interpreted in the context of the low overall rate of support. The EAST uses a two-key response method, with one clearly mapped to “male” and one to “female”—hence, gender context is always high—and since the stimuli are always words, language context is also high. In fact, the only possible parameter comparison that can be made with the EAST concerns inanimate and animate targets, which showed similar levels of support (10% vs. 11%, respectively).

The full results for priming tasks are displayed in Table 5 . There are only five different pieces of research with priming experiments in the literature, comprising 8% of the samples in the review. Care must be taken when attempting to interpret parameter patterns from only a handful of studies, where settings can be entirely confounded with individual articles. Overall, priming offered a 34% support rate (Bender et al., 2011 ; Sato & Athanasopoulos, 2018 ) and a 66% no-support rate (Bender et al., 2011 ; Degani, 2007 ; Mickan et al., 2014 ; Samuel et al., 2016 ). There were no cases of mixed classifications. Given the small overall numbers of support (198 samples in total), coming from only two articles, comparisons were unlikely to reveal any reliable patterns.

Similarity judgment

The full results for similarity tasks are displayed in Table 6 . Similarity judgment tasks comprised only 3% of all samples. The pattern of results, indicating 44% support (Phillips & Boroditsky, 2003 ), 45% mixed support (Sedlmeier et al., 2016 ), and 11% no support (Degani, 2007 ), is entirely confounded with the three individual pieces of research to use the task. The small number of samples with this task make it difficult to draw meaningful conclusions as to what might lead to the differences in results.

Association

The full results for similarity tasks are displayed in Table 7 . Only two pieces of research have employed an association paradigm (Bender et al., 2018 ; Martinez & Shatz, 1996 ), which together comprise only 2% of the samples in the review. All the studies with this paradigm were classified as offering no support. Gender content and language content were always high, and participants were always tested on inanimate targets and in a gendered-language context.

Object–name memory association

The full results for object–name association tasks are displayed in Table 8 . Comprising just under 2% of samples in the review, object–name memory association paradigms form the smallest task-type classification in this review. The task has been used in three separate pieces of research. Of the samples, 36% came under support (Boroditsky & Schmidt, 2000 ), 31% mixed support (Kaushanskaya & Smith, 2016 ), and 33% no support (Pavlidou & Alvanoudi, 2013 ). Note that these differences are split entirely by publication. All these studies were classified as having a high gender content and high language content. All were conducted with adult participants, and all included inanimate targets.

Results by task parameters

The distribution of samples displayed in Fig. 3 points to a number of imbalances in the literature to date. The samples in this review were typically involved in experiments with a high language content (98%). The samples were usually adults (87%), performing in their gendered language (83%), with inanimate targets (76%), and with high sex/gender salience in the task (62%). Samples also usually spoke a language with two gender categories (57%). In other words, the average experiment incorporated five out of the six parameter settings that are usually considered most conducive to results in support of relativity (inanimate targets being the exception).

Changes in the rate of support as a function of task parameters are displayed in Table 9 . In the following sections we describe comparisons where it was possible to isolate one category from another. A study that puts speakers of two-gendered and three-gendered languages into the same group, for example, was excluded entirely rather than added to both categories.

Gender/sex content

Consistent with previous views of the literature, as well as with a statistical association account of grammatical gender effects, studies with a high gender/sex content showed a higher rate of support (51%) than did studies with low gender/sex content (2%). The voice choice, sex assignment, EAST, association, and object–name association task types were always classified as high in gender/sex. The samples classified as low came almost entirely from the 2,689 (96%) who performed property judgment tasks, which came with only a 3% support rate. However, only 98 samples from this paradigm were classified as high, rendering any more detailed comparisons unreliable.

Language content

Almost all of the research was classified as high in language content (98%). The almost complete absence of research classified as low in language content makes any attempt to draw conclusions about this parameter liable to mislead. Only the priming and similarity studies by Sato and Athanasopoulos ( 2018 ) and Phillips and Boroditsky ( 2003 ), respectively—both of which were classified as support—were classified as low in language content. We return to this issue in our Discussion .

Gendered versus ungendered language

A slightly lower rate of support was found for studies performed in a gendered language (29%) than in an ungendered language (32%). This is not consistent with a statistical association account. When we compare performance in gendered and ungendered languages at the within-task level, we see that support is higher in a gendered language context than in an ungendered language context in the sex assignment task (75% vs. 0%) and the properties task, albeit in the latter case with very low rates (3% vs. 0%). Support is slightly higher in the ungendered language for voice choice (73% vs. 56%) and priming tasks (44% vs. 31%). It is also higher for similarity tasks (100% vs. 0%) and object–name tasks (54% vs. 0%). Although 83% of all samples performed in a gendered language context, meaning that comparisons were based on imbalanced sample sizes, the pattern of results within tasks suggests that there is no clear support for the hypothesis that grammatical gender is more likely to influence thought when participants perform in a gendered language.

The review showed higher rates of support from studies with two-gender languages (43%) than with three-gender languages (16%). This outcomes is consistent with a statistical association account. Broken down by task type, support from two-gendered languages over three-gendered languages came from the sex assignment tasks (83% vs. 23%), voice choice tasks (69% vs. 24%), similarity tasks (52% vs. 27%), priming tasks (46% vs. 32%), and object–name association tasks (42% vs. 30%). Only properties tasks (3% vs. 5%) reversed this pattern, albeit with negligible support rates in each category.

Animate versus inanimate targets

Consistent with a statistical association account, studies with animate targets showed a higher rate of support (50%) than did studies with inanimate targets (27%). Broken down by task, this pattern was true of sex assignment tasks (100% vs. 33%), priming tasks (78% vs. 8%), and properties tasks (10% vs. 0%). The results from the EAST were almost matched (11% vs. 10%). The reverse pattern was found for voice choice tasks (38% vs. 68%). Note that the great bulk of the positive results from inanimate targets comes from the voice choice task (90% of samples).

Adults versus children

In apparent contrast with some views expressed in the literature, research with children (55%) revealed a higher rate of support than research with adults (29%). However, this result is strongly weighted by task type. Overall, 87% of the samples came from adult participants, and the great bulk of the data from children came from voice choice and sex assignment tasks (92%). Given that the tasks that children performed provided the highest rates of support, and those performed almost exclusively by adults provided the lowest (e.g., properties judgments), it is difficult to know whether this outcome is the result of an age-related difference or a task-related difference. Looking at age-related performance at the level of the individual task, there is some evidence that adults show a greater influence of grammatical gender than children do; we see that the rate of support is 30% higher in adults than children in voice choice tasks (69% vs. 39%), although it is 7% lower in adults sex assignment tasks (62% vs. 69%).

Other patterns

Almost half of all the data in the review (40%) came from the voice choice and sex assignment tasks, the two paradigms that make the clearest demands on participants to consider targets in terms of biological sex. Since the potential for the strategic use of grammatical gender under such circumstances has been one of the most frequent issues brought up in the literature, we compared the rate of support from the review as a whole with the results with these two paradigms included or excluded. Overall support across the review drops from 32% to only 11% in their absence, and no support rises from 43% to 64%. In other words, when all the data are included, approximately one in three samples in the review provides support; when the data from voice choice and sex assignment are excluded, this rate drops to one in ten.

At its broadest, the review shows that the evidence for an influence of grammatical gender on conceptualizations is highly task- and context-dependent. We found that the voice choice and sex assignment tasks formed the backbone of support for relativity; when they were removed, the support rate dropped to only 11%. With them included, about a third (32%) of the data were classified as support, relative to a no-support rate of 43% and a mixed-support rate of 24%. If we consider the possibility that publication biases mean that fewer null results make it to publication, it may be that even this support rate is an overestimate.

The review provides support for a number of important constraints on the relativity hypothesis. For example, the rate of support is higher when the gender content of a task is high than when it is low, suggesting any influence might be at least partly contingent on the opportunity to strategically call upon grammatical gender. Results are also more likely to be classified as support when participants are processing animate rather than inanimate targets, which also suggests that language might be partly contingent on the immediacy of the overlap between grammatical gender and biological sex. This finding argues against a singular, uniform effect of gender category on all its members. Finally, results were more likely to offer support when the gendered language has two gender categories rather than three; a finding that is inconsistent with a straightforward account of grammatical gender classification per se influencing the conceptualizations of objects. The review initially appeared to reveal one misconception concerning age; there was actually a higher rate of support from samples of children than adults. Upon closer inspection, this outcome was closely bound to the fact that children performed those tasks that most consistently produced positive results for relativity, namely voice choice and sex assignment. However, not all the predicted biases were supported. We found no evidence that support was more common when participants performed a task in a gendered-language context, which is what is predicted by thinking-for-speaking accounts. The only parameter that could not be meaningfully assessed at all concerned the salience of language in the task, an issue that we return to later in this Discussion.

What these parameter-related comparisons reveal is that much of the positive evidence for relativity comes from tasks and conditions that are particularly susceptible to alternative, strategy-based explanations. Overall, we therefore take the results of this review as imposing quite powerful constraints on the relativity hypothesis as seen through the lens of grammatical gender. Nevertheless, this conclusion itself comes with a caveat; we feel the review also points to a significant weakness in much of the relevant research’s ability to speak to the issue of relativity with clarity. This means that future research could either cement this rather negative conclusion, or overturn it through stronger designs that are less susceptible to confounds. As a result, we suggest that our review provides an interim rather than final pattern of results.

We focus our discussion on those areas that we feel need addressing in future work, and make suggestions as to some ways for experiments to deal with them. First, we describe why we feel much of the data speak only weakly to the question of relativity.

How do people solve the tasks?

Some have held that for relativity to be supported there should be a reasonable expectation that participants did not engage grammatical gender information strategically. This is most clearly an option in tasks in which judgments are about sex and are explicit, such as voice choice and sex assignment, but possibly also for similarity, association, and object–name memory associations. It is therefore important to distinguish between two means of arriving at decisions in many of the tasks in this review: one as a result of conceptual change, as usually hypothesized in relativity accounts, and another through metalinguistic knowledge. Metalinguistic knowledge refers to the influence of the knowledge of a formal property of language, such as grammatical gender, on judgments about objects. Although both processes are interesting, it is relativity that the research in this review was designed to investigate. The problem is that the voice choice and sex assignment tasks upon which the bulk of the support for relativity apparently rests cannot tell the two apart.

It might be argued that researchers can simply ask participants how they performed the task, and therefore be in a position to rule out a metalinguistic strategy as a result. Interestingly, the cases in which participants have been asked how they came to their decisions have thrown up mixed results (e.g., Almutrafi, 2015 ; Kurinski & Sera, 2011 ; Sato & Athanasopoulos, 2018 ). Particularly convincing evidence for a conscious metalinguistic strategy account of voice choice performance comes from a study in which 25 out of 30 participants later admitted to using grammatical gender to guide their responses (Almutrafi, 2015 ). However, although it has often been assumed to be so, there is also no a priori reason that the use of metalinguistic strategy need be a conscious process at all. Regardless of how it might occur, the use of metalinguistic knowledge undermines a reading of results as the outcome of conceptual change.

Some researchers have pointed out that participants rarely if ever respond in a manner that is 100% consistent with biological sex. This seems to argue against a metalinguistic strategy (conscious or otherwise). However, we cannot know whether participants had more than one strategy, some more universal (such as masculine for artifacts, feminine for natural kinds: Mullen, 1990 ), and some more idiosyncratic or personal (see for example participants’ justifications for their choices in Kurinski & Sera, 2011 ), with different strategies being brought to bear at different times throughout the task. The absence of a consistent, 100% effect of grammatical gender on task performance therefore does not preclude the possibility that metalinguistic strategies might account for some of the effects that were found.

Is there evidence to support one or the other account from the results of the review? Here again we run into the logical problem that we cannot tell processes apart. For example, the finding that more support came from speakers of two-gendered languages than from speakers of three-gendered languages might be seen as weakening metalinguistic strategy accounts, because there is no reason to believe Spanish speakers should prefer such strategies over German speakers, for example. However, the availability or attractiveness of metalinguistic strategies might, on the other hand, be enhanced where there is a neater one-to-one mapping of grammatical gender category and biological sex.

Another potential criticism of a metalinguistic rather than a relativity account is that in taking the former view, one subscribes to an intrinsically negative, prejudicial default that effectively renders relativity empirically impossible to support. Essentially, giving a metalinguistic alternative explanation equal weight might set the bar for actual conceptual change accounts impossibly high. However, the scope exists to tighten experimental design to guard against alternative, “killjoy” explanations; though this review suggests that, for the most part, when these measures are in place, the likelihood of finding a positive result declines.

In our view, the most convincing evidence of the potential for metalinguistic strategy accounts comes from the results of property judgment tasks. Since this task type does not incorporate a sex/gender prompt it keeps such information at arm’s length. The very low rate of only 3% support from this task type might therefore reflect the absence of such strategies in performance.

Overall, the voice choice and sex assignment paradigms are the most susceptible to alternative explanations, and further research using these tasks without serious modification is unlikely to reveal more about relativity. Since these tasks are the most common in the field, and similar objections can be raised against other task types such as similarity, object–name memory associations, and associations, the pool of information from which we can make the most meaningful inferences about the research question is likely to be small. It is for this reason that we feel the case for grammatical gender influencing concepts is currently difficult to weigh up and awaits future research (see also the Practical suggestions for future research section below).

Language on language or language on concepts?

The process of judging whether a task incorporates language is a difficult one. For example, in a voice choice or sex assignment task participants are sometimes only required to produce a single word: “male,” “female,” “boy,” and so forth. They might only need to circle a letter M or F on a sheet of paper. Does this constitute a language process that might inadvertently recruit grammatical gender itself? What of the role of the instructions of the task, which are linguistic, in formulating a linguistic process to arrive at a response? For almost all the tasks in the field, even language processing of the more conspicuous kind is unavoidable, such as when making judgments based on linguistic stimuli in the EAST, or thinking of adjectives to describe pictures in properties judgments. For some, linguistic relativity research using solely behavioral measures (response times and related patterns) is always susceptible to linguistic processes (e.g., Gleitman & Papafragou, 2013 ), and it is for this reason that some now advocate primarily neurophysiological approaches (Thierry, 2016 ).

The argument that language needs to be controlled in relativity research is usually attributed to the thinking-for-speaking argument (Slobin, 1996 ), which was originally based on the idea that languages require speakers to attend to certain aspects of a scene, such as temporal and spatial details, depending on what information their language required (see also Slobin, 2003 ). Slobin later also conceived of “thinking for comprehending,” in which the languages we speak also influences the way that we think about what we comprehend (Slobin, 2003 ). Such a view could mean that tasks that present participants with words will also be subject to the restrictions of thinking for speaking, as might tasks that use words about gender or sex in their instructions. This would likely encompass almost all the tasks in this review.

In the tasks in this review, participants did not need to produce the actual nouns for the target items themselves in their response. There are some data from the review that we can bring to bear on this question, though they are not conclusive. The thinking-for-speaking theory predicts that effects of language on thought might not extend to performance outside of the language in which such effects are sourced. Translated into grammatical gender research, effects should therefore be strongest when performing in a gendered than in a nongendered language context. This was not the case, though only by a very subtle margin (29% to 32%). However, given the fact that 83% of the samples performed in a gendered language context, it is also possible that further data from research employing an ungendered context might lead to a change in this outcome, either in favor of or against thinking for speaking.

It is difficult to know where to draw the line between high and low language content. We classified all but two articles (Phillips & Boroditsky, 2003 ; Sato & Athanasopoulos, 2018 ), which involved only 133 samples in all, as being high in language content. It could therefore be argued that almost the entirety of the research in the review could have been testing for an influence of language on language. We ourselves do not make this claim; this is almost certainly too strong a conclusion to draw, given the heterogeneity of task designs. In its broadest sense, the philosophical debate around the involvement of language in behavior is beyond the remit of this review. More practically, however, we believe it difficult to argue that the tasks described in this review vary enough in their language content to allow for meaningful comparisons along this dimension.

Practical suggestions for future research

We divide our suggestions for future research into two sections, in order of importance. First, we point out that the results of the review support the possibility that there might be a fundamental flaw in the use of grammatical gender as a tool to speak to the question of relativity at all. Second, we make the case that if grammatical gender can provide an insight, then future tasks would benefit from an overhaul in order to better control for alternative explanations.

Returning to the question of what is “gendered” about grammatical gender?

The results of the review make the case that effects of grammatical gender are for the most part predicted by parameter settings that would be consistent with a statistical association account, at least as well as by a conceptual-change view of relativity. This is because most settings that promote the association between biological sex and grammatical gender enhance the probability that effects will be found.

That relativity is scaffolded by associations between language and thought, rather than language as thought, is a view that is partially consistent with contemporary thought in the field, such as the label-feedback hypothesis (Lupyan, 2012 ) and its offshoot the structural feedback hypothesis (Sato & Athanasopoulos, 2018 ). These theories contend that labels or grammatical information hone attention to associated features, which in turn feed back down to lower-level processes in a feedback loop. These effects can be upregulated or downregulated by the salience of the relevant linguistic information in the task. The results of the review, as well as results from studies in which participants are briefly trained in invented or real languages and come to behave in line with those languages (e.g., Boroditsky, 2001 ; Casasanto, 2010 ; Phillips & Boroditsky, 2003 ), suggest that this is indeed the case. However, where a statistical association account departs from these accounts is that the latter accounts allow for some degree of change at the conceptual level, but a statistical association account does not. A statistical association account would predict that if an Italian speaker is processing the target “bed” in the context of gender/sex, for example, then the concept of masculinity might receive activation by an association rather than by any lasting conceptual rub-off. This would be similar, for example, to the statistical association between the concepts of sunshine and ice cream; we would be unlikely to conceive of sunshine as being similar to ice cream. Put simply, a statistical association account need not require conceptual change, especially long-lived change, to occur at all, and would therefore be incompatible with the spirit of relativity in any of its theoretical incarnations.

As we stated in our introduction, this review is not in a position to make such a distinction between relativity and its alternatives, in part because more data are required, but also because our review did not find enough unambiguous support for relativity to discriminate between the possibilities. For example, it is difficult to assess positive results from voice choice and sex assignment in the light of the hierarchical taxonomy of relativity accounts by Wolff and Holmes ( 2011 ), which ranges from the strongest form of relativity (“thought is language”) through to subtler effects, such as “language as spotlight,” when a nonrelativistic account is at least as likely an explanation for the results from such tasks. Instead, our review is in a position to weigh up the size of the problem in relating much grammatical gender research to relativity at all, in any of its forms. Of the five principal parameter settings that a statistical association account would predict, one (language content) was impossible to draw meaningful inferences from; three (target type, number of gender categories, and salience of sex/gender) resulted in higher overall rates of support, and only one (gendered language context) was equivocal. It is perhaps important to note that the latter parameter did not run powerfully in the opposite direction to what a statistical association account would predict; there was only a – 3% support rate difference, far smaller than the next smallest difference of + 23%, which was instead in favor of the account. The sixth parameter—age—was also equivocal, but was less important for the account in any case. We therefore interpret these results as framing and underlining the case for a statistical association account, by which arbitrary labels for grammatical classes interact with arbitrary assignments of nouns to those classes, under conditions that facilitate their association. This is not to imply that we prefer such an account, or to rule out the possibility that multiple factors might have simultaneous and additive effects. It does imply, however, that grammatical gender is presently a foggier lens through which to inspect the case for relativity than the domains of categorical perception, space or time, to name a few.

As a first step, it would be useful to establish whether “masculine” and “feminine” are psychologically privileged “attractor” concepts that impose their status on other objects in their grammatical class, or whether these metalinguistic labels are themselves arbitrary. This is important because it would help researchers to understand whether the idea of a relationship between grammatical “gender” and biological sex has any psychological reality. If it does, then it becomes more likely that the members of a class are in some sense imbued with this conceptual relationship, and the case for conceptual-change accounts would be enhanced.

There is already a study that suggests a method by which to test this. In one experiment, not included in this review because it involved invented languages, native English speakers were taught “Gumbuzi,” an artificial language with two artificial grammatical “gender” groups, labeled “soupative” and “oosative” (Phillips & Boroditsky, 2003 ). Participants were taught ten items in each group, six of which were inanimate objects, and the remaining four were humans who were either all female or all male. After learning which items were assigned to which category, participants rated the similarity of human–object pairs both within and across the two groups. The results showed that pairs from the same group were rated as being more similar than items from different groups, leading the authors to conclude that there can be a causative (i.e., learned ) relationship between grammatical gender and people’s conceptualizations of objects.

Since 40% of all the items in a group were humans of the same biological sex, the groups were strongly biased toward sex/gender, regardless of the labels “soupative” and “oosative.” It is also likely that the participants were aware of such things as grammatical gender categories through formal second language instruction in schools, knowledge of which they could have applied to the task. Additionally, the labels “oosative” and “soupative” fail to actually describe any of the items within each group; real grammatical gender categories are at least partially correlated with the biological sex of its members. Nevertheless, this study provides a template for a future study that might teach one group of participants that groups are called masculine and feminine, and another group that the groups are named after another and equally represented natural kind. If the arbitrary labeling of the classifiers themselves drives performance, then the results of similarity ratings should pattern in line with other labels at least as much as they would with masculine and feminine. This would suggest that any influence of grammatical gender is a human-made one that is independent of linguistic structure and lacking in psychological reality, undermining the notion of a conceptual relationship between grammatical gender and biological sex, and in turn favoring a statistical association relationship.

Dulling Occam’s razor

If grammatical gender is not merely a cultural label, then we follow Ramos and Roberson ( 2011 ) and others who suggest that studies be conducted that aim to restrict both gender and language to as oblique a role as possible. Property judgment tasks do the former very well, but language remains fundamental, and in any case the evidence from these tasks is to date overwhelmingly negative. Instead, the priming tasks by Sato and Athanasopoulos ( 2018 ) would seem strong candidates for future investigations. In their first experiment, French–English participants were found to be slower to indicate whether two objects were associated with a male or female face when the grammatical gender of those objects was incongruent with the biological sex of the person. In a second experiment, participants matched one of two trait words (e.g., “charming,” “realistic”) to a now genderless face after being primed with a pair of objects, such as a tie and a spade. The results again pointed to an influence of the grammatical gender of the objects in French–English bilinguals’ choices. These studies, while not eliminating biological sex and language altogether, keep some distance between these and participants’ actual responses, because associations come from the task-irrelevant grammatical genders of objects that participants were presented with earlier. Future work might find a way of making this distance greater still, and include direct statistical comparisons between speakers of a gendered language and speakers of a nongendered language in order to establish more clearly that any effects are attributable to grammatical gender specifically. Footnote 7

Limitations

This review represents, to the best of our knowledge, the first systematic attempt to assess the literature in a quantitative manner. The heterogeneity of methods and their uneven representation in the literature presented us with a difficult decision; to group together research of different types, or to provide a finer-grained picture. We took the view that it was better in a first review to provide a nuanced picture that takes into account differences between, for example, instructing participants to assign a voice to an object or a sex to an object. This does have the drawback of making it harder to make reliable inferences based on less well-used tasks, in particular object–name memory associations, association tasks, and similarity tasks. On the other hand, it makes it easier for later work to be incorporated into future reviews.

A more difficult and subjective issue concerns the interpretation of results classified as offering mixed support. The argument for the inclusion of this category is to our minds quite compelling. If we take, for example, the finding that training in Spanish improves the rate at which voice choices are consistent with grammatical gender, but nevertheless fails to raise this rate above chance (Kurinski & Sera, 2011 ), neither an entirely cautious approach (i.e., this result finds no support for relativity) nor an endorsement (voice choices are consistent with grammatical gender) naturally follows, and the need for a third category becomes clear. To present as objective a view as possible, we have for the most part focused on the rate of support in the first instance, the rate of no support second, and mixed support only where it is necessary, such as in conditions in which there are few data on either side of this middle category.

In conclusion, our review showed that support for an influence of grammatical gender on concepts is strongly task- and context-dependent. Support also comes for the most part from tasks that are susceptible to clear alternative explanations. Perhaps most importantly, it needs to be empirically established that grammatical gender itself is not a cultural label but a concept with psychological reality before any influence can be reasonably attributed to truly linguistic processes.

We return to this issue and discuss it in greater detail in the Discussion section.

The potential for grammatical-gender effects on object conceptualization to be the result of statistical co-occurrences goes to the heart of a debate concerning whether grammatical gender is in fact a suitable tool for investigating relativity at all, and it is one we turn to in detail later in this section and in the Discussion .

Not all two-gendered languages divide nouns into masculine and feminine; some (like Dutch, Swedish, and some Norwegian dialects) instead divide nouns into “gendered” and “neuter” categories. To our knowledge, only one study in our review included participants who spoke a language of this type: Bergensk, a language spoken in Norway (Beller, Brattebø, Lavik, Reigstad, & Bender, 2015 ). For the purposes of the review, we excluded the results of this study from our two- versus three-gender comparison. The sample size ( N = 107) suggests that neither leaving this study in nor taking it out would have any meaningful effect on the patterns of results described.

Only one study discovered by the search procedure was eventually excluded for the absence of an English translation (see SOM1 ).

Additionally, there is at present no empirical evidence that participants assign the same biological sex to the same objects across both tasks.

This study included 102 participants of varying levels of Spanish. Any effect of grammatical gender in the advanced learners in this experiment was confounded with the usually more reliable natural/artificial distinction (Mullen, 1990 ). Note that if this item were included as Support owing to the 26 native-speaking participants’ performance alone, it would only move these 26 samples (the number of native Spanish speakers) from mixed support to support.

These studies also tested monolingual English speakers, but the pertinent analyses were performed within groups.

Almutrafi, F. (2015). Language and cognition: Effects of grammatical gender on the categorisation of objects (Unpublished doctoral thesis). Newcastle University, Newcastle upon Tyne, UK

Athanasopoulos, P. (2006). Effects of the grammatical representation of number on cognition in bilinguals. Bilingualism: Language and Cognition , 9 , 89–96.

Google Scholar  

Athanasopoulos, P. (2009). Cognitive representation of colour in bilinguals: The case of Greek blues. Bilingualism: Language and Cognition , 12 , 83–95.

Athanasopoulos, P., & Boutonnet, B. (2016). Learning grammatical gender in a second language changes categorization of inanimate objects: Replications and new evidence from English learners of L2 French. In R. Alonso (Ed.), Cross-linguistic influence in second language acquisition (pp. 173–192). Bristol, UK: Multilingual Matters.

Athanasopoulos, P., & Bylund, E. (2013). Does grammatical aspect affect motion event cognition? A cross-linguistic comparison of English and Swedish speakers. Cognitive Science , 37 , 286–309.

PubMed   Google Scholar  

Bassetti, B. (2007). Bilingualism and thought: Grammatical gender and concepts of objects in Italian–German bilingual children. International Journal of Bilingualism , 11 , 251–273.

Bassetti, B., & Nicoladis, E. (2016). Research on grammatical gender and thought in early and emergent bilinguals. International Journal of Bilingualism , 20 , 3–16.

Belacchi, C., & Cubelli, R. (2012). Implicit knowledge of grammatical gender in preschool children. Journal of Psycholinguistic Research , 41 , 295–310.

Beller, S., Brattebø, K. F., Lavik, K. O., Reigstad, R. D., & Bender, A. (2015). Culture or language: What drives effects of grammatical gender? Cognitive Linguistics , 26 , 331–359.

Bender, A., Beller, S., & Klauer, K. C. (2011). Grammatical gender in German: A case for linguistic relativity? Quarterly Journal of Experimental Psychology , 64 , 1821–1835.

Bender, A., Beller, S., & Klauer, K. C. (2016a). Crossing grammar and biology for gender categorisations: Investigating the gender congruency effect in generic nouns for animates. Journal of Cognitive Psychology , 28 , 530–558.

Bender, A., Beller, S., & Klauer, K. C. (2016b). Lady Liberty and Godfather Death as candidates for linguistic relativity? Scrutinizing the gender congruency effect on personified allegories with explicit and implicit measures. Quarterly Journal of Experimental Psychology , 69 , 48–64.

Bender, A., Beller, S., & Klauer, K. C. (2018). Gender congruency from a neutral point of view: The roles of gender classes and conceptual connotations. Journal of Experimental Psychology: Learning, Memory, and Cognition , 44 , 1580–1608.

Bobb, S. C., & Mani, N. (2013). Categorizing with gender: Does implicit grammatical gender affect semantic processing in 24-month-old toddlers? Journal of Experimental Child Psychology , 115 , 297–308.

Boroditsky, L. (2001). Does language shape thought? Mandarin and English speakers’ conceptions of time. Cognitive Psychology , 43 , 1–22.

Boroditsky, L., & Schmidt, L. A. (2000). Sex, syntax, and semantics. In L. R. Gleitman & A. K. Joshi (Eds.), Proceedings of the Twenty-Second Annual Conference of the Cognitive Science Society (pp. 42–46). Mahwah, NJ: Erlbaum.

Boutonnet, B., Athanasopoulos, P., & Thierry, G. (2012). Unconscious effects of grammatical gender during object categorisation. Brain Research , 1479 , 72–79.

Brown, A., Lindsey, D. T., & Guckes, K. M. (2011). Color names, color categories, and color-cued visual search: Sometimes, color perception is not categorical. Journal of Vision , 11 (12), 2:1–21. https://doi.org/10.1167/11.12.2

Article   PubMed   Google Scholar  

Brown, R., & Lenneberg, E. (1954). A study in language and cognition. Journal of Abnormal and Social Psychology , 49 , 454–462.

Casasanto, D. (2010). Space for thinking. In V. Evans & P. Chilton (Eds.), Language, cognition and space: The state of the art and new directions (pp. 453–478). London, UK: Equinox.

Casasanto, D., Boroditsky, L., Phillips, W., Greene, J., Goswami, S., Bocanegra-Thiel, S., . . . Gil, D. (2004). How deep are effects of language on thought? Time estimation in speakers of English, Indonesian, Greek, and Spanish. In K. Forbus, D. Gentner, & T. Regier (Eds.), Proceedings of the 26th Annual Conference of the Cognitive Science Society (pp. 186–191). Mahwah, NJ: Erlbaum.

Cook, S. V. (2016). Gender matters: From L1 grammar to L2 semantics. Bilingualism: Language and Cognition , 1–19.

Corbett. (1991). Gender . Cambridge, UK: Cambridge University Press.

Costa, A., Kovacic, D., Fedorenko, E., & Caramazza, A. (2003). The gender congruency effect and the selection of freestanding and bound morphemes: Evidence from Croatian. Journal of Experimental Psychology: Learning, Memory, and Cognition , 29 , 1270–1282.

Costa, A., Kovacic, D., Franck, J., & Caramazza, A. (2003). On the autonomy of the grammatical gender systems of the two languages of a bilingual. Bilingualism: Language and Cognition , 6 , 181–200.

Cubelli, R., Paolieri, D., Lotto, L., & Job, R. (2011). The effect of grammatical gender on object categorization. Journal of Experimental Psychology: Learning, Memory, and Cognition , 37 , 449–460.

De Bruin, A., Treccani, B., & Della Sala, S. (2015). Cognitive advantage in bilingualism: An example of publication bias? Psychological Science , 26 , 99–107.

Degani, T. (2007). The semantic role of gender: Grammatical and biological gender match effects in English and Spanish (Unpublished master’s thesis). University of Pittsburgh, Pittsburgh, PA.

Dilkina, K., McClelland, J. L., & Boroditsky, L. (2007, August). How language affects thought in a connectionist model . Paper presented at the 29th Annual Meeting of the Cognitive Science Society, Nashville, TN.

Dolscheid, S., Shayan, S., Majid, A., & Casasanto, D. (2013). The thickness of musical pitch: Psychophysical evidence for linguistic relativity. Psychological Science , 24 , 613–621.

Eberhard, K. M., Heilman, M., & Scheutz, M. (2005, July). An empirical and computational test of linguistic relativity . Paper presented at the 27th Annual Meeting of the Cognitive Science Society, Stresa, Italy.

Firestone, C., & Scholl, B. J. (2016). Cognition does not affect perception: Evaluating the evidence for “top-down” effects. Behavioral and Brain Sciences , 39 , e229. https://doi.org/10.1017/S0140525X15000965

Flaherty, M. (2001). How a language gender system creeps into perception. Journal of Cross-Cultural Psychology , 32 , 18–31.

Forbes, J. N., Poulin-Dubois, D., Rivero, M. R., & Sera, M. D. (2008). Grammatical gender affects bilinguals’ conceptual gender: Implications for linguistic relativity and decision making. Open Applied Linguistics Journal , 1 , 68–76.

Foundalis, H. E. (2002). Evolution of gender in Indo-European languages. In W. D. Gray & C. Schunn (Eds.), Proceedings of the Annual Meeting of the Cognitive Science Society (pp. 304–308). Mahwah, NJ: Erlbaum.

Franklin, A., Clifford, A., Williamson, E., & Davies, I. (2005). Color term knowledge does not affect categorical perception of color in toddlers. Journal of Experimental Child Psychology , 90 , 114–141.

Gilbert, A. L., Regier, T., Kay, P., & Ivry, R. B. (2006). Whorf hypothesis is supported in the right visual field but not the left. Proceedings of the National Academy of Sciences , 103 , 489–494.

Gilbert, A. L., Regier, T., Kay, P., & Ivry, R. B. (2008). Support for lateralization of the Whorf effect beyond the realm of color discrimination. Brain and Language , 105 , 91–98.

Gleitman, L., & Papafragou, A. (2013). Relations between language and thought. In D. Reisberg (Ed.), The Oxford handbook of cognitive psychology (pp. 504–523). New York, NY: Oxford University Press.

Haertlé, I. (2017). Does grammatical gender influence perception? A study of Polish and French speakers. Psychology of Language and Communication , 21 , 386–407.

Imai, M., & Gentner, D. (1997). A cross-linguistic study of early word meaning: Universal ontology and linguistic influence. Cognition , 62 , 169–200.

Imai, M., Schalk, L., Saalbach, H., & Okada, H. (2014). All giraffes have female-specific properties: Influence of grammatical gender on deductive reasoning about sex-specific properties in German speakers. Cognitive Science , 38 , 514–536.

Kaushanskaya, M., & Smith, S. (2016). Do grammatical-gender distinctions learned in the second language influence native-language lexical processing? International Journal of Bilingualism , 20 , 30–39.

Konishi, T. (1993). The semantics of grammatical gender: A cross-cultural study. Journal of Psycholinguistic Research , 22 , 519–534.

Konishi, T. (1994). The connotations of gender: A semantic differential study of German and Spanish. Word , 45 , 317–327.

Kousta, S.-T., Vinson, D. P., & Vigliocco, G. (2008). Investigating linguistic relativity through bilingualism: The case of grammatical gender. Journal of Experimental Psychology: Learning, Memory, and Cognition , 34 , 843–858.

Kurinski, E., Jambor, E., & Sera, M. D. (2016). Spanish grammatical gender: Its effects on categorization in native Hungarian speakers. International Journal of Bilingualism , 20 , 76–93.

Kurinski, E., & Sera, M. D. (2011). Does learning Spanish grammatical gender change English-speaking adults’ categorization of inanimate objects? Bilingualism: Language and Cognition , 14 , 203–220.

Lambelet, A. (2016). Second grammatical gender system and grammatical gender-linked connotations in adult emergent bilinguals with French as a second language. International Journal of Bilingualism , 20 , 62–75.

Landor, R. (2014). Grammatical categories and cognition across five languages: The case of grammatical gender and its potential effects on the conceptualisation of objects (Doctoral thesis). Griffith University, Brisbane, Australia.

Lucy, J. A. (2016). Recent advances in the study of linguistic relativity in historical context: A critical assessment. Language Learning , 66 , 487–515.

Lupyan, G. (2012). Linguistically modulated perception and cognition: The label-feedback hypothesis. Frontiers in Psychology , 3 , 54. https://doi.org/10.3389/fpsyg.2012.00054

Article   PubMed   PubMed Central   Google Scholar  

Martinez, I. M., & Shatz, M. (1996). Linguistic influences on categorization in preschool children: A crosslinguistic study. Journal of Child Language , 23 , 529–545.

Mickan, A., Schiefke, M., & Stefanowitsch, A. (2014). Key is a llave is a Schlussel: A failure to replicate an experiment from Boroditsky et al. 2003. Yearbook of the German Cognitive Linguistics Association , 2 , 39–50.

Montefinese, M., Ambrosini, E., & Roivainen, E. (2019). No grammatical gender effect on affective ratings: Evidence from Italian and German languages. Cognition and Emotion , 33 , 848–854. https://doi.org/10.1080/02699931.2018.1483322

Mullen, M. K. (1990). Children’s classifications of nature and artifact pictures into female and male categories. Sex Roles , 23 , 577–587.

Nicoladis, E. (2019). [Unpublished data].

Nicoladis, E., Da Costa, N., & Foursha-Stevenson, C. (2016). Discourse relativity in Russian-English bilingual preschoolers’ classification of objects by gender. International Journal of Bilingualism , 20 , 17–29.

Nicoladis, E., & Foursha-Stevenson, C. (2012). Language and culture effects on gender classification of objects. Journal of Cross-Cultural Psychology , 43 , 1095–1109.

Park, H. I., & Ziegler, N. (2014). Cognitive shift in the bilingual mind: Spatial concepts in Korean–English bilinguals. Bilingualism: Language and Cognition , 17 , 410–430.

Pavlidou, T.-S., & Alvanoudi, A. (2013). Grammatical gender and cognition. In N. Lavidas, T. Alexio, & A. Sougari (Eds.), Major trends in theoretical and applied linguistics (Vol. 2, pp. 109–124). London, UK: Versita.

Pavlidou, T.-S., & Alvanoudi, A. (2018). Conceptualizing the world as “female” or “male”: Further remarks on grammatical gender and speakers’ cognition. In N. Topintzi, N. Lavidas, & M. Moumtzi (Eds.), Selected papers on theoretical and applied linguistics from ISTAL23 . Thessaloniki, Greece: School of English, Aristotle University of Thessaloniki.

Phillips, W., & Boroditsky, L. (2003, August). Can quirks of grammar affect the way you think? Grammatical gender and object concepts . Paper presented at the 25th Annual Meeting of the Cognitive Science Society, Boston, MA.

Pinker, S. (1994). The language instinct . New York, NY, US: William Morrow & Co.

Pylyshyn, Z. W. (1984). Computation and cognition . Cambridge, MA: MIT Press.

Ramos, S., & Roberson, D. (2011). What constrains grammatical gender effects on semantic judgements? Evidence from Portuguese. Journal of Cognitive Psychology , 23 , 102–111.

Roberson, D., Pak, H., & Hanley, J. R. (2008). Categorical perception of colour in the left and right visual field is verbally mediated: Evidence from Korean. Cognition , 107 , 752–762.

Saalbach, H., Imai, M., & Schalk, L. (2012). Grammatical gender and inferences about biological properties in German-speaking children. Cognitive Science , 36 , 1251–1267.

Samuel, S., Roehr-Brackin, K., Pak, H., & Kim, H. (2018). Cultural effects rather than a bilingual advantage in cognition: A review and an empirical study. Cognitive Science , 42 , 2313–2341.

Samuel, S., Roehr-Brackin, K., & Roberson, D. (2016). “She says, he says”: Does the sex of an instructor interact with the grammatical gender of targets in a perspective-taking task? International Journal of Bilingualism , 20 , 40–61.

Sato, S., & Athanasopoulos, P. (2018). Grammatical gender affects gender perception: Evidence for the structural-feedback hypothesis. Cognition , 176 , 220–231.

Sedlmeier, P., Tipandjan, A., & Jänchen, A. (2016). How persistent are grammatical gender effects? The case of German and Tamil. Journal of Psycholinguistic Research , 45 , 317–336.

Segel, E., & Boroditsky, L. (2011). Grammar in art. Frontiers in Psychology , 1 , 244. https://doi.org/10.3389/fpsyg.2010.00244

Semenuks, A., Phillips, W., Dalca, I., Kim, C., & Boroditsky, L. (2017). Effects of grammatical gender on object description. In G. Gunzelmann, A. Howes, T. Tenbrink, & E. Davelaar (Eds.), Proceedings of the 39th Annual Meeting of the Cognitive Science Society (pp. 1060–1065). Austin, TX: Cognitive Science Society.

Sera, M. D., Berge, C. A. H., & del Castillo Pintado, J. (1994). Grammatical and conceptual forces in the attribution of gender by English and Spanish speakers. Cognitive Development , 9 , 261–292. https://doi.org/10.1016/0885-2014(94)90007-8

Article   Google Scholar  

Sera, M. D., Elieff, C., Forbes, J., Burch, M. C., Rodríguez, W., & Dubois, D. P. (2002). When language affects cognition and when it does not: An analysis of grammatical gender and classification. Journal of Experimental Psychology: General , 131 , 377–397.

Slobin, D. I. (1996). From “thought and language” to “thinking for speaking.” In J. J. Gumperz & S. C. Levinson (Eds.), Rethinking linguistic relativity (pp. 70–96). Cambridge, UK: Cambridge University Press.

Slobin, D. I. (2003). Language and thought online: Cognitive consequences of linguistic relativity. In D. Gentner & S. Goldin-Meadow (Eds.), Language in mind: Advances in the study of language and thought (pp. 157–192). Cambridge, MA: MIT Press.

Thierry, G. (2016). Neurolinguistic relativity: How language flexes human perception and cognition. Language Learning , 66 , 690–713.

PubMed   PubMed Central   Google Scholar  

Vernich, L. (2017). Does learning a foreign language affect object categorization in native speakers of a language with grammatical gender? The case of Lithuanian speakers learning three languages with different types of gender systems (Italian, Russian and German). International Journal of Bilingualism , 23 , 417–436.

Vernich, L., Argus, R., & Kamandulytė-Merfeldienė, L. (2017). Extending research on the influence of grammatical gender on object classification: A cross-linguistic study comparing Estonian, Italian and Lithuanian native speakers. Eesti Rakenduslingvistika Ühingu Aastaraamat , 13 , 223–240.

Vigliocco, G., Vinson, D. P., Paganelli, F., & Dworzynski, K. (2005). Grammatical gender effects on cognition: Implications for language learning and language use. Journal of Experimental Psychology: General , 134 , 501–520.

Vuksanovic, J., Bjekic, J., & Radivojevic, N. (2015). Grammatical gender and mental representation of object: The case of musical instruments. Journal of Psycholinguistic Research , 44 , 383–397.

Whorf, B. L. (1956). Language, thought and reality: Selected writing of Benjamín Lee Whorf . Cambridge, MA: MIT Press.

Winawer, J., Witthoft, N., Frank, M. C., Wu, L., Wade, A. R., & Boroditsky, L. (2007). Russian blues reveal effects of language on color discrimination. Proceedings of the National Academy of Sciences , 104 , 7780–7785.

Witzel, C., & Gegenfurtner, K. R. (2013). Categorical sensitivity to color differences. Journal of Vision , 13 (7), 1:1–33. https://doi.org/10.1167/13.7.1

Wolff, P., & Holmes, K. J. (2011). Linguistic relativity. Wiley Interdisciplinary Reviews: Cognitive Science , 2 , 253–265.

Yorkston, E., & De Mello, G. E. (2005). Linguistic gender marking and categorization. Journal of Consumer Research , 32 , 224–234.

Download references

Author information

Authors and affiliations.

Department of Psychology, University of Essex, Wivenhoe Park, Essex, UK

Steven Samuel, Geoff Cole & Madeline J. Eacott

You can also search for this author in PubMed   Google Scholar

Corresponding author

Correspondence to Steven Samuel .

Ethics declarations

Conflict of interest.

The authors report no conflicts of interest.

Open practices statement

Details about the review and the search mechanisms can be found in the supplemental online materials ( SOM1 and SOM2 ).

Additional information

Publisher’s note.

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Electronic supplementary material

(XLSX 15 kb)

(XLSX 32 kb)

Rights and permissions

Reprints and permissions

About this article

Samuel, S., Cole, G. & Eacott, M.J. Grammatical gender and linguistic relativity: A systematic review. Psychon Bull Rev 26 , 1767–1786 (2019). https://doi.org/10.3758/s13423-019-01652-3

Download citation

Published : 19 August 2019

Issue Date : December 2019

DOI : https://doi.org/10.3758/s13423-019-01652-3

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • Grammatical gender
  • Linguistic relativity
  • Language and thought
  • Find a journal
  • Publish with us
  • Track your research

The Sapir-Whorf Hypothesis Linguistic Theory

DrAfter123/Getty Images

  • An Introduction to Punctuation
  • Ph.D., Rhetoric and English, University of Georgia
  • M.A., Modern English and American Literature, University of Leicester
  • B.A., English, State University of New York

The Sapir-Whorf hypothesis is the  linguistic theory that the semantic structure of a language shapes or limits the ways in which a speaker forms conceptions of the world. It came about in 1929. The theory is named after the American anthropological linguist Edward Sapir (1884–1939) and his student Benjamin Whorf (1897–1941). It is also known as the   theory of linguistic relativity, linguistic relativism, linguistic determinism, Whorfian hypothesis , and Whorfianism .

History of the Theory

The idea that a person's native language determines how he or she thinks was popular among behaviorists of the 1930s and on until cognitive psychology theories came about, beginning in the 1950s and increasing in influence in the 1960s. (Behaviorism taught that behavior is a result of external conditioning and doesn't take feelings, emotions, and thoughts into account as affecting behavior. Cognitive psychology studies mental processes such as creative thinking, problem-solving, and attention.)

Author Lera Boroditsky gave some background on ideas about the connections between languages and thought:

"The question of whether languages shape the way we think goes back centuries; Charlemagne proclaimed that 'to have a second language is to have a second soul.' But the idea went out of favor with scientists when  Noam Chomsky 's theories of language gained popularity in the 1960s and '70s. Dr. Chomsky proposed that there is a  universal grammar  for all human languages—essentially, that languages don't really differ from one another in significant ways...." ("Lost in Translation." "The Wall Street Journal," July 30, 2010)

The Sapir-Whorf hypothesis was taught in courses through the early 1970s and had become widely accepted as truth, but then it fell out of favor. By the 1990s, the Sapir-Whorf hypothesis was left for dead, author Steven Pinker wrote. "The cognitive revolution in psychology, which made the study of pure thought possible, and a number of studies showing meager effects of language on concepts, appeared to kill the concept in the 1990s... But recently it has been resurrected, and 'neo-Whorfianism' is now an active research topic in  psycholinguistics ." ("The Stuff of Thought. "Viking, 2007)

Neo-Whorfianism is essentially a weaker version of the Sapir-Whorf hypothesis and says that language  influences  a speaker's view of the world but does not inescapably determine it.

The Theory's Flaws

One big problem with the original Sapir-Whorf hypothesis stems from the idea that if a person's language has no word for a particular concept, then that person would not be able to understand that concept, which is untrue. Language doesn't necessarily control humans' ability to reason or have an emotional response to something or some idea. For example, take the German word  sturmfrei , which essentially is the feeling when you have the whole house to yourself because your parents or roommates are away. Just because English doesn't have a single word for the idea doesn't mean that Americans can't understand the concept.

There's also the "chicken and egg" problem with the theory. "Languages, of course, are human creations, tools we invent and hone to suit our needs," Boroditsky continued. "Simply showing that speakers of different languages think differently doesn't tell us whether it's language that shapes thought or the other way around."

  • 24 Words Worth Borrowing From Other Languages
  • constructed language (conlang)
  • What Is the "Etymological Fallacy?"
  • Elenchus (argumentation)
  • Subjunct (Grammar)
  • Conditional Clause in Grammar
  • The Meaning of Linguistic Imperialism and How It Can Affect Society
  • The Hypothesis of Colonial Lag
  • Definition and Examples of Sound Change in English
  • Speech in Linguistics
  • 11 Weird and Interesting Words in English
  • How to Recognize and Use Clauses in English Grammar
  • The Cultural Transmission of Language
  • Usage and Examples of a Rebuttal
  • Root Metaphor
  • Ware, Wear, and Where: How to Choose the Right Word

Academia.edu no longer supports Internet Explorer.

To browse Academia.edu and the wider internet faster and more securely, please take a few seconds to  upgrade your browser .

Enter the email address you signed up with and we'll email you a reset link.

  • We're Hiring!
  • Help Center

paper cover thumbnail

Sapir Whorf Hypothesis

Profile image of Sean O'Neill

The International Encyclopedia of Language and Social Interaction, 3 Volume Set. Karen Tracy (Editor), Cornelia Ilie (Associate Editor), Todd Sandel (Associate Editor)

The Sapir–Whorf hypothesis holds that language plays a powerful role in shaping human consciousness, affecting everything from private thought and perception to larger patterns of behavior in society—ultimately allowing members of any given speech community to arrive at a shared sense of social reality. This article starts with a brief consideration of the philosophical insights that inspired the Sapir–Whorf hypothesis or the “principle of linguistic relativity,” as it is more often known today. Toward the end of the article current empirical research is reviewed. This explores everything from human universals to the cross-cultural differences in the construction of gender, color, space, and other creative practices associated with language, such as storytelling, poetry, or song.

Related Papers

—The Sapir-Whorf's Linguistic Relativity Hypothesis provokes intellectual discussion about the strong impact language has on our perception of the world around us. This paper intends to enliven the still open questions raised by this hypothesis. This is done by considering some of Sapir's, Whorf's, and other scholar's works.

sapir whorf's hypothesis on language and gender

Language Sciences 25, 393-432

Cliff Goddard

Probably no contemporary linguist has published as profusely on the connections between semantics, culture, and cognition as Anna Wierzbicka. This paper explores the similarities and differences between her ‘‘natural semantic metalanguage’’ (NSM) approach and the linguistic theory of Benjamin Lee Whorf. It shows that while some work by Wierzbicka and colleagues can be seen as ‘‘neo-Whorfian’’, other aspects of the NSM program are ‘‘counter-Whorfian’’. Issues considered include the meaning of linguistic relativity, the nature of conceptual universals and the consequences for semantic methodology, the importance of polysemy, and the scale and locus of semantic variation between languages, particularly in relation to the domain of time. Examples are drawn primarily from English, Russian, and Hopi.

Aneta Pavlenko

Debates about linguistic relativity commonly focus on one question: Does language affect thought? This yes-or-no question does not do justice to the complexity of Whorf’s ideas and skirts several issues of great importance to Whorf. My first aim in this paper is to recover the arguments that got lost in translation of Whorf’s ideas into the Sapir-Whorf hypothesis. I will show that, for Whorf, languages were also one of the ways in which we think, scientists were not immune to language effects, and the key to advancement of Western science was multilingual awareness. My second aim is to draw on these insights to articulate a Whorfian agenda for the field of second language acquisition (SLA) that asks new questions about second language learning and cognition and expands the boundaries of the field and the scope, duration, and locations of SLA research.

Language Sciences

Keith Allan

In: Dov M. Gabbay, Paul Thagard and John Woods, editors, Philosophy of Linguistics. San Diego: North Holland, 2012, pp. 531-551. ISBN: 978-0-444-51747-0

William O Beeman

Anthropology and linguistics share a common intellectual origin in 19th Century scholarship. The impetus that prompted the earliest archaeologists to look for civilizational origins in Greece, early folklorists to look for the origins of culture in folktales and common memory, and the first armchair cultural anthropologist to look for the origins of human customs through comparison of groups of human beings also prompted the earliest linguistic inquiries. This essay traces the relationship between the development of anthropology and linguistics down to modern times including the development of sociolinguistics, ethnography of communication, culture and communication, pragmatics, metapragmatics and other mainstay topics in linguistic anthropology

Journal of Linguistic Anthropology

Saul Schwartz

A growing literature in linguistic anthropology critically examines the rhetoric of endangered language advocacy. A number of themes remain underexplored, however, including the invocation of "culture" to justify language preservation, the interests of communities without fluent heritage language speakers, and anthropology's contribution to potentially problematic advocacy tropes. Discourses like "language is the core of culture" and "when a language dies, a culture dies" are widespread in language activism even though they undermine communities' efforts to maintain distinctive cultural identities in the wake of language shift and put dormant language communities in a double bind. While Boasian anthropology contains anti-essentialist and counter-nationalist perspectives on language, culture, and race, some Herderian advocacy tropes are borrowed from the (also Boasian) tradition of linguistic relativity in its popular Whorfian iteration. Drawing on my research on Chiwere language politics, I identify two forms of agency available to endangered and dormant language communities: one form of agency resists language loss but accepts dominant ideologies of national difference that make heritage languages essential to indigenous cultural identities, while another form of agency accepts language loss but resists Herderian nationalist expectations that authentic indigenous communities speak their traditional languages. [advocacy rhetoric, dormant language communities, language and culture ideologies, agency, Chiwere]

Pauline Turner Strong

Anthony K. Webster

Aynura Faikovna

The Bilingual Mind

to appear in Semiotica

This article takes seriously Edward Sapir’s observation about poetry as an example of linguistic relativity. Taking my cue from Dwight Bolinger’s “word affinities,” this article reports on the ways sounds of poetry evoke and convoke imaginative possibilities through phonological iconicity. In working with Navajos in translating poetry, I have come to appreciate the sound suggestiveness of that poetry and the imaginative possibilities that are bound up in the sounds of Navajo. It seems that just such sound suggestiveness via phonological iconicity and the ways they orient our imaginations are a crucial locus for thinking through linguistic relativities.

Loading Preview

Sorry, preview is currently unavailable. You can download the paper by clicking the button above.

RELATED PAPERS

Jeroen Vermeulen

Martin McCarvill

Tanja Petrović

Sean O'Neill

American Indian Culture and Research Journal 35(2): 1-18

Leighton C. Peterson

Mehdi Benamar

Current Anthropology

Kathryn Woolard

David Moore

Anthony K. Webster , Leighton C. Peterson

Esther Liz Chen

Moira Inghilleri

Annals of The New York Academy of Sciences

Richard S Pinner

Aly alveyro

Rowshon Ara

Encountering Aboriginal Languages: Studies in the history of Australian linguistics

Journal of Folklore Research, Special Triple Issue, Ethnopoetics, Narrative Inequality, and Voice: The Legacy of Dell Hymes, edited by Paul V. Kroskrity and Anthony K. Webster, 50(1-3):217-250.

Luis Carlos Rubio López

stefano De La torre - Buemo

Bambi B Schieffelin

Flaminia Robu

Daisy Daisy

victoria de caso ward

David Kronenfeld

Anas El Kachkachy

HAU: Journal of Ethnographic Theory 4:193-220.

John Leavitt

Mary Elizabeth Cerutti Rizzatti

Wesley Connors

Argumentation

Manfred Kienpointner

Relative Points of View: Linguistic Representations of Culture

Magda Stroinska

Patrick Boylan

Mikail Barau

Filippo Batisti

Dietha Koster

La Ruby Tuesdey

researchgate.net

Fabian Bross

Boroday Sergey

magel sobrera

RELATED TOPICS

  •   We're Hiring!
  •   Help Center
  • Find new research papers in:
  • Health Sciences
  • Earth Sciences
  • Cognitive Science
  • Mathematics
  • Computer Science
  • Academia ©2024

Learn Anthropology

Username or Email Address

Remember Me Forgot Password?

A link to set a new password will be sent to your email address.

Your personal data will be used to support your experience throughout this website, to manage access to your account, and for other purposes described in our privacy policy .

Get New Password

Sapir-Whorf Hypothesis

  • Last Updated: Jul 22, 2023

The Sapir-Whorf Hypothesis, a seminal concept in the field of linguistic anthropology, posits a relationship between language, thought, and culture , emphasizing that our understanding and perception of reality are influenced by the language we use [1] .

sapir whorf's hypothesis on language and gender

Historical Background

The hypothesis is named after two prominent linguists, Edward Sapir and Benjamin Lee Whorf. While neither actually articulated a formal theory, their individual writings provide the foundation of what we now understand as the Sapir-Whorf Hypothesis [2] .

  • Edward Sapir (1884-1939), a linguist and anthropologist, proposed that the language we speak shapes the way we perceive and experience the world [3] .
  • Benjamin Lee Whorf (1897-1941), a linguist, chemical engineer, and a student of Sapir, extended his mentor’s ideas, hypothesizing a direct link between the structure of a language and the thought processes of its speakers [4] .

Two Forms of the Hypothesis

The hypothesis manifests in two primary forms: Strong (or linguistic determinism) and Weak (or linguistic relativity) [5] .

  • Strong Form (Linguistic Determinism) : The stronger form of the hypothesis, linguistic determinism, asserts that language controls thought and cultural norms, and therefore, individuals are incapable of understanding concepts their language does not support.
  • Weak Form (Linguistic Relativity) : The weaker form, linguistic relativity, proposes that language merely influences thought and perception, suggesting that while different languages can lead to different cognitive processes, it does not prevent understanding of certain concepts.
Sapir-Whorf Hypothesis FormsKey Features
Strong Form (Linguistic Determinism)Language controls thought and cultural norms.
Weak Form (Linguistic Relativity)Language influences thought and perception but does not restrict understanding.

Empirical Evidence

Research conducted over the years provides mixed evidence in support of the Sapir-Whorf Hypothesis.

  • Supportive Studies : Research into color perception provides some support. For instance, Berlin and Kay’s study revealed that speakers of languages with numerous distinct color terms can distinguish colors more accurately than those whose languages have fewer terms.
  • Contradictory Studies : Counter studies have shown that despite language differences, cognitive processes can remain similar. For example, a study of spatial cognition among speakers of different languages, showed that despite linguistic variations, spatial cognition remained relatively consistent.

The Sapir-Whorf Hypothesis and Anthropology

From an anthropological perspective, the Sapir-Whorf Hypothesis is instrumental in exploring cultural diversity, since it suggests that understanding a culture’s language is key to understanding their world view.

  • Cultural Understanding : The hypothesis emphasizes the importance of language in shaping cultural norms and values, enabling anthropologists to delve deeper into the intricacies of diverse cultures.
  • Interdisciplinary Research : The Sapir-Whorf Hypothesis has paved the way for interdisciplinary research, integrating linguistic anthropology with cognitive psychology, neurolinguistics, and artificial intelligence, to name a few.

Criticisms and Controversies

Despite its profound implications, the Sapir-Whorf Hypothesis has been subject to numerous criticisms and controversies.

  • Linguistic Determinism Critique : Critics argue that the strong form of the hypothesis, linguistic determinism, is overly restrictive, stating that it undermines the ability of individuals to perceive or conceive of things outside their linguistic framework.
  • Lack of Empirical Support : Many criticize the hypothesis for its lack of solid empirical evidence. Steven Pinker, a cognitive psychologist, contends that the Sapir-Whorf Hypothesis is a conventional belief rather than a robust scientific theory.
  • Contradictory Evidence : Other criticisms arise from studies presenting contradictory evidence, such as research showing similar cognitive processes across various linguistic groups, undermining the notion of linguistic relativity.
Criticisms of the Sapir-Whorf HypothesisExplanation
Linguistic Determinism CritiqueToo restrictive, undermines cognitive flexibility
Lack of Empirical SupportRequires more solid scientific evidence
Contradictory EvidenceFindings of similar cognitive processes across diverse linguistic groups

Influence on Other Disciplines

While the Sapir-Whorf Hypothesis has its origins in anthropology, its influence extends into various disciplines:

  • Psychology : In cognitive psychology, the hypothesis has fueled debates about whether language influences cognitive processes, like memory and perception.
  • Neurolinguistics : The hypothesis contributes to neurolinguistic studies investigating the brain’s role in language processing and perception.
  • Artificial Intelligence (AI) : The hypothesis plays a significant role in natural language processing (NLP), a subset of AI that deals with human-computer interactions.

Future Directions

As we move forward, further research and interdisciplinary collaboration are necessary to clarify the precise nature of the relationship between language and thought. Future investigations may look into the following areas:

  • More empirical research is needed to validate or refute the hypothesis, focusing on various linguistic features and their cognitive and cultural implications.
  • Studies combining anthropology , linguistics, psychology, and neuroscience could offer valuable insights into how language affects cognitive processes and shapes cultural perspectives.
  • Exploring the impact of multilingualism on cognition could add another layer of complexity to our understanding of the Sapir-Whorf Hypothesis.

In summary, the Sapir-Whorf Hypothesis, despite criticisms and debates, continues to play an essential role in linguistic anthropology and other related fields. While further research is necessary to substantiate or disprove its assertions, the hypothesis undoubtedly contributes to our understanding of the intertwined nature of language, cognition, and culture.

[1] Whorf, B. L. (1956). Language, Thought, and Reality: Selected Writings of Benjamin Lee Whorf. https://ia801605.us.archive.org/12/items/languagethoughtr00whor/languagethoughtr00whor.pdf

[2] Sapir, E. (1929). The Status of Linguistics as a Science.

[3] Sapir, E. (1941). Language: An introduction to the study of speech.

[4] Whorf, B. L. (1941). The relation of habitual thought and behavior to language.

[5] Gumperz, J. J., & Levinson, S. C. (1996). Rethinking linguistic relativity.

Anthropologist Vasundhra - Author and Anthroholic

Vasundhra, an anthropologist, embarks on a captivating journey to decode the enigmatic tapestry of human society. Fueled by an insatiable curiosity, she unravels the intricacies of social phenomena, immersing herself in the lived experiences of diverse cultures. Armed with an unwavering passion for understanding the very essence of our existence, Vasundhra fearlessly navigates the labyrinth of genetic and social complexities that shape our collective identity. Her recent publication unveils the story of the Ancient DNA field, illuminating the pervasive global North-South divide. With an irresistible blend of eloquence and scientific rigor, Vasundhra effortlessly captivates audiences, transporting them to the frontiers of anthropological exploration.

Newsletter Updates

Enter your email address below and subscribe to our newsletter

I accept the Privacy Policy

Related Posts

In the domain of economic anthropology, reciprocity is an intricate and essential element in economic exchanges that transpires across diverse cultures. Often defined as a mutual give-and-take process, reciprocity occurs when goods or services are exchanged amongst individuals or groups.

Leave a Reply Cancel Reply

You must be logged in to post a comment.

Click through the PLOS taxonomy to find articles in your field.

For more information about PLOS Subject Areas, click here .

Loading metrics

Open Access

Peer-reviewed

Research Article

The Sapir-Whorf Hypothesis and Probabilistic Inference: Evidence from the Domain of Color

Contributed equally to this work with: Emily Cibelli, Yang Xu

Affiliation Department of Linguistics, Northwestern University, Evanston, IL 60208, United States of America

Affiliations Department of Linguistics, University of California, Berkeley, CA 94720, United States of America, Cognitive Science Program, University of California, Berkeley, CA 94720, United States of America

Affiliation Department of Psychology, University of Wisconsin, Madison, WI 53706, United States of America

Affiliations Cognitive Science Program, University of California, Berkeley, CA 94720, United States of America, Department of Psychology, University of California, Berkeley, CA 94720, United States of America

* E-mail: [email protected]

  • Emily Cibelli, 
  • Yang Xu, 
  • Joseph L. Austerweil, 
  • Thomas L. Griffiths, 
  • Terry Regier

PLOS

  • Published: July 19, 2016
  • https://doi.org/10.1371/journal.pone.0158725
  • Reader Comments

16 Aug 2016: The PLOS ONE Staff (2016) Correction: The Sapir-Whorf Hypothesis and Probabilistic Inference: Evidence from the Domain of Color. PLOS ONE 11(8): e0161521. https://doi.org/10.1371/journal.pone.0161521 View correction

Fig 1

The Sapir-Whorf hypothesis holds that our thoughts are shaped by our native language, and that speakers of different languages therefore think differently. This hypothesis is controversial in part because it appears to deny the possibility of a universal groundwork for human cognition, and in part because some findings taken to support it have not reliably replicated. We argue that considering this hypothesis through the lens of probabilistic inference has the potential to resolve both issues, at least with respect to certain prominent findings in the domain of color cognition. We explore a probabilistic model that is grounded in a presumed universal perceptual color space and in language-specific categories over that space. The model predicts that categories will most clearly affect color memory when perceptual information is uncertain. In line with earlier studies, we show that this model accounts for language-consistent biases in color reconstruction from memory in English speakers, modulated by uncertainty. We also show, to our knowledge for the first time, that such a model accounts for influential existing data on cross-language differences in color discrimination from memory, both within and across categories. We suggest that these ideas may help to clarify the debate over the Sapir-Whorf hypothesis.

Citation: Cibelli E, Xu Y, Austerweil JL, Griffiths TL, Regier T (2016) The Sapir-Whorf Hypothesis and Probabilistic Inference: Evidence from the Domain of Color. PLoS ONE 11(7): e0158725. https://doi.org/10.1371/journal.pone.0158725

Editor: Daniel Osorio, University of Sussex, UNITED KINGDOM

Received: October 26, 2015; Accepted: June 21, 2016; Published: July 19, 2016

Copyright: © 2016 Cibelli et al. This is an open access article distributed under the terms of the Creative Commons Attribution License , which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

Data Availability: All relevant data are available within the paper and/or at: https://github.com/yangxuch/probwhorfcolor This GitHub repository is mentioned in the paper.

Funding: This research was supported by the National Science Foundation ( www.nsf.gov ) under grants DGE-1106400 (EC) and SBE-1041707 (YX, TR). Publication was made possible in part by support from the Berkeley Research Impact Initiative (BRII) sponsored by the UC Berkeley Library. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

Competing interests: The authors have declared that no competing interests exist.

Introduction

The Sapir-Whorf hypothesis [ 1 , 2 ] holds that our thoughts are shaped by our native language, and that speakers of different languages therefore think about the world in different ways. This proposal has been controversial for at least two reasons, both of which are well-exemplified in the semantic domain of color. The first source of controversy is that the hypothesis appears to undercut any possibility of a universal foundation for human cognition. This idea sits uneasily with the finding that variation in color naming across languages is constrained, such that certain patterns of color naming recur frequently across languages [ 3 – 5 ], suggesting some sort of underlying universal basis. The second source of controversy is that while some findings support the hypothesis, they do not always replicate reliably. Many studies have found that speakers of a given language remember and process color in a manner that reflects the color categories of their language [ 6 – 13 ]. Reinforcing the idea that language is implicated in these findings, it has been shown that the apparent effect of language on color cognition disappears when participants are given a verbal [ 7 ] (but not a visual) interference task [ 8 , 11 , 12 ]; this suggests that language may operate through on-line use of verbal representations that can be temporarily disabled. However, some of these findings have a mixed record of replication [ 14 – 17 ]. Thus, despite the substantial empirical evidence already available, the role of language in color cognition remains disputed.

An existing theoretical stance holds the potential to resolve both sources of controversy. On the one hand, it explains effects of language on cognition in a framework that retains a universal component, building on a proposal by Kay and Kempton [ 7 ]. On the other hand, it has the potential to explain when effects of language on color cognition will appear, and when they will not—and why. This existing stance is that of the “category adjustment” model of Huttenlocher and colleagues [ 18 , 19 ]. We adopt this stance, and cast color memory as inference under uncertainty, instantiated in a category adjustment model, following Bae et al. [ 20 ] and Persaud and Hemmer [ 21 ]. The model holds that color memory involves the probabilistic combination of evidence from two sources: a fine-grained representation of the particular color seen, and the language-specific category in which it fell (e.g. English green ). Both sources of evidence are represented in a universal perceptual color space, yet their combination yields language-specific bias patterns in memory, as illustrated in Fig 1 . The model predicts that such category effects will be strongest when fine-grained perceptual information is uncertain. It thus has the potential to explain the mixed pattern of replications of Whorfian effects in the literature: non-replications could be the result of high perceptual certainty.

thumbnail

  • PPT PowerPoint slide
  • PNG larger image
  • TIFF original image

A stimulus is encoded in two ways: (1) a fine-grained representation of the stimulus itself, shown as a (gray) distribution over stimulus space centered at the stimulus’ location in that space, and (2) the language-specific category (e.g. English “green”) in which the stimulus falls, shown as a separate (green) distribution over the same space, centered at the category prototype. The stimulus is reconstructed by combining these two sources of information through probabilistic inference, resulting in a reconstruction of the stimulus (black distribution) that is biased toward the category prototype. Adapted from Fig 11 of Bae et al. (2015) [ 20 ].

https://doi.org/10.1371/journal.pone.0158725.g001

In the category adjustment model, both the fine-grained representation of the stimulus and the category in which it falls are modeled as probability distributions over a universal perceptual color space. The fine-grained representation is veridical (unbiased) but inexact: its distribution is centered at the location in color space where the stimulus itself fell, and the variance of that distribution captures the observer’s uncertainty about the precise location of the stimulus in color space, with greater variance corresponding to greater uncertainty. Psychologically, such uncertainty might be caused by noise in perception itself, by memory decay over time, or by some other cause—and any increase in such uncertainty is modeled by a wider, flatter distribution for the fine-grained representation. The category distribution, in contrast, captures the information about stimulus location that is given by the named category in which the stimulus fell (e.g. green for an English-speaking observer). Because named color categories vary across languages, this category distribution is assumed to be language-specific—although the space over which it exists is universal. The model infers the original stimulus location by combining evidence from both of these distributions. As a result, the model tends to produce reconstructions of the stimulus that are biased away from the actual location of the stimulus and toward the prototype of the category in which it falls.

As illustrated in Fig 2 , this pattern of bias pulls stimuli on opposite sides of a category boundary in opposite directions, producing enhanced distinctiveness for such stimuli. Such enhanced distinctiveness across a category boundary is the signature of categorical perception, or analogous category effects in memory. On this view, language-specific effects on memory can emerge from a largely universal substrate when one critical component of that substrate is language-specific: the category distribution.

thumbnail

Model reconstructions tend to be biased toward category prototypes, yielding enhanced distinctiveness for two stimuli that fall on different sides of a category boundary. Categories are shown as distributions in green and blue; stimuli are shown as vertical black lines; reconstruction bias patterns are shown as arrows.

https://doi.org/10.1371/journal.pone.0158725.g002

If supported, the category adjustment model holds the potential to clarify the debate over the Sapir-Whorf hypothesis in three ways. First, it would link that debate to independent principles of probabilistic inference. In so doing, it would underscore the potentially important role of uncertainty , whether originating in memory or perception, in framing the debate theoretically. Second, and relatedly, it would suggest a possible reason why effects of language on color memory and perception are sometimes found, and sometimes not [ 17 ]. Concretely, the model predicts that greater uncertainty in the fine-grained representation—induced for example through a memory delay, or noise in perception—will lead to greater influence of the category, and thus a stronger bias in reproduction. The mirror-image of this prediction is that in situations of relatively high certainty in memory or perception, there will be little influence of the category, to the point that such an influence may not be empirically detectable. Third, the model suggests a way to think about the Sapir-Whorf hypothesis without jettisoning the important idea of a universal foundation for cognition.

Closely related ideas appear in the literature on probabilistic cue integration [ 22 – 25 ]. For example, Ernst and Banks [ 24 ] investigated perceptual integration of cues from vision and touch in judging the height of an object. They found that humans integrate visual and haptic cues in a statistically optimal fashion, modulated by cue certainty. The category adjustment model we explore here can be seen as a form of probabilistic cue integration in which one of the cues is a language-specific category.

The category adjustment model has been used to account for category effects in various domains, including spatial location [ 18 , 26 ], object size [ 19 , 27 ], and vowel perception [ 28 ]. The category adjustment model also bears similarities to other theoretical accounts of the Sapir-Whorf hypothesis that emphasize the importance of verbal codes [ 7 , 8 ], and the interplay of such codes with perceptual representations [ 29 – 31 ]. Prior research has linked such category effects to probabilistic inference, following the work of Huttenlocher and colleagues [ 18 , 19 ]. Roberson and colleagues [ 32 ] invoked the category adjustment model as a possible explanation for categorical perception of facial expressions, but did not explore a formal computational model; Goldstone [ 33 ] similarly referenced the category adjustment model with respect to category effects in the color domain. Persaud and Hemmer [ 21 , 34 ] explored bias in memory for color, and compared empirically obtained memory bias patterns from English speakers with results predicted by a formally specified category adjustment model, but did not link those results to the debate over the Sapir-Whorf hypothesis, and did not manipulate uncertainty. More recently, a subsequent paper by the same authors and colleagues [ 35 ] explored category-induced bias in speakers of another language, Tsimané, and did situate those results with respect to the Sapir-Whorf hypothesis, but again did not manipulate uncertainty. Most recently, Bae et al. [ 20 ] extensively documented bias in color memory in English speakers, modeled those results with a category-adjustment computational model, and did manipulate uncertainty—but did not explore these ideas relative to the Sapir-Whorf hypothesis, or to data from different languages.

In what follows, we first present data and computational simulations that support the recent finding that color memory in English speakers is well-predicted by a category adjustment model, with the strength of category effects modulated by uncertainty. We then show, to our knowledge for the first time, that a category adjustment model accounts for influential existing cross-language data on color that support the Sapir-Whorf hypothesis.

In this section we provide general descriptions of our analyses and results. Full details are supplied in the section on Materials and Methods.

Study 1: Color reconstruction in English speakers

Our first study tests the core assumptions of the category adjustment model in English speakers. In doing so, it probes questions that were pursued by two studies that appeared recently, after this work had begun. Persaud and Hemmer [ 21 ] and Bae et al. [ 20 ] both showed that English speakers’ memory for a color tends to be biased toward the category prototype of the corresponding English color term, in line with a category adjustment model. Bae et al. [ 20 ] also showed that the amount of such bias increases when subjects must retain the stimulus in memory during a delay period, compared to when there is no such delay, as predicted by the principles of the category adjustment model. In our first study, we consider new evidence from English speakers that tests these questions, prior to considering speakers of different languages in our following studies.

English-speaking participants viewed a set of hues that varied in small steps from dark yellow to purple, with most hues corresponding to some variety of either green or blue. We collected two kinds of data from these participants: bias data and naming data. Bias data were based on participants’ non-linguistic reconstruction of particular colors seen. Specifically, for each hue seen, participants recreated that hue by selecting a color from a color wheel, either while the target was still visible ( Fig 3A : simultaneous condition), or from memory after a short delay ( Fig 3B : delayed condition). We refer to the resulting data as bias data, because we are interested in the extent to which participants’ reconstructions of the stimulus color are biased away from the original target stimulus. Afterwards, the same participants indicated how good an example of English green (as in Fig 3C ) and how good an example of English blue each hue was. We refer to these linguistic data as naming data.

thumbnail

Screenshots of example trials illustrating (A) simultaneous reconstruction, (B) delayed reconstruction, and (C) green goodness rating.

https://doi.org/10.1371/journal.pone.0158725.g003

Fig 4 shows both naming and bias data as a function of target hue. The top panel of the figure shows the naming data and also shows Gaussian functions corresponding to the English color terms green and blue that we fitted to the naming data. Bias data were collected for only a subset of the hues for which naming data were collected, and the shaded region in the top panel of Fig 4 shows that subset, relative to the full range of hues for naming data. We collected bias data only in this smaller range because we were interested specifically in bias induced by the two color terms blue and green , and colors outside the shaded region seemed to us to clearly show some influence of neighboring categories such as yellow and purple . The bottom panel of the figure shows the bias data, plotted relative to the prototypes (means) of the fitted Gaussian functions for green and blue . It can be seen that reconstruction bias appears to be stronger in the delayed than in the simultaneous condition, as predicted, and that—especially in the delayed condition—there is an inflection in the bias pattern between the two category prototypes, suggesting that bias may reflect the influence of each of the two categories. The smaller shaded region in this bottom panel denotes the subset of these hues that we subsequently analyzed statistically, and to which we fit models. We reduced the range of considered hues slightly further at this stage, to ensure that the range was well-centered with respect to the two relevant category prototypes, for green and blue , as determined by the naming data.

thumbnail

In both top and bottom panels, the horizontal axis denotes target hue, ranging from yellow on the left to purple on the right. Top panel (naming data): The solid green and blue curves show, for each target hue, the average goodness rating for English green and blue respectively, as a proportion of the maximum rating possible. The dashed green and blue curves show Gaussian functions fitted to the naming goodness data. The dotted vertical lines marked at the bottom with green and blue squares denote the prototypes for green and blue , determined as the means of the green and blue fitted Gaussian functions, respectively. The shaded region in the top panel shows the portion of the spectrum for which bias data were collected. Bottom panel (bias data): Solid curves denote, for each target hue, the average reconstruction bias for that hue, such that positive values denote reconstruction bias toward the purple (here, right) end of the spectrum, and negative values denote reconstruction bias toward the yellow (here, left) end of the spectrum. Units for the vertical axis are the same as for the horizontal axis, which is normalized to length 1.0. The black and red curves show bias under simultaneous and delayed response, respectively. Blue stars at the top of the bottom panel mark hues for which there was a significant difference in the magnitude of bias between simultaneous and delayed conditions. The shaded region in the bottom panel shows the portion of the data that was analyzed statistically, and to which models were fit. In both panels, error bars represent standard error of the mean.

https://doi.org/10.1371/journal.pone.0158725.g004

The absolute values (magnitudes) of the bias were analyzed using a 2 (condition: simultaneous vs. delayed) × 15 (hues) repeated measures analysis of variance. This analysis revealed significantly greater bias magnitude in the delayed than in the simultaneous condition. It also revealed that bias magnitude differed significantly as a function of hue, as well as a significant interaction between the factors of hue and condition. The blue stars in Fig 4 denote hues for which the difference in bias magnitude between the simultaneous and delayed conditions reached significance. The finding of greater bias magnitude in the delayed than in the simultaneous condition is consistent with the proposal that uncertainty is an important mediating factor in such category effects, as argued by Bae et al. [ 20 ]. It also suggests that some documented failures to find such category effects could in principle be attributable to high certainty, a possibility that can be explored by manipulating uncertainty.

We wished to test in a more targeted fashion to what extent these data are consistent with a category adjustment model in which a color is reconstructed based in part on English named color categories. To that end, we compared the performance of four models against these data; only one of these models considered both of the relevant English color categories, green and blue . As in Fig 1 , each model contains a fine-grained but inexact representation of the perceived stimulus, and (for most models) a representation of one or more English color categories. Each model predicts the reconstruction of the target stimulus from its fine-grained representation of the target together with any category information. Category information in the model is specified by the naming data. Each model has a single free parameter, corresponding to the uncertainty of the fine-grained representation; this parameter is fit to bias data.

  • The null model is a baseline model that predicts hue reconstruction based only on the fine-grained representation of the stimulus, with no category component.
  • The 1-category (green) model predicts hue reconstruction based on the fine-grained representation of the stimulus, combined with a representation of only the green category, derived from the green naming data.
  • The 1-category (blue) model predicts hue reconstruction based on the fine-grained representation of the stimulus, combined with a representation of only the blue category, derived from the blue naming data.
  • The 2-category model predicts hue reconstruction based on the fine-grained representation of the stimulus, combined with representations of both the green and blue categories.

If reproduction bias reflects probabilistic inference from a fine-grained representation of the stimulus itself, together with any relevant category, we would expect the 2-category model to outperform the others. The other models have access either to no category information at all (null model), or to category information for only one of the two relevant color categories (only one of green and blue ). The 2-category model in contrast combines fine-grained stimulus information with both of the relevant categories ( green and blue ); this model thus corresponds most closely to a full category adjustment model.

Fig 5 redisplays the data from simultaneous and delayed reconstruction, this time with model fits overlaid. The panels in the left column show data from simultaneous reconstruction, fit by each of the four models, and the panels in the right column analogously show data and model fits from delayed reconstruction. Visually, it appears that in the case of delayed reconstruction, the 2-category model fits the data at least qualitatively better than competing models: it shows an inflection in bias as the empirical data do, although not as strongly. For simultaneous reconstruction, the 2-category model fit is also reasonable but visually not as clearly superior to the others (especially the null model) as in the delayed condition.

thumbnail

Left column: Bias from simultaneous reconstruction, fit by each of the four models. The empirical data (black lines with error bars) in these four panels are the same, and only the model fits (red lines) differ. Within each panel, the horizontal axis denotes target hue, and the vertical axis denotes reconstruction bias. The green and blue prototypes are indicated as vertical lines with green and blue squares at the bottom. Right column: delayed reconstruction, displayed analogously.

https://doi.org/10.1371/journal.pone.0158725.g005

Table 1 reports quantitative results of these model fits. The best fit is provided by the 2-category model, in both the simultaneous and delayed conditions, whether assessed by log likelihood (LL) or by mean squared errror (MSE). In line with earlier studies [ 20 , 21 ], these findings demonstrate that a category adjustment model that assumes stimulus reconstruction is governed by relevant English color terms provides a reasonable fit to data on color reconstruction by English speakers. The category adjustment model fits well both when the category bias is relatively slight (simultaneous condition), and when the bias is stronger (delayed condition).

thumbnail

LL = log likelihood (higher is better). MSE = mean squared error (lower is better). The best value in each row is shown in bold .

https://doi.org/10.1371/journal.pone.0158725.t001

Study 2: Color discrimination across languages

The study above examined the categories of just one language, English, whereas the Sapir-Whorf hypothesis concerns cross-language differences in categorization, and their effect on cognition and perception. Empirical work concerning this hypothesis has not specifically emphasized bias in reconstruction, but there is a substantial amount of cross-language data of other sorts against which the category adjustment model can be assessed. One method that has been extensively used to explore the Sapir-Whorf hypothesis in the domain of color is a two-alternative forced choice (2AFC) task. In such a task, participants first are briefly shown a target color, and then shortly afterward are shown that same target color together with a different distractor color, and are asked to indicate which was the color originally seen. A general finding from such studies [ 8 – 10 ] is that participants exhibit enhanced discrimination for pairs of colors that would be named differently in their native language. For example, in such a 2AFC task, speakers of English show enhanced discrimination for colors from the different English categories green and blue , compared with colors from the same category (either both green or both blue ) [ 8 ]. In contrast, speakers of the Berinmo language, which has named color categories that differ from those of English, show enhanced discrimination across Berinmo category boundaries, and not across those of English [ 9 ]. Thus color discrimination in this task is enhanced at the boundaries of native language categories, suggesting an effect of those native language categories on the ability to discriminate colors from memory.

Considered informally, this qualitative pattern of results appears to be consistent with the category adjustment model, as suggested above in Fig 2 . We wished to determine whether such a model would also provide a good quantitative account of such results, when assessed using the specific color stimuli and native-language naming patterns considered in the empirical studies just referenced.

We considered cross-language results from two previous studies by Debi Roberson and colleagues, one that compared color memory in speakers of English and Berinmo, a language of Papua New Guinea [ 9 ], and another that explored color memory in speakers of Himba, a language of Namibia [ 10 ]. Berinmo and Himba each have five basic color terms, in contrast with eleven in English. The Berinmo and Himba color category systems are similar to each other in broad outline, but nonetheless differ noticeably. Following these two previous studies, we considered the following pairs of categories in these three languages:

  • the English categories green and blue ,
  • the Berinmo categories wor (covering roughly yellow, orange, and brown), and nol (covering roughly green, blue, and purple), and
  • the Himba categories dumbu (covering roughly yellow and beige) and burou (covering roughly green, blue, and purple).

These three pairs of categories are illustrated in Fig 6 , using naming data from Roberson et al. (2000) [ 9 ] and Roberson et al. (2005) [ 10 ]. It can be seen that the English green - blue distinction is quite different from the Berinmo wor - nol and the Himba dumbu - burou distinctions, which are similar but not identical to each other. The shaded regions in this figure indicate specific colors that were probed in discrimination tasks. The shaded (probed) region that straddles a category boundary in Berinmo and Himba falls entirely within the English category green , and the shaded (probed) region that straddles a category boundary in English falls entirely within the Berinmo category nol and the Himba category burou , according to naming data in Fig 1 of Roberson et al. (2005) [ 10 ]. The empirical discrimination data in Fig 7 are based on those probed colors [ 9 , 10 ], and show that in general, speakers of a language tend to exhibit greater discrimination for pairs of colors that cross a category boundary in their native language, consistent with the Sapir-Whorf hypothesis.

thumbnail

The English categories green and blue (top panel), the Berinmo categories wor and nol (middle panel), and the Himba categories dumbu and burou (bottom panel), plotted against a spectrum of hues that ranges from dark yellow at the left, through green, to blue at the right. Colored squares mark prototypes: the shared prototype for Berinmo wor and Himba dumbu , and the prototypes for English green and blue ; the color of each square approximates the color of the corresponding prototype. For each language, the dotted-and-dashed vertical lines denote the prototypes for the two categories from that language, and the dashed vertical line denotes the empirical boundary between these two categories. Black curves show the probability of assigning a given hue to each of the two native-language categories, according to the category component of a 2-category model fit to each language’s naming data. The shaded regions mark the ranges of colors probed in discrimination tasks; these two regions are centered at the English green - blue boundary and the Berinmo wor - nol boundary. Data are from Roberson et al. (2000) [ 9 ] and Roberson et al. (2005) [ 10 ].

https://doi.org/10.1371/journal.pone.0158725.g006

thumbnail

Top panels: Discrimination from memory by Berinmo and English speakers for pairs of colors across and within English and Berinmo color category boundaries. Empirical data are from Table 11 of Roberson et al. (2000:392). Empirical values show mean proportion correct 2AFC memory judgments, and error bars show standard error. Model values show mean model proportion correct 2AFC memory judgments after simulated reconstruction with native-language categories. Model results are range-matched to the corresponding empirical values, such that the minimum and maximum model values match the minimum and maximum mean values in the corresponding empirical dataset, and other model values are linearly interpolated. Bottom panels: Discrimination from memory by Himba and English speakers for pairs of colors across and within English and Himba color category boundaries, compared with model results based on native-language categories. Empirical data are from Table 6 of Roberson et al. (2005:400); no error bars are shown because standard error was not reported in that table.

https://doi.org/10.1371/journal.pone.0158725.g007

We sought to determine whether the 2-category model explored above could account for these data. To that end, for each language, we created a version of the 2-category model based on the naming data for that language. Thus, we created an English model in which the two categories were based on empirical naming data for green and blue , a Berinmo model in which the two categories were based on empirical naming data for wor and nol , and a Himba model in which the two categories were based on empirical naming data for dumbu and burou . The black curves in Fig 6 show the probability of assigning a given hue to each of the two native-language categories, according to the category component of a 2-category model fit to each language’s naming data. Given this category information, we simulated color reconstruction from memory for the specific colors considered in the empirical studies [ 9 , 10 ] (the colors in the shaded regions in Fig 6 ). We did so separately for the cases of English, Berinmo, and Himba, in each case fitting a model based on naming data for a given language to discrimination data from speakers of that language. As in Study 1, we fit the model parameter corresponding to the uncertainty of fine-grained perceptual representation to the empirical non-linguistic (here discrimination) data, and we used a single value for this parameter across all three language models. The model results are shown in Fig 7 , beside the empirical data to which they were fit. The models provide a reasonable match to the observed cross-language differences in discrimination. Specifically, the stimulus pairs for which empirical performance is best are those that cross a native-language boundary—and these are stimulus pairs for which the corresponding model response is strongest.

Although not shown in the figure, we also conducted a followup analysis to test whether the quality of these fits was attributable merely to model flexibility, or to a genuine fit between a language’s category system and patterns of discrimination from speakers of that language. We did this by switching which language’s model was fit to which language’s discrimination data. Specifically, we fit the model based on Berinmo naming to the discrimination data from English speakers (and vice versa), and fit the model based on Himba naming to the discrimination data from English speakers (and vice versa), again adjusting the model parameter corresponding to the uncertainty of the fine-grained perceptual representation to the empirical discrimination data. The results are summarized in Table 2 . It can be seen that the discrimination data are fit better by native-language models (that is, models with a category component originally fit to that language’s naming data) than by other-language models (that is, models with a category component originally fit to another language’s naming data). These results suggest that cross-language differences in discrimination may result from category-induced reconstruction bias under uncertainty, guided by native-language categories.

thumbnail

The best value in each row is shown in bold . Data are fit better by native-language models than by other-language models.

https://doi.org/10.1371/journal.pone.0158725.t002

Study 3: Within-category effects

Although many studies of categorical perception focus on pairs of stimuli that cross category boundaries, there is also evidence for category effects within categories. In a 2AFC study of categorical perception of facial expressions, Roberson and colleagues [ 32 ] found the behavioral signature of categorical perception (or more precisely in this case, categorical memory): superior discrimination for cross-category than for within-category pairs of stimuli. But in addition, they found an interesting category effect on within-category pairs, dependent on order of presentation. For each within-category pair they considered, one stimulus of the pair was always closer to the category prototype (the “good exemplar”) than the other (the “poor exemplar”). They found that 2AFC performance on within-category pairs was better when the target was the good exemplar (and the distractor was therefore the poor exemplar) than when the target was the poor exemplar (and the distractor was therefore the good exemplar)—even though the same stimuli were involved in the two cases. Moreover, performance in the former (good exemplar) case did not differ significantly from cross-category performance. Hanley and Roberson [ 36 ] subsequently reanalyzed data from a number of earlier studies that had used 2AFC tasks to explore cross-language differences in color naming and cognition, including those reviewed and modeled in the previous section. Across studies and across domains, including color, they found the same asymmetrical within-category effect originally documented for facial expressions.

This within-category pattern may be naturally explained in category-adjustment terms, as shown in Fig 8 , and as argued by Roberson and colleagues [ 32 ]. The central idea is that because the target is held in memory, it is subject to bias toward the prototype in memory, making discrimination of target from distractor either easier or harder depending on which of the two stimuli is the target. Although this connection with the category adjustment model has been made in the literature in general conceptual terms [ 32 ], followup studies have been theoretically focused elsewhere [ 31 , 36 ], and the idea has not to our knowledge been tested computationally using the specific stimuli and naming patterns involved in the empirical studies. We sought to do so.

thumbnail

The category adjustment model predicts: (top panel, good exemplar) easy within-category discrimination in a 2AFC task when the initially-presented target t is closer to the prototype than the distractor d is; (bottom panel, poor exemplar) difficult within-category discrimination with the same two stimuli when the initially-presented target t is farther from the prototype than the distractor d is. Category is shown as a distribution in blue; stimuli are shown as vertical black lines marked t and d; reconstruction bias patterns are shown as arrows.

https://doi.org/10.1371/journal.pone.0158725.g008

The empirical data in Fig 9 illustrate the within-category effect with published results on color discrimination by speakers of English, Berinmo, and Himba. In attempting to account for these data, we considered again the English, Berinmo, and Himba variants of the 2-category model first used in Study 2, and also retained from that study the parameter value corresponding to the uncertainty of the fine-grained perceptual representation, in the case of native-language models. We simulated reconstruction from memory of the specific colors examined in Study 2. Following the empirical analyses, this time we disaggregated the within-category stimulus pairs into those in which the target was a good exemplar of the category (i.e. the target was closer to the prototype than the distractor was), vs. those in which the target was a poor exemplar of the category (i.e. the target was farther from the prototype than the distractor was). The model results are shown in Fig 9 , and match the empirical data reasonably well, supporting the informal in-principle argument of Fig 8 with a more detailed quantitative analysis.

thumbnail

Across: stimulus pair crosses the native-language boundary; GE: within-category pair, target is the good exemplar; PE: within-category pair, target is the poor exemplar. Empirical data are from Figs 2 (English: 10-second retention interval), 3 (Berinmo), and 4 (Himba) of Hanley and Roberson [ 36 ]. Empirical values show mean proportion correct 2AFC memory judgments, and error bars show standard error. Model values show mean model proportion correct 2AFC memory judgments after simulated reconstruction using native-language categories, range-matched as in Fig 7 . English model compared with English data: 0.00002 MSE; Berinmo model compared with Berinmo data: 0.00055 MSE; Himba model compared with Himba data: 0.00087 MSE.

https://doi.org/10.1371/journal.pone.0158725.g009

Conclusions

We have argued that the debate over the Sapir-Whorf hypothesis may be clarified by viewing that hypothesis in terms of probabilistic inference. To that end, we have presented a probabilistic model of color memory, building on proposals in the literature. The model assumes both a universal color space and language-specific categorical partitionings of that space, and infers the originally perceived color from these two sources of evidence. The structure of this model maps naturally onto a prominent proposal in the literature that has to our knowledge not previously been formalized in these terms. In a classic early study of the effect of language on color cognition, Kay and Kempton [ 7 ] interpret Whorf [ 2 ] as follows:

Whorf […] suggests that he conceives of experience as having two tiers: one, a kind of rock bottom, inescapable seeing-things-as-they-are (or at least as human beings cannot help but see them), and a second, in which [the specific structures of a given language] cause us to classify things in ways that could be otherwise (and are otherwise for speakers of a different language).

Kay and Kempton argue that color cognition involves an interaction between these two tiers. The existence of a universal groundwork for color cognition helps to explain why there are constraints on color naming systems across languages [ 3 – 5 , 37 ]. At the same time, Kay and Kempton acknowledge a role for the language-specific tier in cognition, such that “there do appear to be incursions of linguistic categorization into apparently nonlinguistic processes of thinking” (p. 77). These two tiers map naturally onto the universal and language-specific components of the model we have explored here. This structure offers a straightforward way to think about effects of language on cognition while retaining the idea of a universal foundation underpinning human perception and cognition. Thus, this general approach, and our model as an instance of it, offer a possible resolution of one source of controversy surrounding the Sapir-Whorf hypothesis: taking that hypothesis seriously need not entail a wholesale rejection of important universal components of human cognition.

The approach proposed here also has the potential to resolve another source of controversy surrounding the Sapir-Whorf hypothesis: that some findings taken to support it do not replicate reliably (e.g. in the case of color: [ 15 – 17 ]). Framing the issue in terms of probabilistic inference touches this question by highlighting the theoretically central role of uncertainty , as in models of probabilistic cue integration [ 24 ]. We have seen stronger category-induced bias in color memory under conditions of greater delay and presumably therefore greater uncertainty (Study 1, and [ 20 ]). This suggests that in the inverse case of high certainty about the stimulus, any category effect could in principle be so small as to be empirically undetectable, a possibility that can be pursued by systematically manipulating uncertainty. Thus, the account advanced here casts the Sapir-Whorf hypothesis in formal terms that suggest targeted and quantitative followup tests. A related theoretical advantage of uncertainty is that it highlights an important level of generality: uncertainty could result from memory, as explored here, but it could also result from noise or ambiguity in perception itself, and on the view advanced here, the result should be the same.

The model we have proposed does not cover all aspects of language effects on color cognition. For example, there are documented priming effects [ 31 ] which do not appear to flow as naturally from this account as do the other effects we have explored above. However, the model does bring together disparate bodies of data in a simple framework, and links them to independent principles of probabilistic inference. Future research can usefully probe the generality and the limitations of the ideas we have explored here.

Materials and Methods

Code and data supporting the analyses reported here are available at https://github.com/yangxuch/probwhorfcolor.git .

sapir whorf's hypothesis on language and gender

The perception of stimulus S = s produces a fine-grained memory M , and a categorical code c specifying the category in which s fell. We wish to reconstruct the original stimulus S = s , given M and c .

https://doi.org/10.1371/journal.pone.0158725.g010

sapir whorf's hypothesis on language and gender

Null model.

sapir whorf's hypothesis on language and gender

1-category model.

sapir whorf's hypothesis on language and gender

2-category model.

sapir whorf's hypothesis on language and gender

Fitting models to data.

sapir whorf's hypothesis on language and gender

Participants.

Twenty subjects participated in the experiment, having been recruited at UC Berkeley. All subjects were at least 18 years of age, native English speakers, and reported normal or corrected-to-normal vision, and no colorblindness. All subjects received payment or course credit for participation.

Informed consent was obtained verbally; all subjects read an approved consent form and verbally acknowledged their willingness to participate in the study. Verbal consent was chosen because the primary risk to subjects in this study was for their names to be associated with their response; this approach allowed us to obtain consent and collect data without the need to store subjects’ names in any form. Once subjects acknowledged that they understood the procedures and agreed to participate by stating so to the experimenter, the experimenter recorded their consent by assigning them a subject number, which was anonymously linked to their data. All study procedures, including those involving consent, were overseen and approved by the UC Berkeley Committee for the Protection of Human Subjects.

Stimuli were selected by varying a set of hues centered around the blue - green boundary, holding saturation and lightness constant. Stimuli were defined in Munsell coordinate space, which is widely used in the literature we engage here (e.g. [ 9 , 10 ]). All stimuli were at lightness 6 and saturation 8. Hue varied from 5Y to 10P, in equal hue steps of 2.5. Colors were converted to xyY coordinate space following Table I(6.6.1) of Wyszecki and Stiles (1982) [ 38 ]. The colors were implemented in Matlab in xyY; the correspondence of these coordinate systems in the stimulus set, as well as approximate visualizations of the stimuli, are reported in Table 3 .

thumbnail

All stimuli were presented at lightness 6, saturation 8 in Munsell space.

https://doi.org/10.1371/journal.pone.0158725.t003

  • Full range: We collected naming data for green and blue relative to the full range, stimuli 1-27, for a total of 27 stimuli. We fit the category components of our models to naming data over this full range.
  • Medium range: We collected bias data for a subset of the full range, namely the medium range, stimuli 5-23, for a total of 19 stimuli. We considered this subset because we were interested in bias induced by the English color terms green and blue , and we had the impression, prior to collecting naming or bias data, that colors outside this medium range had some substantial element of the neighboring categories yellow and purple .
  • Focused range: Once we had naming data, we narrowed the range further based on those data, to the focused range, stimuli 5-19, for a total of 15 stimuli. The focused range extends between the (now empirically assessed) prototypes for green and blue , and also includes three of our stimulus hues on either side of these prototypes, yielding a range well-centered relative to those prototypes, as can be seen in the bottom panel of Fig 4 above. We considered this range in our statistical analyses, and in our modeling of bias patterns.

Experimental procedure.

The experiment consisted of four blocks. The first two blocks were reconstruction (bias) tasks: one simultaneous block and one delay block. In the simultaneous block ( Fig 3A ), the subject was shown a stimulus color as a colored square (labeled as “Original” in the figure), and was asked to recreate that color in a second colored square (labeled as “Target” in the figure) as accurately as possible by selecting a hue from a color wheel. The (“Original”) stimulus color remained on screen while the subject selected a response from the color wheel; navigation of the color wheel would change the color of the response (“Target”) square. The stimulus square and response square each covered 4.5 degrees of visual angle, and the color wheel covered 11.1 degrees of visual angle. Target colors were drawn from the medium range of stimuli (stimuli 5—23 of Table 3 ). The color wheel was constructed based on the full range of stimuli (stimuli 1—27 of Table 3 ), supplemented by interpolating 25 points evenly in xyY coordinates between each neighboring pair of the 27 stimuli of the full range, to create a finely discretized continuum from yellow to purple, with 677 possible responses. Each of the 19 target colors of the medium range was presented five times per block in random order, for a total of 95 trials per block. The delay block ( Fig 3B ) was similar to the simultaneous block but with the difference that the stimulus color was shown for 500 milliseconds then disappeared, then a fixation cross was shown for 1000 milliseconds, after which the subject was asked to reconstruct the target color from memory, again using the color wheel to change the color of the response square. The one colored square shown in the final frame of Fig 3B is the response square that changed color under participant control. The order of the simultaneous block and delay block were counterbalanced by subject. Trials were presented with a 500 millisecond inter-trial interval.

Several steps were taken to ensure that responses made on the color wheel during the reconstruction blocks were not influenced by bias towards a particular spatial position. The position of the color wheel was randomly rotated up to 180 degrees from trial to trial. The starting position of the cursor was likewise randomly generated for each new trial. Finally, the extent of the spectrum was jittered one or two stimuli (2.5 or 5 hue steps) from trial to trial, which had the effect of shifting the spectrum slightly in the yellow or the purple direction from trial to trial. This was done to ensure that the blue - green boundary would not fall at a consistent distance from the spectrum endpoints on each trial.

The second two blocks were naming tasks. In each, subjects were shown each of the 27 stimuli of the full range five times in random order, for a total of 135 trials per block. On each trial, subjects were asked to rate how good an example of a given color name each stimulus was. In one block, the color name was green , in the other, the color name was blue ; order of blocks was counterbalanced by subject. To respond, subjects positioned a slider bar with endpoints “Not at all [green/blue]” and “Perfectly [green/blue]” to the desired position matching their judgment of each stimulus, as shown above in Fig 3C . Responses in the naming blocks were self-paced. Naming blocks always followed reconstruction blocks, to ensure that repeated exposure to the color terms green and blue did not bias responses during reconstruction.

The experiment was presented in Matlab version 7.11.0 (R2010b) using Psychtoolbox (version 3) [ 39 – 41 ]. The experiment was conducted in a dark, sound-attenuated booth on an LCD monitor that supported 24-bit color. The monitor had been characterized using a Minolta CS100 colorimeter. A chin rest was used to ensure that each subject viewed the screen from a constant position; when in position, the base of the subject’s chin was situated 30 cm from the screen.

As part of debriefing after testing was complete, each subject was asked to report any strategies they used during the delay block to help them remember the target color. Summaries of each response, as reported by the experimenter, are listed in Table 4 .

thumbnail

When subjects gave specific examples of color terms used as memory aids, they are reported here.

https://doi.org/10.1371/journal.pone.0158725.t004

Color spectrum.

We wished to consider our stimuli along a 1-dimensional spectrum such that distance between two colors on that spectrum approximates the perceptual difference between those colors. To this end, we first converted our stimuli to CIELAB color space. CIELAB is a 3-dimensional color space designed “in an attempt to provide coordinates for colored stimuli so that the distance between the coordinates of any two stimuli is predictive of the perceived color difference between them” (p. 202 of [ 42 ]). The conversion to CIELAB was done according to the equations on pp. 167-168 of Wyszecki and Stiles (1982) [ 38 ], assuming 2 degree observer and D65 illuminant. For each pair of neighboring colors in the set of 677 colors of our color wheel, we measured the distance (Δ E ) betwen these two colors in CIELAB space. We then arranged all colors along a 1-dimensional spectrum that was scaled to length 1, such that the distance between each pair of neighboring colors along that spectrum was proportional to the CIELAB Δ E distance between them. This CIELAB-based 1-dimensional spectrum was used for our analyses in Study 1, and an analogous spectrum for a different set of colors was used for our analyses in Studies 2 and 3.

Statistical analysis.

As a result of the experiment detailed above, we obtained bias data from 20 participants, for each of 19 hues (the medium range), for 5 trials per hue per participant, in each of the simultaneous and delayed conditions. For analysis purposes, we restricted attention to the focused range of stimuli (15 hues), in order to consider a region of the spectrum that is well-centered with respect to green and blue , as we are primarily interested in bias that may be induced by these two categories. We wished to determine whether the magnitude of the bias differed as a function of the simultaneous vs. delayed condition, whether the magnitude of the bias varied as a function of hue, and whether there was an interaction between these two factors. To answer those questions, we conducted a 2 (condition: simultaneous vs. delayed) × 15 (hues) repeated measures analysis of variance (ANOVA), in which the dependent measure was the absolute value of the reproduction bias (reproduced hue minus target hue), averaged across trials for a given participant at a given target hue in a given condition. The ANOVA included an error term to account for across-subject variability. We found a main effect of condition, with greater bias magnitude in the delayed than in the simultaneous condition [ F (1, 19) = 61.61, p < 0.0001], a main effect of hue [ F (14, 266) = 4.565, p < 0.0001], and an interaction of hue and condition [ F (14, 266) = 3.763, p < 0.0001]. All hue calculations were relative to the CIELAB-based spectrum detailed in the preceding section.

We then conducted paired t-tests at each of the target hues, comparing each participant’s bias magnitude for that hue (averaged over trials) in the simultaneous condition vs. the delayed condition. Blue asterisks at the top of Fig 4 mark hues for which the paired t-test returned p < 0.05 when applying Bonferroni corrections for multiple comparisons.

Modeling procedure.

We considered four models in accounting for color reconstruction in English speakers: the null model, a 1-category model for which the category was green , a 1-category model for which the category was blue , and a 2-category model based on both green and blue .

sapir whorf's hypothesis on language and gender

Empirical data.

The empirical data considered for this study were drawn from two sources: the study of 2AFC color discrimination by speakers of Berinmo and English in Experiment 6a of Roberson et al. (2000) [ 9 ], and the study of 2AFC color discrimination by speakers of Himba and English in Experiment 3b of Roberson et al. (2005) [ 10 ]. In both studies, two sets of color stimuli were considered, all at value (lightness) level 5, and chroma (saturation) level 8. Both sets varied in hue by increments of 2.5 Munsell hue steps. The first set of stimuli was centered at the English green - blue boundary (hue 7.5BG), and contained the following seven hues: 10G, 2.5BG, 5BG, 7.5BG, 10BG, 2.5B, 5B. The second set of stimuli was centered at the Berinmo wor - nol boundary (hue 5GY), and contained the following seven hues: 7.5Y, 10Y, 2.5GY, 5GY, 7.5GY, 10GY, 2.5G. Stimuli in the set that crossed an English category boundary all fell within a single category in Berinmo ( nol ) and in Himba ( burou ), and stimuli in the set that crossed a Berinmo category boundary also crossed a Himba category boundary ( dumbu - burou ) but all fell within a single category in English ( green ), according to naming data in Fig 1 of Roberson et al. (2005) [ 10 ]. Based on specifications in the original empirical studies [ 9 , 10 ], we took the pairs of stimuli probed to be those presented in Table 5 .

thumbnail

Any stimulus pair that includes a boundary color is considered to be a cross-category pair. All hues are at value (lightness) level 5, and chroma (saturation) level 8. 1s denotes a 1-step pair; 2s denotes a 2-step pair.

https://doi.org/10.1371/journal.pone.0158725.t005

Based on naming data in Fig 1 of Roberson et al. 2005 [ 10 ], we took the prototypes of the relevant color terms to be:

English green prototype = 10GY
English blue prototype = 10B
Berinmo wor prototype = 5Y
Berinmo nol prototype = 5G
Himba dumbu prototype = 5Y
Himba burou prototype = 10G

Fig 6 above shows a spectrum of hues ranging from the Berinmo wor prototype (5Y) to the English blue prototype (10B) in increments of 2.5 Munsell hue steps, categorized according to each of the three languages we consider here. These Munsell hues were converted to xyY and then to CIELAB as above, and the positions of the hues on the spectrum were adjusted so that the distance between each two neighboring hues in the spectrum is proportional to the CIELAB Δ E distance between them. We use this CIELAB-based spectrum for our analyses below. The two shaded regions on each spectrum in Fig 6 denote the two target sets of stimuli identified above.

The discrimination data we modeled were drawn from Table 11 of Roberson et al. (2000:392) [ 9 ] and Table 6 of Roberson et al. (2005:400) [ 10 ].

We considered three variants of the 2-category model: an English blue - green model, a Berinmo wor - nol model, and a Himba dumbu - burou model. As in Study 1, we fit each model to the data in two steps. For each language’s model, we first fit the category component of that model to naming data from that language. Because color naming differs across these languages, this resulted in three models with different category components. For each model, we then retained and fixed the resulting category parameter settings, and fit the single remaining parameter, corresponding to memory uncertainty, to discrimination data. We detail these two steps below.

sapir whorf's hypothesis on language and gender

The empirical data considered for this study are those of Figs 2 (English green/blue , 10 second delay), 3 (Berinmo wor/nol ), and 4 (Himba dumbu/borou ) of Hanley and Roberson (2011) [ 36 ]. These data were originally published by Roberson and Davidoff (2000) [ 8 ], Roberson et al. (2000) [ 9 ], and Roberson et al. (2005) [ 10 ], respectively. The Berinmo and Himba stimuli and data were the same as in our Study 2, but the English stimuli and data reanalyzed by Hanley and Roberson (2011) [ 36 ] Fig 2 were instead drawn from Table 1 of Roberson and Davidoff (2000) [ 8 ], reproduced here in Table 6 , and used for the English condition of this study. These stimuli for English were at lightness (value) level 4, rather than 5 as for the other two languages. We chose to ignore this difference for modeling purposes.

thumbnail

Any stimulus pair that includes a boundary color is considered to be a cross-category pair. All hues are at value (lightness) level 4, and chroma (saturation) level 8. 1s denotes a 1-step pair; 2s denotes a 2-step pair.

https://doi.org/10.1371/journal.pone.0158725.t006

All modeling procedures were identical to those of Study 2, with the exception that GE (target = good exemplar) and PE (target = poor exemplar) cases were disaggregated, and analyzed separately.

Acknowledgments

We thank Roland Baddeley, Paul Kay, Charles Kemp, Steven Piantadosi, and an anonymous reviewer for their comments.

Author Contributions

Conceived and designed the experiments: EC YX JLA TLG TR. Performed the experiments: EC YX JLA. Analyzed the data: EC YX JLA. Wrote the paper: TR EC YX.

  • View Article
  • Google Scholar
  • 2. Whorf BL. Science and linguistics. In: Carroll JB, editor. Language, Thought, and Reality: Selected Writings of Benjamin Lee Whorf. MIT Press; 1956. p. 207–219.
  • 3. Berlin B, Kay P. Basic color terms: Their universality and evolution. University of California Press; 1969.
  • 5. Kay P, Berlin B, Maffi L, Merrifield WR, Cook R. The World Color Survey. CSLI Publications; 2009.
  • PubMed/NCBI
  • 21. Persaud K, Hemmer P. The influence of knowledge and expectations for color on episodic memory. In: Bello P, Guarini M, McShane M, Scassellati B, editors. Proceedings of the 36th Annual Meeting of the Cognitive Science Society. Cognitive Science Society; 2014. p. 1162–1167.
  • 22. Yuille AL, Bülthoff HH. Bayesian decision theory and psychophysics. In: Knill DC, Richards W, editors. Perception as Bayesian Inference. Cambridge University Press; 1996. p. 123–162.
  • 35. Hemmer P, Persaud K, Kidd C, Piantadosi S. Inferring the Tsimane’s use of color categories from recognition memory. In: Noelle DC, Dale R, Warlaumont AS, Yoshimi J, Matlock T, Jennings CD, et al., editors. Proceedings of the 37th Annual Meeting of the Cognitive Science Society. Cognitive Science Society; 2015. p. 896–901.
  • 38. Wyszecki G, Stiles WS. Color science: Concepts and methods, quantitative data and formulae. Wiley; 1982.
  • 42. Brainard DH. Color appearance and color difference specification. In: Shevell SK, editor. The science of color: Second edition. Elsevier; 2003. p. 191–216.
  • 43. Luce RD. Detection and recognition. In: Luce RD, Bush RR, Galanter E, editors. Handbook of mathematical psychology. Wiley; 1963. p. 103–189.

Stack Exchange Network

Stack Exchange network consists of 183 Q&A communities including Stack Overflow , the largest, most trusted online community for developers to learn, share their knowledge, and build their careers.

Q&A for work

Connect and share knowledge within a single location that is structured and easy to search.

Sapir-Whorf hypothesis and Gender Identity: Empirical Studies?

The Sapir-Whorf hypothesis states, briefly put, that linguistic structures affect cognitive processes. I am interested in finding out how much is known about the development of gender identity from this point of view. That is: Do the gender-related structures of each language influence the way children develop such an identity? Ideally, I would like to know about recent empirical studies comparing the development of gender identity in, say, Semitic languages (heavily gendered), Indo-european languages (gendered) and finno-ugric languages (completely genderless, as far as I know). I am aware of some works in these field, and also about some other very interesting empirical works in relation to Sapir-Whorf hypothesis, like Everett's field work on Pirahã language or this work about absolute spatial description in the Yimithirr's language, but I have not found anything similar (empirical) concerning recent studies on the construction of gender identity for speakers of families of languages which treat (very) differently the gender.

One fascinating example (perhaps completely irrelevant in a linguistic discussion, though, but at least thought-provoking) that I have in mind is the existence of third gender or inter-gender persons, such as the Mahu of Hawai'i and Tahiti or the fa'afafine of Samoa, which were common in pre-colonial polynesian cultures, which precisely possess genderless languages. Could this somehow, to some degree, be related to the language and be related to linguistic relativism? Has this been researched?

  • reference-request
  • sociolinguistics
  • sapir-whorf-hypothesis

Sir Cornflakes's user avatar

  • 1 Not an article, but there is a very insightful TED talk by Dr. Lera Boroditsky on the topic: ted.com/talks/… –  alephreish Commented Jun 23, 2022 at 14:09
  • 5 I've deleted a series of comments criticizing the use of the term "gender identity". Whatever the answer to the question might be, it's a known anthropological fact that various cultures across the world have various different ways of categorizing people (often based on sex), and that individual people align or conflict with these categories to different extents. Asking how or whether these categories interact with linguistic noun-class systems seems like a perfectly reasonable, and empirical, question to me. –  Draconis ♦ Commented Jul 21, 2022 at 19:30

Ok. Let's go with what linguists think of the Sapir-Whorff hypothesis: that it is false or largely has a minor effect on people's cognitive habits. J. Lyons in his book Semantics gives some example of how owning separate words for certain hue can help to better remember which of two objects (identical in all but color) a person saw. It should be clear that having separate words or not for different hue of color does not affect one's ability to distinguish them visually, it only slightly influences one's memory of the objects (the experiment considered monolinguals in Zuñi and bilinguals in Zuñi and English and objects of yellow and orange color. Zuñi language use the same word for both colors.)

Honestly, I doubt very much that too many linguists believe that the process of identity construction has anything to do with the categories of the mother tongue. It seems that sociologists, social psychologists or anthropologists are more willing to consider that sort of thing than the vast majority of linguists.

Davius's user avatar

  • This represents one opinion, there is empirical evidence in favor of language shaping thought. A random article on the topic: web.stanford.edu/class/linguist156/Boroditsky_ea_2003.pdf –  alephreish Commented Jun 23, 2022 at 14:07
  • 2 @alephreish Maybe we can agree that the evidence is controversial, there are many studies denying the Sapir-Whorff hypothesis, specially inside linguistics: brill.com/view/journals/jocc/8/3-4/article-p335_8.xml . In particular, Boroditsky is not a linguist, she is a cognitive scientist particularly insistent that there is evidence in favor of linguistic relativism. I do not dispute his work, but it is outside the mainstream of linguistics. –  Davius Commented Jun 23, 2022 at 14:33
  • 2 Unfortunately, this does not even touch the core question on grammatical gender and gender identity. –  Sir Cornflakes Commented Jun 24, 2022 at 9:31
  • Lingustic relativism is about "speakers' worldview or cognition" as Wiki puts it, it's within the realm of cognitive science. –  alephreish Commented Jun 24, 2022 at 22:23

Your Answer

Sign up or log in, post as a guest.

Required, but never shown

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy .

Not the answer you're looking for? Browse other questions tagged reference-request sociolinguistics gender sapir-whorf-hypothesis or ask your own question .

  • Featured on Meta
  • Bringing clarity to status tag usage on meta sites
  • We've made changes to our Terms of Service & Privacy Policy - July 2024
  • Announcing a change to the data-dump process

Hot Network Questions

  • Parse Minecraft's VarInt
  • What is the name of this simulator
  • How do eradicated diseases make a comeback?
  • Using conditionals within \tl_put_right from latex3 explsyntax
  • Book or novel about an intelligent monolith from space that crashes into a mountain
  • What explanations can be offered for the extreme see-sawing in Montana's senate race polling?
  • How would you say a couple of letters (as in mail) if they're not necessarily letters?
  • Optimal Bath Fan Location
  • Is this a new result about hexagon?
  • The answer is not wrong
  • How can I delete a column from several CSV files?
  • What to do when 2 light switches are too far apart for the light switch cover plate?
  • Is it possible to have a planet that's gaslike in some areas and rocky in others?
  • Cramer's Rule when the determinant of coefficient matrix is zero?
  • Passport Carry in Taiwan
  • I'm trying to remember a novel about an asteroid threatening to destroy the earth. I remember seeing the phrase "SHIVA IS COMING" on the cover
  • Which programming language/environment pioneered row-major array order?
  • Distinctive form of "לאהוב ל-" instead of "לאהוב את"
  • Who was the "Dutch author", "Bumstone Bumstone"?
  • How did Oswald Mosley escape treason charges?
  • Reusing own code at work without losing licence
  • A way to move an object with the 3D cursor location as the moving point?
  • Can a 2-sphere be squashed flat?
  • Encode a VarInt

sapir whorf's hypothesis on language and gender

  • Español (Latinoamérica)
  • Machine Translation
  • Marketing Translation
  • Technical Translation
  • Document Translation
  • eLearning Localization
  • Website Localization
  • Call Center Interpreting
  • Medical Interpreting
  • Phone Interpreting
  • Simultaneous Interpretation
  • Video Remote Interpreting
  • Clinical Trials
  • Linguistic Validation
  • Medical Devices
  • Pharmaceutical
  • Bilingual Care Coordination
  • Medical Claims
  • Manufacturing
  • Consumer Goods
  • Octave® TMS
  • Adobe Experience Manager
  • Veeva Vault
  • Freelance Opportunities
  • Case Studies

ULG's Language Services Blog

The sapir whorf hypothesis and language's effect on cognition.

An image of mountains and a night sky

This article was originally published in April 2017 and has been updated.

Language plays a big role in our lives. But can it affect the way we think?

So would say a dedicated Whorfian, arguing that the language we speak determines how we see the world and either restricts or bolsters our ability to understand it. The theory of linguistic relativity , often referred to as the Sapir-Whorf Hypothesis , has garnered controversy since its origins in the early 20th century, igniting both arguments and interest among prominent linguists.

It’s not only the theory itself that has been criticized, but also its name. Benjamin Lee Whorf and Edward Sapir were both well-known 20th century linguists, but they never actually published a work together on their eponymous “hypothesis.”

There are disagreements about each of the respective linguists’ “true” theories on the matter, too. Although both Whorf and Sapir were at least fair-weather proponents of the idea that was eventually named after them, it’s hard to pinpoint the precise position each man took on the issue.

Myths and unanswered questions aside, we do know the Sapir-Whorf Hypothesis poses some engaging questions about the languages we speak and the way we perceive reality.

Determinism vs. Influence

The Sapir-Whorf Hypothesis posits that language either determines or influences one’s thought. In other words, people who speak different languages see the world differently, based on the language they use to describe it.

There are two differing strands of the hypothesis, “linguistic determinism” and “linguistic influence.” The former refers to the idea that a person’s language “determines,” or either restricts or enables their ideation. For instance, if a language lacks a word to define a certain concept, a linguistic determinist would infer that speakers of that language would not be capable of understanding that concept.

Conversely, if a person spoke a language that had multiple definitions for one concept, linguistic determinists would argue that he or she must have a better understanding of what’s being defined.

On the other hand, someone who is loyal to linguistic influence would explain that the language a person speaks has an impact on the way they think, but doesn’t inhibit or control their ability to comprehend experiences.

Numerous Terms For a Single Concept

One of Whorf’s best known arguments for linguist determinism stems from his study of the Hopi Indians, a Native American tribe from Arizona. Whorf believed that the tribe spoke without using phrases that referred to time, omitting past or future tenses. This lack of time terminology led him to believe that the tribe lived their life without abiding by the concept of time at all.

However, it was later determined that Whorf’s theory of the Hopi people speaking without tense phrases was incorrect.

Another popular example of the Sapir-Whorf hypothesis comes from the observation that the Inuit Tribe has many different terms for snow. The thinking, then, was that Eskimos had a better understanding, or more refined perception, of snow thanks to the fact that they had numerous ways to describe it.

This claim, along with many other Whorfian ideas, have been discredited and argued against by linguists of all stripes. For one, it could be contended that English speakers also have a great deal of terms for snow (sleet, slush, flakes, flurries, etc.) – why wouldn’t Whorf attribute this same understanding of snow to European speakers?

Great In Theory – What About Practice?

Today, it’s believed by many that there may be some aspects of perception affected by language, but the idea that a mother tongue can restrict understanding has largely been disagreed with.

Take the word schadenfreude , for example. The term is German, and means to take pleasure in another’s unhappiness. There is no translatable equivalent in English, but it wouldn’t be true to say English speakers had never experienced or would not be able to comprehend this emotion.

So, just because there is no word for schadenfreude in the English language, doesn’t mean English speakers are “restricted,” or less-equipped in their ability to feel or experience what that word describes. This logic seems to disprove Whorf’s theory.

On the other hand, there’s evidence that the language-associated habits we acquire do play a role in how we view the world. As Guy Deutscheraug points out in a 2010 New York Times Magazine article , this is especially true for languages that attach genders to inanimate objects.

Deutscheraug references a study done that looks at how German and Spanish speakers view different objects based on their gender association in each respective language. The results showed that in describing things that are referred to as masculine in Spanish, speakers of that language graded them as having more manly characteristics.

These same items, which used feminine phrasings in German, were seen by German speakers as containing effeminate characteristics.

The findings suggest speakers of each language have developed preconceived notions of something being more masculine or feminine, not due to the objects’ appearance or characteristics, but because of the way they categorize them in their native language.

Maintaining Relevance

Despite its age, the Sapir-Whorf hypothesis has continued to nudge itself into linguistic conversations, and even pop culture. The idea was just recently revisited in “Arrival,” a science fiction film that explores the ways in which an alien language affects human thinking.

And even if some of its most drastic claims have been debunked, the theory’s continued relevance says something about its importance. Ideas, theories and intellectual musings don’t need to be unequivocally true to remain in the public eye, but they do need to make us think to retain any sort of clout –and Sapir-Whorf has done just that.

The theory doesn’t only make us question linguistic theory, but also our existence itself, and how our perceptions might shape that existence.

There are generalities that we can expect everyone to encounter in their day-to-day life (relationships, work, love, sadness, etc.), but thinking about more granular disparities experienced by those in different circumstances, linguistic or otherwise, helps us to realize that there’s more to the story than ours.

And at the same time, the Sapir-Whorf Hypothesis reiterates the fact that we’re more alike than we are different, no matter what language we speak.

Share on Facebook

Canadian French vs. French: 7 Important Differences You Need to Know

Is your business looking for French translation services or interpreting services? Well, which ... Read More

Gaelic vs. Irish: What’s the Difference?

St. Patrick’s Day, cable-knit wool sweaters, and lush, green rolling landscapes are what people ... Read More

Communicating in High Context vs. Low Context Cultures

How people communicate with one another varies wildly from culture to culture. In our fully ... Read More

Topics: Technology , Translation , Service , Industry

Recent Posts

A language unlike any other: what is nicaraguan sign language, preserving language: icelandic language day, the power of language or; why north korea banned sarcasm.

  • Cultural (2)
  • Cultural Competence (3)
  • Elections (1)
  • Financial (4)
  • Global Marketing (42)
  • Government (3)
  • Health Outcomes Solutions (14)
  • Healthcare (54)
  • Industry (152)
  • Insurance (1)
  • Interpretation (3)
  • Interpreting (55)
  • Language Access (9)
  • Life Sciences (36)
  • Localization (118)
  • Machine Translation (21)
  • Manufacturing (5)
  • Over the Phone Interpreting (3)
  • Service (286)
  • Technology (91)
  • Translation (272)
  • Utility (1)
  • Video Remote Interpreting (7)

ULG_Logo_Stacked_White

Americas HQ

1550 Utica Avenue South, Suite 420 Minneapolis, MN 55416

call: +1 855-786-4833

   
  • Interpreting Services
  • Localization Services
  • Translation Services
  • Supplier Diversity
  • Terms of Use
  • Privacy Policy
  • Legal Translation
  • Medical Device Translation
  • Over-the-Phone Interpreting

sapir whorf's hypothesis on language and gender

IMAGES

  1. Sapir-Whorf Hypothesis: Examples, Definition, Criticisms (2024)

    sapir whorf's hypothesis on language and gender

  2. Definition of language according to edward sapir

    sapir whorf's hypothesis on language and gender

  3. What is the Sapir Whorf Hypothesis?

    sapir whorf's hypothesis on language and gender

  4. PPT

    sapir whorf's hypothesis on language and gender

  5. Sapir-Whorf Hypothesis: An Etiquette Without Gender Discrimination

    sapir whorf's hypothesis on language and gender

  6. Sapir-Whorf Hypothesis and Gender by John-Chrisopher Pesich on Prezi

    sapir whorf's hypothesis on language and gender

VIDEO

  1. Sapir- Whorf Hypothesis

  2. Sapir -whorf hypothesis .important topic meg-4 , part -1

  3. Sapir wharf hypothesis

  4. What is the Sapir-Whorf Hypothesis? #facts #linguistics #shorts

  5. شرح علم اللغة جابتر 20 جزء 5 The Sapir–Whorf Hypothesis and Against the Sapir–Whorf Hypothesis

  6. Quickie: The Sapir-Whorf Hypothesis

COMMENTS

  1. Sapir-Whorf hypothesis (Linguistic Relativity Hypothesis)

    Since the Sapir-Whorf hypothesis theorizes that our language use shapes our perspective of the world, people who speak different languages have different views of the world. In the 1920s, Benjamin Whorf was a Yale University graduate student studying with linguist Edward Sapir, who was considered the father of American linguistic anthropology.

  2. The Sapir-Whorf Hypothesis: How Language Influences How We Express

    The Sapir-Whorf Hypothesis, also known as linguistic relativity, refers to the idea that the language a person speaks can influence their worldview, thought, and even how they experience and understand the world. While more extreme versions of the hypothesis have largely been discredited, a growing body of research has demonstrated that ...

  3. Sapir-Whorf Hypothesis: Examples, Definition, Criticisms

    Developed in 1929 by Edward Sapir, the Sapir-Whorf hypothesis (also known as linguistic relativity) states that a person's perception of the world around them and how they experience the world is both determined and influenced by the language that they speak. The theory proposes that differences in grammatical and verbal structures, and the ...

  4. Whorfianism

    The central idea of the Sapir-Whorf hypothesis is that language functions, not simply as a device for reporting experience, but also, and more significantly, as a way of defining experience for its speakers. ... Grammatical gender is obligatory in the languages in which it occurs and has been claimed by Whorfians to have persistent and enduring ...

  5. Sapir-Whorf Hypothesis

    The Sapir-Whorf hypothesis, also known as the linguistic relativity hypothesis, refers to the proposal that the particular language one speaks influences the way one thinks about reality. ... Sapir also did pioneering work in language and gender, historical linguistics, psycholinguistics and in the study of a number of native American ...

  6. Grammatical gender and linguistic relativity: A systematic review

    The Sapir-Whorf or linguistic relativity hypothesis—henceforth, simply "relativity" (Whorf, 1956)—takes various forms, but at its heart it contends that the idiosyncrasies of the languages we speak influence the way we think about the world.In its strongest incarnation—linguistic determinism—thought is constrained by language, but few if any contemporary scholars take this view ...

  7. Definition and History of the Sapir-Whorf Hypothesis

    The Sapir-Whorf hypothesis is the linguistic theory that the semantic structure of a language shapes or limits the ways in which a speaker forms conceptions of the world. It came about in 1929. The theory is named after the American anthropological linguist Edward Sapir (1884-1939) and his student Benjamin Whorf (1897-1941).

  8. Sapir-Whorf Hypothesis

    Abstract. The Sapir-Whorf hypothesis holds that language plays a powerful role in shaping human consciousness, affecting everything from private thought and perception to larger patterns of behavior in society—ultimately allowing members of any given speech community to arrive at a shared sense of social reality.

  9. The Sapir‐Whorf Hypothesis: A Preliminary History and a Bibliographical

    This article presents a historical overview of linguistic ideas in relation to the Sapir-Whorf Hypothesis. The source of the hypothesis is found in the writings of Wilhelm von Humboldt, and further development is found in the writings of Heymann Steinthal, Franz Boas, Edward Sapir, Benjamin Lee Whorf, Carl Voegelin, and Dell Hymes, among others.

  10. Linguistic Relativity and Sex/Gender Studies: Epistemological and

    Two studies have investigated one purported manifestation of the Sapir-Whorf hypothesis of linguistic relativity—the relationship between gender loading in ... role to language in determining an individual's view of reality. Finally, we raise a number of methodological issues that result from this perspective, and we suggest avenues for ...

  11. Sapir-Whorf Hypothesis

    The Sapir-Whorf hypothesis holds that language plays a powerful role in shaping human consciousness, ... This explores everything from human universals to the cross-cultural differences in the construction of gender, color, space, and other creative practices associated with language, such as storytelling, poetry, or song.

  12. (PDF) Sapir Whorf Hypothesis

    The Sapir-Whorf hypothesis holds that language plays a powerful role in shaping human consciousness, ... Language and the culture of gender: At the intersection of structure, usage, and ideology. In E. Mertz & R. Parmentier (Eds.), Semiotic meditation (pp. 220-259). New York, NY: Academic Press. Whorf, B. L. (1956). Language, thought, and ...

  13. Sapir-Whorf Hypothesis in Linguistic Anthropology

    The Sapir-Whorf Hypothesis and Anthropology. From an anthropological perspective, the Sapir-Whorf Hypothesis is instrumental in exploring cultural diversity, since it suggests that understanding a culture's language is key to understanding their world view. Cultural Understanding: The hypothesis emphasizes the importance of language in ...

  14. The Sapir-Whorf Hypothesis and Probabilistic Inference: Evidence ...

    The Sapir-Whorf hypothesis holds that our thoughts are shaped by our native language, and that speakers of different languages therefore think differently. This hypothesis is controversial in part because it appears to deny the possibility of a universal groundwork for human cognition, and in part because some findings taken to support it have not reliably replicated.

  15. PDF The Sapir-Whorf hypothesis and inference under uncertainty

    The Sapir-Whorf hypothesis holds that human thought is shaped by language, leading speakers of different languages to think differently. This hypothesis has sparked both enthusiasm and controversy, ... The Sapir-Whorf hypothesis holds that the semantic categories of one's native language influence thought, and that as a result speakers of ...

  16. Sapir-Whorf Hypothesis

    The Sapir-Whorf Hypothesis, also known as the linguistic relativity hypothesis, states that the language one knows affects how one thinks about the world. The hypothesis is most strongly associated with Benjamin Lee Whorf, a fire prevention engineer who became a scholar of language under the guidance of linguist and anthropologist Edward Sapir ...

  17. PDF The Sapir-Whorf Hypothesis Today

    Abstract—The Sapir-Whorf's Linguistic Relativity Hypothesis provokes intellectual discussion about the strong impact language has on our perception of the world around us. This paper intends to enliven the still open questions raised by this hypothesis. This is done by considering some of Sapir's, Whorf's, and other scholar's works.

  18. Linguistic relativity

    Linguistic relativity asserts that language influences worldview or cognition.One form of linguistic relativity, linguistic determinism, regards peoples' languages as determining and influencing the scope of cultural perceptions of their surrounding world. [1]Several various colloquialisms refer to linguistic relativism: the Whorf hypothesis; the Sapir-Whorf hypothesis (/ s ə ˌ p ɪər ˈ ...

  19. Sapir-Whorf hypothesis and Gender Identity: Empirical Studies?

    The Sapir-Whorf hypothesis states, briefly put, that linguistic structures affect cognitive processes. I am interested in finding out how much is known about the development of gender identity from this point of view. That is: Do the gender-related structures of each language influence the way children develop such an identity? Ideally, I would like to know about recent empirical studies ...

  20. The Sapir Whorf Hypothesis and Language's Effect on Cognition

    The Sapir-Whorf Hypothesis posits that language either determines or influences one's thought. In other words, people who speak different languages see the world differently, based on the language they use to describe it. There are two differing strands of the hypothesis, "linguistic determinism" and "linguistic influence.". The ...