helpful professor logo

15 Types of Research Methods

15 Types of Research Methods

Chris Drew (PhD)

Dr. Chris Drew is the founder of the Helpful Professor. He holds a PhD in education and has published over 20 articles in scholarly journals. He is the former editor of the Journal of Learning Development in Higher Education. [Image Descriptor: Photo of Chris]

Learn about our Editorial Process

types of research methods, explained below

Research methods refer to the strategies, tools, and techniques used to gather and analyze data in a structured way in order to answer a research question or investigate a hypothesis (Hammond & Wellington, 2020).

Generally, we place research methods into two categories: quantitative and qualitative. Each has its own strengths and weaknesses, which we can summarize as:

  • Quantitative research can achieve generalizability through scrupulous statistical analysis applied to large sample sizes.
  • Qualitative research achieves deep, detailed, and nuance accounts of specific case studies, which are not generalizable.

Some researchers, with the aim of making the most of both quantitative and qualitative research, employ mixed methods, whereby they will apply both types of research methods in the one study, such as by conducting a statistical survey alongside in-depth interviews to add context to the quantitative findings.

Below, I’ll outline 15 common research methods, and include pros, cons, and examples of each .

Types of Research Methods

Research methods can be broadly categorized into two types: quantitative and qualitative.

  • Quantitative methods involve systematic empirical investigation of observable phenomena via statistical, mathematical, or computational techniques, providing an in-depth understanding of a specific concept or phenomenon (Schweigert, 2021). The strengths of this approach include its ability to produce reliable results that can be generalized to a larger population, although it can lack depth and detail.
  • Qualitative methods encompass techniques that are designed to provide a deep understanding of a complex issue, often in a specific context, through collection of non-numerical data (Tracy, 2019). This approach often provides rich, detailed insights but can be time-consuming and its findings may not be generalizable.

These can be further broken down into a range of specific research methods and designs:

Primarily Quantitative MethodsPrimarily Qualitative methods
Experimental ResearchCase Study
Surveys and QuestionnairesEthnography
Longitudinal StudiesPhenomenology
Cross-Sectional StudiesHistorical research
Correlational ResearchContent analysis
Causal-Comparative ResearchGrounded theory
Meta-AnalysisAction research
Quasi-Experimental DesignObservational research

Combining the two methods above, mixed methods research mixes elements of both qualitative and quantitative research methods, providing a comprehensive understanding of the research problem . We can further break these down into:

  • Sequential Explanatory Design (QUAN→QUAL): This methodology involves conducting quantitative analysis first, then supplementing it with a qualitative study.
  • Sequential Exploratory Design (QUAL→QUAN): This methodology goes in the other direction, starting with qualitative analysis and ending with quantitative analysis.

Let’s explore some methods and designs from both quantitative and qualitative traditions, starting with qualitative research methods.

Qualitative Research Methods

Qualitative research methods allow for the exploration of phenomena in their natural settings, providing detailed, descriptive responses and insights into individuals’ experiences and perceptions (Howitt, 2019).

These methods are useful when a detailed understanding of a phenomenon is sought.

1. Ethnographic Research

Ethnographic research emerged out of anthropological research, where anthropologists would enter into a setting for a sustained period of time, getting to know a cultural group and taking detailed observations.

Ethnographers would sometimes even act as participants in the group or culture, which many scholars argue is a weakness because it is a step away from achieving objectivity (Stokes & Wall, 2017).

In fact, at its most extreme version, ethnographers even conduct research on themselves, in a fascinating methodology call autoethnography .

The purpose is to understand the culture, social structure, and the behaviors of the group under study. It is often useful when researchers seek to understand shared cultural meanings and practices in their natural settings.

However, it can be time-consuming and may reflect researcher biases due to the immersion approach.

Pros of Ethnographic ResearchCons of Ethnographic Research
1. Provides deep cultural insights1. Time-consuming
2. Contextually relevant findings2. Potential researcher bias
3. Explores dynamic social processes3. May

Example of Ethnography

Liquidated: An Ethnography of Wall Street  by Karen Ho involves an anthropologist who embeds herself with Wall Street firms to study the culture of Wall Street bankers and how this culture affects the broader economy and world.

2. Phenomenological Research

Phenomenological research is a qualitative method focused on the study of individual experiences from the participant’s perspective (Tracy, 2019).

It focuses specifically on people’s experiences in relation to a specific social phenomenon ( see here for examples of social phenomena ).

This method is valuable when the goal is to understand how individuals perceive, experience, and make meaning of particular phenomena. However, because it is subjective and dependent on participants’ self-reports, findings may not be generalizable, and are highly reliant on self-reported ‘thoughts and feelings’.

Pros of Phenomenological ResearchCons of Phenomenological Research
1. Provides rich, detailed data1. Limited generalizability
2. Highlights personal experience and perceptions2. Data collection can be time-consuming
3. Allows exploration of complex phenomena3. Requires highly skilled researchers

Example of Phenomenological Research

A phenomenological approach to experiences with technology  by Sebnem Cilesiz represents a good starting-point for formulating a phenomenological study. With its focus on the ‘essence of experience’, this piece presents methodological, reliability, validity, and data analysis techniques that phenomenologists use to explain how people experience technology in their everyday lives.

3. Historical Research

Historical research is a qualitative method involving the examination of past events to draw conclusions about the present or make predictions about the future (Stokes & Wall, 2017).

As you might expect, it’s common in the research branches of history departments in universities.

This approach is useful in studies that seek to understand the past to interpret present events or trends. However, it relies heavily on the availability and reliability of source materials, which may be limited.

Common data sources include cultural artifacts from both material and non-material culture , which are then examined, compared, contrasted, and contextualized to test hypotheses and generate theories.

Pros of Historical ResearchCons of Historical Research
1. 1. Dependent on available sources
2. Can help understand current events or trends2. Potential bias in source materials
3. Allows the study of change over time3. Difficult to replicate

Example of Historical Research

A historical research example might be a study examining the evolution of gender roles over the last century. This research might involve the analysis of historical newspapers, advertisements, letters, and company documents, as well as sociocultural contexts.

4. Content Analysis

Content analysis is a research method that involves systematic and objective coding and interpreting of text or media to identify patterns, themes, ideologies, or biases (Schweigert, 2021).

A content analysis is useful in analyzing communication patterns, helping to reveal how texts such as newspapers, movies, films, political speeches, and other types of ‘content’ contain narratives and biases.

However, interpretations can be very subjective, which often requires scholars to engage in practices such as cross-comparing their coding with peers or external researchers.

Content analysis can be further broken down in to other specific methodologies such as semiotic analysis, multimodal analysis , and discourse analysis .

Pros of Content AnalysisCons of Content Analysis
1. Unobtrusive data collection1. Lacks contextual information
2. Allows for large sample analysis2. Potential coder bias
3. Replicable and reliable if done properly3. May overlook nuances

Example of Content Analysis

How is Islam Portrayed in Western Media?  by Poorebrahim and Zarei (2013) employs a type of content analysis called critical discourse analysis (common in poststructuralist and critical theory research ). This study by Poorebrahum and Zarei combs through a corpus of western media texts to explore the language forms that are used in relation to Islam and Muslims, finding that they are overly stereotyped, which may represent anti-Islam bias or failure to understand the Islamic world.

5. Grounded Theory Research

Grounded theory involves developing a theory  during and after  data collection rather than beforehand.

This is in contrast to most academic research studies, which start with a hypothesis or theory and then testing of it through a study, where we might have a null hypothesis (disproving the theory) and an alternative hypothesis (supporting the theory).

Grounded Theory is useful because it keeps an open mind to what the data might reveal out of the research. It can be time-consuming and requires rigorous data analysis (Tracy, 2019).

Pros of Grounded Theory ResearchCons of Grounded Theory Research
1. Helps with theory development1. Time-consuming
2. Rigorous data analysis2. Requires iterative data collection and analysis
3. Can fill gaps in existing theories3. Requires skilled researchers

Grounded Theory Example

Developing a Leadership Identity   by Komives et al (2005) employs a grounded theory approach to develop a thesis based on the data rather than testing a hypothesis. The researchers studied the leadership identity of 13 college students taking on leadership roles. Based on their interviews, the researchers theorized that the students’ leadership identities shifted from a hierarchical view of leadership to one that embraced leadership as a collaborative concept.

6. Action Research

Action research is an approach which aims to solve real-world problems and bring about change within a setting. The study is designed to solve a specific problem – or in other words, to take action (Patten, 2017).

This approach can involve mixed methods, but is generally qualitative because it usually involves the study of a specific case study wherein the researcher works, e.g. a teacher studying their own classroom practice to seek ways they can improve.

Action research is very common in fields like education and nursing where practitioners identify areas for improvement then implement a study in order to find paths forward.

Pros of Action ResearchCons of Action Research
1. Addresses real-world problems and seeks to find solutions.1. It is time-consuming and often hard to implement into a practitioner’s already busy schedule
2. Integrates research and action in an action-research cycle.2. Requires collaboration between researcher, practitioner, and research participants.
3. Can bring about positive change in isolated instances, such as in a school or nursery setting.3. Complexity of managing dual roles (where the researcher is also often the practitioner)

Action Research Example

Using Digital Sandbox Gaming to Improve Creativity Within Boys’ Writing   by Ellison and Drew was a research study one of my research students completed in his own classroom under my supervision. He implemented a digital game-based approach to literacy teaching with boys and interviewed his students to see if the use of games as stimuli for storytelling helped draw them into the learning experience.

7. Natural Observational Research

Observational research can also be quantitative (see: experimental research), but in naturalistic settings for the social sciences, researchers tend to employ qualitative data collection methods like interviews and field notes to observe people in their day-to-day environments.

This approach involves the observation and detailed recording of behaviors in their natural settings (Howitt, 2019). It can provide rich, in-depth information, but the researcher’s presence might influence behavior.

While observational research has some overlaps with ethnography (especially in regard to data collection techniques), it tends not to be as sustained as ethnography, e.g. a researcher might do 5 observations, every second Monday, as opposed to being embedded in an environment.

Pros of Qualitative Observational ResearchCons of Qualitative Observational Research
1. Captures behavior in natural settings, allowing for interesting insights into authentic behaviors. 1. Researcher’s presence may influence behavior
2. Can provide rich, detailed data through the researcher’s vignettes.2. Can be time-consuming
3. Non-invasive because researchers want to observe natural activities rather than interfering with research participants.3. Requires skilled and trained observers

Observational Research Example

A researcher might use qualitative observational research to study the behaviors and interactions of children at a playground. The researcher would document the behaviors observed, such as the types of games played, levels of cooperation , and instances of conflict.

8. Case Study Research

Case study research is a qualitative method that involves a deep and thorough investigation of a single individual, group, or event in order to explore facets of that phenomenon that cannot be captured using other methods (Stokes & Wall, 2017).

Case study research is especially valuable in providing contextualized insights into specific issues, facilitating the application of abstract theories to real-world situations (Patten, 2017).

However, findings from a case study may not be generalizable due to the specific context and the limited number of cases studied (Walliman, 2021).

Pros of Case Study ResearchCons of Case Study Research
1. Provides detailed insights1. Limited generalizability
2. Facilitates the study of complex phenomena2. Can be time-consuming
3. Can test or generate theories3. Subject to observer bias

See More: Case Study Advantages and Disadvantages

Example of a Case Study

Scholars conduct a detailed exploration of the implementation of a new teaching method within a classroom setting. The study focuses on how the teacher and students adapt to the new method, the challenges encountered, and the outcomes on student performance and engagement. While the study provides specific and detailed insights of the teaching method in that classroom, it cannot be generalized to other classrooms, as statistical significance has not been established through this qualitative approach.

Quantitative Research Methods

Quantitative research methods involve the systematic empirical investigation of observable phenomena via statistical, mathematical, or computational techniques (Pajo, 2022). The focus is on gathering numerical data and generalizing it across groups of people or to explain a particular phenomenon.

9. Experimental Research

Experimental research is a quantitative method where researchers manipulate one variable to determine its effect on another (Walliman, 2021).

This is common, for example, in high-school science labs, where students are asked to introduce a variable into a setting in order to examine its effect.

This type of research is useful in situations where researchers want to determine causal relationships between variables. However, experimental conditions may not reflect real-world conditions.

Pros of Experimental ResearchCons of Experimental Research
1. Allows for determination of causality1. Might not reflect real-world conditions
2. Allows for the study of phenomena in highly controlled environments to minimize research contamination.2. Can be costly and time-consuming to create a controlled environment.
3. Can be replicated so other researchers can test and verify the results.3. Ethical concerns need to be addressed as the research is directly manipulating variables.

Example of Experimental Research

A researcher may conduct an experiment to determine the effects of a new educational approach on student learning outcomes. Students would be randomly assigned to either the control group (traditional teaching method) or the experimental group (new educational approach).

10. Surveys and Questionnaires

Surveys and questionnaires are quantitative methods that involve asking research participants structured and predefined questions to collect data about their attitudes, beliefs, behaviors, or characteristics (Patten, 2017).

Surveys are beneficial for collecting data from large samples, but they depend heavily on the honesty and accuracy of respondents.

They tend to be seen as more authoritative than their qualitative counterparts, semi-structured interviews, because the data is quantifiable (e.g. a questionnaire where information is presented on a scale from 1 to 10 can allow researchers to determine and compare statistical means, averages, and variations across sub-populations in the study).

Pros of Surveys and QuestionnairesCons of Surveys and Questionnaires
1. Data can be gathered from larger samples than is possible in qualitative research. 1. There is heavy dependence on respondent honesty
2. The data is quantifiable, allowing for comparison across subpopulations2. There is limited depth of response as opposed to qualitative approaches.
3. Can be cost-effective and time-efficient3. Static with no flexibility to explore responses (unlike semi- or unstrcutured interviewing)

Example of a Survey Study

A company might use a survey to gather data about employee job satisfaction across its offices worldwide. Employees would be asked to rate various aspects of their job satisfaction on a Likert scale. While this method provides a broad overview, it may lack the depth of understanding possible with other methods (Stokes & Wall, 2017).

11. Longitudinal Studies

Longitudinal studies involve repeated observations of the same variables over extended periods (Howitt, 2019). These studies are valuable for tracking development and change but can be costly and time-consuming.

With multiple data points collected over extended periods, it’s possible to examine continuous changes within things like population dynamics or consumer behavior. This makes a detailed analysis of change possible.

a visual representation of a longitudinal study demonstrating that data is collected over time on one sample so researchers can examine how variables change over time

Perhaps the most relatable example of a longitudinal study is a national census, which is taken on the same day every few years, to gather comparative demographic data that can show how a nation is changing over time.

While longitudinal studies are commonly quantitative, there are also instances of qualitative ones as well, such as the famous 7 Up study from the UK, which studies 14 individuals every 7 years to explore their development over their lives.

Pros of Longitudinal StudiesCons of Longitudinal Studies
1. Tracks changes over time allowing for comparison of past to present events.1. Is almost by definition time-consuming because time needs to pass between each data collection session.
2. Can identify sequences of events, but causality is often harder to determine.2. There is high risk of participant dropout over time as participants move on with their lives.

Example of a Longitudinal Study

A national census, taken every few years, uses surveys to develop longitudinal data, which is then compared and analyzed to present accurate trends over time. Trends a census can reveal include changes in religiosity, values and attitudes on social issues, and much more.

12. Cross-Sectional Studies

Cross-sectional studies are a quantitative research method that involves analyzing data from a population at a specific point in time (Patten, 2017). They provide a snapshot of a situation but cannot determine causality.

This design is used to measure and compare the prevalence of certain characteristics or outcomes in different groups within the sampled population.

A visual representation of a cross-sectional group of people, demonstrating that the data is collected at a single point in time and you can compare groups within the sample

The major advantage of cross-sectional design is its ability to measure a wide range of variables simultaneously without needing to follow up with participants over time.

However, cross-sectional studies do have limitations . This design can only show if there are associations or correlations between different variables, but cannot prove cause and effect relationships, temporal sequence, changes, and trends over time.

Pros of Cross-Sectional StudiesCons of Cross-Sectional Studies
1. Quick and inexpensive, with no long-term commitment required.1. Cannot determine causality because it is a simple snapshot, with no time delay between data collection points.
2. Good for descriptive analyses.2. Does not allow researchers to follow up with research participants.

Example of a Cross-Sectional Study

Our longitudinal study example of a national census also happens to contain cross-sectional design. One census is cross-sectional, displaying only data from one point in time. But when a census is taken once every few years, it becomes longitudinal, and so long as the data collection technique remains unchanged, identification of changes will be achievable, adding another time dimension on top of a basic cross-sectional study.

13. Correlational Research

Correlational research is a quantitative method that seeks to determine if and to what degree a relationship exists between two or more quantifiable variables (Schweigert, 2021).

This approach provides a fast and easy way to make initial hypotheses based on either positive or  negative correlation trends  that can be observed within dataset.

While correlational research can reveal relationships between variables, it cannot establish causality.

Methods used for data analysis may include statistical correlations such as Pearson’s or Spearman’s.

Pros of Correlational ResearchCons of Correlational Research
1. Reveals relationships between variables1. Cannot determine causality
2. Can use existing data2. May be
3. Can guide further experimental research3. Correlation may be coincidental

Example of Correlational Research

A team of researchers is interested in studying the relationship between the amount of time students spend studying and their academic performance. They gather data from a high school, measuring the number of hours each student studies per week and their grade point averages (GPAs) at the end of the semester. Upon analyzing the data, they find a positive correlation, suggesting that students who spend more time studying tend to have higher GPAs.

14. Quasi-Experimental Design Research

Quasi-experimental design research is a quantitative research method that is similar to experimental design but lacks the element of random assignment to treatment or control.

Instead, quasi-experimental designs typically rely on certain other methods to control for extraneous variables.

The term ‘quasi-experimental’ implies that the experiment resembles a true experiment, but it is not exactly the same because it doesn’t meet all the criteria for a ‘true’ experiment, specifically in terms of control and random assignment.

Quasi-experimental design is useful when researchers want to study a causal hypothesis or relationship, but practical or ethical considerations prevent them from manipulating variables and randomly assigning participants to conditions.

Pros Cons
1. It’s more feasible to implement than true experiments.1. Without random assignment, it’s harder to rule out confounding variables.
2. It can be conducted in real-world settings, making the findings more applicable to the real world.2. The lack of random assignment may of the study.
3. Useful when it’s unethical or impossible to manipulate the independent variable or randomly assign participants.3. It’s more difficult to establish a cause-effect relationship due to the potential for confounding variables.

Example of Quasi-Experimental Design

A researcher wants to study the impact of a new math tutoring program on student performance. However, ethical and practical constraints prevent random assignment to the “tutoring” and “no tutoring” groups. Instead, the researcher compares students who chose to receive tutoring (experimental group) to similar students who did not choose to receive tutoring (control group), controlling for other variables like grade level and previous math performance.

Related: Examples and Types of Random Assignment in Research

15. Meta-Analysis Research

Meta-analysis statistically combines the results of multiple studies on a specific topic to yield a more precise estimate of the effect size. It’s the gold standard of secondary research .

Meta-analysis is particularly useful when there are numerous studies on a topic, and there is a need to integrate the findings to draw more reliable conclusions.

Some meta-analyses can identify flaws or gaps in a corpus of research, when can be highly influential in academic research, despite lack of primary data collection.

However, they tend only to be feasible when there is a sizable corpus of high-quality and reliable studies into a phenomenon.

Pros Cons
Increased Statistical Power: By combining data from multiple studies, meta-analysis increases the statistical power to detect effects.Publication Bias: Studies with null or negative findings are less likely to be published, leading to an overestimation of effect sizes.
Greater Precision: It provides more precise estimates of effect sizes by reducing the influence of random error.Quality of Studies: of a meta-analysis depends on the quality of the studies included.
Resolving Discrepancies: Meta-analysis can help resolve disagreements between different studies on a topic.Heterogeneity: Differences in study design, sample, or procedures can introduce heterogeneity, complicating interpretation of results.

Example of a Meta-Analysis

The power of feedback revisited (Wisniewski, Zierer & Hattie, 2020) is a meta-analysis that examines 435 empirical studies research on the effects of feedback on student learning. They use a random-effects model to ascertain whether there is a clear effect size across the literature. The authors find that feedback tends to impact cognitive and motor skill outcomes but has less of an effect on motivational and behavioral outcomes.

Choosing a research method requires a lot of consideration regarding what you want to achieve, your research paradigm, and the methodology that is most valuable for what you are studying. There are multiple types of research methods, many of which I haven’t been able to present here. Generally, it’s recommended that you work with an experienced researcher or research supervisor to identify a suitable research method for your study at hand.

Hammond, M., & Wellington, J. (2020). Research methods: The key concepts . New York: Routledge.

Howitt, D. (2019). Introduction to qualitative research methods in psychology . London: Pearson UK.

Pajo, B. (2022). Introduction to research methods: A hands-on approach . New York: Sage Publications.

Patten, M. L. (2017). Understanding research methods: An overview of the essentials . New York: Sage

Schweigert, W. A. (2021). Research methods in psychology: A handbook . Los Angeles: Waveland Press.

Stokes, P., & Wall, T. (2017). Research methods . New York: Bloomsbury Publishing.

Tracy, S. J. (2019). Qualitative research methods: Collecting evidence, crafting analysis, communicating impact . London: John Wiley & Sons.

Walliman, N. (2021). Research methods: The basics. London: Routledge.

Chris

  • Chris Drew (PhD) https://helpfulprofessor.com/author/chris-drew-phd-2/ 10 Reasons you’re Perpetually Single
  • Chris Drew (PhD) https://helpfulprofessor.com/author/chris-drew-phd-2/ 20 Montessori Toddler Bedrooms (Design Inspiration)
  • Chris Drew (PhD) https://helpfulprofessor.com/author/chris-drew-phd-2/ 21 Montessori Homeschool Setups
  • Chris Drew (PhD) https://helpfulprofessor.com/author/chris-drew-phd-2/ 101 Hidden Talents Examples

Leave a Comment Cancel Reply

Your email address will not be published. Required fields are marked *

Service update: Some parts of the Library’s website will be down for maintenance on August 11.

Secondary menu

  • Log in to your Library account
  • Hours and Maps
  • Connect from Off Campus
  • UC Berkeley Home

Search form

Research methods--quantitative, qualitative, and more: overview.

  • Quantitative Research
  • Qualitative Research
  • Data Science Methods (Machine Learning, AI, Big Data)
  • Text Mining and Computational Text Analysis
  • Evidence Synthesis/Systematic Reviews
  • Get Data, Get Help!

About Research Methods

This guide provides an overview of research methods, how to choose and use them, and supports and resources at UC Berkeley. 

As Patten and Newhart note in the book Understanding Research Methods , "Research methods are the building blocks of the scientific enterprise. They are the "how" for building systematic knowledge. The accumulation of knowledge through research is by its nature a collective endeavor. Each well-designed study provides evidence that may support, amend, refute, or deepen the understanding of existing knowledge...Decisions are important throughout the practice of research and are designed to help researchers collect evidence that includes the full spectrum of the phenomenon under study, to maintain logical rules, and to mitigate or account for possible sources of bias. In many ways, learning research methods is learning how to see and make these decisions."

The choice of methods varies by discipline, by the kind of phenomenon being studied and the data being used to study it, by the technology available, and more.  This guide is an introduction, but if you don't see what you need here, always contact your subject librarian, and/or take a look to see if there's a library research guide that will answer your question. 

Suggestions for changes and additions to this guide are welcome! 

START HERE: SAGE Research Methods

Without question, the most comprehensive resource available from the library is SAGE Research Methods.  HERE IS THE ONLINE GUIDE  to this one-stop shopping collection, and some helpful links are below:

  • SAGE Research Methods
  • Little Green Books  (Quantitative Methods)
  • Little Blue Books  (Qualitative Methods)
  • Dictionaries and Encyclopedias  
  • Case studies of real research projects
  • Sample datasets for hands-on practice
  • Streaming video--see methods come to life
  • Methodspace- -a community for researchers
  • SAGE Research Methods Course Mapping

Library Data Services at UC Berkeley

Library Data Services Program and Digital Scholarship Services

The LDSP offers a variety of services and tools !  From this link, check out pages for each of the following topics:  discovering data, managing data, collecting data, GIS data, text data mining, publishing data, digital scholarship, open science, and the Research Data Management Program.

Be sure also to check out the visual guide to where to seek assistance on campus with any research question you may have!

Library GIS Services

Other Data Services at Berkeley

D-Lab Supports Berkeley faculty, staff, and graduate students with research in data intensive social science, including a wide range of training and workshop offerings Dryad Dryad is a simple self-service tool for researchers to use in publishing their datasets. It provides tools for the effective publication of and access to research data. Geospatial Innovation Facility (GIF) Provides leadership and training across a broad array of integrated mapping technologies on campu Research Data Management A UC Berkeley guide and consulting service for research data management issues

General Research Methods Resources

Here are some general resources for assistance:

  • Assistance from ICPSR (must create an account to access): Getting Help with Data , and Resources for Students
  • Wiley Stats Ref for background information on statistics topics
  • Survey Documentation and Analysis (SDA) .  Program for easy web-based analysis of survey data.

Consultants

  • D-Lab/Data Science Discovery Consultants Request help with your research project from peer consultants.
  • Research data (RDM) consulting Meet with RDM consultants before designing the data security, storage, and sharing aspects of your qualitative project.
  • Statistics Department Consulting Services A service in which advanced graduate students, under faculty supervision, are available to consult during specified hours in the Fall and Spring semesters.

Related Resourcex

  • IRB / CPHS Qualitative research projects with human subjects often require that you go through an ethics review.
  • OURS (Office of Undergraduate Research and Scholarships) OURS supports undergraduates who want to embark on research projects and assistantships. In particular, check out their "Getting Started in Research" workshops
  • Sponsored Projects Sponsored projects works with researchers applying for major external grants.
  • Next: Quantitative Research >>
  • Last Updated: Aug 6, 2024 3:06 PM
  • URL: https://guides.lib.berkeley.edu/researchmethods

FIU Libraries Logo

  •   LibGuides
  •   A-Z List
  •   Help

Research Methods Help Guide

  • Quick Glossary

Introduction

Quantitative data, qualitative data.

  • Types of Research
  • Types of Studies
  • Helpful Resources
  • Get Help @ FIU

More Information

  • Qualitative vs Quantitative LibGuide by the Ebling Library, Health Sciences Learning Center at the University of Wisconsin-Madison.
  • Differences Between Qualitative and Quantitative Research Methods Table comparing qualitative and quantitative research methods, created by the Oak Ridge Institute for Science and Education.
  • Nursing Research: Quantitative and Qualitative Research Information provided by the University of Texas Arlington Libraries.

Internet link, free resource

  • Types of Variables From the UF Biostatistics Open Learning Textbook.
  • Qualitative vs Quantitative Methods: Two Opposites that Make a Perfect Match Article discussing the different philosophies behind qualitative and quantitative methods, and an example of how to blend them in the health sciences.

Database Guides

  • ERIC Search Guide by Ramces Marsilli Last Updated Aug 14, 2024 229 views this year
  • PsycINFO Guide by Sarah J. Hammill Last Updated Jun 17, 2024 5643 views this year

Studies can use quantitative data, qualititative data, or both types of data. Each approach has advantages and disadvantages. Explore the resources in the box at the left for more information.

Of the available library databases, only ERIC (for education topics) and PsycINFO (for psychology topics) allow you to limit your results by the type of data a study uses. Hover over the database name below for information on how to do so.

Note: database limits are helpful but not perfect. Rely on your own judgment when determining if data match the type you are seeking.

Login required

Numerical data.

 

Quantitative variables can be or .

: the variable can, in theory, be any value within a certain range. Can be measured. : the variable can only have certain values, usually whole numbers. Can be counted.
  • How to Analyze Quantitative Data

Non-numerical data.

 

 

Qualitative variables can be or .

: the variable does not have a specific order. : the variable has a specific order.
  • How to Analyze Qualitative Data
  • << Previous: Quick Glossary
  • Next: Types of Research >>
  • Last Updated: Aug 16, 2024 5:59 PM
  • URL: https://library.fiu.edu/researchmethods

Information

Fiu libraries floorplans, green library, modesto a. maidique campus.

Floor Resources
One
Two
Three
Four
Five
Six
Seven
Eight

Hubert Library, Biscayne Bay Campus

Floor Resources
One
Two
Three

Federal Depository Library Program logo

Directions: Green Library, MMC

Directions: Hubert Library, BBC

Research Methods: What are research methods?

  • What are research methods?
  • Searching specific databases

What are research methods

Research methods are the strategies, processes or techniques utilized in the collection of data or evidence for analysis in order to uncover new information or create better understanding of a topic.

There are different types of research methods which use different tools for data collection.

Types of research

  • Qualitative Research
  • Quantitative Research
  • Mixed Methods Research

Qualitative Research gathers data about lived experiences, emotions or behaviours, and the meanings individuals attach to them. It assists in enabling researchers to gain a better understanding of complex concepts, social interactions or cultural phenomena. This type of research is useful in the exploration of how or why things have occurred, interpreting events and describing actions.

Quantitative Research gathers numerical data which can be ranked, measured or categorised through statistical analysis. It assists with uncovering patterns or relationships, and for making generalisations. This type of research is useful for finding out how many, how much, how often, or to what extent.

Mixed Methods Research integrates both Q ualitative and Quantitative Research . It provides a holistic approach combining and analysing the statistical data with deeper contextualised insights. Using Mixed Methods also enables Triangulation,  or verification, of the data from two or more sources.

Finding Mixed Methods research in the Databases 

“mixed model*” OR “mixed design*” OR “multiple method*” OR multimethod* OR triangulat*

Data collection tools

Techniques or tools used for gathering research data include:

Qualitative Techniques or Tools Quantitative Techniques or Tools
: these can be structured, semi-structured or unstructured in-depth sessions with the researcher and a participant. Surveys or questionnaires: which ask the same questions to large numbers of participants or use Likert scales which measure opinions as numerical data.
: with several participants discussing a particular topic or a set of questions. Researchers can be facilitators or observers. Observation: which can either involve counting the number of times a specific phenomenon occurs, or the coding of observational data in order to translate it into numbers.
: On-site, in-context or role-play options. Document screening: sourcing numerical data from financial reports or counting word occurrences.
: Interrogation of correspondence (letters, diaries, emails etc) or reports. Experiments: testing hypotheses in laboratories, testing cause and effect relationships, through field experiments, or via quasi- or natural experiments.
: Remembrances or memories of experiences told to the researcher.  

SAGE research methods

  • SAGE research methods online This link opens in a new window Research methods tool to help researchers gather full-text resources, design research projects, understand a particular method and write up their research. Includes access to collections of video, business cases and eBooks,

Help and Information

Help and information

  • Next: Finding qualitative research >>
  • Last Updated: Aug 19, 2024 3:39 PM
  • URL: https://libguides.newcastle.edu.au/researchmethods
  • Research Process
  • Manuscript Preparation
  • Manuscript Review
  • Publication Process
  • Publication Recognition
  • Language Editing Services
  • Translation Services

Elsevier QRcode Wechat

Choosing the Right Research Methodology: A Guide for Researchers

  • 3 minute read
  • 50.6K views

Table of Contents

Choosing an optimal research methodology is crucial for the success of any research project. The methodology you select will determine the type of data you collect, how you collect it, and how you analyse it. Understanding the different types of research methods available along with their strengths and weaknesses, is thus imperative to make an informed decision.

Understanding different research methods:

There are several research methods available depending on the type of study you are conducting, i.e., whether it is laboratory-based, clinical, epidemiological, or survey based . Some common methodologies include qualitative research, quantitative research, experimental research, survey-based research, and action research. Each method can be opted for and modified, depending on the type of research hypotheses and objectives.

Qualitative vs quantitative research:

When deciding on a research methodology, one of the key factors to consider is whether your research will be qualitative or quantitative. Qualitative research is used to understand people’s experiences, concepts, thoughts, or behaviours . Quantitative research, on the contrary, deals with numbers, graphs, and charts, and is used to test or confirm hypotheses, assumptions, and theories. 

Qualitative research methodology:

Qualitative research is often used to examine issues that are not well understood, and to gather additional insights on these topics. Qualitative research methods include open-ended survey questions, observations of behaviours described through words, and reviews of literature that has explored similar theories and ideas. These methods are used to understand how language is used in real-world situations, identify common themes or overarching ideas, and describe and interpret various texts. Data analysis for qualitative research typically includes discourse analysis, thematic analysis, and textual analysis. 

Quantitative research methodology:

The goal of quantitative research is to test hypotheses, confirm assumptions and theories, and determine cause-and-effect relationships. Quantitative research methods include experiments, close-ended survey questions, and countable and numbered observations. Data analysis for quantitative research relies heavily on statistical methods.

Analysing qualitative vs quantitative data:

The methods used for data analysis also differ for qualitative and quantitative research. As mentioned earlier, quantitative data is generally analysed using statistical methods and does not leave much room for speculation. It is more structured and follows a predetermined plan. In quantitative research, the researcher starts with a hypothesis and uses statistical methods to test it. Contrarily, methods used for qualitative data analysis can identify patterns and themes within the data, rather than provide statistical measures of the data. It is an iterative process, where the researcher goes back and forth trying to gauge the larger implications of the data through different perspectives and revising the analysis if required.

When to use qualitative vs quantitative research:

The choice between qualitative and quantitative research will depend on the gap that the research project aims to address, and specific objectives of the study. If the goal is to establish facts about a subject or topic, quantitative research is an appropriate choice. However, if the goal is to understand people’s experiences or perspectives, qualitative research may be more suitable. 

Conclusion:

In conclusion, an understanding of the different research methods available, their applicability, advantages, and disadvantages is essential for making an informed decision on the best methodology for your project. If you need any additional guidance on which research methodology to opt for, you can head over to Elsevier Author Services (EAS). EAS experts will guide you throughout the process and help you choose the perfect methodology for your research goals.

Why is data validation important in research

Why is data validation important in research?

Importance-of-Data-Collection

When Data Speak, Listen: Importance of Data Collection and Analysis Methods

You may also like.

what is a descriptive research design

Descriptive Research Design and Its Myriad Uses

Doctor doing a Biomedical Research Paper

Five Common Mistakes to Avoid When Writing a Biomedical Research Paper

Writing in Environmental Engineering

Making Technical Writing in Environmental Engineering Accessible

Risks of AI-assisted Academic Writing

To Err is Not Human: The Dangers of AI-assisted Academic Writing

Importance-of-Data-Collection

Writing a good review article

Scholarly Sources What are They and Where can You Find Them

Scholarly Sources: What are They and Where can You Find Them?

Input your search keywords and press Enter.

  • Affiliate Program

Wordvice

  • UNITED STATES
  • 台灣 (TAIWAN)
  • TÜRKIYE (TURKEY)
  • Academic Editing Services
  • - Research Paper
  • - Journal Manuscript
  • - Dissertation
  • - College & University Assignments
  • Admissions Editing Services
  • - Application Essay
  • - Personal Statement
  • - Recommendation Letter
  • - Cover Letter
  • - CV/Resume
  • Business Editing Services
  • - Business Documents
  • - Report & Brochure
  • - Website & Blog
  • Writer Editing Services
  • - Script & Screenplay
  • Our Editors
  • Client Reviews
  • Editing & Proofreading Prices
  • Wordvice Points
  • Partner Discount
  • Plagiarism Checker
  • APA Citation Generator
  • MLA Citation Generator
  • Chicago Citation Generator
  • Vancouver Citation Generator
  • - APA Style
  • - MLA Style
  • - Chicago Style
  • - Vancouver Style
  • Writing & Editing Guide
  • Academic Resources
  • Admissions Resources

Types of Research Methods: Examples and Tips

research methods data type

What are research methods?

Research methods are the techniques and procedures used to collect and analyze data in order to answer research questions and test a research hypothesis . There are several different types of research methods, each with its own strengths and weaknesses. 

Common Types of Research Methods

There are several main types of research methods that are employed in academic articles. The type of research method applied depends on the nature of the data to be collected and analyzed, as well as any restrictions or limitations that dictate the study’s resources and methodology. Surveying articles from your target journal and identifying the methods commonly used in these studies is also recommended before choosing a research method or methods.

Surveys are a type of research method that involve collecting data from a large number of people through questionnaires or interviews. Surveys are often used to gather information about attitudes, beliefs, and behaviors.
Experiments are a type of research method that involve manipulating one or more variables in order to observe the effect on another variable. Experiments are often used to test cause-and-effect relationships.
Case studies are a type of research method that involve an in-depth examination of a single individual, group, or event. Case studies are often used to gather detailed information about a specific phenomenon.
Observations are a type of research method that involve watching and recording the behavior of individuals or groups. Observations are often used to gather information about naturalistic behavior.
Content analysis is a type of research method that involves analyzing and interpreting written or spoken text. Content analysis is often used to analyze large amounts of data, such as news articles or social media posts.
Historical research is a type of research method that involves studying the past through the examination of primary and secondary sources, such as documents, artifacts, and photographs.

It’s important to note that research methods can be combined for a more complete understanding of a research question or hypothesis. For example, an experiment can be followed by a survey to gather more information about participants’ attitudes and behaviors.

Overall, the choice of research method depends on the research question, the type of data needed, and the resources available to the researcher.

Data Collection Methods

Data is information collected in order to answer research questions . The kind of data you choose to collect will depend on the nature of your research question and the aims of your study. There are a few main category distinctions of data a researcher can collect.

Quantitative vs qualitative data

Qualitative and quantitative data are two types of data that are often used in research studies. They are different in terms of their characteristics, how they are collected, and how they are analyzed.

Quantitative data is numerical and is collected through methods such as surveys, polls, and experiments. It is often used to measure and describe the characteristics of a large group of people or objects. This data can be analyzed using statistical methods to find patterns and trends.

Qualitative data, on the other hand, is non-numerical and is collected through methods such as interviews, observations, and focus groups. It is often used to understand the experiences, attitudes, and perceptions of individuals or small groups. This data is analyzed using methods such as content analysis, thematic analysis, and discourse analysis to identify patterns and themes.

Overall, quantitative data provides a more objective and generalizable understanding of a phenomenon, while qualitative data provides a more subjective and in-depth understanding. Both types of data are important and can be used together to gain a more comprehensive understanding of a topic.

-Methods can be adjusted as the study progresses to answer different questions.
-Can be induced with a smaller study or sample size.
-No statistical analysis or application to wider populations or phenomena.
-Higher risk for research bias as it is more difficult to standardize metrics..
-Very systematic and specific in yielding data.
-Knowledge generated is testable and reproducible.
-Requires an understanding of statistics to analyze data. 
-Larger sample sizes are needed to yield relevant data.

You can also make use of both qualitative and quantitative research methods in your study.

Primary vs secondary data

Primary and secondary research are two different types of research methods that are used in the field of academia and market research. Both primary and secondary sources can be applied in most studies.

Primary research is research that is conducted by the individual or organization themselves. It involves collecting original data through methods such as surveys, interviews, or experiments. The data collected through primary research is specific to the research question and objectives, and is not typically available through other sources.

Secondary research, on the other hand, involves the use of existing data that has already been collected by someone else. This can include data from government reports, academic journals, or industry publications. The advantage of secondary research is that it is typically less time-consuming and less expensive than primary research, as the data has already been collected. However, the data may not be as specific or relevant to the research question and objectives.

The choice between using primary and secondary research will depend on the research question, study budget, and time constraints of the project, as well as the target journal to which you are submitting your manuscript.

Can more directly answer your research question..Researcher has more control over the constraints and controls of the data.Takes significant time and resources to collectRequires a strong understanding of how to collect data.
Much more convenient and faster to access.Data can be collected from various time frames and locations.No ability to adjust or control how data is created.Takes longer time to process and verify as relevant data.

Experimental vs descriptive data collection

Experimental data is collected through a controlled experiment, in which the researcher manipulates one or more variables to observe the effect on another variable. The goal of experimental data is to determine cause-and-effect relationships. For example, in a study on the effectiveness of a new drug for treating a certain condition, the researchers would randomly assign participants to either a group that receives the drug or a group that receives a placebo, and then compare the outcomes between the two groups. The data collected in this study would be considered experimental data.

Descriptive data, on the other hand, is data that is collected through observation or surveys and is used to describe the characteristics of a population or phenomenon. The goal of descriptive data is to provide a snapshot of the current state of a certain population or phenomenon, rather than to determine cause-and-effect relationships. For example, in a study on the dietary habits of a certain population, the researchers would collect data on what types of food the participants typically eat and how often they eat them. This data would be considered descriptive data.

In summary, experimental data is collected through a controlled experiment to determine cause-and-effect relationships, while descriptive data is collected through observation or surveys to describe the characteristics of a population or phenomenon.

Descriptive data examples:

  • A survey that asks people about their favorite type of music
  • A census that counts the number of people living in a certain area
  • A poll that asks people about their political affiliation

Experimental data examples:

  • A study comparing the effectiveness of two different medications for treating a certain condition
  • An experiment measuring the effect of different levels of a certain chemical on plant growth
  • A clinical trial comparing the side effects of a new treatment to a standard treatment for a disease

Examples of Difference Data Collection Methods

PrimaryQuantitativeTo test causal relationships.
EitherEitherTo analyze a specific case in-depth, often when you do not have the resources to perform a study with a large sample group.
PrimaryEitherTo analyze how a phenomenon functions in a natural state.
SecondaryEitherTo position your work in a body of research and/or uncover trends within a research topic.
SecondaryEitherTo determine the presence of certain words, themes, or concepts within some given qualitative data, often text.

Prepare Your Manuscript With Professional Editing

To ensure your methods are accurately articulated in your study and that your work is free of language errors, consider receiving professional proofreading services . Wordvice specializes in paper editing and manuscript editing for any kind of academic document. 

To receive free AI proofreading for academic writing in real time, try Wordvice AI and compare it to the leading AI writing assistant.

research methods data type

How To Choose Your Research Methodology

Qualitative vs quantitative vs mixed methods.

By: Derek Jansen (MBA). Expert Reviewed By: Dr Eunice Rautenbach | June 2021

Without a doubt, one of the most common questions we receive at Grad Coach is “ How do I choose the right methodology for my research? ”. It’s easy to see why – with so many options on the research design table, it’s easy to get intimidated, especially with all the complex lingo!

In this post, we’ll explain the three overarching types of research – qualitative, quantitative and mixed methods – and how you can go about choosing the best methodological approach for your research.

Overview: Choosing Your Methodology

Understanding the options – Qualitative research – Quantitative research – Mixed methods-based research

Choosing a research methodology – Nature of the research – Research area norms – Practicalities

Free Webinar: Research Methodology 101

1. Understanding the options

Before we jump into the question of how to choose a research methodology, it’s useful to take a step back to understand the three overarching types of research – qualitative , quantitative and mixed methods -based research. Each of these options takes a different methodological approach.

Qualitative research utilises data that is not numbers-based. In other words, qualitative research focuses on words , descriptions , concepts or ideas – while quantitative research makes use of numbers and statistics. Qualitative research investigates the “softer side” of things to explore and describe, while quantitative research focuses on the “hard numbers”, to measure differences between variables and the relationships between them.

Importantly, qualitative research methods are typically used to explore and gain a deeper understanding of the complexity of a situation – to draw a rich picture . In contrast to this, quantitative methods are usually used to confirm or test hypotheses . In other words, they have distinctly different purposes. The table below highlights a few of the key differences between qualitative and quantitative research – you can learn more about the differences here.

  • Uses an inductive approach
  • Is used to build theories
  • Takes a subjective approach
  • Adopts an open and flexible approach
  • The researcher is close to the respondents
  • Interviews and focus groups are oftentimes used to collect word-based data.
  • Generally, draws on small sample sizes
  • Uses qualitative data analysis techniques (e.g. content analysis , thematic analysis , etc)
  • Uses a deductive approach
  • Is used to test theories
  • Takes an objective approach
  • Adopts a closed, highly planned approach
  • The research is disconnected from respondents
  • Surveys or laboratory equipment are often used to collect number-based data.
  • Generally, requires large sample sizes
  • Uses statistical analysis techniques to make sense of the data

Mixed methods -based research, as you’d expect, attempts to bring these two types of research together, drawing on both qualitative and quantitative data. Quite often, mixed methods-based studies will use qualitative research to explore a situation and develop a potential model of understanding (this is called a conceptual framework), and then go on to use quantitative methods to test that model empirically.

In other words, while qualitative and quantitative methods (and the philosophies that underpin them) are completely different, they are not at odds with each other. It’s not a competition of qualitative vs quantitative. On the contrary, they can be used together to develop a high-quality piece of research. Of course, this is easier said than done, so we usually recommend that first-time researchers stick to a single approach , unless the nature of their study truly warrants a mixed-methods approach.

The key takeaway here, and the reason we started by looking at the three options, is that it’s important to understand that each methodological approach has a different purpose – for example, to explore and understand situations (qualitative), to test and measure (quantitative) or to do both. They’re not simply alternative tools for the same job. 

Right – now that we’ve got that out of the way, let’s look at how you can go about choosing the right methodology for your research.

Methodology choices in research

2. How to choose a research methodology

To choose the right research methodology for your dissertation or thesis, you need to consider three important factors . Based on these three factors, you can decide on your overarching approach – qualitative, quantitative or mixed methods. Once you’ve made that decision, you can flesh out the finer details of your methodology, such as the sampling , data collection methods and analysis techniques (we discuss these separately in other posts ).

The three factors you need to consider are:

  • The nature of your research aims, objectives and research questions
  • The methodological approaches taken in the existing literature
  • Practicalities and constraints

Let’s take a look at each of these.

Factor #1: The nature of your research

As I mentioned earlier, each type of research (and therefore, research methodology), whether qualitative, quantitative or mixed, has a different purpose and helps solve a different type of question. So, it’s logical that the key deciding factor in terms of which research methodology you adopt is the nature of your research aims, objectives and research questions .

But, what types of research exist?

Broadly speaking, research can fall into one of three categories:

  • Exploratory – getting a better understanding of an issue and potentially developing a theory regarding it
  • Confirmatory – confirming a potential theory or hypothesis by testing it empirically
  • A mix of both – building a potential theory or hypothesis and then testing it

As a rule of thumb, exploratory research tends to adopt a qualitative approach , whereas confirmatory research tends to use quantitative methods . This isn’t set in stone, but it’s a very useful heuristic. Naturally then, research that combines a mix of both, or is seeking to develop a theory from the ground up and then test that theory, would utilize a mixed-methods approach.

Exploratory vs confirmatory research

Let’s look at an example in action.

If your research aims were to understand the perspectives of war veterans regarding certain political matters, you’d likely adopt a qualitative methodology, making use of interviews to collect data and one or more qualitative data analysis methods to make sense of the data.

If, on the other hand, your research aims involved testing a set of hypotheses regarding the link between political leaning and income levels, you’d likely adopt a quantitative methodology, using numbers-based data from a survey to measure the links between variables and/or constructs .

So, the first (and most important thing) thing you need to consider when deciding which methodological approach to use for your research project is the nature of your research aims , objectives and research questions. Specifically, you need to assess whether your research leans in an exploratory or confirmatory direction or involves a mix of both.

The importance of achieving solid alignment between these three factors and your methodology can’t be overstated. If they’re misaligned, you’re going to be forcing a square peg into a round hole. In other words, you’ll be using the wrong tool for the job, and your research will become a disjointed mess.

If your research is a mix of both exploratory and confirmatory, but you have a tight word count limit, you may need to consider trimming down the scope a little and focusing on one or the other. One methodology executed well has a far better chance of earning marks than a poorly executed mixed methods approach. So, don’t try to be a hero, unless there is a very strong underpinning logic.

Need a helping hand?

research methods data type

Factor #2: The disciplinary norms

Choosing the right methodology for your research also involves looking at the approaches used by other researchers in the field, and studies with similar research aims and objectives to yours. Oftentimes, within a discipline, there is a common methodological approach (or set of approaches) used in studies. While this doesn’t mean you should follow the herd “just because”, you should at least consider these approaches and evaluate their merit within your context.

A major benefit of reviewing the research methodologies used by similar studies in your field is that you can often piggyback on the data collection techniques that other (more experienced) researchers have developed. For example, if you’re undertaking a quantitative study, you can often find tried and tested survey scales with high Cronbach’s alphas. These are usually included in the appendices of journal articles, so you don’t even have to contact the original authors. By using these, you’ll save a lot of time and ensure that your study stands on the proverbial “shoulders of giants” by using high-quality measurement instruments .

Of course, when reviewing existing literature, keep point #1 front of mind. In other words, your methodology needs to align with your research aims, objectives and questions. Don’t fall into the trap of adopting the methodological “norm” of other studies just because it’s popular. Only adopt that which is relevant to your research.

Factor #3: Practicalities

When choosing a research methodology, there will always be a tension between doing what’s theoretically best (i.e., the most scientifically rigorous research design ) and doing what’s practical , given your constraints . This is the nature of doing research and there are always trade-offs, as with anything else.

But what constraints, you ask?

When you’re evaluating your methodological options, you need to consider the following constraints:

  • Data access
  • Equipment and software
  • Your knowledge and skills

Let’s look at each of these.

Constraint #1: Data access

The first practical constraint you need to consider is your access to data . If you’re going to be undertaking primary research , you need to think critically about the sample of respondents you realistically have access to. For example, if you plan to use in-person interviews , you need to ask yourself how many people you’ll need to interview, whether they’ll be agreeable to being interviewed, where they’re located, and so on.

If you’re wanting to undertake a quantitative approach using surveys to collect data, you’ll need to consider how many responses you’ll require to achieve statistically significant results. For many statistical tests, a sample of a few hundred respondents is typically needed to develop convincing conclusions.

So, think carefully about what data you’ll need access to, how much data you’ll need and how you’ll collect it. The last thing you want is to spend a huge amount of time on your research only to find that you can’t get access to the required data.

Constraint #2: Time

The next constraint is time. If you’re undertaking research as part of a PhD, you may have a fairly open-ended time limit, but this is unlikely to be the case for undergrad and Masters-level projects. So, pay attention to your timeline, as the data collection and analysis components of different methodologies have a major impact on time requirements . Also, keep in mind that these stages of the research often take a lot longer than originally anticipated.

Another practical implication of time limits is that it will directly impact which time horizon you can use – i.e. longitudinal vs cross-sectional . For example, if you’ve got a 6-month limit for your entire research project, it’s quite unlikely that you’ll be able to adopt a longitudinal time horizon. 

Constraint #3: Money

As with so many things, money is another important constraint you’ll need to consider when deciding on your research methodology. While some research designs will cost near zero to execute, others may require a substantial budget .

Some of the costs that may arise include:

  • Software costs – e.g. survey hosting services, analysis software, etc.
  • Promotion costs – e.g. advertising a survey to attract respondents
  • Incentive costs – e.g. providing a prize or cash payment incentive to attract respondents
  • Equipment rental costs – e.g. recording equipment, lab equipment, etc.
  • Travel costs
  • Food & beverages

These are just a handful of costs that can creep into your research budget. Like most projects, the actual costs tend to be higher than the estimates, so be sure to err on the conservative side and expect the unexpected. It’s critically important that you’re honest with yourself about these costs, or you could end up getting stuck midway through your project because you’ve run out of money.

Budgeting for your research

Constraint #4: Equipment & software

Another practical consideration is the hardware and/or software you’ll need in order to undertake your research. Of course, this variable will depend on the type of data you’re collecting and analysing. For example, you may need lab equipment to analyse substances, or you may need specific analysis software to analyse statistical data. So, be sure to think about what hardware and/or software you’ll need for each potential methodological approach, and whether you have access to these.

Constraint #5: Your knowledge and skillset

The final practical constraint is a big one. Naturally, the research process involves a lot of learning and development along the way, so you will accrue knowledge and skills as you progress. However, when considering your methodological options, you should still consider your current position on the ladder.

Some of the questions you should ask yourself are:

  • Am I more of a “numbers person” or a “words person”?
  • How much do I know about the analysis methods I’ll potentially use (e.g. statistical analysis)?
  • How much do I know about the software and/or hardware that I’ll potentially use?
  • How excited am I to learn new research skills and gain new knowledge?
  • How much time do I have to learn the things I need to learn?

Answering these questions honestly will provide you with another set of criteria against which you can evaluate the research methodology options you’ve shortlisted.

So, as you can see, there is a wide range of practicalities and constraints that you need to take into account when you’re deciding on a research methodology. These practicalities create a tension between the “ideal” methodology and the methodology that you can realistically pull off. This is perfectly normal, and it’s your job to find the option that presents the best set of trade-offs.

Recap: Choosing a methodology

In this post, we’ve discussed how to go about choosing a research methodology. The three major deciding factors we looked at were:

  • Exploratory
  • Confirmatory
  • Combination
  • Research area norms
  • Hardware and software
  • Your knowledge and skillset

If you have any questions, feel free to leave a comment below. If you’d like a helping hand with your research methodology, check out our 1-on-1 research coaching service , or book a free consultation with a friendly Grad Coach.

research methods data type

Psst... there’s more!

This post was based on one of our popular Research Bootcamps . If you're working on a research project, you'll definitely want to check this out ...

Dr. Zara

Very useful and informative especially for beginners

Goudi

Nice article! I’m a beginner in the field of cybersecurity research. I am a Telecom and Network Engineer and Also aiming for PhD scholarship.

Margaret Mutandwa

I find the article very informative especially for my decitation it has been helpful and an eye opener.

Anna N Namwandi

Hi I am Anna ,

I am a PHD candidate in the area of cyber security, maybe we can link up

Tut Gatluak Doar

The Examples shows by you, for sure they are really direct me and others to knows and practices the Research Design and prepration.

Tshepo Ngcobo

I found the post very informative and practical.

Baraka Mfilinge

I struggle so much with designs of the research for sure!

Joyce

I’m the process of constructing my research design and I want to know if the data analysis I plan to present in my thesis defense proposal possibly change especially after I gathered the data already.

Janine Grace Baldesco

Thank you so much this site is such a life saver. How I wish 1-1 coaching is available in our country but sadly it’s not.

Submit a Comment Cancel reply

Your email address will not be published. Required fields are marked *

Save my name, email, and website in this browser for the next time I comment.

  • Print Friendly

Research Methods In Psychology

Saul McLeod, PhD

Editor-in-Chief for Simply Psychology

BSc (Hons) Psychology, MRes, PhD, University of Manchester

Saul McLeod, PhD., is a qualified psychology teacher with over 18 years of experience in further and higher education. He has been published in peer-reviewed journals, including the Journal of Clinical Psychology.

Learn about our Editorial Process

Olivia Guy-Evans, MSc

Associate Editor for Simply Psychology

BSc (Hons) Psychology, MSc Psychology of Education

Olivia Guy-Evans is a writer and associate editor for Simply Psychology. She has previously worked in healthcare and educational sectors.

Research methods in psychology are systematic procedures used to observe, describe, predict, and explain behavior and mental processes. They include experiments, surveys, case studies, and naturalistic observations, ensuring data collection is objective and reliable to understand and explain psychological phenomena.

research methods3

Hypotheses are statements about the prediction of the results, that can be verified or disproved by some investigation.

There are four types of hypotheses :
  • Null Hypotheses (H0 ) – these predict that no difference will be found in the results between the conditions. Typically these are written ‘There will be no difference…’
  • Alternative Hypotheses (Ha or H1) – these predict that there will be a significant difference in the results between the two conditions. This is also known as the experimental hypothesis.
  • One-tailed (directional) hypotheses – these state the specific direction the researcher expects the results to move in, e.g. higher, lower, more, less. In a correlation study, the predicted direction of the correlation can be either positive or negative.
  • Two-tailed (non-directional) hypotheses – these state that a difference will be found between the conditions of the independent variable but does not state the direction of a difference or relationship. Typically these are always written ‘There will be a difference ….’

All research has an alternative hypothesis (either a one-tailed or two-tailed) and a corresponding null hypothesis.

Once the research is conducted and results are found, psychologists must accept one hypothesis and reject the other. 

So, if a difference is found, the Psychologist would accept the alternative hypothesis and reject the null.  The opposite applies if no difference is found.

Sampling techniques

Sampling is the process of selecting a representative group from the population under study.

Sample Target Population

A sample is the participants you select from a target population (the group you are interested in) to make generalizations about.

Representative means the extent to which a sample mirrors a researcher’s target population and reflects its characteristics.

Generalisability means the extent to which their findings can be applied to the larger population of which their sample was a part.

  • Volunteer sample : where participants pick themselves through newspaper adverts, noticeboards or online.
  • Opportunity sampling : also known as convenience sampling , uses people who are available at the time the study is carried out and willing to take part. It is based on convenience.
  • Random sampling : when every person in the target population has an equal chance of being selected. An example of random sampling would be picking names out of a hat.
  • Systematic sampling : when a system is used to select participants. Picking every Nth person from all possible participants. N = the number of people in the research population / the number of people needed for the sample.
  • Stratified sampling : when you identify the subgroups and select participants in proportion to their occurrences.
  • Snowball sampling : when researchers find a few participants, and then ask them to find participants themselves and so on.
  • Quota sampling : when researchers will be told to ensure the sample fits certain quotas, for example they might be told to find 90 participants, with 30 of them being unemployed.

Experiments always have an independent and dependent variable .

  • The independent variable is the one the experimenter manipulates (the thing that changes between the conditions the participants are placed into). It is assumed to have a direct effect on the dependent variable.
  • The dependent variable is the thing being measured, or the results of the experiment.

variables

Operationalization of variables means making them measurable/quantifiable. We must use operationalization to ensure that variables are in a form that can be easily tested.

For instance, we can’t really measure ‘happiness’, but we can measure how many times a person smiles within a two-hour period. 

By operationalizing variables, we make it easy for someone else to replicate our research. Remember, this is important because we can check if our findings are reliable.

Extraneous variables are all variables which are not independent variable but could affect the results of the experiment.

It can be a natural characteristic of the participant, such as intelligence levels, gender, or age for example, or it could be a situational feature of the environment such as lighting or noise.

Demand characteristics are a type of extraneous variable that occurs if the participants work out the aims of the research study, they may begin to behave in a certain way.

For example, in Milgram’s research , critics argued that participants worked out that the shocks were not real and they administered them as they thought this was what was required of them. 

Extraneous variables must be controlled so that they do not affect (confound) the results.

Randomly allocating participants to their conditions or using a matched pairs experimental design can help to reduce participant variables. 

Situational variables are controlled by using standardized procedures, ensuring every participant in a given condition is treated in the same way

Experimental Design

Experimental design refers to how participants are allocated to each condition of the independent variable, such as a control or experimental group.
  • Independent design ( between-groups design ): each participant is selected for only one group. With the independent design, the most common way of deciding which participants go into which group is by means of randomization. 
  • Matched participants design : each participant is selected for only one group, but the participants in the two groups are matched for some relevant factor or factors (e.g. ability; sex; age).
  • Repeated measures design ( within groups) : each participant appears in both groups, so that there are exactly the same participants in each group.
  • The main problem with the repeated measures design is that there may well be order effects. Their experiences during the experiment may change the participants in various ways.
  • They may perform better when they appear in the second group because they have gained useful information about the experiment or about the task. On the other hand, they may perform less well on the second occasion because of tiredness or boredom.
  • Counterbalancing is the best way of preventing order effects from disrupting the findings of an experiment, and involves ensuring that each condition is equally likely to be used first and second by the participants.

If we wish to compare two groups with respect to a given independent variable, it is essential to make sure that the two groups do not differ in any other important way. 

Experimental Methods

All experimental methods involve an iv (independent variable) and dv (dependent variable)..

The researcher decides where the experiment will take place, at what time, with which participants, in what circumstances,  using a standardized procedure.

  • Field experiments are conducted in the everyday (natural) environment of the participants. The experimenter still manipulates the IV, but in a real-life setting. It may be possible to control extraneous variables, though such control is more difficult than in a lab experiment.
  • Natural experiments are when a naturally occurring IV is investigated that isn’t deliberately manipulated, it exists anyway. Participants are not randomly allocated, and the natural event may only occur rarely.

Case studies are in-depth investigations of a person, group, event, or community. It uses information from a range of sources, such as from the person concerned and also from their family and friends.

Many techniques may be used such as interviews, psychological tests, observations and experiments. Case studies are generally longitudinal: in other words, they follow the individual or group over an extended period of time. 

Case studies are widely used in psychology and among the best-known ones carried out were by Sigmund Freud . He conducted very detailed investigations into the private lives of his patients in an attempt to both understand and help them overcome their illnesses.

Case studies provide rich qualitative data and have high levels of ecological validity. However, it is difficult to generalize from individual cases as each one has unique characteristics.

Correlational Studies

Correlation means association; it is a measure of the extent to which two variables are related. One of the variables can be regarded as the predictor variable with the other one as the outcome variable.

Correlational studies typically involve obtaining two different measures from a group of participants, and then assessing the degree of association between the measures. 

The predictor variable can be seen as occurring before the outcome variable in some sense. It is called the predictor variable, because it forms the basis for predicting the value of the outcome variable.

Relationships between variables can be displayed on a graph or as a numerical score called a correlation coefficient.

types of correlation. Scatter plot. Positive negative and no correlation

  • If an increase in one variable tends to be associated with an increase in the other, then this is known as a positive correlation .
  • If an increase in one variable tends to be associated with a decrease in the other, then this is known as a negative correlation .
  • A zero correlation occurs when there is no relationship between variables.

After looking at the scattergraph, if we want to be sure that a significant relationship does exist between the two variables, a statistical test of correlation can be conducted, such as Spearman’s rho.

The test will give us a score, called a correlation coefficient . This is a value between 0 and 1, and the closer to 1 the score is, the stronger the relationship between the variables. This value can be both positive e.g. 0.63, or negative -0.63.

Types of correlation. Strong, weak, and perfect positive correlation, strong, weak, and perfect negative correlation, no correlation. Graphs or charts ...

A correlation between variables, however, does not automatically mean that the change in one variable is the cause of the change in the values of the other variable. A correlation only shows if there is a relationship between variables.

Correlation does not always prove causation, as a third variable may be involved. 

causation correlation

Interview Methods

Interviews are commonly divided into two types: structured and unstructured.

A fixed, predetermined set of questions is put to every participant in the same order and in the same way. 

Responses are recorded on a questionnaire, and the researcher presets the order and wording of questions, and sometimes the range of alternative answers.

The interviewer stays within their role and maintains social distance from the interviewee.

There are no set questions, and the participant can raise whatever topics he/she feels are relevant and ask them in their own way. Questions are posed about participants’ answers to the subject

Unstructured interviews are most useful in qualitative research to analyze attitudes and values.

Though they rarely provide a valid basis for generalization, their main advantage is that they enable the researcher to probe social actors’ subjective point of view. 

Questionnaire Method

Questionnaires can be thought of as a kind of written interview. They can be carried out face to face, by telephone, or post.

The choice of questions is important because of the need to avoid bias or ambiguity in the questions, ‘leading’ the respondent or causing offense.

  • Open questions are designed to encourage a full, meaningful answer using the subject’s own knowledge and feelings. They provide insights into feelings, opinions, and understanding. Example: “How do you feel about that situation?”
  • Closed questions can be answered with a simple “yes” or “no” or specific information, limiting the depth of response. They are useful for gathering specific facts or confirming details. Example: “Do you feel anxious in crowds?”

Its other practical advantages are that it is cheaper than face-to-face interviews and can be used to contact many respondents scattered over a wide area relatively quickly.

Observations

There are different types of observation methods :
  • Covert observation is where the researcher doesn’t tell the participants they are being observed until after the study is complete. There could be ethical problems or deception and consent with this particular observation method.
  • Overt observation is where a researcher tells the participants they are being observed and what they are being observed for.
  • Controlled : behavior is observed under controlled laboratory conditions (e.g., Bandura’s Bobo doll study).
  • Natural : Here, spontaneous behavior is recorded in a natural setting.
  • Participant : Here, the observer has direct contact with the group of people they are observing. The researcher becomes a member of the group they are researching.  
  • Non-participant (aka “fly on the wall): The researcher does not have direct contact with the people being observed. The observation of participants’ behavior is from a distance

Pilot Study

A pilot  study is a small scale preliminary study conducted in order to evaluate the feasibility of the key s teps in a future, full-scale project.

A pilot study is an initial run-through of the procedures to be used in an investigation; it involves selecting a few people and trying out the study on them. It is possible to save time, and in some cases, money, by identifying any flaws in the procedures designed by the researcher.

A pilot study can help the researcher spot any ambiguities (i.e. unusual things) or confusion in the information given to participants or problems with the task devised.

Sometimes the task is too hard, and the researcher may get a floor effect, because none of the participants can score at all or can complete the task – all performances are low.

The opposite effect is a ceiling effect, when the task is so easy that all achieve virtually full marks or top performances and are “hitting the ceiling”.

Research Design

In cross-sectional research , a researcher compares multiple segments of the population at the same time

Sometimes, we want to see how people change over time, as in studies of human development and lifespan. Longitudinal research is a research design in which data-gathering is administered repeatedly over an extended period of time.

In cohort studies , the participants must share a common factor or characteristic such as age, demographic, or occupation. A cohort study is a type of longitudinal study in which researchers monitor and observe a chosen population over an extended period.

Triangulation means using more than one research method to improve the study’s validity.

Reliability

Reliability is a measure of consistency, if a particular measurement is repeated and the same result is obtained then it is described as being reliable.

  • Test-retest reliability :  assessing the same person on two different occasions which shows the extent to which the test produces the same answers.
  • Inter-observer reliability : the extent to which there is an agreement between two or more observers.

Meta-Analysis

Meta-analysis is a statistical procedure used to combine and synthesize findings from multiple independent studies to estimate the average effect size for a particular research question.

Meta-analysis goes beyond traditional narrative reviews by using statistical methods to integrate the results of several studies, leading to a more objective appraisal of the evidence.

This is done by looking through various databases, and then decisions are made about what studies are to be included/excluded.

  • Strengths : Increases the conclusions’ validity as they’re based on a wider range.
  • Weaknesses : Research designs in studies can vary, so they are not truly comparable.

Peer Review

A researcher submits an article to a journal. The choice of the journal may be determined by the journal’s audience or prestige.

The journal selects two or more appropriate experts (psychologists working in a similar field) to peer review the article without payment. The peer reviewers assess: the methods and designs used, originality of the findings, the validity of the original research findings and its content, structure and language.

Feedback from the reviewer determines whether the article is accepted. The article may be: Accepted as it is, accepted with revisions, sent back to the author to revise and re-submit or rejected without the possibility of submission.

The editor makes the final decision whether to accept or reject the research report based on the reviewers comments/ recommendations.

Peer review is important because it prevent faulty data from entering the public domain, it provides a way of checking the validity of findings and the quality of the methodology and is used to assess the research rating of university departments.

Peer reviews may be an ideal, whereas in practice there are lots of problems. For example, it slows publication down and may prevent unusual, new work being published. Some reviewers might use it as an opportunity to prevent competing researchers from publishing work.

Some people doubt whether peer review can really prevent the publication of fraudulent research.

The advent of the internet means that a lot of research and academic comment is being published without official peer reviews than before, though systems are evolving on the internet where everyone really has a chance to offer their opinions and police the quality of research.

Types of Data

  • Quantitative data is numerical data e.g. reaction time or number of mistakes. It represents how much or how long, how many there are of something. A tally of behavioral categories and closed questions in a questionnaire collect quantitative data.
  • Qualitative data is virtually any type of information that can be observed and recorded that is not numerical in nature and can be in the form of written or verbal communication. Open questions in questionnaires and accounts from observational studies collect qualitative data.
  • Primary data is first-hand data collected for the purpose of the investigation.
  • Secondary data is information that has been collected by someone other than the person who is conducting the research e.g. taken from journals, books or articles.

Validity means how well a piece of research actually measures what it sets out to, or how well it reflects the reality it claims to represent.

Validity is whether the observed effect is genuine and represents what is actually out there in the world.

  • Concurrent validity is the extent to which a psychological measure relates to an existing similar measure and obtains close results. For example, a new intelligence test compared to an established test.
  • Face validity : does the test measure what it’s supposed to measure ‘on the face of it’. This is done by ‘eyeballing’ the measuring or by passing it to an expert to check.
  • Ecological validit y is the extent to which findings from a research study can be generalized to other settings / real life.
  • Temporal validity is the extent to which findings from a research study can be generalized to other historical times.

Features of Science

  • Paradigm – A set of shared assumptions and agreed methods within a scientific discipline.
  • Paradigm shift – The result of the scientific revolution: a significant change in the dominant unifying theory within a scientific discipline.
  • Objectivity – When all sources of personal bias are minimised so not to distort or influence the research process.
  • Empirical method – Scientific approaches that are based on the gathering of evidence through direct observation and experience.
  • Replicability – The extent to which scientific procedures and findings can be repeated by other researchers.
  • Falsifiability – The principle that a theory cannot be considered scientific unless it admits the possibility of being proved untrue.

Statistical Testing

A significant result is one where there is a low probability that chance factors were responsible for any observed difference, correlation, or association in the variables tested.

If our test is significant, we can reject our null hypothesis and accept our alternative hypothesis.

If our test is not significant, we can accept our null hypothesis and reject our alternative hypothesis. A null hypothesis is a statement of no effect.

In Psychology, we use p < 0.05 (as it strikes a balance between making a type I and II error) but p < 0.01 is used in tests that could cause harm like introducing a new drug.

A type I error is when the null hypothesis is rejected when it should have been accepted (happens when a lenient significance level is used, an error of optimism).

A type II error is when the null hypothesis is accepted when it should have been rejected (happens when a stringent significance level is used, an error of pessimism).

Ethical Issues

  • Informed consent is when participants are able to make an informed judgment about whether to take part. It causes them to guess the aims of the study and change their behavior.
  • To deal with it, we can gain presumptive consent or ask them to formally indicate their agreement to participate but it may invalidate the purpose of the study and it is not guaranteed that the participants would understand.
  • Deception should only be used when it is approved by an ethics committee, as it involves deliberately misleading or withholding information. Participants should be fully debriefed after the study but debriefing can’t turn the clock back.
  • All participants should be informed at the beginning that they have the right to withdraw if they ever feel distressed or uncomfortable.
  • It causes bias as the ones that stayed are obedient and some may not withdraw as they may have been given incentives or feel like they’re spoiling the study. Researchers can offer the right to withdraw data after participation.
  • Participants should all have protection from harm . The researcher should avoid risks greater than those experienced in everyday life and they should stop the study if any harm is suspected. However, the harm may not be apparent at the time of the study.
  • Confidentiality concerns the communication of personal information. The researchers should not record any names but use numbers or false names though it may not be possible as it is sometimes possible to work out who the researchers were.

Print Friendly, PDF & Email

Pfeiffer Library

Research Methodologies

  • What are research designs?

What are research methodologies?

Quantitative research methodologies, qualitative research methodologies, mixed method methodologies, selecting a methodology.

  • What are research methods?
  • Additional Sources

According to Dawson (2019),a research methodology is the primary principle that will guide your research.  It becomes the general approach in conducting research on your topic and determines what research method you will use. A research methodology is different from a research method because research methods are the tools you use to gather your data (Dawson, 2019).  You must consider several issues when it comes to selecting the most appropriate methodology for your topic.  Issues might include research limitations and ethical dilemmas that might impact the quality of your research.  Descriptions of each type of methodology are included below.

Quantitative research methodologies are meant to create numeric statistics by using survey research to gather data (Dawson, 2019).  This approach tends to reach a larger amount of people in a shorter amount of time.  According to Labaree (2020), there are three parts that make up a quantitative research methodology:

  • Sample population
  • How you will collect your data (this is the research method)
  • How you will analyze your data

Once you decide on a methodology, you can consider the method to which you will apply your methodology.

Qualitative research methodologies examine the behaviors, opinions, and experiences of individuals through methods of examination (Dawson, 2019).  This type of approach typically requires less participants, but more time with each participant.  It gives research subjects the opportunity to provide their own opinion on a certain topic.

Examples of Qualitative Research Methodologies

  • Action research:  This is when the researcher works with a group of people to improve something in a certain environment.  It is a common approach for research in organizational management, community development, education, and agriculture (Dawson, 2019).
  • Ethnography:  The process of organizing and describing cultural behaviors (Dawson, 2019).  Researchers may immerse themselves into another culture to receive in "inside look" into the group they are studying.  It is often a time consuming process because the researcher will do this for a long period of time.  This can also be called "participant observation" (Dawson, 2019).
  • Feminist research:  The goal of this methodology is to study topics that have been dominated by male test subjects.  It aims to study females and compare the results to previous studies that used male participants (Dawson, 2019).
  • Grounded theory:  The process of developing a theory to describe a phenomenon strictly through the data results collected in a study.  It is different from other research methodologies where the researcher attempts to prove a hypothesis that they create before collecting data.  Popular research methods for this approach include focus groups and interviews (Dawson, 2019).

A mixed methodology allows you to implement the strengths of both qualitative and quantitative research methods.  In some cases, you may find that your research project would benefit from this.  This approach is beneficial because it allows each methodology to counteract the weaknesses of the other (Dawson, 2019).  You should consider this option carefully, as it can make your research complicated if not planned correctly.

What should you do to decide on a research methodology?  The most logical way to determine your methodology is to decide whether you plan on conducting qualitative or qualitative research.  You also have the option to implement a mixed methods approach.  Looking back on Dawson's (2019) five "W's" on the previous page , may help you with this process.  You should also look for key words that indicate a specific type of research methodology in your hypothesis or proposal.  Some words may lean more towards one methodology over another.

Quantitative Research Key Words

  • How satisfied

Qualitative Research Key Words

  • Experiences
  • Thoughts/Think
  • Relationship
  • << Previous: What are research designs?
  • Next: What are research methods? >>
  • Last Updated: Aug 2, 2022 2:36 PM
  • URL: https://library.tiffin.edu/researchmethodologies
  • Databank Solution
  • Credit Risk Solution
  • Valuation Analytics Solution
  • ESG Sustainability Solution
  • Quantitative Finance Solution
  • Regulatory Technology Solution
  • Product Introduction
  • Service Provide Method
  • Learning Video

Quantitative Data Analysis Guide: Methods, Examples & Uses

research methods data type

This guide will introduce the types of data analysis used in quantitative research, then discuss relevant examples and applications in the finance industry.

Table of Contents

An Overview of Quantitative Data Analysis

What is quantitative data analysis and what is it for .

Quantitative data analysis is the process of interpreting meaning and extracting insights from numerical data , which involves mathematical calculations and statistical reviews to uncover patterns, trends, and relationships between variables.

Beyond academic and statistical research, this approach is particularly useful in the finance industry. Financial data, such as stock prices, interest rates, and economic indicators, can all be quantified with statistics and metrics to offer crucial insights for informed investment decisions. To illustrate this, here are some examples of what quantitative data is usually used for:

  • Measuring Differences between Groups: For instance, analyzing historical stock prices of different companies or asset classes can reveal which companies consistently outperform the market average.
  • Assessing Relationships between Variables: An investor could analyze the relationship between a company’s price-to-earnings ratio (P/E ratio) and relevant factors, like industry performance, inflation rates, interests, etc, allowing them to predict future stock price growth.
  • Testing Hypotheses: For example, an investor might hypothesize that companies with strong ESG (Environment, Social, and Governance) practices outperform those without. By categorizing these companies into two groups (strong ESG vs. weak ESG practices), they can compare the average return on investment (ROI) between the groups while assessing relevant factors to find evidence for the hypothesis. 

Ultimately, quantitative data analysis helps investors navigate the complex financial landscape and pursue profitable opportunities.

Quantitative Data Analysis VS. Qualitative Data Analysis

Although quantitative data analysis is a powerful tool, it cannot be used to provide context for your research, so this is where qualitative analysis comes in. Qualitative analysis is another common research method that focuses on collecting and analyzing non-numerical data , like text, images, or audio recordings to gain a deeper understanding of experiences, opinions, and motivations. Here’s a table summarizing its key differences between quantitative data analysis:

Types of Data UsedNumerical data: numbers, percentages, etc.Non-numerical data: text, images, audio, narratives, etc
Perspective More objective and less prone to biasMore subjective as it may be influenced by the researcher’s interpretation
Data CollectionClosed-ended questions, surveys, pollsOpen-ended questions, interviews, observations
Data AnalysisStatistical methods, numbers, graphs, chartsCategorization, thematic analysis, verbal communication
Focus and and
Best Use CaseMeasuring trends, comparing groups, testing hypothesesUnderstanding user experience, exploring consumer motivations, uncovering new ideas

Due to their characteristics, quantitative analysis allows you to measure and compare large datasets; while qualitative analysis helps you understand the context behind the data. In some cases, researchers might even use both methods together for a more comprehensive understanding, but we’ll mainly focus on quantitative analysis for this article.

The 2 Main Quantitative Data Analysis Methods

Once you have your data collected, you have to use descriptive statistics or inferential statistics analysis to draw summaries and conclusions from your raw numbers. 

As its name suggests, the purpose of descriptive statistics is to describe your sample . It provides the groundwork for understanding your data by focusing on the details and characteristics of the specific group you’ve collected data from. 

On the other hand, inferential statistics act as bridges that connect your sample data to the broader population you’re truly interested in, helping you to draw conclusions in your research. Moreover, choosing the right inferential technique for your specific data and research questions is dependent on the initial insights from descriptive statistics, so both of these methods usually go hand-in-hand.

Descriptive Statistics Analysis

With sophisticated descriptive statistics, you can detect potential errors in your data by highlighting inconsistencies and outliers that might otherwise go unnoticed. Additionally, the characteristics revealed by descriptive statistics will help determine which inferential techniques are suitable for further analysis.

Measures in Descriptive Statistics

One of the key statistical tests used for descriptive statistics is central tendency . It consists of mean, median, and mode, telling you where most of your data points cluster:

  • Mean: It refers to the “average” and is calculated by adding all the values in your data set and dividing by the number of values.
  • Median: The middle value when your data is arranged in ascending or descending order. If you have an odd number of data points, the median is the exact middle value; with even numbers, it’s the average of the two middle values. 
  • Mode: This refers to the most frequently occurring value in your data set, indicating the most common response or observation. Some data can have multiple modes (bimodal) or no mode at all.

Another statistic to test in descriptive analysis is the measures of dispersion , which involves range and standard deviation, revealing how spread out your data is relative to the central tendency measures:

  • Range: It refers to the difference between the highest and lowest values in your data set. 
  • Standard Deviation (SD): This tells you how the data is distributed within the range, revealing how much, on average, each data point deviates from the mean. Lower standard deviations indicate data points clustered closer to the mean, while higher standard deviations suggest a wider spread.

The shape of the distribution will then be measured through skewness. 

  • Skewness: A statistic that indicates whether your data leans to one side (positive or negative) or is symmetrical (normal distribution). A positive skew suggests more data points concentrated on the lower end, while a negative skew indicates more data points on the higher end.

While the core measures mentioned above are fundamental, there are additional descriptive statistics used in specific contexts, including percentiles and interquartile range.

  • Percentiles: This divides your data into 100 equal parts, revealing what percentage of data falls below a specific value. The 25th percentile (Q1) is the first quartile, the 50th percentile (Q2) is the median, and the 75th percentile (Q3) is the third quartile. Knowing these quartiles can help visualize the spread of your data.
  • Interquartile Range (IQR): This measures the difference between Q3 and Q1, representing the middle 50% of your data.

Example of Descriptive Quantitative Data Analysis 

Let’s illustrate these concepts with a real-world example. Imagine a financial advisor analyzing a client’s portfolio. They have data on the client’s various holdings, including stock prices over the past year. With descriptive statistics they can obtain the following information:

  • Central Tendency: The mean price for each stock reveals its average price over the year. The median price can further highlight if there were any significant price spikes or dips that skewed the mean.
  • Measures of Dispersion: The standard deviation for each stock indicates its price volatility. A high standard deviation suggests the stock’s price fluctuated considerably, while a low standard deviation implies a more stable price history. This helps the advisor assess each stock’s risk profile.
  • Shape of the Distribution: If data allows, analyzing skewness can be informative. A positive skew for a stock might suggest more frequent price drops, while a negative skew might indicate more frequent price increases.

By calculating these descriptive statistics, the advisor gains a quick understanding of the client’s portfolio performance and risk distribution. For instance, they could use correlation analysis to see if certain stock prices tend to move together, helping them identify expansion opportunities within the portfolio.

While descriptive statistics provide a foundational understanding, they should be followed by inferential analysis to uncover deeper insights that are crucial for making investment decisions.

Inferential Statistics Analysis

Inferential statistics analysis is particularly useful for hypothesis testing , as you can formulate predictions about group differences or potential relationships between variables , then use statistical tests to see if your sample data supports those hypotheses.

However, the power of inferential statistics hinges on one crucial factor: sample representativeness . If your sample doesn’t accurately reflect the population, your predictions won’t be very reliable. 

Statistical Tests for Inferential Statistics

Here are some of the commonly used tests for inferential statistics in commerce and finance, which can also be integrated to most analysis software:

  • T-Tests: This compares the means, standard deviation, or skewness of two groups to assess if they’re statistically different, helping you determine if the observed difference is just a quirk within the sample or a significant reflection of the population.
  • ANOVA (Analysis of Variance): While T-Tests handle comparisons between two groups, ANOVA focuses on comparisons across multiple groups, allowing you to identify potential variations and trends within the population.
  • Correlation Analysis: This technique tests the relationship between two variables, assessing if one variable increases or decreases with the other. However, it’s important to note that just because two financial variables are correlated and move together, doesn’t necessarily mean one directly influences the other.
  • Regression Analysis: Building on correlation, regression analysis goes a step further to verify the cause-and-effect relationships between the tested variables, allowing you to investigate if one variable actually influences the other.
  • Cross-Tabulation: This breaks down the relationship between two categorical variables by displaying the frequency counts in a table format, helping you to understand how different groups within your data set might behave. The data in cross-tabulation can be mutually exclusive or have several connections with each other. 
  • Trend Analysis: This examines how a variable in quantitative data changes over time, revealing upward or downward trends, as well as seasonal fluctuations. This can help you forecast future trends, and also lets you assess the effectiveness of the interventions in your marketing or investment strategy.
  • MaxDiff Analysis: This is also known as the “best-worst” method. It evaluates customer preferences by asking respondents to choose the most and least preferred options from a set of products or services, allowing stakeholders to optimize product development or marketing strategies.
  • Conjoint Analysis: Similar to MaxDiff, conjoint analysis gauges customer preferences, but it goes a step further by allowing researchers to see how changes in different product features (price, size, brand) influence overall preference.
  • TURF Analysis (Total Unduplicated Reach and Frequency Analysis): This assesses a marketing campaign’s reach and frequency of exposure in different channels, helping businesses identify the most efficient channels to reach target audiences.
  • Gap Analysis: This compares current performance metrics against established goals or benchmarks, using numerical data to represent the factors involved. This helps identify areas where performance falls short of expectations, serving as a springboard for developing strategies to bridge the gap and achieve those desired outcomes.
  • SWOT Analysis (Strengths, Weaknesses, Opportunities, and Threats): This uses ratings or rankings to represent an organization’s internal strengths and weaknesses, along with external opportunities and threats. Based on this analysis, organizations can create strategic plans to capitalize on opportunities while minimizing risks.
  • Text Analysis: This is an advanced method that uses specialized software to categorize and quantify themes, sentiment (positive, negative, neutral), and topics within textual data, allowing companies to obtain structured quantitative data from surveys, social media posts, or customer reviews.

Example of Inferential Quantitative Data Analysis

If you’re a financial analyst studying the historical performance of a particular stock, here are some predictions you can make with inferential statistics:

  • The Differences between Groups: You can conduct T-Tests to compare the average returns of stocks in the technology sector with those in the healthcare sector. It can help assess if the observed difference in returns between these two sectors is simply due to random chance or if it’s statistically significant due to a significant difference in their performance.
  • The Relationships between Variables: If you’re curious about the connection between a company’s price-to-earnings ratio (P/E ratios) and its future stock price movements, conducting correlation analysis can let you measure the strength and direction of this relationship. Is there a negative correlation, suggesting that higher P/E ratios might be associated with lower future stock prices? Or is there no significant correlation at all?

Understanding these inferential analysis techniques can help you uncover potential relationships and group differences that might not be readily apparent from descriptive statistics alone. Nonetheless, it’s important to remember that each technique has its own set of assumptions and limitations . Some methods are designed for parametric data with a normal distribution, while others are suitable for non-parametric data. 

Guide to Conduct Data Analysis in Quantitative Research

Now that we have discussed the types of data analysis techniques used in quantitative research, here’s a quick guide to help you choose the right method and grasp the essential steps of quantitative data analysis.

How to Choose the Right Quantitative Analysis Method?

Choosing between all these quantitative analysis methods may seem like a complicated task, but if you consider the 2 following factors, you can definitely choose the right technique:

Factor 1: Data Type

The data used in quantitative analysis can be categorized into two types, discrete data and continuous data, based on how they’re measured. They can also be further differentiated by their measurement scale. The four main types of measurement scales include: nominal, ordinal, interval or ratio. Understanding the distinctions between them is essential for choosing the appropriate statistical methods to interpret the results of your quantitative data analysis accurately.

Discrete data , which is also known as attribute data, represents whole numbers that can be easily counted and separated into distinct categories. It is often visualized using bar charts or pie charts, making it easy to see the frequency of each value. In the financial world, examples of discrete quantitative data include:

  • The number of shares owned by an investor in a particular company
  • The number of customer transactions processed by a bank per day
  • Bond ratings (AAA, BBB, etc.) that represent discrete categories indicating the creditworthiness of a bond issuer
  • The number of customers with different account types (checking, savings, investment) as seen in the pie chart below:

Pie chart illustrating the distribution customers with different account types (checking, savings, investment, salary)

Discrete data usually use nominal or ordinal measurement scales, which can be then quantified to calculate their mode or median. Here are some examples:

  • Nominal: This scale categorizes data into distinct groups with no inherent order. For instance, data on bank account types can be considered nominal data as it classifies customers in distinct categories which are independent of each other, either checking, savings, or investment accounts. and no inherent order or ranking implied by these account types.
  • Ordinal: Ordinal data establishes a rank or order among categories. For example, investment risk ratings (low, medium, high) are ordered based on their perceived risk of loss, making it a type or ordinal data.

Conversely, continuous data can take on any value and fluctuate over time. It is usually visualized using line graphs, effectively showcasing how the values can change within a specific time frame. Examples of continuous data in the financial industry include:

  • Interest rates set by central banks or offered by banks on loans and deposits
  • Currency exchange rates which also fluctuate constantly throughout the day
  • Daily trading volume of a particular stock on a specific day
  • Stock prices that fluctuate throughout the day, as seen in the line graph below:

Line chart illustrating the fluctuating stock prices

Source: Freepik

The measurement scale for continuous data is usually interval or ratio . Here is breakdown of their differences:

  • Interval: This builds upon ordinal data by having consistent intervals between each unit, and its zero point doesn’t represent a complete absence of the variable. Let’s use credit score as an example. While the scale ranges from 300 to 850, the interval between each score rating is consistent (50 points), and a score of zero wouldn’t indicate an absence of credit history, but rather no credit score available. 
  • Ratio: This scale has all the same characteristics of interval data but also has a true zero point, indicating a complete absence of the variable. Interest rates expressed as percentages are a classic example of ratio data. A 0% interest rate signifies the complete absence of any interest charged or earned, making it a true zero point.

Factor 2: Research Question

You also need to make sure that the analysis method aligns with your specific research questions. If you merely want to focus on understanding the characteristics of your data set, descriptive statistics might be all you need; if you need to analyze the connection between variables, then you have to include inferential statistics as well.

How to Analyze Quantitative Data 

Step 1: data collection  .

Depending on your research question, you might choose to conduct surveys or interviews. Distributing online or paper surveys can reach a broad audience, while interviews allow for deeper exploration of specific topics. You can also choose to source existing datasets from government agencies or industry reports.

Step 2: Data Cleaning

Raw data might contain errors, inconsistencies, or missing values, so data cleaning has to be done meticulously to ensure accuracy and consistency. This might involve removing duplicates, correcting typos, and handling missing information.

Furthermore, you should also identify the nature of your variables and assign them appropriate measurement scales , it could be nominal, ordinal, interval or ratio. This is important because it determines the types of descriptive statistics and analysis methods you can employ later. Once you categorize your data based on these measurement scales, you can arrange the data of each category in a proper order and organize it in a format that is convenient for you.

Step 3: Data Analysis

Based on the measurement scales of your variables, calculate relevant descriptive statistics to summarize your data. This might include measures of central tendency (mean, median, mode) and dispersion (range, standard deviation, variance). With these statistics, you can identify the pattern within your raw data. 

Then, these patterns can be analyzed further with inferential methods to test out the hypotheses you have developed. You may choose any of the statistical tests mentioned above, as long as they are compatible with the characteristics of your data.

Step 4. Data Interpretation and Communication 

Now that you have the results from your statistical analysis, you may draw conclusions based on the findings and incorporate them into your business strategies. Additionally, you should also transform your findings into clear and shareable information to facilitate discussion among stakeholders. Visualization techniques like tables, charts, or graphs can make complex data more digestible so that you can communicate your findings efficiently. 

Useful Quantitative Data Analysis Tools and Software 

We’ve compiled some commonly used quantitative data analysis tools and software. Choosing the right one depends on your experience level, project needs, and budget. Here’s a brief comparison: 

EasiestBeginners & basic analysisOne-time purchase with Microsoft Office Suite
EasySocial scientists & researchersPaid commercial license
EasyStudents & researchersPaid commercial license or student discounts
ModerateBusinesses & advanced researchPaid commercial license
ModerateResearchers & statisticiansPaid commercial license
Moderate (Coding optional)Programmers & data scientistsFree & Open-Source
Steep (Coding required)Experienced users & programmersFree & Open-Source
Steep (Coding required)Scientists & engineersPaid commercial license
Steep (Coding required)Scientists & engineersPaid commercial license

Quantitative Data in Finance and Investment

So how does this all affect the finance industry? Quantitative finance (or quant finance) has become a growing trend, with the quant fund market valued at $16,008.69 billion in 2023. This value is expected to increase at the compound annual growth rate of 10.09% and reach $31,365.94 billion by 2031, signifying its expanding role in the industry.

What is Quant Finance?

Quant finance is the process of using massive financial data and mathematical models to identify market behavior, financial trends, movements, and economic indicators, so that they can predict future trends.These calculated probabilities can be leveraged to find potential investment opportunities and maximize returns while minimizing risks.

Common Quantitative Investment Strategies

There are several common quantitative strategies, each offering unique approaches to help stakeholders navigate the market:

1. Statistical Arbitrage

This strategy aims for high returns with low volatility. It employs sophisticated algorithms to identify minuscule price discrepancies across the market, then capitalize on them at lightning speed, often generating short-term profits. However, its reliance on market efficiency makes it vulnerable to sudden market shifts, posing a risk of disrupting the calculations.

2. Factor Investing 

This strategy identifies and invests in assets based on factors like value, momentum, or quality. By analyzing these factors in quantitative databases , investors can construct portfolios designed to outperform the broader market. Overall, this method offers diversification and potentially higher returns than passive investing, but its success relies on the historical validity of these factors, which can evolve over time.

3. Risk Parity

This approach prioritizes portfolio balance above all else. Instead of allocating assets based on their market value, risk parity distributes them based on their risk contribution to achieve a desired level of overall portfolio risk, regardless of individual asset volatility. Although it is efficient in managing risks while potentially offering positive returns, it is important to note that this strategy’s complex calculations can be sensitive to unexpected market events.

4. Machine Learning & Artificial Intelligence (AI)

Quant analysts are beginning to incorporate these cutting-edge technologies into their strategies. Machine learning algorithms can act as data sifters, identifying complex patterns within massive datasets; whereas AI goes a step further, leveraging these insights to make investment decisions, essentially mimicking human-like decision-making with added adaptability. Despite the hefty development and implementation costs, its superior risk-adjusted returns and uncovering hidden patterns make this strategy a valuable asset.

Pros and Cons of Quantitative Data Analysis

Advantages of quantitative data analysis, minimum bias for reliable results.

Quantitative data analysis relies on objective, numerical data. This minimizes bias and human error, allowing stakeholders to make investment decisions without emotional intuitions that can cloud judgment. In turn, this offers reliable and consistent results for investment strategies.

Precise Calculations for Data-Driven Decisions

Quantitative analysis generates precise numerical results through statistical methods. This allows accurate comparisons between investment options and even predictions of future market behavior, helping investors make informed decisions about where to allocate their capital while managing potential risks.

Generalizability for Broader Insights 

By analyzing large datasets and identifying patterns, stakeholders can generalize the findings from quantitative analysis into broader populations, applying them to a wider range of investments for better portfolio construction and risk management

Efficiency for Extensive Research

Quantitative research is more suited to analyze large datasets efficiently, letting companies save valuable time and resources. The softwares used for quantitative analysis can automate the process of sifting through extensive financial data, facilitating quicker decision-making in the fast-paced financial environment.

Disadvantages of Quantitative Data Analysis

Limited scope .

By focusing on numerical data, quantitative analysis may provide a limited scope, as it can’t capture qualitative context such as emotions, motivations, or cultural factors. Although quantitative analysis provides a strong starting point, neglecting qualitative factors can lead to incomplete insights in the financial industry, impacting areas like customer relationship management and targeted marketing strategies.

Oversimplification 

Breaking down complex phenomena into numerical data could cause analysts to overlook the richness of the data, leading to the issue of oversimplification. Stakeholders who fail to understand the complexity of economic factors or market trends could face flawed investment decisions and missed opportunities.

Reliable Quantitative Data Solution 

In conclusion, quantitative data analysis offers a deeper insight into market trends and patterns, empowering you to make well-informed financial decisions. However, collecting comprehensive data and analyzing them can be a complex task that may divert resources from core investment activity. 

As a reliable provider, TEJ understands these concerns. Our TEJ Quantitative Investment Database offers high-quality financial and economic data for rigorous quantitative analysis. This data captures the true market conditions at specific points in time, enabling accurate backtesting of investment strategies.

Furthermore, TEJ offers diverse data sets that go beyond basic stock prices, encompassing various financial metrics, company risk attributes, and even broker trading information, all designed to empower your analysis and strategy development. Save resources and unlock the full potential of quantitative finance with TEJ’s data solutions today!

research methods data type

Subscribe to newsletter

  • Data Analysis 107
  • Market Research 47
  • TQuant Lab 23
  • Solution&DataSets
  • Privacy Policy Statement
  • Personal Data Policy
  • Copyright Information
  • Website licensing terms
  • Information Security Policy 

research methods data type

  • 11th Floor, No. 57, DongXing Road, Taipei 11070, Taiwan(R.O.C.)
  • +886-2-8768-1088
  • +886-2-8768-1336
  • [email protected]

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • View all journals
  • Explore content
  • About the journal
  • Publish with us
  • Sign up for alerts
  • Open access
  • Published: 01 September 2024

Robust identification of perturbed cell types in single-cell RNA-seq data

  • Phillip B. Nicol 1 ,
  • Danielle Paulson 1 ,
  • Gege Qian 2 ,
  • X. Shirley Liu   ORCID: orcid.org/0000-0003-4736-7339 3 ,
  • Rafael Irizarry   ORCID: orcid.org/0000-0002-3944-4309 3 &
  • Avinash D. Sahu   ORCID: orcid.org/0000-0002-2193-6276 4  

Nature Communications volume  15 , Article number:  7610 ( 2024 ) Cite this article

2 Altmetric

Metrics details

  • Bioinformatics
  • Computational models
  • Data mining
  • Statistical methods
  • Transcriptomics

Single-cell transcriptomics has emerged as a powerful tool for understanding how different cells contribute to disease progression by identifying cell types that change across diseases or conditions. However, detecting changing cell types is challenging due to individual-to-individual and cohort-to-cohort variability and naive approaches based on current computational tools lead to false positive findings. To address this, we propose a computational tool, scDist , based on a mixed-effects model that provides a statistically rigorous and computationally efficient approach for detecting transcriptomic differences. By accurately recapitulating known immune cell relationships and mitigating false positives induced by individual and cohort variation, we demonstrate that scDist outperforms current methods in both simulated and real datasets, even with limited sample sizes. Through the analysis of COVID-19 and immunotherapy datasets, scDist uncovers transcriptomic perturbations in dendritic cells, plasmacytoid dendritic cells, and FCER1G+NK cells, that provide new insights into disease mechanisms and treatment responses. As single-cell datasets continue to expand, our faster and statistically rigorous method offers a robust and versatile tool for a wide range of research and clinical applications, enabling the investigation of cellular perturbations with implications for human health and disease.

Similar content being viewed by others

research methods data type

Atlas-scale single-cell multi-sample multi-condition data integration using scMerge2

research methods data type

Interpretation of T cell states from single-cell transcriptomics data using reference atlases

research methods data type

Data-driven comparison of multiple high-dimensional single-cell expression profiles

Introduction.

The advent of single-cell technologies has enabled measuring transcriptomic profiles at single-cell resolution, paving the way for the identification of subsets of cells with transcriptomic profiles that differ across conditions. These cutting-edge technologies empower researchers and clinicians to study human cell types impacted by drug treatments, infections like SARS-CoV-2, or diseases like cancer. To conduct such studies, scientists must compare single-cell RNA-seq (scRNA-seq) data between two or more groups or conditions, such as infected versus non-infected 1 , responders versus non-responders to treatment 2 , or treatment versus control in controlled experiments.

Two related but distinct classes of approaches exist for comparing conditions in single-cell data: differential abundance prediction and differential state analysis 3 . Differential abundance approaches, such as DA-seq, Milo, and Meld 4 , 5 , 6 , 7 , focus on identifying cell types with varying proportions between conditions. In contrast, differential state analysis seeks to detect predefined cell types with distinct transcriptomic profiles between conditions. In this study, we focus on the problem of differential state analysis.

Past differential state studies have relied on manual approaches involving visually inspecting data summaries to detect differences in scRNA data. Specifically, cells were clustered based on gene expression data and visualized using uniform manifold approximation (UMAP) 8 . Cell types that appeared separated between the two conditions were identified as different 1 . Another common approach is to use the number of differentially expressed genes (DEGs) as a metric for transcriptomic perturbation. However, as noted by ref. 9 , the number of DEGs depends on the chosen significance level and can be confounded by the number of cells per cell type because this influences the power of the corresponding statistical test. Additionally, this approach does not distinguish between genes with large and small (yet significant) effect sizes.

To overcome these limitations, Augur 9 uses a machine learning approach to quantify the cell-type specific separation between the two conditions. Specifically, Augur trains a classifier to predict condition labels from the expression data and then uses the area under the receiver operating characteristic (AUC) as a metric to rank cell types by their condition difference. However, Augur does not account for individual-to-individual variability (or pseudoreplication 10 ), which we show can confound the rankings of perturbed cell types.

In this study, we develop a statistical approach that quantifies transcriptomic shifts by estimating the distance (in gene expression space) between the condition means. This method, which we call scDist , introduces an interpretable metric for comparing different cell types while accounting for individual-to-individual and technical variability in scRNA-seq data using linear mixed-effect models. Furthermore, because transcriptomic profiles are high-dimensional, we develop an approximation for the between-group differences, based on a low-dimensional embedding, which results in a computationally convenient implementation that is substantially faster than Augur . We demonstrate the benefits using a COVID-19 dataset, showing that scDist can recover biologically relevant between-group differences while also controlling for sample-level variability. Furthermore, we demonstrated the utility of the scDist by jointly inferring information from five single-cell immunotherapy cohorts, revealing significant differences in a subpopulation of NK cells between immunotherapy responders and non-responders, which we validated in bulk transcriptomes from 789 patients. These results highlight the importance of accounting for individual-to-individual and technical variability for robust inference from single-cell data.

Not accounting for individual-to-individual variability leads to false positives

We used blood scRNA-seq from six healthy controls 1 (see Table  1 ), and randomly divided them into two groups of three, generating a negative control dataset in which no cell type should be detected as being different. We then applied Augur to these data. This procedure was repeated 20 times. Augur falsely identified several cell types as perturbed (Fig.  1 A). Augur quantifies differences between conditions with an AUC summary statistic, related to the amount of transcriptional separation between the two groups (AUC = 0.5 represents no difference). Across the 20 negative control repeats, 93% of the AUCs (across all cell typess) were >0.5, and red blood cells (RBCs) were identified as perturbed in all 20 trials (Fig.  1 A). This false positive result was in part due to high across-individual variability in cell types such as RBCs (Fig.  1 B).

figure 1

A AUCs achieved by Augur on 20 random partitions of healthy controls ( n  = 6 total patients divided randomly into two groups of 3), with no expected cell type differences (dashed red line indicates the null value of 0.5). B Boxplot depicting the second PC score for red blood cells from healthy individuals, highlighting high across-individual variability (each box represents a different individual). The boxplots display the median and first/third quartiles. C AUCs achieved by Augur on simulated scRNA-seq data (10 individuals, 50 cells per individual) with no condition differences but varying patient-level variability (dashed red line indicates the ground truth value of no condition difference, AUC 0.5), illustrating the influence of individual-to-individual variability on false positive predictions. Points and error bands represent the mean  ±1 SD. Source data are provided as a Source Data file.

We confirmed that individual-to-individual variation underlies false positive predictions made by Augur using a simulation. We generated simulated scRNA-seq data with no condition-level difference and varying patient-level variability (Methods). As patient-level variability increased, differences estimated by Augur also increased, converging to the maximum possible AUC of 1 (Fig.  1 C): Augur falsely interpreted individual-to-individual variability as differences between conditions.

Augur recommends that unwanted variability should be removed in a pre-processing step using batch correction software. We applied Harmony 11 to the same dataset 1 , treating each patient as a batch. We then applied Augur to the resulting batch corrected PC scores and found that several cell types still had AUCs significantly above the null value of 0.5 (Fig.  S1 a). On simulated data, batch correction as a pre-processing step also leads to confounding individual-to-individual variability as condition difference (Fig.  S1 b).

A model-based distance metric controls for false positives

To account for individual-to-individual variability, we modeled the vector of normalized counts with a linear mixed-effects model. Mixed models have previously been shown to be successful at adjusting for this source of variability 10 . Specifically, for a given cell type, let z i j be a length G vector of normalized counts for cell i and sample j ( G is the number of genes). We then model

where α is a vector with entries α g representing the baseline expression for gene g , x j is a binary indicator that is 0 if individual j is in the reference condition, and 1 if in the alternative condition, β is a vector with entries β g representing the difference between condition means for gene g , ω j is a random effect that represents the differences between individuals, and ε i j is a random vector (of length G ) that accounts for other sources of variability. We assume that \({{{\boldsymbol{\omega }}}}_{j}\mathop{ \sim }\limits^{{{\rm{i}}}.{{\rm{i}}}.{{\rm{d}}}}{{\mathcal{N}}}(0,{\tau }^{2}I)\) , \({{{\boldsymbol{\varepsilon }}}}_{ij}\mathop{ \sim }\limits^{{{\rm{i}}}.{{\rm{i}}}.{{\rm{d}}}}{{\mathcal{N}}}(0,{\sigma }^{2}I)\) , and that the ω j and ε i j are independent of each other.

To obtain normalized counts, we recommend defining z i j to be the vector of Pearson residuals obtained from fitting a Poisson or negative binomial GLM 12 , the normalization procedure is implemented in the scTransform function 13 . However, our proposed approach can be used with other normalization methods for which the model is appropriate.

Note that in model ( 18 ), the means for the two conditions are α and α  +  β , respectively. Therefore, we quantify the difference in expression profile by taking the 2 − norm of the vector β :

Here, D can be interpreted as the Euclidean distance between condition means (Fig.  2 A).

figure 2

A scDist estimates the distance between condition means in high-dimensional gene expression space for each cell type. B To improve efficiency, scDist calculates the distance in a low-dimensional embedding space (derived from PCA) and employs a linear mixed-effects model to account for sample-level and other technical variability. This figure is created with Biorender.com, was released under a Creative Commons Attribution-NonCommercial-NoDerivs 4.0 International license.

Because we expected the vector of condition differences β to be sparse, we improved computational efficiency by approximating D with a singular value decomposition to find a K  ×  G matrix U , with K much smaller than G , and

With this approximation in place, we fitted model equation ( 18 ) by replacing z i j with U z i j to obtain estimates of ( U β ) k . A challenge with estimating D K is that the maximum likelihood estimator can have a significant upward bias when the number of patients is small (as is typically the case). For this reason, we employed a post-hoc Bayesian procedure to shrink \({{(U{\boldsymbol{\beta}} )}_{k}^{2}}\) towards zero and compute a posterior distribution of D K 14 . We also provided a statistical test for the null hypothesis that D K  = 0. We refer to the resulting procedure as scDist (Fig.  2 B). Technical details are provided in Methods.

We applied scDist to the negative control dataset based on blood scRNA-seq from six healthy used to show the large number of false positives reported by Augur (Fig.  1 ) and found that the false positive rate was controlled (Fig.  3 A, B). We then applied scDist to the data from the simulation study and found that, unlike Augur , the resulting distance estimate does not grow with individual-to-individual variability (Fig.  3 C). scDist also accurately estimated distances on fully simulated data (Fig.  S2 ).

figure 3

A Reanalysis of the data from Fig.  1 A using distances calculated with scDist ; the dashed red line represents ground truth value of 0. B As in A , but for p-values computed with scDist ; values above the dashed red line represent p  < 0.05. C Null simulation from Fig.  1 C reanalyzed using distances calculated by scDist ; the dashed red line represents the ground truth value of 0. Points and error bands represent the mean  ±1 SD. D Dendrogram generated by hierarchical clustering based on distances between pairs of cell types estimated by scDist . In all panels the boxplots display median and first/third quartiles. Source data are provided as a Source Data file.

The Euclidean distance D measures perturbation by taking the sum of squared differences across all genes. To show that this measure is biologically meaningful, we applied scDist to obtain estimated distances between pairs of known cell types in the above dataset and then applied hierarchical clustering to these distances. The resulting clustering is consistent with known relationships driven by cell lineages (Fig.  3 D). Specifically, Lymphoid cell types T and NK cells clustered together, while B cells were further apart, and Myeloid cell types DC, monocytes, and neutrophils were close to each other.

Though the scDist distance D assigns each gene an equal weight (unweighted), scDist includes an option to assign different weights w g to each gene (Methods). Weighting could be useful in situations where certain genes are known to contribute more to specific phenotypes. We conducted a simulation to study the impact of using the weighted distance. These simulations show that when a priori information is available, using the correct weighting leads to a slightly better estimation of the distance. However, incorrect weighting leads to significantly worse estimation compared to the unweighted distance (Fig.  S3 ). Therefore, the unweighted distance is recommended unless strong a priori information is available.

Challenges in cell type annotations are expected to impact scDist ’s interpretation, much like it does for other methods reliant on a priori cell type annotation such as 3 , 9 . Our simulations (see Methods), reveal scDist ’s vulnerability to false-negatives when annotations are confounded by condition- or patient-specific factors. However, when clusters are annotated using data where such differences have been removed, scDist’s predictions become more reliable (Fig.  S23 ). Thus, we recommend removing these confounders before annotation. As potential issues could occur when the inter-condition distance exceeds the inter-cell-type distance, scDist provides a diagnostic plot (Fig.  S6 ) to compare these two distances. scDist also incorporates an additional diagnostic feature (Fig.  S24 ) to identify annotation issues, utilizing a cell-type tree to evaluate cell relationships at different hierarchical levels. Inconsistencies in scDist ’s output signal potential clustering or annotation errors.

Comparison to counting the number of DEGs

We also compared scDist to the approach of counting the number of differentially expressed genes (nDEG) on pseudobulk samples 3 . Given that the statistical power to detect DEGs is heavily reliant on sample size, we hypothesized that nDEG could become a misleading measure of perturbation in single-cell data with a large variance in the number of cells per cell type. To demonstrate this, we applied both methods to resampled COVID-19 data 1 where the number of cells per cell type was artificially varied between 100 and 10,000. nDEG was highly confounded by the number of cells (Fig.  4 A), whereas the scDist distance remained relatively constant despite the varying number of cells (Fig.  4 B). When the number of subsampled cells is small, the ranking of cell types (by perturbation) was preserved by scDist but not by nDEG (Fig.  S5 a–c). Additionally, scDist was over 60 times faster than nDEG since the latter requires testing all G genes as opposed to K   ≪   G PCs (Fig.  S4 ).

figure 4

A Sampling with replacement from the COVID-19 dataset 1 to create datasets with a fixed number of cells per cell type, and then counting the number of differentially expressed genes (nDEG) for the CD14 monocytes. B Repeating the previous analysis with the scDist distance. C Comparing all pairs of cell types on the downsampled 15 dataset and applying hierarchical clustering to the pairwise perturbations. Leaves corresponding to T cells are colored blue while the leaf corresponding to monocytes is colored red. D The same analysis using the scDist distances. Source data are provided as a Source Data file.

An additional limitation of nDEG is that it does not account for the magnitude of the differential expression. We illustrated this with a simple simulation that shows the number of DEGs between two cell types can be the same (or less) despite a larger transcriptomic perturbation in gene expression space (Fig.  S7 a, b). To demonstrate this on real data, we considered a dataset consisting of eight sorted immune cell types (originally from ref. 15 and combined by ref. 16 ) where scDist and nDEG were applied to all pairs of cell types, and the perturbation estimates were visualized using hierarchical clustering. Although both nDEG and scDist performed well when the sample size was balanced across cell types (Fig.  S8 ), nDEG provided inconsistent results when the CD14 Monocytes were downsampled to create a heterogeneous cell type size distribution. Specifically, scDist produced the expected result of clustering the T cells together, whereas nDEG places the Monocytes in the same cluster as B and T cells (Fig.  4 C, D) despite the fact that these belong to different lineages. Thus by taking into account the magnitude of the differential expression, scDist is able to produce results more in line with known biology.

We also considered varying the number of patients on simulated data with a known ground truth. Again, the nDEG (computed using a mixed model, as recommended by ref. 10 ) increases as the number of patients increases, whereas scDist remains relatively stable (Fig.  S9 a). Moreover, the correlation between the ground truth perturbation and scDist increases as the number of patients increases (Fig.  S9 b). Augur was also sensitive to the number of samples and had a lower correlation with the ground truth than both nDEG and scDist .

scDist detects cell types that are different in COVID-19 patient compared to controls

We applied scDist to a large COVID-19 dataset 17 consisting of 1.4 million cells of 64 types from 284 PBMC samples from 196 individuals consisting of 171 COVID-19 patients and 25 healthy donors. The large number of samples of this dataset permitted further evaluation of our approach using real data rather than simulations. Specifically, we defined true distances between the two groups by computing the sum of squared log fold changes (across all genes) on the entire dataset and then estimated the distance on random samples of five cases versus five controls. Because Augur does not estimate distances explicitly, we assessed the two methods’ ability to accurately recapitulate the ranking of cell types based on established ground truth distances. We found that scDist recovers the rankings better than Augur (Fig.  5 A, S10 ). When the size of the subsample is increased to 15 patients per condition, the accuracy of scDist to recover the ground truth rank and distance improves further (Fig.  S25 ).

figure 5

A Correlation between estimated ranks (based on subsamples of 5 cases and 5 controls) and true ranks for each method, with points above the diagonal line indicate better agreement of scDist with the true ranking. B Plot of true distance vs. distances estimated with scDist (dashed line represents y  =  x ). Poins and error bars represent mean, and 5/95th percentile. C AUC values achieved by Augur , where color represents likely true (blue) or false (orange) positive cell types. D Same as C , but for distances estimated with scDist . E AUC values achieved by Augur against the cell number variation in subsampled-datasets (of false positive cell types). F Same as E , but for distances estimated with scDist . In all boxplots the median and first/third quartiles are reported. For C – E , 1000 random subsamples were used. Source data are provided as a Source Data file.

To evaluate scDist ’s accuracy further, we defined a new ground truth using the entire COVID-19 dataset, consisting two groups: four cell types with differences between groups (true positives) and five cell types without differences (false positives) (Fig.  S11 , Methods). We generated 1000 random samples with only five individuals per cell type and estimated group differences using both Augur and scDist . Augur failed to accurately separate the two groups (Fig.  5 C); median difference estimates of all true positive cell types, except MK167+ CD8+T, were lower than median estimates of all true negative cell types (Fig.  5 C). In contrast, scDist showed a separation between scDist estimates between the two groups (Fig.  5 D).

Single-cell data can also exhibit dramatic sample-specific variation in the number of cells of specific cell types. This imbalance can arise from differences in collection strategies, biospecimen quality, and technical effects, and can impact the reliability of methods that do not account for sample-to-sample or individual-to-individual variation. We measured the variation in cell numbers within samples by calculating the ratio of the largest sample’s cell count to the total cell counts across all samples (Methods). Augur ’s predictions were negatively impacted by this cell number variation (Figs.  5 E, S12 ), indicating its increased susceptibility to false positives when sample-specific cell number variation was present (Fig.  1 C). In contrast, scDist ’s estimates were robust to sample-specific cell number variation in single-cell data (Fig.  5 F).

To further demonstrate the advantage of statistical inference in the presence of individual-to-individual variation, we analyzed the smaller COVID-19 dataset 1 with only 13 samples. The original study 1 discovered differences between cases and controls in CD14+ monocytes through extensive manual inspection. scDist identified this same group as the most significantly perturbed cell type. scDist also identified two cell types not considered in the original study, dendritic cells (DCs) and plasmacytoid dendritic cells (pDCs) ( p  = 0.01 and p  = 0.04, Fig.  S13 a), although pDC did not remain significant after adjusting for multiple testing. We note that DCs induce anti-viral innate and adaptive responses through antigen presentation 18 . Our finding was consistent with studies reporting that DCs and pDCs are perturbed by COVID-19 infection 19 , 20 . In contrast, Augur identified RBCs, not CD14+ monocytes, as the most perturbed cell type (Fig.  S14 ). Omitting the patient with the most RBCs dropped the perturbation between infected and control cases estimated by Augur for RBCs markedly (Fig.  S14 ), further suggesting that Augur predictions are clouded by patient-level variability.

scDist enables the identification of genes underlying cell-specific across-condition differences

To identify transcriptomic alteration, scDist assigns an importance score to each gene based on its contribution to the overall perturbation (Methods). We assessed this importance score for CD14+ monocytes in small COVID-19 datasets. In this cell type, scDist assigned the highest importance score to genes S100 calcium-binding protein A8 ( S100A8 ) and S100 calcium-binding protein A9 ( S100A9 ) ( p  < 10 −3 , Fig.  S13 b). These genes are canonical markers of inflammation 21 that are upregulated during cytokine storm. Since patients with severe COVID-19 infections often experience cytokine storms, the result suggests that S100A8/A9 upregulation in CD14+ monocyte could be a marker of the cytokine storm 22 . These two genes were reported to be upregulated in COVID-19 patients in the study of 284 samples 17 .

scDist identifies transcriptomic alterations associated with immunotherapy response

To demonstrate the real-world impact of scDist , we applied it to four published dataset used to understand patient responses to cancer immunotherapy in head and neck, bladder, and skin cancer patients, respectively 2 , 23 , 24 , 25 . We found that each individual dataset was underpowered to detect differences between responders and non-responders (Fig.  S15 ). To potentially increase power, we combined the data from all cohorts (Fig.  6 A). However, we found that analyzing the combined data without accounting for cohort-specific variations led to false positives. For example, responder-non-responder differences estimated by Augur were highly correlated between pre- and post-treatments (Fig.  6 B), suggesting a confounding effect of cohort-specific variations. Furthermore, Augur predicted that most cell types were altered in both pre-treatment and post-treatment samples (AUC > 0.5 for 41 in pre-treatment and 44 in post-treatment out of a total of 49 cell types), which is potentially due to the confounding effect of cohort-specific variations.

figure 6

A Study design: discovery cohorts of four scRNA cohorts (cited in order as shown 2 , 23 , 24 , 25 ) identify cell-type-specific differences and a differential gene signature between responders and non-responders. This signature was evaluated in validation cohorts of six bulk RNA-seq cohorts (cited in order as shown 32 , 33 , 34 , 35 , 36 , 37 , 38 ). B Pre-treatment and post-treatment sample differences were estimated using Augur and scDist (Spearman correlation is reported on the plot). The error bars represent 95% confidence interval for the fitted linear regression line. C Significance of the estimated differences ( scDist ). D Kaplan–Meier plots display the survival differences in anti-PD-1 therapy patients, categorized by low-risk and high-risk groups using the median value of the NK-2 signature value; overall and progression-free survival is shown. E NK-2 signature levels in non-responders, partial-responders, and responders (Bulk RNA-seq cohorts). The boxplots report the median and first/third quartiles. A created with BioRender.com, released under a Creative Commons Attribution-NonCommercial-NoDerivs 4.0 International license. Source data are provided as a Source Data file.

To account for cohort-specific variations, we ran scDist including an explanatory variable to the model ( 18 ) to account for cohort effects. With this approach, distance estimates were not correlated significantly between pre- and post-treatment (Fig.  6 B). Removal of these variables re-established correlation (Fig.  S16 ). scDist predicted CD4-T and CD8-T altered pre-treatment (Fig.  S17 a), while NK, CD8-T, and B cells altered post-treatment (Fig.  S17 b). Analysis of subtypes revealed FCER1G+NK cells (NK-2) were changed in both pre-treatment and post-treatment samples (Fig.  6 C). To validate this finding, we generated an NK-2 signature differential between responders and non-responders (Fig.  S18 Methods) and evaluated these signatures in bulk RNA-seq immunotherapy cohorts, composing 789 patient samples (Fig.  6 A). We scored each of the 789 patient samples using the NK-2 differential signature (Methods). The NK-2 signature scores were significantly associated with overall and progression-free survival (Fig.  6 D) as well as radiology-based response (Fig.  6 E). We similarly evaluated the top Augur prediction. Differential signature from plasma, the top predicted cell type by Augur , did not show an association with the response or survival outcomes in 789 bulk transcriptomes (Fig.  S19 , Methods).

scDist is computationally efficient

A key strength of the linear modeling framework used by scDist is that it is efficient on large datasets. For instance, on the COVID-19 dataset with 13 samples 1 , scDist completed the analysis in around 50 seconds, while Augur required 5 minutes. To better understand how runtime depends on the number of cells, we applied both methods to subsamples of the dataset that varied in size and observed that scDist was, on average, five-fold faster (Fig.  S20 ). scDist is also capable of scaling to millions of cells. On simulated data, scDist required approximately 10 minutes to fit a dataset with 1,000,000 cells (Fig.  S21 ). We also tested the sensitivity of scDist to the number of PCs used by comparing D K for various values of K . We observed that the estimated distances stabilize as K increases (Fig.  S22 ), justifying K  = 20 as a reasonable choice for most datasets.

The identification of cell types influenced by infections, treatments, or biological conditions is crucial for understanding their impact on human health and disease. We present scDist , a statistically rigorous and computationally fast method for detecting cell-type specific differences across multiple groups or conditions. By using a mixed-effects model, scDist estimates the difference between groups while quantifying the statistical uncertainty due to individual-to-individual variation and other sources of variability. We validated scDist through the unbiased recapitulation of known relationships between immune cells and demonstrated its effectiveness in mitigating false positives from patient-level and technical variations in both simulated and real datasets. Notably, scDist facilitates biological discoveries from scRNA cohorts, even when the number of individuals is limited, a common occurrence in human scRNA-seq datasets. We also pointed out how the detection of cell-type specific differences can be obscured by batch effects or other confounders and how the linear model used by our approach permits accounting for these.

Since the same expression data is used for annotation and by scDist , there are potential issues associated with “double dipping.” Our simulation highlighted this issue by showing that condition-specific effects can result in over-clustering and downward bias in the estimated distances (Methods, Fig.  S23 ). Users can avoid these false negatives by using annotation approaches that can control for patient and condition-specific effects. scDist provides two diagnostic tools to help users identify potential issues in their annotation (Figs.  S24 and S6 . Despite this, significant errors in clustering and annotation could cause unavoidable bias in scDist , and thus designing a cluster-free extension of scDist is an area for future work. scDist also provides a diagnostic tool that estimates distances at multiple resolutions to help users identify potential issues in their annotation (Fig.  S24 ). Another point of sensitivity for scDist is the choice of the number of principal components used to estimate the distance. Although in practice we observed that the distance estimate is stable as the number of PCs varies between 20 and 50 (Fig.  S22 ), an adaptive approach for selecting K could improve performance and maximize power. Finally, although Pearson residual-based normalized counts 12 , 13 is recommended input for scDist , if the data available was normalized by another, sub-optimal, approach, scDist ’s performances could be affected. A future version could adapt the model and estimation procedure so that scDist can be directly applied to the counts, and avoid potential problems introduced by normalization.

We believe that scDist will have extensive utility, as the comparison of single-cell experiments between groups is a common task across a range of research and clinical applications. In this study, we have focused on examining discrete phenotypes, such as infected versus non-infected (in COVID-19 studies) and responders vs. non-responders to checkpoint inhibitors. However, the versatility of our framework allows for extension to experiments involving continuous phenotypes or conditions, such as height, survival, and exposure levels, to name a few. As single-cell datasets continue to grow in size and complexity, scDist will enable rigorous and reliable insights into cellular perturbations with implications for human health and disease.

Normalization

Our method takes as input a normalized count matrix (with corresponding cell type annotations). We recommend using scTransform 13 to normalize, although the method is compatible with any normalization approach. Let y i j g be the UMI counts for gene 1 ≤  g  ≤  G in cell i from sample j . scTransform fits the following model:

where r i j is the total number of UMI counts for the particular cell. The normalized counts are given by the Pearson residuals of the above model:

Distance in normalized expression space

In this section, we describe the inferential procedure of scDist for cases without additional covariates. However, the procedure can be generalized to the full model ( 18 ) with arbitrary covariates (design matrix) incorporating random and fixed effects, as well as nested-effect mixed models. For a given cell type, we model the G -dimensional vector of normalized counts as

where \({\boldsymbol{\alpha}},{\boldsymbol{\beta}} \in {{\mathbb{R}}}^{G}\) , x i j is a binary indicator of condition, \({\boldsymbol{\omega }}_{j} \sim {{\mathcal{N}}}(0,{\tau }^{2}{I}_{G})\) , and \({\boldsymbol{\varepsilon }}_{ij} \sim {{\mathcal{N}}}(0,{\sigma }^{2}{I}_{G})\) . The quantity of interest is the Euclidean distance between condition means α and α  +  β :

If \(U\in {{\mathbb{R}}}^{G\times G}\) is an orthonormal matrix, we can apply U to equation ( 6 ) to obtain the transformed model :

Since U is orthogonal, U ω j and U ε i j still have spherical normal distributions. We also have that

This means that the distance in the transformed model is the same as in the original model. As mentioned earlier, our goal is to find U such that

with K   ≪   G .

Let \(Z\in {{\mathbb{R}}}^{n\times G}\) be the matrix with rows z i j (where n is the total number of cells). Intuitively, we want to choose a U such that the projection of z i j onto the first K rows of U ( \({u}_{1},\ldots,{u}_{K}\in {{\mathbb{R}}}^{G}\) ) minimizes the reconstruction error

where \(\mu \in {{\mathbb{R}}}^{G}\) is a shift vector and \(({v}_{ik})\in {{\mathbb{R}}}^{n\times K}\) is a matrix of coefficients. It can be shown that the PCA of Z yields the (orthornormal) u 1 , …,  u K that minimizes this reconstruction error 26 .

Given an estimator \(\widehat{{(U{\boldsymbol{\beta}} )}_{k}}\) of ( U β ) k , a naive estimator of D K is given by taking the square root of the sum of squared estimates:

However, this estimator can have significant upward bias due to sampling variability. For instance, even if the true distance is 0, \(\widehat{{(U\beta )}_{k}}\) is unlikely to be exactly zero, and that noise becomes strictly positive when squaring.

To account for this, we apply a post-hoc Bayesian procedure to the \({\widehat{U\beta }}_{k}\) to shrink them towards zero before computing the sum of squares. In particular, we adopt the spike slab model of 14

where \({{\rm{Var}}}[\widehat{{(U{\boldsymbol{\beta}} )}_{k}}]\) is the variance of the estimator \(\widehat{{(U{\boldsymbol{\beta}} )}_{k}}\) , δ 0 is a point mass at 0, and π 0 ,  π 1 , … π T are mixing weights (that is, they are non-negative and sum to 1). 14 provides a fast empirical Bayes approach to estimate the mixing weights and obtain posterior samples of ( U β ) k . Then samples from the posterior of D K are obtained by applying the formula ( 12 ) to the posterior samples of ( U β ) k . We then summarize the posterior distribution by reporting the median and other quantiles. Advantage of this particular specification is that the amount of shrinkage depends on the uncertainty in the initial estimate of ( U β ) k .

We use the following procedure to obtain \({\widehat{U\beta }}_{k}\) :

Use the matrix of PCA loadings as a plug in estimator for U . Then U z i j is the vector of PC scores for cell i in sample j .

Estimate ( U β ) k by using lme4 27 to fit the model ( 6 ) using the PC scores corresponding to the k -th loading (i.e., each dimension is fit independently).

Note that only the first K rows of U need to be stored.

We are particularly interested in testing the null hypothesis of D K  = 0 against the alternative D d  > 0. Because the null hypothesis corresponds to ( U β ) k  = 0 for all 1 ≤  k  ≤  d , we can use the sum of individual Wald statistics as our test statistic:

Under the null hypothesis that ( U β ) k  = 0, W k can be approximated by a \({F}_{{\nu }_{k},1}\) distribution. ν k is estimated using Satterthwaite’s approximation in lmerTest . This implies that

under the null. Moreover, the W k are independent because we have assumed that covariance matrices for the sample and cell-level noise are multiples of the identity. Equation ( 16 ) is not a known distribution but quantiles can be approximated using Monte Carlo samples. To make this precise, let W 1 , …,  W M be draws from equation ( 16 ), where M  = 10 5 and let W * be the value of equation ( 15 ) (i.e., the actual test statistic). Then the empirical p -value 28 is computed as

Controlling for additional covariates

Because scDist is based on a linear model, it is straightforward to control for additional covariates such as age or sex of a patient in the analysis. In particular, model ( 18 ) can be replaced with

where \({w}_{ijk}\in {\mathbb{R}}\) is the value of the k th covariate for cell i in sample j and \({\boldsymbol{\gamma }}_{k}\in {{\mathbb{R}}}^{G}\) is the corresponding gene-specific effect corresponding to the k th covariate.

Choosing the number of principal components

An important choice in scDist is the number of principal components d . If d is chosen too small, then estimation accuracy may suffer as the first few PCs may not capture enough of the distance. On the other hand, if d is chosen too large then the power may suffer as a majority of the PCs will simply be capturing random noise (and adding to degrees of freedom to the Wald statistic). Moreover, it is important that d is chosen a priori, as choosing the d that produces the lowest p values is akin to p -hacking.

If the model is correctly specified then it is reasonable to choose d  =  J − 1, where J is the number of samples (or patients). To see why, notice that the mean expression in sample 1 ≤  j  ≤  J is

In particular, the J sample means lie on a ( J  − 1)-dimensional subspace in \({{\mathbb{R}}}^{G}\) . Under the assumption that the condition difference and sample-level variability is larger than the error variance σ 2 , we should expect that the first J  − 1 PC vectors capture all of the variance due to differences in sample means.

In practice, however, the model can not be expected to be correctly specified. For this reason, we find that d  = 20 is a reasonable choice when the number of samples is small (as is usually the case in scRNA-seq) and d  = 50 for datasets with a large number of samples. This is line with other single-cell methods, where the number of PCs retained is usually between 20 and 50.

Cell type annotation and “double dipping”

scDist takes as input an annotated list of cells. A common approach to annotate cells is to cluster based on gene expression. Since scDist also uses the gene expression data to measure the condition difference there are concerns associated with “double-dipping” or using the data twice. In particular, if the condition difference is very large and all of the data is used to cluster it is possible that the cells in the two conditions would be assigned to different clusters. In this case scDist would be unable to estimate the inter-condition distance, leading to a false negative. In other words, the issue of double dipping could cause scDist to be more conservative. Note that the opposite problem occurs when performing differential expression between two estimated clusters; in this case, the p -values corresponding to genes will be anti-conservative 29 .

To illustrate, we simulated a normalized count matrix with 4000 cells and 1000 genes in such a way that there are two “true” cell types and a true condition distance of 4 for both cell types (Fig.  S23 a). To cluster (annotate) the cells, we applied k -means with various choices of k and compared results by taking the median inter-condition distance across all clusters. As the number of clusters increases, the median distance decays towards 0, which demonstrates that scDist can produce false negatives when the data is over-clustered (Fig.  S23 b). To avoid this issue, one possible approach is to begin by clustering the data for only one condition and then to assign cells in the other condition by finding the nearest centroid in the existing clusters. When applied to the simulated data this approach is able to correctly estimate the condition distance even when the number of clusters k is larger than the true value.

On real data, one approach to identify possible over-clustering is to apply scDist at various cluster resolutions. We used the expression data from the small COVID-19 data 1 to construct a tree \({{\mathcal{T}}}\) with leaf nodes corresponding to the cell types in the original annotation provided by the authors (Fig.  S24 , see Appendix A for a description of how the tree is estimated). At each internal node \(v\in {{\mathcal{T}}}\) , we applied scDist to the cluster containing all children of v . We can then visualize the estimated distances by plotting the tree (Fig.  S24 ). Situations where the child nodes have a small distance but the parent node has a large distance could be indicative of over-clustering. For example, PB cells are almost exclusiviely found in cases (1977 cells in cases and 86 cells in controls), suggesting that it is reasonable to consider PB and B cells as a single-cell type when applying scDist .

Feature importance

To better understand the genes that drive the observed difference in the CD14+ monocytes, we define a gene importance score . For 1 ≤  k  ≤  d and 1 ≤  g  ≤  G , the k -th importance score for gene g is ∣ U k g ∣ β g . In other words, the importance score is the absolute value of the gene’s k -th PC loading times its expression difference between the two conditions. Note that the gene importance score is 0 if and only if β g  = 0 or U k g  = 0. Since the U k g are fixed and known, significance can be assigned to the gene importance score using the differential expression method used to estimate β g .

Simulated single-cell data

We test the method on data generated from model equation ( 6 ). To ensure that the “true” distance is D , we use the R package uniformly 30 to draw β from the surface of the sphere of radius D in \({{\mathbb{R}}}^{G}\) . The data in Figs.  1 C and 3 C are obtained by setting β  = 0 and σ 2  = 1 and varying τ 2 between 0 and 1.

Weighted distance

By default, scDist uses the Euclidean distance D which treats each gene equally. In cases where a priori information is available about the relevance of each gene, scDist provides the option to estimate a weighted distance D w , where \(w\in {{\mathbb{R}}}^{G}\) has non-negative components and

The weighted distance can be written in matrix form by letting \(W\in {{\mathbb{R}}}^{G\times G}\) be a diagonal matrix with W g g  =  w g , so that

Thus, the weighted distance can be estimated by instead considered the transformed model where \(U\sqrt{W}\) is applied to each z i j . After this different transformed model is obtained, estimation and inference of D w proceeds in exactly the same way as the unweighted case.

To test the accuracy of the weighted distance estimate, we considered a simulation where each gene had only a 10% chance of having β g  ≠ 0 (otherwise \({\beta }_{g} \sim {{\mathcal{N}}}(0,1)\) ). We then considered three scenarios: w g  = 1 if β g  ≠ 0 and w g  = 0 otherwise (correct weighting), w g  = 1 for all g (unweighted), and w g  = 1 randomly with probability 0.1 (incorrect weights). We then quantified the performance by taking the absolute value of the error between \({\sum }_{g}{\beta }_{g}^{2}\) and the estimated distance. Figure  S3 shows that correct weighting slightly outperforms unweighted scDist but random weights are significantly worse. Thus, the unweighted version of scDist should be preferred unless strong a priori information is available.

Robustness to model misspecification

The scDist model assumes that the cell-specific variance σ 2 and sample-specific variance τ 2 are shared across genes. The purpose of this assumption is to ensure that the noise in the transformed model follows a spherical normal distribution. Violations of this assumption could lead to miscalibrated standard errors and hypothesis tests but should not effect estimation. To demonstrate this, we considered simulated data where each gene has σ g  ~ Gamma( r ,  r ) and τ g  ~ Gamma( r /2,  r ). As r varies, the quality of the distance estimates does not change significantly (Fig.  S26 ).

Semi-simulated COVID-19 data

COVID-19 patient data for the analysis was obtained from ref. 17 , containing 1.4 million cells of 64 types from 284 PBMC samples collected from 196 individuals, including 171 COVID-19 patients and 25 healthy donors.

Ground truth

We define the ground truth as the cell-type specific transcriptomic differences between the 171 COVID-19 patients and the 25 healthy controls. Specifically, we used the following approach to define a ground truth distance:

For each gene g , we computed the log fold changes L g between COVID-19 cases and controls, with L g  =  E g ( C o v i d ) −  E g ( C o n t r o l ), where E g denotes the log-transformed expression data \(\log (1+x)\) .

The ground truth distance is then defined as \(D={\sum }_{g}{L}_{g}^{2}\) .

Subsequently, we excluded any cell types not present in more than 10% of the samples from further analysis. For true negative cell types, we identified the top 5 with the smallest fold change and a representation of over 20,000 cells within the entire dataset. When attempting similar filtering based on cell count alone, no cell types demonstrated a sufficiently large true distance. Consequently, we chose the top four cell types with over 5000 cells as our true positives Fig.  S11 .

Using the ground truth, we performed two separate simulation analyses:

1: Simulation analyses I (Fig.  5 A, B): Using one half of the dataset (712621 cells, 132 case samples, 20 control samples), we created 100 subsamples consisting of 5 cases and 5 controls. For each subsample, we applied both scDist and Augur to estimate perturbation/distance between cases and controls for each cell type. Then we computed the correlation between the ground truth ranking (ordering cells by sum of log fold changes on the whole dataset) and the ranking obtained by both methods. For scDist , we restricted to cell types that had a non-zero distance estimate in each subsample, and for Augur we restricted to cell types that had an AUC greater than 0.5 (Fig.  5 A). For Fig.  5 B, we took the mean estimated distance across subsamples for which the given cell type had a non-zero distance estimate. This is because in some subsamples a given cell type could be completely absent.

2: Simulation analyses II (Fig.  5 C–F): We subsampled the COVID-19 cohort with 284 samples (284 PBMC samples from 196 individuals: 171 with COVID-19 infection and 25 healthy controls) to create 1,000 downsampled cohorts, each containing samples from 10 individuals (5 with COVID-19 and 5 healthy controls). We randomly selected each sample from the downsampled cohort, further downsampled the number of cells for each cell type, and selected them from the original COVID-19 cohort. This downsampling procedure increases both cohort variability and cell-number variations.

Performance Evaluation in Subsampled Cohorts : We applied scDist and Augur to each subsampled cohort, comparing the results for true positive and false positive cell types. We partitioned the sampled cohorts into 10 groups based on cell-number variation, defined as the number of cells in a sample with the highest number of cells for false-negative cell types divided by the average number of cells in cell types. This procedure highlights the vulnerability of computational methods to cell number variation, particularly in negative cell types.

Analysis of immunotherapy cohorts

Data collection.

We obtained single-cell data from four cohorts 2 , 23 , 24 , 25 , including expression counts and patient response information.

Pre-processing

To ensure uniform processing and annotation across the four scRNA cohorts, we analyzed CD45+ cells (removing CD45− cells) in each cohort and annotated cells using Azimuth 31 with reference provided for CD45+ cells.

Model to account for cohort and sample variance

To account for cohort-specific and sample-specific batch effects, scDist modeled the normalized gene expression as:

Here, Z represents the normalized count matrix, X denotes the binary indicator of condition (responder = 1, non-responder = 0); γ and ω are cohort and sample-level random effects, and (1 ∣ γ :  ω ) models nested effects of samples within cohorts. The inference procedure for distance, its variance, and significance for the model with multiple cohorts is analogous to the single-cohort model.

We estimated the signature in the NK-2 cell type using differential expression between responders and non-responders. To account for cohort-specific and patient-specific effects in differential expression estimation, we employed a linear mixed model described above for estimating distances, performing inference for each gene separately. The coefficient of X inferred from the linear mixed models was used as the estimate of differential expression:

Here, Z represents the normalized count matrix, X denotes the binary indicator of condition (responder = 1, non-responder = 0); γ and ω are cohort and sample-level random effects, and (1 ∣ γ :  ω ) models nested effects of samples within cohorts.

Bulk RNA-seq cohorts

We obtained bulk RNA-seq data from seven cancer cohorts 32 , 33 , 34 , 35 , 36 , 37 , 38 , comprising a total of 789 patients. Within each cohort, we converted counts of each gene to TPM and normalized them to zero mean and unit standard deviation. We collected survival outcomes (both progression-free and overall) and radiologic-based responses (partial/complete responders and non-responders with stable/progressive disease) for each patient.

Evaluation of signature in bulk RNA-seq cohorts

We scored each bulk transcriptome (sample) for the signature using the strategy described in ref. 39 . Specifically, the score was defined as the Spearman correlation between the normalized expression and differential expression in the signature. We stratified patients into two groups using the median score for patient stratification. Kaplan–Meier plots were generated using these stratifications, and the significance of survival differences was assessed using the log-rank test. To demonstrate the association of signature levels with radiological response, we plotted signature levels separately for non-responders, partial-responders, and responders.

Evaluating Augur Signature in Bulk RNA-Seq Cohorts

A differential signature was derived for Augur ’s top prediction, plasma cells, using a procedure analogous to the one described above for scDist . This plasma signature was then assessed in bulk RNA-seq cohorts following the same evaluation strategy as applied to the scDist signature.

Statistics and reproducibility

No statistical method was used to predetermine sample size. No data were excluded from the analyses. The experiments were not randomized. The Investigators were not blinded to allocation during experiments and outcome assessment.

Reporting summary

Further information on research design is available in the  Nature Portfolio Reporting Summary linked to this article.

Data availability

Table  1 gives a list of the datasets used in each figure, as well as details about how the datasets can be obtained.  Source data are provided with this paper.

Code availability

scDist is available as an R package and can be downloaded from GitHub 40 : github.com/phillipnicol/scDist . The repository also includes scripts to replicate some of the figures and a demo of scDist using simulated data.

Wilk, A. J. et al. A single-cell atlas of the peripheral immune response in patients with severe covid-19. Nat. Med. 26 , 1070–1076 (2020).

Article   PubMed   PubMed Central   Google Scholar  

Yuen, K. C. et al. High systemic and tumor-associated il-8 correlates with reduced clinical benefit of pd-l1 blockade. Nat. Med. 26 , 693–698 (2020).

Crowell, H. L. et al. Muscat detects subpopulation-specific state transitions from multi-sample multi-condition single-cell transcriptomics data. Nat. Commun. 11 , 6077 (2020).

Article   ADS   PubMed   PubMed Central   Google Scholar  

Helmink, B. A. et al. B cells and tertiary lymphoid structures promote immunotherapy response. Nature 577 , 549–555 (2020).

Zhao, J. et al. Detection of differentially abundant cell subpopulations in scrna-seq data. Proc. Natl. Acad. Sci. 118 , e2100293118 (2021).

Dann, E., Henderson, N. C., Teichmann, S. A., Morgan, M. D. & Marioni, J. C. Differential abundance testing on single-cell data using k-nearest neighbor graphs. Nat. Biotechnol. 40 , 245–253 (2022).

Article   PubMed   Google Scholar  

Burkhardt, D. B. et al. Quantifying the effect of experimental perturbations at single-cell resolution. Nat. Biotechnol. 39 , 619–629 (2021).

McInnes, L., Healy, J. & Melville, J. Umap: uniform manifold approximation and projection for dimension reduction. arXiv https://arxiv.org/abs/1802.03426 (2018).

Skinnider, M. A. et al. Cell type prioritization in single-cell data. Nat. Biotechnol. 39 , 30–34 (2021).

Zimmerman, K. D., Espeland, M. A. & Langefeld, C. D. A practical solution to pseudoreplication bias in single-cell studies. Nat. Commun. 12 , 1–9 (2021).

Article   Google Scholar  

Korsunsky, I. et al. Fast, sensitive and accurate integration of single-cell data with harmony. Nat. Methods 16 , 1289–1296 (2019).

Townes, F. W., Hicks, S. C., Aryee, M. J. & Irizarry, R. A. Feature selection and dimension reduction for single-cell rna-seq based on a multinomial model. Genome Biol. 20 , 1–16 (2019).

Hafemeister, C. & Satija, R. Normalization and variance stabilization of single-cell RNA-seq data using regularized negative binomial regression. Genome Biol. 20 , 1–15 (2019).

Stephens, M. False discovery rates: a new deal. Biostatistics 18 , 275–294 (2017).

MathSciNet   PubMed   Google Scholar  

Zheng, G. X. et al. Massively parallel digital transcriptional profiling of single cells. Nat. Commun. 8 , 14049 (2017).

Duò, A., Robinson, M. D. & Soneson, C. A systematic performance evaluation of clustering methods for single-cell RNA-seq data. F1000Res. 7 , 1141 (2018).

Ren, X. et al. Covid-19 immune features revealed by a large-scale single-cell transcriptome atlas. Cell 184 , 1895–1913 (2021).

Galati, D., Zanotta, S., Capitelli, L. & Bocchino, M. A bird’s eye view on the role of dendritic cells in sars-cov-2 infection: Perspectives for immune-based vaccines. Allergy 77 , 100–110 (2022).

Pérez-Gómez, A. et al. Dendritic cell deficiencies persist seven months after sars-cov-2 infection. Cell. Mol. Immunol. 18 , 2128–2139 (2021).

Upadhyay, A. A. et al. Trem2+ and interstitial macrophages orchestrate airway inflammation in sars-cov-2 infection in rhesus macaques. bioRxiv https://www.biorxiv.org/content/10.1101/2021.10.05.463212v1 (2021).

Wang, S. et al. S100a8/a9 in inflammation. Front. Immunol. 9 , 1298 (2018).

Mellett, L. & Khader, S. A. S100a8/a9 in covid-19 pathogenesis: impact on clinical outcomes. Cytokine Growth Factor Rev. 63 , 90–97 (2022).

Luoma, A. M. et al. Tissue-resident memory and circulating t cells are early responders to pre-surgical cancer immunotherapy. Cell 185 , 2918–2935 (2022).

Yost, K. E. et al. Clonal replacement of tumor-specific t cells following pd-1 blockade. Nat. Med. 25 , 1251–1259 (2019).

Sade-Feldman, M. et al. Defining T cell states associated with response to checkpoint immunotherapy in melanoma. Cell 175 , 998–1013 (2018).

Pearson, K. Liii. on lines and planes of closest fit to systems of points in space. Lond. Edinb. Dublin Philos. Mag. J. Sci. 2 , 559–572 (1901).

Bates, D., Mächler, M., Bolker, B. & Walker, S. Fitting linear mixed-effects models using lme4. J. Stat. Softw. 67 , 1–48 (2015).

North, B. V., Curtis, D. & Sham, P. C. A note on the calculation of empirical p values from Monte Carlo procedures. Am. J. Hum. Genet. 71 , 439–441 (2002).

Neufeld, A., Gao, L. L., Popp, J., Battle, A. & Witten, D. Inference after latent variable estimation for single-cell RNA sequencing data. arXiv https://arxiv.org/abs/2207.00554 (2022).

Laurent, S. uniformly: uniform sampling. R package version 0.2.0 https://CRAN.R-project.org/package=uniformly (2022).

Hao, Y. et al. Integrated analysis of multimodal single-cell data. Cell 184 , 3573–3587 (2021).

Mariathasan, S. et al. Tgf β attenuates tumour response to pd-l1 blockade by contributing to exclusion of T cells. Nature 554 , 544–548 (2018).

Weber, J. S. et al. Sequential administration of nivolumab and ipilimumab with a planned switch in patients with advanced melanoma (checkmate 064): an open-label, randomised, phase 2 trial. Lancet Oncol. 17 , 943–955 (2016).

Liu, D. et al. Integrative molecular and clinical modeling of clinical outcomes to pd1 blockade in patients with metastatic melanoma. Nat. Med. 25 , 1916–1927 (2019).

McDermott, D. F. et al. Clinical activity and molecular correlates of response to atezolizumab alone or in combination with bevacizumab versus sunitinib in renal cell carcinoma. Nat. Med. 24 , 749–757 (2018).

Riaz, N. et al. Tumor and microenvironment evolution during immunotherapy with nivolumab. Cell 171 , 934–949 (2017).

Miao, D. et al. Genomic correlates of response to immune checkpoint therapies in clear cell renal cell carcinoma. Science 359 , 801–806 (2018).

Van Allen, E. M. et al. Genomic correlates of response to ctla-4 blockade in metastatic melanoma. Science 350 , 207–211 (2015).

Sahu, A. et al. Discovery of targets for immune–metabolic antitumor drugs identifies estrogen-related receptor alpha. Cancer Discov. 13 , 672–701 (2023).

Nicol, P. scdist https://doi.org/10.5281/zenodo.12709683 (2024).

Download references

Acknowledgements

P.B.N. is supported by NIH T32CA009337. A.D.S. received support from R00CA248953, the Michelson Foundation, and was partially supported by the UNM Comprehensive Cancer Center Support Grant NCI P30CA118100. We express our gratitude to Adrienne M. Luoma, Shengbao Suo, and Kai W. Wucherpfennig for providing the scRNA data 23 . We also thank Zexian Zeng for assistance with downloading and accessing the bulk RNA-seq dataset.

Author information

Authors and affiliations.

Harvard University, Cambridge, MA, USA

Phillip B. Nicol & Danielle Paulson

University of California San Diego School of Medicine, San Diego, CA, USA

Dana-Farber Cancer Institute, Boston, MA, USA

X. Shirley Liu & Rafael Irizarry

University of New Mexico Comprehensive Cancer Center, Albuquerque, NM, USA

Avinash D. Sahu

You can also search for this author in PubMed   Google Scholar

Contributions

P.B.N., D.P., G.Q., X.S.L., R.I., and A.D.S. conceived the study. P.B.N. and A.D.S. implemented the method and performed the experiments. P.B.N., R.I., and A.D.S. wrote the manuscript.

Corresponding authors

Correspondence to Rafael Irizarry or Avinash D. Sahu .

Ethics declarations

Competing interests.

X.S.L. conducted the work while being on the faculty at DFCI, and is currently a board member and CEO of GV20 Therapeutics. P.B.N., D.P., G.Q., R.I., and A.D.S. declare no competing interests.

Peer review

Peer review information.

Nature Communications thanks the anonymous reviewer(s) for their contribution to the peer review of this work. A peer review file is available.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Supplementary information, peer review file, reporting summary, source data, source data, rights and permissions.

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ .

Reprints and permissions

About this article

Cite this article.

Nicol, P.B., Paulson, D., Qian, G. et al. Robust identification of perturbed cell types in single-cell RNA-seq data. Nat Commun 15 , 7610 (2024). https://doi.org/10.1038/s41467-024-51649-3

Download citation

Received : 14 December 2023

Accepted : 09 August 2024

Published : 01 September 2024

DOI : https://doi.org/10.1038/s41467-024-51649-3

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

By submitting a comment you agree to abide by our Terms and Community Guidelines . If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.

Quick links

  • Explore articles by subject
  • Guide to authors
  • Editorial policies

Sign up for the Nature Briefing newsletter — what matters in science, free to your inbox daily.

research methods data type

Register for Simple Talk

Get the latest Opinion, Podcasts, Webinars, and Events, delivered straight to your inbox.

Choosing Data Types and Column Options

Ben Johnston

Share to social media

research methods data type

An entire post on choosing the correct data types either seems like overkill or much needed and overdue. The perspective might vary based on the databases you’ve worked with recently. I decided to write this after seeing some code with data type decisions that I would classify as questionable. There are many decisions in technology that can be ambiguous, but the correct date type should be based on business rules and a set of technical guidelines. I am going to share my thought process for deciphering the correct type here.

Selecting a data type is an important part of database and table design. The column represents an actual business attribute, is used to support data integrity, or is used for performance considerations. Care should be used when selecting the definition for each column. Choosing the wrong type can impact each of these areas, makes the system difficult to work with, and makes integrations harder than necessary.

This post is targeted at data types natively available in SQL Server, but the basic concepts apply to all data engines. The physical data model should be representative of the underlying attribute it is storing. The names and exact data types will differ for each engine, but the same general rules apply to each.

These rules also apply to existing databases, but the threshold for change is much higher. An existing system presumably has applications or reports built on that data, which makes it much more complicated than choosing the correct data type during the initial design phase. If you are refactoring code, it is useful to look at the data types. This is especially true if the current data types are obviously wrong and causing problems. In these cases, it is worth the effort to change the data types.

Metrics for choosing a data type

  • Accurately represents the attribute
  • Sized correctly
  • Performance
  • Clear intent of purpose

The first metric of a good data type is when the data type chosen is the best representation of the attribute. In an RDBMS system, the tables shouldn’t look like an Excel spreadsheet or a sloppily imported CSV. A numeric column should have a numeric data type. A date should be in a date column. If you have a good understanding of the data, the decision process should be clear. If not, you’ll need to profile the data.

The second metric is that the data type allows for future growth, but is not too large. It is sized correctly. This is a metric for most data types. To understand if the column is sized correctly, you need an understanding of the business attribute. Sometimes this is easy and sometimes you have to use your best guess. But the analysis should happen and care should be taken when choosing column sizes. If an attribute will never be larger than 255, such as number of bathrooms in a house, it should be a tinyint instead of int or even a smallint.

The third metric is performance. This will generally follow from choosing the correct data type, but it may influence decisions you make regarding data type decisions. Smaller data storage results in faster I/O. Data types with a fixed storage space (int, decimal, date, etc.) don’t have flexibility in their data footprint. To reduce the space used, you need to choose a different type. It’s not wrong to choose a bigint if that potential storage space is needed. But if you only need space for an int, it’s wasted space, which means wasted performance.

The fourth metric to consider is consistency. Data types should match between tables for the same attribute (i.e., the UserName column should be defined identically in each table in which it resides). This is obvious, but it isn’t always implemented. When the same attribute is represented in multiple tables and columns, it is very confusing when consistency is lacking. This is enforced if you have foreign keys linking the tables. If they are denormalized report tables or not modeled ideally, these constraints won’t be in place and it is possible for mismatched data types. In addition to the confusion, there is a performance penalty if you join these tables via a query and the data types don’t match. You will see this in query plans as implicit conversions.

All of these metrics lead to intent of purpose. One of the primary reasons for carefully selecting your data type is to indicate the purpose of the column to anyone looking at that attribute. Everything you can do in your design to make interpreting objects (entities, attributes, etc.) easier is a good thing and almost always worth any extra effort put into the decision. Usually, correctly choosing your data type makes everything else easier too. I only say “usually” easier because I’m sure there are some edge cases, but they are the exceptions. Clear intent of purpose is a good indicator of correct column design. It isn’t a guarantee, but it certainly adds weight to your decision.

Rules for data type selection

Based on the metrics above, and good design practices, the following are the rules I use and recommend for selecting data types. There is room for some debate, and these reflect my personal preferences, but they are a good starting point. The main objective is to carefully consider your design decisions and understand why you are selecting a particular type. Consistency is also a primary objective. Sometimes an imperfect type is the better choice, if it is consistent with the rest of the database.

Approximate numerics

Only used for data that does not fit into decimal or other precise types

Float / real columns should only be used for numeric data that doesn’t fit into the decimal data type. As the data category in the Microsoft documentation states directly, these are approximate data types. They should not be used for exact numerics. The examples I like to use for this are things like number of meters to the sun, or the mass of earth measured in kilos – anything that would normally be represented using scientific notation. These are generally large, imprecise values. You wouldn’t use approximate numerics for something you can easily measure, such as the weight of cargo on a truck or the number of hectares in a corn field, these would be better represented using a decimal or integer.

Modern implementations of the floating-point standard make the output from the CPU consistent, but different data products handle the output of and rounding differently for approximate numbers. This makes scenarios such as testing, comparing data between systems and reporting more complicated. This can be challenging when synchronizing data between systems due to the imprecision. The potential for rounding differences and rounding errors makes implementation more complicated and another reason I rarely use these types.

The main consideration I have seen for using approximate numerics is faster arithmetic operations. If this is extremely important for your application, arithmetic operations within the SQL Server engine, the tradeoffs may be worth it. Usually high-priority calculations will be done in an API layer, so in practice, I haven’t seen this as sufficient reason to use floating point data types except when dictated by the underlying attribute.

Character strings

Should be sized correctly and avoid fixed length strings if not needed

The primary consideration with character strings is determining the correct length. If I see a table with many varchar columns of the same length, I suspect a modeling problem or poor data profiling. There isn’t a performance penalty for having a varchar field defined larger than needed, but it is suspicious. If the table looks like an Excel file (with all varchar(255) or similar columns), I really suspect that no analysis was done. When I see tables like this, I also expect to find other design issues.

The other big consideration for character strings is variable length strings stored in char columns. This is wasted space (and wasted performance), and can lead to space padded columns and subsequent cleanup for reports and system integrations. If the varchar column is sized correctly and anticipates all possible values, it is correct. The easiest way to know if it is sized correctly is if the size matches the source of truth, the data governance model, or the rules implemented in the GUI and API layers. These items should all match and be reflected in the data model.

Unicode character strings

Only used when necessary, driven by business need

Unicode data types (nchar / nvarchar), follow the same rules as the standard character strings. They should be sized correctly. This type allows expanded character sets to be stored correctly. As with character strings, If there are multiple columns with the same Unicode data type and length, it is a code smell.

The different between Unicode strings and the ASCII character strings, other than the obvious ability to store Uncode strings, is the storage space required for each. Unicode strings are also called wide data types. As that name implies, they are bigger than the standard character strings – they require double the space. That translates directly to performance (via rows per page). It can have an even bigger impact when used as part of the cluster.

Unicode character strings should only be used when necessary. SQL Server performance is highly impacted by limiting the amount of I/O required to return the data needed for a particular query. Each data fetch operation retrieves a page (8K) of data at a time. The more rows of data that can be packed into a page, the better that table will perform (at a very generalized, simple level). Reducing the size of your columns and in turn, increasing the number of rows on a page of data, is free performance gains. Or conversely, decreasing the number of rows per page for no reason is giving away performance for absolutely no reason.

I recently diagnosed this issue. The column design was a clear indicator that no thought had gone into the data modeling for a table. An import used nvarchar(255) for an attribute that would never be bigger than 20 characters wide and would never need Unicode support. A varchar(20) would be a reasonable choice for this column. Even worse, this attribute was represented in several other tables in the database using the correct type. This kind of sloppiness causes performance issues and confusion when doing system integrations. I was getting ready to suggest an RLS access predicate for this column and used varchar(20), which could have caused more problems.

My primary rule with Unicode data is it should only be used when required. If you produce software with an audience and data only using a Latin character set, you default should be char / varchar. If it is needed, that’s fine. But it shouldn’t be the default unless it is driven by an actual business need.

Exact numerics

Use the correct size and avoid for strings that look like numbers

Exact numerics are for exactly that purpose, exact numerical values. Acres in a field, grams of a product, salary, units sold, or anything else that is an exact number and will fit into one of the exact numeric data types. Refer to the size boundaries in the documentation and repeated below for those boundaries. Most exact numeric values will fit into one of these categories. As with strings, exact numerics should be sized correctly. They should be the smallest size possible while allowing for future growth and be consistent within in the enterprise.

There are many attributes that contain only numeric values, but should not be stored as a number. These character strings that only look like numbers should be stored as character strings. And since they only contain 0-9 (and maybe some formatting), there is no need for Unicode character strings. I generally consider it a code smell if these values are stored as exact numerics. This type of attribute are things like:

  • Social security number / tax id
  • Phone numbers
  • House number / Street number

Why is this a code smell and why aren’t they exact numerics? No math is performed on these attributes. If it doesn’t make sense to aggregate these numbers or perform math on them, it’s an indication they may not fit into an exact numeric. The second consideration is if the value could have a leading zero. If stored as an exact numeric, that leading zero would be lost. It requires conversion to a string and some calculations to put those zeroes back. Yes, it is easy to do those calculations using LEN and REPLICATE, but avoid this need. If the value requires character formatting (such as a Social Security Number or a 9-digit Zip code), store it as a character string. The last item is probably obvious, but I’ll mention it for the sake of completeness. If future values might need non-numeric values intermingled with the numeric values, it is definitely not an exact numeric.

Using bit is a judgement call – stay consistent

Bit columns take a full byte to store, with the caveat that 8 bit columns will share that byte. That means if you have more than 1 bit column in a table, space is saved. If you have a single bit column, there is no space savings. Given the amount of space I see wasted in table design, this potential savings won’t sway my decision process. The bigger consideration for bit columns is that you can’t use some aggregation operations on the bit data type. If you want to perform a SUM, you need to first convert the column to tinyint, then perform the aggregation. I find myself doing these types of aggregations frequently for data profiling operations, such as null analysis. That alone is generally enough for me to choose tinyint over bit. The advantage of bit is that the intention of purpose is clear and allowed values are limited. Stay consistent in your design and understand the reason for choosing, or not choosing, bit as a data type.

Date and time

Use the correct size

Date and time data typing follows the same principals as the other types. Use the smallest type that fits the data. If the data doesn’t include the time, use a date column. If you only need a time column, use that rather than a datetime column. If your date and time don’t include seconds, use a smalldatetime. Datetime2 storage space differs based on the precision specified. Datetimeoffset should only be used when you will actually set and use the offset.

Other data types

Text, ntext, and image.

Avoid these types and use the newer replacements

Text, ntext, and image data types are older data types that require special functions for manipulating the data in them or are simply less efficient than the newer counterparts. These data types are scheduled to be deprecated in a future version of SQL. Use varchar(max), nvarchar(max) and varbinary to replace them as needed.

Uniqueidentifier (GUID)

Avoid using for the clustered index

I understand the argument for using a GUID as your primary key. It can work for mobile applications, give more control in the application layer, it can be part of a partitioning strategy , and is used for some replication types. Given those caveats, I recommend against using a GUID as the primary key and definitely stay away from it for clusters. Yes, there is a sequential uniqueidentifer (newsequentialid()) that can be used to reduce fragmentation. But that doesn’t reduce the other issues when using this as your cluster. It is still larger than I like to use, which can cause performance problems, and since the cluster is stored with each non-clustered index for lookups, it increases the size of all indexes. It generally doesn’t directly represent anything in the business. It can make the attribute unique, but it doesn’t mean anything to a business user. You will still want to define a unique index on the table to enforce business rules. Use GUIDs if they are needed. Avoid them for clusters.

As a data type, Uniqueidentifier stores GUIDs more efficiently than the equivalent string representation. Storing as a uniqueidentifier also clarifies the intent of purpose for the attribute and makes it easier to find when searching metadata. It can also be used for replication, front-end applicatinos and geo partitioning. Use it when necessary, but don’t use it as your cluster.

JSON and XML

Use caution with these types

SQL Server has had native XML support for many versions and the JSON datatype has recently been added as a native type. JSON could be stored in a varchar(max) / nvarchar(max) column before this datatype was introduced, but native support includes JSON indexes in a similar fashion to XML indexes.

These datatypes may be useful for edge cases, but don’t use them to be clever and skip data modeling. It makes everything harder later. Your queries, exports, imports and any other data integrations will all need to shred or parse these datatypes to be useful. This is extra logic and extra processing that can easily be avoided.

The native XML and the new native JSON data types can be tempting to use for application developers. If you use these data types, the individual columns don’t need to be defined in the database and they can be changed dynamically in the application. The native XML and JSON indexes can be used to improve performance. Don’t get pulled into this design trap. It makes the data model hard to traverse. Leave the XML and JSON in the application layer. I have used it for load tables, but it is actually easier to process outside of SQL Server. Ask why you are using these types. Question if they should be in the database or should these be stored as objects in a storage system , or better yet – modeled correctly in the database.

SQL_VARIANT

Use caution with this data type

Sql_variant used as a column type is a code smell. It’s not an absolute “do not use” data type, but you really need to consider the use case and decide if it is appropriate. And if you are making a new table, you need to take a hard look at these attributes before you assign them to a column. If you find yourself considering a cool EAV (entity attribute value) table or pattern and see that you can leverage sql_variant for this pattern, reconsider it. Much has already been written warning about using EAV, but this is a reminder to model things correctly and not use this pattern. SQL Server, or any RDBMS, is probably not the right choice for your design if this is the pattern you want to use. This is a code smell and should be avoided.

Specialized types to be used as appropriate

Additional datatypes are available, and it is generally evident when they should be used. These include geography, geometry, hierarchyid, cursors and table types.

Geography and geometry are spatial datatypes, implemented as .NET CLR data types. They can store something as simple as a single geospatial point or an extremely complex geospatial shape. These datatypes can get quite large and have some limitations, such as incompatibility with querying those data types via linked servers. Even with the size of columns and limitations, they are very useful when you need to store spatial data.

The hierarchyid is a very compact datatype useful for storing, as the name suggests, hierarchy information. Hierarchyid values are not assigned automatically and they take some work in the application to apply correctly. They are very efficient, but I have seen a lot of confusion about this type and not many developers are familiar with it. Just because it isn’t a common data type doesn’t mean that you shouldn’t use it, but be aware that you may need to perform additional training and include some guiderails if you use it. Use it as needed, but ensure that the team understands it.

The next two data types can’t be used in table definitions, but are still categorized as data types. You won’t find these in table definitions, but you will likely see them in queries or other SQL scripts.

As the name suggests, the cursor data type is used to define server-side cursors. The standard caveats should be considered when defining cursors. They should be limited in their usage and avoided completely for normal data retrieval. The only time I find a good use for cursors is admin scripts. If you are using them for other purposes, reconsider. It is likely that there is a better method to access the data. The table data type is used for temporary storage. It is required for inline table-valued functions. For other queries you need to decide between a temporary table or a table variable. The decision process for this can be complicated and often requires testing of the specific scenario to choose the best method. Generally, smaller data sets will use table variables. Temp tables are useful when they involve larger data sets or if they need additional indexes applied. Be aware of the main differences and be open to changing the method based on your testing.

Size Boundaries

Be aware of size boundaries with various data types so you choose the best data type. Select the smallest type that will still allow for all foreseeable growth. Refer to the documentation for size limits. Remember that this isn’t just a storage issue. It has a direct impact on performance.

  • Bit (1 byte for each 8 bit fields, 0-1)
  • Tinyint (1 byte, 0-255)
  • Smallint (2 bytes, -2^15 (-32,768) to 2^15-1 (32,767))
  • Int (4 bytes, -2^31 (-2,147,483,648) to 2^31-1 (2,147,483,647))
  • Bigint (8 bytes, -2^63 (-9,223,372,036,854,775,808) to 2^63-1 (9,223,372,036,854,775,807))
  • Precision 1-9 (5 bytes)
  • Precision 10-19 (9 bytes)
  • Precision 20-28 (13 bytes)
  • Precision 29-26 (17 bytes)
  • Date (3 bytes)
  • Time (5 bytes)
  • SmallDateTime (4 bytes)
  • Datetime (8 bytes)
  • Datetime2 (6-8 bytes, depending on precision)
  • Datetimeoffset (8-10 bytes, depending on precision)
  • varcahr 8000
  • nvarchar 4000
  • varchar / nvarchar max
  • Avoid text / ntext / image / blob

Developers should be able to defend any decision for a column definition. That includes the type, the exact size, and the nullability.

Considerations for NULL columns

Before I get too deep into considerations for null values, the concept of null deserves a quick definition. Null isn’t a specific value; it is an unknown value. It doesn’t indicate a blank string, 0, or a specific date. It is unknown.

Null columns can be a contentious topic. I tend to take a pragmatic view. Null values are inevitable with data imports and even with clean, new databases. With a new application where you have complete design control, it can be hard to get the business to agree to your database rules. They want to implement their business rules. It makes sense, since they are the customer in this scenario. Business rules can often be messy and require some flexibility. Null values help implement this flexibility.

Even if you assiduously avoid null values in tables, you will likely have to deal with them at some level. Take that into consideration with your queries and don’t plan on being able to keep null values completely out of your data.

  • Adding a NOT NULL column after a table has been created and is populated with data is a more complicated operation. Refer to the documentation for adding columns.
  • If the value is not nullable in the database, then the value should not be nullable anywhere in the code (API, GUI, ETL).
  • Ask your business analyst / resource analyst if there is a difference between zero and null. That is, does ‘zero’ mean something in the context of the attribute or column.

Magic values

One method for eliminating null is via magic values – a value used in place of null. The system might replace a null date column with ‘1/1/1900’ or a null integer with -9999. I’ve never found magic values to be any less troublesome than null. In fact, I find them to be more cumbersome. The value must be known and tested for in the same way as a null value, but it requires business knowledge to know that the magic value is there. You can usually find these magic values by performing some data profiling, but it still often requires business confirmation. I recommend just using null in these cases.

Default values

Default values are another method for eliminating null columns. The problem with defaults is that it assumes that the default value is correct if you don’t know the actual value of the column. This works for some attributes, but not all of them. The created and modified dates for a column are usually safe to assign a default, but other columns may not be safe with a default. Like magic values, these defaults will be used in aggregates and calculations. Reports, and other queries using these values, will be impacted by their usage. I like to assign default values, when possible, but they should come from business requirements.

Modeling to remove NULL

Modeling key tables using vertical partitioning can be used to eliminate null values. This can become excessive and it is a more complicated pattern. Even with vertical partitioning, it is hard to eliminate all nullable columns. Use this pattern as appropriate and as the skill of your teams allow.

Domain lists

Forcing the table to have a value is the obvious solution to null issues – just don’t allow it. This can be easier said than done. I list domain lists / lookup tables since they are the common way to provide users with a list of valid values for these columns.

Knowing all the valid values for a column, as provided in a domain list, isn’t necessarily enough to remove the null definition. It might make it easier for users, but it can’t make unknown values known. This is the difference between knowing what is valid versus not knowing the value at all.

Tri-valued logic

The main concern with null values is that developers need to understand the tri-valued logic required to properly evaluate null-able columns. Any particular null value isn’t equal to another null value. It is unknown, so you have to test for it separately with IS NULL or test it with a function such as ISNULL or COALESCE to replace the NULL values with a default. This can introduce unintended errors in the code. You can’t realistically remove the need for all nulls, so make sure your teams understand this, understands how to evaluate null, and it is included in your testing.

Code smells

Many of these items are covered in previous sections, but sometimes these items are an indication that there is a problem with the table design. Sometimes they are also a problem when they are seen together or when a certain threshold has been met.

When I see a table with many varchar(255) columns (or varchar(100), nvarchar(200), etc.), it’s an indication that care wasn’t taken during the design of the table. These columns are an indication that the table was based on a spreadsheet or CSV import and the default values were used.

If the total potential column length for the row exceeds 8000, this is another code smell. It’s not wrong every time, but it is an indicator of a potential issue. Remember that short rows usually perform better – more rows fit on a page and each I/O operation brings back more rows. The exception to this is if the table is defined with sparse columns. That shows intent of purpose and it is probably modeled correctly.

There are some additional patterns I consider a code smell. They should be investigated if you see these on a table. If you are doing a code review and see them on a new table, definitely question the reasoning for the following items.

  • Many columns with VC100 / vc255, etc.
  • Total column length over 8000
  • Sql_variant
  • Uniqueidentifier on the cluster / PK
  • Attribute with different data types between entities
  • Only Unicode columns used

Notes for conversions

Data isn’t perfect, especially when you aren’t the source of truth for that data. This makes conversions between types inevitable. Conversions should happen as close to the source as is reasonable. I like to import data to load tables unmodified. This makes testing against the source system much easier. But this raw data is never exposed to end users and conversions happen as data leaves these staging tables.

It’s standard to refine data types during import processes or ETL. The usual reason for this can be that the import format is text only, like a CSV file, it can be to meet enterprise standards, or to align with existing database columns.  These conversions should happen as close to the source as possible in your ETL. If you use a load table, I wouldn’t do the transformation during that process, but I would modify the data type as you extract from there. If you import directly to the final table, of course you must do the transformation during the import process.

Making the conversion as close to the source as possible makes it easier to keep the attribute definition consistent in the database. This helps with foreign keys and joins. Data types should match between tables for the same columns. This is forced with foreign keys. Lookups that aren’t explicit FKs won’t have this protection – one more reason to declare these constraints.

Exports and system integrations become easier when the correct data types are used. If you do the analysis on the source of truth, it alleviates that work in connected systems. Data can be fixed by placing a view layer over the poorly modeled tables in your system, but I strongly believe the actual data type in the table should be correct. If you extract the data with a view or stored procedure, the conversion should happen there. Fix the data type close to the source as possible.

Consistency

It is often better to be consistent than correct. If the data governance team is “wrong” or all other systems are not modeled correctly, you only create more problems by fixing the issue in your system. It becomes harder to match and confuses both users and developers. That inconsistency can make problems even worse than the initial bad modeling. Work with your architects at an enterprise level before you become the outlier.

Additional thoughts

Row and page compression remove some of the performance concerns with improper modeling. Even if using row or page compression, data should still be typed correctly. Intent of purpose is a sufficient reason to spend the time designing things properly.

I’d argue that a good design is reason enough to type your attributes correctly. If you get satisfaction from seeing a “correct” system, you understand this. If not, good luck to your team. I hope the performance reasons and ease-of-use reasons are convincing enough. If you’ve read this far, we probably agree.

Collations can be set at a column level, but table level, and even better, database level collations should be used most of the time. Setting column level collations should be the exception and for specific use cases. Set the collation at the highest level possible. Differing collations can cause issues with comparisons between columns and makes working with the data harder.

Choosing the correct data type can be tedious, but it’s much more tedious to fix it later. It’s also more time consuming to adjust your queries, ETL, reports and data objects to work around a bad decision. It’s much easier to take your time and choose correctly at the start of a data project. This post outlines my decision process and basic rules, but the most important thing is being consistent and representing the data correctly. Your rules and guidelines may differ, but you should have logical rules that you follow for your environment. Go forth and do the right thing.

  • https://learn.microsoft.com/en-us/sql/t-sql/data-types/data-types-transact-sql?view=sql-server-ver16
  • https://learn.microsoft.com/en-us/sql/t-sql/data-types/table-transact-sql?view=sql-server-ver16

Load comments

Recommended for you

Houri

Oracle Data Type Implicit Conversion Hierarchy

In Oracle, under certain circumstances, an implicit data type conversion precludes the use of indexes. Perhaps you have a vague...

Robert Sheldon

How to get Database Design Horribly Wrong

Database Design is one of those tasks where you have to carefully get all the major aspects right. If you...

Working with SSIS Data Types

In order to be able to take data from a variety of sources, manipulate it, and then export it to...

About the author

research methods data type

Ben Johnston

Ben is a data architect from Iowa and has been working with SQL Server since version 6.5 in the late 90's. Ben focuses on performance tuning, warehouse implementations and optimizations, database security, and system integrations.

Ben's contributions

  • T-SQL Programming
  • Database Administration

Ben's latest contributions:

research methods data type

Auditing SQL Server – Part 4 – Database Configuration Audit

This continues my series on auditing SQL Server. The fist parts covered discovery and documentation, server level hardware audits and SQL Server engine level audits....

research methods data type

Auditing SQL Server – Part 3 – SQL Server Configuration Audit

This is the continuation of my series on auditing SQL Server. In the first part, I discussed basic server discovery and documentation. The next section...

research methods data type

Auditing SQL Server – Part 2 – Hardware Audit

This is the second part of my series on auditing SQL Server. In the first part, I discussed basic server discovery and documentation. It covered...

  • Open access
  • Published: 02 September 2024

CTGAN-ENN: a tabular GAN-based hybrid sampling method for imbalanced and overlapped data in customer churn prediction

  • I Nyoman Mahayasa Adiputra 1 &
  • Paweena Wanchai 1  

Journal of Big Data volume  11 , Article number:  121 ( 2024 ) Cite this article

Metrics details

Class imbalance is one of many problems of customer churn datasets. One of the common problems is class overlap, where the data have a similar instance between classes. The prediction task of customer churn becomes more challenging when there is class overlap in the data training. In this research, we suggested a hybrid method based on tabular GANs, called CTGAN-ENN, to address class overlap and imbalanced data in datasets of customers that churn. We used five different customer churn datasets from an open platform. CTGAN is a tabular GAN-based oversampling to address class imbalance but has a class overlap problem. We combined CTGAN with the ENN under-sampling technique to overcome the class overlap. CTGAN-ENN reduced the number of class overlaps by each feature in all datasets. We investigated how effective CTGAN-ENN is in each machine learning technique. Based on our experiments, CTGAN-ENN achieved satisfactory results in KNN, GBM, XGB and LGB machine learning performance for customer churn predictions. We compared CTGAN-ENN with common over-sampling and hybrid sampling methods, and CTGAN-ENN achieved outperform results compared with other sampling methods and algorithm-level methods with cost-sensitive learning in several machine learning algorithms. We provide a time consumption algorithm between CTGAN and CTGAN-ENN. CTGAN-ENN achieved less time consumption than CTGAN. Our research work provides a new framework to handle customer churn prediction problems with several types of imbalanced datasets and can be useful in real-world data from customer churn prediction.

Introduction

Customer churn prediction uses data analysis and predictive algorithms to determine which clients are most likely to quit a company or cease utilizing its goods or services. By anticipating customer turnover, the business may take proactive measures to retain key clients and optimize its marketing and retention strategy. Businesses in a corporation can increase revenue by developing an accurate prediction of client turnover habits and providing retention solutions [ 1 ].

The task of classifying customers presents significant hurdles due to class imbalance. Class imbalance denotes a situation where one or more classes are much more prevalent than the others [ 2 ]. Imbalanced classes can cause misleading performance metrics and increase false negative predictions. False negative occurs when the model fails to identify a customer who is likely to churn. That condition can have a significant impact on the implementation of retention strategies because customers who churn are predicted as not churn.

The data level-solving approach to the imbalance class problem is one of a solution to address the issue [ 3 ]. The data level method focuses on the preprocessing stage and is independent of the machine learning prediction method. A type of generative model based on a neural network called Generative Adversarial Networks, or GAN for short, is intended to generate realistic samples of entities [ 4 ].

Class imbalance issues have been addressed with GAN-based oversampling; nevertheless, class imbalance is not the only difficulty with customer churn data; class overlap is another issue that GAN is unable to resolve. The degree of similarity between instances of distinct classes is known as class overlap. Training a classifier that can distinguish between the classes with accuracy is challenging due to the class overlap requirement. Poor conditions for training data can result in machine learning performance factors [ 5 ].

The latest study works on GAN-based hybrid sampling [ 2 ], that works overcome class overlap and achieve the best result compared to other oversampling and hybrid sampling methods. We proposed a research framework using tabular type of GAN called CTGAN, in tabular data, category variables, numerical values, and specific relationships between columns are frequently present along with other organizational features. Tabular data has a specific structure and limitations that traditional GANs are not well adapted to handle. Tabular GANs can be configured to meet the difficulties and limitations posed by structured data, making them an invaluable tool for tabular data creation tasks. We did an experiment focused on customer churn problems and tried the result of the method in both classical and ensemble machine learning. This experiment’s objective is to observe the effectiveness difference in evaluation metrics and time execution of all algorithms. Beside the data-level solutions, we compared CTGAN-ENN with algorithm-level solution called cost-sensitive learning.

Our research objectives are highlighted as follows:

Customer churn prediction datasets that have high dimensional features and different imbalance ratios.

How CTGAN-ENN can handle class imbalance and class overlap in customer churn datasets.

The effectiveness of CTGAN-ENN in different machine learning algorithms within several evaluation metrics in customer churn prediction.

Comparison between data-level and algorithm-level solutions with CTGAN-ENN performance.

Algorithm time consumption between original CTGAN and CTGAN-ENN in customer churn prediction.

The novelty of this paper is this is the first work that investigates a combination of tabular GAN (CTGAN) and an under-sampling technique (ENN). This work shows a framework for handling high dimensional features in customer churn prediction with class imbalance and class overlap. Moreover, this work evaluates how effective CTGAN-ENN is both in evaluation metrics and algorithm time consumption. This paper scope is only on data-level solution, because this study wants to prove a tabular GAN hybrid sampling method achieved better performance than non-tabular GAN hybrid sampling method and another classical hybrid sampling method in customer churn predictions data. Furthermore, we provided a comparison between CTGAN-ENN and cost-sensitive learning in several machine-learning algorithms.

Related works

  • Customer churn prediction

Machine learning algorithms such as K-nearest Neighbor, naïve Bayes, and decision trees have been widely used on customer churn predictions as supervised learning in recent years. Classical machine learning methods commonly have a not satisfied result in performance. Ensemble approaches are used for classical machine learning problems, such as Random Forest, AdaBoost, Gradient Boosting, and XGBoost [ 6 ]. The main problem with customer churn predictions is imbalanced data. A deep learning algorithm is proposed using Deep & Cross Network (DCN) to learn latent features from customer churn prediction and Asymmetric Loss Function (ASL) to handle imbalanced data [ 1 ]. Based on recent studies, machine learning and deep learning for customer churn prediction have not reached outperforming results; the other approach is the data-level solution. This method works on the data preprocessing stage, using traditional oversampling techniques such as SMOTE and ADAYSN or hybrid sampling techniques using SMOTE + under-sampling methods [ 6 , 7 ]. Traditional sampling methods have a problem in representing synthetic data. A deep learning-based sampling method can be used to tackle this problem. One of the common methods is Generative Adversarial Network (GAN); GAN produces synthetic data based on deep learning algorithms and represents real data better than traditional oversampling methods [ 8 ]. The latest technology of customer churn predictions uses a GAN-based hybrid sampling method to overcome a class overlap problem that GAN produced; this method resulted in an outperform in AUC, F1-Score, and G-mean metrics compared to other sampling methods [ 2 , 9 ]. Class overlaps problem and hybrid sampling method are explained empirically in the next subsects, Class overlap problem and Hybrid sampling methods below.

Class overlap problem

Class overlap in machine learning makes it challenging to develop a reliable classifier that can differentiate between the classes. Factors contributing to class overlap include similar data, sparse distribution, and noise. Figure  1 shows oversampling results using CTGAN in all datasets. It’s worth noting that all datasets in this research have overlap issues, evident from the lack of clear distance between classes and stacked data between classes.

figure 1

Class overlap in customer churn datasets

Hybrid sampling methods

In the last few years, there has been research on customer churn prediction and several research for managing class imbalance. Sampling approaches, which are further classified as under-sampling, over-sampling, and hybrid sampling are well-known strategies for addressing class imbalance. Some studies have tried hybrid sampling. Sáez et al. [ 10 ] proposed the extension of SMOTE with a noise data filter called an Iterative-Partitioning Filter (IPF) for handling not only class imbalanced problems but also one of the data distribution problems, which is noisy data. The results proved their method performs better than existing techniques. Besides class imbalance, class overlap is another problem in the customer churn dataset. Vuttipittayamongkol [ 11 ] proposed an NCR-based under-sampling architecture to eliminate any possible overlapping data to address class imbalance in binary datasets. Their approach is centered on the under-sampling combination, and the experimental results show significant improvements in sensitivity.

Geiler et al. [ 6 ] investigated an effective strategy for churn prediction with SMOTE hybrid sampling in several machine learning algorithms, including ensemble machine learning. They compared SMOTE—Random under-sampling, SMOTE—NCR, and SMOTE—Tomek links. Based on their experiment, SMOTE—NCR outperformed in 2 datasets and SMOTE—Tomek links achieved the best AUC in one dataset. Another SMOTE hybrid method proposed by Zhaozhao Xu et al. [ 12 ] proposed M-SMOTE with ENN combination and experimented on a RF algorithm. Main objective of this work is to handle class imbalance from medical data with M-SMOTE, while ENN is used to handle the misclassified data. Compared to other comparable approaches, the outcome of this work in ten medical datasets is more encouraging, and they achieved 99.5% in the F-1 score metric.

Besides the traditional oversampling method, there is a neural network-based oversampling method called GAN (Generative Adversarial Network). Ding et al. [ 9 ] proposed RGAN-EL to train the GAN to prioritize the class overlapping region during sample distribution fitting. They experimented with 41 imbalanced datasets. They compared the proposed research framework with several other oversampling methods, and the result showed a promising result from RGAN-EL. The limitation of this work is they are not using high-dimensional data in complex fields. Most of the datasets consist of less than ten dimensions.

Zhu et al. [ 2 ] developed a new hybrid sampling method with GAN and adaptive neighborhood-based weighted under-sampling (ANWU) in customer classification datasets. The ANWU method is used for handling class overlaps by removing generated instances and the original majority of class instances. They used KNN, DT, RF and GBM for their experiment. Compared to existing benchmark approaches, the GAN-based hybrid sampling method performs better in accuracy and profit-based evaluation criteria.

Cost-sensitive learning

Cost-sensitive learning is a paradigm within machine learning that focuses on incorporating the varying costs associated with different types of errors or misclassifications into the learning process. In traditional machine learning algorithms, the objective is typically to minimize overall classification error without considering the potential differences in the consequences or costs of different types of misclassifications.

Cost in cost-sensitive learning represented as a cost matrix shown in Fig.  2 [ 13 ]. The cost of false positives is \({C}_{10}\) and the cost of false negative is \({C}_{01}\) , in classification problem adjusting the cost based on objective of prediction is important. For example, if we want to build a machine learning that predicts customer churn, avoid customer that predicted not churn but churn or false negative is the priority of classification objective. Therefore, give \({C}_{01}\) more cost than the other matrix can be one of the solutions in algorithm-level.

figure 2

Cost-sensitive learning cost matrix

Machine learning algorithm

We used two types of machine learning to experiment with the classification task in customer churn prediction after the CTGAN-ENN result. The first type is classical machine learning, known as KNN, DT, and NB. The second type is ensemble machine learning, such as XGB, RF, and GBM. These algorithms are explained in this section.

K-nearest neighbor (KNN)

A case-based learning approach called K-Nearest Neighbor retains all the training data for categorization [ 14 ]. KNN is a well-known machine learning method that may be used for both regression and classification applications. Since the approach is instance-based and non-parametric, it makes no underlying assumptions about how the data are distributed. The idea behind KNN is that similar data points should lead to similar outcomes. In other words, a new data point that must be classified or predicted, you may utilize the labels (for classification) or values (for regression) of the K nearest data points (neighbors) in the training dataset to create predictions.

Decision tree (DT)

A recursive partition of the instance space is represented by a decision tree classifier. It consists of nodes that divide the instance space into two or more subspaces based on a discrete function over the input attributes. When the root, or root node of the tree, occupies the entire area, the first split occurs.

The nodes that come after are either leaf nodes, which show the final categorization, or internal nodes, which have a predecessor and multiple successors [ 15 ]. When it comes to classification jobs, any new data point that travels along the path to a leaf node will be projected to belong to the majority class in that leaf. The mean or median of the target variable in that node is usually the leaf node value for regression tasks.

Naïve bayes (NB)

The Naïve Bayes algorithm relies on a probabilistic approach to perform classification tasks. It operates under the premise that a feature’s existence in a class is independent of the existence of another feature in the same class [ 16 ]. The “naïve” assumption of feature independence, which simplifies the computations but might not hold true in all real-world situations, is the foundation of the Bayes theorem.

XGBoost (XGB)

XGboost is a weighted quantile sketch and sparsity-aware algorithm for approximate tree learning. To create a scalable tree-boosting system, XGBoost offers information on cache usage patterns and data compression [ 17 ]. XGBoost starts by creating a weak learner, which is a simple model that can make some predictions about the data, and then creates a new tree that is trained to correct the errors of the weak learner.

Random forest (RF)

As a kind of bagging technique, random forest constructs several models using various data subsets and then aggregates the forecasts from each model to produce a final prediction. In order for a random forest to function, a set of decision trees that grow in a number of randomly selected data subspaces are combined to create a prediction ensemble. Every tree in the collection is made by first selecting a subset of input coordinates at each node at random (which are there after referred to as features or variables). Based on these features, the training set determines the proper split [ 18 ].

Gradient boosting machine (GBM)

GBM is an ensemble forward learning model works by strongest predictor is chosen once all the weaker ones have been eliminated. This revised decision tree method compares each successor to the others, using the structure score, gain computations, and ever-finer approximations to produce a set of the tree's most satisfying structures [ 19 ]. To achieve a more accurate and complete model, GBM combines the predictions of several inaccurate models, typically decision trees. It accomplishes this by gradually training a series of decision trees, each one intended to fix the mistakes produced by the one before it.

Light gradient-boosting machine (LGB)

Light Gradient-Boosting Machine in short LightGBM is a novel algorithm from GBDT (Gradient Boosting Decision Tree), the purpose of LightGBM is reducing features by its information gain [ 20 ]. LightGBM works by parallel voting decision tree algorithm. It is designed to maximize parallel learning by reducing memory usage, accelerating the training process, and combining sophisticated network connectivity. Partition the training data among several machines, then carry out the local voting to determine the top-k attributes and the global voting to determine the top-two-k attributes for each iteration [ 20 ].

We discovered several kinds of research about hybrid sampling methods. Most recent work has used traditional oversampling techniques like SMOTE [ 6 , 10 , 12 ]. This technique can cause less data diversity in oversampling results. An oversampling technique that is based on a neural network can be the solution to sampling diversity. A common neural network technique for oversampling is the Generative Adversarial Network (GAN). A recent study suggested using GAN-based hybrid sampling to get around the GAN's class overlap problem. The latest work on GAN-based hybrid sampling has some limitations. The first limitation is that they do not use the tabular type of GAN for tabular data problems [ 2 , 9 ]. The second limitation is that they are not measuring the time execution of the algorithms, which can be a good insight for real-world implementation. The remaining problem of the latest study about the GAN-based hybrid sampling method is that it does not involve a tabular-based GAN on tabular data cases.

In this research, we proposed a research framework that includes a GAN-based hybrid sampling scheme and used an ENN under-sampling combination to address the class overlap and tabular GAN type (CTGAN) to handle tabular data. Our work is focused on customer churn prediction problems with time execution insight in experiments. Hopefully, this method can be useful in real-world cases and used for any company that wants to implement churn prediction in their campaign strategies.

Methodology

Data preprocessing using the CTGAN-ENN approach is the first of two phases in our proposed research framework, as seen in Fig.  3 The input of imbalanced dataset divided by minority and majority classes. The minority class is oversampled by GAN using CTGAN [ 21 ], and it produces generated data within a balanced number as the majority class. The next process is concatenating the majority and generated data.

figure 3

Proposed research framework

CTGAN can handle several challenges that are different from the traditional GAN model. First is a mixed data type because real-world data consists of a mixed data type between discrete and continuous. The second is highly imbalanced categorical columns. Customer churn prediction datasets used in this research have a high imbalanced ratio [ 21 ].

Figure  4 is the CTGAN workflow, assuming data has two features \({D}_{1}\) and \({D}_{2}\) , and we want to oversample the \({D}_{2}\) feature and pick \({D}_{2}=1\) . Select all the row from \({D}_{2}\) that has 1 in value and let it be train data \({T}_{train}\) . The result of generator compared with \({T}_{train}\) resulted a critic score; The critic score estimated the distance between the learned conditional distribution and the conditional distribution of real data. That process is called training-by-sampling, which is better than a random value from traditional GAN.

figure 4

CTGAN original framework

The ENN technique is effective for removing instances so that it can handle class overlap. Through the consideration of its k-nearest neighbors who are members of the other class, ENN selectively eliminates data that do not belong to majority class [ 11 ]. The details of ENN algorithm are presented in Algorithm 1. The input is concatenated data from CTGAN and original data \(D\) , the number of nearest neighbors is \(k=3\) , the majority samples \({D}_{maj}\) is dependent on the datasets and the initial under-sampling rate is \(R=1\) .

figure a

Edited Nearest Neighbor (ENN)

The under-sampling method removed overlap instances from concatenate data. This stage result is the final customer churn dataset and is ready to process in machine learning prediction. The result of the effectiveness of the CTGAN-ENN method in this research is shown in the experiment result section.

After forming the final datasets, the second phase of our proposed research framework experimented with two types of machine learning on the customer churn datasets. The first type of model used KNN, Decision Trees, and Naïve Bayes, while the second type was an ensemble model that used XGBoost, Random Forest, GBM and LightGBM. To evaluate the models, we used four different criteria: AUC-ROC, G-Mean, F-1 Score, and algorithm execution time.

figure b

Pseudo-code of Proposed Framework

Algorithm 2 is pseudo-code of proposed framework, attempts to address the problem of unbalanced datasets in machine learning by putting forth a thorough strategy that includes phases for data pretreatment, augmentation, training, and evaluation. To resolve the underlying class imbalance, the dataset is first split into majority and minority classes. The creation of artificial minority instances via the CTGAN technique which mimics the minority class’s distribution by utilizing generative adversarial networks is a crucial next step. This augmentation technique gives the model a more balanced representation of the data, which attempts to alleviate the problem of class imbalance. After augmentation, the original majority and minority subsets are combined with the augmented minority data to create a consolidated dataset. The approach uses the Edited Nearest Neighbors (ENN) algorithm to further refine the dataset and improve its quality. By removing instances that are noisy or borderline, ENN enhances the final dataset's discriminative power. This cleaned dataset (called \(finalDatase{t}_{m}\) ) is used as the basis for the model training and assessment that comes after.

The next step balanced final dataset ready to apply ensemble learning ( \(ensembleM{L}_{m}\) ) and classical machine learning ( \(classicalM{L}_{m}\) ) approaches. To verify our proposed method performance, a variety of modeling paradigms can be explored thanks to this dual approach. Finally, performance measurements like Area Under the Curve (AUC), F-measure, and G-mean are used to assess how effective the trained models are. With AUC representing overall discriminative power, F-measure reflecting the trade-off between precision and recall, and G-mean evaluating the model’s ability to manage class imbalance, these metrics shed light on the models’ capacity to discriminate between various classes. The last metric is time execution of algorithms. This metric considers computational efficiency to provide insights into the algorithm’s scalability and practical applicability.

Experiment design and result analysis

Data description.

This research uses six datasets that are published in the Kaggle platform, which are telecommunication 1 (telco 1) dataset [ 22 ], bank dataset [ 23 ], mobile dataset [ 24 ], telecommunication 2 (telco 2) dataset [ 25 ], telecommunication 3 (telco 3) [ 26 ] and insurance dataset [ 27 ]. The imbalance ratio shows all datasets have imbalanced problems from 2.7 until extremely imbalanced on 7.5 imbalance ratio. This condition is one of the factor classifiers that hardly achieve stratifying results. The datasets overview shown in Table  1 below.

Experiment settings

All experiments in this research were performed using available Python packages. Table 2 shows several types of techniques, packages, and parameters that we used in the experiment. We compared the results of CTGAN-ENN (CE) with those of the conventional hybrid sampling techniques SMOTE-ENN (SE), ADAYSN-ENN (AE), and another GAN-based hybrid sampling technique known as WGAN-GP+ENN (WE) to demonstrate the efficacy of our proposed research framework.

We apply the filling missing value preprocessing technique, wherein null or missing values in our experiment are replaced with mean values computed from the data. Since the mean replaces continuous data without introducing outliers, it yields a better result when used in place of missing or null values. The experiments of this study use fivefold cross validation for data training and data testing, k-fold cross validation gives a stable value of evaluation, because it is used all subsets of data.

The next stage after preprocessing customer churn data using the hybrid sampling method is to do a classification task. Table 3 displays the models, packages, and parameters that we employed for both classical and ensemble machine learning.

In the experiment we provided cost-sensitive learning approached on Decision Tree, XGBoost, Random Forest and Light Gradient Boosting algorithms. The aim is to provide a comparison of CTGAN-ENN not only on data-level solutions but also on algorithm-level solutions. Table 4 shown the packages and parameter in the experiment that implemented using the scikit-learn library.

Evaluation metrics

Three separate evaluation techniques were used in the experiment, each with a distinct set of goals. The precision and recall harmonic means are used to calculate the first metric, known as the F1 score. It provides an equitable assessment of a model’s efficacy by considering both false positives and false negatives.

\(RC\) stand for recall, the ratio of accurately predicted positive observations to the total number of real positives. It assesses the model’s capacity to recognize and accurately classify every occurrence of the positive class. \(PR\) is precision, the ratio of accurately predicted positive observations to the total number of predicted positives is known as precision. It gauges how well the model predicts the favorable outcomes.

The second metric is AUC-ROC, plotting the true positive rate against the false positive rate yields the second figure of Area Under the Receiver Operating Characteristic Curve. The area under the ROC curve is denoted by AUC. It offers a model’s overall performance across different thresholds.

The third is G-mean, which uses a geometric mean of recall and specificity to provide a balanced assessment of a model's performance [ 2 ]. The G-mean measure can identify positive examples and prevent false positives, regardless of the distribution between sample classes.

Specificity, represented by \(SP\) , is a gauge of how well the model can identify negative cases. Rather than dividing the training and testing data at random, we employed the k-fold cross-validation approach with the number of k = 5.

F1-score result

Tables 5 , 6 , 7 , 8 and 9 give the result on each dataset with different evaluation metric. The best value of the experiment is marked in bold. Table 5 shows that CTGAN-ENN (CE) achieved the best performance in 21 scenarios out of 42 scenarios in F1-Score.

The performance of WE and SE both achieved rank 1 in 6 scenarios of F1-Score. WE performed well in the Decision Tree and Random Forest algorithm in two datasets, while SE performed well in all machine learning algorithms but only in the telco1 dataset. Our proposed research framework outperformed all machine learning algorithms, especially in KNN, GBM, XGB and LGB.

AUC-ROC result

Table 6 shows that CE obtained 29 of the best results out of 42 scenarios in terms of AUC. It outperformed the other sampling method most in all datasets except with the KNN and NB algorithms. Table 5 confirms that our proposed research framework performs well by reaching 29 the best performance out of 42 scenarios in the G-mean metric. Interestingly, WE performed well in the DT algorithm consistently in all evaluation metrics.

We found an interesting result that CE worked well on ensemble machine learning in all scenarios. Our technique performed exceptionally well in GBM, XGB, RF and LGB. Table 7 shows the mean ranking of all scenarios in this experiment and proves that CE achieved the best mean ranking in the ensemble machine learning algorithm.

G-Mean result

Machine learning algorithms are divided into two types: KNN, DT, NB is the classical model; otherwise, GBM, XGB, RF, LGB is ensemble model. From the experimental result, we can see the effectiveness of our proposed research framework. The ensemble model given a few improvements from the original CTGAN compared to the combination model with ENN (CE), but in the classical model, KNN, DT, and NB, the improvement was more significant than the ensemble model. It happened because the ensemble model has a robust classification ability. Even without the sampling method, the ensemble model achieved satisfying results.

Accuracy of minority class

Minority class accuracy refers to the accuracy metric specifically calculated for the minority class; in this study case all datasets have the same minority class (churn). The accuracy of minority class in this experiment was measured by recall metric because the minority class is positive class, as shown in formula 3 below.

Table 8 shows the accuracy of minority class result by percentage, The comparison sheds light on the efficacy of CTGAN-ENN in addressing the challenges of imbalanced datasets. CTGAN-ENN outperformed on 19 of 42 scenario, and in the second-best rank is SE achieved best result on 10 of 42 scenario.

CTGAN-ENN and algorithm-level comparison

CTGAN-ENN achieved better results in all algorithm-level experiments, as shown in Table  9 we used Cost-Sensitive (CS) on Decision Tree (DT), Random Forest (RF), Light Gradient-Boosting Machine (LGB) and XGBoost (XGB). The cost that we used was 1 for class 0 or not churn, and 10 for class 1 or churn class, we given more cost on minority class to adjust the robustness of algorithm. Based on our experiment CTGAN-ENN (CE) consistently gained better performance in AUC, F1-Score and G-Mean metric compared to cost-sensitive learning result.

Mean ranking score, class overlap degree, and time execution result

CTGAN-ENN achieved a lower Fisher’s discriminant ratio is displayed in Table  10 . The datasets degree of class overlap is considerably reduced by CTGAN-ENN.

The average rank within all experiment scenarios provided in Table  11 , lower value of the mean ranking score indicates better results in the scenario. CE achieved the best average rank score in almost all scenarios, 14 of 21 scenarios. WE gained five best average ranks; WE outperformed in the DT classifier while CE outperformed in almost all ensemble machine learning models.

We calculate the algorithm time consumption after data preprocessing using CTGAN and CTGAN-ENN. CTGAN-ENN reduced the time consumption algorithm in all machine learning algorithms except in NB. Although the improvement is not much in several cases, CTGAN-ENN has less algorithm time execution compared with CTGAN. The time consumption of the algorithm is significantly reduced by 38.47% on average, indicating that CTGAN-ENN can work effectively in large data of customer churn. The detailed results of these comparisons are provided in Appendix 1.

Figure  5 represents the visualization of all datasets by using the T-SNE technique. We can spot the difference between the middle side of the figure and the right side of the figure. Almost all stacked data points between classes were removed by ENN and produced a clear area between classes. The results of CTGAN-ENN in all datasets made machine learning algorithms learn more easily and achieved outperformed results compared to other sampling methods.

figure 5

The visualization of non-sampling (left), CTGAN (middle) and CTGAN-ENN (right) of all datasets

Finally, after all the experiment’s measurements, our proposed research framework, CTGAN-ENN, outperforms compared to other over-sampling and hybrid sampling methods. From the experimental results, CTGAN-ENN is working very well on KNN and ensemble machine learning models GBM, XGB, and RF. If a company considers building a fast and optimal customer churn prediction model, KNN is the optimal model. Using our proposed research framework, a company can build the customer churn model using CTGAN-ENN and choose the machine learning algorithm according to their objective.

Compared with the latest work [ 2 , 9 ], our work offers a new hybrid sampling method perspective using tabular GAN. The strong point of tabular GAN is the generative phase is built for tabular data while another GAN is originally built for image data.

Summary and analysis of results

The classical generative data algorithms like SMOTE and ADAYSN only work well in one dataset according to our experime nts. This happened because SMOTE and ADAYSN generate data using a cluster of original datasets. The variation of data might not be diverse for training in machine learning. In some datasets, SMOTE and ADAYSN work because the original data itself has a good distribution.

CTGAN-ENN in our work still has some limitations. The customer churn datasets usually have only two classes (churn and not churn). Meanwhile, that might be a multiclass problem for customer classifications in real-world datasets. The variation of GANs is widely developed nowadays. Besides the classification task, the GAN’s technique can be used in other customer machine learning tasks. For example, predicting the demand for a product by days or predicting a customer transaction number by period. Our work does not yet cover all the possibilities for using GANs and their hybrid method.

Our proposed research framework with XGBoost algorithm achieved better results than the latest work on the Telco 1 and Insurance datasets [ 1 ]. Specifically, it achieved an F1-Score of 0.949 for Telco 1 and 0.981 for Insurance, while the latest work achieved 0.635 and 0.623 respectively. Algorithm-level approach by cost-sensitive learning used DT, RF, LGB and XGB algorithms was compared with CTGAN-ENN, CTGAN-ENN consistently achieved better performance in all of experiments.

Compared with other latest work results [ 2 ], because the datasets are different, we compared with the same characteristics of datasets by imbalanced ratio for the latest work was IBM dataset with 2.76 imbalanced ratio and our works was Telco 1 with 2.7 imbalanced ratio. Our works showed better performance both in AUC and G-mean metrics. The latest work in AUC with GBM achieved 0.832 while our work achieved 0.991. In G-mean metric, the latest work achieved 0.743 with Random Forest algorithm while our work achieved 0.955.

Class imbalance and class overlap are common problems in customer churn prediction. A GAN-based hybrid sampling method was recently proposed to handle the issues. For a better result, we proposed a tabular GAN-based hybrid sampling method, CTGAN-ENN, which combines a tabular GAN-based oversampling and Edited Nearest Neighbor (ENN) under-sampling.

Our primary takeaway from the experiment’s outcome is that CTGAN-ENN enhanced customer churn prediction performance. The KNN algorithm achieved the best results in algorithm time consumption. Using CTGAN-ENN as a preprocessing strategy, GBM and XGB can be the most accurate models for predicting customer turnover if time consumption is disregarded. In the other scenario, we discovered that the DT algorithm performed well with WGAN-GP+ENN. Compared to recent studies using identical datasets and comparable dataset properties, our suggested research framework produced better findings.

Our findings have practical implications for stakeholders who want to build customer churn predictions in their company data. In the real-world case customer data rapidly increases over time, and the results of this study can give insight into the big data fields. By choosing the right combination of CTGAN-ENN and machine learning algorithms, stakeholders can consider their resources to build a customer churn prediction model.

This study’s limitation is that it only focuses on data-level solutions and binary classification problems; instead of binary classification tasks, we can investigate several modifications of our methodology for multi-class classification tasks in another customer classification problem. Our study results also have a problem with classical machine learning outcomes. This issue might be handled with algorithm-level solutions, our experiment in algorithm-level solutions only used simple cost-sensitive learning without further detail analysis. In future work, we can consider extending our framework using cost-sensitive learning in details analysis as algorithm-level solutions.

Availability of data and materials

Code and datasets can be accessed on GitHub: https://github.com/mahayasa/gan-hybrid-sampling-customer-churn . No datasets were generated or analysed during the current study.

Abbreviations

Adaptive Synthetic Sampling

ADAYSN+Edited Nearest Neighbor

Area Under Curve

CTGAN + Edited Nearest Neighbor

Cost-sensitive

Decision tree

Edited Nearest Neighbor

K-nearest neighbor

Light Gradient-Boosting Machine

Naïve Bayes

Random forest

Generative Adversarial Network

Gradient boosting machine

SMOTE+Edited Nearest Neighbor

Synthetic Minority Over-sampling Technique

WGAN-GP+Edited Nearest Neighbor

Wen X, Wang Y, Ji X, Traoré MK. Three-stage churn management framework based on DCN with asymmetric loss. Expert Syst Appl. 2022;207:117998. https://doi.org/10.1016/j.eswa.2022.117998 .

Article   Google Scholar  

Zhu B, Pan X, Vanden Broucke S, Xiao J. A GAN-based hybrid sampling method for imbalanced customer classification. Inf Sci. 2022;609:1397–411. https://doi.org/10.1016/j.ins.2022.07.145 .

Das S, Mullick SS, Zelinka I. On supervised class-imbalanced learning: an updated perspective and some key challenges. IEEE Trans Artif Intell. 2022;3(6):973–93. https://doi.org/10.1109/TAI.2022.3160658 .

Goodfellow IJ et al. Generative Adversarial Networks. 2014. http://arxiv.org/abs/1406.2661

Huyen C. Designing machine learning systems. Sebastopol: O’Reilly Media; 2022.

Google Scholar  

Geiler L, Affeldt S, Nadif M. An effective strategy for churn prediction and customer profiling. Data Knowl Eng. 2022. https://doi.org/10.1016/j.datak.2022.102100 .

Wu S, Yau W-C, Ong T-S, Chong S-C. Integrated churn prediction and customer segmentation framework for telco business. IEEE Access. 2021;9:62118–36. https://doi.org/10.1109/ACCESS.2021.3073776 .

Su C, Wei L, Xie X. Churn prediction in telecommunications industry based on conditional Wasserstein GAN, In: 2022 IEEE 29th International Conference on High Performance Computing, Data, and Analytics (HiPC) , 2022, pp. 186–191. https://doi.org/10.1109/HiPC56025.2022.00034 .

Ding H, Sun Y, Wang Z, Huang N, Shen Z, Cui X. RGAN-EL: a GAN and ensemble learning-based hybrid approach for imbalanced data classification. Inf Process Manag. 2023;60(2):103235. https://doi.org/10.1016/j.ipm.2022.103235 .

Sáez JA, Luengo J, Stefanowski J, Herrera F. SMOTE-IPF: addressing the noisy and borderline examples problem in imbalanced classification by a re-sampling method with filtering. Inf Sci. 2015;291:184–203. https://doi.org/10.1016/j.ins.2014.08.051 .

Vuttipittayamongkol P, Elyan E. Neighbourhood-based undersampling approach for handling imbalanced and overlapped data. Inf Sci. 2020;509:47–70. https://doi.org/10.1016/j.ins.2019.08.062 .

Xu Z, Shen D, Nie T, Kou Y. A hybrid sampling algorithm combining M-SMOTE and ENN based on random forest for medical imbalanced data. J Biomed Inform. 2020;107:103465. https://doi.org/10.1016/j.jbi.2020.103465 .

Elkan C. The Foundations of Cost-Sensitive Learning.

Guo G, Wang H, Bell D, Bi Y, Greer K. LNCS 2888—KNN model-based approach in classification. Berlin: Springer; 2003.

Altuve M, Alvarez AJ, Severeyn E. Multiclass classification of metabolic conditions using fasting plasma levels of glucose and insulin. Health Technol (Berl). 2021;11(4):953–62. https://doi.org/10.1007/s12553-021-00550-w .

Kumari S, Kumar D, Mittal M. An ensemble approach for classification and prediction of diabetes mellitus using soft voting classifier. Int J Cogn Comput Eng. 2021;2:40–6. https://doi.org/10.1016/j.ijcce.2021.01.001 .

Chen T, Guestrin C. XGBoost: a scalable tree boosting system. 2016. https://doi.org/10.1145/2939672.2939785 .

Biau G, Fr GB. Analysis of a random forests model. 2012.

Shrivastav LK, Jha SK. A gradient boosting machine learning approach in modeling the impact of temperature and humidity on the transmission rate of COVID-19 in India. Appl Intell. 2021;51(5):2727–39. https://doi.org/10.1007/s10489-020-01997-6 .

Ke G et al. LightGBM: A highly efficient gradient boosting decision tree. https://github.com/Microsoft/LightGBM . Accessed 17 Mar 2023.

Xu L, Skoularidou M, Cuesta-Infante A, Veeramachaneni K. Modeling Tabular data using Conditional GAN. 2019. http://arxiv.org/abs/1907.00503 . Accessed 8 May 2023.

Telco Customer Churn | Kaggle. https://www.kaggle.com/datasets/blastchar/telco-customer-churn . Accessed 07 Jun 2023.

Churn Modelling | Kaggle. https://www.kaggle.com/datasets/shrutimechlearn/churn-modelling . Accessed 07 Jun 2023.

mobile-churn-data.xlsx | Kaggle. https://www.kaggle.com/datasets/dimitaryanev/mobilechurndataxlsx . Accessed 07 Jun 2023

Customer Churn Prediction 2020 | Kaggle. https://www.kaggle.com/competitions/customer-churn-prediction-2020 . Accessed 07 Jun 2023.

Customer Churn. https://www.kaggle.com/datasets/royjafari/customer-churn . Accessed 18 Mar 2024

Vinod Kumar. Insurance churn prediction : weekend hackathon. https://www.kaggle.com/datasets/k123vinod/insurance-churn-prediction-weekend-hackathon . Accessed 15 Mar 2023.

SMOTE—Version 0.10.1. https://imbalanced-learn.org/stable/references/generated/imblearn.over_sampling.SMOTE.html . Accessed 08 Jun 2023.

Lemaître G, Nogueira F, Aridas CK. Imbalanced-learn: a python toolbox to tackle the curse of imbalanced datasets in machine learning. J Mach Learn Res. 2017;18(17):1–5.

ydata-synthetic Python package for synthetic data generation for tabular and time-series data. https://docs.synthetic.ydata.ai/1.3/ . Accessed 04 Jul 2023.

ctgan · PyPI. https://pypi.org/project/ctgan/ . Accessed 08 Jun 2023.

EditedNearestNeighbours—Version 0.10.1. https://imbalanced-learn.org/stable/references/generated/imblearn.under_sampling.EditedNearestNeighbours.html . Accessed 08 Jun 2023.

SMOTEENN—Version 0.10.1. https://imbalanced-learn.org/stable/references/generated/imblearn.combine.SMOTEENN.html . Accessed 08 Jun 2023.

Pedregosa F, et al. Scikit-learn: machine learning in python. J Mach Learn Res. 2011;12(85):2825–30.

MathSciNet   Google Scholar  

XGBoost Documentation—xgboost 2.0.3 documentation. https://xgboost.readthedocs.io/en/stable/ . Accessed 19 Mar 2024.

sklearn.ensemble.RandomForestClassifier—scikit-learn 1.4.1 documentation. https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html . Accessed 12 Mar 2024.

lightgbm.LGBMClassifier—LightGBM 4.3.0.99 documentation. https://lightgbm.readthedocs.io/en/latest/pythonapi/lightgbm.LGBMClassifier.html . Accessed 19 Mar 2024.

Download references

Acknowledgements

This work was supported by Khon Kaen University ASEAN GMS grant and part of AIDA (Applied Intelligence and Data Analytics) lab in College of Computing, Khon Kaen University, Thailand.

Khon Kaen University.

Author information

Authors and affiliations.

College of Computing, Khon Kaen University, Khon Kaen, Thailand

I Nyoman Mahayasa Adiputra & Paweena Wanchai

You can also search for this author in PubMed   Google Scholar

Contributions

I Nyoman Mahayasa Adiputra and Paweena Wanchai conceived of the presented idea. Both authors contributed to the design and implementation of the research, to the analysis of the results and to the writing of the manuscript. I Nyoman Mahayasa Adiputra carried out the experiments. Paweena Wanchai supervised the project, provided critical feedback, and helped shape the research and analysis. Both authors discussed the results and contributed to the final version of the manuscript.

Corresponding author

Correspondence to Paweena Wanchai .

Ethics declarations

Competing interests.

The authors declare that this research works have no competing interests.

Additional information

Publisher's note.

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

CTGAN and CTGAN-ENN algorithm time execution (s)

Algorithm

Dataset

CTGAN

CTGAN-ENN

KNN

Bank

2.10

 

Mobile

336.54

 

Telco1

8.58

 

Telco2

1.72

 

Telco3

3.22

 

Insurance

81.91

DT

Bank

1.78

 

Mobile

79.83

 

Telco1

1.38

 

Telco2

2.07

 

Telco3

0.79

 

Insurance

7.40

NB

Bank

0.30

0.30

 

Mobile

4.69

 

Telco1

0.66

 

Telco2

0.18

 

Telco3

0.23

 

Insurance

1.40

GBM

Bank

43.70

 

Mobile

1820

 

Telco1

29.40

 

Telco2

51.26

 

Telco3

27.42

 

Insurance

247.61

XGB

Bank

27.56

 

Mobile

458.07

 

Telco1

27.20

 

Telco2

27.25

 

Telco3

2.72

 

Insurance

158.94

RF

Bank

41.12

 

Mobile

605.53

 

Telco1

23.90

 

Telco2

37.72

 

Telco3

12.18

 

Insurance

141.45

LGM

Bank

6.79

 

Mobile

58.51

 

Telco1

7.48

 

Telco2

6.12

 

Telco3

11.03

 

Insurance

17.52

  • Bold values represent the algorithm execution time of customer churn predcition, altought the improvement of prediction not really significant in some cases comapred to CTGAN, CTGAN-ENN has a smaller time excetuion. That indicates the realibility of CTGAN-ENN to works on big scale data

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/ .

Reprints and permissions

About this article

Cite this article.

Adiputra, I.N.M., Wanchai, P. CTGAN-ENN: a tabular GAN-based hybrid sampling method for imbalanced and overlapped data in customer churn prediction. J Big Data 11 , 121 (2024). https://doi.org/10.1186/s40537-024-00982-x

Download citation

Received : 21 November 2023

Accepted : 16 August 2024

Published : 02 September 2024

DOI : https://doi.org/10.1186/s40537-024-00982-x

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • Hybrid sampling method
  • Over-sampling
  • Under-sampling
  • Machine learning

research methods data type

  • Privacy Policy

Research Method

Home » Data Collection – Methods Types and Examples

Data Collection – Methods Types and Examples

Table of Contents

Data collection

Data Collection

Definition:

Data collection is the process of gathering and collecting information from various sources to analyze and make informed decisions based on the data collected. This can involve various methods, such as surveys, interviews, experiments, and observation.

In order for data collection to be effective, it is important to have a clear understanding of what data is needed and what the purpose of the data collection is. This can involve identifying the population or sample being studied, determining the variables to be measured, and selecting appropriate methods for collecting and recording data.

Types of Data Collection

Types of Data Collection are as follows:

Primary Data Collection

Primary data collection is the process of gathering original and firsthand information directly from the source or target population. This type of data collection involves collecting data that has not been previously gathered, recorded, or published. Primary data can be collected through various methods such as surveys, interviews, observations, experiments, and focus groups. The data collected is usually specific to the research question or objective and can provide valuable insights that cannot be obtained from secondary data sources. Primary data collection is often used in market research, social research, and scientific research.

Secondary Data Collection

Secondary data collection is the process of gathering information from existing sources that have already been collected and analyzed by someone else, rather than conducting new research to collect primary data. Secondary data can be collected from various sources, such as published reports, books, journals, newspapers, websites, government publications, and other documents.

Qualitative Data Collection

Qualitative data collection is used to gather non-numerical data such as opinions, experiences, perceptions, and feelings, through techniques such as interviews, focus groups, observations, and document analysis. It seeks to understand the deeper meaning and context of a phenomenon or situation and is often used in social sciences, psychology, and humanities. Qualitative data collection methods allow for a more in-depth and holistic exploration of research questions and can provide rich and nuanced insights into human behavior and experiences.

Quantitative Data Collection

Quantitative data collection is a used to gather numerical data that can be analyzed using statistical methods. This data is typically collected through surveys, experiments, and other structured data collection methods. Quantitative data collection seeks to quantify and measure variables, such as behaviors, attitudes, and opinions, in a systematic and objective way. This data is often used to test hypotheses, identify patterns, and establish correlations between variables. Quantitative data collection methods allow for precise measurement and generalization of findings to a larger population. It is commonly used in fields such as economics, psychology, and natural sciences.

Data Collection Methods

Data Collection Methods are as follows:

Surveys involve asking questions to a sample of individuals or organizations to collect data. Surveys can be conducted in person, over the phone, or online.

Interviews involve a one-on-one conversation between the interviewer and the respondent. Interviews can be structured or unstructured and can be conducted in person or over the phone.

Focus Groups

Focus groups are group discussions that are moderated by a facilitator. Focus groups are used to collect qualitative data on a specific topic.

Observation

Observation involves watching and recording the behavior of people, objects, or events in their natural setting. Observation can be done overtly or covertly, depending on the research question.

Experiments

Experiments involve manipulating one or more variables and observing the effect on another variable. Experiments are commonly used in scientific research.

Case Studies

Case studies involve in-depth analysis of a single individual, organization, or event. Case studies are used to gain detailed information about a specific phenomenon.

Secondary Data Analysis

Secondary data analysis involves using existing data that was collected for another purpose. Secondary data can come from various sources, such as government agencies, academic institutions, or private companies.

How to Collect Data

The following are some steps to consider when collecting data:

  • Define the objective : Before you start collecting data, you need to define the objective of the study. This will help you determine what data you need to collect and how to collect it.
  • Identify the data sources : Identify the sources of data that will help you achieve your objective. These sources can be primary sources, such as surveys, interviews, and observations, or secondary sources, such as books, articles, and databases.
  • Determine the data collection method : Once you have identified the data sources, you need to determine the data collection method. This could be through online surveys, phone interviews, or face-to-face meetings.
  • Develop a data collection plan : Develop a plan that outlines the steps you will take to collect the data. This plan should include the timeline, the tools and equipment needed, and the personnel involved.
  • Test the data collection process: Before you start collecting data, test the data collection process to ensure that it is effective and efficient.
  • Collect the data: Collect the data according to the plan you developed in step 4. Make sure you record the data accurately and consistently.
  • Analyze the data: Once you have collected the data, analyze it to draw conclusions and make recommendations.
  • Report the findings: Report the findings of your data analysis to the relevant stakeholders. This could be in the form of a report, a presentation, or a publication.
  • Monitor and evaluate the data collection process: After the data collection process is complete, monitor and evaluate the process to identify areas for improvement in future data collection efforts.
  • Ensure data quality: Ensure that the collected data is of high quality and free from errors. This can be achieved by validating the data for accuracy, completeness, and consistency.
  • Maintain data security: Ensure that the collected data is secure and protected from unauthorized access or disclosure. This can be achieved by implementing data security protocols and using secure storage and transmission methods.
  • Follow ethical considerations: Follow ethical considerations when collecting data, such as obtaining informed consent from participants, protecting their privacy and confidentiality, and ensuring that the research does not cause harm to participants.
  • Use appropriate data analysis methods : Use appropriate data analysis methods based on the type of data collected and the research objectives. This could include statistical analysis, qualitative analysis, or a combination of both.
  • Record and store data properly: Record and store the collected data properly, in a structured and organized format. This will make it easier to retrieve and use the data in future research or analysis.
  • Collaborate with other stakeholders : Collaborate with other stakeholders, such as colleagues, experts, or community members, to ensure that the data collected is relevant and useful for the intended purpose.

Applications of Data Collection

Data collection methods are widely used in different fields, including social sciences, healthcare, business, education, and more. Here are some examples of how data collection methods are used in different fields:

  • Social sciences : Social scientists often use surveys, questionnaires, and interviews to collect data from individuals or groups. They may also use observation to collect data on social behaviors and interactions. This data is often used to study topics such as human behavior, attitudes, and beliefs.
  • Healthcare : Data collection methods are used in healthcare to monitor patient health and track treatment outcomes. Electronic health records and medical charts are commonly used to collect data on patients’ medical history, diagnoses, and treatments. Researchers may also use clinical trials and surveys to collect data on the effectiveness of different treatments.
  • Business : Businesses use data collection methods to gather information on consumer behavior, market trends, and competitor activity. They may collect data through customer surveys, sales reports, and market research studies. This data is used to inform business decisions, develop marketing strategies, and improve products and services.
  • Education : In education, data collection methods are used to assess student performance and measure the effectiveness of teaching methods. Standardized tests, quizzes, and exams are commonly used to collect data on student learning outcomes. Teachers may also use classroom observation and student feedback to gather data on teaching effectiveness.
  • Agriculture : Farmers use data collection methods to monitor crop growth and health. Sensors and remote sensing technology can be used to collect data on soil moisture, temperature, and nutrient levels. This data is used to optimize crop yields and minimize waste.
  • Environmental sciences : Environmental scientists use data collection methods to monitor air and water quality, track climate patterns, and measure the impact of human activity on the environment. They may use sensors, satellite imagery, and laboratory analysis to collect data on environmental factors.
  • Transportation : Transportation companies use data collection methods to track vehicle performance, optimize routes, and improve safety. GPS systems, on-board sensors, and other tracking technologies are used to collect data on vehicle speed, fuel consumption, and driver behavior.

Examples of Data Collection

Examples of Data Collection are as follows:

  • Traffic Monitoring: Cities collect real-time data on traffic patterns and congestion through sensors on roads and cameras at intersections. This information can be used to optimize traffic flow and improve safety.
  • Social Media Monitoring : Companies can collect real-time data on social media platforms such as Twitter and Facebook to monitor their brand reputation, track customer sentiment, and respond to customer inquiries and complaints in real-time.
  • Weather Monitoring: Weather agencies collect real-time data on temperature, humidity, air pressure, and precipitation through weather stations and satellites. This information is used to provide accurate weather forecasts and warnings.
  • Stock Market Monitoring : Financial institutions collect real-time data on stock prices, trading volumes, and other market indicators to make informed investment decisions and respond to market fluctuations in real-time.
  • Health Monitoring : Medical devices such as wearable fitness trackers and smartwatches can collect real-time data on a person’s heart rate, blood pressure, and other vital signs. This information can be used to monitor health conditions and detect early warning signs of health issues.

Purpose of Data Collection

The purpose of data collection can vary depending on the context and goals of the study, but generally, it serves to:

  • Provide information: Data collection provides information about a particular phenomenon or behavior that can be used to better understand it.
  • Measure progress : Data collection can be used to measure the effectiveness of interventions or programs designed to address a particular issue or problem.
  • Support decision-making : Data collection provides decision-makers with evidence-based information that can be used to inform policies, strategies, and actions.
  • Identify trends : Data collection can help identify trends and patterns over time that may indicate changes in behaviors or outcomes.
  • Monitor and evaluate : Data collection can be used to monitor and evaluate the implementation and impact of policies, programs, and initiatives.

When to use Data Collection

Data collection is used when there is a need to gather information or data on a specific topic or phenomenon. It is typically used in research, evaluation, and monitoring and is important for making informed decisions and improving outcomes.

Data collection is particularly useful in the following scenarios:

  • Research : When conducting research, data collection is used to gather information on variables of interest to answer research questions and test hypotheses.
  • Evaluation : Data collection is used in program evaluation to assess the effectiveness of programs or interventions, and to identify areas for improvement.
  • Monitoring : Data collection is used in monitoring to track progress towards achieving goals or targets, and to identify any areas that require attention.
  • Decision-making: Data collection is used to provide decision-makers with information that can be used to inform policies, strategies, and actions.
  • Quality improvement : Data collection is used in quality improvement efforts to identify areas where improvements can be made and to measure progress towards achieving goals.

Characteristics of Data Collection

Data collection can be characterized by several important characteristics that help to ensure the quality and accuracy of the data gathered. These characteristics include:

  • Validity : Validity refers to the accuracy and relevance of the data collected in relation to the research question or objective.
  • Reliability : Reliability refers to the consistency and stability of the data collection process, ensuring that the results obtained are consistent over time and across different contexts.
  • Objectivity : Objectivity refers to the impartiality of the data collection process, ensuring that the data collected is not influenced by the biases or personal opinions of the data collector.
  • Precision : Precision refers to the degree of accuracy and detail in the data collected, ensuring that the data is specific and accurate enough to answer the research question or objective.
  • Timeliness : Timeliness refers to the efficiency and speed with which the data is collected, ensuring that the data is collected in a timely manner to meet the needs of the research or evaluation.
  • Ethical considerations : Ethical considerations refer to the ethical principles that must be followed when collecting data, such as ensuring confidentiality and obtaining informed consent from participants.

Advantages of Data Collection

There are several advantages of data collection that make it an important process in research, evaluation, and monitoring. These advantages include:

  • Better decision-making : Data collection provides decision-makers with evidence-based information that can be used to inform policies, strategies, and actions, leading to better decision-making.
  • Improved understanding: Data collection helps to improve our understanding of a particular phenomenon or behavior by providing empirical evidence that can be analyzed and interpreted.
  • Evaluation of interventions: Data collection is essential in evaluating the effectiveness of interventions or programs designed to address a particular issue or problem.
  • Identifying trends and patterns: Data collection can help identify trends and patterns over time that may indicate changes in behaviors or outcomes.
  • Increased accountability: Data collection increases accountability by providing evidence that can be used to monitor and evaluate the implementation and impact of policies, programs, and initiatives.
  • Validation of theories: Data collection can be used to test hypotheses and validate theories, leading to a better understanding of the phenomenon being studied.
  • Improved quality: Data collection is used in quality improvement efforts to identify areas where improvements can be made and to measure progress towards achieving goals.

Limitations of Data Collection

While data collection has several advantages, it also has some limitations that must be considered. These limitations include:

  • Bias : Data collection can be influenced by the biases and personal opinions of the data collector, which can lead to inaccurate or misleading results.
  • Sampling bias : Data collection may not be representative of the entire population, resulting in sampling bias and inaccurate results.
  • Cost : Data collection can be expensive and time-consuming, particularly for large-scale studies.
  • Limited scope: Data collection is limited to the variables being measured, which may not capture the entire picture or context of the phenomenon being studied.
  • Ethical considerations : Data collection must follow ethical principles to protect the rights and confidentiality of the participants, which can limit the type of data that can be collected.
  • Data quality issues: Data collection may result in data quality issues such as missing or incomplete data, measurement errors, and inconsistencies.
  • Limited generalizability : Data collection may not be generalizable to other contexts or populations, limiting the generalizability of the findings.

About the author

' src=

Muhammad Hassan

Researcher, Academic Writer, Web developer

You may also like

Scope of the Research

Scope of the Research – Writing Guide and...

Evaluating Research

Evaluating Research – Process, Examples and...

Thesis

Thesis – Structure, Example and Writing Guide

Problem statement

Problem Statement – Writing Guide, Examples and...

Research Summary

Research Summary – Structure, Examples and...

Ethical Considerations

Ethical Considerations – Types, Examples and...

Here’s how you know

  • U.S. Department of Health and Human Services
  • National Institutes of Health

Acupuncture: Effectiveness and Safety

acupuncture_GettyImages-

.header_greentext{color:green!important;font-size:24px!important;font-weight:500!important;}.header_bluetext{color:blue!important;font-size:18px!important;font-weight:500!important;}.header_redtext{color:red!important;font-size:28px!important;font-weight:500!important;}.header_darkred{color:#803d2f!important;font-size:28px!important;font-weight:500!important;}.header_purpletext{color:purple!important;font-size:31px!important;font-weight:500!important;}.header_yellowtext{color:yellow!important;font-size:20px!important;font-weight:500!important;}.header_blacktext{color:black!important;font-size:22px!important;font-weight:500!important;}.header_whitetext{color:white!important;font-size:22px!important;font-weight:500!important;}.header_darkred{color:#803d2f!important;}.Green_Header{color:green!important;font-size:24px!important;font-weight:500!important;}.Blue_Header{color:blue!important;font-size:18px!important;font-weight:500!important;}.Red_Header{color:red!important;font-size:28px!important;font-weight:500!important;}.Purple_Header{color:purple!important;font-size:31px!important;font-weight:500!important;}.Yellow_Header{color:yellow!important;font-size:20px!important;font-weight:500!important;}.Black_Header{color:black!important;font-size:22px!important;font-weight:500!important;}.White_Header{color:white!important;font-size:22px!important;font-weight:500!important;} What is acupuncture?

Acupuncture is a technique in which practitioners insert fine needles into the skin to treat health problems. The needles may be manipulated manually or stimulated with small electrical currents (electroacupuncture). Acupuncture has been in use in some form for at least 2,500 years. It originated from  traditional Chinese medicine but has gained popularity worldwide since the 1970s.

.header_greentext{color:green!important;font-size:24px!important;font-weight:500!important;}.header_bluetext{color:blue!important;font-size:18px!important;font-weight:500!important;}.header_redtext{color:red!important;font-size:28px!important;font-weight:500!important;}.header_darkred{color:#803d2f!important;font-size:28px!important;font-weight:500!important;}.header_purpletext{color:purple!important;font-size:31px!important;font-weight:500!important;}.header_yellowtext{color:yellow!important;font-size:20px!important;font-weight:500!important;}.header_blacktext{color:black!important;font-size:22px!important;font-weight:500!important;}.header_whitetext{color:white!important;font-size:22px!important;font-weight:500!important;}.header_darkred{color:#803d2f!important;}.Green_Header{color:green!important;font-size:24px!important;font-weight:500!important;}.Blue_Header{color:blue!important;font-size:18px!important;font-weight:500!important;}.Red_Header{color:red!important;font-size:28px!important;font-weight:500!important;}.Purple_Header{color:purple!important;font-size:31px!important;font-weight:500!important;}.Yellow_Header{color:yellow!important;font-size:20px!important;font-weight:500!important;}.Black_Header{color:black!important;font-size:22px!important;font-weight:500!important;}.White_Header{color:white!important;font-size:22px!important;font-weight:500!important;} How widely is acupuncture used?

According to the World Health Organization, acupuncture is used in 103 of 129 countries that reported data.

In the United States, data from the National Health Interview Survey show that the use of acupuncture by U.S. adults more than doubled between 2002 and 2022. In 2002, 1.0 percent of U.S. adults used acupuncture; in 2022, 2.2 percent used it. 

.header_greentext{color:green!important;font-size:24px!important;font-weight:500!important;}.header_bluetext{color:blue!important;font-size:18px!important;font-weight:500!important;}.header_redtext{color:red!important;font-size:28px!important;font-weight:500!important;}.header_darkred{color:#803d2f!important;font-size:28px!important;font-weight:500!important;}.header_purpletext{color:purple!important;font-size:31px!important;font-weight:500!important;}.header_yellowtext{color:yellow!important;font-size:20px!important;font-weight:500!important;}.header_blacktext{color:black!important;font-size:22px!important;font-weight:500!important;}.header_whitetext{color:white!important;font-size:22px!important;font-weight:500!important;}.header_darkred{color:#803d2f!important;}.Green_Header{color:green!important;font-size:24px!important;font-weight:500!important;}.Blue_Header{color:blue!important;font-size:18px!important;font-weight:500!important;}.Red_Header{color:red!important;font-size:28px!important;font-weight:500!important;}.Purple_Header{color:purple!important;font-size:31px!important;font-weight:500!important;}.Yellow_Header{color:yellow!important;font-size:20px!important;font-weight:500!important;}.Black_Header{color:black!important;font-size:22px!important;font-weight:500!important;}.White_Header{color:white!important;font-size:22px!important;font-weight:500!important;} What is acupuncture used for?

National survey data indicate that in the United States, acupuncture is most commonly used for pain, such as back, joint, or neck pain.

.header_greentext{color:green!important;font-size:24px!important;font-weight:500!important;}.header_bluetext{color:blue!important;font-size:18px!important;font-weight:500!important;}.header_redtext{color:red!important;font-size:28px!important;font-weight:500!important;}.header_darkred{color:#803d2f!important;font-size:28px!important;font-weight:500!important;}.header_purpletext{color:purple!important;font-size:31px!important;font-weight:500!important;}.header_yellowtext{color:yellow!important;font-size:20px!important;font-weight:500!important;}.header_blacktext{color:black!important;font-size:22px!important;font-weight:500!important;}.header_whitetext{color:white!important;font-size:22px!important;font-weight:500!important;}.header_darkred{color:#803d2f!important;}.Green_Header{color:green!important;font-size:24px!important;font-weight:500!important;}.Blue_Header{color:blue!important;font-size:18px!important;font-weight:500!important;}.Red_Header{color:red!important;font-size:28px!important;font-weight:500!important;}.Purple_Header{color:purple!important;font-size:31px!important;font-weight:500!important;}.Yellow_Header{color:yellow!important;font-size:20px!important;font-weight:500!important;}.Black_Header{color:black!important;font-size:22px!important;font-weight:500!important;}.White_Header{color:white!important;font-size:22px!important;font-weight:500!important;} How does acupuncture work scientifically?

How acupuncture works is not fully understood. However, there’s evidence that acupuncture may have effects on the nervous system, effects on other body tissues, and nonspecific (placebo) effects. 

  • Studies in animals and people, including studies that used imaging methods to see what’s happening in the brain, have shown that acupuncture may affect nervous system function.
  • Acupuncture may have direct effects on the tissues where the needles are inserted. This type of effect has been seen in connective tissue.
  • Acupuncture has nonspecific effects (effects due to incidental aspects of a treatment rather than its main mechanism of action). Nonspecific effects may be due to the patient’s belief in the treatment, the relationship between the practitioner and the patient, or other factors not directly caused by the insertion of needles. In many studies, the benefit of acupuncture has been greater when it was compared with no treatment than when it was compared with sham (simulated or fake) acupuncture procedures, such as the use of a device that pokes the skin but does not penetrate it. These findings suggest that nonspecific effects contribute to the beneficial effect of acupuncture on pain or other symptoms. 
  • In recent research, a nonspecific effect was demonstrated in a unique way: Patients who had experienced pain relief during a previous acupuncture session were shown a video of that session and asked to imagine the treatment happening again. This video-guided imagery technique had a significant pain-relieving effect.

.header_greentext{color:green!important;font-size:24px!important;font-weight:500!important;}.header_bluetext{color:blue!important;font-size:18px!important;font-weight:500!important;}.header_redtext{color:red!important;font-size:28px!important;font-weight:500!important;}.header_darkred{color:#803d2f!important;font-size:28px!important;font-weight:500!important;}.header_purpletext{color:purple!important;font-size:31px!important;font-weight:500!important;}.header_yellowtext{color:yellow!important;font-size:20px!important;font-weight:500!important;}.header_blacktext{color:black!important;font-size:22px!important;font-weight:500!important;}.header_whitetext{color:white!important;font-size:22px!important;font-weight:500!important;}.header_darkred{color:#803d2f!important;}.Green_Header{color:green!important;font-size:24px!important;font-weight:500!important;}.Blue_Header{color:blue!important;font-size:18px!important;font-weight:500!important;}.Red_Header{color:red!important;font-size:28px!important;font-weight:500!important;}.Purple_Header{color:purple!important;font-size:31px!important;font-weight:500!important;}.Yellow_Header{color:yellow!important;font-size:20px!important;font-weight:500!important;}.Black_Header{color:black!important;font-size:22px!important;font-weight:500!important;}.White_Header{color:white!important;font-size:22px!important;font-weight:500!important;} What does research show about the effectiveness of acupuncture for pain?

Research has shown that acupuncture may be helpful for several pain conditions, including back or neck pain, knee pain associated with osteoarthritis, and postoperative pain. It may also help relieve joint pain associated with the use of aromatase inhibitors, which are drugs used in people with breast cancer. 

An analysis of data from 20 studies (6,376 participants) of people with painful conditions (back pain, osteoarthritis, neck pain, or headaches) showed that the beneficial effects of acupuncture continued for a year after the end of treatment for all conditions except neck pain.

.header_greentext{color:green!important;font-size:24px!important;font-weight:500!important;}.header_bluetext{color:blue!important;font-size:18px!important;font-weight:500!important;}.header_redtext{color:red!important;font-size:28px!important;font-weight:500!important;}.header_darkred{color:#803d2f!important;font-size:28px!important;font-weight:500!important;}.header_purpletext{color:purple!important;font-size:31px!important;font-weight:500!important;}.header_yellowtext{color:yellow!important;font-size:20px!important;font-weight:500!important;}.header_blacktext{color:black!important;font-size:22px!important;font-weight:500!important;}.header_whitetext{color:white!important;font-size:22px!important;font-weight:500!important;}.header_darkred{color:#803d2f!important;}.Green_Header{color:green!important;font-size:24px!important;font-weight:500!important;}.Blue_Header{color:blue!important;font-size:18px!important;font-weight:500!important;}.Red_Header{color:red!important;font-size:28px!important;font-weight:500!important;}.Purple_Header{color:purple!important;font-size:31px!important;font-weight:500!important;}.Yellow_Header{color:yellow!important;font-size:20px!important;font-weight:500!important;}.Black_Header{color:black!important;font-size:22px!important;font-weight:500!important;}.White_Header{color:white!important;font-size:22px!important;font-weight:500!important;} Back or Neck Pain

  • In a 2018 review, data from 12 studies (8,003 participants) showed acupuncture was more effective than no treatment for back or neck pain, and data from 10 studies (1,963 participants) showed acupuncture was more effective than sham acupuncture. The difference between acupuncture and no treatment was greater than the difference between acupuncture and sham acupuncture. The pain-relieving effect of acupuncture was comparable to that of nonsteroidal anti-inflammatory drugs (NSAIDs).
  • A 2017 clinical practice guideline from the American College of Physicians included acupuncture among the nondrug options recommended as first-line treatment for chronic low-back pain. Acupuncture is also one of the treatment options recommended for acute low-back pain. The evidence favoring acupuncture for acute low-back pain was judged to be of low quality, and the evidence for chronic low-back pain was judged to be of moderate quality.

For more information, see the  NCCIH webpage on low-back pain .

.header_greentext{color:green!important;font-size:24px!important;font-weight:500!important;}.header_bluetext{color:blue!important;font-size:18px!important;font-weight:500!important;}.header_redtext{color:red!important;font-size:28px!important;font-weight:500!important;}.header_darkred{color:#803d2f!important;font-size:28px!important;font-weight:500!important;}.header_purpletext{color:purple!important;font-size:31px!important;font-weight:500!important;}.header_yellowtext{color:yellow!important;font-size:20px!important;font-weight:500!important;}.header_blacktext{color:black!important;font-size:22px!important;font-weight:500!important;}.header_whitetext{color:white!important;font-size:22px!important;font-weight:500!important;}.header_darkred{color:#803d2f!important;}.Green_Header{color:green!important;font-size:24px!important;font-weight:500!important;}.Blue_Header{color:blue!important;font-size:18px!important;font-weight:500!important;}.Red_Header{color:red!important;font-size:28px!important;font-weight:500!important;}.Purple_Header{color:purple!important;font-size:31px!important;font-weight:500!important;}.Yellow_Header{color:yellow!important;font-size:20px!important;font-weight:500!important;}.Black_Header{color:black!important;font-size:22px!important;font-weight:500!important;}.White_Header{color:white!important;font-size:22px!important;font-weight:500!important;} Osteoarthritis

  • In a 2018 review, data from 10 studies (2,413 participants) showed acupuncture was more effective than no treatment for osteoarthritis pain, and data from 9 studies (2,376 participants) showed acupuncture was more effective than sham acupuncture. The difference between acupuncture and no treatment was greater than the difference between acupuncture and sham acupuncture. Most of the participants in these studies had knee osteoarthritis, but some had hip osteoarthritis. The pain-relieving effect of acupuncture was comparable to that of NSAIDs.
  • A 2018 review evaluated 6 studies (413 participants) of acupuncture for hip osteoarthritis. Two of the studies compared acupuncture with sham acupuncture and found little or no difference between them in terms of effects on pain. The other four studies compared acupuncture with a variety of other treatments and could not easily be compared with one another. However, one of the trials indicated that the addition of acupuncture to routine care by a physician may improve pain and function in patients with hip osteoarthritis.
  • A 2019 clinical practice guideline from the American College of Rheumatology and the Arthritis Foundation conditionally recommends acupuncture for osteoarthritis of the knee, hip, or hand. The guideline states that the greatest number of studies showing benefits have been for knee osteoarthritis.

For more information, see the  NCCIH webpage on osteoarthritis .

.header_greentext{color:green!important;font-size:24px!important;font-weight:500!important;}.header_bluetext{color:blue!important;font-size:18px!important;font-weight:500!important;}.header_redtext{color:red!important;font-size:28px!important;font-weight:500!important;}.header_darkred{color:#803d2f!important;font-size:28px!important;font-weight:500!important;}.header_purpletext{color:purple!important;font-size:31px!important;font-weight:500!important;}.header_yellowtext{color:yellow!important;font-size:20px!important;font-weight:500!important;}.header_blacktext{color:black!important;font-size:22px!important;font-weight:500!important;}.header_whitetext{color:white!important;font-size:22px!important;font-weight:500!important;}.header_darkred{color:#803d2f!important;}.Green_Header{color:green!important;font-size:24px!important;font-weight:500!important;}.Blue_Header{color:blue!important;font-size:18px!important;font-weight:500!important;}.Red_Header{color:red!important;font-size:28px!important;font-weight:500!important;}.Purple_Header{color:purple!important;font-size:31px!important;font-weight:500!important;}.Yellow_Header{color:yellow!important;font-size:20px!important;font-weight:500!important;}.Black_Header{color:black!important;font-size:22px!important;font-weight:500!important;}.White_Header{color:white!important;font-size:22px!important;font-weight:500!important;} Headache and Migraine

  • A 2020   review of nine studies that compared acupuncture with various drugs for preventing migraine found that acupuncture was slightly more effective, and study participants who received acupuncture were much less likely than those receiving drugs to drop out of studies because of side effects.
  • There’s moderate-quality evidence that acupuncture may reduce the frequency of migraines (from a 2016 evaluation of 22 studies with almost 5,000 people). The evidence from these studies also suggests that acupuncture may be better than sham acupuncture, but the difference is small. There is moderate- to low-quality evidence that acupuncture may reduce the frequency of tension headaches (from a 2016 evaluation of 12 studies with about 2,350 people).

For more information, see the  NCCIH webpage on headache .

.header_greentext{color:green!important;font-size:24px!important;font-weight:500!important;}.header_bluetext{color:blue!important;font-size:18px!important;font-weight:500!important;}.header_redtext{color:red!important;font-size:28px!important;font-weight:500!important;}.header_darkred{color:#803d2f!important;font-size:28px!important;font-weight:500!important;}.header_purpletext{color:purple!important;font-size:31px!important;font-weight:500!important;}.header_yellowtext{color:yellow!important;font-size:20px!important;font-weight:500!important;}.header_blacktext{color:black!important;font-size:22px!important;font-weight:500!important;}.header_whitetext{color:white!important;font-size:22px!important;font-weight:500!important;}.header_darkred{color:#803d2f!important;}.Green_Header{color:green!important;font-size:24px!important;font-weight:500!important;}.Blue_Header{color:blue!important;font-size:18px!important;font-weight:500!important;}.Red_Header{color:red!important;font-size:28px!important;font-weight:500!important;}.Purple_Header{color:purple!important;font-size:31px!important;font-weight:500!important;}.Yellow_Header{color:yellow!important;font-size:20px!important;font-weight:500!important;}.Black_Header{color:black!important;font-size:22px!important;font-weight:500!important;}.White_Header{color:white!important;font-size:22px!important;font-weight:500!important;} Myofascial Pain Syndrome

  • Myofascial pain syndrome is a common form of pain derived from muscles and their related connective tissue (fascia). It involves tender nodules called “trigger points.” Pressing on these nodules reproduces the patient’s pattern of pain.
  • A combined analysis of a small number of studies of acupuncture for myofascial pain syndrome showed that acupuncture applied to trigger points had a favorable effect on pain intensity (5 studies, 215 participants), but acupuncture applied to traditional acupuncture points did not (4 studies, 80 participants).  

.header_greentext{color:green!important;font-size:24px!important;font-weight:500!important;}.header_bluetext{color:blue!important;font-size:18px!important;font-weight:500!important;}.header_redtext{color:red!important;font-size:28px!important;font-weight:500!important;}.header_darkred{color:#803d2f!important;font-size:28px!important;font-weight:500!important;}.header_purpletext{color:purple!important;font-size:31px!important;font-weight:500!important;}.header_yellowtext{color:yellow!important;font-size:20px!important;font-weight:500!important;}.header_blacktext{color:black!important;font-size:22px!important;font-weight:500!important;}.header_whitetext{color:white!important;font-size:22px!important;font-weight:500!important;}.header_darkred{color:#803d2f!important;}.Green_Header{color:green!important;font-size:24px!important;font-weight:500!important;}.Blue_Header{color:blue!important;font-size:18px!important;font-weight:500!important;}.Red_Header{color:red!important;font-size:28px!important;font-weight:500!important;}.Purple_Header{color:purple!important;font-size:31px!important;font-weight:500!important;}.Yellow_Header{color:yellow!important;font-size:20px!important;font-weight:500!important;}.Black_Header{color:black!important;font-size:22px!important;font-weight:500!important;}.White_Header{color:white!important;font-size:22px!important;font-weight:500!important;} Sciatica

  • Sciatica involves pain, weakness, numbness, or tingling in the leg, usually on one side of the body, caused by damage to or pressure on the sciatic nerve—a nerve that starts in the lower back and runs down the back of each leg.
  • Two 2015 evaluations of the evidence, one including 12 studies with 1,842 total participants and the other including 11 studies with 962 total participants, concluded that acupuncture may be helpful for sciatica pain, but the quality of the research is not good enough to allow definite conclusions to be reached.

.header_greentext{color:green!important;font-size:24px!important;font-weight:500!important;}.header_bluetext{color:blue!important;font-size:18px!important;font-weight:500!important;}.header_redtext{color:red!important;font-size:28px!important;font-weight:500!important;}.header_darkred{color:#803d2f!important;font-size:28px!important;font-weight:500!important;}.header_purpletext{color:purple!important;font-size:31px!important;font-weight:500!important;}.header_yellowtext{color:yellow!important;font-size:20px!important;font-weight:500!important;}.header_blacktext{color:black!important;font-size:22px!important;font-weight:500!important;}.header_whitetext{color:white!important;font-size:22px!important;font-weight:500!important;}.header_darkred{color:#803d2f!important;}.Green_Header{color:green!important;font-size:24px!important;font-weight:500!important;}.Blue_Header{color:blue!important;font-size:18px!important;font-weight:500!important;}.Red_Header{color:red!important;font-size:28px!important;font-weight:500!important;}.Purple_Header{color:purple!important;font-size:31px!important;font-weight:500!important;}.Yellow_Header{color:yellow!important;font-size:20px!important;font-weight:500!important;}.Black_Header{color:black!important;font-size:22px!important;font-weight:500!important;}.White_Header{color:white!important;font-size:22px!important;font-weight:500!important;} Postoperative Pain

  • A 2016 evaluation of 11 studies of pain after surgery (with a total of 682 participants) found that patients treated with acupuncture or related techniques 1 day after surgery had less pain and used less opioid pain medicine after the operation.

.header_greentext{color:green!important;font-size:24px!important;font-weight:500!important;}.header_bluetext{color:blue!important;font-size:18px!important;font-weight:500!important;}.header_redtext{color:red!important;font-size:28px!important;font-weight:500!important;}.header_darkred{color:#803d2f!important;font-size:28px!important;font-weight:500!important;}.header_purpletext{color:purple!important;font-size:31px!important;font-weight:500!important;}.header_yellowtext{color:yellow!important;font-size:20px!important;font-weight:500!important;}.header_blacktext{color:black!important;font-size:22px!important;font-weight:500!important;}.header_whitetext{color:white!important;font-size:22px!important;font-weight:500!important;}.header_darkred{color:#803d2f!important;}.Green_Header{color:green!important;font-size:24px!important;font-weight:500!important;}.Blue_Header{color:blue!important;font-size:18px!important;font-weight:500!important;}.Red_Header{color:red!important;font-size:28px!important;font-weight:500!important;}.Purple_Header{color:purple!important;font-size:31px!important;font-weight:500!important;}.Yellow_Header{color:yellow!important;font-size:20px!important;font-weight:500!important;}.Black_Header{color:black!important;font-size:22px!important;font-weight:500!important;}.White_Header{color:white!important;font-size:22px!important;font-weight:500!important;} Cancer Pain

  • A 2016 review of 20 studies (1,639 participants) indicated that acupuncture was not more effective in relieving cancer pain than conventional drug therapy. However, there was some evidence that acupuncture plus drug therapy might be better than drug therapy alone.
  • A 2017 review of 5 studies (181 participants) of acupuncture for aromatase inhibitor-induced joint pain in breast cancer patients concluded that 6 to 8 weeks of acupuncture treatment may help reduce the pain. However, the individual studies only included small numbers of women and used a variety of acupuncture techniques and measurement methods, so they were difficult to compare.
  • A larger 2018 study included 226 women with early-stage breast cancer who were taking aromatase inhibitors. The study found that the women who received 6 weeks of acupuncture treatment, given twice each week, reported less joint pain than the participants who received sham or no acupuncture.

.header_greentext{color:green!important;font-size:24px!important;font-weight:500!important;}.header_bluetext{color:blue!important;font-size:18px!important;font-weight:500!important;}.header_redtext{color:red!important;font-size:28px!important;font-weight:500!important;}.header_darkred{color:#803d2f!important;font-size:28px!important;font-weight:500!important;}.header_purpletext{color:purple!important;font-size:31px!important;font-weight:500!important;}.header_yellowtext{color:yellow!important;font-size:20px!important;font-weight:500!important;}.header_blacktext{color:black!important;font-size:22px!important;font-weight:500!important;}.header_whitetext{color:white!important;font-size:22px!important;font-weight:500!important;}.header_darkred{color:#803d2f!important;}.Green_Header{color:green!important;font-size:24px!important;font-weight:500!important;}.Blue_Header{color:blue!important;font-size:18px!important;font-weight:500!important;}.Red_Header{color:red!important;font-size:28px!important;font-weight:500!important;}.Purple_Header{color:purple!important;font-size:31px!important;font-weight:500!important;}.Yellow_Header{color:yellow!important;font-size:20px!important;font-weight:500!important;}.Black_Header{color:black!important;font-size:22px!important;font-weight:500!important;}.White_Header{color:white!important;font-size:22px!important;font-weight:500!important;} Chronic Prostatitis/Chronic Pelvic Pain Syndrome

  • Chronic prostatitis/chronic pelvic pain syndrome is a condition in men that involves inflammation of or near the prostate gland; its cause is uncertain.
  • A review of 3 studies (204 total participants) suggested that acupuncture may reduce prostatitis symptoms, compared with a sham procedure. Because follow-up of the study participants was relatively brief and the numbers of studies and participants were small, a definite conclusion cannot be reached about acupuncture’s effects.

.header_greentext{color:green!important;font-size:24px!important;font-weight:500!important;}.header_bluetext{color:blue!important;font-size:18px!important;font-weight:500!important;}.header_redtext{color:red!important;font-size:28px!important;font-weight:500!important;}.header_darkred{color:#803d2f!important;font-size:28px!important;font-weight:500!important;}.header_purpletext{color:purple!important;font-size:31px!important;font-weight:500!important;}.header_yellowtext{color:yellow!important;font-size:20px!important;font-weight:500!important;}.header_blacktext{color:black!important;font-size:22px!important;font-weight:500!important;}.header_whitetext{color:white!important;font-size:22px!important;font-weight:500!important;}.header_darkred{color:#803d2f!important;}.Green_Header{color:green!important;font-size:24px!important;font-weight:500!important;}.Blue_Header{color:blue!important;font-size:18px!important;font-weight:500!important;}.Red_Header{color:red!important;font-size:28px!important;font-weight:500!important;}.Purple_Header{color:purple!important;font-size:31px!important;font-weight:500!important;}.Yellow_Header{color:yellow!important;font-size:20px!important;font-weight:500!important;}.Black_Header{color:black!important;font-size:22px!important;font-weight:500!important;}.White_Header{color:white!important;font-size:22px!important;font-weight:500!important;} Irritable Bowel Syndrome

  • A 2019 review of 41 studies (3,440 participants) showed that acupuncture was no more effective than sham acupuncture for symptoms of irritable bowel syndrome, but there was some evidence that acupuncture could be helpful when used in addition to other forms of treatment.

For more information, see the  NCCIH webpage on irritable bowel syndrome .

.header_greentext{color:green!important;font-size:24px!important;font-weight:500!important;}.header_bluetext{color:blue!important;font-size:18px!important;font-weight:500!important;}.header_redtext{color:red!important;font-size:28px!important;font-weight:500!important;}.header_darkred{color:#803d2f!important;font-size:28px!important;font-weight:500!important;}.header_purpletext{color:purple!important;font-size:31px!important;font-weight:500!important;}.header_yellowtext{color:yellow!important;font-size:20px!important;font-weight:500!important;}.header_blacktext{color:black!important;font-size:22px!important;font-weight:500!important;}.header_whitetext{color:white!important;font-size:22px!important;font-weight:500!important;}.header_darkred{color:#803d2f!important;}.Green_Header{color:green!important;font-size:24px!important;font-weight:500!important;}.Blue_Header{color:blue!important;font-size:18px!important;font-weight:500!important;}.Red_Header{color:red!important;font-size:28px!important;font-weight:500!important;}.Purple_Header{color:purple!important;font-size:31px!important;font-weight:500!important;}.Yellow_Header{color:yellow!important;font-size:20px!important;font-weight:500!important;}.Black_Header{color:black!important;font-size:22px!important;font-weight:500!important;}.White_Header{color:white!important;font-size:22px!important;font-weight:500!important;} Fibromyalgia

  • A 2019 review of 12 studies (824 participants) of people with fibromyalgia indicated that acupuncture was significantly better than sham acupuncture for relieving pain, but the evidence was of low-to-moderate quality.

For more information, see the  NCCIH webpage on fibromyalgia . 

.header_greentext{color:green!important;font-size:24px!important;font-weight:500!important;}.header_bluetext{color:blue!important;font-size:18px!important;font-weight:500!important;}.header_redtext{color:red!important;font-size:28px!important;font-weight:500!important;}.header_darkred{color:#803d2f!important;font-size:28px!important;font-weight:500!important;}.header_purpletext{color:purple!important;font-size:31px!important;font-weight:500!important;}.header_yellowtext{color:yellow!important;font-size:20px!important;font-weight:500!important;}.header_blacktext{color:black!important;font-size:22px!important;font-weight:500!important;}.header_whitetext{color:white!important;font-size:22px!important;font-weight:500!important;}.header_darkred{color:#803d2f!important;}.Green_Header{color:green!important;font-size:24px!important;font-weight:500!important;}.Blue_Header{color:blue!important;font-size:18px!important;font-weight:500!important;}.Red_Header{color:red!important;font-size:28px!important;font-weight:500!important;}.Purple_Header{color:purple!important;font-size:31px!important;font-weight:500!important;}.Yellow_Header{color:yellow!important;font-size:20px!important;font-weight:500!important;}.Black_Header{color:black!important;font-size:22px!important;font-weight:500!important;}.White_Header{color:white!important;font-size:22px!important;font-weight:500!important;} What does research show about acupuncture for conditions other than pain?

In addition to pain conditions, acupuncture has also been studied for at least 50 other health problems. There is evidence that acupuncture may help relieve seasonal allergy symptoms, stress incontinence in women, and nausea and vomiting associated with cancer treatment. It may also help relieve symptoms and improve the quality of life in people with asthma, but it has not been shown to improve lung function.

.header_greentext{color:green!important;font-size:24px!important;font-weight:500!important;}.header_bluetext{color:blue!important;font-size:18px!important;font-weight:500!important;}.header_redtext{color:red!important;font-size:28px!important;font-weight:500!important;}.header_darkred{color:#803d2f!important;font-size:28px!important;font-weight:500!important;}.header_purpletext{color:purple!important;font-size:31px!important;font-weight:500!important;}.header_yellowtext{color:yellow!important;font-size:20px!important;font-weight:500!important;}.header_blacktext{color:black!important;font-size:22px!important;font-weight:500!important;}.header_whitetext{color:white!important;font-size:22px!important;font-weight:500!important;}.header_darkred{color:#803d2f!important;}.Green_Header{color:green!important;font-size:24px!important;font-weight:500!important;}.Blue_Header{color:blue!important;font-size:18px!important;font-weight:500!important;}.Red_Header{color:red!important;font-size:28px!important;font-weight:500!important;}.Purple_Header{color:purple!important;font-size:31px!important;font-weight:500!important;}.Yellow_Header{color:yellow!important;font-size:20px!important;font-weight:500!important;}.Black_Header{color:black!important;font-size:22px!important;font-weight:500!important;}.White_Header{color:white!important;font-size:22px!important;font-weight:500!important;} Seasonal Allergies (Allergic Rhinitis or Hay Fever)

  • A 2015 evaluation of 13 studies of acupuncture for allergic rhinitis, involving a total of 2,365 participants, found evidence that acupuncture may help relieve nasal symptoms. The study participants who received acupuncture also had lower medication scores (meaning that they used less medication to treat their symptoms) and lower blood levels of immunoglobulin E (IgE), a type of antibody associated with allergies.
  • A 2014 clinical practice guideline from the American Academy of Otolaryngology–Head and Neck Surgery included acupuncture among the options health care providers may offer to patients with allergic rhinitis.

For more information, see the  NCCIH webpage on seasonal allergies .

.header_greentext{color:green!important;font-size:24px!important;font-weight:500!important;}.header_bluetext{color:blue!important;font-size:18px!important;font-weight:500!important;}.header_redtext{color:red!important;font-size:28px!important;font-weight:500!important;}.header_darkred{color:#803d2f!important;font-size:28px!important;font-weight:500!important;}.header_purpletext{color:purple!important;font-size:31px!important;font-weight:500!important;}.header_yellowtext{color:yellow!important;font-size:20px!important;font-weight:500!important;}.header_blacktext{color:black!important;font-size:22px!important;font-weight:500!important;}.header_whitetext{color:white!important;font-size:22px!important;font-weight:500!important;}.header_darkred{color:#803d2f!important;}.Green_Header{color:green!important;font-size:24px!important;font-weight:500!important;}.Blue_Header{color:blue!important;font-size:18px!important;font-weight:500!important;}.Red_Header{color:red!important;font-size:28px!important;font-weight:500!important;}.Purple_Header{color:purple!important;font-size:31px!important;font-weight:500!important;}.Yellow_Header{color:yellow!important;font-size:20px!important;font-weight:500!important;}.Black_Header{color:black!important;font-size:22px!important;font-weight:500!important;}.White_Header{color:white!important;font-size:22px!important;font-weight:500!important;} Urinary Incontinence

  • Stress incontinence is a bladder control problem in which movement—coughing, sneezing, laughing, or physical activity—puts pressure on the bladder and causes urine to leak.
  • In a 2017 study of about 500 women with stress incontinence, participants who received electroacupuncture treatment (18 sessions over 6 weeks) had reduced urine leakage, with about two-thirds of the women having a decrease in leakage of 50 percent or more. This was a rigorous study that met current standards for avoiding bias.

.header_greentext{color:green!important;font-size:24px!important;font-weight:500!important;}.header_bluetext{color:blue!important;font-size:18px!important;font-weight:500!important;}.header_redtext{color:red!important;font-size:28px!important;font-weight:500!important;}.header_darkred{color:#803d2f!important;font-size:28px!important;font-weight:500!important;}.header_purpletext{color:purple!important;font-size:31px!important;font-weight:500!important;}.header_yellowtext{color:yellow!important;font-size:20px!important;font-weight:500!important;}.header_blacktext{color:black!important;font-size:22px!important;font-weight:500!important;}.header_whitetext{color:white!important;font-size:22px!important;font-weight:500!important;}.header_darkred{color:#803d2f!important;}.Green_Header{color:green!important;font-size:24px!important;font-weight:500!important;}.Blue_Header{color:blue!important;font-size:18px!important;font-weight:500!important;}.Red_Header{color:red!important;font-size:28px!important;font-weight:500!important;}.Purple_Header{color:purple!important;font-size:31px!important;font-weight:500!important;}.Yellow_Header{color:yellow!important;font-size:20px!important;font-weight:500!important;}.Black_Header{color:black!important;font-size:22px!important;font-weight:500!important;}.White_Header{color:white!important;font-size:22px!important;font-weight:500!important;} Treatment-Related Nausea and Vomiting in Cancer Patients

  • Experts generally agree that acupuncture is helpful for treatment-related nausea and vomiting in cancer patients, but this conclusion is based primarily on research conducted before current guidelines for treating these symptoms were adopted. It’s uncertain whether acupuncture is beneficial when used in combination with current standard treatments for nausea and vomiting.

For more information, see the  NCCIH webpage on cancer .

.header_greentext{color:green!important;font-size:24px!important;font-weight:500!important;}.header_bluetext{color:blue!important;font-size:18px!important;font-weight:500!important;}.header_redtext{color:red!important;font-size:28px!important;font-weight:500!important;}.header_darkred{color:#803d2f!important;font-size:28px!important;font-weight:500!important;}.header_purpletext{color:purple!important;font-size:31px!important;font-weight:500!important;}.header_yellowtext{color:yellow!important;font-size:20px!important;font-weight:500!important;}.header_blacktext{color:black!important;font-size:22px!important;font-weight:500!important;}.header_whitetext{color:white!important;font-size:22px!important;font-weight:500!important;}.header_darkred{color:#803d2f!important;}.Green_Header{color:green!important;font-size:24px!important;font-weight:500!important;}.Blue_Header{color:blue!important;font-size:18px!important;font-weight:500!important;}.Red_Header{color:red!important;font-size:28px!important;font-weight:500!important;}.Purple_Header{color:purple!important;font-size:31px!important;font-weight:500!important;}.Yellow_Header{color:yellow!important;font-size:20px!important;font-weight:500!important;}.Black_Header{color:black!important;font-size:22px!important;font-weight:500!important;}.White_Header{color:white!important;font-size:22px!important;font-weight:500!important;} Asthma

  • In a study conducted in Germany in 2017, 357 participants receiving routine asthma care were randomly assigned to receive or not receive acupuncture, and an additional 1,088 people who received acupuncture for asthma were also studied. Adding acupuncture to routine care was associated with better quality of life compared to routine care alone.
  • A review of 9 earlier studies (777 participants) showed that adding acupuncture to conventional asthma treatment improved symptoms but not lung function.

For more information, see the  NCCIH webpage on asthma .

.header_greentext{color:green!important;font-size:24px!important;font-weight:500!important;}.header_bluetext{color:blue!important;font-size:18px!important;font-weight:500!important;}.header_redtext{color:red!important;font-size:28px!important;font-weight:500!important;}.header_darkred{color:#803d2f!important;font-size:28px!important;font-weight:500!important;}.header_purpletext{color:purple!important;font-size:31px!important;font-weight:500!important;}.header_yellowtext{color:yellow!important;font-size:20px!important;font-weight:500!important;}.header_blacktext{color:black!important;font-size:22px!important;font-weight:500!important;}.header_whitetext{color:white!important;font-size:22px!important;font-weight:500!important;}.header_darkred{color:#803d2f!important;}.Green_Header{color:green!important;font-size:24px!important;font-weight:500!important;}.Blue_Header{color:blue!important;font-size:18px!important;font-weight:500!important;}.Red_Header{color:red!important;font-size:28px!important;font-weight:500!important;}.Purple_Header{color:purple!important;font-size:31px!important;font-weight:500!important;}.Yellow_Header{color:yellow!important;font-size:20px!important;font-weight:500!important;}.Black_Header{color:black!important;font-size:22px!important;font-weight:500!important;}.White_Header{color:white!important;font-size:22px!important;font-weight:500!important;} Depression

  • A 2018 review of 64 studies (7,104 participants) of acupuncture for depression indicated that acupuncture may result in a moderate reduction in the severity of depression when compared with treatment as usual or no treatment. However, these findings should be interpreted with caution because most of the studies were of low or very low quality.

For more information, see the  NCCIH webpage on depression .

.header_greentext{color:green!important;font-size:24px!important;font-weight:500!important;}.header_bluetext{color:blue!important;font-size:18px!important;font-weight:500!important;}.header_redtext{color:red!important;font-size:28px!important;font-weight:500!important;}.header_darkred{color:#803d2f!important;font-size:28px!important;font-weight:500!important;}.header_purpletext{color:purple!important;font-size:31px!important;font-weight:500!important;}.header_yellowtext{color:yellow!important;font-size:20px!important;font-weight:500!important;}.header_blacktext{color:black!important;font-size:22px!important;font-weight:500!important;}.header_whitetext{color:white!important;font-size:22px!important;font-weight:500!important;}.header_darkred{color:#803d2f!important;}.Green_Header{color:green!important;font-size:24px!important;font-weight:500!important;}.Blue_Header{color:blue!important;font-size:18px!important;font-weight:500!important;}.Red_Header{color:red!important;font-size:28px!important;font-weight:500!important;}.Purple_Header{color:purple!important;font-size:31px!important;font-weight:500!important;}.Yellow_Header{color:yellow!important;font-size:20px!important;font-weight:500!important;}.Black_Header{color:black!important;font-size:22px!important;font-weight:500!important;}.White_Header{color:white!important;font-size:22px!important;font-weight:500!important;} Quitting Smoking

  • In recommendations on smoking cessation treatment issued in 2021, the U.S. Preventive Services Task Force, a panel of experts that makes evidence-based recommendations about disease prevention, did not make a recommendation about the use of acupuncture as a stop-smoking treatment because only limited evidence was available. This decision was based on a 2014 review of 9 studies (1,892 participants) that looked at the effect of acupuncture on smoking cessation results for 6 months or more and found no significant benefit. Some studies included in that review showed evidence of a possible small benefit of acupuncture on quitting smoking for shorter periods of time.

For more information, see the  NCCIH webpage on quitting smoking .

.header_greentext{color:green!important;font-size:24px!important;font-weight:500!important;}.header_bluetext{color:blue!important;font-size:18px!important;font-weight:500!important;}.header_redtext{color:red!important;font-size:28px!important;font-weight:500!important;}.header_darkred{color:#803d2f!important;font-size:28px!important;font-weight:500!important;}.header_purpletext{color:purple!important;font-size:31px!important;font-weight:500!important;}.header_yellowtext{color:yellow!important;font-size:20px!important;font-weight:500!important;}.header_blacktext{color:black!important;font-size:22px!important;font-weight:500!important;}.header_whitetext{color:white!important;font-size:22px!important;font-weight:500!important;}.header_darkred{color:#803d2f!important;}.Green_Header{color:green!important;font-size:24px!important;font-weight:500!important;}.Blue_Header{color:blue!important;font-size:18px!important;font-weight:500!important;}.Red_Header{color:red!important;font-size:28px!important;font-weight:500!important;}.Purple_Header{color:purple!important;font-size:31px!important;font-weight:500!important;}.Yellow_Header{color:yellow!important;font-size:20px!important;font-weight:500!important;}.Black_Header{color:black!important;font-size:22px!important;font-weight:500!important;}.White_Header{color:white!important;font-size:22px!important;font-weight:500!important;} Infertility

  • A 2021 review evaluated 6 studies (2,507 participants) that compared the effects of acupuncture versus sham acupuncture on the success of in vitro fertilization as a treatment for infertility. No difference was found between the acupuncture and sham acupuncture groups in rates of pregnancy or live birth.
  • A 2020 review evaluated 12 studies (1,088 participants) on the use of acupuncture to improve sperm quality in men who had low sperm numbers and low sperm motility. The reviewers concluded that the evidence was inadequate for firm conclusions to be drawn because of the varied design of the studies and the poor quality of some of them. 

.header_greentext{color:green!important;font-size:24px!important;font-weight:500!important;}.header_bluetext{color:blue!important;font-size:18px!important;font-weight:500!important;}.header_redtext{color:red!important;font-size:28px!important;font-weight:500!important;}.header_darkred{color:#803d2f!important;font-size:28px!important;font-weight:500!important;}.header_purpletext{color:purple!important;font-size:31px!important;font-weight:500!important;}.header_yellowtext{color:yellow!important;font-size:20px!important;font-weight:500!important;}.header_blacktext{color:black!important;font-size:22px!important;font-weight:500!important;}.header_whitetext{color:white!important;font-size:22px!important;font-weight:500!important;}.header_darkred{color:#803d2f!important;}.Green_Header{color:green!important;font-size:24px!important;font-weight:500!important;}.Blue_Header{color:blue!important;font-size:18px!important;font-weight:500!important;}.Red_Header{color:red!important;font-size:28px!important;font-weight:500!important;}.Purple_Header{color:purple!important;font-size:31px!important;font-weight:500!important;}.Yellow_Header{color:yellow!important;font-size:20px!important;font-weight:500!important;}.Black_Header{color:black!important;font-size:22px!important;font-weight:500!important;}.White_Header{color:white!important;font-size:22px!important;font-weight:500!important;} Carpal Tunnel Syndrome

  • A 2018 review of 12 studies with 869 participants concluded that acupuncture and laser acupuncture (a treatment that uses lasers instead of needles) may have little or no effect on carpal tunnel syndrome symptoms in comparison with sham acupuncture. It’s uncertain how the effects of acupuncture compare with those of other treatments for this condition.    
  • In a 2017 study not included in the review described above, 80 participants with carpal tunnel syndrome were randomly assigned to one of three interventions: (1) electroacupuncture to the more affected hand; (2) electroacupuncture at “distal” body sites, near the ankle opposite to the more affected hand; and (3) local sham electroacupuncture using nonpenetrating placebo needles. All three interventions reduced symptom severity, but local and distal acupuncture were better than sham acupuncture at producing desirable changes in the wrist and the brain.

.header_greentext{color:green!important;font-size:24px!important;font-weight:500!important;}.header_bluetext{color:blue!important;font-size:18px!important;font-weight:500!important;}.header_redtext{color:red!important;font-size:28px!important;font-weight:500!important;}.header_darkred{color:#803d2f!important;font-size:28px!important;font-weight:500!important;}.header_purpletext{color:purple!important;font-size:31px!important;font-weight:500!important;}.header_yellowtext{color:yellow!important;font-size:20px!important;font-weight:500!important;}.header_blacktext{color:black!important;font-size:22px!important;font-weight:500!important;}.header_whitetext{color:white!important;font-size:22px!important;font-weight:500!important;}.header_darkred{color:#803d2f!important;}.Green_Header{color:green!important;font-size:24px!important;font-weight:500!important;}.Blue_Header{color:blue!important;font-size:18px!important;font-weight:500!important;}.Red_Header{color:red!important;font-size:28px!important;font-weight:500!important;}.Purple_Header{color:purple!important;font-size:31px!important;font-weight:500!important;}.Yellow_Header{color:yellow!important;font-size:20px!important;font-weight:500!important;}.Black_Header{color:black!important;font-size:22px!important;font-weight:500!important;}.White_Header{color:white!important;font-size:22px!important;font-weight:500!important;} Hot Flashes Associated With Menopause

  • A 2018 review of studies of acupuncture for vasomotor symptoms associated with menopause (hot flashes and related symptoms such as night sweats) analyzed combined evidence from an earlier review of 15 studies (1,127 participants) and 4 newer studies (696 additional participants). The analysis showed that acupuncture was better than no acupuncture at reducing the frequency and severity of symptoms. However, acupuncture was not shown to be better than sham acupuncture.

For more information, see the  NCCIH webpage on menopause .

.header_greentext{color:green!important;font-size:24px!important;font-weight:500!important;}.header_bluetext{color:blue!important;font-size:18px!important;font-weight:500!important;}.header_redtext{color:red!important;font-size:28px!important;font-weight:500!important;}.header_darkred{color:#803d2f!important;font-size:28px!important;font-weight:500!important;}.header_purpletext{color:purple!important;font-size:31px!important;font-weight:500!important;}.header_yellowtext{color:yellow!important;font-size:20px!important;font-weight:500!important;}.header_blacktext{color:black!important;font-size:22px!important;font-weight:500!important;}.header_whitetext{color:white!important;font-size:22px!important;font-weight:500!important;}.header_darkred{color:#803d2f!important;}.Green_Header{color:green!important;font-size:24px!important;font-weight:500!important;}.Blue_Header{color:blue!important;font-size:18px!important;font-weight:500!important;}.Red_Header{color:red!important;font-size:28px!important;font-weight:500!important;}.Purple_Header{color:purple!important;font-size:31px!important;font-weight:500!important;}.Yellow_Header{color:yellow!important;font-size:20px!important;font-weight:500!important;}.Black_Header{color:black!important;font-size:22px!important;font-weight:500!important;}.White_Header{color:white!important;font-size:22px!important;font-weight:500!important;} What is auricular acupuncture good for?

  • Auricular acupuncture is a type of acupuncture that involves stimulating specific areas of the ear. 
  • In a 2019 review of 15 studies (930 participants) of auricular acupuncture or auricular acupressure (a form of auricular therapy that does not involve penetration with needles), the treatment significantly reduced pain intensity, and 80 percent of the individual studies showed favorable effects on various measures related to pain.
  • A 2020 review of 9 studies (783 participants) of auricular acupuncture for cancer pain showed that auricular acupuncture produced better pain relief than sham auricular acupuncture. Also, pain relief was better with a combination of auricular acupuncture and drug therapy than with drug therapy alone.
  • An inexpensive, easily learned form of auricular acupuncture called “battlefield acupuncture” has been used by the U.S. Department of Defense and Department of Veterans Affairs to treat pain. However, a 2021 review of 9 studies (692 participants) of battlefield acupuncture for pain in adults did not find any significant improvement in pain when this technique was compared with no treatment, usual care, delayed treatment, or sham battlefield acupuncture.

.header_greentext{color:green!important;font-size:24px!important;font-weight:500!important;}.header_bluetext{color:blue!important;font-size:18px!important;font-weight:500!important;}.header_redtext{color:red!important;font-size:28px!important;font-weight:500!important;}.header_darkred{color:#803d2f!important;font-size:28px!important;font-weight:500!important;}.header_purpletext{color:purple!important;font-size:31px!important;font-weight:500!important;}.header_yellowtext{color:yellow!important;font-size:20px!important;font-weight:500!important;}.header_blacktext{color:black!important;font-size:22px!important;font-weight:500!important;}.header_whitetext{color:white!important;font-size:22px!important;font-weight:500!important;}.header_darkred{color:#803d2f!important;}.Green_Header{color:green!important;font-size:24px!important;font-weight:500!important;}.Blue_Header{color:blue!important;font-size:18px!important;font-weight:500!important;}.Red_Header{color:red!important;font-size:28px!important;font-weight:500!important;}.Purple_Header{color:purple!important;font-size:31px!important;font-weight:500!important;}.Yellow_Header{color:yellow!important;font-size:20px!important;font-weight:500!important;}.Black_Header{color:black!important;font-size:22px!important;font-weight:500!important;}.White_Header{color:white!important;font-size:22px!important;font-weight:500!important;} Is acupuncture safe?

  • Relatively few complications from using acupuncture have been reported. However, complications have resulted from use of nonsterile needles and improper delivery of treatments.
  • When not delivered properly, acupuncture can cause serious adverse effects, including infections, punctured organs, and injury to the central nervous system.
  • The U.S. Food and Drug Administration (FDA) regulates acupuncture needles as medical devices and requires that they be sterile and labeled for single use only.

.header_greentext{color:green!important;font-size:24px!important;font-weight:500!important;}.header_bluetext{color:blue!important;font-size:18px!important;font-weight:500!important;}.header_redtext{color:red!important;font-size:28px!important;font-weight:500!important;}.header_darkred{color:#803d2f!important;font-size:28px!important;font-weight:500!important;}.header_purpletext{color:purple!important;font-size:31px!important;font-weight:500!important;}.header_yellowtext{color:yellow!important;font-size:20px!important;font-weight:500!important;}.header_blacktext{color:black!important;font-size:22px!important;font-weight:500!important;}.header_whitetext{color:white!important;font-size:22px!important;font-weight:500!important;}.header_darkred{color:#803d2f!important;}.Green_Header{color:green!important;font-size:24px!important;font-weight:500!important;}.Blue_Header{color:blue!important;font-size:18px!important;font-weight:500!important;}.Red_Header{color:red!important;font-size:28px!important;font-weight:500!important;}.Purple_Header{color:purple!important;font-size:31px!important;font-weight:500!important;}.Yellow_Header{color:yellow!important;font-size:20px!important;font-weight:500!important;}.Black_Header{color:black!important;font-size:22px!important;font-weight:500!important;}.White_Header{color:white!important;font-size:22px!important;font-weight:500!important;} Is acupuncture covered by health insurance?

  • Some health insurance policies cover acupuncture, but others don’t. Coverage is often limited based on the condition being treated.
  • An analysis of data from the Medical Expenditure Panel Survey, a nationally representative U.S. survey, showed that the share of adult acupuncturist visits with any insurance coverage increased from 41.1 percent in 2010–2011 to 50.2 percent in 2018–2019.
  • Medicare covers acupuncture only for the treatment of chronic low-back pain. Coverage began in 2020. Up to 12 acupuncture visits are covered, with an additional 8 visits available if the first 12 result in improvement. Medicaid coverage of acupuncture varies from state to state.

.header_greentext{color:green!important;font-size:24px!important;font-weight:500!important;}.header_bluetext{color:blue!important;font-size:18px!important;font-weight:500!important;}.header_redtext{color:red!important;font-size:28px!important;font-weight:500!important;}.header_darkred{color:#803d2f!important;font-size:28px!important;font-weight:500!important;}.header_purpletext{color:purple!important;font-size:31px!important;font-weight:500!important;}.header_yellowtext{color:yellow!important;font-size:20px!important;font-weight:500!important;}.header_blacktext{color:black!important;font-size:22px!important;font-weight:500!important;}.header_whitetext{color:white!important;font-size:22px!important;font-weight:500!important;}.header_darkred{color:#803d2f!important;}.Green_Header{color:green!important;font-size:24px!important;font-weight:500!important;}.Blue_Header{color:blue!important;font-size:18px!important;font-weight:500!important;}.Red_Header{color:red!important;font-size:28px!important;font-weight:500!important;}.Purple_Header{color:purple!important;font-size:31px!important;font-weight:500!important;}.Yellow_Header{color:yellow!important;font-size:20px!important;font-weight:500!important;}.Black_Header{color:black!important;font-size:22px!important;font-weight:500!important;}.White_Header{color:white!important;font-size:22px!important;font-weight:500!important;} Do acupuncturists need to be licensed?

  • Most states license acupuncturists, but the requirements for licensing vary from state to state. To find out more about licensing of acupuncturists and other complementary health practitioners, visit the NCCIH webpage  Credentialing, Licensing, and Education . 

.header_greentext{color:green!important;font-size:24px!important;font-weight:500!important;}.header_bluetext{color:blue!important;font-size:18px!important;font-weight:500!important;}.header_redtext{color:red!important;font-size:28px!important;font-weight:500!important;}.header_darkred{color:#803d2f!important;font-size:28px!important;font-weight:500!important;}.header_purpletext{color:purple!important;font-size:31px!important;font-weight:500!important;}.header_yellowtext{color:yellow!important;font-size:20px!important;font-weight:500!important;}.header_blacktext{color:black!important;font-size:22px!important;font-weight:500!important;}.header_whitetext{color:white!important;font-size:22px!important;font-weight:500!important;}.header_darkred{color:#803d2f!important;}.Green_Header{color:green!important;font-size:24px!important;font-weight:500!important;}.Blue_Header{color:blue!important;font-size:18px!important;font-weight:500!important;}.Red_Header{color:red!important;font-size:28px!important;font-weight:500!important;}.Purple_Header{color:purple!important;font-size:31px!important;font-weight:500!important;}.Yellow_Header{color:yellow!important;font-size:20px!important;font-weight:500!important;}.Black_Header{color:black!important;font-size:22px!important;font-weight:500!important;}.White_Header{color:white!important;font-size:22px!important;font-weight:500!important;} NCCIH-Funded Research

NCCIH funds research to evaluate acupuncture’s effectiveness for various kinds of pain and other conditions and to further understand how the body responds to acupuncture and how acupuncture might work. Some recent NCCIH-supported studies involve:

  • Evaluating the feasibility of using acupuncture in hospital emergency departments.
  • Testing whether the effect of acupuncture on chronic low-back pain can be enhanced by combining it with transcranial direct current stimulation.
  • Evaluating a portable acupuncture-based nerve stimulation treatment for anxiety disorders.

.header_greentext{color:green!important;font-size:24px!important;font-weight:500!important;}.header_bluetext{color:blue!important;font-size:18px!important;font-weight:500!important;}.header_redtext{color:red!important;font-size:28px!important;font-weight:500!important;}.header_darkred{color:#803d2f!important;font-size:28px!important;font-weight:500!important;}.header_purpletext{color:purple!important;font-size:31px!important;font-weight:500!important;}.header_yellowtext{color:yellow!important;font-size:20px!important;font-weight:500!important;}.header_blacktext{color:black!important;font-size:22px!important;font-weight:500!important;}.header_whitetext{color:white!important;font-size:22px!important;font-weight:500!important;}.header_darkred{color:#803d2f!important;}.Green_Header{color:green!important;font-size:24px!important;font-weight:500!important;}.Blue_Header{color:blue!important;font-size:18px!important;font-weight:500!important;}.Red_Header{color:red!important;font-size:28px!important;font-weight:500!important;}.Purple_Header{color:purple!important;font-size:31px!important;font-weight:500!important;}.Yellow_Header{color:yellow!important;font-size:20px!important;font-weight:500!important;}.Black_Header{color:black!important;font-size:22px!important;font-weight:500!important;}.White_Header{color:white!important;font-size:22px!important;font-weight:500!important;} More To Consider

  • Don’t use acupuncture to postpone seeing a health care provider about a health problem.
  • Take charge of your health—talk with your health care providers about any complementary health approaches you use. Together, you can make shared, well-informed decisions.

.header_greentext{color:green!important;font-size:24px!important;font-weight:500!important;}.header_bluetext{color:blue!important;font-size:18px!important;font-weight:500!important;}.header_redtext{color:red!important;font-size:28px!important;font-weight:500!important;}.header_darkred{color:#803d2f!important;font-size:28px!important;font-weight:500!important;}.header_purpletext{color:purple!important;font-size:31px!important;font-weight:500!important;}.header_yellowtext{color:yellow!important;font-size:20px!important;font-weight:500!important;}.header_blacktext{color:black!important;font-size:22px!important;font-weight:500!important;}.header_whitetext{color:white!important;font-size:22px!important;font-weight:500!important;}.header_darkred{color:#803d2f!important;}.Green_Header{color:green!important;font-size:24px!important;font-weight:500!important;}.Blue_Header{color:blue!important;font-size:18px!important;font-weight:500!important;}.Red_Header{color:red!important;font-size:28px!important;font-weight:500!important;}.Purple_Header{color:purple!important;font-size:31px!important;font-weight:500!important;}.Yellow_Header{color:yellow!important;font-size:20px!important;font-weight:500!important;}.Black_Header{color:black!important;font-size:22px!important;font-weight:500!important;}.White_Header{color:white!important;font-size:22px!important;font-weight:500!important;} For More Information

Nccih clearinghouse.

The NCCIH Clearinghouse provides information on NCCIH and complementary and integrative health approaches, including publications and searches of Federal databases of scientific and medical literature. The Clearinghouse does not provide medical advice, treatment recommendations, or referrals to practitioners.

Toll-free in the U.S.: 1-888-644-6226

Telecommunications relay service (TRS): 7-1-1

Website: https://www.nccih.nih.gov

Email: [email protected] (link sends email)

Know the Science

NCCIH and the National Institutes of Health (NIH) provide tools to help you understand the basics and terminology of scientific research so you can make well-informed decisions about your health. Know the Science features a variety of materials, including interactive modules, quizzes, and videos, as well as links to informative content from Federal resources designed to help consumers make sense of health information.

Explaining How Research Works (NIH)

Know the Science: How To Make Sense of a Scientific Journal Article

Understanding Clinical Studies (NIH)

A service of the National Library of Medicine, PubMed® contains publication information and (in most cases) brief summaries of articles from scientific and medical journals. For guidance from NCCIH on using PubMed, see How To Find Information About Complementary Health Approaches on PubMed .

Website: https://pubmed.ncbi.nlm.nih.gov/

NIH Clinical Research Trials and You

The National Institutes of Health (NIH) has created a website, NIH Clinical Research Trials and You, to help people learn about clinical trials, why they matter, and how to participate. The site includes questions and answers about clinical trials, guidance on how to find clinical trials through ClinicalTrials.gov and other resources, and stories about the personal experiences of clinical trial participants. Clinical trials are necessary to find better ways to prevent, diagnose, and treat diseases.

Website: https://www.nih.gov/health-information/nih-clinical-research-trials-you

Research Portfolio Online Reporting Tools Expenditures & Results (RePORTER)

RePORTER is a database of information on federally funded scientific and medical research projects being conducted at research institutions.

Website: https://reporter.nih.gov

.header_greentext{color:green!important;font-size:24px!important;font-weight:500!important;}.header_bluetext{color:blue!important;font-size:18px!important;font-weight:500!important;}.header_redtext{color:red!important;font-size:28px!important;font-weight:500!important;}.header_darkred{color:#803d2f!important;font-size:28px!important;font-weight:500!important;}.header_purpletext{color:purple!important;font-size:31px!important;font-weight:500!important;}.header_yellowtext{color:yellow!important;font-size:20px!important;font-weight:500!important;}.header_blacktext{color:black!important;font-size:22px!important;font-weight:500!important;}.header_whitetext{color:white!important;font-size:22px!important;font-weight:500!important;}.header_darkred{color:#803d2f!important;}.Green_Header{color:green!important;font-size:24px!important;font-weight:500!important;}.Blue_Header{color:blue!important;font-size:18px!important;font-weight:500!important;}.Red_Header{color:red!important;font-size:28px!important;font-weight:500!important;}.Purple_Header{color:purple!important;font-size:31px!important;font-weight:500!important;}.Yellow_Header{color:yellow!important;font-size:20px!important;font-weight:500!important;}.Black_Header{color:black!important;font-size:22px!important;font-weight:500!important;}.White_Header{color:white!important;font-size:22px!important;font-weight:500!important;} Key References

  • Befus D, Coeytaux RR, Goldstein KM, et al.  Management of menopause symptoms with acupuncture: an umbrella systematic review and meta-analysis . Journal of Alternative and Complementary Medicine. 2018;24(4):314-323.
  • Bleck   R, Marquez E, Gold MA, et al.  A scoping review of acupuncture insurance coverage in the United States . Acupuncture in Medicine. 2020;964528420964214.
  • Briggs JP, Shurtleff D.  Acupuncture and the complex connections between the mind and the body. JAMA. 2017;317(24):2489-2490.
  • Brinkhaus B, Roll S, Jena S, et al.  Acupuncture in patients with allergic asthma: a randomized pragmatic trial. Journal of Alternative and Complementary Medicine. 2017;23(4):268-277.
  • Chan MWC, Wu XY, Wu JCY, et al.  Safety of acupuncture: overview of systematic reviews. Scientific Reports. 2017;7(1):3369.
  • Coyle ME, Stupans I, Abdel-Nour K, et al.  Acupuncture versus placebo acupuncture for in vitro fertilisation: a systematic review and meta-analysis. Acupuncture in Medicine. 2021;39(1):20-29.
  • Hershman DL, Unger JM, Greenlee H, et al.  Effect of acupuncture vs sham acupuncture or waitlist control on joint pain related to aromatase inhibitors among women with early-stage breast cancer: a randomized clinical trial. JAMA. 2018;320(2):167-176.
  • Linde K, Allais G, Brinkhaus B, et al.  Acupuncture for the prevention of episodic migraine. Cochrane Database of Systematic Reviews. 2016;(6):CD001218. Accessed at  cochranelibrary.com on February 12, 2021.
  • Linde K, Allais G, Brinkhaus B, et al.  Acupuncture for the prevention of tension-type headache. Cochrane Database of Systematic Reviews. 2016;(4):CD007587. Accessed at  cochranelibrary.com on February 12, 2021.
  • MacPherson H, Vertosick EA, Foster NE, et al. The persistence of the effects of acupuncture after a course of treatment: a meta-analysis of patients with chronic pain . Pain. 2017;158(5):784-793.
  • Qaseem A, Wilt TJ, McLean RM, et al.  Noninvasive treatments for acute, subacute, and chronic low back pain: a clinical practice guideline from the American College of Physicians. Annals of Internal Medicine. 2017;166(7):514-530.
  • Seidman MD, Gurgel RK, Lin SY, et al.  Clinical practice guideline: allergic rhinitis. Otolaryngology—Head and Neck Surgery. 2015;152(suppl 1):S1-S43.
  • Vickers AJ, Vertosick EA, Lewith G, et al. Acupuncture for chronic pain: update of an individual patient data meta-analysis . The Journal of Pain. 2018;19(5):455-474.
  • White AR, Rampes H, Liu JP, et al.  Acupuncture and related interventions for smoking cessation. Cochrane Database of Systematic Reviews. 2014;(1):CD000009. Accessed at  cochranelibrary.com on February 17, 2021.
  • Zia FZ, Olaku O, Bao T, et al.  The National Cancer Institute’s conference on acupuncture for symptom management in oncology: state of the science, evidence, and research gaps. Journal of the National Cancer Institute. Monographs. 2017;2017(52):lgx005.

.header_greentext{color:green!important;font-size:24px!important;font-weight:500!important;}.header_bluetext{color:blue!important;font-size:18px!important;font-weight:500!important;}.header_redtext{color:red!important;font-size:28px!important;font-weight:500!important;}.header_darkred{color:#803d2f!important;font-size:28px!important;font-weight:500!important;}.header_purpletext{color:purple!important;font-size:31px!important;font-weight:500!important;}.header_yellowtext{color:yellow!important;font-size:20px!important;font-weight:500!important;}.header_blacktext{color:black!important;font-size:22px!important;font-weight:500!important;}.header_whitetext{color:white!important;font-size:22px!important;font-weight:500!important;}.header_darkred{color:#803d2f!important;}.Green_Header{color:green!important;font-size:24px!important;font-weight:500!important;}.Blue_Header{color:blue!important;font-size:18px!important;font-weight:500!important;}.Red_Header{color:red!important;font-size:28px!important;font-weight:500!important;}.Purple_Header{color:purple!important;font-size:31px!important;font-weight:500!important;}.Yellow_Header{color:yellow!important;font-size:20px!important;font-weight:500!important;}.Black_Header{color:black!important;font-size:22px!important;font-weight:500!important;}.White_Header{color:white!important;font-size:22px!important;font-weight:500!important;} Other References

  • Adams D, Cheng F, Jou H, et al. The safety of pediatric acupuncture: a systematic review. Pediatrics. 2011;128(6):e1575-1587.
  • Candon M, Nielsen A, Dusek JA. Trends in insurance coverage for acupuncture, 2010-2019. JAMA Network Open. 2022;5(1):e2142509.
  • Cao J, Tu Y, Orr SP, et al. Analgesic effects evoked by real and imagined acupuncture: a neuroimaging study. Cerebral Cortex. 2019;29(8):3220-3231.
  • Centers for Medicare & Medicaid Services. Decision Memo for Acupuncture for Chronic Low Back Pain (CAG-00452N). Accessed at https://www.cms.gov/medicare-coverage-database/details/nca-decision-memo.aspx?NCAId=295 on June 25, 2021.
  • Chen L, Lin C-C, Huang T-W, et al. Effect of acupuncture on aromatase inhibitor-induced arthralgia in patients with breast cancer: a meta-analysis of randomized controlled trials . The Breast. 2017;33:132-138. 
  • Choi G-H, Wieland LS, Lee H, et al. Acupuncture and related interventions for the treatment of symptoms associated with carpal tunnel syndrome. Cochrane Database of Systematic Reviews. 2018;(12):CD011215. Accessed at cochranelibrary.com on January 28, 2021.
  • Cui J, Wang S, Ren J, et al. Use of acupuncture in the USA: changes over a decade (2002–2012). Acupuncture in Medicine. 2017;35(3):200-207.
  • Federman DG, Zeliadt SB, Thomas ER, et al. Battlefield acupuncture in the Veterans Health Administration: effectiveness in individual and group settings for pain and pain comorbidities. Medical Acupuncture. 2018;30(5):273-278.
  • Feng S, Han M, Fan Y, et al. Acupuncture for the treatment of allergic rhinitis: a systematic review and meta-analysis. American Journal of Rhinology & Allergy. 2015;29(1):57-62.
  • Franco JV, Turk T, Jung JH, et al. Non-pharmacological interventions for treating chronic prostatitis/chronic pelvic pain syndrome. Cochrane Database of Systematic Reviews. 2018;(5):CD012551. Accessed at cochranelibrary.com on January 28, 2021.
  • Freeman MP, Fava M, Lake J, et al. Complementary and alternative medicine in major depressive disorder: the American Psychiatric Association task force report. The Journal of Clinical Psychiatry . 2010;71(6):669-681.
  • Giovanardi CM, Cinquini M, Aguggia M, et al. Acupuncture vs. pharmacological prophylaxis of migraine: a systematic review of randomized controlled trials. Frontiers in Neurology. 2020;11:576272.
  • Hu C, Zhang H, Wu W, et al. Acupuncture for pain management in cancer: a systematic review and meta-analysis. Evidence-Based Complementary and Alternative Medicine. 2016;2016;1720239.
  • Jiang C, Jiang L, Qin Q. Conventional treatments plus acupuncture for asthma in adults and adolescent: a systematic review and meta-analysis. Evidence-Based Complementary and Alternative Medicine . 2019;2019:9580670.
  • Ji M, Wang X, Chen M, et al. The efficacy of acupuncture for the treatment of sciatica: a systematic review and meta-analysis. Evidence-Based Complementary and Alternative Medicine.  2015;2015:192808.
  • Kaptchuk TJ. Acupuncture: theory, efficacy, and practice. Annals of Internal Medicine . 2002;136(5):374-383.
  • Kolasinski SL, Neogi T, Hochberg MC, et al. 2019 American College of Rheumatology/Arthritis Foundation guideline for the management of osteoarthritis of the hand, hip, and knee. Arthritis Care & Research. 2020;72(2):149-162. 
  • Langevin H. Fascia mobility, proprioception, and myofascial pain. Life. 2021;11(7):668. 
  • Liu Z, Liu Y, Xu H, et al. Effect of electroacupuncture on urinary leakage among women with stress urinary incontinence: a randomized clinical trial. JAMA. 2017;317(24):2493-2501.
  • MacPherson H, Hammerschlag R, Coeytaux RR, et al. Unanticipated insights into biomedicine from the study of acupuncture. Journal of Alternative and Complementary Medicine. 2016;22(2):101-107.
  • Maeda Y, Kim H, Kettner N, et al. Rewiring the primary somatosensory cortex in carpal tunnel syndrome with acupuncture. Brain. 2017;140(4):914-927.
  • Manheimer E, Cheng K, Wieland LS, et al. Acupuncture for hip osteoarthritis. Cochrane Database of Systematic Reviews. 2018;(5):CD013010. Accessed at cochranelibrary.com on February 17, 2021. 
  • Moura CC, Chaves ECL, Cardoso ACLR, et al. Auricular acupuncture for chronic back pain in adults: a systematic review and metanalysis. Revista da Escola de Enfermagem da U S P. 2019;53:e03461.
  • Nahin RL, Rhee A, Stussman B. Use of complementary health approaches overall and for pain management by US adults. JAMA. 2024;331(7):613-615.
  • Napadow V. Neuroimaging somatosensory and therapeutic alliance mechanisms supporting acupuncture. Medical Acupuncture. 2020;32(6):400-402.
  • Patnode CD, Henderson JT, Coppola EL, et al. Interventions for tobacco cessation in adults, including pregnant persons: updated evidence report and systematic review for the US Preventive Services Task Force. JAMA. 2021;325(3):280-298.
  • Qin Z, Liu X, Wu J, et al. Effectiveness of acupuncture for treating sciatica: a systematic review and meta-analysis. Evidence-Based Complementary and Alternative Medicine. 2015;2015;425108.
  • Smith CA, Armour M, Lee MS, et al. Acupuncture for depression. Cochrane Database of Systematic Reviews. 2018;(3):CD004046. Accessed at cochranelibrary.com on January 20, 2021.
  • US Preventive Services Task Force. Interventions for tobacco smoking cessation in adults, including pregnant persons. US Preventive Services Task Force recommendation statement. JAMA. 2021;325(3):265-279.
  • Vase L, Baram S, Takakura N, et al. Specifying the nonspecific components of acupuncture analgesia. Pain. 2013;154(9):1659-1667.
  • Wang R, Li X, Zhou S, et al. Manual acupuncture for myofascial pain syndrome: a systematic review and meta-analysis. Acupuncture in Medicine. 2017;35(4):241-250.
  • World Health Organization. WHO Traditional Medicine Strategy: 2014–2023. Geneva, Switzerland: World Health Organization, 2013. Accessed at https://www.who.int/publications/i/item/9789241506096 on February 2, 2021.
  • Wu M-S, Chen K-H, Chen I-F, et al. The efficacy of acupuncture in post-operative pain management: a systematic review and meta-analysis. PLoS One. 2016;11(3):e0150367.
  • Xu S, Wang L, Cooper E, et al. Adverse events of acupuncture: a systematic review of case reports. Evidence-Based Complementary and Alternative Medicine. 2013;2013:581203.
  • Yang J, Ganesh R, Wu Q, et al. Battlefield acupuncture for adult pain: a systematic review and meta-analysis of randomized controlled trials. The American Journal of Chinese Medicine. 2021;49(1):25-40.
  • Yang Y, Wen J, Hong J. The effects of auricular therapy for cancer pain: a systematic review and meta-analysis. Evidence-Based Complementary and Alternative Medicine. 2020;2020:1618767.  
  • Yeh CH, Morone NE, Chien L-C, et al. Auricular point acupressure to manage chronic low back pain in older adults: a randomized controlled pilot study. Evidence-Based Complementary and Alternative Medicine. 2014;2014;375173.
  • You F, Ruan L, Zeng L, et al. Efficacy and safety of acupuncture for the treatment of oligoasthenozoospermia: a systematic review. Andrologia. 2020;52(1):e13415.
  • Zhang X-C, Chen H, Xu W-T, et al. Acupuncture therapy for fibromyalgia: a systematic review and meta-analysis of randomized controlled trials. Journal of Pain Research. 2019;12:527-542.
  • Zheng H, Chen R, Zhao X, et al. Comparison between the effects of acupuncture relative to other controls on irritable bowel syndrome: a meta-analysis. Pain Research and Management. 2019;2019:2871505.

Acknowledgments

NCCIH thanks Pete Murray, Ph.D., David Shurtleff, Ph.D., and Helene M. Langevin, M.D., NCCIH for their review of the 2022 update of this fact sheet. 

This publication is not copyrighted and is in the public domain. Duplication is encouraged.

NCCIH has provided this material for your information. It is not intended to substitute for the medical expertise and advice of your health care provider(s). We encourage you to discuss any decisions about treatment or care with your health care provider. The mention of any product, service, or therapy is not an endorsement by NCCIH.

Related Topics

Pain: Considering Complementary Approaches (eBook)

For Consumers

6 Things To Know When Selecting a Complementary Health Practitioner

For Health Care Providers

Mind and Body Approaches for Chronic Pain

Complementary Psychological and/or Physical Approaches for Cancer Symptoms and Treatment Side Effects

Research Results

New Findings Suggest Acupuncture Stimulation Reduces Systemic Inflammation

How the Body and Brain Achieve Carpal Tunnel Pain Relief via Acupuncture

Related Fact Sheets

Low-Back Pain and Complementary Health Approaches: What You Need To Know

Osteoarthritis: In Depth

Cancer and Complementary Health Approaches: What You Need To Know

Chronic Pain and Complementary Health Approaches

Traditional Chinese Medicine: What You Need To Know

Credentialing, Licensing, and Education

Paying for Complementary and Integrative Health Approaches

Have a language expert improve your writing

Run a free plagiarism check in 10 minutes, generate accurate citations for free.

  • Knowledge Base

Methodology

  • What Is Qualitative Research? | Methods & Examples

What Is Qualitative Research? | Methods & Examples

Published on June 19, 2020 by Pritha Bhandari . Revised on June 22, 2023.

Qualitative research involves collecting and analyzing non-numerical data (e.g., text, video, or audio) to understand concepts, opinions, or experiences. It can be used to gather in-depth insights into a problem or generate new ideas for research.

Qualitative research is the opposite of quantitative research , which involves collecting and analyzing numerical data for statistical analysis.

Qualitative research is commonly used in the humanities and social sciences, in subjects such as anthropology, sociology, education, health sciences, history, etc.

  • How does social media shape body image in teenagers?
  • How do children and adults interpret healthy eating in the UK?
  • What factors influence employee retention in a large organization?
  • How is anxiety experienced around the world?
  • How can teachers integrate social issues into science curriculums?

Table of contents

Approaches to qualitative research, qualitative research methods, qualitative data analysis, advantages of qualitative research, disadvantages of qualitative research, other interesting articles, frequently asked questions about qualitative research.

Qualitative research is used to understand how people experience the world. While there are many approaches to qualitative research, they tend to be flexible and focus on retaining rich meaning when interpreting data.

Common approaches include grounded theory, ethnography , action research , phenomenological research, and narrative research. They share some similarities, but emphasize different aims and perspectives.

Qualitative research approaches
Approach What does it involve?
Grounded theory Researchers collect rich data on a topic of interest and develop theories .
Researchers immerse themselves in groups or organizations to understand their cultures.
Action research Researchers and participants collaboratively link theory to practice to drive social change.
Phenomenological research Researchers investigate a phenomenon or event by describing and interpreting participants’ lived experiences.
Narrative research Researchers examine how stories are told to understand how participants perceive and make sense of their experiences.

Note that qualitative research is at risk for certain research biases including the Hawthorne effect , observer bias , recall bias , and social desirability bias . While not always totally avoidable, awareness of potential biases as you collect and analyze your data can prevent them from impacting your work too much.

Receive feedback on language, structure, and formatting

Professional editors proofread and edit your paper by focusing on:

  • Academic style
  • Vague sentences
  • Style consistency

See an example

research methods data type

Each of the research approaches involve using one or more data collection methods . These are some of the most common qualitative methods:

  • Observations: recording what you have seen, heard, or encountered in detailed field notes.
  • Interviews:  personally asking people questions in one-on-one conversations.
  • Focus groups: asking questions and generating discussion among a group of people.
  • Surveys : distributing questionnaires with open-ended questions.
  • Secondary research: collecting existing data in the form of texts, images, audio or video recordings, etc.
  • You take field notes with observations and reflect on your own experiences of the company culture.
  • You distribute open-ended surveys to employees across all the company’s offices by email to find out if the culture varies across locations.
  • You conduct in-depth interviews with employees in your office to learn about their experiences and perspectives in greater detail.

Qualitative researchers often consider themselves “instruments” in research because all observations, interpretations and analyses are filtered through their own personal lens.

For this reason, when writing up your methodology for qualitative research, it’s important to reflect on your approach and to thoroughly explain the choices you made in collecting and analyzing the data.

Qualitative data can take the form of texts, photos, videos and audio. For example, you might be working with interview transcripts, survey responses, fieldnotes, or recordings from natural settings.

Most types of qualitative data analysis share the same five steps:

  • Prepare and organize your data. This may mean transcribing interviews or typing up fieldnotes.
  • Review and explore your data. Examine the data for patterns or repeated ideas that emerge.
  • Develop a data coding system. Based on your initial ideas, establish a set of codes that you can apply to categorize your data.
  • Assign codes to the data. For example, in qualitative survey analysis, this may mean going through each participant’s responses and tagging them with codes in a spreadsheet. As you go through your data, you can create new codes to add to your system if necessary.
  • Identify recurring themes. Link codes together into cohesive, overarching themes.

There are several specific approaches to analyzing qualitative data. Although these methods share similar processes, they emphasize different concepts.

Qualitative data analysis
Approach When to use Example
To describe and categorize common words, phrases, and ideas in qualitative data. A market researcher could perform content analysis to find out what kind of language is used in descriptions of therapeutic apps.
To identify and interpret patterns and themes in qualitative data. A psychologist could apply thematic analysis to travel blogs to explore how tourism shapes self-identity.
To examine the content, structure, and design of texts. A media researcher could use textual analysis to understand how news coverage of celebrities has changed in the past decade.
To study communication and how language is used to achieve effects in specific contexts. A political scientist could use discourse analysis to study how politicians generate trust in election campaigns.

Qualitative research often tries to preserve the voice and perspective of participants and can be adjusted as new research questions arise. Qualitative research is good for:

  • Flexibility

The data collection and analysis process can be adapted as new ideas or patterns emerge. They are not rigidly decided beforehand.

  • Natural settings

Data collection occurs in real-world contexts or in naturalistic ways.

  • Meaningful insights

Detailed descriptions of people’s experiences, feelings and perceptions can be used in designing, testing or improving systems or products.

  • Generation of new ideas

Open-ended responses mean that researchers can uncover novel problems or opportunities that they wouldn’t have thought of otherwise.

Prevent plagiarism. Run a free check.

Researchers must consider practical and theoretical limitations in analyzing and interpreting their data. Qualitative research suffers from:

  • Unreliability

The real-world setting often makes qualitative research unreliable because of uncontrolled factors that affect the data.

  • Subjectivity

Due to the researcher’s primary role in analyzing and interpreting data, qualitative research cannot be replicated . The researcher decides what is important and what is irrelevant in data analysis, so interpretations of the same data can vary greatly.

  • Limited generalizability

Small samples are often used to gather detailed data about specific contexts. Despite rigorous analysis procedures, it is difficult to draw generalizable conclusions because the data may be biased and unrepresentative of the wider population .

  • Labor-intensive

Although software can be used to manage and record large amounts of text, data analysis often has to be checked or performed manually.

If you want to know more about statistics , methodology , or research bias , make sure to check out some of our other articles with explanations and examples.

  • Chi square goodness of fit test
  • Degrees of freedom
  • Null hypothesis
  • Discourse analysis
  • Control groups
  • Mixed methods research
  • Non-probability sampling
  • Quantitative research
  • Inclusion and exclusion criteria

Research bias

  • Rosenthal effect
  • Implicit bias
  • Cognitive bias
  • Selection bias
  • Negativity bias
  • Status quo bias

Quantitative research deals with numbers and statistics, while qualitative research deals with words and meanings.

Quantitative methods allow you to systematically measure variables and test hypotheses . Qualitative methods allow you to explore concepts and experiences in more detail.

There are five common approaches to qualitative research :

  • Grounded theory involves collecting data in order to develop new theories.
  • Ethnography involves immersing yourself in a group or organization to understand its culture.
  • Narrative research involves interpreting stories to understand how people make sense of their experiences and perceptions.
  • Phenomenological research involves investigating phenomena through people’s lived experiences.
  • Action research links theory and practice in several cycles to drive innovative changes.

Data collection is the systematic process by which observations or measurements are gathered in research. It is used in many different contexts by academics, governments, businesses, and other organizations.

There are various approaches to qualitative data analysis , but they all share five steps in common:

  • Prepare and organize your data.
  • Review and explore your data.
  • Develop a data coding system.
  • Assign codes to the data.
  • Identify recurring themes.

The specifics of each step depend on the focus of the analysis. Some common approaches include textual analysis , thematic analysis , and discourse analysis .

Cite this Scribbr article

If you want to cite this source, you can copy and paste the citation or click the “Cite this Scribbr article” button to automatically add the citation to our free Citation Generator.

Bhandari, P. (2023, June 22). What Is Qualitative Research? | Methods & Examples. Scribbr. Retrieved September 3, 2024, from https://www.scribbr.com/methodology/qualitative-research/

Is this article helpful?

Pritha Bhandari

Pritha Bhandari

Other students also liked, qualitative vs. quantitative research | differences, examples & methods, how to do thematic analysis | step-by-step guide & examples, "i thought ai proofreading was useless but..".

I've been using Scribbr for years now and I know it's a service that won't disappoint. It does a good job spotting mistakes”

Money blog: Oasis resale U-turn as official reseller lowers fee amid criticism

The Money blog is your place for consumer and personal finance news and tips. Today's posts include Twickets lowering fees for Oasis tickets, the extension of the Household Support Fund and O2 Priority axing free Greggs. Listen to a Daily podcast on the Oasis ticket troubles as you scroll.

Monday 2 September 2024 20:11, UK

  • Oasis resale U-turn as Twickets lowers fee after criticism
  • Millions to get cost of living payments this winter as scheme extended
  • O2 Priority customers fume as Greggs perk scaled back
  • Listen to the Daily above and tap here to follow wherever you get your podcasts

Essential reads

  • 'This job has saved lives': What's it like selling the Big Issue?
  • Eyewatering rate hike awaiting anyone coming off a five-year fixed

Tips and advice

  • Money Problem : 'My dog died but my insurance is still demanding whole year's policy payment'
  • Treat savings like monthly bill, says savings guru
  • Young people doing 'big no-no' with holiday money - here are the golden rules

Twickets has announced it is lowering its charges after some Oasis fans had to pay more than £100 in extra fees to buy official resale tickets.

The site is where the band themselves is directing people to buy second-hand tickets for face value - having warned people against unofficial third party sellers like StubHub and Viagogo.

One person branded the extra fees "ridiculous" (see more in 10.10 post), after many people had already been left disappointed at the weekend when Ticketmaster's dynamic pricing pushed tickets up by three times the original advertised fee.

Twickets said earlier that it typically charged a fee of 10-15% of the face value of the tickets.

But it has since said it will lower the charge due to "exceptional demand" from Oasis fans - taking ownership of an issue in a way fans will hope others follow. 

Richard Davies, Twickets founder, told the Money blog: "Due to the exceptional demand for the Oasis tour in 2025, Twickets have taken the decision to lower our booking fee to 10% and a 1% transactional fee (to cover bank charges) for all buyers of their tickets on our platform. In addition we have introduced a fee cap of £25 per ticket for these shows. Sellers of tickets already sell free of any Twickets charge.

"This ensures that Twickets remains hugely competitive against the secondary market, including sites such as Viagogo, Gigsberg and StubHub.

"Not only do these platforms inflate ticket prices way beyond their original face value but they also charge excessive booking fees, usually in the region of 30-40%. Twickets by comparison charges an average fee of around 12.5%"

The fee cap, which the Money blog understands is being implemented today, will apply to anyone who has already bought resale tickets through the site.

Mr Davies said Twickets was a "fan first" resale site and a "safe and affordable place" for people to trade unwanted tickets.

"The face value of a ticket is the total amount it was first purchased for, including any booking fee. Twickets does not set the face value price, that is determined by the event and the original ticketing company. The price listed on our platform is set by the seller, however no one is permitted to sell above the face-value on Twickets, and every ticket is checked before listing that it complies with this policy," he said.

Meanwhile, hundreds of people have complained to the regulator about how Oasis tickets were advertised ahead of going on sale. 

The Advertising Standards Authority said it had received 450 complaints about Ticketmaster adverts for the gigs.

Some  expressed their anger on social media , as tickets worth £148 were being sold for £355 on the site within hours of release, due to the "dynamic pricing" systems.

A spokesperson from ASA said the complainants argue that the adverts made "misleading claims about availability and pricing".

They added: "We're carefully assessing these complaints and, as such, can't comment any further at this time.

"To emphasise, we are not currently investigating these ads."

Ticketmaster said it does not set prices and its website says this is down to the "event organiser" who "has priced these tickets according to their market value".

Despite traditionally being an affordable staple of British cuisine, the average price for a portion of fish and chips has risen by more than 50% in the past five years to nearly £10, according to the Office for National Statistics.

Sonny and Shane "the codfather" Lee told Sky News of the challenges that owning J-Henry's Fish and Chip Shop brings and why prices have skyrocketed. 

"Potatoes, fish, utilities, cooking oil - so many things [are going up]," he said. 

Shane also said that he is used to one thing at a time increasing in price, but the outlook today sees multiple costs going up all at once.  

"Potatoes [were] priced right up to about £25 a bag - the previous year it was about £10 a bag," Sonny said, noting a bad harvest last year. 

He said the business had tried hake as a cheaper fish option, but that consumers continued to prefer the more traditional, but expensive, cod and haddock. 

"It's hard and we can we can absorb the cost to a certain extent, but some of it has to be passed on," Shane added. 

After a long Saturday for millions of Oasis fans in online queues, the culture secretary says surge pricing - which pushed the price of some tickets up by three times their original advertised value to nearly £400 - will be part of the government's review of the ticket market. 

On today's episode of the Daily podcast, host Niall Paterson speaks to secondary ticketing site Viagogo. While it wasn’t part of dynamic pricing, it has offered resale tickets for thousands of pounds since Saturday. 

Matt Drew from the company accepts the industry needs a full review, while Adam Webb, from the campaign group FanFair Alliance, explains the changes it would like to see.

We've covered the fallout of the Oasis sale extensively in the Money blog today - see the culture secretary's comments on the "utterly depressing" inflated pricing in our post at 6.37am, and Twickets, the official Oasis resale site, slammed by angry fans for its "ridiculous" added fees at 10.10am.

The growing backlash culminated in action from Twickets - the company said it would lower its charges after some fans had to pay more than £100 in extra fees for resale tickets (see post at 15.47).

Tap here to follow the Daily podcast - 20 minutes on the biggest stories every day

Last week we reported that employers will have to offer flexible working hours - including a four-day week - to all workers under new government plans.

To receive their full pay, employees would still have to work their full hours but compressed into a shorter working week - something some workplaces already do.

Currently, employees can request flexible hours as soon as they start at a company but employers are not legally obliged to agree.

The Labour government now wants to make it so employers have to offer flexible hours from day one, except where it is "not reasonably feasible".

You can read more of the details in this report by our politics team:

But what does the public think about this? We asked our followers on LinkedIn to give their thoughts in an unofficial poll.

It revealed that the overwhelming majority of people support the idea to compress the normal week's hours into fewer days - some 83% of followers said they'd choose this option over a standard five-day week.

But despite the poll showing a clear preference for a compressed week, our followers appeared divided in the comments.

"There's going to be a huge brain-drain as people move away from companies who refuse to adapt with the times and implement a 4 working week. This will be a HUGE carrot for many orgs," said Paul Burrows, principal software solutions manager at Reality Capture.

Louise McCudden, head of external affairs at MSI Reproductive Choices, said she wasn't surprised at the amount of people choosing longer hours over fewer days as "a lot of people" are working extra hours on a regular basis anyway.

But illustrator and administrative professional Leslie McGregor noted the plan wouldn't be possible in "quite a few industries and quite a few roles, especially jobs that are customer centric and require 'round the clock service' and are heavily reliant upon people in trades, maintenance, supply and transport". 

"Very wishful thinking," she said.

Paul Williamson had a similar view. He said: "I'd love to know how any customer first service business is going to manage this."

We reported earlier that anyone with O2 Priority will have their free weekly Greggs treats replaced by £1 monthly Greggs treats - see 6.21am post.

But did you know there are loads of other ways to get food from the nation's most popular takeaway for free or at a discount?

Downloading the Greggs app is a good place to start - as the bakery lists freebies, discounts and special offers there regularly. 

New users also get rewards just for signing up, so it's worth checking out. 

And there's a digital loyalty card which you can add virtual "stamps" to with each purchase to unlock discounts or other freebies.  

Vodafone rewards

Seriously begrudged Virgin Media O2 customers may want to consider switching providers. 

The Vodafone Rewards app, VeryMe, sometimes gives away free Greggs coffees, sausage rolls, sweet treats and more to customers.

Monzo bank account holders can grab a sausage roll (regular or vegan), regular sized hot drink, doughnut or muffin every week. 

Birthday cake

Again, you'll need the Greggs award app for this one - which will allow you to claim one free cupcake, cream cake or doughnut for your birthday each year.

Octopus customers

Octopus Energy customers with smart meters can claim one free drink each week, in-store from Greggs (or Caffè Nero).

The Greggs freebie must be a regular size hot drink.

Make new friends

If you're outgoing (and hungry), it may be worth befriending a Greggs staff member.

The staff discount at Greggs is 50% on own-produced goods and 25% off branded products. 

If you aren't already aware, Iceland offers four Greggs sausage rolls in a multi-pack for £3. 

That means, if you're happy to bake it yourself, you'll only be paying 74p per sausage roll. 

Millions of Britons could receive extra cash to help with the cost of living this winter after the government extended the Household Support Fund.

A £421m pot will be given to local councils in England to distribute, while £79m will go to the devolved administrations.

The fund will now be available until April 2025 having been due to run out this autumn.

Councils decide how to dish out their share of the fund but it's often via cash grants or vouchers.

Many councils also use the cash to work with local charities and community groups to provide residents with key appliances, school uniforms, cookery classes and items to improve energy efficiency in the home.

Chancellor Rachel Reeves said: "The £22bn blackhole inherited from the previous governments means we have to take tough decisions to fix the foundations of our economy.

"But extending the Household Support Fund is the right thing to do - provide targeted support for those who need it most as we head into the winter months."

The government has been criticised for withdrawing universal winter fuel payments for pensioners of up to £300 this winter - with people now needing to be in receipt of certain means-tested benefits to qualify.

People should contact their local council for details on how to apply for the Household Support Fund - they can find their council  here .

Lloyds Bank app appears to have gone down for many, with users unable to see their transactions. 

Down Detector, which monitors site outages, has seen more than 600 reports this morning.

It appears to be affecting online banking as well as the app.

There have been some suggestions the apparent issue could be due to an update.

Another disgruntled user said: "Absolutely disgusting!! I have an important payment to make and my banking is down. There was no warning given prior to this? Is it a regular maintenance? Impossible to get hold of someone to find out."

A Lloyds Bank spokesperson told Sky News: "We know some of our customers are having issues viewing their recent transactions and our app may be running slower than usual.

"We're sorry about this and we're working to have everything back to normal soon."

We had anger of unofficial resale prices, then Ticketmaster's dynamic pricing - and now fees on the official resale website are causing consternation among Oasis fans.

The band has encouraged anyone wanting resale tickets to buy them at face value from Ticketmaster or Twickets - after some appeared for £6,000 or more on other sites.

"Tickets appearing on other secondary ticketing sites are either counterfeit or will be cancelled by the promoters," Oasis said.

With that in mind, fans flocked to buy resale tickets from the sites mentioned above - only to find further fees are being added on. 

Mainly Oasis, a fan page, shared one image showing a Twickets fee for two tickets as high as £138.74. 

"Selling the in demand tickets completely goes against the whole point of their company too… never mind adding a ridiculous fee on top of that," the page shared. 

Fan Brad Mains shared a photo showing two tickets priced at £337.50 each (face value of around £150, but increased due to dynamic pricing on Saturday) - supplemented by a £101.24 Twickets fee. 

That left him with a grand total of £776.24 to pay for two tickets.

"Actually ridiculous this," he  said on X .

"Ticketmaster inflated price then sold for 'face value' on Twickets with a £100 fee. 2 x £150 face value tickets for £776, [this] should be illegal," he added. 

Twickets typically charges between 10-15% of the ticket value as its own fee. 

We have approached the company for comment.

Separately, the government is now looking at the practice of dynamic pricing - and we've had a response to that from the Competition and Markets Authority this morning.

It said: "We want fans to get a fair deal when they go to buy tickets on the secondary market and have already taken action against major resale websites to ensure consumer law is being followed properly. 

"But we think more protections are needed for consumers here, so it is positive that the government wants to address this. We now look forward to working with them to get the best outcomes for fans and fair-playing businesses."

Consumer protection law does not ban dynamic pricing and it is a widely used practice. However, the law also states that businesses should not mislead consumers about the price they must pay for a product, either by providing false or deceptive information or by leaving out important information or providing it too late.

By James Sillars , business reporter

It's a false start to the end of the summer holidays in the City.

While London is mostly back at work, trading is fairly subdued due to the US Labor (that's labour, as in work) Day holiday.

US markets will not open again until Tuesday.

There's little direction across Europe with the FTSE 100 trading nine points down at 8,365.

Leading the gainers was Rightmove - up 24%. The property search website is the subject of a possible cash and shares takeover offer by Australian rival REA.

The company is a division of Rupert Murdoch's News Corp.

One other point to note is the continuing fluctuation in oil prices.

Brent crude is 0.7% down at the start of the week at $76.

Dragging the cost lower is further evidence of weaker demand in China.

Australia's REA Group is considering a takeover of Rightmove, in a deal which could be worth about £4.36bn.

REA Group said in a statement this morning there are "clear similarities" between the companies, which have "highly aligned cultural values".

Rightmove is the UK's largest online property portal, while REA is Australia's largest property website. 

It employs more than 2,800 people and is majority-owned by Rupert Murdoch's News Corp,.

REA Group said: "REA sees a transformational opportunity to apply its globally leading capabilities and expertise to enhance customer and consumer value across the combined portfolio, and to create a global and diversified digital property company, with number one positions in Australia and the UK.

"There can be no certainty that an offer will be made, nor as to the terms on which any offer may be made."

Rightmove has been approached for comment.

Be the first to get Breaking News

Install the Sky News app for free

research methods data type

IMAGES

  1. 6-1: Types of Research Data (Source: Malhotra et al, 2002)

    research methods data type

  2. Research Methods 15

    research methods data type

  3. Methods of Data Collection-Primary and secondary sources

    research methods data type

  4. 15 Types of Research Methods (2024)

    research methods data type

  5. Types of Research by Method

    research methods data type

  6. Comparison between the five types of research

    research methods data type

VIDEO

  1. Business Research Methods| AKTU Digital Education

  2. Data Analysis in Research

  3. Research methods and data collection

  4. Research Methods & Data

  5. Literature Reviews Lecture Full Video

  6. Understanding Quantitative Research Methods

COMMENTS

  1. Research Data

    Research data refers to any information or evidence gathered through systematic investigation or experimentation to support or refute a hypothesis or answer a research question. It includes both primary and secondary data, and can be in various formats such as numerical, textual, audiovisual, or visual. Research data plays a critical role in ...

  2. Research Methods

    Research Methods | Definitions, Types, Examples Research methods are specific procedures for collecting and analyzing data. Developing your research methods is an integral part of your research design. When planning your methods, there are two key decisions you will make.

  3. Research Methods

    Research Methods refer to the techniques, procedures, and processes used by researchers to collect, analyze, and interpret data in order to answer research questions or test hypotheses. The methods used in research can vary depending on the research questions, the type of data that is being collected, and the research design.

  4. 15 Types of Research Methods (2024)

    Research methods refer to the strategies, tools, and techniques used to gather and analyze data in a structured way in order to answer a research question or investigate a hypothesis (Hammond & Wellington, 2020). Generally,

  5. Quantitative Data

    Quantitative data refers to numerical data that can be measured or counted. This type of data is often used in scientific research and is typically collected through methods such as surveys, experiments, and statistical analysis.

  6. Research Methods--Quantitative, Qualitative, and More: Overview

    About Research Methods This guide provides an overview of research methods, how to choose and use them, and supports and resources at UC Berkeley.

  7. FIU Libraries: Research Methods Help Guide: Types of Data

    Introduction Studies can use quantitative data, qualititative data, or both types of data. Each approach has advantages and disadvantages. Explore the resources in the box at the left for more information.

  8. Types of Research Designs Compared

    Types of research can be categorized based on the research aims, the type of data, and the subjects, timescale, and location of the research.

  9. What Is a Research Design

    A research design is a strategy for answering your research question using empirical data. Creating a research design means making decisions about: Your overall research objectives and approach. Whether you'll rely on primary research or secondary research. Your sampling methods or criteria for selecting subjects. Your data collection methods.

  10. Research Methods: What are research methods?

    Research methods are the strategies, processes or techniques utilized in the collection of data or evidence for analysis in order to uncover new information or create better understanding of a topic. There are different types of research methods which use different tools for data collection.

  11. Choosing the Right Research Methodology: A Guide

    Choosing an optimal research methodology is crucial for the success of any research project. The methodology you select will determine the type of data you collect, how you collect it, and how you analyse it. Understanding the different types of research methods available along with their strengths and weaknesses, is thus imperative to make an ...

  12. Types of Research Methods: Examples and Tips

    What are research methods? Research methods are the techniques and procedures used to collect and analyze data in order to answer research questions and test a research hypothesis. There are several different types of research methods, each with its own strengths and weaknesses.

  13. What are research methods?

    What are research methods? Research methods are different from research methodologies because they are the ways in which you will collect the data for your research project. The best method for your project largely depends on your topic, the type of data you will need, and the people or items from which you will be collecting data.

  14. How To Choose The Right Research Methodology

    Learn how to choose the best research methodology (qualitative, quantitative and mixed methods) for your dissertation or thesis.

  15. Research

    Research design: Research design refers to the overall plan and structure of the study, including the type of study (e.g., observational, experimental), the sampling strategy, and the data collection and analysis methods. Sampling strategy: Sampling strategy refers to the method used to select a representative sample of participants or units ...

  16. Research Methods In Psychology

    Research methods in psychology are systematic procedures used to observe, describe, predict, and explain behavior and mental processes. They include experiments, surveys, case studies, and naturalistic observations, ensuring data collection is objective and reliable to understand and explain psychological phenomena.

  17. Types of Research Methods (With Best Practices and Examples)

    Researchers use different research methods to achieve data results. Learn more about various research methods and best practices for using each one.

  18. Qualitative vs. Quantitative Research

    The differences between quantitative and qualitative research Quantitative and qualitative research use different research methods to collect and analyze data, and they allow you to answer different kinds of research questions.

  19. What are research methodologies?

    A guide on the different types of research methods, and how to determine which one to use in your own research.

  20. Quantitative Data Analysis Guide: Methods, Examples & Uses

    Although quantitative data analysis is a powerful tool, it cannot be used to provide context for your research, so this is where qualitative analysis comes in. Qualitative analysis is another common research method that focuses on collecting and analyzing non-numerical data, like text, images, or audio recordings to gain a deeper understanding ...

  21. Robust identification of perturbed cell types in single-cell RNA-seq data

    To demonstrate this, we applied both methods to resampled COVID-19 data 1 where the number of cells per cell type was artificially varied between 100 and 10,000. nDEG was highly confounded by the ...

  22. Research Methodology

    Research Methodology refers to the systematic and scientific approach used to conduct research, investigate problems, and gather data and information for a specific purpose. It involves the techniques and procedures used to identify, collect, analyze, and interpret data to answer research questions or solve research problems.

  23. Choosing Data Types and Column Options

    Use caution with this data type. ... It is likely that there is a better method to access the data. The table data type is used for temporary storage. It is required for inline table-valued functions. For other queries you need to decide between a temporary table or a table variable. The decision process for this can be complicated and often ...

  24. Data Collection

    Learn how to collect data for your research project, whether it is qualitative or quantitative, and explore the advantages and disadvantages of different methods and examples.

  25. CTGAN-ENN: a tabular GAN-based hybrid sampling method for imbalanced

    Class imbalance is one of many problems of customer churn datasets. One of the common problems is class overlap, where the data have a similar instance between classes. The prediction task of customer churn becomes more challenging when there is class overlap in the data training. In this research, we suggested a hybrid method based on tabular GANs, called CTGAN-ENN, to address class overlap ...

  26. Data Collection

    Data collection is the process of gathering and collecting information from various sources to analyze and make informed decisions based on the data collected. This can involve various methods, such as surveys, interviews, experiments, and observation.

  27. Acupuncture: Effectiveness and Safety

    According to the World Health Organization, acupuncture is used in 103 of 129 countries that reported data. In the United States, data from the National Health Interview Survey show that the use of acupuncture by U.S. adults more than doubled between 2002 and 2022. In 2002, 1.0 percent of U.S. adults used acupuncture; in 2022, 2.2 percent used it.

  28. What Is Qualitative Research?

    Qualitative research involves collecting and analyzing non-numerical data (e.g., text, video, or audio) to understand concepts, opinions, or experiences.

  29. Money blog: Major bank to let first-time buyers borrow up to 5.5 times

    Scroll through the Money blog for consumer and personal finance news, features and tips. Today's posts include free Greggs being axed by O2 Priority, a potential Rightmove takeover and Lloyds ...