- Is an expanded literature review that tries to collect and summarize ALL of the primary research studies that have previously been conducted by others as they try to answer their own research question.
- Is when the author of a systematic review uses statistical methods to summarize the results of the data from the studies s/he found. No new data is being generated because of primary research - the author of the meta-analysis is using statistics to enhance her/his analysis of other researchers' data.
It is also often used in other types of resources such as non-research based articles, books, documentaries, et cetera to provide background information that ties back to primary research that the person using the source can locate if they want to see the actual primary research study.
The further away the reader gets from the original research source, the more likely the reader will lose or misinterpret the original context of the data being used.
For example:
I publish a research study (primary research) in 2014 that is focused on parent monitoring of there child's cell phone usage. In my study I said I surveyed 800 ninth graders in Massachusetts and that data I gathered for a specific question I asked showed that 10% don't have a cell phone, 70% have a cell phone on their parents plan and 20% have a cell phone on their own individual plan.
Another researcher uses my research in their literature review published in 2018 and summarizes my findings as "one survey found that 90% of ninth graders have cell phones".
You find the 2018 article and generalize that author's summary of my findings to state that "a 2014 survey found that 10% of ninth graders don't have a cell phone".
How accurate is the context in which you interpret my data? How accurate is the context of the data for the person who reads your literature review? By the time you are using my data from the secondary resource neither you or your reader have any idea that this was a study of 800 participants, that it was focused in only one state, or that it included data for those with a phone as to whether they have their own phone plan or on their parents' plan.
Tips when first evaluating a source: Once you find a research-based source, read the abstract and/or methodologies section and ask yourself who conducted the actual research process to gather the data?
Common sources where primary research in the field of Education is published are:
Any of the above sources may also be peer-reviewed, meaning that the content is reviewed by other professionals in the field before it is published. Peer-reviewed, scholarly journals are considered to be high quality and often are a requirement in your research assignments because they are produced by experts and professionals in the field and all primary research articles are put through a peer-reviewed vetting process that is detailed by the journal publisher. It is also where professionals in the field in turn tend to publish their research for those same reasons. It is important to keep a couple of things in mind:
Research Databases are a great tool for finding and accessing primary research articles published in scholarly/academic, peer-reviewed journals. In the field of Education our library provides access to ERIC, Education Source, Proquest Education Database and Academic Search Ultimate. Each of these databases have search functions to help you narrow your results list down to scholarly/academic, peer-reviewed journals.
Depending on your topic and research question you may need to explore research databases in other disciplines as well. Here are two examples:
info This is a space for the teal alert bar.
notifications This is a space for the yellow alert bar.
Primary resources contain first-hand information, meaning that you are reading the author’s own account on a specific topic or event that s/he participated in. Examples of primary resources include scholarly research articles, books, and diaries. Primary sources such as research articles often do not explain terminology and theoretical principles in detail. Thus, readers of primary scholarly research should have foundational knowledge of the subject area. Use primary resources to obtain a first-hand account to an actual event and identify original research done in a field. For many of your papers, use of primary resources will be a requirement.
Examples of a primary source are:
How to locate primary research in NU Library:
Secondary sources describe, summarize, or discuss information or details originally presented in another source; meaning the author, in most cases, did not participate in the event. This type of source is written for a broad audience and will include definitions of discipline specific terms, history relating to the topic, significant theories and principles, and summaries of major studies/events as related to the topic. Use secondary sources to obtain an overview of a topic and/or identify primary resources. Refrain from including such resources in an annotated bibliography for doctoral level work unless there is a good reason.
Examples of a secondary source are:
Locate secondary resources in NU Library within the following databases:
This workshop introduces to the beginning stages of the research process, focusing on identifying different types of information, as well as gathering background information through electronic books.
© Copyright 2024 National University. All Rights Reserved.
Privacy Policy | Consumer Information
Appinio Research · 18.09.2023 · 11min read
Have you ever wondered how businesses and researchers gather those fresh insights that drive innovation and decision-making? That's where primary research steps in. In a world where information is gold, primary research acts as a direct channel to tap into the thoughts, behaviors, and preferences of people. Whether you're exploring new market trends, fine-tuning a product, or understanding human behavior, primary research is your compass for navigating the sea of possibilities.
Primary research is the systematic process of gathering original data directly from individuals , sources, or phenomena to address specific research questions or objectives. This firsthand approach involves designing and conducting research methods such as surveys and interviews to generate unique insights and information tailored to the researcher's specific area of inquiry. Primary research enables researchers to collect relevant, accurate, and directly applicable data to their research goals, providing a foundation for deeper understanding and informed decision-making.
Primary research offers many advantages that contribute to its effectiveness and relevance. Here are the key benefits that make primary research a powerful tool for generating insights:
While primary research involves collecting new data, secondary research involves analyzing existing data gathered by others. Secondary research is useful for building context, identifying trends, and gaining insights from previous studies. However, primary research provides you with unique insights and a firsthand understanding of your subject.
Before embarking on your primary research journey, thorough planning is essential to ensure its success.
Clearly defining your research objectives and questions is the foundation of effective primary research. Ask yourself:
Select a method that aligns with your research objectives. Common methods include surveys, interviews, observations, experiments, case studies, and focus groups, each with strengths and limitations.
Identify the individuals, groups, or subjects you want to study. Your target audience will determine the relevance of your findings. Ensure your sample size is representative of your target population.
Primary research offers a diverse range of methods to gather data directly from sources, enabling you to gain unique insights and answers to your research questions. Each method has its strengths, and the choice of method depends on your research objectives, the nature of your subject, and the available resources.
Surveys and questionnaires are widely used methods to collect data from a large number of participants. You present a series of structured questions, which participants respond to by selecting predefined choices or providing open-ended answers.
Surveys are efficient for obtaining quantitative data and are suitable for studying opinions, preferences, behaviors, and demographics. Online platforms, such as Appinio and Google Forms, facilitate easy distribution and data collection.
Interviews involve direct conversations between the researcher and participants. Interviews can be structured, semi-structured, or unstructured.
Interviews are valuable for gathering rich qualitative data and insights into participants' experiences, thoughts, and emotions.
Observational research involves systematically observing and recording behaviors, interactions, and occurrences in natural settings. Researchers can be either active participants or passive observers. This method is ideal for studying behavior patterns, social interactions, and environmental influences.
Observational research provides a window into real-world behaviors without the potential bias that can arise from self-reporting. It requires careful planning to ensure data collection is consistent and objective.
Experiments involve manipulating variables to study cause-and-effect relationships. Researchers create controlled environments to test hypotheses and assess how changes in one variable impact another.
In contrast, A/B testing is a specific form of experimentation used in marketing and product development. It compares two versions (A and B) of a variable, such as a website layout or email subject line, to determine which performs better.
Experiments and A/B testing are powerful for establishing causal relationships and measuring the impact of interventions.
Case studies involve an in-depth examination of a single subject, context, or phenomenon.
Researchers gather and analyze various data sources, such as interviews, documents, and observations, to provide a holistic understanding.
Case studies are valuable for exploring complex issues in detail and generating nuanced insights. While they lack generalizability due to their focus on specific instances, case studies contribute rich contextual information to the research landscape.
Focus groups gather a small group of participants to discuss specific topics guided by a moderator. These discussions encourage participants to share their opinions, perceptions, and experiences, fostering interaction and generating qualitative data.
Focus groups are valuable for exploring collective perspectives, identifying shared trends, and uncovering diverse viewpoints. The dynamic nature of group interactions can lead to the emergence of unexpected insights.
When selecting a primary research method, consider factors such as the nature of your research question, the level of detail you require, the resources available, and the preferences of your target audience. Combining multiple methods or triangulating data from different sources often enhances the validity and depth of your findings.
By choosing the suitable primary research method for your project, you can gather meaningful insights that contribute to your understanding of the subject at hand.
To better understand how primary research is applied in various fields, let's explore some real-world examples that showcase the diversity and effectiveness of different primary research methods:
While primary research offers numerous benefits, it also comes with inherent limitations. Being aware of these limitations is essential for conducting rigorous and well-rounded research:
Primary research is your direct line to understanding your customers, improving products, and making smarter decisions. It's like having a conversation with your audience, getting insights straight from the source. Whether you're asking them questions, watching their behaviors, or testing new ideas, primary research gives you the real-deal information you need to stay competitive and relevant.
Remember, primary research isn't just for big corporations – even small businesses can tap into its power. By listening to your customers and adapting based on their input, you're not only meeting their needs but also building a stronger, customer-focused brand.
At Appinio , we're not just a market research platform but your partner in propelling your business forward. Imagine having the power to harness real-time consumer insights effortlessly, enabling you to make swift, data-driven decisions that fuel your success.
Join the loop 💌
Be the first to hear about new updates, product news, and data insights. We'll send it all straight to your inbox.
Get the latest market research news straight to your inbox! 💌
26.06.2024 | 35min read
Brand Development: Definition, Process, Strategies, Examples
18.06.2024 | 7min read
Future Flavors: How Burger King nailed Concept Testing with Appinio's Predictive Insights
18.06.2024 | 32min read
What is a Pulse Survey? Definition, Types, Questions
Published by Alvin Nicolas at August 16th, 2021 , Revised On August 29, 2023
Primary research or secondary research? How do you decide which is best for your dissertation paper?
As researchers, we need to be aware of the pros and cons of the two types of research methods to make sure their selected research method is the most appropriate, taking into account the topic of investigation .
The success of any dissertation paper largely depends on choosing the correct research design . Before you can decide whether you must base your research strategy on primary or secondary research; it is important to understand the difference between primary resources and secondary resources.
What are primary sources.
According to UCL libraries, primary sources are articles, images, or documents that provide direct evidence or first-hand testimony about any given research topic.
Is it important that we have a clear understanding of the information resulting from actions under investigation ? Primary sources allow us to get close to those events to recognise their analysis and interpretation in scientific and academic communities.
Classic examples of primary sources include;
However, when the researcher wishes to analyse and understand information coming out of events or actions that have already occurred, their work is regarded as a secondary source.
In essence, no secondary source can be created without using primary sources. The same information source or evidence can be considered either primary or secondary, depending on who is presenting the information and where the information is presented.
Some examples of secondary sources are;
Need help with getting started with your dissertation paper? Here is a comprehensive article on “ How to write a dissertation – Step by step guide “.
Below you will find detailed guidelines to help you make an informed decision if you have been thinking of the question “Should I use primary or secondary research in my dissertation”.
Proposal and dissertation orders completed by our expert writers are
Primary research includes an exhaustive analysis of data to answer research questions that are specific and exploratory in nature.
Primary research methods with examples include the use of various primary research tools such as interviews, research surveys , numerical data, observations, audio, video, and images to collect data directly rather than using existing literature.
Business organisations throughout the world have their employees or an external research agency conduct primary research on their behalf to address certain issues. On the other hand, undergraduate and postgraduate students conduct primary research as part of their dissertation projects to fill an obvious research gap in their respective fields of study.
As indicated above, primary data can be collected in a number of ways, and so we have also conducted in-depth research on the most common yet independent primary data collection techniques .
When conducting primary research, it is vitally important to pay attention to the chosen sampling method which can be described as “ a specific principle used to select members of the population to participate in the research ”.
Oftentimes, the researcher might not be able to directly work with the targeted population because of its large size, and so it becomes indispensable to employ statistical sampling techniques where the researchers have no choice but to draw conclusions based on responses collected from the representative population.
The process of sampling in primary data collection includes the following five steps;
The researcher can gather responses when conducting primary research, but nonverbal communication and gestures play a considerable role. They help the researcher identify the various hidden elements which cannot be identified when conducting the secondary research.
How to use Social Media Networks for Dissertation Research
One important aspect of primary research that researchers should look into is research ethics. Keeping participants’ information confidential is a research responsibility that should never be overlooked.
How to Approach a Company for your Primary Study
Secondary research or desk-based research is the second type of research you could base your research methodology in a dissertation on. This type of research reviews and analyses existing research studies to improve the overall authenticity of the research.
Secondary research methods include the use of secondary sources of information including journal articles, published reports, public libraries, books, data available on the internet, government publications, and results from primary research studies conducted by other researchers in the past.
Unlike primary research, secondary research is cost-effective and less time-consuming simply because it uses existing literature and doesn’t require the researcher to spend time and financial resources to collect first-hand data.
Not all researchers and/or business organisations are able to afford a significant amount of money towards research, and that’s one of the reasons this type of research is the most popular in universities and organisations.
Secondary research involves the following five steps;
If you aren’t sure about the correct method of research for your dissertation paper, you should get help from an expert who can guide on whether you should use Primary or Secondary Research for your dissertation paper.
The Steps Involved in Writing a Dissertation
Primary Research | Secondary Research |
---|---|
Research is conducted first hand to obtain data. Research “own” the data collected. | Research is based on data collected from previous researches. |
Primary research is based on raw data. | Secondary research is based on tried and tested data which is previously analysed and filtered. |
The data collected fits the needs of a researcher, it is customised. Data is collected based on the absolute needs of organisations or businesses. | Data may or may not be according to the requirement of a researcher. |
Researcher is deeply involved in research to collect data in primary research. | As opposed to primary research, secondary research is fast and easy. It aims at gaining a broader understanding of subject matter. |
Primary research is an expensive process and consumes a lot of time to collect and analyse data. | Secondary research is a quick process as data is already available. Researcher should know where to explore to get most appropriate data. |
When choosing between primary and secondary research, you should always take into consideration the advantages and disadvantages of both types of research so you make an informed decision.
The best way to select the correct research strategy for your dissertation is to look into your research topic, research questions , aim and objectives – and of course the available time and financial resources.
Discussion pertaining to the two research techniques clearly indicates that primary research should be chosen when a specific topic, a case, organisation, etc. is to be researched about and the researcher has access to some financial resources.
Whereas secondary research should be considered when the research is general in nature and can be answered by analysing past researches and published data.
Not sure which research strategy you should apply, get in touch with us right away . At ResearchProspect, we have Masters and Ph.D. qualified writers in all academic subjects so you can be confident of having your research; completed to the highest academic standard and well-recognised in the academic world.
Check Prices Now
What is the difference between primary vs secondary research.
Primary research involves collecting firsthand data from sources like surveys or interviews. Secondary research involves analyzing existing data, such as articles or reports. Primary is original data gathering, while secondary relies on existing information.
Ethnography is a type of research where a researcher observes the people in their natural environment. Here is all you need to know about ethnography.
Experimental research refers to the experiments conducted in the laboratory or under observation in controlled conditions. Here is all you need to know about experimental research.
A case study is a detailed analysis of a situation concerning organizations, industries, and markets. The case study generally aims at identifying the weak areas.
USEFUL LINKS
LEARNING RESOURCES
COMPANY DETAILS
An official website of the United States government
The .gov means it’s official. Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.
The site is secure. The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.
Preview improvements coming to the PMC website in October 2024. Learn More or Try it out now .
Jan m. sargeant.
1 Department of Population Medicine, Ontario Veterinary College, University of Guelph, Guelph, ON, Canada
2 Centre for Evidence-Based Veterinary Medicine, School of Veterinary Medicine and Science, University of Nottingham, Sutton Bonington Campus, Loughborough, United Kingdom
3 Department of Large Animal Clinical Sciences, College of Veterinary Medicine, Michigan State University, East Lansing, MI, United States
Clinical decisions in human and veterinary medicine should be based on the best available evidence. The results of primary research are an important component of that evidence base. Regardless of whether assessing studies for clinical case management, developing clinical practice guidelines, or performing systematic reviews, evidence from primary research should be evaluated for internal validity i.e., whether the results are free from bias (reflect the truth). Three broad approaches to evaluating internal validity are available: evaluating the potential for bias in a body of literature based on the study designs employed (levels of evidence), evaluating whether key study design features associated with the potential for bias were employed (quality assessment), and applying a judgement as to whether design elements of a study were likely to result in biased results given the specific context of the study (risk of bias assessment). The level of evidence framework for assessing internal validity assumes that internal validity can be determined based on the study design alone, and thus makes the strongest assumptions. Risk of bias assessments involve an evaluation of the potential for bias in the context of a specific study, and thus involve the least assumptions about internal validity. Quality assessment sits somewhere between the assumptions of these two. Because risk of bias assessment involves the least assumptions, this approach should be used to assess internal validity where possible. However, risk of bias instruments are not available for all study designs, some clinical questions may be addressed using multiple study designs, and some instruments that include an evaluation of internal validity also include additional components (e.g., evaluation of comprehensiveness of reporting, assessments of feasibility or an evaluation of external validity). Therefore, it may be necessary to embed questions related to risk of bias within existing quality assessment instruments. In this article, we overview the approaches to evaluating internal validity, highlight the current complexities, and propose ideas for approaching assessments of internal validity.
Every day in clinical practice, veterinary professionals need to make decisions ranging from a decision as to whether (or not) to use an intervention or to apply a diagnostic test, to decisions about the overall management of complex clinical conditions. Increasingly, it is expected that clinical decisions are evidence-based. Evidence-based veterinary medicine incorporates clinician experience, client preferences, animal needs, and scientific evidence when making clinical decisions ( 1 ). In this approach, scientific evidence is obtained from relevant research. When research-based evidence does not exist, other sources of evidence, such as expert opinion may need to be used. Traditional narrative reviews provide an overview of a topic, and thus may be an attractive way of quickly acquiring knowledge for making clinical decisions. However, narrative reviews generally do not provide information on the identification and selection of the primary research being summarized (if any), the methodological quality of the studies, or the magnitude of the expected effect ( 2 , 3 ).
Formal methods have been developed to systematically identify, select, and synthesize the available evidence to assist veterinary professionals in evidence-based decision-making. These include critically appraised topics (CATs) ( 4 ), systematic review and meta-analysis (SR-MA) ( 5 – 7 ), and clinical practice guidelines ( 8 ) (see Box 1 for a short overview of these methods). These evidence synthesis approaches have different purposes which results in different processes and endpoints, but each includes an assessment of the internal validity of the research used. Critical appraisal of an individual study also includes an evaluation of internal validity, in addition to an evaluation of feasibility and generalizability ( 10 ). The evaluation of internal validity is the focus of this article. Understanding the different ways internal validity can be assessed, and the assumptions associated with these approaches, is necessary for researchers evaluating internal validity, and for veterinary professionals to assess studies for integration of evidence into practice.
Systematic review, meta-analysis, and network meta-analysis: Systematic review is a structured methodology for identifying, selecting and evaluating all relevant research to address a structured question, which may relate to descriptive characteristics such as prevalence, etiology, efficacy of interventions, or diagnostic test accuracy ( 5 ). Meta-analysis is the statistical combination of results from multiple studies. For addressing questions on intervention efficacy, meta-analysis provides an overall effect size for pairwise comparisons between two intervention groups. Network meta-analysis allows an estimation of the comparative efficacy across all available intervention options ( 6 ), which may provide more relevant information for veterinary professionals when there are multiple intervention options available. However, systematic reviews with pairwise meta-analysis or network meta-analysis require that a body of research exists that can be synthesized to address a clinical question and can also be resource and time intensive to conduct. Therefore, there are many clinical questions for which formally synthesized research summaries do not exist.
Critically appraised topics: Critically appraised topics (CATs) use the same principles as systematic reviews to address clinical questions but employ a more rapid approach, particularly in relation to the screening and summation of the evidence. They were designed to be employed by clinicians as a way of rapidly gathering and interpreting evidence on clinical questions relating to specific cases ( 4 ). Therefore, there is a greater risk that research addressing the question may be missed. However, in the absence of a well conducted systematic review or meta-analysis, CATs can provide a faster evaluation of research addressing a clinical question and can be undertaken by veterinary professionals who may have fewer resources and potentially less methodological or statistical expertise, particularly if they are freely available and accessible.
Clinical practice guidelines: Veterinary professionals often are involved in the management of complex clinical conditions, where an array of questions need to be addressed, including those related to etiology, prognosis, diagnostic test accuracy, and intervention efficacy. Clinical practice guidelines are intended to assist healthcare professionals in assessing more than one aspect of case approach, including appropriate prevention, diagnosis, treatment, or clinical management of diseases, disorders, and other health conditions ( 9 ). Although there are differences in the methods among authors and institutions, the key elements of guideline development include the establishment of a multidisciplinary working group to develop the guidelines, the involvement of appropriate stakeholders, identification of the topic area, systematic searches for research evidence, assessment of the internal validity of studies comprising the evidence base, a process for drafting recommendations, and ongoing review and updating of the guidelines as new evidence becomes available ( 8 ).
Internal validity refers to the extent to which the study results reflect the true state of nature (i.e., whether the effect size estimated in a study is free from systematic error, also called bias) ( 11 ). Although there are a large number of named biases ( 12 ), for studies that assess interventions or risk factors, the biases can be categorized into three broad types of bias: selection bias, information bias, and confounding ( 13 ). Selection bias impacts the effect size if, compared to the source population, the exposure or intervention groups differ in the distribution of factors associated with the outcome at the time the study population is selected, or if differential loss to follow up between groups occurs during the study. In case-control studies, selection bias occurs if cases or controls are selected based on criteria that are related to the exposure of interest. Information bias occurs when there are errors in measuring the exposure or intervention, or the outcome, or both. Finally, confounding is a mixing of effects that occurs when a variable (the confounder) that is independently associated with both the exposure and the outcome is not properly controlled. When confounding is not controlled, the estimate of the relationship between the exposure and the outcome will be biased.
There are several terms used to describe the approaches to assessing internal validity of primary research studies, including evidence hierarchies and levels of evidence, quality assessment, and risk of bias assessment. The use of these terms may be confusing, and it is not uncommon for some of these terms to be used interchangeably ( 14 , 15 ). Also, authors may mislabel the approaches and some evaluation tools (instruments) available for assessing internal validity may include additional components, such as those related to comprehensiveness (quality) of reporting, feasibility of applying an intervention, or external validity. Finally, some instruments may use the approach as a label for the instrument [e.g., Cochrane's risk of bias tool ( 16 ), which is an instrument that employs a risk of bias approach] and other instruments may not include the approach in the instrument name [e.g., the Jadad scale ( 17 ), which employs a quality assessment approach]. In an evaluation of the comprehensiveness of reporting in animal health systematic reviews (SRs), Sargeant et al., ( 18 )found that a range of instruments involving all three approaches had been used for assessing the internal validity of primary research studies. Although a large number of instruments are available, the approaches within each instrument used to assess internal validity can be grouped into three broad categories: based on study design, based on the presence or absence of design features, or based on a judgement about bias in the context of the study. These categories generally correspond to levels of evidence, quality assessment, and risk of bias, respectively. Therefore, our objective was to review these approaches to assessing internal validity as distinct entities and to describe the assumptions associated with each approach. Although we provide examples of specific instruments that include an evaluation of internal validity, our focus is on the approaches, rather than the tools. We discuss advances in the use of these approaches to assessing internal validity in human healthcare and propose a process for veterinary medicine for selecting the approach with the least assumptions as appropriate to the clinical question, the purpose of the assessment, and the research found that addresses the question of interest. The target audience for this article is individuals who assess internal validity of studies, individuals who develop instruments that include items related to the assessment of internal validity, and those who use evidence synthesis products created by others, such as systematic reviews or clinical practice guidelines.
Levels of evidence is an approach to evaluating the internal validity of a body of evidence, based on the potential for bias which is inherent to the employed study designs that were used to address the clinical question. The concept behind levels of evidence is that there is a hierarchy of study designs, with different study designs having different potential for bias. The way evidence hierarchies are used is based on either the name of the design or the description of the design. Readers of a study look for this information, then determine the design and assign a level of evidence. No further differentiation of methodological features or judgment is conducted.
Evidence hierarchies were initially introduced in 1979 by the Canadian Task Force on the Periodic Health Examination ( 19 ), with further development into an evidence pyramid by David Sackett in 1989 ( 20 ). A pyramid shaped figure commonly is used to illustrate the hierarchy of study designs for evaluating the efficacy of an intervention under realistic-use conditions (owned animals, as opposed to experimental settings), with the potential for bias decreasing from the base to the top of the pyramid ( Figure 1 ). Thus, study designs on the top of the pyramid represent those with inherently lower risk of bias compared to study designs lower on the hierarchy. The pyramid shape acknowledges that the quantity of research tends to decrease in the higher levels of evidence (for instance, there will be a larger volume of randomized controlled trials (RCTs) compared to SR-MA). Suggested modifications to the evidence pyramid for veterinary intervention studies include dividing RCTs into those conducted under realistic-use conditions vs. those conducted in nonrealistic-use conditions (e.g., research facility) ( 21 ), the inclusion of challenge trials (where disease outcomes are deliberately induced) below RCTs in the pyramid ( 21 , 22 ), and increasing the interpretability of the concept for students by displaying the hierarchy as a staircase rather than a pyramid ( 23 ).
Illustration of an evidence pyramid hierarchy for addressing intervention studies in veterinary medicine. SR, systematic review; MA, meta-analysis; RCT, randomized controlled trial.
The concept of evaluating the potential for bias in an individual study based on the study design can be extended to an evaluation of the potential for bias in a body of literature. This approach for evaluating the internal validity of a body of literature is referred to as “levels of evidence”. The approach is applied by identifying research (or other evidence) that pertains to the clinical question, determining the study design used for each of the studies, and then assigning each study to a level of evidence based on that design. For instance, a framework for levels of evidence in veterinary clinical nutrition has been proposed by Roudebush et al. ( 24 ). In this framework, level 1 evidence corresponds to at least 1 appropriately designed RCT in the target species with natural disease development, level 2 evidence would correspond to RCTs in laboratory settings with natural disease development, level 3 evidence would be obtained from non-randomized trials, deliberate disease induction trials, analytical observational studies or case series, and level 4 evidence would correspond to expert opinion, descriptive studies, studies in other species, or pathophysiological justification. Therefore, if the clinical question involves interventions, and the evidence found to address the question consists of 2 RCTs, 3 case-control studies, and 3 case series, the evidence would be designated as “level 1 evidence” because study designs with the highest evidentiary level in the available research consisted of RCTs. If all available evidence was from expert opinion, the body of research would comprise “level 4” evidence. This evidence would represent the best available evidence to inform decision-making at the time the assessment was made, although the overall level assigned would change as higher evidentiary level information becomes available.
The levels of evidence approach may be perceived as a quick and easy approach to assessing internal validity because it requires only a knowledge of the study design employed and not the individual features of a study that may or may not be associated with the potential for bias. However, that ease of use is based on very strong assumptions: 1) that study design maps directly to bias, 2) that authors always correctly label study designs, and 3) that authors execute and report study designs appropriately. The approach also pertains to a body of evidence, implying that there are multiple comparable studies available to address the question of interest.
An important critique of levels of evidence is that the approach focuses on the study design, rather than the actual design features that were used or the context of the study. Thus, although this framework illustrates the inherent potential for bias of the different study designs, it does not provide a consideration of the methodological rigor with which any specific individual study was conducted ( 25 ). For instance, although a well-conducted cohort study may be less biased than a poorly executed RCT, this nuance is not captured by a levels of evidence approach. Additionally, levels of evidence are based on the potential for confounding and selection biases, but there is no mechanism to evaluate the potential for information bias because this is linked to the outcome and the levels of evidence approach is based on features at the study, rather than outcome, level. For instance, RCTs provide a higher level of evidence compared to observational studies because random allocation to intervention groups minimizes the potential for confounding, and case-control studies provide a lower level of evidence than cohort studies because they are more prone to selection bias. However, a RCT that used a subjectively measured outcome would be assigned a higher level of evidence than a cohort study with an objective outcome, although the observational study may have a lower risk of information bias. Finally, studies may be mislabeled in terms of their study design; there is empirical evidence that this occurs in the veterinary literature ( 26 – 28 ). For example, studies labeled as case series in veterinary medicine frequently include a component corresponding to a cohort study design ( 27 ); these studies may be assigned an inappropriately low level of evidence if individuals classifying these studies rely on authors terminology rather than the complete design description to determine the design employed.
An additional consideration is that for questions related to aspects of clinical care other than selection of interventions, the framework and positioning of study designs included in Figure 1 may not be appropriate. Levels of evidence schema are available for other clinical questions, such as prognosis, diagnostic test accuracy, disease screening, and etiology ( 29 , 30 ).
As the name implies, quality assessment represents an evaluation of the quality of a primary research article. However, the term “quality” is difficult to specifically define in the context of evidence-based medicine, in that it does not appear to have been used consistently in the literature. The Merriam-Webster dictionary defines quality as “how good or bad something is” or “a high level of value or excellence” ( https://www.merriam-webster.com/dictionary/quality ). Quality generally is understood to be a multi-dimensional concept. While clear definitions are difficult to find in the research literature, the lay literature includes numerous treaties on the dimensions of quality. One example is the eight dimensions of quality delineated by David Gavin, which include performance, features, reliability, conformance, durability, serviceability, aesthetics, and perceived quality ( https://en.wikipedia.org/wiki/Eight_dimensions_of_quality ).
The findings from a review ( 31 ) identified that available instruments labeled as quality assessment tools varied in clarity and often involved more than just assessing internal validity. In addition to including an assessment of internal validity, quality assessment instruments also generally contain elements related to quality of reporting or an assessment of the inclusion of study features not directly related to bias, such as whether ethical approval was sought or whether the study participants were similar to those animals in the care of the individual doing the critique ( 14 , 31 – 33 ).
Quality assessment as an approach to evaluating internal validity involves an evaluation of the presence or absence of design features, i.e., a methodological checklist ( 14 , 15 ). For example, the Jadad scale ( 17 ) involves completing a checklist of whether the study was described as randomized, whether the study was described as double blind, and whether there was a description of withdrawals and dropouts, with points assigned for each category. Therefore, the Jadad scale uses a quality assessment approach to evaluating internal validity. In terms of assumptions, the quality assessment approach also makes strong assumptions, although these are less than those used in levels of evidence assessments. Instead of mapping bias to the study design, quality assessment maps bias to a design feature i.e., if a trial was randomized, it is assumed to be “good quality” and if the trial was not randomized the assumption is that it is “poor quality”. The same process is followed for additional study aspects, such a blinding or losses to follow-up, and an overall assessment of quality is then based on how the study 'performs' against these questions.
Quality assessment also considers more than just confounding and selection bias as components of internal validity. The inclusion of blinding as a design feature of interest illustrates this. Blinding as a design feature is intended to reduce the potential for differential care as a source of confounding bias (blinding of caregivers) or may be intended to reduce the potential for information bias (blinding of outcome assessors). Conducting a quality assessment is more complicated and time-consuming than evaluating levels of evidence because the presence or absence of the specific design features needs to be identified and validated within the study report. However, the approach requires only that the person evaluating internal validity can identify whether (or not) a design feature was used. Therefore, this approach requires more technical expertise that the levels of evidence approach, but less than the risk of bias approach.
Risk of bias assessments have been developed specifically for evaluating the potential for elements of the design or conduct employed within a study to lead to a biased effect size ( 34 , 35 ). The components of risk of bias assessments are selected based on empirical evidence of their association with estimates of effect sizes ( 24 , 32 ). The way risk of bias assessments work is that individuals evaluating a study for internal validity answer a series of signaling questions about the presence or absence of design features followed by a judgment about the potential for the use of the design feature to lead to a biased estimate in the context of the specific study. A conclusion is then reached about potential for bias based on all evaluated design features in the context of the study. Thus, a risk of bias assessment makes fewer assumptions about the link between study design and design features compared to quality assessment. For instance, a quality assessment for an RCT would include an evaluation as to whether blinding of outcome assessors occurred, whereas a risk of bias assessment would involve an evaluation not only as to whether blinding was used, but also a judgement as to whether a lack of blinding of outcome assessors would be likely to lead to a biased estimate given the context of the study and the outcome measures used. Thus, a RCT that did not include blinding of outcome assessors might be rated as poor on a quality assessment but might not be a concern in a risk of bias assessment if the outcomes were measured objectively, precluding the likelihood that the estimate would be biased by a knowledge of the intervention group when classifying the outcomes. Because of the necessity of making a judgement about the potential that bias is associated with design features in the context of a specific topic area, this approach requires the highest level of knowledge of study design and bias. The risk of bias approach also generally is conducted at the outcome level, rather than at the study level. For instance, an unblinded RCT of interventions to treat lameness might be considered to have a high risk of bias if the outcome was assessed by owners (a subjective outcome) but not if the outcome was assessed by force plate measurement (an objective outcome). For a level of evidence assessment, the assessment of internal validity would be high quality because the trial was an RCT. For quality assessment, the study may be considered poor quality because it was unblinded, but the overall judgement would be dependent on a number of other study design flaws identified. Finally, in a risk of bias assessment, the study would likely be low risk of bias for the objective outcome and high risk of bias for the subjective outcome if blinding was not used.
Some components of a risk of bias assessment are the same as those included in a quality assessment approach (e.g., an assessment of randomization, allocation concealment, and blinding could be included in both). However, the way the assessment is done differs, with quality assessments generally involving present/absent judgements as opposed to assessments as to whether the risk of bias is likely or not. Hartling et al. ( 14 ) applied two instruments using a quality assessment approach and one instrument using a risk of bias approach to a sample of 163 trials and found that there was low correlation between quality assessment and risk of bias approaches when comparing the assessment of internal validity.
Although the critical elements for risk of bias are well described for RCTs in human healthcare and to a large extent in veterinary RCTs, these elements are not as well described for non-randomized trials and observational studies where allocation to groups is not under the control of the investigator. There are some risk of bias tools available for assessing risk of bias in non-randomized studies, such as ROBINS-I ( 36 ). However, ROBINS-I has been criticized for being challenging to use and for having low reliability, particularly amongst less experienced raters ( 37 , 38 ). A review and critique of approaches to risk of bias assessment for observational studies is available ( 39 ). It is anticipated that risk of bias tools for observational study designs, including studies related to questions of prognosis and causation, will continue to evolve as new instruments are developed and validated.
Currently, the available approaches to assessing internal validity tend to be used for different applications. Levels of evidence have previously been used for creating evidence-based recommendations or clinical practices guidelines ( 30 , 40 , 41 ), where it is anticipated that multiple study designs may have been used to address the clinical question(s) of interest. Both quality assessment and risk of bias assessment approaches have been used as a component of systematic reviews with meta-analysis or network meta-analysis, as the intended product of these reviews is to summarize a single parameter (such as incidence or prevalence) or a summary effect size (such as a risk ratio, odds ratio, or hazard ratio) where it is desired that the estimate is unbiased. Often, that estimate is derived from studies with the same study design or a narrow range of study designs from high levels in the evidence hierarchy for the research question type. Therefore, the focus is on a specific parameter estimate based on multiple studies, rather than a descriptive summary of the evidentiary strength of those studies.
However, the different approaches are not necessarily mutually exclusive, but are nested within each other based on assumptions, and the methodology and use of the different approaches has evolved over time. As previously described, a criticism of the use of levels of evidence is that the potential for bias is based on the study design that was employed, rather than the methodological rigor of a specific study ( 42 ). For this reason, many frameworks for levels of evidence included wording such as “appropriately designed” ( 24 ) or “well designed” ( 41 )for the study designs, although the criteria for determining whether a study was designed and executed with rigor generally is not described. A lack of transparency for the criteria for evaluating internal validity of studies within an evidence level is problematic for individuals wishing to use the results. An example of the evolution toward more transparent considerations of internal validity of individual studies within a levels of evidence framework is seen in the progression of the Australian National Health and Medical Research Council (NHMRC) system for evaluating evidence in the development of clinical practice guidelines. The designation of levels of evidence in this framework originally was based on levels of evidence, with descriptors such as “properly-designed” or “well-designed” included for each type of study design ( 40 ). A concern with this approach was that the framework was not designed to address the strength of evidence from individual studies within each evidence level ( 43 ). Therefore, the framework was modified to include the use of risk of bias evaluations of individual studies within each evidence level. The combined use of levels of evidence and risk of bias assessment of studies within each level of evidence now forms the “evidence base” component of the NHMRC's FORM framework for the development of evidence-based clinical guidelines ( 44 ).
Another example of the evolution of approaches to assessing internal validity is from the Cochrane Back review group, who conduct systematic reviews of neck and back pain. The initial methods guidelines, published in 1997, recommended that a quality assessment be performed on each included study, with each item in the quality assessment tool scored based on whether the authors reported their use ( 45 ). Updated methods guidelines were published in 2003 ( 46 ). The framework for levels of evidence in this guidance was restricted to a consideration of randomized controlled trials and non-randomized controlled clinical trials, as these were considered the study designs that potentially were appropriate to address research questions in this content area. In the updated guidelines, the recommendations for the assessment of internal validity moved to a risk of bias approach, where judgements were made on whether the characteristics of each study were likely to lead to biased study results. In the 2003 methods guidelines, levels of evidence were recommended as an approach to qualitative analysis rather than the use of “vote counting” (summing the number of studies where a positive or negative outcome was reported). The guidelines were again updated in 2009 ( 47 ). In this version, the assessment of the internal validity of individual studies explicitly employed a risk of bias approach. It was further recommended that the use of evidence levels as a component of a qualitative synthesis be replaced with a formal rating of the quality of the evidence for each of the included outcomes. It was recommended that review authors use the GRADE approach for this component. The GRADE approach explicitly includes a consideration of the risk of bias across all studies included in the review, as well as an assessment of the consistency of results across studies, the directness of the evidence to the review question, the precision in the effect size estimate, and the potential for publication bias ( 48 ).
The examples from the human medical literature illustrate that assessment of internal validity need not be static, and that modifications to our approach to assessing internal validity can strengthen the evidence base for clinical decision making. When developing or using tools which include an evaluation of internal validity, the assessment of internal validity should use the approach with the least assumptions about bias. This implies that the risk of bias approach, where context specific judgements are made related to the potential for bias, is the preferred approach for assessing internal validity. The risk of bias approach is well developed for RCTs. Therefore, when RCTs are included in the evidence available to address a clinical question, a risk of bias assessment approach should be used. When evaluating internal validity as a component of a SR-MA, the Cochrane ROB2.0 tool ( 16 ) could be used for this purpose. Modifications to this tool have been proposed for evaluating trials in livestock trials ( 49 – 51 ). For critical appraisal instruments for RCTs, where additional components such as feasibility and external validity are a desired component, the questions or items within the instrument that are specific to assessing internal validity still could follow a risk of bias approach by specifically requiring a judgement on the potential for bias. Similarly, the use of questions or items requiring a judgement on the potential for bias also could be used for evaluation of RCTs included in clinical practice guidelines when RCTs are present in the evidence base.
However, there are circumstances where these recommendations may not be appropriate or sufficient, such as for observational studies where risk of bias assessment instruments do not formally exist, or where a variety of study designs have been identified that answer the clinical question (particularly non-intervention type questions). When observational studies are used as evidence, individuals assessing internal validity may wish to evaluate risks of bias for each study ad hoc by considering the specific risks of bias related to selection bias, information bias, and confounding in the context of the topic area. However, this approach requires considerable methodological expertise. Alternatively, a quality assessment approach could be used to evaluate internal validity for observational studies, recognizing that more assumptions related to the potential for bias are involved. As instruments for evaluating the risk of bias for observational studies are developed and validated, these could replace ad hoc or quality assessment approaches.
For situations where the evidence base includes multiple study types, such as clinical practice guidelines, the use of levels of evidence may be useful for framing the potential for bias inherent in the studies identified to address the clinical questions. However, within each evidence level, there still is a need to evaluate the internal validity of each study. The proposed approach for situations where RCTs and observational studies are included in the evidence base was described in the preceding paragraphs. For lower levels of evidence, such as case series, textbooks and narrative reviews, and expert opinion, levels of evidence could be used to emphasize that these types of evidence have high potential for bias based on their design.
It should be noted that although this article has focused on approaches to evaluating internal validity of studies, this is only one component of the assessment of evidence. Critical appraisal, CATs, SR-MA, and clinical practice guidelines explicitly incorporate other aspects of decision-making, including a consideration of the magnitude and precision of an intervention effect or the potential clinical impact, the consistency of the research results across studies, the applicability (external validity and feasibility) of the research results, and the directness of the evidence to a clinical situation (for instance, whether the study populations are similar to those in a practice setting). However, a discussion of these components for decision-making is beyond the scope of the current study. The interested reader is referred to further details on the components used in evaluating evidence for CATs ( 4 ), for SR-MA using the GRADE approach ( 52 ), for network meta-analysis ( 53 ) and for clinical practice guidelines ( 8 , 44 ).
JS drafted the manuscript. All authors contributed equally to the conceptualization of this work. All authors read and approved the final contents.
Partial funding support was obtained from the University of Guelph Research Leadership Chair (Sargeant).
The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.
All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.
What is a case control study in research, when would you use a case control study, examples of case control studies, advantages of case control studies, disadvantages of case control studies.
A case control study is a type of observational research commonly used in the field of epidemiology. It is designed to help researchers identify factors that may contribute to a particular outcome, such as a disease or condition, by comparing subjects who have that outcome (cases) with those who do not (controls). The analysis approach is usually quantitative , but it's helpful to understand this research design , because this method is particularly useful for studying rare diseases or outcomes and can provide valuable insights into potential risk factors.
In this article, we will define what a case control study is, discuss when it is most appropriately used, and provide examples, along with the advantages and disadvantages of this research approach.
A case control study is a type of observational study commonly used to compare two groups of individuals who are largely similar except for the fact that one group has a specific condition or outcome while the second group of individuals, called the controls, do not have that condition or outcome. The primary goal of this study design is to compare factors between the two groups to identify what may be potentially contributing to the outcome or condition being studied.
Case control studies are usually retrospective, meaning they look backward and can use existing data to examine multiple risk factors that might explain why certain individuals developed the condition. In contrast, cohort studies are usually prospective, following individuals over a long period of time and analyzing an outcome, such as the development of a disease.
In a case control study, researchers first identify the cases, which are individuals who have the condition of interest. They then construct a second, very similar group of controls , who share many characteristics with the case group but do not have the condition. Researchers collect data on past exposures, behaviors, and other relevant variables from both the cases and the healthy controls.
By comparing the frequency and patterns of these exposures between an appropriate control group and a corresponding case group, researchers can identify any potentially relative risk factors associated with the condition. The quantitative measure commonly used to compare the strength of association between exposures and outcomes in case control studies is the odds ratio. Odds ratios are used for informing public health interventions and guiding future research.
This type of study is particularly valuable when studying rare diseases or conditions, as it allows researchers to gather data more quickly and efficiently than would be possible with a prospective cohort study. Additionally, case control studies are often less expensive and require fewer resources, making them a practical choice for many research questions .
However, it is important to note that case control studies can be prone to certain biases , such as recall bias and selection bias. Recall bias occurs when participants do not accurately remember past exposures, while selection bias can arise if cases and controls are not properly matched. Despite these limitations, case control studies remain a crucial method in health and epidemiological research, offering insights into the potential causes and risk factors of various health outcomes.
A case control study is particularly useful in several research scenarios, especially when the goal is to look at factors associated with rare diseases or conditions. This type of study is an efficient way to identify and evaluate risk factors associated with specific outcomes. Researchers often use case control studies when the condition under investigation has a low incidence rate, making it impractical to follow a large cohort over time to observe the development of the condition. By focusing on individuals who already have the condition and comparing them to those who do not, researchers can gain insights more quickly and with fewer resources.
This study design is also advantageous when time and funding are limited. Prospective studies can be time-consuming and costly, requiring long-term follow-up and extensive data collection. In contrast, case control studies are retrospective and can be conducted relatively quickly, as they rely on existing records and participant recall of past exposures. This makes them a cost-effective choice for preliminary investigations, allowing researchers to identify potential associations before committing to more extensive and expensive studies.
Case control studies are also appropriate when exploring multiple potential risk factors simultaneously. Since researchers collect detailed exposure information from both cases and controls, they can examine a wide range of variables and their potential associations with the condition. This flexibility is particularly useful in the early stages of research when the exact causes of a condition are not well understood.
Take advantage of our intuitive interface with powerful data analysis tools. Get started with a free trial.
Case control studies have been instrumental in uncovering and evaluating factors associated with diseases and understanding potential underlying causes of various health conditions. These observational studies compare individuals with the outcome of interest to a comparison group of controls without the outcome, providing valuable insights into potential risk factors. Below are two examples that illustrate how case control studies can be used in different contexts.
One example of case control studies looks at historical factors of lung cancer such as smoking. Researchers select individuals diagnosed with lung cancer as the cases and a control group of individuals without lung cancer, matched by age, sex, and other relevant variables. Both groups are questioned about their smoking habits, including the duration and intensity of smoking.
The study can report a significantly higher prevalence of smoking among the cases compared to the controls, suggesting a strong association between smoking and lung cancer. Such findings can be crucial in establishing smoking as a major risk factor for lung cancer, leading to public health initiatives aimed at reducing smoking rates to improve health outcomes.
Another important case control study might explore the risk factors for myocardial infarction (heart attack). Researchers select patients who had experienced a myocardial infarction as the cases and match them with a control group of individuals without a history of heart attacks but with similar health status and demographic characteristics. Data is collected on various exposures, such as diet, physical activity, family history of heart disease, and other historical factors to identify potential causes.
The analysis in this example reveals that factors like high cholesterol levels, hypertension, and lack of physical activity are more common among the cases than the controls. These findings can highlight the importance of managing cholesterol, blood pressure, and maintaining an active lifestyle to reduce the risk of myocardial infarction.
Case control studies offer several advantages that make them a valuable research method in epidemiology and public health. They are particularly useful when investigating rare diseases, working with limited resources, or exploring multiple risk factors. Below are three key advantages of case control studies.
One of the primary advantages of case control studies is their efficiency in studying rare diseases. Since these studies start with individuals who already have the outcome of interest, researchers can gather sufficient data without needing to follow a large cohort over time. This is particularly beneficial when the condition is uncommon, as it allows researchers to focus their efforts on a smaller, more manageable sample size. By comparing these cases to a control group , researchers can quickly identify potential risk factors associated with the disease, accelerating the discovery of novel findings that might be difficult to obtain through other study designs like prospective cohort studies and retrospective cohort studies, which are designed around already established exposure or risk factors.
Case control studies are generally more cost-effective and time-efficient compared to other epidemiological study designs, such as cohort studies. Because they are retrospective, case control studies utilize existing records and participant recall, reducing the need for long-term follow-up and extensive data collection. This makes them a practical choice for researchers with limited budgets and time constraints. The ability to conduct these studies relatively quickly allows for faster generation of insights and can inform the design of future, more comprehensive studies if necessary.
Another significant advantage of case control studies is their ability to examine multiple risk factors simultaneously. When collecting data from both cases and controls, researchers can gather information on a wide range of exposures, behaviors, and other variables. This comprehensive data collection enables the analysis of various potential risk factors and their associations with the outcome of interest. This flexibility is particularly useful in the early stages of research when the exact causes of a condition are not well understood. By identifying several possible risk factors, case control studies can provide a broader understanding of the disease and guide further investigation.
While case control studies offer several advantages, they also come with notable disadvantages that researchers must consider. Below are two major disadvantages of case control studies.
One significant drawback of case control studies is their susceptibility to recall bias . Since these studies are retrospective, they rely on participants' memory and self-reported data regarding past exposures and behaviors. Cases and controls may recall information differently, especially if the condition being studied is severe or has a significant impact on the individual's life. Such recall bias may introduce effects from confounding variables and other factors to an analysis.
For example, individuals with a disease might be more likely to remember and report certain exposures they believe contributed to their condition, while controls may not recall these details as accurately. This discrepancy can lead to biased results, as the data collected may not accurately reflect actual past exposures. One way to minimize effects from recall bias is to collect data from multiple sources to triangulate findings.
Another major disadvantage of case control studies is the potential for selection bias. Properly selecting and matching cases and controls is critical to ensure that the two groups are comparable in all relevant aspects except for the outcome of interest. If cases and controls are not appropriately matched, the contrasts observed between the groups may be due to systematic differences in who was selected rather than true associations between exposures and the outcome.
For instance, if the controls are not representative of the population that gave rise to the cases, the findings may not be generalizable. Additionally, the methods used to identify and recruit participants can also introduce bias, further complicating the interpretation of results. Selection bias can be mitigated by transparently describing the methods and assessing how representative the control group is of the population from which the cases emerged.
Download a free trial of our qualitative data analysis platform to make the most of your research.
In order to continue enjoying our site, we ask that you confirm your identity as a human. Thank you very much for your cooperation.
Applying the filters below will filter all articles, data, insights and projects by the topic area you select.
Not sure where to find something? Search all of the site's content.
This Report is part of within WRI Ross Center for Sustainable Cities , and Clean Energy . Reach out to Lulu Xue for more information.
Addressing the demand side, particularly the concerns of cost-conscious and less technology-savvy small- and medium-sized enterprises (SMEs), is critical for the future uptake of zero-emission trucks (ZETs).
To address the question, this study took Guangdong as an example and assessed the techno-economic feasibility of ZETs from 2022 to 2030 across 14 use cases, considering operational feasibility, purchase cost gaps between ZETs and diesel trucks, and total cost of ownership (TCO) parity years relative to diesel trucks. We identified the use cases with near-term ZET transition opportunities, and explored the roles played by technological development, policy incentives, operational improvements, and business models in advancing ZETs’ TCO (and purchase cost) parity years relative to diesel trucks. We also evaluated the applicability of this study’s conclusions to other Chinese regions.
Research Associate, Sustainable Transition Center, WRI China
WRI relies on the generosity of donors like you to turn research into action. You can support our work by making a gift today or exploring other ways to give.
World Resources Institute 10 G Street NE Suite 800 Washington DC 20002 +1 (202) 729-7600
© 2024 World Resources Institute
Envision a world where everyone can enjoy clean air, walkable cities, vibrant landscapes, nutritious food and affordable energy.
You have full access to this open access article
95 Accesses
Explore all metrics
To identify the research trends in studies related to STEM Clubs, 56 publications that met the inclusion and extraction criteria were identified from the online databases ERIC and WoS in this study. These studies were analysed by using the descriptive content analysis research method based on the Paper Classification Form (PCF), which includes publishing years, keywords, research methods, sample levels and sizes, data collection tools, data analysis methods, durations, purposes, and findings. The findings showed that, the keywords in the studies were used under six different categories: disciplines, technological concepts, academic community, learning experiences, core elements of education, and psychosocial factors (variables). Case studies were frequently employed, with middle school students serving as the main participants in sample groups ranging from 11–15, 16–20, and 201–250. Surveys, questionnaires, and observations were the primary methods of data collection, and descriptive analysis was commonly used for data analysis. STEM Clubs had sessions ranging from 2 to 16 weeks, with each session commonly lasting 60 to 120 min. The study purposes mainly focused on four themes: the impact of participation on various aspects such as attitudes towards STEM disciplines, career paths, STEM major selection, and academic achievement; the development and implementation of a sample STEM Club program, including challenges and limitations; the examination of students' experiences, perceptions, and factors influencing their involvement and choice of STEM majors; the identification of some aspects such as attitudinal effects and non-academic skills; and the comparison of STEM experiences between in-school and out-of-school settings. The study results mainly focused on three themes: the increase in various aspects such as academic achievement, STEM major choice, engagement in STEM clubs, identity, interest in STEM, collaboration-communication skills; the design of STEM Clubs, including sample implementations, design principles, challenges, and factors affecting their success and sustainability; and the identification of factors influencing participation, motivation, and barriers. Overall, this study provides a comprehensive understanding of STEM Clubs, leading the way for more targeted and informed future research endeavours.
Avoid common mistakes on your manuscript.
Worldwide, STEM education, which integrates the disciplines of science, technology, engineering, and math, is gaining popularity in K-12 settings due to its capacity to enhance 21st-century skills such as adaptability, problem-solving, and creative thinking (National Research Council [NRC], 2015 ). In STEM lessons, students are frequently guided by the engineering design process, which involves identifying problems or technical challenges and creating and developing solutions. Furthermore, higher achievement in STEM education has been linked to increased enrolment in post-secondary STEM fields, offering students greater opportunities to pursue careers in these domains (Merrill & Daugherty, 2010 ). However, STEM activities require dedicated time and the restructuring of integrated curricula, necessitating careful organization of lessons. Recognizing the complexity of developing 21st-century STEM proficiency, schools are not expected to tackle this challenge alone. In addition to regular STEM classes, there exists a diverse range of extended education programs, activities, and out-of-school learning environments (Baran et al., 2016 ; Kalkan & Eroglu, 2017 ; Schweingruber et al., 2014 ). In this paper, out-of-school learning environments, informal learning environments, extended education, and afterschool programs were used synonymously. It is worth noting that the literature lacks a universally accepted definition for out-of-school learning environments, leading to the use of various interchangeable terms (Donnelly et al., 2019 ). Some of these terms include informal learning environments, extended education, afterschool programs, all-day school, extracurricular activities, out-of-school time learning, extended schools, expanded learning, and leisure-time activities. These terms refer to optional programs and clubs offered by schools that exist outside of the standard academic curriculum (Baran et al., 2016 ; Cooper, 2011 ; Kalkan & Eroglu, 2017 ; Schweingruber et al., 2014 ).
Out-of-school learning, in contrast to traditional in-school learning, offers greater flexibility in terms of time and space, as it is not bound by the constraints of the school schedule, national or state standards, and standardized tests (Cooper, 2011 ). Out-of-school learning experiences typically involve collaborative engagement, the use of tools, and immersion in authentic environments, while school environments often emphasize individual performance, independent thinking, symbolic representations, and the acquisition of generalized skills and knowledge (Resnick, 1987 ). They encompass everyday activities such as family discussions, pursuing hobbies, and engaging in daily conversations, as well as designed environments like museums, science centres, and afterschool programs (Civil, 2007 ; Hein, 2009 ). On the other hand, extended education refers to intentionally structured learning and development programs and activities that are not part of regular classes. These programs are typically offered before and after school, as well as at locations outside the school (Bae, 2018 ). As a result, out-of-school learning environments encompass a wide range of experiences, including social, cultural, and technical excursions around the school, field studies at museums, zoos, nature centres, aquariums, and planetariums, project-based learning, sports activities, nature training, and club activities (Civil, 2007 ; Donnelly et al., 2019 ; Hein, 2009 ). At this point, STEM clubs are a specialized type of extracurricular activity that engage students in hands-on projects, experiments, and learning experiences related to scientific, technological, engineering, and mathematical disciplines. STEM Clubs, described as flexible learning environments unconstrained by time or location, offer an effective approach to conducting STEM studies outside of school (Blanchard et al., 2017 ; Cooper, 2011 ; Dabney et al., 2012 ).
Out-of-school learning environments, extended education or afterschool programs, hold tremendous potential for enhancing student learning and providing them with a diverse and enriching educational experience (Robelen, 2011 ). Extensive research supports the notion that these alternative educational programs not only contribute to students' academic growth but also foster their social, emotional, and intellectual development (NRC, 2015 ). Studies have consistently shown that after-school programs play a vital role in boosting students' achievement levels (Casing & Casing, 2024 ; Pastchal-Temple, 2012 ; Shernoff & Vandell, 2007 ), and contributing to positive emotional development, including improved self-esteem, positive attitudes, and enhanced social behaviour (Afterschool Alliance, 2015 ; Durlak & Weissberg, 2007 ; Lauer et al., 2006 ; Little et al., 2008 ). Moreover, engaging in various activities within these programs allows students to develop meaningful connections, expand their social networks, enhance leadership skills (Lipscomb et al., 2017 ), and cultivate cooperation, effective communication, and innovative problem-solving abilities (Mahoney et al., 2007 ).
Implementing STEM activities in out-of-school learning environments not only supports students in making career choices and fostering meaningful learning and interest in science, but also facilitates deep learning experiences (Bybee, 2001 ; Dabney et al., 2012 ; Sahin et al., 2018 ). Furthermore, STEM Clubs enhance students' emotional skills, such as a sense of belonging and peer-to-peer communication, while also fostering 21st-century skills, facilitating the acquisition of current content, and promoting career awareness and interest in STEM professions (Blanchard et al., 2017 ). In summary, engaging in STEM activities through social club activities not only addresses time constraints but also complements formal education and contributes to students' overall development. Hence, STEM Clubs, which are part of extended education, can be defined as dynamic and flexible learning environments that provide an effective approach to conducting STEM studies beyond traditional classroom settings. These clubs offer flexibility in terms of time and location, with intentionally structured programs and activities that take place outside of regular classes. They provide students with unique opportunities to explore and deepen their understanding of STEM subjects through collaborative engagement, hands-on use of tools, and immersive experiences in authentic environments (Bae, 2018 ; Blanchard, et al., 2017 ; Bybee, 2001 ; Cooper, 2011 ; Dabney et al., 2012 ). STEM Clubs have gained immense popularity worldwide, providing students with invaluable opportunities to explore and cultivate their interests and knowledge in these crucial fields (Adams et al., 2014 ; Bell et al., 2009 ). According to America After 3PM, nearly 75% of afterschool program participants, around 5,740,836 children, have access to STEM learning opportunities (Afterschool Alliance, 2015 ).
STEM Clubs as after-school programs come in various forms and provide diverse tutoring and instructional opportunities. For instance, the Boys and Girls Club of America (BGCA) operates in numerous cities across the United States, annually serving 4.73 million students (Boys and Girls Club of America, 2019 ). This program offers students the chance to engage in activities like sports, art, dance, field trips, and addresses the underrepresentation of African Americans in STEM. Another example is the Science Club for Girls (SCFG), established by concerned parents in Cambridge to address gender inequity in math, science, and technology courses and careers. SCFG brings together girls from grades K–7 through free after-school or weekend clubs, science explorations during vacations, and community science fairs, with approximately 800 to 1,000 students participating each year. The primary goal of these clubs is to increase STEM literacy and self-confidence among K–12 girls from underrepresented groups in these fields. More examples can be found in the literature, such as the St. Jude STEM Club (SJSC), where students conducted a 10-week paediatric cancer research project using accurate data (Ayers et al., 2020 ), and After School Matters, based in Chicago, offers project-based learning that enhances students' soft skills and culminates in producing a final project based on their activities (Hirsch, 2011 ).
The literature on STEM Clubs indicates a diverse range of such clubs located worldwide, catering to different student groups, operating on varying schedules, implementing diverse activities, and employing various strategies, methodologies, experiments, and assessments (Ayers et al., 2020 ; Blanchard et al., 2017 ; Boys and Girls Club of America, 2019 ; Hirsch, 2011 ; Sahin et al., 2018 ). However, it was previously unknown which specific sample groups were most commonly studied, which analytical methods were used frequently, and which results were primarily reported, even though the overall topic of STEM Clubs has gained significant attention. Therefore, organizing and categorizing this expansive body of literature is necessary to gain deeper insights into the current state of knowledge and practices in STEM Clubs. By systematically reviewing and synthesizing the diverse range of studies on this topic, we can develop a clearer understanding of the focus areas, methodologies, and key findings that have emerged from the existing research (Fraenkel et al., 2012 ). At this point, using a content analysis method is appropriate for this purpose because this method is particularly useful for examining trends and patterns in documents (Stemler, 2000 ). Similarly, some previous research on STEM education has conducted content analyses to examine existing studies and construct holistic patterns to understand trends (Bozkurt et al., 2019 ; Chomphuphra et al., 2019 ; Irwanto et al., 2022 ; Li et al., 2020 ; Lin et al., 2019 ; Martín-Páez et al., 2019 ; Noris et al., 2023 ). However, there is a lack of content analysis specifically focused on studies of STEM Clubs in the literature and showing the trends in this topic. Analysing research trends in STEM Clubs can help build upon existing knowledge, identify gaps, explore emerging topics, and highlight successful methodologies and strategies (Fraenkel et al., 2012 ; Noris et al., 2023 ; Stemler, 2000 ). This information can be valuable for researchers, educators, and policymakers to stay up-to-date and make informed decisions regarding curriculum design (Bozkurt et al., 2019 ; Chomphuphra et al., 2019 ; Irwanto et al., 2022 ; Li et al., 2020 ; Lin et al., 2019 ; Martín-Páez et al., 2019 ; Noris et al., 2023 ), the development of effective STEM Club programs, resource allocation, and policy formulation (Blanchard et al., 2017 ; Cooper, 2011 ; Dabney et al., 2012 ). Therefore, the identification of research trends in STEM Clubs was the aim of this study.
To identify research trends, studies commonly analysed documents by considering the dimensions of articles such as keywords, publishing years, research designs, purposes, sample levels, sample sizes, data collection tools, data analysis methods, and findings (Bozkurt et al., 2019 ; Chomphuphra et al., 2019 ; Irwanto et al., 2022 ; Li et al., 2020 ; Sozbilir et al., 2012 ). Using these dimensions as a framework is a useful and common approach in content analysis because this framework allows researchers to systematically examine the key aspects of existing studies and uncover patterns, relationships, and trends within the research data (Sozbilir et al., 2012 ). Hence, since the aim of this study is to identify and analyse research trends in STEM Clubs, it focused on publishing years, keywords, research designs, purposes, sample levels, sample sizes, data collection tools, data analysis methods, and findings of the studies on STEM Clubs.
As a conclusion, the main problem of this study is “What are the characteristics of the studies on STEM Clubs?”. The following sub-questions are addressed in this study:
What is the distribution of studies on STEM Clubs by year?
What are the frequently used keywords in studies on STEM Clubs?
What are the commonly employed research designs in studies on STEM Clubs?
What are the typical purposes explored in studies on STEM Clubs?
What are the commonly observed sample levels in studies on STEM Clubs?
What are the commonly observed sample sizes in studies on STEM Clubs?
What are the commonly utilized data collection tools in studies on STEM Clubs?
What are the commonly utilized data analysis methods in studies on STEM Clubs?
What are the typical durations reported in studies on STEM Clubs?
What are the commonly reported findings in studies on STEM Clubs?
In this study, the descriptive content analysis research method was employed, which allows for a systematic and objective examination of the content within articles, and description of the general trends and research results in a particular subject matter (Lin et al., 2014 ; Suri & Clarke, 2009 ; Sozbilir et al., 2012 ; Stemler, 2000 ). Given the aim of examining research trends in STEM Clubs, the utilization of this method was appropriate, as it provides a structured approach to identify patterns and trends (Gay et al., 2012 ). To implement the content analysis method, this study followed the three main phases proposed by Elo and Kyngäs ( 2008 ): preparation, organizing, and reporting. In the preparation phase, the unit of analysis, such as a word or theme, is selected as the starting point. So, in this study, the topic of STEM Clubs was carefully selected. During the organizing process, the researcher strives to make sense of the data and to learn "what is going on" and obtain a sense of the whole. So, in this study, during the analysis process, the content analysis framework (sample levels, sample sizes, data collection tools, research designs, etc.) was used to question the collected studies. Finally, in the reporting phase, the analyses are presented in a meaningful and coherent manner. So, the analyses were presented meaningfully with visual representations such as tables, graphs, etc. By adopting the content analysis research method and following the suggested phases, this study aimed to gain insights into research trends in STEM Clubs, identify recurring themes, and provide a comprehensive analysis of the collected data.
The online databases ERIC and Web of Science were searched using keywords derived from a database thesaurus. These databases were chosen because of their widespread recognition and respect in the fields of education and academic research, and they offer a substantial amount of high-quality, peer-reviewed literature. The search process involved several steps. Firstly, titles, abstracts, and keywords were searched using Boolean operators for the keywords "STEM Clubs," "STEAM Clubs," "science-technology-engineering-mathematics clubs," "after school STEM program" and "extracurricular STEM activities" in the databases (criterion-1). Secondly, studies were collected beginning from November to the end of December 2023. So, the studies published until the end of December 2023 were included in the search, without a specific starting date restriction (criterion-2). Thirdly, the search was limited to scientific journal articles, book chapters, proceedings, and theses, excluding publications such as practices, letters to editors, corrections, and (guest) editorials (criterion-3). Fourthly, studies published in languages other than English were excluded, focusing exclusively on English language publications (criterion-4). Fifthly, duplicate articles found in both databases were identified and removed. Next, the author read the contents of all the studies, including those without full articles, with a particular focus on the abstract sections. After that, studies related to after school program and extracurricular activities that did not specifically involve the terms STEM or clubs were excluded, even though “extracurricular STEM activities” and “after school STEM program” were used in the search process, and there were studies related to after school program or extracurricular activities but not STEM (criterion-5). Additionally, studies conducted in formal and informal settings within STEM clubs were included, while studies conducted in settings such as museums or trips were excluded (criterion-6). Because STEM Clubs are a subset of informal STEM education settings, which also include museums and field trips, the main focus of this study is to show the trends specifically related to STEM Clubs. Moreover, studies focusing solely on technology without incorporating other STEM components were also excluded (criterion-7). Finally, 56 publications that met the inclusion and extraction criteria were identified. These publications comprised two dissertations, seven proceedings, and 47 articles from 36 different journals. By applying these criteria, the search process aimed to ensure the inclusion of relevant studies while excluding those that did not meet the specified criteria as shown in Fig. 1 .
Flowchart of article process selection
Two different approaches were followed in the content analysis process of this study. In the first part, deductive content analysis was used, and a priori coding was conducted as the categories were established prior to the analysis. The categorization matrix was created based on the Paper Classification Form (PCF) developed by Sozbilir et al. ( 2012 ). The coding scheme devised consisted of eight classification groups for the sections of publication years, keywords, research designs, sample levels, sample sizes, data collection tools, data analysis methods, and durations, with sub-categories for each section. For example, under the research designs section, the sub-categories included qualitative and quantitative methods, case study, design-case study, comparative-case study, ethnographic study, phenomenological study, survey study, experimental study, mixed and longitudinal study, and literature review study. These sub-categories were identified prior to the analysis. Coding was then applied to the data using spreadsheets in the Excel program, based on the categorization matrix. Frequencies for the codes and categories created were calculated and presented in the findings section with tables. Line charts were used for the publication years section, while word clouds, which visually represent word frequency, were used for the keywords section. Word clouds display the most frequently used words in different sizes and colours based on their frequencies (DePaolo & Wilkinson, 2014 ). So, in this part, the analysis was certain since the studies mostly provided related information in their contents.
In the second part, open coding and the creation of categories and abstraction phases were followed for the purposes and findings sections. Firstly, the stated purposes and findings of the studies were written as text. The written text was then carefully reviewed, and any necessary terms were written down in the margins to describe all aspects of the content. Following this open coding, the lists of categories were grouped under higher order headings, taking into consideration their similarities or dissimilarities. Each category was named using content-characteristic words. The abstraction process was repeated to the extent that was reasonable and possible. In this coding process, two individuals independently reviewed ten studies, considering the coding scheme for the first part and conducting open coding for the second part. They then compared their notes and resolved any differences that emerged during their initial checklists. Inter-rater reliability was calculated as 0.84 using Cohen's kappa analysis. Once coding reliability was ensured, the remaining articles were independently coded by the author. After completing the coding process, consensus was reached through discussions regarding any disagreements among the researchers regarding the codes, as well as the codes and categories constructed for the purpose and findings sections. At this point, there were mostly agreements in the coding process since the studies had already clearly stated their key characteristics, such as research design, sample size, sample level, and data collection tools. Additionally, when coding the studies' stated purposes and results, the researchers closely referred to the original sentences in the studies, which led to a high level of consistency in the coded content between the two raters.
Studies related to the STEM Clubs were initially conducted in 2009 (Fig. 2 ). The noticeable increase in the number of studies conducted each year is remarkable. It can be seen that the majority of the 47 articles that were examined (56 articles) were published after 2015, despite a decrease in the year 2018. Additionally, it was observed that the articles were most frequently published (8) in the years 2019 and 2022, least frequently (1) in the years 2009, 2010, and 2014, and there were no publications in 2012.
Number of articles by years
Word clouds were utilized to present the most frequently used keywords in the articles, as shown in Fig. 3 . However, due to the lack of reported keywords in the ERIC database, only 30 articles were included for these analyses. The keywords that exist in these studies were represented in a word cloud in Fig. 3 . The most frequently appearing keywords, such as "STEM," "education" and "learning" were identified. Additionally, by using a content analysis method, these keywords were categorized into six different groups: disciplines, technological concepts, academic community, learning experiences, core elements of education, and psychosocial factors (variables) in Table 1 .
Word cloud of the keywords used in articles
The purposes of the identified studies identified were classified into six main themes: “effects of participation in STEM Clubs on” (25), “evolution of a sample program for STEM Clubs and its implementation” (25), “examination of” (11), “identification of” (3), “comparison of in-school and out-school STEM experiences” (2) and “others” (6). Table 2 presents the distribution of the articles’ purposes based on the classification regarding these themes. Therefore, it can be seen that purposes of “effects of participation in STEM Clubs on,” and “evolution of a sample program for STEM Clubs and its implementation” were given the highest and equal consideration, while the purposes related to "identification of" (3) and "comparison of in-school and out-of-school STEM experiences" (2) were given the least consideration among them.
Within the theme of "effects of participation in STEM Clubs on" there are 11 categories. The aims of the studies in this section are to examine the effect of participation in STEM Clubs on various aspects such as attitudes towards STEM disciplines or career paths, STEM major choice/career aspiration, achievement in math, science, STEM disciplines, or content knowledge, perception of scientists, strategies used, value of clubs, STEM career paths, enjoyment of physics, use of complex and scientific language, interest in STEM, creativity, critical thinking about STEM texts, images of mathematics, or climate-change beliefs/literacy. It is evident that the majority of research in this section focuses on the effects of participation in STEM Clubs on STEM major choice/career aspiration (5), achievement (4), perception of something (4), and interest in STEM (3).
Within the theme of "evolution of a sample program for STEM Clubs and its implementation" there are three categories: development of program/curriculum/activity (14), identification of program's challenges and limitations (3), and implementation of program/activity (8). The studies in this section aim to develop a sample program for STEM Clubs and describe its implementation. It can be seen that the most preferred purpose among them is the development of program/curriculum/activity (14), while the least preferred purpose is the identification of program's challenges and limitations (3). In addition, studies that focus on the development of the program, curriculum, or activity were classified under the "general" category (10). Sub-categories were created for studies specifically expressing the development of the program with a focus on a particular area, such as the maker movement or Arduino-assisted robotics and coding. Similarly, studies that explicitly mentioned the development of the program based on presented ideas and experiences formed another sub-category. Furthermore, the category related to the implementation of program/activity was divided into eight sub-categories, each indicating the specific centre of implementation, such as problem-based learning-centred and representation of blacks-centred.
The theme of "examination of" refers to studies that aim to examine certain aspects, such as the experiences and perceptions of students (7) and the factors influencing specific subjects (4). Studies focusing on examining the experiences and perceptions of students were labelled as "general" (4), while studies exploring their experiences and perceptions regarding specific content, such as influences and challenges to participation in STEM clubs (2) and assessment (1), were labelled accordingly. Additionally, studies that focused on examining factors affecting the choice of STEM majors (2), participation in STEM clubs (1), and motivation to develop interest in STEM (1) were categorized in line with their respective focuses. As shown in Table 2 , it is evident that studies focusing on examining the experiences and perceptions of students (7) were more frequently conducted compared to studies focusing on examining the factors affecting specific subjects (4).
The theme of "identification of" refers to studies that aim to identify certain aspects, such as the types of attitudinal effects (1), types of changes in affect toward engineering (1), and non-academic skills (1). Additionally, the theme of "comparison of in-school and out-of-school STEM experiences" (2) refers to studies that aim to compare STEM experiences within school and outside of school. Lastly, studies that did not fit into the aforementioned categories were included in the "others" theme (6) as no clear connection could be identified among them.
The research designs employed in the examined articles were identified as follows: qualitative methods (36), including case study (20), design-case study (6), comparative-case study (4), ethnographic study (2), phenomenological study (2), and survey study (2); quantitative methods (7), including survey study (4) and experimental study (3); mixed methods and longitudinal studies (10); and literature review (3), as illustrated in Table 3 . It can be observed that among these methods, case study was the most commonly utilized. Furthermore, it is evident that quantitative methods (7) and literature reviews (3) were employed less frequently compared to qualitative (36) and mixed methods (10). Additionally, survey studies were utilized in both quantitative and qualitative studies.
The frequencies and percentages of sample levels in the examined articles are presented in Table 4 . The studies involved participants at different educational levels, including elementary school (8), middle school (23), high school (14), pre-service teachers or undergraduate students (6), teachers (4), parents (3), and others (1). It is apparent that middle school students (23) were the most commonly utilized sample among them, while high school students (14) were more frequently chosen compared to elementary school students (8). It should be noted that while grade levels were specified for both elementary and middle school students, separate grade levels were not identified for high school students in these studies. Additionally, studies that involved mixed groups were labelled as 3-5th and 6-8th grades. However, when the mixed groups included participants from different educational levels such as elementary, middle, or high school, teachers, parents, etc., they were counted as separate levels. Furthermore, the studies conducted with participants such as pre-service teachers, undergraduates, teachers, and parents were less frequently employed compared to K-12 students.
The frequencies of sample sizes in the examined articles are presented in Table 5 . It was observed that in 15 studies, the number of sample sizes was not provided. The intervals for the sample size were not equally separated; instead, they were arranged with intervals of 5, 10, 50, and 100. This choice was made to allow for a more detailed analysis of smaller samples, as smaller intervals can provide a more granular examination of data instead of cumulative amounts. The analysis reveals that the studies primarily prioritized sample groups with 11–15 (f:8) participants, followed by groups of 16–20 (f:4) and 201–250 (f:4). Additionally, it is evident that sample sizes of 6–10, 21–25, 41–50, 50–100, and more than 2000 (f:1) were the least commonly studied.
The frequencies and percentages of data collection tools in the examined articles are presented in Table 6 . The analysis reveals that the studies primarily employed survey or questionnaires (31.6%) and observations (30.5%) as data collection methods, followed by interviews (15.8%), documents (13.7%), tests (4.2%), and field notes (4.2%). Regarding survey/questionnaires, Likert-type scales (f:23) were more commonly employed compared to open-ended questions (f:7). Tests were predominantly used as achievement tests (f:2) and assessments (f:2), representing the least preferred data collection tools. Furthermore, the table illustrates that multiple data collection tools were frequently employed, as the total number of tools (95) is nearly twice the number of studies (56).
The frequencies and percentages of data analysing methods in the examined articles are presented in Table 7 . The table reveals that the studies predominantly employed descriptive analysis (f:33, 41.25%), followed by inferential statistics (f:16, 20%), descriptive statistics (f:15, 18.75%), content analysis (f:14, 17.5%), and the constant-comparative method (f:2, 2.5%). It is notable that qualitative methods (f:49, 61.25%) were preferred more frequently than quantitative methods (f:31, 38.75%) in the examined studies related to STEM Clubs. Within the qualitative methods, descriptive analysis (f:33) was utilized nearly twice as often as content analysis (f:14), while within the quantitative methods, descriptive statistics (f:15) and inferential statistics (f:16), including t-tests, ANOVA, regression, and other methods, were used with comparable frequency.
The durations of STEM Clubs in the examined studies are presented in Table 8 . Based on the analysis, there are more studies (f:37) that do not state the duration of STEM Clubs than studies (f:19) that do provide information on the durations. Additionally, among the studies that do state the durations, there is no common period of time for STEM Clubs, as they were implemented for varying numbers of weeks and sessions, with session durations ranging from several minutes. Therefore, it can be observed that STEM Clubs were conducted over the course of 3 semesters (academic year and summer), 5 months, 2 to 16 weeks, with session durations ranging from 60 to 120 min. Furthermore, the durations of "3 semesters," "10 weeks with 90-min sessions per week," and "unknown weeks with 60-min sessions per week" were used more than once in the studies.
The content analysis of the findings of the identified examined articles are presented by their frequencies in Table 9 . Although the studies cover a diverse range of topics, the analysis indicates that the results can be broadly classified into three themes, namely, the "development of or increase in certain aspects" (f:68), "design of STEM Clubs" (f:17), and "identification of various aspects" (f:16). Based on the analysis, the findings in the studies are associated with the development of certain aspects such as skills or the increase in specific outcomes like academic achievement. Furthermore, the studies explore the design of STEM Clubs through the description of specific cases, such as sample implementations and challenges. Additionally, the studies focus on the identification of various aspects, such as factors and perceptions.
It is evident from the findings that the studies predominantly yield results related to the development of or increase in certain aspects (f:68). Within this theme, the most commonly observed result is the development of STEM or academic achievement or STEM competency (f:11). This is followed by an increase in STEM major choice or career aspiration (f:9), an increase in engagement or participation in STEM clubs (f:5), the development of identity including STEM, science, engineering, under-representative groups (f:5), the development of interest in STEM (f:4), an increase in enjoyment (f:4), and the development of collaboration, leadership, or communication skills (f:4). Furthermore, it can be observed that there are some results, such as the development of critical thinking, perseverance and the teachers’ profession, that were yielded less frequently (f:1). The results of 16 studies were found with a frequency of 1.
Within the design of STEM Clubs, the sample implementation or design model for different purposes such as the usage of robotic program or students with disabilities (f:7), design principles or ideas for STEM clubs, activities or curriculum (f:4), challenges or factors effecting STEM Clubs success and sustainability (f:3) were presented as a result. Additionally, the comparison was made between in-school and out-of-school learning environments (f:3), highlighting the contradictions of STEM clubs and science classes, as well as the differences in STEM activities and continues-discontinues learning experiences in mathematics. Within the identification of various aspects, the most commonly gathered result was the identification of factors affecting participation or motivation to STEM clubs (f:5). This was followed by the identification of barriers to participation (f:2). The identification of other aspects, such as parents' roles and perspectives on STEM, was comparatively less frequent.
Considering the wide variety of STEM Clubs found in different regions around the world, this study aimed to investigate the current state of research on STEM Clubs. It is not surprising to observe an increase in the number of studies conducted on STEM Clubs over the years. This can be attributed to the overall growth in research on STEM education (Zhan et al., 2022 ), as STEM education often includes activities and after-school programs as integral components (Blanchard et al., 2017 ). Identifying relevant keywords and incorporating them into a search strategy is crucial for conducting a comprehensive and rigorous systematic review (Corrin et al., 2022 ). To gain a broader understanding of keyword usage in the context of STEM Clubs, a word cloud analysis was performed (McNaught & Lam, 2010 ). Additionally, based on the content analysis method, six different categories for keywords were immerged: disciplines, technological concepts, academic community, learning experiences, core elements of education, and psychosocial factors (variables). The analysis revealed that the keyword "STEM" was used most frequently in the studies examined. This may be because authors want their studies to be easily found and widely searchable by others, so they use "STEM" as a general term for their studies (Corrin et al., 2022 ). Similarly, the frequent use of keywords like "education" and "learning" from the "core elements of education" category could be attributed to authors' desire to use broad, searchable terms to make their studies more discoverable (Corrin et al., 2022 ). Additionally, it was observed that from the STEM components, only "science" and "engineering" were used as keywords, while "mathematics" and "technology" were not present. This finding aligns with claims in the literature that mathematics is often underemphasized in STEM integration (Fitzallen, 2015 ; Maass et al., 2019 ; Stohlmann, 2018 ). Although the specific term "technology" did not appear in the word cloud, technology-related keywords such as "arduino," "robots," "coding," and "innovative" were present. Furthermore, the analysis revealed that authors preferred to use keywords related to their sample populations, such as "middle (school students)," "elementary (students)," "high school students," or "teachers." Additionally, keywords describing learning experiences, such as "extracurricular," "informal," "afterschool," "out-of-school," "social," "clubs," and "practice" were commonly used. This preference may stem from the fact that STEM clubs are often part of informal learning environments, out-of-school programs, or afterschool activities, and these concepts are closely related to each other (Baran et al., 2016 ; Cooper, 2011 ; Kalkan & Eroglu, 2017 ; Schweingruber et al., 2014 ). Moreover, the analysis showed that keywords related to psychosocial factors (variables), such as "disabilities," "skills," "interest," "attainment," "enactment," "expectancy-value," "self-efficacy," "engagement," "motivation," "career," "gender," "cognitive," and "identity" were also prevalent. This suggests that the articles investigated the effects of STEM club practices on these psychosocial variables. To sum up, by using these keywords, researchers can gain valuable insights and effectively search for relevant articles related to STEM clubs, enabling them to locate appropriate resources for their research (Corrin et al., 2022 ).
The popularity of case studies as a research design, based on the analysis, can be attributed to the fact that studies on STEM Clubs were conducted in diverse learning environments, highlighting sample implementation designs (Adams et al., 2014 ; Bell et al., 2009 ; Robelen, 2011 ). At this point, case studies offer the opportunity to present practical applications and real-world examples (Hamilton & Corbett-Whittier, 2012 ), which is highly valuable in the context of STEM Clubs. Additionally, the observation that quantitative methods were not as commonly utilized as qualitative methods in studies related to STEM Clubs contrasts with the predominant reliance on quantitative methods in STEM education research (Aslam et al., 2022 ; Irwanto et al., 2022 ; Lin et al., 2019 ). This suggests a lack of quantitative studies specifically focused on STEM Clubs, indicating a need for more research in this area employing quantitative approaches. Therefore, it is important to prioritize and conduct additional quantitative studies to further enhance our understanding of STEM Clubs and their impact. In studies on STEM Club, there is a higher frequency of research involving K-12 students, particularly middle school students, parallel to some studies on literature (Aslam et al., 2022 ), compared to other groups such as pre-service teachers, undergraduate students, teachers, and parents. This can be attributed to the fact that STEM Clubs are designed for K-12 students, and middle school is a crucial period for introducing them to STEM concepts and careers. Middle school students are developmentally ready for hands-on and inquiry-based learning, commonly used in STEM education. Additionally, time constraints, especially for high school students preparing for university, may limit their involvement in extensive STEM activities. Furthermore, STEM Clubs were primarily employed with sample groups ranging from 11–15, 16–20, and 201–250 participants. The preference for 11–20 participants, rather than less than 10, may be attributed to the collaborative nature of STEM activities, which often require a larger team for effective teamwork and group dynamics (Magaji et al., 2022 ). Utilizing small groups as samples can result in the case study research design being the most frequently employed approach due to its compatibility with smaller sample sizes. On the other hand, the inclusion of larger groups (201–250) is suitable for survey studies, as this number can represent the total student population attending STEM Clubs throughout a semester with multiple sessions (Boys & Girls Club of America, 2019 ).
According to studies on STEM Clubs, surveys or questionnaires and observations were predominantly used as data collection methods. This preference can be attributed to the fact that surveys or questionnaires allow researchers to gather data on diverse aspects, including students' attitudes, perceptions, and experiences related to STEM Clubs, facilitating generalization and comparison (McLafferty, 2016 ). Furthermore, observations were frequently employed because they can offer a deeper understanding of the lived experiences and actual practices within STEM Clubs (Baker, 2006 ). Along with data collection tools, descriptive analysis was predominantly utilized in studies on STEM Clubs, with quantitative methods including descriptive statistics and inferential statistics being used to a similar extent. The preference for descriptive analysis may arise from its effectiveness in describing activities, experiences, and practices within STEM Clubs. Given the predominance of case study research in the analysed studies, it is not surprising to observe a high frequency of descriptive statistics in the findings. On the other hand, the extensive use of quantitative analysing methods can be attributed to the need for statistical analysis of surveys and questionnaires (Young, 2015 ). Consequently, future studies on STEM Clubs could benefit from considering the use of tests and field notes as additional data collection tools, along with surveys, observations and interviews. Additionally, the development of tests specifically designed to assess aspects related to STEM could provide valuable insights (Capraro & Corlu, 2013 ; Grangeat et al., 2021 ). Moreover, increasing the utilization of content analysis and constant comparative analysis methods could further enhance the depth and richness of data analysis in STEM Club research (White & Marsh, 2006 ). In the studies on STEM Clubs, the duration and scheduling of the clubs varied considerably. While there was no common period of time for STEM Clubs, they were implemented for different numbers of weeks and sessions, with session durations ranging from several minutes to 60 to 120 min. However, it was observed that STEM Clubs were predominantly conducted over the course of three semesters, including the academic year and summer, or for durations of 2 to 16 weeks. This scheduling pattern can be attributed to the fact that STEM Clubs were often implemented as after-school programs, and they were designed to align with the academic semesters and summer school periods to effectively reach students. Additionally, the number of weeks in these studies may have been arranged according to the duration of academic semesters, although some studies were conducted for less than a semester (Gutierrez, 2016 ). The most common use of multiple sessions with a time range of 60 to 120 min can be attributed to the nature of the activities involved in STEM Clubs. These activities often require more time than regular class hours, and splitting them into separate sessions allows students to effectively concentrate on their work and engage in more in-depth learning experiences (Vennix et al., 2017 ).
The purposes of the studies on STEM Clubs were mostly related to effects of participation in STEM Clubs on various aspects such as attitudes towards STEM disciplines or career paths, STEM major choice/career aspiration, achievement etc., evolution of a sample program for STEM Clubs and its implementation including the development of program/activity, identification of program's challenges and limitations, and implementation of it, followed by the examination of certain aspects such as the experiences and perceptions of students and the factors influencing specific subjects, identification of such as the types of attitudinal effects and non-academic skills, and comparison of in-school and out-school STEM experiences. Therefore, the results of the studies parallel to the purposes were mostly related to development of or increase in certain aspects such as STEM or academic achievement or STEM competency STEM major choice or career aspiration engagement or participation in STEM Clubs, identity, interest in STEM, enjoyment, collaboration, communication skills, critical thinking, the design of STEM Clubs including the sample implementation or design model for different purposes such as the usage of robotic program or students with disabilities, design principles or ideas for STEM clubs or activities, challenges or factors effecting STEM Clubs success and sustainability, and the comparison between in-school and out-of-school learning environments. Also, they are related to the identification of various aspects such as factors affecting participation or motivation to STEM clubs, barriers to participation. At this point, it is evident that these identified categories align with the findings of studies in the literature. These studies claim that after-school programs, such as STEM Clubs, have positive impacts on students' achievement levels (NRC, 2015 ; Kazu & Kurtoglu Yalcin, 2021 ; Shernoff & Vandell, 2007 ), communication, and innovative problem-solving abilities (Mahoney et al., 2007 ), leadership skills (Lipscomb et al., 2017 ), career decision-making (Bybee, 2001 ; Dabney et al., 2012 ; Sahin et al., 2018 ; Tai et al., 2006 ), creativity (Wan et al., 2023 ), 21st-century skills (Hirsch, 2011 ; Zeng et al., 2018 ), interest in STEM professions (Blanchard et al., 2017 ; Chittum et al., 2017 ; Wang et al., 2011 ), and knowledge in STEM fields (Adams et al., 2014 ; Bell et al., 2009 ). Furthermore, it can be inferred that the studies on STEM Clubs paid significant attention to the design descriptions of programs or activities (Nation et al., 2019 ). This may be because there is a need for studies that focus on designing program models for different cases (Calabrese Barton & Tan, 2018 ; Estrada et al., 2016 ). These studies can serve as examples and provide guidance for the development of STEM clubs in various settings. By creating sample models, researchers can contribute to the improvement and expansion of STEM clubs across different environments (Cakir & Guven, 2019 ; Estrada et al., 2016 ).
In conclusion, as the studies on the trends in STEM education (Bozkurt et al., 2019 ; Chomphuphra et al., 2019 ; Irwanto et al., 2022 ; Li et al., 2020 ; Lin et al., 2019 ; Martín-Páez et al., 2019 ; Noris et al., 2023 ), the analysis of prevailing research trends specifically in STEM Clubs, which are implemented in diverse environments with varying methods and purposes, can provide a comprehensive understanding of these clubs as a whole.
It can also serve as a valuable resource for guiding future investigations in this field. By identifying common approaches and identifying gaps in methods and results, a holistic perspective on STEM Clubs can be achieved, leading to a more informed and targeted direction for future research endeavours.
Future research on STEM Clubs should consider the trends identified in the study and address methodological gaps. For instance, there is a lack of research in this area that employs quantitative approaches. Therefore, it is important for future studies to incorporate quantitative methods to enhance the understanding of STEM Clubs and their impact. This includes exploring underrepresented populations, investigating the long-term impacts of STEM Clubs, and examining the effectiveness of specific pedagogical approaches or interventions within these clubs. Researchers should conduct an analysis to identify common approaches used in STEM Clubs across different settings. This analysis can help uncover effective strategies, best practices, and successful models that can be replicated or adapted in various contexts. By undertaking these efforts, researchers can contribute to a more comprehensive understanding of STEM Clubs, leading to advancements in the field of STEM education.
It is important to consider the limitations of the study when interpreting its findings. The study's findings are based on the literature selected from two databases, which may introduce biases and limitations. Additionally, the study's findings are constrained by the timeframe of the literature review, and new studies may have emerged since the cut-off date, potentially impacting the representation and generalizability of the research trends identified. Another limitation lies in the construction of categories during the coding process. The coding scheme used may not have fully captured or represented all relevant terms or concepts. Some relevant terms may have been inadequately represented or identified using different words or phrases, potentially introducing limitations to the analysis. While efforts were made to ensure validity and reliability, there is still a possibility of unintended biases or inconsistencies in the categorization process.
The datasets (documents, excel analysis) utilized in this article are available upon request from the corresponding author.
Adams, J. D., Gupta, P., & Cotumaccio, A. (2014). A museum program enhances girls’ STEM interest, motivation and persistence. Afterschool Matters, 12 , 14–20.
Google Scholar
Afterschool Alliance (2015). Full STEM ahead: Afterschool programs step up as key partners in STEM education . Retrieved November 2023 from http://www.afterschoolalliance.org/AA3PM/
Aslam, S., Saleem, A., Kennedy, T. J., Kumar, T., Parveen, K., Akram, H., & Zhang, B. (2022). Identifying the research and trends in STEM education in Pakistan: A systematic literature review. SAGE Open, 12 (3), 21582440221118544.
Article Google Scholar
Ayers, K. A., Wade-Jaimes, K., Wang, L., Pennella, R. A., & Pounds, S. B. (2020). The St. Jude STEM clubs: An after-school STEM club for upper elementary school students in Memphis, TN. Journal of STEM Outreach, 3 (1), 1–26. https://doi.org/10.15695/jstem/v3i1.13
Bae, S. H. (2018). Concepts, models, and research of extended education. International Journal for Research on Extended Education, 6 (2), 153–165.
Baker, L. (2006). Observation: A complex research method. Library Trends, 55 (1), 171–189.
Baran, E., Bilici, S. C., Mesutoglu, C., & Ocak, C. (2016). Moving STEM beyond schools: Students’ perceptions about an out-of-school STEM education program. International Journal of Education in Mathematics, Science and Technology, 4 (1), 9–19. https://doi.org/10.18404/ijemst.71338
Bell, P., Lewenstein, B., Shouse, A. W., & Feder, M. A. (2009). Learning science in informal environments: People, places and pursuits . National Research Council of the National Academies.
Blanchard, M. R., Hoyle, K. S., & Gutierrez, K. S. (2017). How to start a STEM club. Science Scope, 41 (3), 88–94.
Boys and Girls Club of America (2019). Annual report . Retrieved November 2023 from https://www.bgca.org/about-us/annual-report
Bozkurt, A., Ucar, H., Durak, G., & Idin, S. (2019). The current state of the art in STEM research: A systematic review study. Cypriot Journal of Educational Science, 14 (3), 374–383. https://doi.org/10.18844/cjes.v14i3.3447
Bybee, R. W. (2001). Achieving scientific literacy: Strategies for ensuring that free choice science education complements national formal science education efforts. In J. H. Falk (Ed.), Free choice education: How we learn science outside of school (pp. 44–63). Teachers College Press.
Cakir, N. K., & Guven, G. (2019). Arduino-assisted robotic and coding applications in science teaching: Pulsimeter activity in compliance with the 5E learning model. Science Activities, 56 (2), 42–51.
Calabrese Barton, A., & Tan, E. (2018). A longitudinal study of equity-oriented STEM-rich making among youth from historically marginalized communities. American Educational Research Journal, 55 (4), 761–800.
Capraro, R. M., & Corlu, M. S. (2013). Changing views on assessment for STEM project-based learning. In R. M. Capraro, M. M. Capraro, & J. R. Morgan (Eds.), STEM project-based learning (pp. 109–118). Brill.
Chapter Google Scholar
Casing, P. I., & Casing, L. M. R. (2024). Fostering students’ mathematics achievement through after-school program in the 21st century. Online Submission, 12 (3), 118–122.
Chittum, J. R., Jones, B. D., Akalin, S., & Schram, A. B. (2017). The effects of an afterschool STEM program on students’ motivation and engagement. International Journal of STEM Education, 4 , 1–16.
Chomphuphra, P., Chaipidech, P., & Yuenyong, C. (2019). Trends and research issues of STEM education: A review of academic publications from 2007 to 2017. Journal of Physics: Conference Series, 1340 (1), 012069.
Civil, M. (2007). Building on community knowledge: An avenue to equity in mathematics education. In N. S. Nasir & P. Cobb (Eds.), Improving access to mathematics: Diversity and equity in the classroom (pp. 105–117). Teachers College.
Cooper, S. (2011). An exploration of the potential for mathematical experiences in informal learning environments. Visitor Studies, 14 (1), 48–65. https://doi.org/10.1080/10645578.2011.557628
Corrin, L., Thompson, K., Hwang, G. J., & Lodge, J. M. (2022). The importance of choosing the right keywords for educational technology publications. Australasian Journal of Educational Technology, 38 (2), 1–8.
Dabney, K. P., Tai, R. H., Almarode, J. T., Miller-Friedmann, J. L., Sonnert, G., Sadler, P. M., & Hazari, Z. (2012). Out-of-school time science activities and their association with a career interest in STEM. International Journal of Science Education, Part B, 2 (1), 63–79. https://doi.org/10.1080/21548455.2011.629455
DePaolo, C. A., & Wilkinson, K. (2014). Get your head into the clouds: Using word clouds for analyzing qualitative assessment data. TechTrends, 58 , 38–44. https://doi.org/10.1007/s11528-014-0750-9
Donnelly, M., ažetić, P., Sandoval-Hernandez, A., Kumar, K., & Whewall, S. (2019). An unequal playing field-extra-curricular activities, soft skills and social mobility . Social Mobility Commission.
Durlak, J. A., & Weissberg, R. P. (2007). The impact of after-school programs that promote personal and social skills. Collaborative for Academic, Social, and Emotional Learning (CASEL). Retrieved from www.casel.org
Elo, S., & Kyngäs, H. (2008). The qualitative content analysis process. Journal of Advanced Nursing, 62 (1), 107–115.
Estrada, M., Burnett, M., Campbell, A. G., Campbell, P. B., Denetclaw, W. F., Gutiérrez, C. G., Hurtado, S., John, G. H., Matsui, J., McGee, R., Okpodu, C. M, Robinson, T. J., Summers, M. F., Werner-Washburne, M., & Zavala, M. (2016). Improving underrepresented minority student persistence in STEM. CBE—Life Sciences Education , 15 (3), es5.
Fitzallen, N. (2015). STEM Education: What does mathematics have to offer? In M. Marshman, V. Geiger, & A. Bennison (Eds.), Mathematics education in the margins. Proceedings of The 38th Annual Conference of the Mathematics Education Research Group of Australasia (pp. 237–244). MERGA.
Fraenkel, J., Wallen, N., & Hyun, H. (2012). How to design and evaluate research in education (10th ed.). McGraw-Hill Education.
Gay, L. R., Mills, G. E., & Airasian, P. W. (2012). Educational research: competencies for analysis and applications (10th ed.). Pearson.
Grangeat, M., Harrison, C., & Dolin, J. (2021). Exploring assessment in STEM inquiry learning classrooms. International Journal of Science Education, 43 (3), 345–361.
Gutierrez, K. S. (2016). Investigating the climate change beliefs, knowledge, behaviors, and cultural worldviews of rural middle school students and their families during an out-of-school intervention: A mixed-methods study (Publication No. 11320) [Doctoral dissertation, North Carolina State University]. NC State University Libraries.
Hamilton, L., & Corbett-Whittier, C. (2012). Using case study in education research . Sage.
Hein, G. (2009). Learning science in informal environments: People, places, and pursuits. Museums & Social Issues, 4 (1), 113–124.
Hirsch, B. (2011). Learning and development in after-school programs. Phi Delta Kappan, 92 (5), 66–69. https://doi.org/10.1177/2F003172171109200516
Irwanto, I., Saputro, A. D., Widiyanti, W., Ramadhan, M. F., & Lukman, I. R. (2022). Research trends in STEM education from 2011 to 2020: A systematic review of publications in selected journals. International Journal of Interactive Mobile Technologies (iJIM), 16 (5), 19–32.
Kalkan, C., & Eroglu, S. (2017). Designing sample activities based on STEM materials for gifted/talented students in support education rooms. Journal of Gifted Education and Creativity , 4 (2), 36–46. Retrieved November 2023 from https://dergipark.org.tr/tr/pub/jgedc/issue/38702/449432
Kazu, I. Y., & Kurtoglu Yalcin, C. (2021). The effect of STEM education on academic performance: A meta-analysis study. Turkish Online Journal of Educational Technology-TOJET, 20 (4), 101–116.
Lauer, P. A., Akiba, M., Wilkerson, S. B., Apthorp, H. S., Snow, D., & Martin-Glenn, M. L. (2006). Out-of-school-time programs: A meta-analysis of effects for at-risk students. Review of Educa- Tional Research, 76 (2), 275–313.
Li, Y., Wang, K., Xiao, Y., & Froyd, J. E. (2020). Research and trends in STEM education: A systematic review of journal publications. International Journal of STEM Education, 7 (1), 1–16.
Lin, T. C., Lin, T. J., & Tsai, C. C. (2014). Research trends in science education from 2008 to 2012: A systematic content analysis of publications in selected journals. International Journal of Science Education, 36 (8), 1346–1372.
Lin, T. J., Lin, T. C., Potvin, P., & Tsai, C. C. (2019). Research trends in science education from 2013 to 2017: A systematic content analysis of publications in selected journals. International Journal of Science Education, 41 (3), 367–387.
Lipscomb, S., Haimson, J., Liu, A. Y., Burghardt, J., Johnson, D. R., & Thurlow, M. L. (2017). Preparing for life after high school: The characteristics and experiences of youth in special education. Findings from the National Longitudinal Transition Study 2012. Volume 2: Comparisons across disability groups: Full report (Report No. NCEE 2017–4018). U.S. Department of Education, Institute of Education Sciences, National Center for Education Evaluation and Regional Assistance.
Little, P., Wimer, C., & Weiss, H. B. (2008). After school programs in the 21st century: Their poten- tial and what it takes to achieve it. Issues and Opportunities in out-of-School Time Evaluation, 10 , 1–12.
Maass, K., Geiger, V., Ariza, M. R., & Goos, M. (2019). The role of mathematics in interdisciplinary STEM education. ZDM, 51 , 869–884. https://doi.org/10.1007/s11858-019-01100-5
Magaji, A., Ade-Ojo, G., & Bijlhout, D. (2022). The impact of after school science club on the learning progress and attainment of students. International Journal of Instruction, 15 (3), 171–190.
Mahoney, J. L., Parente, M. E., & Lord, H. (2007). After-school program engagement: Links to child competence and program quality and content. The Elementary School Journal, 107 (4), 385–404.
Martín-Páez, T., Aguilera, D., Perales-Palacios, F. J., & Vílchez-González, J. M. (2019). What are we talking about when we talk about STEM education? A Review of Literature. Science Education, 103 (4), 799–822.
McLafferty, S. (2016). Conducting questionnaire surveys. Key Methods in Geography, 3 , 129–142.
McNaught, C., & Lam, P. (2010). Using Wordle as a supplementary research tool. Qualitative Report, 15 (3), 630–643.
Merrill, C., & Daugherty, J. (2010). STEM education and leadership: A mathematics and science partnership approach. Journal of Technology Education, 21 (2), 21–34.
Nation, J. M., Harlow, D., Arya, D. J., & Longtin, M. (2019). Being and becoming scientists: Design-based STEM programming for girls. Afterschool Matters, 29 , 36–44.
National Research Council, Division of Behavioral, Board on Science Education, & Committee on Successful Out-of-School STEM Learning (2015). Identifying and supporting productive STEM programs in out-of-school settings . National Academies Press.
Noris, M., Saputro, S., & Ulimaz, A. (2023). STEM research trends from 2013 to 2022: A systematic literature review. International Journal of Technology in Education (IJTE), 6 (2), 224–237. https://doi.org/10.46328/ijte.390
Pastchal-Temple, A. S. (2012). The effect of regular participation in an after-school program on student achievement, attendance, and behavior (Publication No. 4368) [Doctoral dissertation, Mississippi State University]. Mississippi State University Libraries.
Resnick, L. B. (1987). Education and learning to think . National Academy Press.
Robelen, E. (2011). New STEM schools target underrepresented groups. Education Week, 31 (1), 18–19.
Sahin, A., Ekmekci, A., & Waxman, H. C. (2018). Collective effects of individual, behavioral, and contextual factors on high school students’ future STEM career plans. International Journal of Science and Mathematics Education, 16 , 69–89.
Schweingruber, H., Pearson, G., & Honey, M. (Eds.). (2014). STEM integration in K-12 education: Status, prospects, and an agenda for research . National Academies Press.
Shernoff, D. J., & Vandell, D. L. (2007). Engagement in after school program activities: Quality of experience from the perspective of participants. Journal of Youth Adolescence, 36 , 891–903.
Stemler, S. (2000). An overview of content analysis. Practical Assessment, Research & Evaluation, 7 (17), 1–6. https://doi.org/10.7275/z6fm-2e34
Stohlmann, M. (2018). A vision for future work to focus on the “m” in integrated STEM. School Science and Mathematics, 118 (7), 310–319. https://doi.org/10.1111/ssm.12301
Sozbilir, M., Kutu, H., & Yasar, M. D. (2012). Science education research in Turkey: A content analysis of selected features of papers published. In J. Dillon & D. Jorde (Eds.), The world of science education: Handbook of research in Europe (pp. 1–35). Sense publishers.
Suri, H., & Clarke, D. (2009). Advancements in research systhesis methods: From a methodologically inclusive perspective. Review of Educational Research, 79 (1), 395–430.
Tai, R. H., Qi Liu, C., Maltese, A. V., & Fan, X. (2006). Planning early for careers in science. Science, 312 (5777), 1143–1144.
Vennix, J., Den Brok, P., & Taconis, R. (2017). Perceptions of STEM-based outreach learning activities in secondary education. Learning Environments Research, 20 , 21–46.
Wan, Z. H., So, W. M. W., & Zhan, Y. (2023). Investigating the effects of design-based STEM learning on primary students’ STEM creativity and epistemic beliefs. International Journal of Science and Mathematics Education, 21 (Suppl. 1), 87–108.
Wang, H. H., Moore, T. J., Roehrig, G. H., & Park, M. S. (2011). STEM integration: Teacher perceptions and practice. Journal of Pre-College Engineering Education Research, 1 (2), 1–13.
White, M. D., & Marsh, E. E. (2006). Content analysis: A flexible methodology. Library Trends, 55 (1), 22–45.
Young, T. J. (2015). Questionnaires and surveys. In Z. Hua (Ed.), Research methods in intercultural communication: A practical guide (pp. 163–180). John Wiley & Sons. https://doi.org/10.1002/9781119166283.ch11
Zeng, Z., Yao, J., Gu, H., & Przybylski, R. (2018). A meta-analysis on the effects of STEM education on students’ abilities. Science Insights Education Frontiers, 1 (1), 3–16.
Zhan, Z., Shen, W., Xu, Z., Niu, S., & You, G. (2022). A bibliometric analysis of the global landscape on STEM education (2004–2021): Towards global distribution, subject integration, and research trends. Asia Pacific Journal of Innovation and Entrepreneurship, 16 (2), 171–203.
Download references
Open access funding provided by the Scientific and Technological Research Council of Türkiye (TÜBİTAK). There was no external funding received for the research conducted in this article.
Authors and affiliations.
Department of Industrial Engineering, Istanbul Aydin University, Istanbul, Turkey
Rabia Nur Öndeş
You can also search for this author in PubMed Google Scholar
Correspondence to Rabia Nur Öndeş .
Ethical approval and consent.
This is a review study and no ethical approval is required.
The authors have no competing interests to declare that are relevant to the content of this article.
No potential conflict of interest was reported by the author.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ .
Reprints and permissions
Öndeş, R.N. Research Trends in STEM Clubs: A Content Analysis. Int J of Sci and Math Educ (2024). https://doi.org/10.1007/s10763-024-10477-z
Download citation
Received : 19 January 2024
Accepted : 10 June 2024
Published : 25 June 2024
DOI : https://doi.org/10.1007/s10763-024-10477-z
Anyone you share the following link with will be able to read this content:
Sorry, a shareable link is not currently available for this article.
Provided by the Springer Nature SharedIt content-sharing initiative
You are accessing a machine-readable page. In order to be human-readable, please install an RSS reader.
All articles published by MDPI are made immediately available worldwide under an open access license. No special permission is required to reuse all or part of the article published by MDPI, including figures and tables. For articles published under an open access Creative Common CC BY license, any part of the article may be reused without permission provided that the original article is clearly cited. For more information, please refer to https://www.mdpi.com/openaccess .
Feature papers represent the most advanced research with significant potential for high impact in the field. A Feature Paper should be a substantial original Article that involves several techniques or approaches, provides an outlook for future research directions and describes possible research applications.
Feature papers are submitted upon individual invitation or recommendation by the scientific editors and must receive positive feedback from the reviewers.
Editor’s Choice articles are based on recommendations by the scientific editors of MDPI journals from around the world. Editors select a small number of articles recently published in the journal that they believe will be particularly interesting to readers, or important in the respective research area. The aim is to provide a snapshot of some of the most exciting work published in the various research areas of the journal.
Original Submission Date Received: .
Find support for a specific problem in the support section of our website.
Please let us know what you think of our products and services.
Visit our dedicated information section to learn more about MDPI.
Assessing carbon sink capacity in coal mining areas: a case study from taiyuan city, china.
2. materials and methods, 2.1. study area, 2.2. data source and processing, 2.2.1. data source, 2.2.2. data processing, 2.3. using carbon absorption coefficient to estimate carbon absorption, 2.4. using carbon density to estimate carbon storage, 2.4.1. estimation of carbon storage, 2.4.2. selection and calibration of carbon density, 2.5. using npp to estimate nep, 2.5.1. estimation of npp, 2.5.2. estimation of nep, 3.1. carbon absorption in mining areas, 3.2. carbon storage in mining areas, 3.3. nep estimation in mining areas, 4. discussion, 4.1. discussion on estimation results and their methods, 4.1.1. discussion on carbon absorption and its estimation methods, 4.1.2. discussion on carbon storage and its estimation methods, 4.1.3. discussion on npp and its estimation methods, 4.1.4. discussion on r h , nep, and their estimation methods, 4.2. evaluation of methods for estimating carbon sink capacity in mining areas, 4.3. novelty and limitations for estimating carbon sink capacity in mining areas, 4.4. countermeasures for coal mining enterprises to stabilize carbon sinks in mining areas, 5. conclusions, supplementary materials, author contributions, institutional review board statement, informed consent statement, data availability statement, conflicts of interest, abbreviations.
Click here to enlarge figure
Land Use Type | Area in 2021 (t/hm ) | Notes |
---|---|---|
Cultivated land | 5.66 | |
Forest land | 201.75 | |
Grassland | 0 | Undivided in 2021 |
Land for mining and industry | 7.16 | Belonging to construction land |
Land for residential area | 7.76 | |
Land for transportation | 3.89 | |
Water area and land for water conservancy facilities | 0.35 | Abbreviated as water area |
Other land | 8.79 |
Land Use Type | Carbon Absorption Coefficient (t/hm ) | Source |
---|---|---|
Cultivated land | 0.007 | An et al. [ ] |
Forest land | 0.581 | Cao and Cui. [ ]; Zhang et al. [ ] |
Grassland | 0.021 | Zhan et al. [ ]; Zhang et al. [ ] |
Water area | 0.253 | Cao and Cui. [ ]; Zhang et al. [ ]; Zhong et al. [ ] |
Other land | 0.005 | Han et al. [ ]; Zhang et al. [ ]; Zhong et al. [ ] |
Land Use Type | Aboveground Carbon Density (t/hm ) | Underground Carbon Density (t/hm ) | Soil Carbon Density (t/hm ) | Carbon Density of Dead Organic Matter (t/hm ) |
---|---|---|---|---|
Cultivated land | 1.65 | 0.32 | 87.09 | 0.16 |
Forest land | 28.67 | 5.98 | 100.61 | 2.87 |
Grassland | 3.35 | 2.10 | 78.84 | 0.33 |
Land for mining and industry | 0.03 | 0 | 61.55 | 0 |
Land for residential area | 0 | 0 | 50.07 | 0 |
Land for transportation | 0 | 0 | 47.77 | 0 |
Water area | 0.02 | 0 | 0 | 0 |
Other land | 0.81 | 0.14 | 18.76 | 0.08 |
Land Use Type | Cultivated Land | Forest Land | Grassland | Water Area | Other Land |
---|---|---|---|---|---|
Carbon absorption (t) | 0.04 | 117.22 | 0 | 0.09 | 0.04 |
Land Use Type | Aboveground Carbon Storage (t) | Underground Carbon Storage (t) | Soil Carbon Storage (t) | Dead Organic Matter Carbon Storage (t) | Total Carbon Storage (t) |
---|---|---|---|---|---|
Cultivated land | 9.34 | 1.81 | 492.93 | 0.91 | 504.99 |
Forest land | 5784.17 | 1206.47 | 20,298.07 | 579.02 | 27,867.73 |
Grassland | 0 | 0 | 0 | 0 | 0 |
Land for mining and industry | 0.21 | 0 | 440.70 | 0 | 440.91 |
Land for residential area | 0 | 0 | 388.54 | 0 | 388.54 |
Land for transportation | 0 | 0 | 185.83 | 0 | 185.83 |
Water area | 0.01 | 0 | 0 | 0 | 0.01 |
Other land | 7.12 | 1.23 | 164.90 | 0.70 | 173.95 |
Mining area | 5800.85 | 1209.51 | 21,970.96 | 580.63 | 29,561.96 |
Study Area | T | R | NPP | NPP |
---|---|---|---|---|
This study | 10.39 | 470.70 | 1441.09 | 805.25 |
Wuzhong City | 10.19 | 189.45 | 1423.53 | 353.02 |
Zhongning County | 10.15 | 196.89 | 1419.51 | 365.69 |
Guyuan City | 7.12 | 439.30 | 1156.51 | 754.81 |
Longde County | 5.76 | 500.11 | 1043.70 | 843.23 |
Study Area | Estimation Results of NPP [g/(m ·a)] | Study Time (Year) | Method | Source |
---|---|---|---|---|
This study | 805.25 | 2021 | Miami model | |
Fenhe River Basin | 291.57 | From 2000 to 2015 | CASA model | Tian et al. [ ] |
Shanxi Province | 273.67 | From 2000 to 2019 | CASA model | Su et al. [ ] |
Loess Plateau | 300.74 | From 2001 to 2019 | CASA model | Song et al. [ ] |
Lvliang contiguous poverty areas | 241.24–331.70 | From 2000 to 2018 | CASA model | Sun et al. [ ] |
Index | Units | Main Estimation Methods | Advantage | Disadvantage | Notes |
---|---|---|---|---|---|
Carbon absorption | t | The product of carbon absorption coefficient and area | The simplest method | The carbon absorption coefficient is not yet unified | Carbon absorption is also known as a carbon sink. |
Carbon storage | t | The product of carbon density and area | The method is relatively simple | The InVEST model ignores the influence of interannual changes in carbon density; at least two years of carbon storage data are required to obtain carbon sink capacity. | Carbon sink refers to the increase in carbon storage over two years. |
NEP | g/(m ·a) | NPP minus R | The most complex method | Affected by NPP and R calculation results, the estimation accuracy of NPP is relatively low. | NEP with a positive value represents carbon sink. |
The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
Chen, F.; Liu, Y.; Guo, J.; Bai, H.; Wu, Z.; Liu, Y.; Li, R. Assessing Carbon Sink Capacity in Coal Mining Areas: A Case Study from Taiyuan City, China. Atmosphere 2024 , 15 , 765. https://doi.org/10.3390/atmos15070765
Chen F, Liu Y, Guo J, Bai H, Wu Z, Liu Y, Li R. Assessing Carbon Sink Capacity in Coal Mining Areas: A Case Study from Taiyuan City, China. Atmosphere . 2024; 15(7):765. https://doi.org/10.3390/atmos15070765
Chen, Fan, Yang Liu, Jinkai Guo, He Bai, Zhitao Wu, Yang Liu, and Ruijin Li. 2024. "Assessing Carbon Sink Capacity in Coal Mining Areas: A Case Study from Taiyuan City, China" Atmosphere 15, no. 7: 765. https://doi.org/10.3390/atmos15070765
Article access statistics, supplementary material.
ZIP-Document (ZIP, 112 KiB)
Mdpi initiatives, follow mdpi.
Subscribe to receive issue release notifications and newsletters from MDPI journals
Published on 26.6.2024 in Vol 8 (2024)
Authors of this article:
1 Department of Diagnostic and Generalist Medicine, Dokkyo Medical University, Tochigi, Japan
2 Department of General Medicine, Okayama University Graduate School of Medicine, Dentistry and Pharmaceutical Sciences, Okayama, Japan
Takanobu Hirosawa, MD, PhD
Department of Diagnostic and Generalist Medicine
Dokkyo Medical University
880 Kitakobayashi
Mibu-cho, Shimotsuga
Tochigi, 321-0293
Phone: 81 282861111
Email: [email protected]
Background: The potential of artificial intelligence (AI) chatbots, particularly ChatGPT with GPT-4 (OpenAI), in assisting with medical diagnosis is an emerging research area. However, it is not yet clear how well AI chatbots can evaluate whether the final diagnosis is included in differential diagnosis lists.
Objective: This study aims to assess the capability of GPT-4 in identifying the final diagnosis from differential-diagnosis lists and to compare its performance with that of physicians for case report series.
Methods: We used a database of differential-diagnosis lists from case reports in the American Journal of Case Reports , corresponding to final diagnoses. These lists were generated by 3 AI systems: GPT-4, Google Bard (currently Google Gemini), and Large Language Models by Meta AI 2 (LLaMA2). The primary outcome was focused on whether GPT-4’s evaluations identified the final diagnosis within these lists. None of these AIs received additional medical training or reinforcement. For comparison, 2 independent physicians also evaluated the lists, with any inconsistencies resolved by another physician.
Results: The 3 AIs generated a total of 1176 differential diagnosis lists from 392 case descriptions. GPT-4’s evaluations concurred with those of the physicians in 966 out of 1176 lists (82.1%). The Cohen κ coefficient was 0.63 (95% CI 0.56-0.69), indicating a fair to good agreement between GPT-4 and the physicians’ evaluations.
Conclusions: GPT-4 demonstrated a fair to good agreement in identifying the final diagnosis from differential-diagnosis lists, comparable to physicians for case report series. Its ability to compare differential diagnosis lists with final diagnoses suggests its potential to aid clinical decision-making support through diagnostic feedback. While GPT-4 showed a fair to good agreement for evaluation, its application in real-world scenarios and further validation in diverse clinical environments are essential to fully understand its utility in the diagnostic process.
Diagnostic error and feedback.
A well-developed diagnostic process is fundamental to medicine. Diagnostic errors [ 1 ], which include missed, incorrect, or delayed diagnoses [ 2 ], result in severe misdiagnosis-related harm, affecting up to 795,000 patients annually in the United States [ 3 ]. These errors often stem from a failure to correctly identify an underlying condition [ 4 , 5 ]. Enhancing the diagnostic process is crucial, with diagnostic feedback playing a key role [ 6 ]. The feedback enables physicians to assess their diagnostic accuracy and adjust their subsequent clinical decisions accordingly [ 7 ]. Common diagnostic feedback methods include self-reflection [ 8 , 9 ], peer review [ 1 ], and clinical decision support systems (CDSSs), which aim to enhance decision-making at the point of care [ 10 ]. Unlike the retrospective nature of self and peer review processes, feedback from CDSSs is provided in real-time [ 11 ], offering immediate support and guidance during the diagnostic process. This timely feedback is particularly advantageous in fast-paced clinical settings where timely decision-making is critical.
CDSSs are categorized into 2 main types: knowledge-based and nonknowledge-based systems [ 10 ]. Knowledge-based CDSSs rely on established medical knowledge including clinical guidelines, expert protocols, and information on drug interactions. In contrast, nonknowledge-based systems, particularly those using artificial intelligence (AI), leverage advanced algorithms, machine learning, and statistical pattern recognition. Unlike their rule-based counterparts, these systems adapt over time, continuously refining their insights and recommendations. The rapid integration of AI into CDSSs highlights the growing importance of advanced technologies in health care [ 12 ]. In recent years, generative AI through large language models (LLMs) has been reshaping health care, offering improvements in diagnostic accuracy, treatment planning, and patient care [ 13 , 14 ]. AI systems, emulating human cognition, continuously learn from new data [ 15 ]. They assist health care professionals by analyzing complex patient data, thereby enhancing clinical decision-making and patient outcomes [ 10 ].
In this context of rapidly integrating AI into CDSSs, generative AIs have marked a new era in digital health. LLMs are advanced AI algorithms trained on extensive textual data, enabling them to process and generate human-like text, thereby providing valuable insights to medical diagnostics. Several generative AI tools are now available to the public, including Bard (currently Gemini) by Google [ 16 , 17 ], LLM Meta AI 2 (LLaMA2) by Meta AI [ 18 ], and ChatGPT, developed by OpenAI [ 19 ]. These AI tools, which use LLMs, have successfully passed national medical licensing exams without specific training or reinforcement [ 20 ], demonstrating their potential in medical diagnostics. Among these, ChatGPT stands out as one of the most extensively researched generative AI applications in health care [ 21 ]. Specifically, in diagnostics, a recent study has shown that these generative AI systems, particularly ChatGPT with GPT-4, demonstrate excellent diagnostic capability when answering clinical vignette questions [ 22 ]. Additionally, other studies, including our own, have assessed AI systems’ performance in one aspect of the diagnostic process, generating differential diagnosis lists [ 23 - 25 ]. While broader studies compare a variety of state-of-the-art models, our analysis focuses on the distinct capabilities and impacts of these specific tools within medical diagnostics.
The diagnostic process involves collecting clinical information, forming a differential diagnosis, and refining it through continuous feedback [ 26 ]. This feedback consists of patient outcomes, test results, and final diagnoses [ 27 , 28 ]. Similar to traditional CDSSs, generative AI systems can enhance this feedback loop [ 29 ]. However, a gap previously existed in the systematic comparison of differential diagnoses with final diagnoses through a feedback loop [ 27 ]. Given this background, it remains less explored how effectively these AI systems integrate their feedback into clinical workflow. To address this gap, exploring how generative AI systems provide feedback by comparing final diagnoses with differential-diagnosis lists represents a straightforward and viable first step. This study used differential diagnosis lists to assess diagnostic accuracy. This approach was chosen to mimic a key aspect of the clinical decision-making process, where physicians often narrow down a broad list of potential diagnoses to determine the most likely one. This method reflects a critical use case for AI in health care, potentially speeding up and refining diagnostic accuracy. In our previous short communication, we reported that the fourth generation ChatGPT (GPT-4) showed very good agreement with physicians in evaluating the lists for a limited number of case reports published from our General Internal Medicine (GIM) department [ 30 ]. Building on this research, this study focused on assessing the capability of GPT-4 in identifying the final diagnosis from differential-diagnosis lists for comprehensive case report series, compared with those of physicians. Furthermore, this research aimed to demonstrate the role of generative AI, particularly GPT-4, in enhancing the diagnostic learning cycle through effective feedback mechanisms.
We conducted an experimental study using GPT-4 and the differential-diagnosis lists generated by 3 AI systems inputting into case descriptions. The research was conducted at the Department of Generalist and Diagnostic Medicine (GIM), Dokkyo Medical University, Tochigi, Japan. Our research methodology encompassed preparing a data set for differential-diagnosis lists and the corresponding final diagnoses, assessing these lists using GPT-4, and having physicians evaluate the lists. Figure 1 illustrates this study flow.
Since we used a database extracted from published case reports, obtaining ethical approval was not applicable.
We used our data set from a previous study (TH, YH, KM, T Sakamoto, KT, T Shimizu. Diagnostic performance of generative artificial intelligences for a series of complex case reports. unpublished data, November 2023). From the PubMed search, we identified a total of 557 case reports. We excluded the nondiagnosed cases (130 cases) and the pediatric cases, aged younger than 10 years (35 cases). The exclusion criteria were based on the previous research for CDSS [ 31 ]. After the exclusion, we included 392 case reports. The case reports were brushed up as case descriptions to focus on the diagnosis. The authors typically defined the final diagnoses. Through inputting into the case descriptions and systematic prompt, 3 generative AI systems—GPT-4, Google Bard (currently Google Gemini), and LLaMA2 chatbot—generated the top 10 differential-diagnosis lists. The AI systems used were not trained for any additional medical use or reinforced. The main investigator (TH) conducted the entire process, with validation provided by another investigator (YH). Through this process, this data set included differential diagnosis lists corresponding to case descriptions and final diagnoses from case reports in the American Journal of Case Reports . Detailed lists of differential diagnoses and their final diagnoses are shown in Multimedia Appendix 1 .
In selecting the generative AI systems for evaluation, we focused on GPT-4 due to its distinct architectural frameworks and widespread use in the field of health care research. GPT-4, developed by OpenAI, is notable for its advanced natural language processing capabilities and extensive training data set, making it particularly relevant for health care [ 32 ]. We used the August 3 version and September 25 version of GPT-4 to evaluate differential diagnosis lists. The access date was from September 11, 2023, to October 6, 2023. A structured prompt was crafted to ascertain whether GPT-4 could identify the final diagnosis within a list and its position if present. The prompt required direct copying and pasting of the final diagnoses and differential diagnosis lists from our data set. We assessed the inclusion of the final diagnosis in the list (Yes=1, No=0) and its position. The prompt selection was a preliminary investigation. To ensure unbiased output, each session was isolated by deactivating chat history and training controls and restarting GPT-4 before every new evaluation. We obtained a single output from GPT-4 for each differential diagnosis list. The details of this structured prompt in this study are expounded in Multimedia Appendix 2 .
For comparison, 2 independent physicians (KM and T Sakamoto) also evaluated the differential diagnosis lists. The presence of the final diagnosis within the differential diagnosis lists was marked with a 1 or 0. A “1” was marked when the lists precisely and acceptably identified the final diagnosis [ 33 ], further ranking it from 1 to 10 based on its placement. A “0” indicated its absence. Discrepancies between the evaluations of the 2 physicians were resolved by another physician (KT). Notably, the physicians were blinded to which AI generated the lists they assessed. We selected 3 independent physicians, specializing in GIM. Selection was based on expertise in diagnostic processes and familiarity with AI technologies in health care. All physicians underwent a brief guidance session to familiarize themselves with the evaluation criteria and objectives of the study to ensure consistent assessment standards.
The primary outcome was defined as the κ coefficient for interrater agreement between GPT-4 and the physicians’ evaluations for the differential-diagnosis lists generated by 3 AI systems including GPT-4, Google Bard (currently Google Gemini), and LLaMA2 chatbot. The secondary outcomes were defined as the κ coefficients for interrater agreement between GPT-4 and the physicians’ evaluations for the differential diagnosis lists generated by each AI system. Additionally, another secondary outcome was defined as the ranking patterns between GPT-4’s evaluation and that of physicians.
Analytical procedures were conducted using R (version 4.2.2; The R Foundation for Statistical Computing). The agreement between different evaluations was quantified using the Cohen κ coefficient through the irr package in R. Agreement strength was categorized as per Cohen κ benchmarks: values under 0.40 indicated poor agreement; values between 0.41 and 0.75 showed fair to good agreement; and values ranging from 0.75 to 1.00 denoted very good agreement [ 34 ]. The 95% CIs were used to quantify uncertainty. Additionally, we compared ranking patterns between GPT-4’s evaluation and that of physicians [ 35 ].
This study involved 3 generative AI systems—GPT-4, Google Bard (currently Google Gemini), and LLaMA2 chatbot—outputting differential-diagnosis lists for 392 case descriptions, resulting in a total of 1176 lists. In 825 lists where physicians included a final diagnosis, GPT-4 matched 636 lists and did not match 189 lists. Conversely, in 351 lists where physicians did not include a final diagnosis, GPT-4 matched 330 lists and did not match 21 lists. In total, GPT-4’s evaluations matched the physicians’ evaluations in 966 out of 1176 lists (82.1%). Cohen κ coefficient was 0.63 (95% CI 0.56-0.69), indicating a fair to good agreement between GPT-4 and the physicians’ evaluations. GPT-4 omitted the final diagnosis in 16.1% (n=189) of cases, contrasting with physicians’ evaluations that included these diagnoses. Table 1 shows GPT-4’s evaluations concurred with the physicians’ evaluations. Table 2 details the κ coefficient for interrater agreement between GPT-4 and the physicians’ evaluations. The representative input used in GPT-4’s evaluations is illustrated in Figure 2 , and the corresponding output is shown in Figure 3 . A formed data set is shown in Multimedia Appendix 3 .
Variables | GPT-4 | Total (N=1176) | |
Matched | Did not match | ||
Inclusion of final diagnosis | 636 | 189 | 825 |
Noninclusion of final diagnosis | 330 | 21 | 351 |
Differential-diagnosis lists generator | Cohen κ coefficient (95% CI) | Strength of agreement [ ] | Number of differential-diagnosis lists |
All | 0.63 (0.56-0.69) | Fair to good | 1176 |
GPT-4 | 0.47 (0.39-0.56) | Fair to good | 392 |
Google Bard | 0.67 (0.52-0.73) | Fair to good | 392 |
LLaMA2 chatbot | 0.63 (0.52-0.73) | Fair to good | 392 |
a Currently Google Gemini.
b LLaMA2: LLM Meta AI 2.
The κ coefficients for differential-diagnosis lists generated by GPT-4, Google Bard (currently Google Gemini), and LLaMA2 chatbot were 0.47 (95% CI 0.39-0.56), 0.67 (95% CI 0.52-0.73), and 0.63 (95% CI 0.52-0.73), respectively. All κ coefficients indicated a fair to good agreement between GPT-4 and the physicians’ evaluations.
Both GPT-4’s evaluation and that of physicians showed a general trend of decreasing frequency as the rank increases. Figure 4 shows the comparisons of ranking patterns between GPT-4 and physicians.
Physicians’ evaluations (KM and T Sakamoto) for the differential diagnosis lists showed very good agreement, with concordance in 88.8% (n=1044) of cases. The κ coefficient was 0.75 (95% CI 0.46-0.99).
This experimental study highlights several key findings. First, GPT-4’s evaluations matched those of physicians in more than 82% (n/N=966/1176) of the cases, demonstrating fair to good agreement according to κ coefficient values. These results imply that GPT-4’s accuracy in identifying the final diagnosis within differential-diagnosis lists is comparable to that of physicians. Unlike traditional CDSSs, generative AI systems, including GPT-4, are capable of performing multiple roles in the diagnostic process including formulating and assessing differential diagnoses. These capabilities highlight GPT-4’s potential to streamline diagnostics in clinical settings by expediting diagnostic feedback [ 36 ]. Our study design focuses on GPT-4’s ability to refine and validate pre-existing diagnostic considerations as supplementary tools for medical diagnostics. This scenario is akin to real-world clinical settings where generative AI systems could verify and support physicians’ final diagnostic decisions. By assessing the AI’s accuracy in this context, we can better understand its potential role and limitations in practical medical applications. Furthermore, in medical education, generative AI tools, like GPT-4, can offer students valuable self-learning opportunities. They provide timely feedback in the form of final diagnoses [ 37 ], enabling them to cross-reference with reliable sources for verification [ 38 ].
Second, GPT-4 failed to identify the final diagnosis in 16% (n/N=189/1176) of differential-diagnosis lists, even though these diagnoses were recognized by the evaluating physicians. Notably, despite achieving very good agreement among physicians, GPT-4 did not reach similar levels of concordance. This discrepancy highlights potential areas for improving the system’s ability to interpret and analyze complex medical data. This discrepancy arises primarily from GPT-4’s reliance on textual patterns and word associations within the provided differential diagnosis lists. Unlike physicians, who use a comprehensive medical knowledge base and clinical experience, an inherent limitation in generative AI systems like GPT-4 is their reliance on existing data patterns and textual association. To mitigate these discrepancies, continuous development in generative AI systems for health care is needed. Additionally, future research should focus on enhancing the medical training of these systems. This will enhance the generative AI systems’ diagnostic feedback, making it more adaptable to real clinical settings.
Third, regarding evaluation at what rank in the differential-diagnosis list was the final diagnosis found, both GPT-4 and physicians exhibited a trend of decreasing frequency. This suggests GPT-4’s diagnosis ranking shows a similar trend to physicians’ diagnosis ranking. Moreover, all 3 generative AI systems, including GPT-4, Google Bard (currently Google Gemini), and LLaMA2 chatbot, prioritized the most likely diagnoses at the top of the list, leading to a natural decrease in frequency as less-probable diagnoses are ranked lower. Therefore, generative AI systems showed the potential not only to generate differential diagnosis lists for clinical cases but also to evaluate these lists as feedback.
Fourth, an examination of the differential diagnosis lists generated by 3 different AI systems showed the overlap in the 95% CI for the κ coefficients across the 3 AI platforms. One might hypothesize that GPT-4 would exhibit improved performance when evaluating differential-diagnosis lists it generated itself. However, observed results may stem from the inherent variability in generative AI outputs including GPT-4. This inherent variability underscores the challenge of maintaining a consistent standard of accuracy and reliability in the outputs from generative AI systems. Even when evaluating differential-diagnosis lists generated by itself, GPT-4’s performance did not markedly surpass that of lists generated by other AI systems. Additionally, the observed performance differences may be partially due to version inconsistencies. The generation of differential diagnosis lists used an earlier version of GPT-4 (March 24). Subsequent evaluations used later versions (August 3 and September 25). Different versions of generative AI systems can exhibit varied capabilities and outputs, potentially impacting the accuracy and consistency of diagnostic evaluations. This highlights the need for ongoing updates and version alignment in clinical AI applications to maintain reliability.
This study has several limitations. First, GPT-4’s role was limited to identifying the final diagnosis within the differential diagnosis list. The current binary evaluation method has not been a well-established approach to evaluating diagnostic performance by other CDSSs. Another study used a 5-grade level of accuracy for a variety number of differentials [ 39 ]. Investigating more complex outcomes, such as quantitative evaluations and additional clinical suggestions, might yield different results. Second, our inputs to GPT-4 consisted only of the final diagnoses and the differential diagnosis list, without the case descriptions that generated these lists. Further research should examine what types of input enhance AI systems’ performance the most. Third, there was a nonnegligible risk associated with generative AI systems, including GPT-4, regarding their capacity to inadvertently learn from and replicate the information contained in publicly available case reports. Fourth, the data set was sourced from a single case reports journal and generated by 3 AI systems. Future research would benefit from using real-world scenarios [ 40 ]. Expanding the data set to include a more diverse range of AI systems is also advisable.
Regarding limitations for generative AI systems, like GPT-4, there is currently no approval for their use as CDSSs. Furthermore, GPT-4 operates as a fee-based application, which could potentially limit its accessibility to the wider public. Additionally, the reliability of generative AI systems can vary based on the input data it was trained on. If it is not exposed to diverse clinical scenarios during its training, it may not be as effective in real-world diagnostic situations [ 41 ]. Moreover, while AI tools can assist, they do not replace the nuanced judgments and decision-making processes of human physicians [ 42 , 43 ]. Additionally, the rapid evolution of AI means that our findings may become outdated as Google Bard and LLaMA2 were updated to the new LLM model, Google Gemini and LLaMA3, respectively [ 17 , 44 ]. Finally, overreliance on AI without critical review could lead to diagnostic errors [ 45 ].
In our previous study involving GPT-4 [ 30 ], we observed a very good agreement with physicians in identifying final diagnoses within the differential-diagnosis lists, achieving a 95.9% agreement rate (236 out of 246 lists; κ=0.86). In contrast, this study demonstrated a fair to good agreement rate of 82.1% (966/1176 lists; κ=0.63). Despite using the same evaluation methods in both studies, the observed decrease in the agreement can be attributed to several factors: the source of case reports (GIM-published vs a broader range of case reports), the generators of differential diagnoses (physicians, GPT-3/GPT-4 vs GPT-4/Google Bard [currently Gemini]/LLaMA2 chatbot), and the volume of lists assessed (246 lists vs 1176 lists).
Future studies explore the potential of integrating GPT-4 and similar AI systems into real-world clinical settings. This could involve developing interfaces that allow these AI systems to interact directly with electronic health records, providing real-time diagnostic feedback to physicians. Additionally, research could focus on tailoring these AI systems for specialized medical fields, where their ability to process vast amounts of data could significantly aid in complex case analysis. Another vital area for future research is the ethical implications of AI in medicine [ 43 ], particularly in patient data privacy, AI decision transparency, and the impact of AI-assisted diagnostics on physician-patient relationships.
Furthermore, further research should also investigate the optimal use of AI technologies, including the exploration of both chatbot interfaces and application programming interface functionalities. A more detailed examination of application programming interface settings, such as adjustable parameters including temperature and Top P, could be invaluable. This investigation would provide clearer guidelines on when and how to use different AI tools effectively, considering both scientific evidence and effectiveness.
Moreover, our future research will focus on refining the evaluation of AI-generated differential diagnoses by incorporating more sophisticated and validated psychometric methods as the next diagnostic step. We propose to adopt methodologies for assessing the quality of differential diagnoses. This approach will allow us not only to compare AI-generated outputs with those from physicians but also to treat it as a form of Turing test—evaluating whether AI can match or surpass human performance in diagnostic tasks without being distinguishable from them [ 46 ].
GPT-4 demonstrated a fair to good agreement in identifying the final diagnosis from differential-diagnosis lists, comparable to physicians for case report series. By reliably identifying diagnoses, GPT-4 can provide on-time feedback by comparing final diagnoses with differential-diagnosis lists. Therefore, this study suggests that generative AI systems have the potential to assist physicians in the diagnostic process by providing reliable and efficient feedback, thereby contributing to improved clinical decision-making and medical education. However, it is imperative to recognize that these findings are based on experimental studies. Real-world scenarios could present unique challenges, and further validations in diverse clinical environments are essential before broad implementation can be recommended.
This research was funded by the Japan Society for the Promotion of Science (JSPS) KAKENHI (grant 22K10421). This study was conducted using resources from the Department of Diagnostics and Generalist Medicine at Dokkyo Medical University.
TH, YH, KM, T Sakamoto, KT, and T Shimizu contributed to the study concept and design. TH performed the statistical analyses. TH contributed to the drafting of the manuscript. YH, KM, T Sakamoto, KT, and T Shimizu contributed to the critical revision of the manuscript for relevant intellectual content. All the authors have read and approved the final version of the manuscript.
None declared.
The differential-diagnosis generated by 3 artificial intelligences used in this study and the final diagnosis.
Structured prompt used in this study.
Formed data set used in this study.
artificial intelligence |
clinical decision support system |
general internal medicine |
LLM Meta AI 2 |
large language model |
Edited by A Mavragani; submitted 08.04.24; peer-reviewed by A Rodman, C Zhang; comments to author 24.04.24; revised version received 28.04.24; accepted 04.05.24; published 26.06.24.
©Takanobu Hirosawa, Yukinori Harada, Kazuya Mizuta, Tetsu Sakamoto, Kazuki Tokumasu, Taro Shimizu. Originally published in JMIR Formative Research (https://formative.jmir.org), 26.06.2024.
This is an open-access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work, first published in JMIR Formative Research, is properly cited. The complete bibliographic information, a link to the original publication on https://formative.jmir.org, as well as this copyright and license information must be included.
IMAGES
VIDEO
COMMENTS
Primary Sources. A primary source in science is a document or record that reports on a study, experiment, trial or research project. Primary sources are usually written by the person(s) who did the research, conducted the study, or ran the experiment, and include hypothesis, methodology, and results. Primary Sources include: Pilot/prospective ...
Primary research is any research that you conduct yourself. It can be as simple as a 2-question survey, or as in-depth as a years-long longitudinal study. The only key is that data must be collected firsthand by you. Primary research is often used to supplement or strengthen existing secondary research.
A case study is a detailed study of a specific subject, such as a person, group, place, event, organization, or phenomenon. Case studies are commonly used in social, educational, clinical, and business research. A case study research design usually involves qualitative methods, but quantitative methods are sometimes also used.
Research design is the key that unlocks before the both the researcher and the audience all the primary elements of the research—the purpose of the research, the research questions, the type of case study research to be carried out, the sampling method to be adopted, the sample size, the techniques of data collection to be adopted and the ...
A case study is a qualitative research method that involves the in-depth exploration and analysis of a particular case, which can be an individual, group, organization, event, or community. The primary purpose of a case study is to generate a comprehensive and nuanced understanding of the case, including its history, context, and dynamics.
Case study research consists of a detailed investigation, often with empirical material collected over a period of time from a well-defined case to provide an analysis of the context and processes involved in the phenomenon. ... Firstly, the primary research objective was to understand the value cocreation process as it happens. Abduction logic ...
Case studies tend to focus on qualitative data using methods such as interviews, observations, and analysis of primary and secondary sources (e.g., newspaper articles, photographs, official records). Sometimes a case study will also collect quantitative data. Example: Mixed methods case study. For a case study of a wind farm development in a ...
A Case study is: An in-depth research design that primarily uses a qualitative methodology but sometimes includes quantitative methodology. Used to examine an identifiable problem confirmed through research. Used to investigate an individual, group of people, organization, or event. Used to mostly answer "how" and "why" questions.
Here are some of the primary research methods organizations or businesses use to collect data: 1. Interviews (telephonic or face-to-face) Conducting interviews is a qualitative research method to collect data and has been a popular method for ages. These interviews can be conducted in person (face-to-face) or over the telephone.
Case study - A case study is an uncontrolled, observational study of events and outcomes in a single case. ... Secondary research, also called synthesized research, combines the findings from primary research studies and provides conclusions about that body of evidence.
Abstract. This chapter explores case study as a major approach to research and evaluation. After first noting various contexts in which case studies are commonly used, the chapter focuses on case study research directly Strengths and potential problematic issues are outlined and then key phases of the process.
A primary research or study is an empirical research that is published in peer-reviewed journals. Some ways of recognizing whether an article is a primary research article when searching a database: 1. The abstract includes a research question or a hypothesis, methods and results. 2. Studies can have tables and charts representing data findings. 3.
Primary resources are an essential requirement for most research papers and case studies. Examples of a primary source are: Original documents such as diaries, speeches, manuscripts, letters, interviews, records, eyewitness accounts, autobiographies; Empirical scholarly works such as research articles, clinical reports, case studies, dissertations
A particular type of cross-sectional study, called a Prospective, Blind Comparison to a Gold Standard, is a controlled trial that allows a research to compare a new test to the "gold standard" test to determine whether or not the new test will be useful. Case Studies are usually single patient cases. Secondary Sources.
Cross sectional studies. Data are collected at a single time but may refer retrospectively to experiences in the past. A sample of patients is interviewed, examined, or medical records studied to gain answers to a specific clinical question. The exposure and the outcome are determined at the same time. A cross sectional study can address ...
A primary source in science is a document or record that reports on a study, experiment, trial or research project. Primary sources are usually written by the person(s) who did the research, conducted the study, or ran the experiment, and include hypothesis, methodology, and results. Primary Sources include: Pilot/prospective studies; Cohort ...
A case study is a research approach that is used to generate an in-depth, multi-faceted understanding of a complex issue in its real-life context. It is an established research design that is used extensively in a wide variety of disciplines, particularly in the social sciences. A case study can be defined in a variety of ways (Table.
case study; focus group; interviews; The key is that it generates some type of data that the researcher can then analyze and utilize to prove/disprove their research question. ... The reader of a primary research study can use the information provided in the methodology, analysis and results sections to help judge the quality of the study. ...
Empirical scholarly works such as research articles, clinical reports, case studies, dissertations; Creative works such as poetry, music, video, photography ... remember to look for articles where the author has conducted original research. A primary research article will include a literature review, methodology, population or set sample, test ...
Secondary research is useful for building context, identifying trends, and gaining insights from previous studies. However, primary research provides you with unique insights and a firsthand understanding of your subject. ... Case studies are valuable for exploring complex issues in detail and generating nuanced insights. While they lack ...
Primary Research. Primary research includes an exhaustive analysis of data to answer research questions that are specific and exploratory in nature. Primary research methods with examples include the use of various primary research tools such as interviews, research surveys, numerical data, observations, audio, video, and images to collect data directly rather than using existing literature.
Regardless of whether assessing studies for clinical case management, developing clinical practice guidelines, or performing systematic reviews, evidence from primary research should be evaluated for internal validity i.e., whether the results are free from bias (reflect the truth).
What is a case control study in research? A case control study is a type of observational study commonly used to compare two groups of individuals who are largely similar except for the fact that one group has a specific condition or outcome while the second group of individuals, called the controls, do not have that condition or outcome. The primary goal of this study design is to compare ...
A case study is an example of primary research. Primary and secondary research can be differentiated by the source of the data used in the research...
This study took Guangdong as an example and assessed the techno-economic feasibility of zero-emission trucks (ZETs) from 2022 to 2030 across 14 use cases, considering operational feasibility, purchase cost gaps between ZETs and diesel trucks, and total cost of ownership parity years relative to diesel trucks.
Primary market research is the process of gathering firsthand data directly from your target audience to gain valuable insights and make informed business decisions.
The primary goal of these clubs is to increase STEM literacy and self-confidence among K-12 girls from underrepresented groups in these fields. More examples can be found ... Given the predominance of case study research in the analysed studies, it is not surprising to observe a high frequency of descriptive statistics in the findings. On ...
Climate warming and air pollution are atmospheric environmental problems that have aroused broad concern worldwide. Greenhouse gas emissions are the main cause of global warming. In addition to reducing carbon emissions, increasing carbon sink capacity and improving environmental quality are essential for building green and low-carbon enterprises under carbon peak and carbon neutrality goals.
One of the primary research goals of the study was to ascertain the extent to which primary and middle schools exhibited the precursors of destructive leadership. This was accomplished by using the toxic triangle model created by Padilla, Hogan, and Kaiser (Citation 2007) to assess the extent to which harmful leadership antecedents were evident ...
Background: The potential of artificial intelligence (AI) chatbots, particularly ChatGPT with GPT-4 (OpenAI), in assisting with medical diagnosis is an emerging research area. However, it is not yet clear how well AI chatbots can evaluate whether the final diagnosis is included in differential diagnosis lists. Objective: This study aims to assess the capability of GPT-4 in identifying the ...