Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • View all journals
  • Explore content
  • About the journal
  • Publish with us
  • Sign up for alerts
  • Published: 29 December 2014

Programming tools: Adventures with R

  • Sylvia Tippmann  

Nature volume  517 ,  pages 109–110 ( 2015 ) Cite this article

9588 Accesses

152 Citations

944 Altmetric

Metrics details

  • Computational biology and bioinformatics
  • Information technology
  • Research management

13 February 2015 Owing to an editing error, the original version of this article did not make it clear what prevented Rabih Murr from practising R — he was preparing a paper for publication. The text has been updated to reflect this.

A Correction to this article was published on 04 March 2015

This article has been updated

A guide to the popular, free statistics and visualization software that gives scientists control of their own data analysis.

research paper on r programming

For years, geneticist Helene Royo used commercial software to analyse her work. She would extract DNA from the developing sperm cells of mice, send it for analysis and then fire up a package called GeneSpring to study the results. “As a scientist, I wanted to understand everything I was doing,” she says. “But this kind of analysis didn’t allow that: I just pressed buttons and got answers.” And as Royo’s studies comparing genetic activity on different chromosomes became more involved, she realized that the commercial tool could not keep up with her data-processing demands.

research paper on r programming

With the results of her first genomic sequencing experiments in hand at the start of a new postdoc, Royo had a choice: pass the sequences over to the experts or learn to analyse the data herself. She took the plunge, and began learning how to parse data in the free, open-source software package R. It helped that the centre she had joined — the Friedrich Miescher Institute for Biomedical Research in Basel, Switzerland — ran regular courses on the software. But she was also following a wider trend: for many academics seeking to wean themselves off commercial software, R is the data-analysis tool of choice.

Besides being free, R is popular partly because it presents different faces to different users. It is, first and foremost, a programming language — requiring input through a command line, which may seem forbidding to non-coders. But beginners can surf over the complexities and call up preset software packages, which come ready-made with commands for statistical analysis and data visualization. These packages create a welcoming middle ground between the comfort of commercial ‘black-box’ solutions and the expert world of code. “R made it very easy,” says Rojo. “It did everything for me.”

That, indeed, is what R’s developers intended when they designed it in the 1990s. Ross Ihaka and Robert Gentleman, statisticians at the University of Auckland in New Zealand, had an interest in computing but lacked practical software for their needs. So they developed a programming language with which they could perform data analysis themselves. R got its name in part from its developers’ initials, although it was also a reference to the most widely used coding language at the time, S.

research paper on r programming

In the early days of the World Wide Web, R quickly attracted interest from scientists around the globe who needed statistical software and were willing to contribute ideas. Gentleman and Ihaka decided to make their source code accessible to everybody, and coding-literate scientists quickly developed packages of pre-programmed routines and commands for particular fields. “I can write software that would be good for somebody doing astronomy,” says Gentleman, “but it’s a lot better if someone doing astronomy writes software for other people doing astronomy.”

Mathematical solutions

Karline Soetaert, an oceanographer at the Royal Netherlands Institute for Sea Research in Yerseke, took up that idea when, in 2008, she wanted to check the health of zooplankton in the estuary of the river Scheldt. Soetaert wanted to calculate how fast zooplankton were dying, using measurements along the river, but R was not equipped for that. To tackle the problem, she worked with two ecologists to develop deSolve — the first package written in R to solve differential equations. “Other software can do that, but it is expensive and closed source,” she notes. Now deSolve is used by epidemiologists modelling infectious diseases, geneticists working on gene-regulatory networks and drug developers working on pharmaco­kinetics (how compounds behave in living organisms).

By 2003, 10 years after R’s first release, scientists had developed more than 200 packages, and the first citations of the ‘R Project’ appeared. Today, nearly 6,000 packages exist for all kinds of specialized purposes (see ‘R in science’). They allow scientists to compare a human and a Neanderthal genome (using Bioconductor ); to model population growth ( IPMpack ); predict equity prices ( quantmod ); and visualize the results in polished graphics ( ggplot2 ) in a few lines of code. Experts can use R to write up manuscripts, embedding raw code in them to be run by the reader ( knitr ). Nearly 1 in 100 scholarly articles indexed in Elsevier’s Scopus database last year cites R or one of its packages — and in agricultural and environmental sciences, the share is even higher (see ‘A rising tide of R’).

Statistical success

For many users, R’s quality as statistics software stands out. The tool is on a par with commercial packages such as SPSS and SAS, says Robert Muenchen, a statistician at the University of Tennessee in Knoxville who analyses the popularity of software used in statistical computing. In the past decade, R has caught up with and overtaken the market leaders. “Most likely, R became the top statistics package used during the summer of this year,” he says.

In genomics and molecular biology, a software project called Bioconductor was developed on the back of R. It helps scientists to process and compare huge numbers of genetic sequences, to query results against databases such as Gene Expression Omnibus and to upload data to the databases . It includes almost 1,000 packages, some of which help to link the millions of DNA snippets from next-generation sequencing experiments to annotated genes.

For her dive into R, Royo had intensive training: under the supervision of Michael Stadler, head of the Friedrich Miescher Institute’s bioinformatics group, she took about half a year to work on R and Bioconductor. But there are plentiful chances to learn, says Karthik Ram, an ecologist at the Berkeley Institute for Data Science in California who founded rOpenSci, an initiative that helps scientists to adopt and develop R (see ‘An R starter kit’). He and his colleagues teach free courses that do not require existing programming skills and are targeted towards scientists’ specific problems.

One researcher who took that training is Megan Jennings, an ecologist at San Diego State University in California. She tracks bobcats, mountain lions and other wild animals, to understand their movements. Armed with more than 400,000 time-stamped photos to which she had appended species names — taken from 36 cameras running for almost a year — Jennings wanted to follow particular species at particular times of year. At first, she manually selected the photos she wanted and fed them into a black-box program called PRESENCE. But with Ram’s help, she is creating an R package that reads in the tagged photos, cleans them up and then sends customized subsets of the data to a pre-existing modelling package in R. “What took me one hour to do manually, I will now be able to do in five minutes,” Jennings says.

One of the greatest perks of R is its online support. Discussion forums about R-related topics outstrip online questions about any commercial statistics software says Muenchen.

“It’s common to see someone post a question and the person who developed the package answer within half an hour,” he says. This rapid response is key for scientists in basic research. “I can find an answer to almost any question online,” says Royo. She can confidently do most of her day-to-day data analysis herself, and she helps out less proficient colleagues. Still, “I google things every day”, she adds. Learning R, says Royo, has not only taught her coding skills, but has also made her more critical about other scientists’ analyses.

Not every scientist is enthusiastic about learning the necessary programming — even though, says Ram, R is less intimidating than languages such as Python (let alone Perl or C). “There are going to be far more scientists that will be comfortable with click-and-drop interfaces than will ever learn to program at any time,” Muenchen says. Geneticist Rabih Murr, for example, took the same R course as Royo when he was a postdoc, but preparing a paper for publication gave him little time to practise. To get started and develop research-specific skills in R definitely requires a commitment: “It’s a matter of priorities,” he says. But after becoming a lab head at the University of Geneva in Switzerland this year, he is planning to hire someone with R experience.

Like any other skill, learning R cannot be done overnight. But Jennings says that it is worth it. “Make that time. Set it aside as an investment: for saving time later, and for building skills that can be used across multiple problems we face as scientists.”

R in science

Researchers have used R to devise software packages in all kinds of disciplines. A few are listed below; there are thousands more at the Comprehensive R Archive Network (CRAN).

Astrophysics The solaR package provides functions to determine the solar radiation that falls on Earth.

Carbon dating Bchron creates chronologies based on radiocarbon- and non-radiocarbon-dated depths of sediments.

Climate science raincpc allows researchers to obtain and analyse daily global rainfall data from the US National Oceanic and Atmospheric Administration’s Climate Prediction Center.

Epidemiology DCluster is a package for the detection of spatial clusters of diseases.

Chemistry ChemmineR is a cheminformatics toolkit for analysing small molecules in R.

Genetics Bioconductor provides tools for the analysis of high-throughput genomic data.

Pharmacokinetics The PKfit package can model the half-life and dose absorption of drugs.

Palaeoecology Neotoma provides access to data on pollen, fossil mammals and everything else on the Neotoma palaeoecology database.

Oceanography deSolve is a package for solving differential equations.

Graphics ggplot2 one of the most popular visualization packages in R.

Phylogeny dendextend compares trees of evolutionary relationships.

Genomics The QuasR package lets researchers quantify and annotate short reads from sequencing experiments.

An R starter kit

● Install R at the Comprehensive R Archive Network . This also provides an introduction to the system .

● Many researchers recommend using a (free) powerful interface called RStudio .

● Among many online tutorials are those provided by DataCamp , rOpenSci , Software Carpentry and R-bloggers .

Change history

13 february 2015.

Owing to an editing error, the original version of this article did not make it clear what prevented Rabih Murr from practising R — he was preparing a paper for publication. The text has been updated to reflect this.

04 March 2015

A Correction to this paper has been published: https://doi.org/10.1038/519120a

You can also search for this author in PubMed   Google Scholar

Related links

Related links in nature research.

Interactive notebooks: Sharing the code 2014-Nov-05

My digital toolbox: Ecologist Ethan White on interactive notebooks 2014-Sep-30

'Boot camps' teach scientists computing skills 2014-Sep-03

Nature Toolbox

Related external links

The Comprehensive R Archive Network (CRAN

Rights and permissions

Reprints and permissions

About this article

Cite this article.

Tippmann, S. Programming tools: Adventures with R. Nature 517 , 109–110 (2015). https://doi.org/10.1038/517109a

Download citation

Published : 29 December 2014

Issue Date : 01 January 2015

DOI : https://doi.org/10.1038/517109a

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

This article is cited by

Phylogenetic diversity and functional potential of the microbial communities along the bay of bengal coast.

  • Salma Akter
  • M. Shaminur Rahman
  • Md Firoz Ahmed

Scientific Reports (2023)

ULBP2 is a biomarker related to prognosis and immunity in colon cancer

  • Xiaoping Yang
  • Dekui Zhang

Molecular and Cellular Biochemistry (2023)

Constructed the ceRNA network and predicted a FEZF1-AS1/miR-92b-3p/ZIC5 axis in colon cancer

Synthesis of non-active electrode (tio2/go/ag) for the photo-electro-fenton oxidation of micropollutants in wastewater.

  • M. Sillanpaa

International Journal of Environmental Science and Technology (2023)

Omicron-included mutation-induced changes in epitopes of SARS-CoV-2 spike protein and effectiveness assessments of current antibodies

  • Huaichuan Duan

Molecular Biomedicine (2022)

Quick links

  • Explore articles by subject
  • Guide to authors
  • Editorial policies

Sign up for the Nature Briefing: AI and Robotics newsletter — what matters in AI and robotics research, free to your inbox weekly.

research paper on r programming

research paper on r programming

Advanced R Statistical Programming and Data Models

Analysis, Machine Learning, and Visualization

  • © 2019
  • Matt Wiley 0 ,
  • Joshua F. Wiley 1

Columbia City, USA

You can also search for this author in PubMed   Google Scholar

  • Demonstrates applied R programming to make analyses more efficient and effective
  • Shows how to handle machine learning using R
  • Includes case studies throughout book

80k Accesses

16 Citations

6 Altmetric

This is a preview of subscription content, log in via an institution to check access.

Access this book

Subscribe and save.

  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Other ways to access

Licence this eBook for your library

Institutional subscriptions

About this book

  • Conduct advanced analyses in R including: generalized linear models, generalized additive models, mixedeffects models, machine learning, and parallel processing
  • Carry out regression modeling using R data visualization, linear and advanced regression, additive models, survival / time to event analysis
  • Handle machine learning using R including parallel processing, dimension reduction, and feature selection and classification
  • Address missing data using multiple imputation in R
  • Work on factor analysis, generalized linear mixed models, and modeling intraindividual variability 

Similar content being viewed by others

research paper on r programming

Rex: R-linked EXcel add-in for statistical analysis of medical and bioinformatics data

research paper on r programming

Statistical Modelling

  • programming
  • data science

Table of contents (13 chapters)

Front matter, univariate data visualization.

Matt Wiley, Joshua F. Wiley

Multivariate Data Visualization

Ml: introduction, ml: unsupervised, ml: supervised, missing data, glmms: introduction, glmms: linear, glmms: advanced, modelling iiv, back matter, authors and affiliations, about the authors, bibliographic information.

Book Title : Advanced R Statistical Programming and Data Models

Book Subtitle : Analysis, Machine Learning, and Visualization

Authors : Matt Wiley, Joshua F. Wiley

DOI : https://doi.org/10.1007/978-1-4842-2872-2

Publisher : Apress Berkeley, CA

eBook Packages : Professional and Applied Computing , Apress Access Books , Professional and Applied Computing (R0)

Copyright Information : Matt Wiley and Joshua F. Wiley 2019

Softcover ISBN : 978-1-4842-2871-5 Published: 21 February 2019

eBook ISBN : 978-1-4842-2872-2 Published: 20 February 2019

Edition Number : 1

Number of Pages : XX, 638

Number of Illustrations : 80 b/w illustrations, 127 illustrations in colour

Topics : Programming Languages, Compilers, Interpreters , Programming Techniques , Probability and Statistics in Computer Science

  • Publish with us

Policies and ethics

  • Find a journal
  • Track your research

Software Engineering and R Programming: A Call for Research

Although R programming has been a part of research since its origins in the 1990s, few studies address scientific software development from a Software Engineering (SE) perspective. The past few years have seen unparalleled growth in the R community, and it is time to push the boundaries of SE research and R programming forwards. This paper discusses relevant studies that close this gap Additionally, it proposes a set of good practices derived from those findings aiming to act as a call-to-arms for both the R and RSE (Research SE) community to explore specific, interdisciplinary paths of research.

1 Introduction

R is a multi-paradigm statistical programming language, based on the S statistical language ( Morandat et al. 2012 ) , developed in the 1990s by Ross Ihaka and Robert Gentleman at the University of Auckland, New Zealand. It is maintained by the R Development Core Team ( Thieme 2018 ) . Though CRAN (Comprehensive Archive Network) was created for users to suggest improvements and report bugs, nowadays, is the official venue to submit user-generated R packages ( Ihaka 2017 ) . R has gained popularity for work related to statistical analysis and mathematical modelling and has been one of the fastest-growing programming languages ( Muenchen 2017 ) . In July 2020, R ranked 8th in the TIOBE index, which measures of popularity of programming languages; as a comparison, one year before (July 2019), TIOBE ranked R in the 20th position ( TIOBE 2020 ) . According to ( Korkmaz et al. 2018 ) , “this has led to the development and distribution of over 10,000 packages, each with a specific purpose" . Furthermore, it has a vibrant end-user programming community, where the majority of contributors and core members are ”not software engineers by trade, but statisticians and scientists" , with diverse technical backgrounds and application areas ( German et al. 2013 ) .

R programming has become an essential part of computational science –the "application of computer science and Software Engineering (SE) principles to solving scientific problems" ( Hasselbring et al. 2019 ) . As a result, there are numerous papers discussing R packages explicitly developed to close a particular gap or assist in the analysis of data of a myriad of disciplines. Regardless of the language used, the development of software to assist in research ventures in a myriad of disciplines, has been termed as ‘research SE’ (RSE) ( Cohen et al. 2021 ) . Overall, RSE has several differences with traditional software development, such as the lifecycles used, the software goals and life-expectancy, and the requirements elicitation. This type of software is often “constructed for a particular project, and rarely maintained beyond this, leading to rapid decay, and frequent ’reinvention of the wheel" ( Rosado de Souza et al. 2019 ) .

However, both RSE and SE for R programming remain under-explored, with little SE-specific knowledge being tailored to these two areas. This poses several problems, given that in computational science, research software is a central asset for research. Moreover, although most RSE-ers (the academics writing software for research) come from the research community, a small number arrive from a professional programming background ( Pinto et al. 2018 ; Cohen et al. 2021 ) . Previous research showed R programmers do not consider themselves as developers ( Pinto et al. 2018 ) and that few of them are aware of the intricacies of the language ( Morandat et al. 2012 ) . This poses a problem since the lack of formal programming training can lead to lower quality software ( Hasselbring et al. 2019 ) , as well as less-robust software ( Vidoni 2021a ) . This is problematic since ensuring sustainable development focused on code quality and maintenance is essential for the evolution of research in a myriad of computational science disciplines, as faulty and low-quality software can potentially affect research results ( Cohen et al. 2018 ) .

As a result, this paper aims to provide insights into three core areas:

  • Related works that tackle both RSE and R programming, discussing their goals, motivations, relevancy, and findings. This list was curated through an unstructured review and is, by no means, complete or exhaustive.
  • Organising the findings from those manuscripts into a list of good practices for developers. This is posed as a baseline, aiming to be improved with time, application, and experience.
  • A call-to-arms for RSE and R communities, to explore interdisciplinary paths of research, covering not only empirical SE topics but also further developing the tools available to R programmers.

The rest of this paper is organised as follows. Section 2 presents the related works, introducing them one by one. Section 3 outlines the proposed best practices, and Section 4 concludes this work with a call-to-action for the community.

2 Related Works

This Section discusses relevant works organised in four sub-areas related to software development: coding in R, testing packages, reviewing them, and developers’ experiences.

Area: Coding in R

Code quality is often related to technical debt. Technical Debt (TD) is a metaphor used to reflect the implied cost of additional rework caused by choosing an easy solution in the present, instead of using a better approach that would take longer ( Samarthyam et al. 2017 ) .

Claes et al. ( 2015 ) mined software repositories (MSR) to evaluate the maintainability of R packages published in CRAN. They focused on function clones , which is the practice of duplicating functions from other packages to reduce the number of dependencies; this is often done by copying the code of an external function directly into the package under development or by re-exporting the function under an alias. Code clones are harmful because they lead to redundancy due to code duplication and are a code smell (i.e., a practice that reduces code quality, making maintenance more difficult).

The authors identified that cloning, in CRAN packages only, is often caused by several reasons. These are: coexisting package versions (with some packages lines being cloned in the order of the hundreds and thousands), forked packages, packages that are cloned more than others, utility packages (i.e., those that bundle functions from other packages to simply importing), popular packages (with functions cloned more often than in other packages), and popular functions (specific functions being cloned by a large number of packages).

Moreover, they analysed the cloning trend for packages published in CRAN. They determined that the ratio of packages impacted by cloning appears to be stable but, overall, it represents over quarter-million code lines in CRAN. Quoting the authors, “those lines are included in packages representing around 50% of all code lines in CRAN.” ( Claes et al. 2015 ) . Related to this, Korkmaz et al. ( 2019 ) found that the more dependencies a package has, the less likely it is to have a higher impact. Likewise, other studies have demonstrated that scantily updated packages that depend on others that are frequently updated are prone to have more errors caused by incompatible dependencies ( Plakidas et al. 2017 ) ; thus, leading developers to clone functions rather than importing.

Code quality is also reflected by the comments developers write in their code. The notion of Self-Admitted Technical Debt (SATD) indicates the case where programmers are aware that the current implementation is not optimal and write comments alerting of the problems of the solution ( Potdar and Shihab 2014 ) . ( Vidoni 2021b ) conducted a three-part mixed-methods study to understand SATD in R programming, mining over 500 packages publicly available in GitHub and enquiring their developers through an anonymous online survey. Overall, this study uncovered that:

  • Slightly more than 1/10th of the comments are actually “commenting out” (i.e., nullifying) functions and large portions of the code. This is a code smell named dead code , which represents functions or pieces of unused code that are never called or reached. It clogs the files, effectively reducing the readability ( Alves et al. 2016 ) .
  • About 3% of the source code comments are SATD, and 40% of those discuss code debt. Moreover, about 5% of this sample discussed algorithm debt , defined as “sub-optimal implementations of algorithm logic in deep learning frameworks. Algorithm debt can pull down the performance of a system” ( Liu et al. 2020 ) .
  • In the survey, developers declared adding SATD as "self reminders" or to "schedule future work", but also responded that they rarely address the SATD they encounter, even if it was added by themselves. This trend is aligned with what happens in traditional object-oriented (OO) software development.

This work extended previous findings obtained exclusively for OO, identifying specific debt instances as developers perceive them. However, a limitation of the findings is that the dataset was manually generated. For the moment, there is no tool or package providing support to detect SATD comments in R programming automatically.

Area: Testing R Packages

Vidoni ( 2021a ) conducted a mixed-methods MSR (Mining Software Repositories) that combined mining GitHub repositories with a developers survey to study testing technical debt (TTD) in R programming–the test dimension of TD.

Overall, this study determined that R packages testing has poor quality, specifically caused by the situations summarised in Table 1 . A finding is with regards to the type of tests being carried out. When designing test cases, good practices indicate that developers should test common cases (the "traditional" or "more used" path of an algorithm or function) as well as edge cases (values that require special handling, hence assessing boundary conditions of an algorithm or function) ( Daka and Fraser 2014 ) . Nonetheless, this study found that almost 4/5 of the tests are common cases, and a vast majority of alternative paths (e.g., accessible after a condition) are not being assessed.

Moreover, this study also determined that available tools for testing are limited regarding their documentation and the examples provided (as indicated by survey respondents). This includes the usability of the provided assertions (given that most developers use custom-defined cases) and the lack of tools to automate the initialisation of data for testing, which often causes the test suits to fail due to problems in the suite itself.

Table 1: Problems found by ( ) regarding unit testing of R packages.
Smell Definition ( ) Reason ( )
Inadequate Unit Tests The test suite is not ideal to ensure quality testing. Many relevant lines remain untested. Alternative paths (i.e., those accessible after a condition) are mostly untested. There is a large variability in the coverage of packages from the same area (e.g., biostatistics). Developers focus on common cases only, leading to incomplete testing.
Obscure Unit Tests When unit tests are obscure, it becomes difficult to understand the unit test code and the production code for which the tests are written. Multiple asserts have unclear messages. Multiple asserts are mixed in the same test function. Excessive use of user-defined asserts instead of relying on the available tools.
Improper Asserts Wrong or non-optimal usage of asserts leads to poor testing and debugging. Testing concentrated on common cases. Excessive use of custom asserts. Developers still uncover bugs in their code even when the tests are passing.
Inexperienced Testers Testers, and their domain knowledge, are the main strength of exploratory testing. Therefore, low tester fitness and non-uniform test accuracy over the whole system accumulate residual defects. Survey participants are reportedly highly-experienced, yet their most common challenge was lack of testing knowledge and poor documentation of tools.
Limited Test Execution Executing or running only a subset of tests to reduce the time required. It is a shortcut increasing the possibility of residual defects. A large number of mined packages (about 35%) only used manual testing, with no automated suite (e.g., ). The survey responses confirmed this proportion.
Improper Test Design Since the execution of all combination of test cases is an effort-intensive process, testers often run only known, less problematic tests (i.e., those less prone to make the system fail). This increases the risk of residual defects. The study found a lack of support for automatically testing plots. The mined packages used functions to generate a plot that was later (manually) inspected by a human to evaluate readability, suitability, and other subjective values. Survey results confirmed developers struggle with plots assessment.

Křikava and Vitek ( 2018 ) conducted an MSR to inspect R packages’ source code, making available a tool that automatically generates unit tests. In particular, they identified several challenges regarding testing caused by the language itself, namely its extreme dynamism, coerciveness, and lack of types, which difficult the efficacy of traditional test extraction techniques.

In particular, the authors worked with execution traces , “the sequence of operations performed by a program for a given set of input values” ( Křikava and Vitek 2018 ) , to provide genthat , a package to optimise the unit testing of a target package ( Krikava 2018 ) . genthat records the execution traces of a target package, allowing the extraction of unit test functions; however, this is limited to the public interface or the internal implementation of the target package. Overall, its process requires installation, extraction, tracing, checking and minimisation.

Both genthat and the study performed by these authors are highly valuable to the community since the minimisation phase of the package checks the unit tests and discards those failing, and records to coverage, eliminating redundant test cases. Albeit this is not a solution to the lack of edge cases detected in another study ( Vidoni 2021a ) , this genthat assists developers and can potentially reduce the workload required to obtain a baseline test suite. However, this work’s main limitation is its emphasis on the coverage measure, which is not an accurate reflection of the tests’ quality. Finally, Russell et al. ( 2019 ) focused on the maintainability quality of R packages caused by their testing and performance . The authors conducted an MSR of 13500 CRAN packages, demonstrating that "reproducible and replicable software tests are frequently not available". This is also aligned with the findings of other authors mentioned in this Section. They concluded with recommendations to improve the long-term maintenance of a package in terms of testing and optimisation, reviewed in Section 3 .

Area: Reviewing Packages

The increased relevance of software in data science, statistics and research increased the need for reproducible, quality-coded software ( Howison and Herbsleb 2011 ) . Several community-led organisations were created to organize and review packages - among them, rOpenSci ( Ram et al. 2019 ; rOpenSci et al. 2021 ) and BioConductor ( Gentleman et al. 2004 ) . In particular, rOpenSci has established a thorough peer-review process for R packages based on the intersection of academic peer-reviews and software reviews.

As a result, Codabux et al. ( 2021 ) studied rOpenSci open peer-review process. They extracted completed and accepted packages reviews, broke down individual comments, and performed a card sorting approach to determine which types of TD were most commonly discussed.

One of their main contributions is a taxonomy of TD extending the current definitions to R programming. It also groups debt types by perspective , representing ’who is the most affected by a type of debt". They also provided examples of rOpenSci’s peer-review comments referring to a specific debt. This taxonomy is summarised in Table 2 , also including recapped definitions.

Table 2: Taxonomy of TD types and perspectives for R packages, proposed by Codabux et al. ( ).
User Usability In the context of R, test debt encompasses anything related to usability, interfaces, visualisation and so on.
Documentation For R, this is anything related to (or alternatives such as the Latex or Markdown generation), readme files, vignettes and even websites.
Requirements Refers to trade-offs made concerning what requirements the development team needs to implement or how to implement them.
Developer Test In the context of R, test debt encompasses anything related to coverage, unit testing, and test automation.
Defect Refers to known defects, usually identified by testing activities or by the user and reported on bug tracking systems.
Design For R, this debt is related to any OO feature, including visibility, internal functions, the triple-colon operator, placement of functions in files and folders, use of for imports, returns of objects, and so on.
Code In the context of R, examples of code debt are anything related to renaming classes and functions, \(<-\) vs. \(=\), parameters and arguments in functions, FALSE/TRUE vs. F/T, print vs warning/message.
CRAN Build In the context of R, examples of build debt are anything related to Travis, Codecov.io, GitHub Actions, CI, AppVeyor, CRAN, CMD checks, .
Versioning Refers to problems in source code versioning, such as unnecessary code forks.
Architecture for example, violation of modularity, which can affect architectural requirements (e.g., performance, robustness).

Additionally, they uncovered that almost one-third of the debt discussed is documentation debt –related to how well packages are being documented. This was followed by code debt , providing a different distribution than the one obtained by ( Vidoni 2021b ) . This difference is caused by rOpenSci reviewers focusing on documentation (e.g., comments written by reviewers’ account for most of the documentation debt ), while developers’ comments concentrate their attention in code debt . The entire classification process is detailed in the original study Codabux et al. ( 2021 ) .

Area: Developers’ Experiences

Developers’ perspectives on their work are fundamental to understand how they develop software. However, scientific software developers have a different point of view than ‘traditional’ programmers ( Howison and Herbsleb 2011 ) .

Pinto et al. ( 2018 ) used an online questionnaire to survey over 1500 R developers, with results enriched with metadata extracted from GitHub profiles (provided by the respondents in their answers). Overall, they found that scientific developers are primarily self-taught but still consider peer-learning a second valuable source. Interestingly, the participants did not perceive themselves as programmers, but rather as a member of any other discipline. This also aligns with findings provided by other works ( Morandat et al. 2012 ; German et al. 2013 ) . Though the latter is understandable, such perception may pose a risk to the development of quality software as developers may be inclined to feel ‘justified’ not to follow good coding practices ( Pinto et al. 2018 ) .

Additionally, this study found that scientific developers work alone or in small teams (up to five people). Interestingly enough, they found that people spend a significant amount of time focused on coding and testing and performed an ad-hoc elicitation of requirements, mostly ‘deciding by themselves’ on what to work next, rather than following any development lifecycle.

When enquiring about commonly-faced challenges, the participants of this study considered the following: cross-platform compatibility, poor documentation (which is a central topic for reviewers ( Codabux et al. 2021 ) ), interruptions while coding, lack of time (also mentioned by developers in another study ( Vidoni 2021b ) ), scope bloat, lack of user feedback (also related to validation, instead of verification testing), and lack of formal reward system (e.g., the work is not credited in the scientific community ( Howison and Herbsleb 2011 ) ).

Table 3: Recommendations of best practices, according to the issues found in previous work and good practices established in the SE community.
Lifecycles The lack of proper requirement elicitation and development organisation was identified as a critical problem for developers ( ; ), who often resort to writing comments in the source to remind themselves of tasks they later do not address ( ). There are extremely lightweight agile lifecycles (e.g., Extreme Programming, Crystal Clear, Kanban) that can be adapted for a single developer or small groups. Using these can provide a project management framework that can also organise a research project that depends on creating scientific software.
Teaching Most scientific developers do not perceive themselves as programmers and are self-taught ( ). This hinders their background knowledge and the tools they have available to detect TD and other problems, potentially leading to low-quality code ( ). Since graduate school is considered fundamental for these developers ( ), providing a solid foundation of SE-oriented R programming for candidates whose research relies heavily on software can prove beneficial. The topics to be taught should be carefully selected to keep them practical and relevant yet still valuable for the candidates.
Coding Some problems discussed where functions clones, incorrect imports, non-semantic or meaningful names, improper visibilitiy or file distribution of functions, among others. Avoid duplicating (i.e., copy-pasting or re-exporting) functions from other packages, and instead use proper selective import, such as ‘s or similar Latex documentation styles. Avoid leaving unused functions or pieces of code that are ’commented out’ to be nullified. Proper use of version control enables developers to remove the segments of code and revisit them through previous commits. Code comments are meant to be meaningful and should not be used as a planning tool. Comments indicating problems or errors should be addressed (either when found, if the problem is small or planning for a specific time to do it if the problem is significant). Names should be semantic and meaningful, maintaining consistency in the whole project. Though there is no pre-established convention for R, previous works provide an overview ( ), as well as packages, such as the .
Testing Current tests leave many relevant paths unexplored, often ignoring the testing of edge cases and damaging the robustness of the code packaged ( ; ) All alternative paths should be tested (e.g., those limited by conditionals). Exceptional cases should be tested; e.g., evaluating that a function throws an exception or error when it should, and evaluating other cases such as (but not limited to), nulls, , , warnings, large numbers, empty strings, empty variables (e.g., , among others.
Other specific testing cases, including performance evaluation and profiling, discussed and exemplified by Russell et al. ( ).

This study ( Pinto et al. 2018 ) was followed up to create a taxonomy of problems commonly faced by scientific developers ( Wiese et al. 2020 ) . They worked with over 2100 qualitatively-reported problems and grouped them into three axes; given the size of their taxonomy, only the larger groups are summarised below:

  • Technical Problems: represent almost two-thirds of the problems faced. They are related to software design and construction, software testing and debugging, software maintenance and evolution, software requirements and management, software build and release engineering, software tooling’ and others (e.g., licensing, CRAN-related, user interfaces).
  • Social-Related Problems: they represent a quarter of the problems faced by developers. The main groups are: publicity, lack of support, lack of time, emotional and communication and collaboration.
  • Scientific-Related Problems: are the smaller category related to the science supporting or motivating the development. The main groups are: scope, background, reproducibility and data handling, with the latter being the most important.

These two works provide valuable insight into scientific software developers. Like other works mentioned in this article, albeit there are similarities with traditional software development (both in terms of programming paradigms and goals), the differences are notable enough to warrant further specialised investigations.

3 Towards Best Practices

Based on well-known practices for traditional software development ( Sommerville 2015 ) , this Section outlines a proposal of best practices for R developers. These are meant to target the weaknesses found by the previous studies discussed in Section 2 . This list aims to provide a baseline, aiming that (through future research works) they can be improved and further tailored to the needs of scientific software development and the R community in itself.

The practices discussed span from overarching (e.g., related to processes) to specific activities. They are summarised in Table 3 .

4 Call to Action

Scientific software and R programming became ubiquitous to numerous disciplines, providing essential analysis tools that could not be completed otherwise. Albeit R developers are reportedly struggling in several areas, academic literature centred on the development of scientific software is scarce. As a result, this Section provides two calls to actions: one for R users and another for RSE academics.

Research Software Engineering Call: SE for data science and scientific software development is crucial for advancing research outcomes. As a result, interdisciplinary works are increasingly needed to approach specific areas. Some suggested topics to kickstart this research are as follows:

  • Lifecycles and methodologies for project management. Current methodologies focus on the demands of projects with clear stakeholders and in teams of traditional developers. As suggested in Section 3 , many agile methodologies are suitable for smaller teams or even uni-personal developments. Studying this and evaluating its application in practice can prove highly valuable.
  • Specific debts in scientific software. Previous studies highlighted the existence of specific types of debt that are not often present in traditional software development (e.g., algorithm and reproducibility) ( Liu et al. 2020 ) and are therefore not part of currently accepted taxonomies ( Potdar and Shihab 2014 ; Alves et al. 2016 ) . Thus, exploring these specific problems can help detect uncovered problems, providing viable paths of actions and frameworks for programmers.
  • Distinct testing approaches. R programming is an inherently different paradigm, and current guidance for testing has been developed for the OO paradigm. As a result, more studies are needed to tackle specific issues that may arise, such as how to test visualisations or scripts ( Vidoni 2021a ) , and how to move beyond coverage by providing tests that are optimal yet meaningful ( Křikava and Vitek 2018 ) .

R Community Call: The following suggestions are centred on the abilities of the R community:

  • Several packages remain under-developed, reportedly providing incomplete tools. This happens not only in terms of functionalities provided but also on their documentation and examples. For instance, developers disclosed that lack of specific examples was a major barrier when properly testing ( Vidoni 2021a ) . Extending the examples available in current packages can be achieved through community calls, leveraging community groups’ reach, such as R-Ladies and RUGs (R User Groups). Note that this suggestion is not related to package development guides but to a community-sourced improvement of the documentation of existing packages.
  • Additionally, incorporating courses in graduate school curricula that focus on “SE for Data Science” would be beneficial for the students, as reported in other works ( Pinto et al. 2018 ; Wiese et al. 2020 ) . However, this can only be achieved through interdisciplinary work that merges specific areas of interest with RSE academics and educators alike. Once more, streamlined versions of these workshops could be replicated in different community groups.

There is a wide range of possibilities and areas to work, all derived from diversifying R programming and RSE. This paper highlighted meaningful work in this area and proposed a call-to-action to further this area of research and work. However, these ideas need to be repeatedly evaluated and refined to be valuable to R users.

Acknowledgements

This research did not receive any specific grant from funding agencies in the public, commercial, or not-for-profit sectors. The author is grateful to both R-Ladies and rOpenSci communities that fostered the interest in this topic and to Prof. Dianne Cook for extending the invitation for this article.

5 Packages Mentioned

The following packages were mentioned in this article:

  • covr , for package coverage evaluation. Mentioned by ( Křikava and Vitek 2018 ) and Codabux et al. ( 2021 ) . Available at: https://cran.r-project.org/web/packages/covr/index.html .
  • genthat , developed by Křikava and Vitek ( 2018 ) , to optimise testing suits. Available at https://github.com/PRL-PRG/genthat .
  • pkgdown for package documentation. Mentioned by ( Codabux et al. 2021 ) as part of documentation debt. Available at: https://cran.r-project.org/web/packages/pkgdown/index.html .
  • roxygen2 , for package documentation. Recommended in Section 3 , and mentioned as examples of design and documentation debt by ( Codabux et al. 2021 ) . Available at https://cran.r-project.org/web/packages/roxygen2/index.html .
  • testthat , most used testing tool, according to findings by Vidoni ( 2021a ) . Mentioned when discussing testing debt by ( Codabux et al. 2021 ) . Available at https://cran.r-project.org/web/packages/testthat/index.html .
  • tidyverse , bundling a large number of packages and providing a style guile. Mentioned in Section 3 . Available at: https://cran.r-project.org/web/packages/tidyverse/index.html .

CRAN packages used

genthat , roxygen2 , pkgdown , covr , testthat , tidyverse

CRAN Task Views implied by cited packages

This article is converted from a Legacy LaTeX article using the texor package. The pdf version is the official version. To report a problem with the html, refer to CONTRIBUTE on the R Journal homepage.

Text and figures are licensed under Creative Commons Attribution CC BY 4.0 . The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".

For attribution, please cite this work as

BibTeX citation

© The R Foundation, web page contact .

R Programming for Research

Colorado state university, erhs 535.

Brooke Anderson, Rachel Severson, and Nicholas Good

Online course book, ERHS 535

This is the online book for Colorado State University’s R Programming for Research courses (ERHS 535, ERHS 581A3, and ERHS 581A4).

This book includes course information, course notes, links to download pdfs of lecture slides, in-course exercises, homework assignments, and vocabulary lists for quizzes for this course.

““Give someone a program, you frustrate them for a day; teach them how to program, you frustrate them for a lifetime.”—David Leinweber”

This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License .

R-bloggers

R news and tutorials contributed by hundreds of R bloggers

An academic programming language paper about r.

Posted on April 27, 2012 by Derek-Jones in R bloggers | 0 Comments

[social4i size="small" align="align-left"] --> [This article was first published on The Shape of Code » R , and kindly contributed to R-bloggers ]. (You can report issue about the content on this page here ) Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

The R language has passed another milestone, a paper aimed at the academic programming language community (or at least one section of this community) has been written about it, Evaluating the Design of the R Language by Morandat, Hill, Osvald and Vitek. Hardly earth shattering news, but it may have some impact on how R is viewed by nonusers of the language (the many R users in finance probably don’t care that R seems to have been labeled as the language for doing statistics ). The paper is well written and contains some very interesting information as well as a few mistakes, although it will probably read like gobbledygook to anybody not familiar with academic programming language research. What follows has something of the form of an R users guide to reading this paper, plus some commentary.

The paper has roughly three parts, the first gives an overview of R, the second is a formal definition of a subset and the third an initial report of an analysis of R usage. For me and I imagine you dear reader the really interesting stuff is in the third section.

What is a formal description of a subset of R (i.e., done purely using mathematics) doing in the second part? Well, until recently very little academic software engineering was empirically based and was populated by people I would classify as failed mathematicians without the common sense to be engineers. Things are starting to change but research that measures things, particularly people, is still regarded as not being respectable in some quarters. In this case the formal definition is playing the role of a virility symbol showing that the authors are obviously regular guys who happen to be indulging in a bit of empirical research.

A surprising number of papers measuring the usage of real software contain formal definitions of a subset of the language being measured. Subsets are used because handling the complete language is a big project that usually involves one or more people getting a PhD out of the work. The subset chosen have to look plausible to readers who understand the mathematics but not the programming language, broadly handle all the major constructs but not get involved with all the fiddly details that need years of work and many pages to describe.

The third part contains the real research, which is really about one implementation of R and the characteristics of R source in the CRAN and Bioconductor repositories, and contains lots of interesting information. Note: the authors are incorrect to aim nearly all of the criticisms in this subsection at R, these really apply to the current implementation of R and might not apply to a different implementation.

In a previous post I suggested some possibilities for speeding up the execution of R programs that depended on R usage characteristics. The Morandat paper goes a long way towards providing numbers for some of these usage characteristics (e.g., 37% of function parameters are assigned to and 36% of vectors contain a single value).

What do we learn from this first batch of measurements? R users rarely use many of the more complicated features (e.g., object oriented constructs {and this paper has been accepted at the European Conference on Object-Oriented Programming}), a result usually seen for other languages. I was a bit surprised that R programs were only 40% smaller than equivalent C programs. I think part of the reason is that some of the problems used for benchmarking are not the kind that would usually be solved using R and I did not see any ‘typical’ R programs being coded up in C for comparison, another possibility is that the authors were not thinking in R when writing the code.

One big measurement topic the authors missed is comparing their general findings with usage measurements of other languages. I think they will find lots of similar patterns of usage.

The complaint that R has been defined by the successive releases of its only implementation, rather than a written specification, applies to all widely used languages, at least in their early days. Back in the day a major reason for creating language standards for Pascal and then C was so that other implementations could be created; the handful of major languages whose specification was written before the first implementation (e.g., PL/1, Ada) have/are dieing out. Are multiple implementations needed in an Open Source world? The answer seems to be no for Perl and yes for PHP, Ruby etc. The effort needed to create a written specification for the R language might be better invested improving the efficiency of the current implementation so that a better alternative is not needed.

Needless to say the authors suggested committing the fatal programming language research mistake .

The authors have created an interesting set of tools for static and dynamic analysis of R and I look forward to reading more about their findings in future papers.

To leave a comment for the author, please follow the link and comment on their blog: The Shape of Code » R . R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job . Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

  • dynamic analysis
  • formal definition
  • Language usage
  • Uncategorized

Copyright © 2022 | MH Corporate basic by MH Themes

Never miss an update! Subscribe to R-bloggers to receive e-mails with the latest R posts. (You will not see this message again.)

IEEE Account

  • Change Username/Password
  • Update Address

Purchase Details

  • Payment Options
  • Order History
  • View Purchased Documents

Profile Information

  • Communications Preferences
  • Profession and Education
  • Technical Interests
  • US & Canada: +1 800 678 4333
  • Worldwide: +1 732 981 0060
  • Contact & Support
  • About IEEE Xplore
  • Accessibility
  • Terms of Use
  • Nondiscrimination Policy
  • Privacy & Opting Out of Cookies

A not-for-profit organization, IEEE is the world's largest technical professional organization dedicated to advancing technology for the benefit of humanity. © Copyright 2024 IEEE - All rights reserved. Use of this web site signifies your agreement to the terms and conditions.

U.S. flag

An official website of the United States government

The .gov means it’s official. Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

The site is secure. The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

  • Publications
  • Account settings

Preview improvements coming to the PMC website in October 2024. Learn More or Try it out now .

  • Advanced Search
  • Journal List
  • HHS Author Manuscripts

Logo of nihpa

Conducting Simulation Studies in the R Programming Environment

Kevin a. hallgren.

University of New Mexico, Department of Psychology

Associated Data

Simulation studies allow researchers to answer specific questions about data analysis, statistical power, and best-practices for obtaining accurate results in empirical research. Despite the benefits that simulation research can provide, many researchers are unfamiliar with available tools for conducting their own simulation studies. The use of simulation studies need not be restricted to researchers with advanced skills in statistics and computer programming, and such methods can be implemented by researchers with a variety of abilities and interests. The present paper provides an introduction to methods used for running simulation studies using the R statistical programming environment and is written for individuals with minimal experience running simulation studies or using R. The paper describes the rationale and benefits of using simulations and introduces R functions relevant for many simulation studies. Three examples illustrate different applications for simulation studies, including (a) the use of simulations to answer a novel question about statistical analysis, (b) the use of simulations to estimate statistical power, and (c) the use of simulations to obtain confidence intervals of parameter estimates through bootstrapping. Results and fully annotated syntax from these examples are provided.

Introduction

Simulations provide a powerful technique for answering a broad set of methodological and theoretical questions and provide a flexible framework to answer specific questions relevant to one’s own research. For example, simulations can evaluate the robustness of a statistical procedure under ideal and non-ideal conditions, and can identify strengths (e.g., accuracy of parameter estimates) and weaknesses (e.g., type-I and type-II error rates) of competing approaches for hypothesis testing. Simulations can be used to estimate the statistical power of many models that cannot be estimated directly through power tables and other classical methods (e.g., mediation analyses, hierarchical linear models, structural equation models, etc.). The procedures used for simulation studies are also at the heart of bootstrapping methods, which use resampling procedures to obtain empirical estimates of sampling distributions, confidence intervals, and p -values when a parameter sampling distribution is non-normal or unknown.

The current paper will provide an overview of the procedures involved in designing and implementing basic simulation studies in the R statistical programming environment ( R Development Core Team, 2011 ). The paper will first outline the logic and steps that are included in simulation studies. Then, it will briefly introduce R syntax that helps facilitate the use of simulations. Three examples will be introduced to show the logic and procedures involved in implementing simulation studies, with fully annotated R syntax and brief discussions of the results provided. The examples will target three different uses of simulation studies, including

  • Using simulations to answer a novel statistical question
  • Using simulations to estimate the statistical power of a model
  • Using bootstrapping to obtain a 95% confidence interval of a model parameter estimate

For demonstrative purposes, these examples will achieve their respective goals within the context of mediation models. Specifically, Example 1 will answer a novel statistical question about mediation model specification, Example 2 will estimate the statistical power of a mediation model, and Example 3 will bootstrap confidence intervals for testing the significance of an indirect effect in a mediation model. Despite the specificity of these example applications, the goal of the present paper is to provide the reader with an entry-level understanding of methods for conducting simulation studies in R that can be applied to a variety of statistical models unrelated to mediation analysis.

Rationale for Simulation Studies

Although many statistical questions can be answered directly through mathematical analysis rather than simulations, the complexity of some statistical questions makes them more easily answered through simulation methods. In these cases, simulations may be used to generate datasets that conform to a set of known properties (e.g., mean, standard deviation, degree of zero-inflation, ceiling effects, etc. are specified by the researcher) and the accuracy of the model-computed parameter estimates may be compared to their specified values to determine how adequately the model performs under the specified conditions. Because several methods may be available for analyzing datasets with these characteristics, the suitability of these different methods could also be tested using simulations to determine if some methods offer greater accuracy than others (e.g., Estabrook, Grimm, & Bowles, 2012 ; Luh & Guo, 1999 ).

Simulation studies typically are designed according to the following steps to ensure that the simulation study can be informative to the researcher’s question:

  • A set of assumptions about the nature and parameters of a dataset are specified.
  • A dataset is generated according to these assumptions.
  • Statistical analyses of interest are performed on this dataset, and the parameter estimates of interest from these analyses (e.g., model coefficient estimates, fit indices, p -values, etc.) are retained.
  • Steps 2 and 3 are repeated many times with many newly generated datasets (e.g., 1000 datatsets) in order to obtain an empirical distribution of parameter estimates.
  • Often, the assumptions specified in step 1 are modified and steps 2–4 are repeated for datasets generated according to new parameters or assumptions.
  • The obtained distributions of parameter estimates from these simulated datasets are analyzed to evaluate the question of interest.

The R Statistical Programming Environment

The R statistical programming environment ( R Development Core Team, 2011 ) provides an ideal platform to conduct simulation studies. R includes the ability to fit a variety of statistical models natively, includes sophisticated procedures for data plotting, and has over 3000 add-on packages that allow for additional modeling and plotting techniques. R also allows researchers to incorporate features common in most programming languages such as loops, random number generators, conditional (if-then) logic, branching, and reading and writing of data, all of which facilitate the generation and analysis of data over many repetitions that is required for many simulation studies. R also is free, open source, and may be run across a variety of operating systems.

Several existing add-on packages already allow R users to conduct simulation studies, but typically these are designed for running simulations for a specific type of model or application. For example, the simsem package provides functions for simulating structural equation models ( Pornprasertmanit, Miller, & Schoemann, 2012 ), ergm includes functions for simulating social network exponential random graphs ( Handcock et al., 2012 ), mirt allows users to simulate multivariate-normal data for item response theory ( Chalmers, 2012 ), and the simulate function in the native stats package allows users to simulate fitted general linear models and generalized linear models. It should be noted that many simulation studies can be conducted efficiently using these pre-existing functions, and that using the alternative, more general method for running simulation studies described here may not always be necessary. However, the current paper will describe a set of general methods and functions that can be used in a variety of simulation studies, rather than describing the methods for simulating specific types of models already developed in other packages.

R is syntax-driven, which can create an initial hurdle that prevents many researchers from using it. While the learning curve for syntax-driven statistical languages may be steep initially, many people with little or no prior programming experience have become comfortable using R. Also, such a syntax-driven platform allows for much of the program’s flexibility described above.

The simulations used in the following tutorials utilize several basic R functions, with a rationale for their use provided below and a brief description with examples given in Table 1 . A full tutorial on these basic functions and on using R in general is not given here; instead, the reader is referred to several open-source tutorials introducing R ( Kabacoff, 2012 ; Owen, 2010 ; Spector, 2004 ; Venables, Smith, & R Development Core Team, 2012 ). Some commands that serve a secondary function that are not directly related to generating or analyzing simulation data (e.g., the write.table command for saving a dataset) are not discussed here but descriptions of such functions are included in the annotated syntax examples in the appendices. More information about each of the functions used in this tutorial can be obtained from the help files included in R or by entering ?<command> in the R command line (e.g., enter ?c to get more information about the c command).

Common R commands for simulation studies.

Commands for working with vectors
 CommandDescriptionExamples
 cCombines arguments to make vectors#create vector called a which contains the values 3, 5, 4
a = c(3,5,4)
#identical to above, uses <- instead of =
a <- c(3,4,5)
#return the second element in vector a, which is 5
a[2]
#remove the contents previously stored in vector a
a = NULL
 lengthReturns the length of a vector#return length of vector a, which is 3
a = c(3,5,4)
length(a)
 rbind and cbindCombine arguments by rows or columns#create matrix d that has vector a as row 1 and vector b as row 2.
a = c(3,5,4)
b = c(9,8,7)
d = rbind(a,b)
#create matrix e that has two copies of matrix d joined by column
e = cbind(d,d)
Commands for generating random values
 CommandDescriptionExamples
 rnormRandomly samples values from normal distribution with a given population and #randomly sample 100 values from a normal distribution with a population = 50 and = 10
x = rnorm(100, 50, 10)
 sampleRandomly sample values from another vector#randomly sample 8 values from vector a, with replacement
a = c(1,2,3,4,5,6,7,8)
sample(a, size=8, replace=TRUE)
#e.g., returns 3 1 3 6 5 4 2 2
 set.seedAllows exact replication of randomly-generated numbers between simulations#The same 5 random numbers returned each time the following lines are run
set.seed(12345)
rnorm(5, 50, 10)
Command for statistical modeling
 CommandDescriptionExamples
 lmfits linear ordinary least squares models#Regress y onto x1 and x2
y = c(2,2,5,4,3,6,4,6,5,7)
x1 = c(1,2,3,1,1,2,3,1,2,2)
x2 = c(0,0,0,0,0,1,1,1,1,1)
mymodel = lm(y ~ x1 + x2)
summary(mymodel)
#retrieve fixed effect coefficients from a lm object
mymodel$coefficients
Commands for programming
 CommandDescriptionExamples
 functiongenerate customized function# function that returns the sum of x1 and x2
myfunction = function(x1, x2){
 mysum = x1 + x2
 return(mysum)
}
 forcreate a loop, allowing sequences of commands to be executed a specified number of times#Create vector of empirical sample means (stored as mean_vector) from 100 random samples of size = 20, sampled from a population = 50 and = 10.
mean_vector = NULL
for (i in 1:100){
 x = rnorm(20, 50, 10)
 m = mean(x)
 mean_vector = c(mean_vector, m)
}

Note : Text appearing after the # symbol is not processed by R and is typically reserved for comments and annotation. List of commands is not exhaustive.

R is an object-oriented program that works with data structures such as vectors and data frames. Vectors are one of the simplest data structures and contain an ordered list of values. Vectors will be used throughout the examples described in this tutorial to store values for variables in simulated datasets and to store parameter estimates that are retained from statistical analyses (e.g., p -values, parameter point estimates, etc.). The examples here will make extensive use of commands for generating, indexing, and combining vectors, including the c command for generating and combining vectors, the length command for obtaining the number of items in a vector, and the rbind and cbind commands for combining vectors by row or column, respectively.

Two functions for creating random numbers, rnorm and sample, will be used in the simulation examples in this paper in order to generate values for random variables or to sample subsets of observations from an existing dataset, respectively. An additional function for setting the randomization seed, set.seed, is useful for generating the same sets of random numbers each time a simulation study is run, allowing exact replications of results.

Statistical models in these tutorials will be fit using the lm command, which models linear regression, analysis of variance, and analysis of covariance (however, note that there are many additional native and add-on R packages that can fit a variety of models outside of the general linear model framework). The lm command returns an object with information about the fitted linear model, which may be accessed through additional commands. For example, fixed effect coefficients for the lm object called mymodel shown in Table 1 (under the lm command) can be extracted by calling for the coefficients values of mymodel, such that the syntax

returns the regression coefficients for the intercept and effects of x1 and x2 in predicting y from the data in Table 1 and saves it to vector f, which has the following values:

Specific fixed effects could be further extracted by indexing values from vector f; for example, the command f[2] would extract the second value in vector f, which is the fixed effect coefficient for x1.

The function command allows users to generate their own customized functions, which provides a useful way of reducing syntax when a procedure is repeated many times. For example, the first tutorial below computes several Sobel statistics each time a dataset is generated, and declaring a function that computes the Sobel statistic allows the program to call on one function each time the statistic must be computed, rather than repeating several lines of the same syntax within the simulation. The for command is used to create loops, which allow sequences of commands that are specified once to be executed several times. This is useful in simulation studies because datasets often must be generated and analyzed hundreds or thousands of times.

This section will outline examples of questions that may be answered using simulation studies and describes the methods used to answer those questions. In each example, the underlying assumptions and procedures for generating and analyzing data will be discussed, and fully annotated syntax for the simulations will be provided as appendices.

Example 1: Answering a Novel Question about Mediation Analysis

Mediation analysis is a statistical technique for analyzing whether the effect of an independent variable ( X ) on an outcome variable ( Y ) can be accounted for by an intermediate variable ( M ; see Figure 1 for graphical depiction; see Hayes 2009 for pedagogical review). When mediation is present, the degree to which X predicts Y is changed when M is added to the model in the manner shown in Figure 1 (i.e., c – c ′ ≠ 0 in Figure 1 ). The degree to which the relationship between X and Y changes ( c – c ′) is called the indirect effect, which is mathematically equivalent to the product of the path coefficients ab shown in Figure 1 . The product of path coefficients ab (or equivalently, c – c ′) represents the amount of change in outcome variable Y that can be attributed to being caused by changes in the independent variable X operating through the mediating variable M . In situations where a mediator variable cannot be directly manipulated through experimentation, mediation analysis has often been championed as a method of choice for identifying variables that may cause an observed outcome ( Y ) as part of a causal sequence where X affects M , and M in turn affects Y .

An external file that holds a picture, illustration, etc.
Object name is nihms591919f1.jpg

Direct effect model (top) and mediation model (bottom).

For example, in psychotherapy research, the number of times participants receive drink-refusal training ( X ) may impact their self-efficacy to refuse drinks ( M ), and enhanced self-efficacy may in turn cause improved abstinence from alcohol ( Y ; e.g., Witkiewitz, Donovan, & Hartzler, 2012 ). Self-efficacy cannot be directly manipulated by experiment, so researchers may use mediation analysis to test whether a particular psychotherapy increases self-efficacy, and whether this in turn increases abstinence outcomes. However, little research has identified the consequences of wrongly specifying which variables are mediator variables ( M ) versus outcome variables ( Y ). For example, it could also be possible that drink-refusal training ( X ) enhances abstinence from alcohol ( Y ), which in turn enhances self-efficacy ( M; e.g., X causes Y, Y causes M ). Support for this alternative model would guide treatment providers and subsequent research efforts toward different goals than the original model, and therefore it is important to know whether mediation models are likely to produce significant results even when the true causal order of effects is incorrectly specified by investigators.

The present example uses simulations to test whether mediation models produce significant results when the implied causal ordering of effects is switched within the tested model. Data is generated for three variables, X , M , and Y , such that M mediates the relationship between X and Y (“ X-M-Y ” model) using ordinary least-squares (OLS) regression. Path coefficients for a ( X predicting M; see Figure 1 ) and b ( M predicting Y , controlling for X ) will each be manipulated at three levels (−0.3, 0.0, 0.3), c ′ ( X predicting Y , controlling for M ) will be manipulated at three levels (−0.2, 0.0, 0.2), and sample size ( N ) will be manipulated at two levels (100, 300). This results in a 3 ( X ) × 3 ( M ) × 3 ( Y ) × 2 ( N ) design. One thousand simulated datasets will be generated in each condition. Data will be generated for an X-M-Y model, and mediation tests will be conducted on the original X-M-Y models and with models that switch the order of M and Y variables (i.e., X-Y-M models). The Sobel test ( MacKinnon, Warsi, & Dwyer, 1995 ; Sobel, 1982 ) will be computed and retained for each type of mediation model, with p < 0.05 indicating significant mediation for that particular model.

Assumptions about the nature and properties of a dataset

Data in this example are generated in accordance with OLS regression assumptions, including the assumptions that random variables are sampled from populations with normal distributions, that residual errors are normally distributed with a mean of zero, and that residual errors are homoscedastic and serially uncorrelated. Assumptions about the relationships among X , M , and Y variables from Figure 1 are guided by the equations provided by Jo (2008) ,

where X i , M i , and Y i represent values for the independent variable, mediator, and outcome for individual i , respectively; α m and α y represent the intercepts for M and Y after the other effects are accounted for, and a , b , and c ′ correspond with the mediation regression paths shown in Figure 1 .

Generating data

Data for X , M , and Y with sample size N can be generated using the rnorm command. If N , a , b , and c ′ ( c ′ is named cp in the syntax below) are each specified as single numeric values, then the following syntax will generate data for the X , M , and Y variables.

The first line of the syntax above creates a random variable X with a mean of zero and a standard deviation of one for N observations. The second line creates a random variable M that regresses onto X with regression coefficient a and a random error with a mean of zero and standard deviation of one (error variances need not be fixed with a mean of zero and standard deviation of one, and can be specified at any value based on previous research or theoretically-expected values). The third line of syntax creates a random variable Y that regresses onto X and M with regression coefficients cp and b, respectively, with a random error that has a mean of zero and standard deviation of one. It will be shown below that the intercept parameters do not affect the significance of a mediation test, and thus the intercepts were left at zero in the three lines of code above; however, the intercept parameter could be manipulated in a similar manner to a , b , and c ′ if desired.

Statistical analyses are performed and parameters are retained

Once the random variables X, M, and Y have been generated, the next step is to perform a statistical analysis on the simulated data. In mediation analysis, the Sobel test ( MacKinnon et al., 1995 ; Sobel, 1982 ) is commonly employed (although, see section below on bootstrapping), which tests the significance of a mediation effect by computing the magnitude of the indirect effect as the product of coefficients a and b ( ab ) and compares this value to the standard error of ab to obtain a z -like test statistic. Specifically, the Sobel test uses the formula

where s a and s b are the standard errors of the estimates for regression coefficients a and b , respectively. The product of coefficients ab reflects the degree to which the effect of X on Y is mediated through variable M , and is contained in the numerator of Equation 3 . The standard error of the distribution of ab is in the denominator of Equation 3 , and the Sobel statistic obtained in the full equation provides a z -like statistic that tests whether the ab effect is significantly different from zero. Because the Sobel test will be computed many times, making a function to compute the Sobel test provides an efficient way to compute the test repeatedly. Such a function is defined below and called sobel_test. The function takes three arguments, vectors X, M, and Y as the first, second, and third arguments, respectively, and computes regression models for M regressed onto X and Y regressed onto X and M. The coefficients representing a , b , s a , and s b in Equation 3 are extracted by calling coefficients, then a Sobel test is computed and returned.

Data are generated and analyzed many times under the same conditions

So far syntax has been provided to generate one set of X , M , and Y variables and to compute a Sobel z -statistic from these variables. These procedures can now be repeated several hundred or thousand times to observe how this model behaves across many samples, which may be accomplished with for loops, as shown below. In the syntax below, the procedure for generating data and computing a Sobel test is repeated reps number of times, where reps is a single integer value. For each iteration of the for loop, data are saved to a matrix called d to retain information about the iteration number (i), a , b , and c ′ parameters (a, b, and cp), the sample size (N), an indexing variable that tells whether the test statistic corresponds with an X-M-Y or X-Y-M mediation model (1 vs. 2), and the computed Sobel test statistic which calls on the sobel_test function above.

The above steps can then be repeated for datasets generated according to different parameters. In the present example, we wish to test three different values of a , b , c ′, and N . Syntax for manipulating these parameters is included below. The values selected for a , b , c ′, and N are specified as vectors called a_list, b_list, cp_list, and N_list, respectively. Four nested for loops index through each of the values in a_list, b_list, cp_list, and N_list and extract single values for these parameters that are used for data generation. For each combination of a , b , c ′, and N , reps number of datasets are generated and subjected to the Sobel test using the same syntax presented above (some syntax is omitted below for brevity, and full syntax with more detailed annotation for this example is provided in Appendix A ), and the data are then saved to a matrix called d:

Retained parameter estimates are analyzed to evaluate the question of interest

Executing the syntax above generates a matrix d that contains Sobel test statistics for X-M-Y (omitted for brevity) and X-Y-M mediation models (shown above) generated from a variety of a , b , c ′, and N parameters. The next step is to evaluate the results of these models. Before this is done, it will be helpful to add labels to the variables in matrix d to allow for easy extraction of subsets of the results and to facilitate their interpretation:

It is also desirable to save a backup copy of the results using the command

In the syntax above, “...” must be replaced with the directory where results should be saved, and each folder must be separated by double backslashes (“\\”) if the R program is running on a Windows computer (on Macintosh, a colon “:” should be used, and in Linux/Unix, a single forward slash “/” should be used).

Researchers can choose any number of ways to analyze the results of simulation studies, and the method chosen should be based on the nature of the question under examination. One way to compare the distributions of Sobel z-statistics obtained for the X-M-Y and X-Y-M mediation models in the current example is to use boxplots, which can be created in R (see ?boxplot for details) or other statistical software by importing the mediation_output.csv file into other data analytic software. As seen in Figure 2 , in the first two conditions where the population parameters a = 0.3, b = 0.3, and c ′ = 0.2, Sobel tests for X-M-Y and X-Y-M mediation models produce test statistics with nearly identical distributions and Sobel test-values are almost always significant (| z | > 1.96, which corresponds with p < .05, two-tailed) when N = 300 and other assumptions described above are held. In the latter two conditions where the population parameters a = 0.3, b = 0.3, and c ′ = 0, Sobel tests for X-M-Y models remain high, while test statistics for X-Y-M models are lower even though approximately 25% of these models still had Sobel z -test statistics with magnitudes greater than 1.96 (and thus, p -values less than 0.05).

An external file that holds a picture, illustration, etc.
Object name is nihms591919f2.jpg

Boxplot of partial results from Example 1 with N = 300.

The similarity of results between X-M-Y and X-Y-M models suggests limitations of using mediation analysis to identify causal relationships. Specifically, the same datasets may produce significant results under a variety of models that support different theories of the causal ordering of relations. For example, a variable that is truly a mediator may instead be specified as an outcome and still produce “significant” results in a mediation analysis. This could imply misleading support for a causal chain due to the way researchers specify the ordering of variables in the analysis. This finding suggests that mediation analysis may produce misleading results in some situations, particularly when data are cross-sectional because of the lack of temporal-ordering for observations of X , M , and Y that could provide stronger testing of a proposed causal sequence (Maxwell & Cole, 2007; Maxwell, Cole, & Mitchell, 2011 ). One implication of these findings is that researchers who perform mediation analysis should test alternative models. For example, researchers could test alternative models with assumed mediators modeled as outcomes and assumed outcomes modeled as mediators to test whether other plausible models are also “significant” (e.g., Witkiewitz et al., 2012 ).

Example 2: Estimating the Statistical Power of a Model

Simulations can be used to estimate the statistical power of a model -- i.e., the likelihood of rejecting the null hypothesis for a particular effect under a given set of conditions. Although statistical power can be estimated directly for many analyses with power tables (e.g., Maxwell & Delaney, 2004 ) and free software such as G*Power (Erdfelder, Faul, & Buchner, 2006; see Mayr, Erdfelder, Buchner, & Faul, 2007 for a tutorial on using G*Power), many types of analyses currently have no well-established method to directly estimate statistical power, as is the case with mediation analysis.

The steps in Example 1 provide the necessary data to estimate the power of a mediation analysis if the assumptions and parameters specified in Example 1 remain the same. Thus, using the simulation results saved in dataset d generated in Example 1, the power of a mediation model under a given set of conditions can be estimated by identifying the relative frequency in which a mediation test was significant.

For example, the syntax below extracts the Sobel test statistic from dataset d under the condition where a = 0.3, b = 0.3, c ′ = 0.2, N = 300, and “model” = 1 (i.e., an X-M-Y mediation model is tested). The vector of Sobel test statistics across 1000 repetitions is saved in a variable called z_dist. The absolute value each of the numbers in z_dist is compared against 1.96 (i.e., the z -value that corresponds with p < 0.05, two-tailed), creating a vector of values that are either TRUE (if the absolute value is greater than 1.96) or FALSE (if the absolute value is less than or equal to 1.96). The number of TRUE and FALSE values can be summarized using the table command (see ?table for details), which if divided by the length of the number of values in the vector will provide the proportion of Sobel tests with absolute value greater than 1.96:

When the above syntax is run, the following result is printed

which indicates that 99.7% of the datasets randomly sampled under the conditions specified above produced significant Sobel tests, and that the analysis has an estimated power of 0.997.

One could also test the power of mediation models with different parameters specified. For example, the power of a model with all the same parameters as above except with a smaller sample size of N = 100 could be examined using the syntax

which produces the following output

The output above indicates that only 51.5% of the mediation models in this example were significant, which reflects the reduced power rate due to the smaller sample size. Full syntax for this example is provided in Appendix B .

Example 3: Bootstrapping to Obtain Confidence Intervals

In the above examples, the Sobel test was used to determine whether a mediation effect was significant. Although the Sobel test is more robust than other methods such as Baron and Kenny’s (1984) causal steps approach ( Hayes, 2009 ; McKinnon et al., 1995 ), a limitation of the Sobel test is that it assumes that the sampling distribution of indirect effects ( ab ) is normally distributed in order for the p -value obtained from the z -like statistic to be valid. This assumption typically is not met because the sampling distributions for a and b are each independently normal, and multiplying a and b introduces skew into the sampling distribution of ab . Bootstrapping can be used as an alternative to the Sobel test to obtain an empirically derived sampling distribution with confidence intervals that are more accurate than the Sobel test.

To obtain an empirical sampling distribution of indirect effects ab , N randomly selected participants from an observed dataset are sampled with replacement, where N is equal to the original sample size. A dataset containing the observed X , M , and Y values for these randomly resampled participants is created and subject to a mediation analysis using Equations 1 and 2 . The a and b coefficients are obtained from these regression models, and the product of these coefficients, ab , is computed and retained. This procedure is repeated many times, perhaps 1000 or 10,000 times, with a new set of subjects from the original sample randomly selected with replacement each time ( Hélie, 2006 ). This provides an empirical sampling distribution of the product of coefficients ab that no longer requires the standard error of the estimate for ab to be computed.

The syntax below provides the steps for bootstrapping a 95% confidence interval of an indirect effect for variables X, M, and Y. A variable called ab_vector holds the bootstrapped distribution of ab values, and is initialized using the NULL argument to remove any data previously stored in this variable. A for loop is specified to repeat reps number of times, where reps is a single integer representing the number of repetitions that should be used for bootstrapping. Variable s is a vector containing row numbers of participants that are randomly sampled with replacement from the original observed sample (raw data for X , M , and Y in this example is provided in the supplemental file mediation_raw_data.csv; see Appendix C for syntax to import this dataset into R). The vectors Xs, Ys, and Ms store the values for X , Y , and M , respectively, that correspond with the subjects resampled based on the vector s. Finally, M_Xs and Y_XMs are lm objects containing linear regression models for Ms regressed onto Xs and for Ys regressed onto Xs and Ms, respectively, and the a and b coefficients in these two models are extracted. The product of coefficients ab is computed and saved to ab_vector, then the resampling process and computation of the ab effect are repeated. Once the repetitions are completed, 95% confidence interval limits are obtained using the quantile command to identify the values in ab_vector at the 2.5th and 97.5 percentiles (these values could be adjusted to obtain different confidence intervals; enter ?quantile in the R console for more details), and the result is saved in a vector called bootlim. Finally, a histogram of the ab effects in ab_vector is printed and displayed in Figure 3 .

An external file that holds a picture, illustration, etc.
Object name is nihms591919f3.jpg

Empirical distribution of indirect effects ( ab ) used for bootstrapping a confidence interval.

Full syntax with annotation for the bootstrapping procedure above is provided in Appendix C . Calling the bootlim vector returns the indirect effects that correspond with the 2.5th and 97.5th percentile of the empirical sampling distribution of ab , giving the following output:

Because the 95% confidence interval does not contain zero, the results indicate that the product of coefficients ab is significantly different than zero at p < 0.05.

The preceding sections provided demonstrations of methods to implement simulation studies for different purposes, including answering novel questions related to statistical modeling, estimating power, and bootstrapping confidence intervals. The demonstrations presented here used mediation analysis as the content area to demonstrate the underlying processes used in simulation studies, but simulation studies are not limited only to questions related to mediation. Virtually any type of analysis or model could be explored using simulation studies. While the way that researchers construct simulations depends largely on the research question of interest, the basic procedures outlined here can be applied to a large array of simulation studies.

While it is possible to run simulation studies in other programming environments (e.g., the latent variable modeling software MPlus , see Muthén & Muthén, 2002 ), R may provide unique advantages to other programs when running simulation studies because it is free, open source, and cross-platform. R also allows researchers to generate and manipulate their data with much more flexibility than many other programs, and contains packages to run a multitude of statistical analyses of interest to social science researchers in a variety of domains.

There are several limitations of simulation studies that should be noted. First, real-world data often do not adhere to the assumptions and parameters by which data are generated in simulation studies. For example, unlike the linear regression models for the examples above, it is often the case in real world studies that residual errors are not homoscedastic and serially uncorrelated. That is, real-world datasets are likely to be more “dirty” than the “clean” datasets that are generated in simulation studies, which are often generated under idealistic conditions. While these “dirty” aspects of data can be incorporated into simulation studies, the degree to which these aspects should be modeled into the data may be unknown and thus at times difficult to incorporate in a realistic manner.

Second, it is practically impossible to know the values of true population parameters that are incorporated into simulation studies. For example, in the mediation examples above, the regression coefficients a , b , and c’ often may be unknown for a question of interest. Even if previous research provides empirically-estimated parameter estimates, the exact value for these population parameters is still unknown due to sampling error. To deal with this, researchers can run simulations across a variety of parameter values, as was done in Examples 1 and 2, to understand how their models may perform under different conditions, but pinpointing the exact parameter values that apply to their question of interest is unrealistic and typically impossible.

Third, simulation studies often require considerable computation time because hundreds or thousands of datasets often must be generated and analyzed. Simulations that are large or use iterative estimation routines (e.g., maximum likelihood) may take hours, days, or even weeks to run, depending on the size of the study.

Fourth, not all statistical questions require simulations to obtain meaningful answers. Many statistical questions can be answered through mathematical derivations, and in these cases simulation studies can demonstrate only what was shown already to be true through mathematical proofs ( Maxwell & Cole, 1995 ). Thus, simulation studies are utilized best when they derive answers to problems that do not contain simple mathematical solutions.

Simulation methods are relatively straightforward once the assumptions of a model and the parameters to be used for data generation are specified. Researchers who use simulation methods can have tight experimental control over these assumptions and their data, and can test how a model performs under a known set of parameters (whereas with real-world data, the parameters are unknown). Simulation methods are flexible and can be applied to a number of problems to obtain quantitative answers to questions that may not be possible to derive through other approaches. Results from simulation studies can be used to test obtained results with their theoretically-expected values to compare competing approaches for handling data, and the flexibility of simulation studies allows them to be used for a variety of purposes.

Supplementary Material

Supplementary data and syntax, acknowledgments.

This research was funded by NIAAA grant F31AA021031.

The author would like to thank Mandy Owens, Chris McLouth, and Nick Gaspelin for their feedback on previous versions of this manuscript.

Appendix A. Syntax for Example 1

Appendix b. syntax for example 2, appendix c. syntax for example 3.

  • Baron RM, Kenny DA. The moderator-mediator distinction in social psychological research: Conceptual, strategic, and statistical considerations. Journal of Personality and Social Psychology. 1986; 51 (6):1173–1182. [ PubMed ] [ Google Scholar ]
  • Chalmers P. Multidimensional Item Response Theory [computer software] 2012 Available from http://cran.r-project.org/web/packages/mirt/index.html .
  • Erdfelder E, Faul F, Buchner A. GPOWER: A general power analysis program. Behavior Research Methods, Instruments & Computers. 1996; 28 (1):1–11. [ Google Scholar ]
  • Estabrook R, Grimm KJ, Bowles RP. A Monte Carlo simulation study assessment of the reliability of within–person variability. Psychology and Aging. 2012 Jan 23; doi: 10.1037/a0026669. Advance online publication. [ PMC free article ] [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Handcock MS, Hunder DR, Butts CT, Goodreau SM, Krivitsky PN, Morris M. Fit, Simulate, and Diagnose Exponential-Family Models for Networks [computer software] 2012 Available from http://cran.r-project.org/web/packages/ergm/index.html . [ PMC free article ] [ PubMed ]
  • Hayes AF. Beyond Baron and Kenny: Statistical mediation analysis in the new millennium. Communication Monographs. 2009; 76 (4):408–420. [ Google Scholar ]
  • Hélie S. An introduction to model selection: Tools and algorithms. Tutorials in Quantitative Methods for Psychology. 2006; 2 (1):1–10. [ Google Scholar ]
  • Jo B. Causal inference in randomized experiments with mediational processes. Psychological Methods. 2008; 13 (4):314–336. [ PMC free article ] [ PubMed ] [ Google Scholar ]
  • Kabacoff R. Quick-R: Accessing the power of R. 2012 Retrieved from http://www.statmethods.net/
  • Luh W, Guo J. A powerful transformation trimmed mean method for one-way fixed effects ANOVA model under non-normality and inequality of variances. British Journal of Mathematical and Statistical Psychology. 1999; 52 (2):303–320. [ Google Scholar ]
  • MacKinnon DP, Warsi G, Dwyer JH. A simulation study of mediated effect measures. Multivariate Behavioral Research. 1995; 30 :41–62. [ PMC free article ] [ PubMed ] [ Google Scholar ]
  • Maxwell SE, Cole DA. Tips for writing (and reading) methodological articles. Psychological Bulletin. 1995; 118 (2):193–198. [ Google Scholar ]
  • Maxwell SE, Cole DA, Mitchell MA. Bias in cross-sectional analyses of longitudinal mediation: Partial and complete mediation under an autoregressive model. Multivariate Behavioral Research. 2011; 45 :816–841. [ PubMed ] [ Google Scholar ]
  • Maxwell SE, Delaney HD. Designing experiments and analyzing data: A model comparison perspective. 2. Mahwah, NJ: Lawrence Erlbaum; 2004. [ Google Scholar ]
  • Mayr S, Erdfelder E, Buchner A, Faul F. A short tutorial of GPower. Tutorials in Quantitative Methods for Psychology. 2007; 3 (2):51–59. [ Google Scholar ]
  • Muthén LK, Muthén BO. Teacher_s corner: How to use a Monte Carlo study to decide on sample size and determine power. Structural Equation Modeling: A Multidisciplinary Journal. 2002; 9 (4):599–620. [ Google Scholar ]
  • Owen WJ. The R guide. 2010 Retrieved from http://cran.r-project.org/doc/contrib/Owen-TheRGuide.pdf .
  • Pornprasertmanit S, Miller P, Schoemann A. SIMulated Structural Equation Modeling [computer software] 2012 Available from http://cran.r-project.org/web/packages/simsem/index.html .
  • R Development Core Team. R: A Language and Environment for Statistical Computing [computer software] 2011 Available from http://www.R-project.org .
  • Spector P. An introduction to R. 2004 Retrieved from http://www.stat.berkeley.edu/~spector/R.pdf .
  • Sobel ME. Asymptotic intervals for indirect effects in structural equations models. In: Leinhart S, editor. Sociological methodology 1982. San Francisco: Jossey-Bass; 1982. pp. 290–312. [ Google Scholar ]
  • Venables WN, Smith DM R Development Core Team. An introduction to R. 2012 Retrieved from http://cran.r-project.org/doc/manuals/R-intro.pdf .
  • Witkiewitz K, Donovan DM, Hartzler B. Drink refusal training as part of a combined behavioral intervention: Effectiveness and mechanisms of change. Journal of Consulting and Clinical Psychology. 2012; 80 (3):440–449. [ PMC free article ] [ PubMed ] [ Google Scholar ]

Leatherby Libraries

masthead background

Unleashing the Full Potential of Your Research with Chapman Figshare

August 9, 2024

Following the successful launch of Chapman Figshare during Love Data Week, the Leatherby Libraries is excited to invite more members of the Chapman community to take full advantage of the new research data repository. Whether you have datasets, media files, or other research outputs, Chapman Figshare makes your data citable, shareable, and discoverable.

research paper on r programming

The Leatherby Libraries offers two open-access repositories, Chapman Figshare  and Chapman University Digital Commons , to support Chapman scholars and researchers in sharing and preserving their research outputs. Both platforms cater to distinct aspects of the research lifecycle due to differences in the research outputs they accommodate. Datasets can be uploaded to Chapman Figshare, while Chapman University Digital Commons houses all other research outputs such as articles, theses, posters, and more. Check out our previous blog to learn whether your work best fits in Chapman Figshare or Chapman University Digital Commons. 

Why Choose Chapman Figshare?

Data is increasingly valuable in today’s research landscape as it increases the reliability and replicability of research findings. Chapman Figshare provides a secure and efficient platform to archive and publish your research data, ensuring it reaches a broader audience. Here’s why you should consider depositing your data in Chapman Figshare:

  • Increased Visibility and Reach : Every dataset submitted to Chapman Figshare receives a Digital Object Identifier (DOI), providing a permanent link to your research. All data is indexed by DataCite, Clarivate’s Data Citation Index, Google, and Dimensions, which improves data discoverability on dataset search platforms such as Google Scholar, Web of Science, Mendeley Data, Datacite Commons, and OpenAIRE Explore.
  • Secure and Compliant Storage : Chapman Figshare’s cloud-based infrastructure ensures that your data is stored safely and securely, meeting the requirements of many publishers and funding agencies.
  • Comprehensive Usage Tracking : With detailed usage statistics, including views, downloads, citations, and Altmetrics, you can easily track how often your research is accessed and utilized globally.

How to Deposit Your Data

Ready to make your research data citable, shareable, and discoverable? Here’s how to get started:

  • Use Your Chapman Account to Sign In : Log in to Chapman Figshare using your Chapman credentials.
  • Provide Documentation : Write documentation to make your data understandable and reusable.
  • Submit Your Files: Once you have created documentation explaining your data, you can upload your document and data files to Chapman Figshare.
  • Submission Review : Our data curators will review your submission to ensure it meets Chapman Figshare’s format, size, and subject matter standards.

Join the Growing Chapman Figshare Community

Take advantage of the opportunity to enhance the visibility and impact of your research! Join the growing number of Chapman community members already benefiting from Chapman Figshare. By depositing your data, you contribute to a collaborative, transparent research platform that benefits scholars worldwide.

For detailed instructions on using Chapman Figshare, visit our Digital Repositories at Chapman University LibGuide or contact the LRDS team at [email protected] . Make your research count with Chapman Figshare!

More Stories

9881

Leatherby Libraries Resources for Students at Finals Week

May 9, 2024 by Alyssa Castanon | Resources

Finals Week is almost here again, Panthers! The Leatherby Libraries, in collaboration with the Chapman University Student Government Association, are thrilled to announce the return of coffee and snacks at the Library for Finals Week! Rotunda After-Hours Study Commons Open 24/7 for Finals Week We are pleased to announce that the After-Hours Study Commons in

10197

Everything You Need to Know about R Programming

July 1, 2024 by Alyssa Castanon | Resources

The Leatherby Libraries is pleased to introduce a 3-part Introduction to R Programming Summer Workshop Series to introduce students, staff, and faculty to fundamental coding concepts in R. R programming is an effective tool that can propel efficiency and effectiveness in analyzing complex statistical analyses while producing quality data visualizations. These workshops are for beginners

  • Panther Mail
  • Staff & Faculty Email
  • Campus Life
  • Prospective Students
  • Current Students
  • Faculty & Staff
  • Parents & Families
  • Degrees & Programs
  • Argyros School of Business & Economics
  • Attallah College of Educational Studies
  • College of Performing Arts
  • Crean College of Health & Behavioral Sciences
  • Dodge College of Film & Media Arts
  • Fowler School of Engineering
  • Fowler School of Law
  • Schmid College of Science & Technology
  • School of Communication
  • School of Pharmacy
  • Wilkinson College of Arts, Humanities, & Social Sciences
  • News and Stories
  • Chapman Alumni
  • Crean College of Health and Behavioral Sciences
  • Dodge College of Film and Media Arts
  • Schmid College of Science and Technology
  • Wilkinson College of Arts, Humanities, and Social Sciences
  • View More Blogs
  • Inside Chapman
  • Venue: Polytechnique Montréal
  • Travelling to Montreal
  • Research Papers
  • Reproducibility Studies and Negative Results (RENE) Track
  • Tool Demo Track
  • Workshops Track
  • SANER 2025 Committees
  • Organizing Committee
  • Track Committees
  • Contributors
  • People Index

Tool Demo Track SANER 2025

Call for papers.

The Tool Demonstration track of the 32nd International Conference on Software Analysis, Evolution, and Reengineering (SANER’25) provides an excellent opportunity for researchers and practitioners to showcase innovative tools, prototypes, and software systems related to software analysis, engineering, and refactoring. The track aims to foster knowledge exchange, collaboration, and discussions about the latest advancements in tools and technologies that support software development, maintenance, and improvement.

Tool demonstrations should showcase the implementation of research approaches through practical tools. These tools can range from advanced prototypes to fully developed products that are in the process of being commercialized. We particularly encourage proposals for tool demonstrations that complement full research papers. While a research paper aims to provide background information and highlight the scientific contribution of a new software engineering approach, the tool demonstration offers an excellent opportunity to demonstrate how the scientific approach has been translated into a functional tool prototype. As a result, authors of research papers are strongly encouraged to submit the corresponding tools to this track. Tool demonstrations related to any of the topics covered in the conference are welcome and deemed suitable.

Evaluation Criteria

Each submission will be reviewed by at least three members of the tool demonstration program committee. The committee will review each submission on its merits and quality.

A good tool paper should:

Fall under the topics mentioned for SANER 2025 research track;

Present and discuss a tool that has NOT been published before as a tool paper;

Motivate the need for the tool;

Describe the tool’s novelty and how it relates to previous industrial or research efforts;

Describe the potential applications and usefulness of the tool;

Describe the tool’s goals, requirements, architecture and explain its inner workings;

NOT necessarily contain a large-scale empirical study of the tool, BUT any empirical results or user feedback are highly encouraged;

Include a URL for downloading or accessing the latest version of the tool (e.g., a GitHub url)

Optionally, include in the abstract the URL of a 3-to-5 minute screencast, either with annotations or voice-over, that provides a concise version of the tool demo scenario. The video should be posted on YouTube (private, not shared) or hosted on the tool’s website.

Submission Instructions

Submissions of tool demonstrations must:

adhere to the conference proceedings style (IEEE proceedings paper format guidelines.);

have a maximum of 5 pages that describe the criteria above;

be uploaded electronically in PDF format via the SANER 2025 Easychair submission site.

Accepted tool demonstrations will be allocated 5 pages in the conference proceedings. Presenters of accepted tool demonstrations will have the opportunity to (i) deliver a presentation that will be included in the conference program, and (ii) conduct a hands-on session where attendees of SANER can actively use and experiment with the demonstrated tools. Please note that commercial products and tools currently under commercialization procedures CANNOT be accepted for the tool demonstration track. The purpose of these demonstrations is to emphasize scientific contributions and, as such, should not be used as sales pitches.

Important Dates

  • Paper submission: Monday, November 11, 2024 AoE
  • Notifications: Friday, December 13, 2024 AoE
  • Camera Ready: Friday, January 10, 2025 AoE
Mon 11 Nov 2024new
Paper Submission
Fri 13 Dec 2024new
Notifications
Fri 10 Jan 2025new
Camera Ready

Rrezarta Krasniqi

Rrezarta Krasniqi Tool and Demo Track co-chair

University of north carolina at charlotte, united states.

Sarah Nadi

Sarah Nadi Tool and Demo Track co-chair

New york university abu dhabi, university of alberta, united arab emirates.

IMAGES

  1. R programming

    research paper on r programming

  2. Statistical Data Analysis using R Programming

    research paper on r programming

  3. (PDF) R Programming in Statistics

    research paper on r programming

  4. (PDF) A Review of the use of R Programming for Data Science Research in

    research paper on r programming

  5. (PDF) Expansion and evolution of the R programming language

    research paper on r programming

  6. Introduction to R Programming

    research paper on r programming

COMMENTS

  1. (PDF) Using R language to analyze and programming vital data by

    R is an open-source s oftware plat form for statistical d ata analysis. The R project s tarted in 1993 as a projec t. launched by two Ne w Zealand statisticians, Ross Ilhaka and Robert Gentleman ...

  2. (PDF) A Review on R Programming Language

    Abstract: R is an integrated software environment for data processing, calculation, analysis, and graphical display that is open-source and publicly accessible. R is a platform and functional ...

  3. PDF Software Engineering and R Programming: A Call for Research

    of SE research and R programming forwards. This paper discusses relevant studies that close this gap Additionally, it proposes a set of good practices derived from those findings aiming to act as a call-to-arms for both the R and RSE (Research SE) community to explore specific, interdisciplinary paths of research. Introduction

  4. Expansion and evolution of the R programming language

    R is an ideal software language to test for evidence of language change. It is ranked in the top 20 most popular programming languages [ 23] and is free and open source, creating a broad user base. It is specifically targeted to data analysis and statistical inference, restricting language use cases [ 24 ].

  5. The R Language: An Engine for Bioinformatics and Data Science

    The R programming language is approaching its 30th birthday, and in the last three decades it has achieved a prominent role in statistics, bioinformatics, and data science in general. It currently ranks among the top 10 most popular languages worldwide, and its community has produced tens of thousands of extensions and packages, with scopes ...

  6. PDF S, R, and Data Science

    S, R, and Data Science by John M. Chambers Abstract Data science is increasingly important and challenging. It requires computational tools and programming environments that handle big data and difficult computations, while supporting creative, high-quality analysis. The R language and related software play a major role in computing for data ...

  7. The Functional Programming Language R and the Paradigm of Dynamic

    levels of their data processing software. The R language is, seman tic p eculiari-. ties aside, well-documented and forgiving, and encourages abstract, elegant and. reusable functional style. The ...

  8. Software Engineering and ... The R Journal

    The R Journal: article published in 2021, volume 13:2. Software Engineering and R Programming: A Call for Research Melina Vidoni , The R Journal (2021) 13:2, pages 6-14. Abstract Although R programming has been a part of research since its origins in the 1990s, few studies address scientific software development from a Software Engineering (SE) perspective.

  9. R Programming: Statistical Data Analysis in Research

    R is an open-source software with a development environment (RStudio) for computing statistics and graphical displays through data manipulation, modeling, and calculation. R packages and supported libraries provide a wide range of functions for programming and analyzing of data. Unlike many of the existing statistical software, R has the added ...

  10. Data Visualization Using R for Researchers Who Do Not Use R

    Use of the programming language R (R Core Team, 2021) for data processing and statistical analysis by researchers is increasingly common; there was an average yearly growth of 87% in the number of citations of the R Core Team between 2006 and 2018 (Barrett, 2019).In addition to benefiting reproducibility and transparency, one of the advantages of using R is that researchers have a much larger ...

  11. Programming tools: Adventures with R

    Oceanography deSolve is a package for solving differential equations. Graphics ggplot2 one of the most popular visualization packages in R. Phylogeny dendextend compares trees of evolutionary ...

  12. R software: unfriendly but probably the best

    How was R created and how did it become so popular? R was created by statistician Ross Ihaka and statistician and bioinformaticist Robert Gentleman from the University of Auckland in 1992 on the basis of the programming language S. The first official stable version (1.0) was released in 2000. Today, R is developed by the R Development Core Team.

  13. PDF A Handbook of Statistical Analyses Using R

    programming language for data analysis tasks but in fact it is a full-featured ... With the help of the R system for statistical computing, research really becomes reproducible when both the data and the results of all data analysis steps reported in a paper are available to the readers through an Rtranscript file. Ris most widely used for

  14. Advanced R Statistical Programming and Data Models

    With programming experience in R, C++, Ruby, Fortran, and JavaScript, he has always found ways to meld his passion for writing with his joy of logical problem solving and data science. From the boardroom to the classroom, Matt enjoys finding dynamic ways to partner with interdisciplinary and diverse teams to make complex ideas and projects ...

  15. R programming for Social Network Analysis

    Published in: 2018 3rd International Conference on Information Technology Research (ICITR) Article #: Date of Conference: 05-07 December 2018. Date Added to IEEE Xplore: 13 June 2019. ISBN Information: Electronic ISBN: 978-1-7281-1470-5. CD: 978-1-7281-1468-2. Print on Demand (PoD) ISBN: 978-1-7281-1471-2. INSPEC Accession Number:

  16. Software Engineering and R Programming: A Call for Research

    Although R programming has been a part of research since its origins in the 1990s, few studies address scientific software development from a Software Engineering (SE) perspective. The past few years have seen unparalleled growth in the R community, and it is time to push the boundaries of SE research and R programming forwards. This paper discusses relevant studies that close this gap ...

  17. Data interpretation and visualization of COVID-19 cases using R programming

    The COVID19-World online web application systematically produces daily updated country-specific data visualization and analysis of the SARS-CoV-2 epidemic worldwide. The application will help with a better understanding of the SARS-CoV-2 epidemic worldwide. Keywords: Covid-19, Coronavirus, Open data map, Data visualization, Machine learning.

  18. R Programming for Research

    This is the online book for Colorado State University's R Programming for Research courses (ERHS 535, ERHS 581A3, and ERHS 581A4). This book includes course information, course notes, links to download pdfs of lecture slides, in-course exercises, homework assignments, and vocabulary lists for quizzes for this course. ""Give someone a ...

  19. 162097 PDFs

    Explore the latest full-text research PDFs, articles, conference papers, preprints and more on R PROGRAMMING. Find methods information, sources, references or conduct a literature review on R ...

  20. An academic programming language paper about R

    The R language has passed another milestone, a paper aimed at the academic programming language community (or at least one section of this community) has been written about it, Evaluating the Design of the R Language by Morandat, Hill, Osvald and Vitek. Hardly earth shattering news, but it may have some impact on how R is viewed by nonusers of the language (the many R users in finance probably ...

  21. Potential use of R-statistical programming in the field of geoscience

    This paper presents some recent developments in geoscience field such as geology, remote sensing, soil and rock mechanics where the open-source tool R-programming can be used in statistical research and understandings. It also describes the list of packages available in R-programming for specific field of study and how it can be used to promote ...

  22. Conducting Simulation Studies in the R Programming Environment

    The R Statistical Programming Environment. The R statistical programming environment (R Development Core Team, 2011) provides an ideal platform to conduct simulation studies.R includes the ability to fit a variety of statistical models natively, includes sophisticated procedures for data plotting, and has over 3000 add-on packages that allow for additional modeling and plotting techniques.

  23. Unleashing the Full Potential of Your Research with Chapman Figshare

    The Leatherby Libraries is pleased to introduce a 3-part Introduction to R Programming Summer Workshop Series to introduce students, staff, and faculty to fundamental coding concepts in R. R programming is an effective tool that can propel efficiency and effectiveness in analyzing complex statistical analyses while producing quality data ...

  24. (PDF) A Quick Introduction to R and RStudio

    about R. LEARNING OUTCOMES. After completing this tutorial, you should be able to: Explain how R language is used for analyzing data. Use RStudio for entering and running R-code. Add comments to R ...

  25. SANER 2025

    Call for Papers. The Tool Demonstration track of the 32nd International Conference on Software Analysis, Evolution, and Reengineering (SANER'25) provides an excellent opportunity for researchers and practitioners to showcase innovative tools, prototypes, and software systems related to software analysis, engineering, and refactoring.