Křikava and Vitek ( 2018 ) conducted an MSR to inspect R packages’ source code, making available a tool that automatically generates unit tests. In particular, they identified several challenges regarding testing caused by the language itself, namely its extreme dynamism, coerciveness, and lack of types, which difficult the efficacy of traditional test extraction techniques.
In particular, the authors worked with execution traces , “the sequence of operations performed by a program for a given set of input values” ( Křikava and Vitek 2018 ) , to provide genthat , a package to optimise the unit testing of a target package ( Krikava 2018 ) . genthat records the execution traces of a target package, allowing the extraction of unit test functions; however, this is limited to the public interface or the internal implementation of the target package. Overall, its process requires installation, extraction, tracing, checking and minimisation.
Both genthat and the study performed by these authors are highly valuable to the community since the minimisation phase of the package checks the unit tests and discards those failing, and records to coverage, eliminating redundant test cases. Albeit this is not a solution to the lack of edge cases detected in another study ( Vidoni 2021a ) , this genthat assists developers and can potentially reduce the workload required to obtain a baseline test suite. However, this work’s main limitation is its emphasis on the coverage measure, which is not an accurate reflection of the tests’ quality. Finally, Russell et al. ( 2019 ) focused on the maintainability quality of R packages caused by their testing and performance . The authors conducted an MSR of 13500 CRAN packages, demonstrating that "reproducible and replicable software tests are frequently not available". This is also aligned with the findings of other authors mentioned in this Section. They concluded with recommendations to improve the long-term maintenance of a package in terms of testing and optimisation, reviewed in Section 3 .
The increased relevance of software in data science, statistics and research increased the need for reproducible, quality-coded software ( Howison and Herbsleb 2011 ) . Several community-led organisations were created to organize and review packages - among them, rOpenSci ( Ram et al. 2019 ; rOpenSci et al. 2021 ) and BioConductor ( Gentleman et al. 2004 ) . In particular, rOpenSci has established a thorough peer-review process for R packages based on the intersection of academic peer-reviews and software reviews.
As a result, Codabux et al. ( 2021 ) studied rOpenSci open peer-review process. They extracted completed and accepted packages reviews, broke down individual comments, and performed a card sorting approach to determine which types of TD were most commonly discussed.
One of their main contributions is a taxonomy of TD extending the current definitions to R programming. It also groups debt types by perspective , representing ’who is the most affected by a type of debt". They also provided examples of rOpenSci’s peer-review comments referring to a specific debt. This taxonomy is summarised in Table 2 , also including recapped definitions.
User | Usability | In the context of R, test debt encompasses anything related to usability, interfaces, visualisation and so on. |
Documentation | For R, this is anything related to (or alternatives such as the Latex or Markdown generation), readme files, vignettes and even websites. | |
Requirements | Refers to trade-offs made concerning what requirements the development team needs to implement or how to implement them. | |
Developer | Test | In the context of R, test debt encompasses anything related to coverage, unit testing, and test automation. |
Defect | Refers to known defects, usually identified by testing activities or by the user and reported on bug tracking systems. | |
Design | For R, this debt is related to any OO feature, including visibility, internal functions, the triple-colon operator, placement of functions in files and folders, use of for imports, returns of objects, and so on. | |
Code | In the context of R, examples of code debt are anything related to renaming classes and functions, \(<-\) vs. \(=\), parameters and arguments in functions, FALSE/TRUE vs. F/T, print vs warning/message. | |
CRAN | Build | In the context of R, examples of build debt are anything related to Travis, Codecov.io, GitHub Actions, CI, AppVeyor, CRAN, CMD checks, . |
Versioning | Refers to problems in source code versioning, such as unnecessary code forks. | |
Architecture | for example, violation of modularity, which can affect architectural requirements (e.g., performance, robustness). |
Additionally, they uncovered that almost one-third of the debt discussed is documentation debt –related to how well packages are being documented. This was followed by code debt , providing a different distribution than the one obtained by ( Vidoni 2021b ) . This difference is caused by rOpenSci reviewers focusing on documentation (e.g., comments written by reviewers’ account for most of the documentation debt ), while developers’ comments concentrate their attention in code debt . The entire classification process is detailed in the original study Codabux et al. ( 2021 ) .
Developers’ perspectives on their work are fundamental to understand how they develop software. However, scientific software developers have a different point of view than ‘traditional’ programmers ( Howison and Herbsleb 2011 ) .
Pinto et al. ( 2018 ) used an online questionnaire to survey over 1500 R developers, with results enriched with metadata extracted from GitHub profiles (provided by the respondents in their answers). Overall, they found that scientific developers are primarily self-taught but still consider peer-learning a second valuable source. Interestingly, the participants did not perceive themselves as programmers, but rather as a member of any other discipline. This also aligns with findings provided by other works ( Morandat et al. 2012 ; German et al. 2013 ) . Though the latter is understandable, such perception may pose a risk to the development of quality software as developers may be inclined to feel ‘justified’ not to follow good coding practices ( Pinto et al. 2018 ) .
Additionally, this study found that scientific developers work alone or in small teams (up to five people). Interestingly enough, they found that people spend a significant amount of time focused on coding and testing and performed an ad-hoc elicitation of requirements, mostly ‘deciding by themselves’ on what to work next, rather than following any development lifecycle.
When enquiring about commonly-faced challenges, the participants of this study considered the following: cross-platform compatibility, poor documentation (which is a central topic for reviewers ( Codabux et al. 2021 ) ), interruptions while coding, lack of time (also mentioned by developers in another study ( Vidoni 2021b ) ), scope bloat, lack of user feedback (also related to validation, instead of verification testing), and lack of formal reward system (e.g., the work is not credited in the scientific community ( Howison and Herbsleb 2011 ) ).
Lifecycles | The lack of proper requirement elicitation and development organisation was identified as a critical problem for developers ( ; ), who often resort to writing comments in the source to remind themselves of tasks they later do not address ( ). | There are extremely lightweight agile lifecycles (e.g., Extreme Programming, Crystal Clear, Kanban) that can be adapted for a single developer or small groups. Using these can provide a project management framework that can also organise a research project that depends on creating scientific software. |
Teaching | Most scientific developers do not perceive themselves as programmers and are self-taught ( ). This hinders their background knowledge and the tools they have available to detect TD and other problems, potentially leading to low-quality code ( ). | Since graduate school is considered fundamental for these developers ( ), providing a solid foundation of SE-oriented R programming for candidates whose research relies heavily on software can prove beneficial. The topics to be taught should be carefully selected to keep them practical and relevant yet still valuable for the candidates. |
Coding | Some problems discussed where functions clones, incorrect imports, non-semantic or meaningful names, improper visibilitiy or file distribution of functions, among others. | Avoid duplicating (i.e., copy-pasting or re-exporting) functions from other packages, and instead use proper selective import, such as ‘s or similar Latex documentation styles. Avoid leaving unused functions or pieces of code that are ’commented out’ to be nullified. Proper use of version control enables developers to remove the segments of code and revisit them through previous commits. Code comments are meant to be meaningful and should not be used as a planning tool. Comments indicating problems or errors should be addressed (either when found, if the problem is small or planning for a specific time to do it if the problem is significant). Names should be semantic and meaningful, maintaining consistency in the whole project. Though there is no pre-established convention for R, previous works provide an overview ( ), as well as packages, such as the . |
Testing | Current tests leave many relevant paths unexplored, often ignoring the testing of edge cases and damaging the robustness of the code packaged ( ; ) | All alternative paths should be tested (e.g., those limited by conditionals). Exceptional cases should be tested; e.g., evaluating that a function throws an exception or error when it should, and evaluating other cases such as (but not limited to), nulls, , , warnings, large numbers, empty strings, empty variables (e.g., , among others. |
Other specific testing cases, including performance evaluation and profiling, discussed and exemplified by Russell et al. ( ). |
This study ( Pinto et al. 2018 ) was followed up to create a taxonomy of problems commonly faced by scientific developers ( Wiese et al. 2020 ) . They worked with over 2100 qualitatively-reported problems and grouped them into three axes; given the size of their taxonomy, only the larger groups are summarised below:
These two works provide valuable insight into scientific software developers. Like other works mentioned in this article, albeit there are similarities with traditional software development (both in terms of programming paradigms and goals), the differences are notable enough to warrant further specialised investigations.
Based on well-known practices for traditional software development ( Sommerville 2015 ) , this Section outlines a proposal of best practices for R developers. These are meant to target the weaknesses found by the previous studies discussed in Section 2 . This list aims to provide a baseline, aiming that (through future research works) they can be improved and further tailored to the needs of scientific software development and the R community in itself.
The practices discussed span from overarching (e.g., related to processes) to specific activities. They are summarised in Table 3 .
Scientific software and R programming became ubiquitous to numerous disciplines, providing essential analysis tools that could not be completed otherwise. Albeit R developers are reportedly struggling in several areas, academic literature centred on the development of scientific software is scarce. As a result, this Section provides two calls to actions: one for R users and another for RSE academics.
Research Software Engineering Call: SE for data science and scientific software development is crucial for advancing research outcomes. As a result, interdisciplinary works are increasingly needed to approach specific areas. Some suggested topics to kickstart this research are as follows:
R Community Call: The following suggestions are centred on the abilities of the R community:
There is a wide range of possibilities and areas to work, all derived from diversifying R programming and RSE. This paper highlighted meaningful work in this area and proposed a call-to-action to further this area of research and work. However, these ideas need to be repeatedly evaluated and refined to be valuable to R users.
This research did not receive any specific grant from funding agencies in the public, commercial, or not-for-profit sectors. The author is grateful to both R-Ladies and rOpenSci communities that fostered the interest in this topic and to Prof. Dianne Cook for extending the invitation for this article.
The following packages were mentioned in this article:
genthat , roxygen2 , pkgdown , covr , testthat , tidyverse
This article is converted from a Legacy LaTeX article using the texor package. The pdf version is the official version. To report a problem with the html, refer to CONTRIBUTE on the R Journal homepage.
Text and figures are licensed under Creative Commons Attribution CC BY 4.0 . The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".
For attribution, please cite this work as
BibTeX citation
© The R Foundation, web page contact .
Colorado state university, erhs 535.
Brooke Anderson, Rachel Severson, and Nicholas Good
This is the online book for Colorado State University’s R Programming for Research courses (ERHS 535, ERHS 581A3, and ERHS 581A4).
This book includes course information, course notes, links to download pdfs of lecture slides, in-course exercises, homework assignments, and vocabulary lists for quizzes for this course.
““Give someone a program, you frustrate them for a day; teach them how to program, you frustrate them for a lifetime.”—David Leinweber”
This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License .
An academic programming language paper about r.
Posted on April 27, 2012 by Derek-Jones in R bloggers | 0 Comments
[social4i size="small" align="align-left"] --> [This article was first published on The Shape of Code » R , and kindly contributed to R-bloggers ]. (You can report issue about the content on this page here ) Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
The R language has passed another milestone, a paper aimed at the academic programming language community (or at least one section of this community) has been written about it, Evaluating the Design of the R Language by Morandat, Hill, Osvald and Vitek. Hardly earth shattering news, but it may have some impact on how R is viewed by nonusers of the language (the many R users in finance probably don’t care that R seems to have been labeled as the language for doing statistics ). The paper is well written and contains some very interesting information as well as a few mistakes, although it will probably read like gobbledygook to anybody not familiar with academic programming language research. What follows has something of the form of an R users guide to reading this paper, plus some commentary.
The paper has roughly three parts, the first gives an overview of R, the second is a formal definition of a subset and the third an initial report of an analysis of R usage. For me and I imagine you dear reader the really interesting stuff is in the third section.
What is a formal description of a subset of R (i.e., done purely using mathematics) doing in the second part? Well, until recently very little academic software engineering was empirically based and was populated by people I would classify as failed mathematicians without the common sense to be engineers. Things are starting to change but research that measures things, particularly people, is still regarded as not being respectable in some quarters. In this case the formal definition is playing the role of a virility symbol showing that the authors are obviously regular guys who happen to be indulging in a bit of empirical research.
A surprising number of papers measuring the usage of real software contain formal definitions of a subset of the language being measured. Subsets are used because handling the complete language is a big project that usually involves one or more people getting a PhD out of the work. The subset chosen have to look plausible to readers who understand the mathematics but not the programming language, broadly handle all the major constructs but not get involved with all the fiddly details that need years of work and many pages to describe.
The third part contains the real research, which is really about one implementation of R and the characteristics of R source in the CRAN and Bioconductor repositories, and contains lots of interesting information. Note: the authors are incorrect to aim nearly all of the criticisms in this subsection at R, these really apply to the current implementation of R and might not apply to a different implementation.
In a previous post I suggested some possibilities for speeding up the execution of R programs that depended on R usage characteristics. The Morandat paper goes a long way towards providing numbers for some of these usage characteristics (e.g., 37% of function parameters are assigned to and 36% of vectors contain a single value).
What do we learn from this first batch of measurements? R users rarely use many of the more complicated features (e.g., object oriented constructs {and this paper has been accepted at the European Conference on Object-Oriented Programming}), a result usually seen for other languages. I was a bit surprised that R programs were only 40% smaller than equivalent C programs. I think part of the reason is that some of the problems used for benchmarking are not the kind that would usually be solved using R and I did not see any ‘typical’ R programs being coded up in C for comparison, another possibility is that the authors were not thinking in R when writing the code.
One big measurement topic the authors missed is comparing their general findings with usage measurements of other languages. I think they will find lots of similar patterns of usage.
The complaint that R has been defined by the successive releases of its only implementation, rather than a written specification, applies to all widely used languages, at least in their early days. Back in the day a major reason for creating language standards for Pascal and then C was so that other implementations could be created; the handful of major languages whose specification was written before the first implementation (e.g., PL/1, Ada) have/are dieing out. Are multiple implementations needed in an Open Source world? The answer seems to be no for Perl and yes for PHP, Ruby etc. The effort needed to create a written specification for the R language might be better invested improving the efficiency of the current implementation so that a better alternative is not needed.
Needless to say the authors suggested committing the fatal programming language research mistake .
The authors have created an interesting set of tools for static and dynamic analysis of R and I look forward to reading more about their findings in future papers.
To leave a comment for the author, please follow the link and comment on their blog: The Shape of Code » R . R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job . Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Copyright © 2022 | MH Corporate basic by MH Themes
A not-for-profit organization, IEEE is the world's largest technical professional organization dedicated to advancing technology for the benefit of humanity. © Copyright 2024 IEEE - All rights reserved. Use of this web site signifies your agreement to the terms and conditions.
An official website of the United States government
The .gov means it’s official. Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.
The site is secure. The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.
Preview improvements coming to the PMC website in October 2024. Learn More or Try it out now .
Kevin a. hallgren.
University of New Mexico, Department of Psychology
Simulation studies allow researchers to answer specific questions about data analysis, statistical power, and best-practices for obtaining accurate results in empirical research. Despite the benefits that simulation research can provide, many researchers are unfamiliar with available tools for conducting their own simulation studies. The use of simulation studies need not be restricted to researchers with advanced skills in statistics and computer programming, and such methods can be implemented by researchers with a variety of abilities and interests. The present paper provides an introduction to methods used for running simulation studies using the R statistical programming environment and is written for individuals with minimal experience running simulation studies or using R. The paper describes the rationale and benefits of using simulations and introduces R functions relevant for many simulation studies. Three examples illustrate different applications for simulation studies, including (a) the use of simulations to answer a novel question about statistical analysis, (b) the use of simulations to estimate statistical power, and (c) the use of simulations to obtain confidence intervals of parameter estimates through bootstrapping. Results and fully annotated syntax from these examples are provided.
Simulations provide a powerful technique for answering a broad set of methodological and theoretical questions and provide a flexible framework to answer specific questions relevant to one’s own research. For example, simulations can evaluate the robustness of a statistical procedure under ideal and non-ideal conditions, and can identify strengths (e.g., accuracy of parameter estimates) and weaknesses (e.g., type-I and type-II error rates) of competing approaches for hypothesis testing. Simulations can be used to estimate the statistical power of many models that cannot be estimated directly through power tables and other classical methods (e.g., mediation analyses, hierarchical linear models, structural equation models, etc.). The procedures used for simulation studies are also at the heart of bootstrapping methods, which use resampling procedures to obtain empirical estimates of sampling distributions, confidence intervals, and p -values when a parameter sampling distribution is non-normal or unknown.
The current paper will provide an overview of the procedures involved in designing and implementing basic simulation studies in the R statistical programming environment ( R Development Core Team, 2011 ). The paper will first outline the logic and steps that are included in simulation studies. Then, it will briefly introduce R syntax that helps facilitate the use of simulations. Three examples will be introduced to show the logic and procedures involved in implementing simulation studies, with fully annotated R syntax and brief discussions of the results provided. The examples will target three different uses of simulation studies, including
For demonstrative purposes, these examples will achieve their respective goals within the context of mediation models. Specifically, Example 1 will answer a novel statistical question about mediation model specification, Example 2 will estimate the statistical power of a mediation model, and Example 3 will bootstrap confidence intervals for testing the significance of an indirect effect in a mediation model. Despite the specificity of these example applications, the goal of the present paper is to provide the reader with an entry-level understanding of methods for conducting simulation studies in R that can be applied to a variety of statistical models unrelated to mediation analysis.
Although many statistical questions can be answered directly through mathematical analysis rather than simulations, the complexity of some statistical questions makes them more easily answered through simulation methods. In these cases, simulations may be used to generate datasets that conform to a set of known properties (e.g., mean, standard deviation, degree of zero-inflation, ceiling effects, etc. are specified by the researcher) and the accuracy of the model-computed parameter estimates may be compared to their specified values to determine how adequately the model performs under the specified conditions. Because several methods may be available for analyzing datasets with these characteristics, the suitability of these different methods could also be tested using simulations to determine if some methods offer greater accuracy than others (e.g., Estabrook, Grimm, & Bowles, 2012 ; Luh & Guo, 1999 ).
Simulation studies typically are designed according to the following steps to ensure that the simulation study can be informative to the researcher’s question:
The R statistical programming environment ( R Development Core Team, 2011 ) provides an ideal platform to conduct simulation studies. R includes the ability to fit a variety of statistical models natively, includes sophisticated procedures for data plotting, and has over 3000 add-on packages that allow for additional modeling and plotting techniques. R also allows researchers to incorporate features common in most programming languages such as loops, random number generators, conditional (if-then) logic, branching, and reading and writing of data, all of which facilitate the generation and analysis of data over many repetitions that is required for many simulation studies. R also is free, open source, and may be run across a variety of operating systems.
Several existing add-on packages already allow R users to conduct simulation studies, but typically these are designed for running simulations for a specific type of model or application. For example, the simsem package provides functions for simulating structural equation models ( Pornprasertmanit, Miller, & Schoemann, 2012 ), ergm includes functions for simulating social network exponential random graphs ( Handcock et al., 2012 ), mirt allows users to simulate multivariate-normal data for item response theory ( Chalmers, 2012 ), and the simulate function in the native stats package allows users to simulate fitted general linear models and generalized linear models. It should be noted that many simulation studies can be conducted efficiently using these pre-existing functions, and that using the alternative, more general method for running simulation studies described here may not always be necessary. However, the current paper will describe a set of general methods and functions that can be used in a variety of simulation studies, rather than describing the methods for simulating specific types of models already developed in other packages.
R is syntax-driven, which can create an initial hurdle that prevents many researchers from using it. While the learning curve for syntax-driven statistical languages may be steep initially, many people with little or no prior programming experience have become comfortable using R. Also, such a syntax-driven platform allows for much of the program’s flexibility described above.
The simulations used in the following tutorials utilize several basic R functions, with a rationale for their use provided below and a brief description with examples given in Table 1 . A full tutorial on these basic functions and on using R in general is not given here; instead, the reader is referred to several open-source tutorials introducing R ( Kabacoff, 2012 ; Owen, 2010 ; Spector, 2004 ; Venables, Smith, & R Development Core Team, 2012 ). Some commands that serve a secondary function that are not directly related to generating or analyzing simulation data (e.g., the write.table command for saving a dataset) are not discussed here but descriptions of such functions are included in the annotated syntax examples in the appendices. More information about each of the functions used in this tutorial can be obtained from the help files included in R or by entering ?<command> in the R command line (e.g., enter ?c to get more information about the c command).
Common R commands for simulation studies.
Commands for working with vectors | ||
---|---|---|
Command | Description | Examples |
c | Combines arguments to make vectors | #create vector called a which contains the values 3, 5, 4 |
a = c(3,5,4) | ||
#identical to above, uses <- instead of = | ||
a <- c(3,4,5) | ||
#return the second element in vector a, which is 5 | ||
a[2] | ||
#remove the contents previously stored in vector a | ||
a = NULL | ||
length | Returns the length of a vector | #return length of vector a, which is 3 |
a = c(3,5,4) | ||
length(a) | ||
rbind and cbind | Combine arguments by rows or columns | #create matrix d that has vector a as row 1 and vector b as row 2. |
a = c(3,5,4) | ||
b = c(9,8,7) | ||
d = rbind(a,b) | ||
#create matrix e that has two copies of matrix d joined by column | ||
e = cbind(d,d) |
Commands for generating random values | ||
---|---|---|
Command | Description | Examples |
rnorm | Randomly samples values from normal distribution with a given population and | #randomly sample 100 values from a normal distribution with a population = 50 and = 10 |
x = rnorm(100, 50, 10) | ||
sample | Randomly sample values from another vector | #randomly sample 8 values from vector a, with replacement |
a = c(1,2,3,4,5,6,7,8) | ||
sample(a, size=8, replace=TRUE) | ||
#e.g., returns 3 1 3 6 5 4 2 2 | ||
set.seed | Allows exact replication of randomly-generated numbers between simulations | #The same 5 random numbers returned each time the following lines are run |
set.seed(12345) | ||
rnorm(5, 50, 10) |
Command for statistical modeling | ||
---|---|---|
Command | Description | Examples |
lm | fits linear ordinary least squares models | #Regress y onto x1 and x2 |
y = c(2,2,5,4,3,6,4,6,5,7) | ||
x1 = c(1,2,3,1,1,2,3,1,2,2) | ||
x2 = c(0,0,0,0,0,1,1,1,1,1) | ||
mymodel = lm(y ~ x1 + x2) | ||
summary(mymodel) | ||
#retrieve fixed effect coefficients from a lm object | ||
mymodel$coefficients |
Commands for programming | ||
---|---|---|
Command | Description | Examples |
function | generate customized function | # function that returns the sum of x1 and x2 |
myfunction = function(x1, x2){ | ||
mysum = x1 + x2 | ||
return(mysum) | ||
} | ||
for | create a loop, allowing sequences of commands to be executed a specified number of times | #Create vector of empirical sample means (stored as mean_vector) from 100 random samples of size = 20, sampled from a population = 50 and = 10. |
mean_vector = NULL | ||
for (i in 1:100){ | ||
x = rnorm(20, 50, 10) | ||
m = mean(x) | ||
mean_vector = c(mean_vector, m) | ||
} |
Note : Text appearing after the # symbol is not processed by R and is typically reserved for comments and annotation. List of commands is not exhaustive.
R is an object-oriented program that works with data structures such as vectors and data frames. Vectors are one of the simplest data structures and contain an ordered list of values. Vectors will be used throughout the examples described in this tutorial to store values for variables in simulated datasets and to store parameter estimates that are retained from statistical analyses (e.g., p -values, parameter point estimates, etc.). The examples here will make extensive use of commands for generating, indexing, and combining vectors, including the c command for generating and combining vectors, the length command for obtaining the number of items in a vector, and the rbind and cbind commands for combining vectors by row or column, respectively.
Two functions for creating random numbers, rnorm and sample, will be used in the simulation examples in this paper in order to generate values for random variables or to sample subsets of observations from an existing dataset, respectively. An additional function for setting the randomization seed, set.seed, is useful for generating the same sets of random numbers each time a simulation study is run, allowing exact replications of results.
Statistical models in these tutorials will be fit using the lm command, which models linear regression, analysis of variance, and analysis of covariance (however, note that there are many additional native and add-on R packages that can fit a variety of models outside of the general linear model framework). The lm command returns an object with information about the fitted linear model, which may be accessed through additional commands. For example, fixed effect coefficients for the lm object called mymodel shown in Table 1 (under the lm command) can be extracted by calling for the coefficients values of mymodel, such that the syntax
returns the regression coefficients for the intercept and effects of x1 and x2 in predicting y from the data in Table 1 and saves it to vector f, which has the following values:
Specific fixed effects could be further extracted by indexing values from vector f; for example, the command f[2] would extract the second value in vector f, which is the fixed effect coefficient for x1.
The function command allows users to generate their own customized functions, which provides a useful way of reducing syntax when a procedure is repeated many times. For example, the first tutorial below computes several Sobel statistics each time a dataset is generated, and declaring a function that computes the Sobel statistic allows the program to call on one function each time the statistic must be computed, rather than repeating several lines of the same syntax within the simulation. The for command is used to create loops, which allow sequences of commands that are specified once to be executed several times. This is useful in simulation studies because datasets often must be generated and analyzed hundreds or thousands of times.
This section will outline examples of questions that may be answered using simulation studies and describes the methods used to answer those questions. In each example, the underlying assumptions and procedures for generating and analyzing data will be discussed, and fully annotated syntax for the simulations will be provided as appendices.
Mediation analysis is a statistical technique for analyzing whether the effect of an independent variable ( X ) on an outcome variable ( Y ) can be accounted for by an intermediate variable ( M ; see Figure 1 for graphical depiction; see Hayes 2009 for pedagogical review). When mediation is present, the degree to which X predicts Y is changed when M is added to the model in the manner shown in Figure 1 (i.e., c – c ′ ≠ 0 in Figure 1 ). The degree to which the relationship between X and Y changes ( c – c ′) is called the indirect effect, which is mathematically equivalent to the product of the path coefficients ab shown in Figure 1 . The product of path coefficients ab (or equivalently, c – c ′) represents the amount of change in outcome variable Y that can be attributed to being caused by changes in the independent variable X operating through the mediating variable M . In situations where a mediator variable cannot be directly manipulated through experimentation, mediation analysis has often been championed as a method of choice for identifying variables that may cause an observed outcome ( Y ) as part of a causal sequence where X affects M , and M in turn affects Y .
Direct effect model (top) and mediation model (bottom).
For example, in psychotherapy research, the number of times participants receive drink-refusal training ( X ) may impact their self-efficacy to refuse drinks ( M ), and enhanced self-efficacy may in turn cause improved abstinence from alcohol ( Y ; e.g., Witkiewitz, Donovan, & Hartzler, 2012 ). Self-efficacy cannot be directly manipulated by experiment, so researchers may use mediation analysis to test whether a particular psychotherapy increases self-efficacy, and whether this in turn increases abstinence outcomes. However, little research has identified the consequences of wrongly specifying which variables are mediator variables ( M ) versus outcome variables ( Y ). For example, it could also be possible that drink-refusal training ( X ) enhances abstinence from alcohol ( Y ), which in turn enhances self-efficacy ( M; e.g., X causes Y, Y causes M ). Support for this alternative model would guide treatment providers and subsequent research efforts toward different goals than the original model, and therefore it is important to know whether mediation models are likely to produce significant results even when the true causal order of effects is incorrectly specified by investigators.
The present example uses simulations to test whether mediation models produce significant results when the implied causal ordering of effects is switched within the tested model. Data is generated for three variables, X , M , and Y , such that M mediates the relationship between X and Y (“ X-M-Y ” model) using ordinary least-squares (OLS) regression. Path coefficients for a ( X predicting M; see Figure 1 ) and b ( M predicting Y , controlling for X ) will each be manipulated at three levels (−0.3, 0.0, 0.3), c ′ ( X predicting Y , controlling for M ) will be manipulated at three levels (−0.2, 0.0, 0.2), and sample size ( N ) will be manipulated at two levels (100, 300). This results in a 3 ( X ) × 3 ( M ) × 3 ( Y ) × 2 ( N ) design. One thousand simulated datasets will be generated in each condition. Data will be generated for an X-M-Y model, and mediation tests will be conducted on the original X-M-Y models and with models that switch the order of M and Y variables (i.e., X-Y-M models). The Sobel test ( MacKinnon, Warsi, & Dwyer, 1995 ; Sobel, 1982 ) will be computed and retained for each type of mediation model, with p < 0.05 indicating significant mediation for that particular model.
Data in this example are generated in accordance with OLS regression assumptions, including the assumptions that random variables are sampled from populations with normal distributions, that residual errors are normally distributed with a mean of zero, and that residual errors are homoscedastic and serially uncorrelated. Assumptions about the relationships among X , M , and Y variables from Figure 1 are guided by the equations provided by Jo (2008) ,
where X i , M i , and Y i represent values for the independent variable, mediator, and outcome for individual i , respectively; α m and α y represent the intercepts for M and Y after the other effects are accounted for, and a , b , and c ′ correspond with the mediation regression paths shown in Figure 1 .
Data for X , M , and Y with sample size N can be generated using the rnorm command. If N , a , b , and c ′ ( c ′ is named cp in the syntax below) are each specified as single numeric values, then the following syntax will generate data for the X , M , and Y variables.
The first line of the syntax above creates a random variable X with a mean of zero and a standard deviation of one for N observations. The second line creates a random variable M that regresses onto X with regression coefficient a and a random error with a mean of zero and standard deviation of one (error variances need not be fixed with a mean of zero and standard deviation of one, and can be specified at any value based on previous research or theoretically-expected values). The third line of syntax creates a random variable Y that regresses onto X and M with regression coefficients cp and b, respectively, with a random error that has a mean of zero and standard deviation of one. It will be shown below that the intercept parameters do not affect the significance of a mediation test, and thus the intercepts were left at zero in the three lines of code above; however, the intercept parameter could be manipulated in a similar manner to a , b , and c ′ if desired.
Once the random variables X, M, and Y have been generated, the next step is to perform a statistical analysis on the simulated data. In mediation analysis, the Sobel test ( MacKinnon et al., 1995 ; Sobel, 1982 ) is commonly employed (although, see section below on bootstrapping), which tests the significance of a mediation effect by computing the magnitude of the indirect effect as the product of coefficients a and b ( ab ) and compares this value to the standard error of ab to obtain a z -like test statistic. Specifically, the Sobel test uses the formula
where s a and s b are the standard errors of the estimates for regression coefficients a and b , respectively. The product of coefficients ab reflects the degree to which the effect of X on Y is mediated through variable M , and is contained in the numerator of Equation 3 . The standard error of the distribution of ab is in the denominator of Equation 3 , and the Sobel statistic obtained in the full equation provides a z -like statistic that tests whether the ab effect is significantly different from zero. Because the Sobel test will be computed many times, making a function to compute the Sobel test provides an efficient way to compute the test repeatedly. Such a function is defined below and called sobel_test. The function takes three arguments, vectors X, M, and Y as the first, second, and third arguments, respectively, and computes regression models for M regressed onto X and Y regressed onto X and M. The coefficients representing a , b , s a , and s b in Equation 3 are extracted by calling coefficients, then a Sobel test is computed and returned.
So far syntax has been provided to generate one set of X , M , and Y variables and to compute a Sobel z -statistic from these variables. These procedures can now be repeated several hundred or thousand times to observe how this model behaves across many samples, which may be accomplished with for loops, as shown below. In the syntax below, the procedure for generating data and computing a Sobel test is repeated reps number of times, where reps is a single integer value. For each iteration of the for loop, data are saved to a matrix called d to retain information about the iteration number (i), a , b , and c ′ parameters (a, b, and cp), the sample size (N), an indexing variable that tells whether the test statistic corresponds with an X-M-Y or X-Y-M mediation model (1 vs. 2), and the computed Sobel test statistic which calls on the sobel_test function above.
The above steps can then be repeated for datasets generated according to different parameters. In the present example, we wish to test three different values of a , b , c ′, and N . Syntax for manipulating these parameters is included below. The values selected for a , b , c ′, and N are specified as vectors called a_list, b_list, cp_list, and N_list, respectively. Four nested for loops index through each of the values in a_list, b_list, cp_list, and N_list and extract single values for these parameters that are used for data generation. For each combination of a , b , c ′, and N , reps number of datasets are generated and subjected to the Sobel test using the same syntax presented above (some syntax is omitted below for brevity, and full syntax with more detailed annotation for this example is provided in Appendix A ), and the data are then saved to a matrix called d:
Executing the syntax above generates a matrix d that contains Sobel test statistics for X-M-Y (omitted for brevity) and X-Y-M mediation models (shown above) generated from a variety of a , b , c ′, and N parameters. The next step is to evaluate the results of these models. Before this is done, it will be helpful to add labels to the variables in matrix d to allow for easy extraction of subsets of the results and to facilitate their interpretation:
It is also desirable to save a backup copy of the results using the command
In the syntax above, “...” must be replaced with the directory where results should be saved, and each folder must be separated by double backslashes (“\\”) if the R program is running on a Windows computer (on Macintosh, a colon “:” should be used, and in Linux/Unix, a single forward slash “/” should be used).
Researchers can choose any number of ways to analyze the results of simulation studies, and the method chosen should be based on the nature of the question under examination. One way to compare the distributions of Sobel z-statistics obtained for the X-M-Y and X-Y-M mediation models in the current example is to use boxplots, which can be created in R (see ?boxplot for details) or other statistical software by importing the mediation_output.csv file into other data analytic software. As seen in Figure 2 , in the first two conditions where the population parameters a = 0.3, b = 0.3, and c ′ = 0.2, Sobel tests for X-M-Y and X-Y-M mediation models produce test statistics with nearly identical distributions and Sobel test-values are almost always significant (| z | > 1.96, which corresponds with p < .05, two-tailed) when N = 300 and other assumptions described above are held. In the latter two conditions where the population parameters a = 0.3, b = 0.3, and c ′ = 0, Sobel tests for X-M-Y models remain high, while test statistics for X-Y-M models are lower even though approximately 25% of these models still had Sobel z -test statistics with magnitudes greater than 1.96 (and thus, p -values less than 0.05).
Boxplot of partial results from Example 1 with N = 300.
The similarity of results between X-M-Y and X-Y-M models suggests limitations of using mediation analysis to identify causal relationships. Specifically, the same datasets may produce significant results under a variety of models that support different theories of the causal ordering of relations. For example, a variable that is truly a mediator may instead be specified as an outcome and still produce “significant” results in a mediation analysis. This could imply misleading support for a causal chain due to the way researchers specify the ordering of variables in the analysis. This finding suggests that mediation analysis may produce misleading results in some situations, particularly when data are cross-sectional because of the lack of temporal-ordering for observations of X , M , and Y that could provide stronger testing of a proposed causal sequence (Maxwell & Cole, 2007; Maxwell, Cole, & Mitchell, 2011 ). One implication of these findings is that researchers who perform mediation analysis should test alternative models. For example, researchers could test alternative models with assumed mediators modeled as outcomes and assumed outcomes modeled as mediators to test whether other plausible models are also “significant” (e.g., Witkiewitz et al., 2012 ).
Simulations can be used to estimate the statistical power of a model -- i.e., the likelihood of rejecting the null hypothesis for a particular effect under a given set of conditions. Although statistical power can be estimated directly for many analyses with power tables (e.g., Maxwell & Delaney, 2004 ) and free software such as G*Power (Erdfelder, Faul, & Buchner, 2006; see Mayr, Erdfelder, Buchner, & Faul, 2007 for a tutorial on using G*Power), many types of analyses currently have no well-established method to directly estimate statistical power, as is the case with mediation analysis.
The steps in Example 1 provide the necessary data to estimate the power of a mediation analysis if the assumptions and parameters specified in Example 1 remain the same. Thus, using the simulation results saved in dataset d generated in Example 1, the power of a mediation model under a given set of conditions can be estimated by identifying the relative frequency in which a mediation test was significant.
For example, the syntax below extracts the Sobel test statistic from dataset d under the condition where a = 0.3, b = 0.3, c ′ = 0.2, N = 300, and “model” = 1 (i.e., an X-M-Y mediation model is tested). The vector of Sobel test statistics across 1000 repetitions is saved in a variable called z_dist. The absolute value each of the numbers in z_dist is compared against 1.96 (i.e., the z -value that corresponds with p < 0.05, two-tailed), creating a vector of values that are either TRUE (if the absolute value is greater than 1.96) or FALSE (if the absolute value is less than or equal to 1.96). The number of TRUE and FALSE values can be summarized using the table command (see ?table for details), which if divided by the length of the number of values in the vector will provide the proportion of Sobel tests with absolute value greater than 1.96:
When the above syntax is run, the following result is printed
which indicates that 99.7% of the datasets randomly sampled under the conditions specified above produced significant Sobel tests, and that the analysis has an estimated power of 0.997.
One could also test the power of mediation models with different parameters specified. For example, the power of a model with all the same parameters as above except with a smaller sample size of N = 100 could be examined using the syntax
which produces the following output
The output above indicates that only 51.5% of the mediation models in this example were significant, which reflects the reduced power rate due to the smaller sample size. Full syntax for this example is provided in Appendix B .
In the above examples, the Sobel test was used to determine whether a mediation effect was significant. Although the Sobel test is more robust than other methods such as Baron and Kenny’s (1984) causal steps approach ( Hayes, 2009 ; McKinnon et al., 1995 ), a limitation of the Sobel test is that it assumes that the sampling distribution of indirect effects ( ab ) is normally distributed in order for the p -value obtained from the z -like statistic to be valid. This assumption typically is not met because the sampling distributions for a and b are each independently normal, and multiplying a and b introduces skew into the sampling distribution of ab . Bootstrapping can be used as an alternative to the Sobel test to obtain an empirically derived sampling distribution with confidence intervals that are more accurate than the Sobel test.
To obtain an empirical sampling distribution of indirect effects ab , N randomly selected participants from an observed dataset are sampled with replacement, where N is equal to the original sample size. A dataset containing the observed X , M , and Y values for these randomly resampled participants is created and subject to a mediation analysis using Equations 1 and 2 . The a and b coefficients are obtained from these regression models, and the product of these coefficients, ab , is computed and retained. This procedure is repeated many times, perhaps 1000 or 10,000 times, with a new set of subjects from the original sample randomly selected with replacement each time ( Hélie, 2006 ). This provides an empirical sampling distribution of the product of coefficients ab that no longer requires the standard error of the estimate for ab to be computed.
The syntax below provides the steps for bootstrapping a 95% confidence interval of an indirect effect for variables X, M, and Y. A variable called ab_vector holds the bootstrapped distribution of ab values, and is initialized using the NULL argument to remove any data previously stored in this variable. A for loop is specified to repeat reps number of times, where reps is a single integer representing the number of repetitions that should be used for bootstrapping. Variable s is a vector containing row numbers of participants that are randomly sampled with replacement from the original observed sample (raw data for X , M , and Y in this example is provided in the supplemental file mediation_raw_data.csv; see Appendix C for syntax to import this dataset into R). The vectors Xs, Ys, and Ms store the values for X , Y , and M , respectively, that correspond with the subjects resampled based on the vector s. Finally, M_Xs and Y_XMs are lm objects containing linear regression models for Ms regressed onto Xs and for Ys regressed onto Xs and Ms, respectively, and the a and b coefficients in these two models are extracted. The product of coefficients ab is computed and saved to ab_vector, then the resampling process and computation of the ab effect are repeated. Once the repetitions are completed, 95% confidence interval limits are obtained using the quantile command to identify the values in ab_vector at the 2.5th and 97.5 percentiles (these values could be adjusted to obtain different confidence intervals; enter ?quantile in the R console for more details), and the result is saved in a vector called bootlim. Finally, a histogram of the ab effects in ab_vector is printed and displayed in Figure 3 .
Empirical distribution of indirect effects ( ab ) used for bootstrapping a confidence interval.
Full syntax with annotation for the bootstrapping procedure above is provided in Appendix C . Calling the bootlim vector returns the indirect effects that correspond with the 2.5th and 97.5th percentile of the empirical sampling distribution of ab , giving the following output:
Because the 95% confidence interval does not contain zero, the results indicate that the product of coefficients ab is significantly different than zero at p < 0.05.
The preceding sections provided demonstrations of methods to implement simulation studies for different purposes, including answering novel questions related to statistical modeling, estimating power, and bootstrapping confidence intervals. The demonstrations presented here used mediation analysis as the content area to demonstrate the underlying processes used in simulation studies, but simulation studies are not limited only to questions related to mediation. Virtually any type of analysis or model could be explored using simulation studies. While the way that researchers construct simulations depends largely on the research question of interest, the basic procedures outlined here can be applied to a large array of simulation studies.
While it is possible to run simulation studies in other programming environments (e.g., the latent variable modeling software MPlus , see Muthén & Muthén, 2002 ), R may provide unique advantages to other programs when running simulation studies because it is free, open source, and cross-platform. R also allows researchers to generate and manipulate their data with much more flexibility than many other programs, and contains packages to run a multitude of statistical analyses of interest to social science researchers in a variety of domains.
There are several limitations of simulation studies that should be noted. First, real-world data often do not adhere to the assumptions and parameters by which data are generated in simulation studies. For example, unlike the linear regression models for the examples above, it is often the case in real world studies that residual errors are not homoscedastic and serially uncorrelated. That is, real-world datasets are likely to be more “dirty” than the “clean” datasets that are generated in simulation studies, which are often generated under idealistic conditions. While these “dirty” aspects of data can be incorporated into simulation studies, the degree to which these aspects should be modeled into the data may be unknown and thus at times difficult to incorporate in a realistic manner.
Second, it is practically impossible to know the values of true population parameters that are incorporated into simulation studies. For example, in the mediation examples above, the regression coefficients a , b , and c’ often may be unknown for a question of interest. Even if previous research provides empirically-estimated parameter estimates, the exact value for these population parameters is still unknown due to sampling error. To deal with this, researchers can run simulations across a variety of parameter values, as was done in Examples 1 and 2, to understand how their models may perform under different conditions, but pinpointing the exact parameter values that apply to their question of interest is unrealistic and typically impossible.
Third, simulation studies often require considerable computation time because hundreds or thousands of datasets often must be generated and analyzed. Simulations that are large or use iterative estimation routines (e.g., maximum likelihood) may take hours, days, or even weeks to run, depending on the size of the study.
Fourth, not all statistical questions require simulations to obtain meaningful answers. Many statistical questions can be answered through mathematical derivations, and in these cases simulation studies can demonstrate only what was shown already to be true through mathematical proofs ( Maxwell & Cole, 1995 ). Thus, simulation studies are utilized best when they derive answers to problems that do not contain simple mathematical solutions.
Simulation methods are relatively straightforward once the assumptions of a model and the parameters to be used for data generation are specified. Researchers who use simulation methods can have tight experimental control over these assumptions and their data, and can test how a model performs under a known set of parameters (whereas with real-world data, the parameters are unknown). Simulation methods are flexible and can be applied to a number of problems to obtain quantitative answers to questions that may not be possible to derive through other approaches. Results from simulation studies can be used to test obtained results with their theoretically-expected values to compare competing approaches for handling data, and the flexibility of simulation studies allows them to be used for a variety of purposes.
Supplementary data and syntax, acknowledgments.
This research was funded by NIAAA grant F31AA021031.
The author would like to thank Mandy Owens, Chris McLouth, and Nick Gaspelin for their feedback on previous versions of this manuscript.
Appendix b. syntax for example 2, appendix c. syntax for example 3.
Leatherby Libraries
August 9, 2024
Following the successful launch of Chapman Figshare during Love Data Week, the Leatherby Libraries is excited to invite more members of the Chapman community to take full advantage of the new research data repository. Whether you have datasets, media files, or other research outputs, Chapman Figshare makes your data citable, shareable, and discoverable.
The Leatherby Libraries offers two open-access repositories, Chapman Figshare and Chapman University Digital Commons , to support Chapman scholars and researchers in sharing and preserving their research outputs. Both platforms cater to distinct aspects of the research lifecycle due to differences in the research outputs they accommodate. Datasets can be uploaded to Chapman Figshare, while Chapman University Digital Commons houses all other research outputs such as articles, theses, posters, and more. Check out our previous blog to learn whether your work best fits in Chapman Figshare or Chapman University Digital Commons.
Data is increasingly valuable in today’s research landscape as it increases the reliability and replicability of research findings. Chapman Figshare provides a secure and efficient platform to archive and publish your research data, ensuring it reaches a broader audience. Here’s why you should consider depositing your data in Chapman Figshare:
Ready to make your research data citable, shareable, and discoverable? Here’s how to get started:
Take advantage of the opportunity to enhance the visibility and impact of your research! Join the growing number of Chapman community members already benefiting from Chapman Figshare. By depositing your data, you contribute to a collaborative, transparent research platform that benefits scholars worldwide.
For detailed instructions on using Chapman Figshare, visit our Digital Repositories at Chapman University LibGuide or contact the LRDS team at [email protected] . Make your research count with Chapman Figshare!
May 9, 2024 by Alyssa Castanon | Resources
Finals Week is almost here again, Panthers! The Leatherby Libraries, in collaboration with the Chapman University Student Government Association, are thrilled to announce the return of coffee and snacks at the Library for Finals Week! Rotunda After-Hours Study Commons Open 24/7 for Finals Week We are pleased to announce that the After-Hours Study Commons in
July 1, 2024 by Alyssa Castanon | Resources
The Leatherby Libraries is pleased to introduce a 3-part Introduction to R Programming Summer Workshop Series to introduce students, staff, and faculty to fundamental coding concepts in R. R programming is an effective tool that can propel efficiency and effectiveness in analyzing complex statistical analyses while producing quality data visualizations. These workshops are for beginners
Call for papers.
The Tool Demonstration track of the 32nd International Conference on Software Analysis, Evolution, and Reengineering (SANER’25) provides an excellent opportunity for researchers and practitioners to showcase innovative tools, prototypes, and software systems related to software analysis, engineering, and refactoring. The track aims to foster knowledge exchange, collaboration, and discussions about the latest advancements in tools and technologies that support software development, maintenance, and improvement.
Tool demonstrations should showcase the implementation of research approaches through practical tools. These tools can range from advanced prototypes to fully developed products that are in the process of being commercialized. We particularly encourage proposals for tool demonstrations that complement full research papers. While a research paper aims to provide background information and highlight the scientific contribution of a new software engineering approach, the tool demonstration offers an excellent opportunity to demonstrate how the scientific approach has been translated into a functional tool prototype. As a result, authors of research papers are strongly encouraged to submit the corresponding tools to this track. Tool demonstrations related to any of the topics covered in the conference are welcome and deemed suitable.
Each submission will be reviewed by at least three members of the tool demonstration program committee. The committee will review each submission on its merits and quality.
A good tool paper should:
Fall under the topics mentioned for SANER 2025 research track;
Present and discuss a tool that has NOT been published before as a tool paper;
Motivate the need for the tool;
Describe the tool’s novelty and how it relates to previous industrial or research efforts;
Describe the potential applications and usefulness of the tool;
Describe the tool’s goals, requirements, architecture and explain its inner workings;
NOT necessarily contain a large-scale empirical study of the tool, BUT any empirical results or user feedback are highly encouraged;
Include a URL for downloading or accessing the latest version of the tool (e.g., a GitHub url)
Optionally, include in the abstract the URL of a 3-to-5 minute screencast, either with annotations or voice-over, that provides a concise version of the tool demo scenario. The video should be posted on YouTube (private, not shared) or hosted on the tool’s website.
Submissions of tool demonstrations must:
adhere to the conference proceedings style (IEEE proceedings paper format guidelines.);
have a maximum of 5 pages that describe the criteria above;
be uploaded electronically in PDF format via the SANER 2025 Easychair submission site.
Accepted tool demonstrations will be allocated 5 pages in the conference proceedings. Presenters of accepted tool demonstrations will have the opportunity to (i) deliver a presentation that will be included in the conference program, and (ii) conduct a hands-on session where attendees of SANER can actively use and experiment with the demonstrated tools. Please note that commercial products and tools currently under commercialization procedures CANNOT be accepted for the tool demonstration track. The purpose of these demonstrations is to emphasize scientific contributions and, as such, should not be used as sales pitches.
Mon 11 Nov 2024new Paper Submission |
Fri 13 Dec 2024new Notifications |
Fri 10 Jan 2025new Camera Ready |
University of north carolina at charlotte, united states.
New york university abu dhabi, university of alberta, united arab emirates.
IMAGES
COMMENTS
R is an open-source s oftware plat form for statistical d ata analysis. The R project s tarted in 1993 as a projec t. launched by two Ne w Zealand statisticians, Ross Ilhaka and Robert Gentleman ...
Abstract: R is an integrated software environment for data processing, calculation, analysis, and graphical display that is open-source and publicly accessible. R is a platform and functional ...
of SE research and R programming forwards. This paper discusses relevant studies that close this gap Additionally, it proposes a set of good practices derived from those findings aiming to act as a call-to-arms for both the R and RSE (Research SE) community to explore specific, interdisciplinary paths of research. Introduction
R is an ideal software language to test for evidence of language change. It is ranked in the top 20 most popular programming languages [ 23] and is free and open source, creating a broad user base. It is specifically targeted to data analysis and statistical inference, restricting language use cases [ 24 ].
The R programming language is approaching its 30th birthday, and in the last three decades it has achieved a prominent role in statistics, bioinformatics, and data science in general. It currently ranks among the top 10 most popular languages worldwide, and its community has produced tens of thousands of extensions and packages, with scopes ...
S, R, and Data Science by John M. Chambers Abstract Data science is increasingly important and challenging. It requires computational tools and programming environments that handle big data and difficult computations, while supporting creative, high-quality analysis. The R language and related software play a major role in computing for data ...
levels of their data processing software. The R language is, seman tic p eculiari-. ties aside, well-documented and forgiving, and encourages abstract, elegant and. reusable functional style. The ...
The R Journal: article published in 2021, volume 13:2. Software Engineering and R Programming: A Call for Research Melina Vidoni , The R Journal (2021) 13:2, pages 6-14. Abstract Although R programming has been a part of research since its origins in the 1990s, few studies address scientific software development from a Software Engineering (SE) perspective.
R is an open-source software with a development environment (RStudio) for computing statistics and graphical displays through data manipulation, modeling, and calculation. R packages and supported libraries provide a wide range of functions for programming and analyzing of data. Unlike many of the existing statistical software, R has the added ...
Use of the programming language R (R Core Team, 2021) for data processing and statistical analysis by researchers is increasingly common; there was an average yearly growth of 87% in the number of citations of the R Core Team between 2006 and 2018 (Barrett, 2019).In addition to benefiting reproducibility and transparency, one of the advantages of using R is that researchers have a much larger ...
Oceanography deSolve is a package for solving differential equations. Graphics ggplot2 one of the most popular visualization packages in R. Phylogeny dendextend compares trees of evolutionary ...
How was R created and how did it become so popular? R was created by statistician Ross Ihaka and statistician and bioinformaticist Robert Gentleman from the University of Auckland in 1992 on the basis of the programming language S. The first official stable version (1.0) was released in 2000. Today, R is developed by the R Development Core Team.
programming language for data analysis tasks but in fact it is a full-featured ... With the help of the R system for statistical computing, research really becomes reproducible when both the data and the results of all data analysis steps reported in a paper are available to the readers through an Rtranscript file. Ris most widely used for
With programming experience in R, C++, Ruby, Fortran, and JavaScript, he has always found ways to meld his passion for writing with his joy of logical problem solving and data science. From the boardroom to the classroom, Matt enjoys finding dynamic ways to partner with interdisciplinary and diverse teams to make complex ideas and projects ...
Published in: 2018 3rd International Conference on Information Technology Research (ICITR) Article #: Date of Conference: 05-07 December 2018. Date Added to IEEE Xplore: 13 June 2019. ISBN Information: Electronic ISBN: 978-1-7281-1470-5. CD: 978-1-7281-1468-2. Print on Demand (PoD) ISBN: 978-1-7281-1471-2. INSPEC Accession Number:
Although R programming has been a part of research since its origins in the 1990s, few studies address scientific software development from a Software Engineering (SE) perspective. The past few years have seen unparalleled growth in the R community, and it is time to push the boundaries of SE research and R programming forwards. This paper discusses relevant studies that close this gap ...
The COVID19-World online web application systematically produces daily updated country-specific data visualization and analysis of the SARS-CoV-2 epidemic worldwide. The application will help with a better understanding of the SARS-CoV-2 epidemic worldwide. Keywords: Covid-19, Coronavirus, Open data map, Data visualization, Machine learning.
This is the online book for Colorado State University's R Programming for Research courses (ERHS 535, ERHS 581A3, and ERHS 581A4). This book includes course information, course notes, links to download pdfs of lecture slides, in-course exercises, homework assignments, and vocabulary lists for quizzes for this course. ""Give someone a ...
Explore the latest full-text research PDFs, articles, conference papers, preprints and more on R PROGRAMMING. Find methods information, sources, references or conduct a literature review on R ...
The R language has passed another milestone, a paper aimed at the academic programming language community (or at least one section of this community) has been written about it, Evaluating the Design of the R Language by Morandat, Hill, Osvald and Vitek. Hardly earth shattering news, but it may have some impact on how R is viewed by nonusers of the language (the many R users in finance probably ...
This paper presents some recent developments in geoscience field such as geology, remote sensing, soil and rock mechanics where the open-source tool R-programming can be used in statistical research and understandings. It also describes the list of packages available in R-programming for specific field of study and how it can be used to promote ...
The R Statistical Programming Environment. The R statistical programming environment (R Development Core Team, 2011) provides an ideal platform to conduct simulation studies.R includes the ability to fit a variety of statistical models natively, includes sophisticated procedures for data plotting, and has over 3000 add-on packages that allow for additional modeling and plotting techniques.
The Leatherby Libraries is pleased to introduce a 3-part Introduction to R Programming Summer Workshop Series to introduce students, staff, and faculty to fundamental coding concepts in R. R programming is an effective tool that can propel efficiency and effectiveness in analyzing complex statistical analyses while producing quality data ...
about R. LEARNING OUTCOMES. After completing this tutorial, you should be able to: Explain how R language is used for analyzing data. Use RStudio for entering and running R-code. Add comments to R ...
Call for Papers. The Tool Demonstration track of the 32nd International Conference on Software Analysis, Evolution, and Reengineering (SANER'25) provides an excellent opportunity for researchers and practitioners to showcase innovative tools, prototypes, and software systems related to software analysis, engineering, and refactoring.