Logo for The Wharton School

  • Youth Program
  • Wharton Online

Research Papers / Publications

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • View all journals

Statistics articles from across Nature Portfolio

Statistics is the application of mathematical concepts to understanding and analysing large collections of data. A central tenet of statistics is to describe the variations in a data set or population using probability distributions. This analysis aids understanding of what underlies these variations and enables predictions of future changes.

Latest Research and Reviews

research paper of statistics

Effectiveness of non-pharmaceutical interventions for COVID-19 in USA

  • Weihao Wang

research paper of statistics

Prediction of fresh herbage yield using data mining techniques with limited plant quality parameters

  • Şenol Çelik
  • Halit Tutar

research paper of statistics

An identification method of LBL underwater positioning systematic error with optimal selection criterion

  • Jiongqi Wang
  • Xuanying Zhou

research paper of statistics

Analyzing spatio-temporal dynamics of dissolved oxygen for the River Thames using superstatistical methods and machine learning

  • Takuya Boehringer
  • Christian Beck

research paper of statistics

Passive earth pressure on vertical rigid walls with negative wall friction coupling statically admissible stress field and soft computing

  • Tram Bui-Ngoc

research paper of statistics

Omicron COVID-19 immune correlates analysis of a third dose of mRNA-1273 in the COVE trial

Using data from a phase 3 efficacy trial, the authors here show that post-boost Omicron BA.1 spike-specific binding and neutralizing antibodies inversely correlate with Omicron COVID-19 and booster efficacy for naive and non-naive participants, supporting the continued use of antibody as a surrogate endpoint.

  • Lars W. P. van der Laan

Advertisement

News and Comment

research paper of statistics

Machine learning reveals the merging history of nearby galaxies

A probabilistic machine learning method trained on cosmological simulations is used to determine whether stars in 10,000 nearby galaxies formed internally or were accreted from other galaxies during merging events. The model predicts that only 20% of the stellar mass in present day galaxies is the result of past mergers.

research paper of statistics

Efficient learning of many-body systems

The Hamiltonian describing a quantum many-body system can be learned using measurements in thermal equilibrium. Now, a learning algorithm applicable to many natural systems has been found that requires exponentially fewer measurements than existing methods.

research paper of statistics

Fudging the volcano-plot without dredging the data

Selecting omic biomarkers using both their effect size and their differential status significance ( i.e. , selecting the “volcano-plot outer spray”) has long been equally biologically relevant and statistically troublesome. However, recent proposals are paving the way to resolving this dilemma.

  • Thomas Burger

research paper of statistics

Disentangling truth from bias in naturally occurring data

A technique that leverages duplicate records in crowdsourcing data could help to mitigate the effects of biases in research and services that are dependent on government records.

  • Daniel T. O’Brien

research paper of statistics

Sciama’s argument on life in a random universe and distinguishing apples from oranges

Dennis Sciama has argued that the existence of life depends on many quantities—the fundamental constants—so in a random universe life should be highly unlikely. However, without full knowledge of these constants, his argument implies a universe that could appear to be ‘intelligently designed’.

  • Zhi-Wei Wang
  • Samuel L. Braunstein

research paper of statistics

A method for generating constrained surrogate power laws

A paper in Physical Review X presents a method for numerically generating data sequences that are as likely to be observed under a power law as a given observed dataset.

  • Zoe Budrikis

Quick links

  • Explore articles by subject
  • Guide to authors
  • Editorial policies

research paper of statistics

Data Science: the impact of statistics

  • Regular Paper
  • Open access
  • Published: 16 February 2018
  • Volume 6 , pages 189–194, ( 2018 )

Cite this article

You have full access to this open access article

research paper of statistics

  • Claus Weihs 1 &
  • Katja Ickstadt 2  

42k Accesses

52 Citations

17 Altmetric

Explore all metrics

In this paper, we substantiate our premise that statistics is one of the most important disciplines to provide tools and methods to find structure in and to give deeper insight into data, and the most important discipline to analyze and quantify uncertainty. We give an overview over different proposed structures of Data Science and address the impact of statistics on such steps as data acquisition and enrichment, data exploration, data analysis and modeling, validation and representation and reporting. Also, we indicate fallacies when neglecting statistical reasoning.

Similar content being viewed by others

research paper of statistics

Data Analysis

research paper of statistics

Data science vs. statistics: two cultures?

research paper of statistics

Data Science: An Introduction

Explore related subjects.

  • Artificial Intelligence

Avoid common mistakes on your manuscript.

1 Introduction and premise

Data Science as a scientific discipline is influenced by informatics, computer science, mathematics, operations research, and statistics as well as the applied sciences.

In 1996, for the first time, the term Data Science was included in the title of a statistical conference (International Federation of Classification Societies (IFCS) “Data Science, classification, and related methods”) [ 37 ]. Even though the term was founded by statisticians, in the public image of Data Science, the importance of computer science and business applications is often much more stressed, in particular in the era of Big Data.

Already in the 1970s, the ideas of John Tukey [ 43 ] changed the viewpoint of statistics from a purely mathematical setting , e.g., statistical testing, to deriving hypotheses from data ( exploratory setting ), i.e., trying to understand the data before hypothesizing.

Another root of Data Science is Knowledge Discovery in Databases (KDD) [ 36 ] with its sub-topic Data Mining . KDD already brings together many different approaches to knowledge discovery, including inductive learning, (Bayesian) statistics, query optimization, expert systems, information theory, and fuzzy sets. Thus, KDD is a big building block for fostering interaction between different fields for the overall goal of identifying knowledge in data.

Nowadays, these ideas are combined in the notion of Data Science, leading to different definitions. One of the most comprehensive definitions of Data Science was recently given by Cao as the formula [ 12 ]:

data science = (statistics + informatics + computing + communication + sociology + management) | (data + environment + thinking) .

In this formula, sociology stands for the social aspects and | (data + environment + thinking) means that all the mentioned sciences act on the basis of data, the environment and the so-called data-to-knowledge-to-wisdom thinking.

A recent, comprehensive overview of Data Science provided by Donoho in 2015 [ 16 ] focuses on the evolution of Data Science from statistics. Indeed, as early as 1997, there was an even more radical view suggesting to rename statistics to Data Science [ 50 ]. And in 2015, a number of ASA leaders [ 17 ] released a statement about the role of statistics in Data Science, saying that “statistics and machine learning play a central role in data science.”

In our view, statistical methods are crucial in most fundamental steps of Data Science. Hence, the premise of our contribution is:

Statistics is one of the most important disciplines to provide tools and methods to find structure in and to give deeper insight into data, and the most important discipline to analyze and quantify uncertainty.

This paper aims at addressing the major impact of statistics on the most important steps in Data Science.

2 Steps in data science

One of forerunners of Data Science from a structural perspective is the famous CRISP-DM (Cross Industry Standard Process for Data Mining) which is organized in six main steps: Business Understanding, Data Understanding, Data Preparation, Modeling, Evaluation, and Deployment [ 10 ], see Table  1 , left column. Ideas like CRISP-DM are now fundamental for applied statistics.

In our view, the main steps in Data Science have been inspired by CRISP-DM and have evolved, leading to, e.g., our definition of Data Science as a sequence of the following steps: Data Acquisition and Enrichment, Data Storage and Access , Data Exploration, Data Analysis and Modeling, Optimization of Algorithms , Model Validation and Selection, Representation and Reporting of Results, and Business Deployment of Results . Note that topics in small capitals indicate steps where statistics is less involved, cp. Table  1 , right column.

Usually, these steps are not just conducted once but are iterated in a cyclic loop. In addition, it is common to alternate between two or more steps. This holds especially for the steps Data Acquisition and Enrichment , Data Exploration , and Statistical Data Analysis , as well as for Statistical Data Analysis and Modeling and Model Validation and Selection .

Table  1 compares different definitions of steps in Data Science. The relationship of terms is indicated by horizontal blocks. The missing step Data Acquisition and Enrichment in CRISP-DM indicates that that scheme deals with observational data only. Moreover, in our proposal, the steps Data Storage and Access and Optimization of Algorithms are added to CRISP-DM, where statistics is less involved.

The list of steps for Data Science may even be enlarged, see, e.g., Cao in [ 12 ], Figure 6, cp. also Table  1 , middle column, for the following recent list: Domain-specific Data Applications and Problems, Data Storage and Management, Data Quality Enhancement, Data Modeling and Representation, Deep Analytics, Learning and Discovery, Simulation and Experiment Design, High-performance Processing and Analytics, Networking, Communication, Data-to-Decision and Actions.

In principle, Cao’s and our proposal cover the same main steps. However, in parts, Cao’s formulation is more detailed; e.g., our step Data Analysis and Modeling corresponds to Data Modeling and Representation, Deep Analytics, Learning and Discovery . Also, the vocabularies differ slightly, depending on whether the respective background is computer science or statistics. In that respect note that Experiment Design in Cao’s definition means the design of the simulation experiments.

In what follows, we will highlight the role of statistics discussing all the steps, where it is heavily involved, in Sects.  2.1 – 2.6 . These coincide with all steps in our proposal in Table  1 except steps in small capitals. The corresponding entries Data Storage and Access and Optimization of Algorithms are mainly covered by informatics and computer science , whereas Business Deployment of Results is covered by Business Management .

2.1 Data acquisition and enrichment

Design of experiments (DOE) is essential for a systematic generation of data when the effect of noisy factors has to be identified. Controlled experiments are fundamental for robust process engineering to produce reliable products despite variation in the process variables. On the one hand, even controllable factors contain a certain amount of uncontrollable variation that affects the response. On the other hand, some factors, like environmental factors, cannot be controlled at all. Nevertheless, at least the effect of such noisy influencing factors should be controlled by, e.g., DOE.

DOE can be utilized, e.g.,

to systematically generate new data ( data acquisition ) [ 33 ],

for systematically reducing data bases [ 41 ], and

for tuning (i.e., optimizing) parameters of algorithms [ 1 ], i.e., for improving the data analysis methods (see Sect.  2.3 ) themselves.

Simulations [ 7 ] may also be used to generate new data. A tool for the enrichment of data bases to fill data gaps is the imputation of missing data [ 31 ].

Such statistical methods for data generation and enrichment need to be part of the backbone of Data Science. The exclusive use of observational data without any noise control distinctly diminishes the quality of data analysis results and may even lead to wrong result interpretation. The hope for “The End of Theory: The Data Deluge Makes the Scientific Method Obsolete” [ 4 ] appears to be wrong due to noise in the data.

Thus, experimental design is crucial for the reliability, validity, and replicability of our results.

2.2 Data exploration

Exploratory statistics is essential for data preprocessing to learn about the contents of a data base. Exploration and visualization of observed data was, in a way, initiated by John Tukey [ 43 ]. Since that time, the most laborious part of data analysis, namely data understanding and transformation, became an important part in statistical science.

Data exploration or data mining is fundamental for the proper usage of analytical methods in Data Science. The most important contribution of statistics is the notion of distribution . It allows us to represent variability in the data as well as (a-priori) knowledge of parameters, the concept underlying Bayesian statistics. Distributions also enable us to choose adequate subsequent analytic models and methods.

2.3 Statistical data analysis

Finding structure in data and making predictions are the most important steps in Data Science. Here, in particular, statistical methods are essential since they are able to handle many different analytical tasks. Important examples of statistical data analysis methods are the following.

Hypothesis testing is one of the pillars of statistical analysis. Questions arising in data driven problems can often be translated to hypotheses. Also, hypotheses are the natural links between underlying theory and statistics. Since statistical hypotheses are related to statistical tests, questions and theory can be tested for the available data. Multiple usage of the same data in different tests often leads to the necessity to correct significance levels. In applied statistics, correct multiple testing is one of the most important problems, e.g., in pharmaceutical studies [ 15 ]. Ignoring such techniques would lead to many more significant results than justified.

Classification methods are basic for finding and predicting subpopulations from data. In the so-called unsupervised case, such subpopulations are to be found from a data set without a-priori knowledge of any cases of such subpopulations. This is often called clustering.

In the so-called supervised case, classification rules should be found from a labeled data set for the prediction of unknown labels when only influential factors are available.

Nowadays, there is a plethora of methods for the unsupervised [ 22 ] as well for the supervised case [ 2 ].

In the age of Big Data, a new look at the classical methods appears to be necessary, though, since most of the time the calculation effort of complex analysis methods grows stronger than linear with the number of observations n or the number of features p . In the case of Big Data, i.e., if n or p is large, this leads to too high calculation times and to numerical problems. This results both, in the comeback of simpler optimization algorithms with low time-complexity [ 9 ] and in re-examining the traditional methods in statistics and machine learning for Big Data [ 46 ].

Regression methods are the main tool to find global and local relationships between features when the target variable is measured. Depending on the distributional assumption for the underlying data, different approaches may be applied. Under the normality assumption, linear regression is the most common method, while generalized linear regression is usually employed for other distributions from the exponential family [ 18 ]. More advanced methods comprise functional regression for functional data [ 38 ], quantile regression [ 25 ], and regression based on loss functions other than squared error loss like, e.g., Lasso regression [ 11 , 21 ]. In the context of Big Data, the challenges are similar to those for classification methods given large numbers of observations n (e.g., in data streams) and / or large numbers of features p . For the reduction of n , data reduction techniques like compressed sensing, random projection methods [ 20 ] or sampling-based procedures [ 28 ] enable faster computations. For decreasing the number p to the most influential features, variable selection or shrinkage approaches like the Lasso [ 21 ] can be employed, keeping the interpretability of the features. (Sparse) principal component analysis [ 21 ] may also be used.

Time series analysis aims at understanding and predicting temporal structure [ 42 ]. Time series are very common in studies of observational data, and prediction is the most important challenge for such data. Typical application areas are the behavioral sciences and economics as well as the natural sciences and engineering. As an example, let us have a look at signal analysis, e.g., speech or music data analysis. Here, statistical methods comprise the analysis of models in the time and frequency domains. The main aim is the prediction of future values of the time series itself or of its properties. For example, the vibrato of an audio time series might be modeled in order to realistically predict the tone in the future [ 24 ] and the fundamental frequency of a musical tone might be predicted by rules learned from elapsed time periods [ 29 ].

In econometrics, multiple time series and their co-integration are often analyzed [ 27 ]. In technical applications, process control is a common aim of time series analysis [ 34 ].

2.4 Statistical modeling

Complex interactions between factors can be modeled by graphs or networks . Here, an interaction between two factors is modeled by a connection in the graph or network [ 26 , 35 ]. The graphs can be undirected as, e.g., in Gaussian graphical models, or directed as, e.g., in Bayesian networks. The main goal in network analysis is deriving the network structure. Sometimes, it is necessary to separate (unmix) subpopulation specific network topologies [ 49 ].

Stochastic differential and difference equations can represent models from the natural and engineering sciences [ 3 , 39 ]. The finding of approximate statistical models solving such equations can lead to valuable insights for, e.g., the statistical control of such processes, e.g., in mechanical engineering [ 48 ]. Such methods can build a bridge between the applied sciences and Data Science.

Local models and globalization Typically, statistical models are only valid in sub-regions of the domain of the involved variables. Then, local models can be used [ 8 ]. The analysis of structural breaks can be basic to identify the regions for local modeling in time series [ 5 ]. Also, the analysis of concept drifts can be used to investigate model changes over time [ 30 ].

In time series, there are often hierarchies of more and more global structures. For example, in music, a basic local structure is given by the notes and more and more global ones by bars, motifs, phrases, parts etc. In order to find global properties of a time series, properties of the local models can be combined to more global characteristics [ 47 ].

Mixture models can also be used for the generalization of local to global models [ 19 , 23 ]. Model combination is essential for the characterization of real relationships since standard mathematical models are often much too simple to be valid for heterogeneous data or bigger regions of interest.

2.5 Model validation and model selection

In cases where more than one model is proposed for, e.g., prediction, statistical tests for comparing models are helpful to structure the models, e.g., concerning their predictive power [ 45 ].

Predictive power is typically assessed by means of so-called resampling methods where the distribution of power characteristics is studied by artificially varying the subpopulation used to learn the model. Characteristics of such distributions can be used for model selection [ 7 ].

Perturbation experiments offer another possibility to evaluate the performance of models. In this way, the stability of the different models against noise is assessed [ 32 , 44 ].

Meta-analysis as well as model averaging are methods to evaluate combined models [ 13 , 14 ].

Model selection became more and more important in the last years since the number of classification and regression models proposed in the literature increased with higher and higher speed.

2.6 Representation and reporting

Visualization to interpret found structures and storing of models in an easy-to-update form are very important tasks in statistical analyses to communicate the results and safeguard data analysis deployment. Deployment is decisive for obtaining interpretable results in Data Science. It is the last step in CRISP-DM [ 10 ] and underlying the data-to-decision and action step in Cao [ 12 ].

Besides visualization and adequate model storing, for statistics, the main task is reporting of uncertainties and review [ 6 ].

3 Fallacies

The statistical methods described in Sect.  2 are fundamental for finding structure in data and for obtaining deeper insight into data, and thus, for a successful data analysis. Ignoring modern statistical thinking or using simplistic data analytics/statistical methods may lead to avoidable fallacies. This holds, in particular, for the analysis of big and/or complex data.

As mentioned at the end of Sect.  2.2 , the notion of distribution is the key contribution of statistics. Not taking into account distributions in data exploration and in modeling restricts us to report values and parameter estimates without their corresponding variability. Only the notion of distributions enables us to predict with corresponding error bands.

Moreover, distributions are the key to model-based data analytics. For example, unsupervised learning can be employed to find clusters in data. If additional structure like dependency on space or time is present, it is often important to infer parameters like cluster radii and their spatio-temporal evolution. Such model-based analysis heavily depends on the notion of distributions (see [ 40 ] for an application to protein clusters).

If more than one parameter is of interest, it is advisable to compare univariate hypothesis testing approaches to multiple procedures, e.g., in multiple regression, and choose the most adequate model by variable selection. Restricting oneself to univariate testing, would ignore relationships between variables.

Deeper insight into data might require more complex models, like, e.g., mixture models for detecting heterogeneous groups in data. When ignoring the mixture, the result often represents a meaningless average, and learning the subgroups by unmixing the components might be needed. In a Bayesian framework, this is enabled by, e.g., latent allocation variables in a Dirichlet mixture model. For an application of decomposing a mixture of different networks in a heterogeneous cell population in molecular biology see [ 49 ].

A mixture model might represent mixtures of components of very unequal sizes, with small components (outliers) being of particular importance. In the context of Big Data, naïve sampling procedures are often employed for model estimation. However, these have the risk of missing small mixture components. Hence, model validation or sampling according to a more suitable distribution as well as resampling methods for predictive power are important.

4 Conclusion

Following the above assessment of the capabilities and impacts of statistics our conclusion is:

The role of statistics in Data Science is under-estimated as, e.g., compared to computer science. This yields, in particular, for the areas of data acquisition and enrichment as well as for advanced modeling needed for prediction.

Stimulated by this conclusion, statisticians are well-advised to more offensively play their role in this modern and well accepted field of Data Science.

Only complementing and/or combining mathematical methods and computational algorithms with statistical reasoning, particularly for Big Data, will lead to scientific results based on suitable approaches. Ultimately, only a balanced interplay of all sciences involved will lead to successful solutions in Data Science.

Adenso-Diaz, B., Laguna, M.: Fine-tuning of algorithms using fractional experimental designs and local search. Oper. Res. 54 (1), 99–114 (2006)

Article   Google Scholar  

Aggarwal, C.C. (ed.): Data Classification: Algorithms and Applications. CRC Press, Boca Raton (2014)

Google Scholar  

Allen, E., Allen, L., Arciniega, A., Greenwood, P.: Construction of equivalent stochastic differential equation models. Stoch. Anal. Appl. 26 , 274–297 (2008)

Article   MathSciNet   Google Scholar  

Anderson, C.: The End of Theory: The Data Deluge Makes the Scientific Method Obsolete. Wired Magazine https://www.wired.com/2008/06/pb-theory/ (2008)

Aue, A., Horváth, L.: Structural breaks in time series. J. Time Ser. Anal. 34 (1), 1–16 (2013)

Berger, R.E.: A scientific approach to writing for engineers and scientists. IEEE PCS Professional Engineering Communication Series IEEE Press, Wiley (2014)

Book   Google Scholar  

Bischl, B., Mersmann, O., Trautmann, H., Weihs, C.: Resampling methods for meta-model validation with recommendations for evolutionary computation. Evol. Comput. 20 (2), 249–275 (2012)

Bischl, B., Schiffner, J., Weihs, C.: Benchmarking local classification methods. Comput. Stat. 28 (6), 2599–2619 (2013)

Bottou, L., Curtis, F.E., Nocedal, J.: Optimization methods for large-scale machine learning. arXiv preprint arXiv:1606.04838 (2016)

Brown, M.S.: Data Mining for Dummies. Wiley, London (2014)

Bühlmann, P., Van De Geer, S.: Statistics for High-Dimensional Data: Methods, Theory and Applications. Springer, Berlin (2011)

Cao, L.: Data science: a comprehensive overview. ACM Comput. Surv. (2017). https://doi.org/10.1145/3076253

Claeskens, G., Hjort, N.L.: Model Selection and Model Averaging. Cambridge University Press, Cambridge (2008)

Cooper, H., Hedges, L.V., Valentine, J.C.: The Handbook of Research Synthesis and Meta-analysis. Russell Sage Foundation, New York City (2009)

Dmitrienko, A., Tamhane, A.C., Bretz, F.: Multiple Testing Problems in Pharmaceutical Statistics. Chapman and Hall/CRC, London (2009)

Donoho, D.: 50 Years of Data Science. http://courses.csail.mit.edu/18.337/2015/docs/50YearsDataScience.pdf (2015)

Dyk, D.V., Fuentes, M., Jordan, M.I., Newton, M., Ray, B.K., Lang, D.T., Wickham, H.: ASA Statement on the Role of Statistics in Data Science. http://magazine.amstat.org/blog/2015/10/01/asa-statement-on-the-role-of-statistics-in-data-science/ (2015)

Fahrmeir, L., Kneib, T., Lang, S., Marx, B.: Regression: Models, Methods and Applications. Springer, Berlin (2013)

Frühwirth-Schnatter, S.: Finite Mixture and Markov Switching Models. Springer, Berlin (2006)

MATH   Google Scholar  

Geppert, L., Ickstadt, K., Munteanu, A., Quedenfeld, J., Sohler, C.: Random projections for Bayesian regression. Stat. Comput. 27 (1), 79–101 (2017). https://doi.org/10.1007/s11222-015-9608-z

Article   MathSciNet   MATH   Google Scholar  

Hastie, T., Tibshirani, R., Wainwright, M.: Statistical Learning with Sparsity: The Lasso and Generalizations. CRC Press, Boca Raton (2015)

Hennig, C., Meila, M., Murtagh, F., Rocci, R.: Handbook of Cluster Analysis. Chapman & Hall, London (2015)

Klein, H.U., Schäfer, M., Porse, B.T., Hasemann, M.S., Ickstadt, K., Dugas, M.: Integrative analysis of histone chip-seq and transcription data using Bayesian mixture models. Bioinformatics 30 (8), 1154–1162 (2014)

Knoche, S., Ebeling, M.: The musical signal: physically and psychologically, chap 2. In: Weihs, C., Jannach, D., Vatolkin, I., Rudolph, G. (eds.) Music Data Analysis—Foundations and Applications, pp. 15–68. CRC Press, Boca Raton (2017)

Koenker, R.: Quantile Regression. Econometric Society Monographs, vol. 38 (2010)

Koller, D., Friedman, N.: Probabilistic Graphical Models: Principles and Techniques. MIT press, Cambridge (2009)

Lütkepohl, H.: New Introduction to Multiple Time Series Analysis. Springer, Berlin (2010)

Ma, P., Mahoney, M.W., Yu, B.: A statistical perspective on algorithmic leveraging. In: Proceedings of the 31th International Conference on Machine Learning, ICML 2014, Beijing, China, 21–26 June 2014, pp 91–99. http://jmlr.org/proceedings/papers/v32/ma14.html (2014)

Martin, R., Nagathil, A.: Digital filters and spectral analysis, chap 4. In: Weihs, C., Jannach, D., Vatolkin, I., Rudolph, G. (eds.) Music Data Analysis—Foundations and Applications, pp. 111–143. CRC Press, Boca Raton (2017)

Mejri, D., Limam, M., Weihs, C.: A new dynamic weighted majority control chart for data streams. Soft Comput. 22(2), 511–522. https://doi.org/10.1007/s00500-016-2351-3

Molenberghs, G., Fitzmaurice, G., Kenward, M.G., Tsiatis, A., Verbeke, G.: Handbook of Missing Data Methodology. CRC Press, Boca Raton (2014)

Molinelli, E.J., Korkut, A., Wang, W.Q., Miller, M.L., Gauthier, N.P., Jing, X., Kaushik, P., He, Q., Mills, G., Solit, D.B., Pratilas, C.A., Weigt, M., Braunstein, A., Pagnani, A., Zecchina, R., Sander, C.: Perturbation Biology: Inferring Signaling Networks in Cellular Systems. arXiv preprint arXiv:1308.5193 (2013)

Montgomery, D.C.: Design and Analysis of Experiments, 8th edn. Wiley, London (2013)

Oakland, J.: Statistical Process Control. Routledge, London (2007)

Pearl, J.: Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference. Morgan Kaufmann, Los Altos (1988)

Chapter   Google Scholar  

Piateski, G., Frawley, W.: Knowledge Discovery in Databases. MIT Press, Cambridge (1991)

Press, G.: A Very Short History of Data Science. https://www.forbescom/sites/gilpress/2013/05/28/a-very-short-history-of-data-science/#5c515ed055cf (2013). [last visit: March 19, 2017]

Ramsay, J., Silverman, B.W.: Functional Data Analysis. Springer, Berlin (2005)

Särkkä, S.: Applied Stochastic Differential Equations. https://users.aalto.fi/~ssarkka/course_s2012/pdf/sde_course_booklet_2012.pdf (2012). [last visit: March 6, 2017]

Schäfer, M., Radon, Y., Klein, T., Herrmann, S., Schwender, H., Verveer, P.J., Ickstadt, K.: A Bayesian mixture model to quantify parameters of spatial clustering. Comput. Stat. Data Anal. 92 , 163–176 (2015). https://doi.org/10.1016/j.csda.2015.07.004

Schiffner, J., Weihs, C.: D-optimal plans for variable selection in data bases. Technical Report, 14/09, SFB 475 (2009)

Shumway, R.H., Stoffer, D.S.: Time Series Analysis and Its Applications: With R Examples. Springer, Berlin (2010)

Tukey, J.W.: Exploratory Data Analysis. Pearson, London (1977)

Vatcheva, I., de Jong, H., Mars, N.: Selection of perturbation experiments for model discrimination. In: Horn, W. (ed.) Proceedings of the 14th European Conference on Artificial Intelligence, ECAI-2000, IOS Press, pp 191–195 (2000)

Vatolkin, I., Weihs, C.: Evaluation, chap 13. In: Weihs, C., Jannach, D., Vatolkin, I., Rudolph, G. (eds.) Music Data Analysis—Foundations and Applications, pp. 329–363. CRC Press, Boca Raton (2017)

Weihs, C.: Big data classification — aspects on many features. In: Michaelis, S., Piatkowski, N., Stolpe, M. (eds.) Solving Large Scale Learning Tasks: Challenges and Algorithms, Springer Lecture Notes in Artificial Intelligence, vol. 9580, pp. 139–147 (2016)

Weihs, C., Ligges, U.: From local to global analysis of music time series. In: Morik, K., Siebes, A., Boulicault, J.F. (eds.) Detecting Local Patterns, Springer Lecture Notes in Artificial Intelligence, vol. 3539, pp. 233–245 (2005)

Weihs, C., Messaoud, A., Raabe, N.: Control charts based on models derived from differential equations. Qual. Reliab. Eng. Int. 26 (8), 807–816 (2010)

Wieczorek, J., Malik-Sheriff, R.S., Fermin, Y., Grecco, H.E., Zamir, E., Ickstadt, K.: Uncovering distinct protein-network topologies in heterogeneous cell populations. BMC Syst. Biol. 9 (1), 24 (2015)

Wu, J.: Statistics = data science? http://www2.isye.gatech.edu/~jeffwu/presentations/datascience.pdf (1997)

Download references

Acknowledgements

The authors would like to thank the editor, the guest editors and all reviewers for valuable comments on an earlier version of the manuscript. They also thank Leo Geppert for fruitful discussions.

Author information

Authors and affiliations.

Computational Statistics, TU Dortmund University, 44221, Dortmund, Germany

Claus Weihs

Mathematical Statistics and Biometric Applications, TU Dortmund University, 44221, Dortmund, Germany

Katja Ickstadt

You can also search for this author in PubMed   Google Scholar

Corresponding author

Correspondence to Claus Weihs .

Rights and permissions

Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License ( http://creativecommons.org/licenses/by/4.0 /), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.

Reprints and permissions

About this article

Weihs, C., Ickstadt, K. Data Science: the impact of statistics. Int J Data Sci Anal 6 , 189–194 (2018). https://doi.org/10.1007/s41060-018-0102-5

Download citation

Received : 20 March 2017

Accepted : 25 January 2018

Published : 16 February 2018

Issue Date : November 2018

DOI : https://doi.org/10.1007/s41060-018-0102-5

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • Structures of data science
  • Impact of statistics on data science
  • Fallacies in data science
  • Find a journal
  • Publish with us
  • Track your research
  • Amstat News
  • ASA Community
  • Practical Significance
  • ASA Leader HUB

research paper of statistics

  • Real World Data Science
  • Staff Directory
  • ASA Leader Hub
  • Code of Conduct
  • Board of Directors
  • Constitution
  • Strategic Plan
  • Council of Sections Governing Board
  • Council of Chapters Governing Board
  • Council of Sections
  • Council of Chapters
  • Individual Member Benefits
  • Membership Options
  • Membership for Organizations
  • Student Chapters
  • Sections & Interest Groups
  • Outreach Groups
  • Membership Campaigns
  • Membership Directory
  • Members Only
  • Classroom Resources
  • Publications
  • Guidelines and Reports
  • Professional Development
  • Student Competitions
  • Communities and Resources
  • Graduate Educators
  • Caucus of Academic Reps
  • Student Resources
  • Career Resources
  • Communities
  • Statistics and Biostatistics Programs
  • Internships and Fellowships
  • K-12 Student Outreach
  • K-12 Statistical Ambassador
  • Educational Ambassador
  • Statistics and Biostatistics Degree Data
  • COVID-19 Pandemic Resources
  • Education Publications
  • JSM Proceedings
  • Significance
  • ASA Member News
  • Joint Statistical Meetings
  • Conference on Statistical Practice
  • ASA Biopharmaceutical Section Regulatory-Industry Statistics Workshop
  • International Conference on Establishment Statistics
  • International Conference on Health Policy Statistics
  • Symposium on Data Science & Statistics
  • Women in Statistics and Data Science
  • Other Meetings
  • ASA Board Statements
  • Letters Signed/Sent
  • Resources for Policymakers
  • Federal Budget Information
  • Statistical Significance Series
  • Count on Stats
  • ASA Fellowships and Grants
  • Salary Information
  • External Funding Sources
  • Ethical Guidelines for Statistical Practice
  • Accreditation
  • Authorized Use of PSTAT® Mark
  • ASA Fellows
  • Student Paper Competitions
  • Awards and Scholarships

ASA Journals Online

Journal of the american statistical association, the american statistician, journal of agricultural, biological, and environmental statistics, journal of business & economic statistics, journal of computational and graphical statistics, journal of nonparametric statistics, statistical analysis and data mining: the asa data science journal, statistics in biopharmaceutical research, technometrics, asa open-access journals.

Data

Data Science in Science   

Journal of statistics and data science education   .

Statistics and Public Policy

Statistics and Public Policy

Statistics surveys, asa co-published journals, journal of educational and behavioral statistics, journal of quantitative analysis in sports.

SIAM/ASA Journal on Uncertainty Quantification

SIAM/ASA Journal on Uncertainty Quantification

Journal of Survey Statistics and Methodology

Journal of Survey Statistics and Methodology

U.S. flag

An official website of the United States government

The .gov means it’s official. Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

The site is secure. The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

  • Publications
  • Account settings
  • My Bibliography
  • Collections
  • Citation manager

Save citation to file

Email citation, add to collections.

  • Create a new collection
  • Add to an existing collection

Add to My Bibliography

Your saved search, create a file for external citation management software, your rss feed.

  • Search in PubMed
  • Search in NLM Catalog
  • Add to Search

Bayesian statistics for clinical research

Affiliations.

  • 1 Interdepartmental Division of Critical Care Medicine and Department of Physiology, University of Toronto, Toronto, ON, Canada; Department of Medicine, Division of Respirology, University Health Network, Toronto, ON, Canada; Toronto General Hospital Research Institute, Toronto, ON, Canada. Electronic address: [email protected].
  • 2 Child Health Evaluative Sciences, The Hospital for Sick Children, Toronto, ON, Canada; Division of Biostatistics, Dalla Lana School of Public Health, University of Toronto, Toronto, ON, Canada.
  • 3 Department of Statistical Science (A Heath), University College London, London, UK; MRC Clinical Trials Unit, University College London, London, UK; Department of Biostatistics, Epidemiology, and Informatics and Department of Medicine, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA, USA.
  • PMID: 39277290
  • DOI: 10.1016/S0140-6736(24)01295-9

Frequentist and Bayesian statistics represent two differing paradigms for the analysis of data. Frequentism became the dominant mode of statistical thinking in medical practice during the 20th century. The advent of modern computing has made Bayesian analysis increasingly accessible, enabling growing use of Bayesian methods in a range of disciplines, including medical research. Rather than conceiving of probability as the expected frequency of an event (purported to be measurable and objective), Bayesian thinking conceives of probability as a measure of strength of belief (an explicitly subjective concept). Bayesian analysis combines previous information (represented by a mathematical probability distribution, the prior) with information from the study (the likelihood function) to generate an updated probability distribution (the posterior) representing the information available for clinical decision making. Owing to its fundamentally different conception of probability, Bayesian statistics offers an intuitive, flexible, and informative approach that facilitates the design, analysis, and interpretation of clinical trials. In this Review, we provide a brief account of the philosophical and methodological differences between Bayesian and frequentist approaches and survey the use of Bayesian methods for the design and analysis of clinical research.

Copyright © 2024 Elsevier Ltd. All rights reserved, including those for text and data mining, AI training, and similar technologies.

PubMed Disclaimer

Conflict of interest statement

Declaration of interests ECG is supported by an Early Career Health Research Award from the National Sanitarium Association. MOH is supported by grant number R01-HL168202 from the National Heart, Lung, and Blood Institute (National Institutes of Health). AH is supported by a Canada Research Chair in Statistical Trial Design and the Discovery Grant Program of the Natural Sciences and Engineering Research Council of Canada (RGPIN-2021–03366). ECG receives fees for speaking or consulting from Vyaire, BioAge, Stimit, Lungpacer Medical, Getinge, Draeger, Heecap, and Zoll. He serves on the clinical advisory board for Getinge and previously served on the advisory board for Lungpacer Medical. He has received in-kind support for research from Timpel Medical, Lungpacer Medical, and Getinge. MOH has received statistical consulting fees from Unlearn. AI, Guidepoint Global, and the Berkeley Research Group; fees for editorial services from Elsevier and the American Thoracic Society; fees for serving on a data safety monitoring board from the University of California, San Francisco, and the University of Pittsburgh; and fees for pilot grant reviews from Brown University and New York University.

Publication types

  • Search in MeSH

Related information

Linkout - more resources, full text sources.

  • ClinicalKey
  • Elsevier Science
  • Citation Manager

NCBI Literature Resources

MeSH PMC Bookshelf Disclaimer

The PubMed wordmark and PubMed logo are registered trademarks of the U.S. Department of Health and Human Services (HHS). Unauthorized use of these marks is strictly prohibited.

  • EXPLORE Random Article
  • Happiness Hub

How to Find Statistics for a Research Paper

Last Updated: March 10, 2024 References

This article was co-authored by wikiHow staff writer, Jennifer Mueller, JD . Jennifer Mueller is a wikiHow Content Creator. She specializes in reviewing, fact-checking, and evaluating wikiHow's content to ensure thoroughness and accuracy. Jennifer holds a JD from Indiana University Maurer School of Law in 2006. There are 8 references cited in this article, which can be found at the bottom of the page. This article has been viewed 25,129 times.

When you're writing a research paper, particularly in social sciences such as political science or sociology, statistics can help you back up your conclusions with solid data. You typically can find relevant statistics using online sources. However, it's important to accurately assess the reliability of the source. You also need to understand whether the statistics you've found strengthen or undermine your arguments or conclusions before you incorporate them into your writing. [1] X Research source [2] X Trustworthy Source University of North Carolina Writing Center UNC's on-campus and online instructional service that provides assistance to students, faculty, and others during the writing process Go to source

Identifying the Data You Need

Step 1 Outline your points or arguments.

  • For example, if you're writing a research paper for a sociology class on the effect of crime in inner cities, you may want to make the point that high school graduation rates decrease as the rate of violent crime increases.
  • To support that point, you would need data about high school graduation rates in specific inner cities, as well as violent crime rates in the same areas.
  • From that data, you would want to find statistics that show the trends in those two rates. Then you can compare those statistics to reach a correlation that would (potentially) support your point.

Step 2 Do some background research.

  • Background research also can clue you in to words or phrases that are commonly used by academics, researchers, and statisticians examining the same issues you're discussing in your research paper.
  • A basic familiarity with your topic can help you identify additional statistics that you might not have thought of before.
  • For example, in reading about the effect of violent crime in inner cities, you may find an article discussing how children coming from high-crime neighborhoods have higher rates of PTSD than children who grow up in peaceful suburbs.
  • The issue of PTSD is something you potentially could weave into your research paper, although you'd have to do more digging into the source of the statistics themselves.
  • Keep in mind when you're reading on background, this isn't necessarily limited to material that you might use as a source for your research paper. You're just trying to familiarize yourself with the subject generally.

Step 3 Distinguish between descriptive and inferential statistics.

  • With a descriptive statistic, those who collected the data got information for every person included in a specific, limited group.
  • "Only 2 percent of the students in McKinley High School's senior class have red hair" is an example of a descriptive statistic. All the students in the senior class have been accounted for, and the statistic describes only that group.
  • However, if the statisticians used the county high school's senior class as a representative sample of the county as a whole, the result would be an inferential statistic.
  • The inferential version would be phrased "According to our study, approximately 2 percent of the people in McKinley County have red hair." The statisticians didn't check the hair color of every person who lived in the county.

Step 4 Brainstorm search terms.

  • Finding the best key words can be an art form. Using what you learned from your background research, try to use words academics or other researchers in the field use when discussing your topic.
  • You not only want to search for specific words, but also synonyms for those words. You also might search for both broader categories and narrower examples of related phenomena.
  • For example, "violent crime" is a broad category that may include crimes such as assault, rape, and murder. You may not be able to find statistics that specifically track violent crime generally, but you should be able to find statistics on the murder rate in a given area.
  • If you're looking for statistics related to a particular geographic area, you'll need to be flexible there as well. For example, if you can't find statistics that relate solely to a particular neighborhood, you may want to expand outward to the city or even the county.

Step 5 Locate relevant studies and polls.

  • While you can run a general internet search using your key words to potentially find statistics you can use in your research paper, knowing specific sources can help you find reliable statistics more quickly.
  • For example, if you're looking for statistics related to various demographics in the United States, the U.S. government has many statistics available at www.usa.gov/statistics.
  • You also can check the U.S. Census Bureau's website to retrieve census statistics and data.
  • The NationMaster website collects data from the CIA World Factbook and other sources to create a wealth of statistics comparing different countries on a number of measures.

Evaluating Sources

Step 1 Judge the source's reliability.

  • Find out who was responsible for collecting the data, and why. If the organization or group behind the data collection and creation of the statistics has an ideological or political mission, their statistics may be suspect.
  • Essentially, if someone is creating statistics to support a particular position or prove their arguments, you cannot trust those statistics. There are many ways raw data can be manipulated to show trends or correlations that don't necessarily reflect reality.
  • Government sources typically are highly reliable, as are most university studies. However, even with university studies you want to see if the study was funded in whole or in part by a group or organization with an ideological or political motivation or bias.

Step 2 Understand the background of the data.

  • To explore the background adequately, use the journalistic standard of the "5 w's" – who, what, when, where, and why.
  • This means you'll want to find out who carried out the study (or, in the case of a poll, who asked the questions), what questions were asked, when was the study or poll conducted, and why the study or poll was conducted.
  • The answers to these questions will help you understand the purpose of the statistical research that was conducted, and whether it would be helpful in your own research paper.

Step 3 Interpret the statistics yourself.

  • You may find the statistics set forth in a report that describes these statistics and what they mean.
  • However, just because someone else has explained the meaning of the statistics doesn't mean you should necessarily take their word for it.
  • Draw on your understanding of the background of the study or poll, and look at the interpretation the author presents critically.
  • Remove the statistics themselves from the text of the report, for example by copying them into a table. Then you can interpret them on your own without being distracted by the author's interpretation.
  • If you create a table of your own from a statistical report, make sure you label it accurately so you can cite the source of the statistics later if you decide to include them in your research paper.

Step 4 Use caution when producing your own statistics.

  • If you're looking at raw data, you may need to actually calculate the statistics yourself. If you don't have any experience with statistics, talk to someone who does.
  • Your teacher or professor may be able to help you understand how to calculate the statistics correctly.
  • Even if you have access to a statistics program, there's no guarantee that the result you get actually will be accurate unless you know what information to provide the program. Remember the common phrase with computer programs: "Garbage in, garbage out."
  • Don't assume you can just divide two numbers to get a percentage, for example. There are other probability elements that must be taken into account.

Writing with Statistics

Step 1 Use statistical terms correctly.

  • For example, the word "average" is one you often see in everyday writing. However, when you're writing about statistics, the word "average" could mean up to three different things.
  • The word "average" can be used to mean the median (the middle value in the set of data), the mean (the result when you add all the values in the set and then divide by the quantity of numbers in the set), or the mode (the number or value in the set that occurs most frequently).
  • Therefore, if you read "average," you need to know which of these definitions is meant.
  • You also want to make sure that any two or more statistics you're comparing are using the same definition of "average." Not doing so could lead to a significant misinterpretation of your statistics and what they mean in the context of your research.

Step 2 Focus on presentation and readability.

  • Charts and graphs also can be useful even when you are referencing the statistics within your text. Using graphical elements can break up the text and enhance reader understanding.
  • Tables, charts, and graphs can be especially beneficial if you ultimately will have to give a presentation of your research paper, either to your class or to teachers or professors.
  • As difficult as statistics are to follow in print, they can be even more difficult to follow when someone is merely telling them to you.
  • To test the readability of the statistics in your paper, read those paragraphs out loud to yourself. If you find yourself stumbling over them or getting confused as you read, it's likely anyone else will stumble too when reading them for the first time.

Step 3 Choose statistics that support your arguments.

  • This often has as much to do with how you describe the statistics as the specific statistics you use.
  • Keep in mind that numbers themselves are neutral – it is your interpretation of those numbers that gives them meaning.

Step 4 Present the data in context.

  • For example, if you present the statistic that the murder rate in one neighborhood increased by 500 percent, and in the same period high school graduation rates decreased by 300 percent, these numbers are virtually meaningless without context.
  • You don't know what a 500 percent increase entails unless you know what the rate was before the period measured by the statistic.
  • When you say "500 percent," it sounds like a large amount, but if there was only one murder before the period measured by the statistic, then what you're actually saying is that during that period there were five murders.
  • Additionally, your statistics may be more meaningful if you can compare them to similar statistics in other areas.
  • Think of it in terms of a scientific experiment. If scientists are studying the effects of a particular drug to treat a disease, they also include a control group that doesn't take the drug. Comparing the test group to the control group helps show the drug's effectiveness.

Step 5 Cite the source for your statistics correctly.

  • For example, you might write "According to the FBI, violent crime in McKinley County increased by 37 percent between the years 2000 and 2012."
  • A textual citation provides immediate authority to the statistics you're using, allowing your readers to trust the statistics and move on to the next point.
  • On the other hand, if you don't state where the statistics came from, your reader may be too busy mentally questioning the source of your statistics to fully grasp the point you're trying to make.

Expert Q&A

You might also like.

Get a Loan Even With Bad Credit

  • ↑ https://owl.english.purdue.edu/owl/resource/672/1/
  • ↑ http://writingcenter.unc.edu/handouts/statistics/
  • ↑ https://www.nationmaster.com/country-info/stats
  • ↑ https://www.usa.gov/statistics
  • ↑ https://owl.english.purdue.edu/owl/resource/672/02/
  • ↑ http://libguides.lib.msu.edu/datastats
  • ↑ https://owl.english.purdue.edu/owl/resource/672/06/
  • ↑ https://owl.english.purdue.edu/owl/resource/672/04/

About this article

Jennifer Mueller, JD

Did this article help you?

Get a Loan Even With Bad Credit

  • About wikiHow
  • Terms of Use
  • Privacy Policy
  • Do Not Sell or Share My Info
  • Not Selling Info

research paper of statistics

Have a language expert improve your writing

Run a free plagiarism check in 10 minutes, generate accurate citations for free.

  • Knowledge Base
  • Inferential Statistics | An Easy Introduction & Examples

Inferential Statistics | An Easy Introduction & Examples

Published on September 4, 2020 by Pritha Bhandari . Revised on June 22, 2023.

While descriptive statistics summarize the characteristics of a data set, inferential statistics help you come to conclusions and make predictions based on your data.

When you have collected data from a sample , you can use inferential statistics to understand the larger population from which the sample is taken.

Inferential statistics have two main uses:

  • making estimates about populations (for example, the mean SAT score of all 11th graders in the US).
  • testing hypotheses to draw conclusions about populations (for example, the relationship between SAT scores and family income).

Table of contents

Descriptive versus inferential statistics, estimating population parameters from sample statistics, hypothesis testing, other interesting articles, frequently asked questions about inferential statistics.

Descriptive statistics allow you to describe a data set, while inferential statistics allow you to make inferences based on a data set.

  • Descriptive statistics

Using descriptive statistics, you can report characteristics of your data:

  • The distribution concerns the frequency of each value.
  • The central tendency concerns the averages of the values.
  • The variability concerns how spread out the values are.

In descriptive statistics, there is no uncertainty – the statistics precisely describe the data that you collected. If you collect data from an entire population, you can directly compare these descriptive statistics to those from other populations.

Inferential statistics

Most of the time, you can only acquire data from samples, because it is too difficult or expensive to collect data from the whole population that you’re interested in.

While descriptive statistics can only summarize a sample’s characteristics, inferential statistics use your sample to make reasonable guesses about the larger population.

With inferential statistics, it’s important to use random and unbiased sampling methods . If your sample isn’t representative of your population, then you can’t make valid statistical inferences or generalize .

Sampling error in inferential statistics

Since the size of a sample is always smaller than the size of the population, some of the population isn’t captured by sample data. This creates sampling error , which is the difference between the true population values (called parameters) and the measured sample values (called statistics).

Sampling error arises any time you use a sample, even if your sample is random and unbiased. For this reason, there is always some uncertainty in inferential statistics. However, using probability sampling methods reduces this uncertainty.

Receive feedback on language, structure, and formatting

Professional editors proofread and edit your paper by focusing on:

  • Academic style
  • Vague sentences
  • Style consistency

See an example

research paper of statistics

The characteristics of samples and populations are described by numbers called statistics and parameters :

  • A statistic is a measure that describes the sample (e.g., sample mean ).
  • A parameter is a measure that describes the whole population (e.g., population mean).

Sampling error is the difference between a parameter and a corresponding statistic. Since in most cases you don’t know the real population parameter, you can use inferential statistics to estimate these parameters in a way that takes sampling error into account.

There are two important types of estimates you can make about the population: point estimates and interval estimates .

  • A point estimate is a single value estimate of a parameter. For instance, a sample mean is a point estimate of a population mean.
  • An interval estimate gives you a range of values where the parameter is expected to lie. A confidence interval is the most common type of interval estimate.

Both types of estimates are important for gathering a clear idea of where a parameter is likely to lie.

Confidence intervals

A confidence interval uses the variability around a statistic to come up with an interval estimate for a parameter. Confidence intervals are useful for estimating parameters because they take sampling error into account.

While a point estimate gives you a precise value for the parameter you are interested in, a confidence interval tells you the uncertainty of the point estimate. They are best used in combination with each other.

Each confidence interval is associated with a confidence level. A confidence level tells you the probability (in percentage) of the interval containing the parameter estimate if you repeat the study again.

A 95% confidence interval means that if you repeat your study with a new sample in exactly the same way 100 times, you can expect your estimate to lie within the specified range of values 95 times.

Although you can say that your estimate will lie within the interval a certain percentage of the time, you cannot say for sure that the actual population parameter will. That’s because you can’t know the true value of the population parameter without collecting data from the full population.

However, with random sampling and a suitable sample size, you can reasonably expect your confidence interval to contain the parameter a certain percentage of the time.

Your point estimate of the population mean paid vacation days is the sample mean of 19 paid vacation days.

Hypothesis testing is a formal process of statistical analysis using inferential statistics. The goal of hypothesis testing is to compare populations or assess relationships between variables using samples.

Hypotheses , or predictions, are tested using statistical tests . Statistical tests also estimate sampling errors so that valid inferences can be made.

Statistical tests can be parametric or non-parametric. Parametric tests are considered more statistically powerful because they are more likely to detect an effect if one exists.

Parametric tests make assumptions that include the following:

  • the population that the sample comes from follows a normal distribution of scores
  • the sample size is large enough to represent the population
  • the variances , a measure of variability , of each group being compared are similar

When your data violates any of these assumptions, non-parametric tests are more suitable. Non-parametric tests are called “distribution-free tests” because they don’t assume anything about the distribution of the population data.

Statistical tests come in three forms: tests of comparison, correlation or regression.

Comparison tests

Comparison tests assess whether there are differences in means, medians or rankings of scores of two or more groups.

To decide which test suits your aim, consider whether your data meets the conditions necessary for parametric tests, the number of samples, and the levels of measurement of your variables.

Means can only be found for interval or ratio data , while medians and rankings are more appropriate measures for ordinal data .

test Yes Means 2 samples
Yes Means 3+ samples
Mood’s median No Medians 2+ samples
Wilcoxon signed-rank No Distributions 2 samples
Wilcoxon rank-sum (Mann-Whitney ) No Sums of rankings 2 samples
Kruskal-Wallis No Mean rankings 3+ samples

Correlation tests

Correlation tests determine the extent to which two variables are associated.

Although Pearson’s r is the most statistically powerful test, Spearman’s r is appropriate for interval and ratio variables when the data doesn’t follow a normal distribution.

The chi square test of independence is the only test that can be used with nominal variables.

Pearson’s Yes Interval/ratio variables
Spearman’s No Ordinal/interval/ratio variables
Chi square test of independence No Nominal/ordinal variables

Regression tests

Regression tests demonstrate whether changes in predictor variables cause changes in an outcome variable. You can decide which regression test to use based on the number and types of variables you have as predictors and outcomes.

Most of the commonly used regression tests are parametric. If your data is not normally distributed, you can perform data transformations.

Data transformations help you make your data normally distributed using mathematical operations, like taking the square root of each value.

1 interval/ratio variable 1 interval/ratio variable
2+ interval/ratio variable(s) 1 interval/ratio variable
Logistic regression 1+ any variable(s) 1 binary variable
Nominal regression 1+ any variable(s) 1 nominal variable
Ordinal regression 1+ any variable(s) 1 ordinal variable

If you want to know more about statistics , methodology , or research bias , make sure to check out some of our other articles with explanations and examples.

  • Confidence interval
  • Measures of central tendency
  • Correlation coefficient

Methodology

  • Cluster sampling
  • Stratified sampling
  • Types of interviews
  • Cohort study
  • Thematic analysis

Research bias

  • Implicit bias
  • Cognitive bias
  • Survivorship bias
  • Availability heuristic
  • Nonresponse bias
  • Regression to the mean

Here's why students love Scribbr's proofreading services

Discover proofreading & editing

Descriptive statistics summarize the characteristics of a data set. Inferential statistics allow you to test a hypothesis or assess whether your data is generalizable to the broader population.

A statistic refers to measures about the sample , while a parameter refers to measures about the population .

A sampling error is the difference between a population parameter and a sample statistic .

Hypothesis testing is a formal procedure for investigating our ideas about the world using statistics. It is used by scientists to test specific predictions, called hypotheses , by calculating how likely it is that a pattern or relationship between variables could have arisen by chance.

Cite this Scribbr article

If you want to cite this source, you can copy and paste the citation or click the “Cite this Scribbr article” button to automatically add the citation to our free Citation Generator.

Bhandari, P. (2023, June 22). Inferential Statistics | An Easy Introduction & Examples. Scribbr. Retrieved September 22, 2024, from https://www.scribbr.com/statistics/inferential-statistics/

Is this article helpful?

Pritha Bhandari

Pritha Bhandari

Other students also liked, parameter vs statistic | definitions, differences & examples, descriptive statistics | definitions, types, examples, hypothesis testing | a step-by-step guide with easy examples, what is your plagiarism score.

U.S. flag

An official website of the United States government

The .gov means it’s official. Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

The site is secure. The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

  • Publications
  • Account settings

The PMC website is updating on October 15, 2024. Learn More or Try it out now .

  • Advanced Search
  • Journal List
  • Indian J Crit Care Med
  • v.25(Suppl 2); 2021 May

An Introduction to Statistics: Choosing the Correct Statistical Test

Priya ranganathan.

1 Department of Anaesthesiology, Critical Care and Pain, Tata Memorial Centre, Homi Bhabha National Institute, Mumbai, Maharashtra, India

The choice of statistical test used for analysis of data from a research study is crucial in interpreting the results of the study. This article gives an overview of the various factors that determine the selection of a statistical test and lists some statistical testsused in common practice.

How to cite this article: Ranganathan P. An Introduction to Statistics: Choosing the Correct Statistical Test. Indian J Crit Care Med 2021;25(Suppl 2):S184–S186.

In a previous article in this series, we looked at different types of data and ways to summarise them. 1 At the end of the research study, statistical analyses are performed to test the hypothesis and either prove or disprove it. The choice of statistical test needs to be carefully performed since the use of incorrect tests could lead to misleading conclusions. Some key questions help us to decide the type of statistical test to be used for analysis of study data. 2

W hat is the R esearch H ypothesis ?

Sometimes, a study may just describe the characteristics of the sample, e.g., a prevalence study. Here, the statistical analysis involves only descriptive statistics . For example, Sridharan et al. aimed to analyze the clinical profile, species distribution, and susceptibility pattern of patients with invasive candidiasis. 3 They used descriptive statistics to express the characteristics of their study sample, including mean (and standard deviation) for normally distributed data, median (with interquartile range) for skewed data, and percentages for categorical data.

Studies may be conducted to test a hypothesis and derive inferences from the sample results to the population. This is known as inferential statistics . The goal of inferential statistics may be to assess differences between groups (comparison), establish an association between two variables (correlation), predict one variable from another (regression), or look for agreement between measurements (agreement). Studies may also look at time to a particular event, analyzed using survival analysis.

A re the C omparisons M atched (P aired ) or U nmatched (U npaired )?

Observations made on the same individual (before–after or comparing two sides of the body) are usually matched or paired . Comparisons made between individuals are usually unpaired or unmatched . Data are considered paired if the values in one set of data are likely to be influenced by the other set (as can happen in before and after readings from the same individual). Examples of paired data include serial measurements of procalcitonin in critically ill patients or comparison of pain relief during sequential administration of different analgesics in a patient with osteoarthritis.

W hat are the T ype of D ata B eing M easured ?

The test chosen to analyze data will depend on whether the data are categorical (and whether nominal or ordinal) or numerical (and whether skewed or normally distributed). Tests used to analyze normally distributed data are known as parametric tests and have a nonparametric counterpart that is used for data, which is distribution-free. 4 Parametric tests assume that the sample data are normally distributed and have the same characteristics as the population; nonparametric tests make no such assumptions. Parametric tests are more powerful and have a greater ability to pick up differences between groups (where they exist); in contrast, nonparametric tests are less efficient at identifying significant differences. Time-to-event data requires a special type of analysis, known as survival analysis.

H ow M any M easurements are B eing C ompared ?

The choice of the test differs depending on whether two or more than two measurements are being compared. This includes more than two groups (unmatched data) or more than two measurements in a group (matched data).

T ests for C omparison

( Table 1 lists the tests commonly used for comparing unpaired data, depending on the number of groups and type of data. As an example, Megahed and colleagues evaluated the role of early bronchoscopy in mechanically ventilated patients with aspiration pneumonitis. 5 Patients were randomized to receive either early bronchoscopy or conventional treatment. Between groups, comparisons were made using the unpaired t test for normally distributed continuous variables, the Mann–Whitney U -test for non-normal continuous variables, and the chi-square test for categorical variables. Chowhan et al. compared the efficacy of left ventricular outflow tract velocity time integral (LVOTVTI) and carotid artery velocity time integral (CAVTI) as predictors of fluid responsiveness in patients with sepsis and septic shock. 6 Patients were divided into three groups— sepsis, septic shock, and controls. Since there were three groups, comparisons of numerical variables were done using analysis of variance (for normally distributed data) or Kruskal–Wallis test (for skewed data).

Tests for comparison of unpaired data

NominalChi-square test or Fisher's exact test
Ordinal or skewedMann–Whitney -test (Wilcoxon rank sum test)Kruskal–Wallis test
Normally distributedUnpaired -testAnalysis of variance (ANOVA)

A common error is to use multiple unpaired t -tests for comparing more than two groups; i.e., for a study with three treatment groups A, B, and C, it would be incorrect to run unpaired t -tests for group A vs B, B vs C, and C vs A. The correct technique of analysis is to run ANOVA and use post hoc tests (if ANOVA yields a significant result) to determine which group is different from the others.

( Table 2 lists the tests commonly used for comparing paired data, depending on the number of groups and type of data. As discussed above, it would be incorrect to use multiple paired t -tests to compare more than two measurements within a group. In the study by Chowhan, each parameter (LVOTVTI and CAVTI) was measured in the supine position and following passive leg raise. These represented paired readings from the same individual and comparison of prereading and postreading was performed using the paired t -test. 6 Verma et al. evaluated the role of physiotherapy on oxygen requirements and physiological parameters in patients with COVID-19. 7 Each patient had pretreatment and post-treatment data for heart rate and oxygen supplementation recorded on day 1 and day 14. Since data did not follow a normal distribution, they used Wilcoxon's matched pair test to compare the prevalues and postvalues of heart rate (numerical variable). McNemar's test was used to compare the presupplemental and postsupplemental oxygen status expressed as dichotomous data in terms of yes/no. In the study by Megahed, patients had various parameters such as sepsis-related organ failure assessment score, lung injury score, and clinical pulmonary infection score (CPIS) measured at baseline, on day 3 and day 7. 5 Within groups, comparisons were made using repeated measures ANOVA for normally distributed data and Friedman's test for skewed data.

Tests for comparison of paired data

NominalMcNemar's testCochran's Q
Ordinal or skewedWilcoxon signed rank testFriedman test
Normally distributedPaired -testRepeated measures ANOVA

T ests for A ssociation between V ariables

( Table 3 lists the tests used to determine the association between variables. Correlation determines the strength of the relationship between two variables; regression allows the prediction of one variable from another. Tyagi examined the correlation between ETCO 2 and PaCO 2 in patients with chronic obstructive pulmonary disease with acute exacerbation, who were mechanically ventilated. 8 Since these were normally distributed variables, the linear correlation between ETCO 2 and PaCO 2 was determined by Pearson's correlation coefficient. Parajuli et al. compared the acute physiology and chronic health evaluation II (APACHE II) and acute physiology and chronic health evaluation IV (APACHE IV) scores to predict intensive care unit mortality, both of which were ordinal data. Correlation between APACHE II and APACHE IV score was tested using Spearman's coefficient. 9 A study by Roshan et al. identified risk factors for the development of aspiration pneumonia following rapid sequence intubation. 10 Since the outcome was categorical binary data (aspiration pneumonia— yes/no), they performed a bivariate analysis to derive unadjusted odds ratios, followed by a multivariable logistic regression analysis to calculate adjusted odds ratios for risk factors associated with aspiration pneumonia.

Tests for assessing the association between variables

Both variables normally distributedPearson's correlation coefficient
One or both variables ordinal or skewedSpearman's or Kendall's correlation coefficient
Nominal dataChi-square test; odds ratio or relative risk (for binary outcomes)
Continuous outcomeLinear regression analysis
Categorical outcome (binary)Logistic regression analysis

T ests for A greement between M easurements

( Table 4 outlines the tests used for assessing agreement between measurements. Gunalan evaluated concordance between the National Healthcare Safety Network surveillance criteria and CPIS for the diagnosis of ventilator-associated pneumonia. 11 Since both the scores are examples of ordinal data, Kappa statistics were calculated to assess the concordance between the two methods. In the previously quoted study by Tyagi, the agreement between ETCO 2 and PaCO 2 (both numerical variables) was represented using the Bland–Altman method. 8

Tests for assessing agreement between measurements

Categorical dataCohen's kappa
Numerical dataIntraclass correlation coefficient (numerical) and Bland–Altman plot (graphical display)

T ests for T ime-to -E vent D ata (S urvival A nalysis )

Time-to-event data represent a unique type of data where some participants have not experienced the outcome of interest at the time of analysis. Such participants are considered to be “censored” but are allowed to contribute to the analysis for the period of their follow-up. A detailed discussion on the analysis of time-to-event data is beyond the scope of this article. For analyzing time-to-event data, we use survival analysis (with the Kaplan–Meier method) and compare groups using the log-rank test. The risk of experiencing the event is expressed as a hazard ratio. Cox proportional hazards regression model is used to identify risk factors that are significantly associated with the event.

Hasanzadeh evaluated the impact of zinc supplementation on the development of ventilator-associated pneumonia (VAP) in adult mechanically ventilated trauma patients. 12 Survival analysis (Kaplan–Meier technique) was used to calculate the median time to development of VAP after ICU admission. The Cox proportional hazards regression model was used to calculate hazard ratios to identify factors significantly associated with the development of VAP.

The choice of statistical test used to analyze research data depends on the study hypothesis, the type of data, the number of measurements, and whether the data are paired or unpaired. Reviews of articles published in medical specialties such as family medicine, cytopathology, and pain have found several errors related to the use of descriptive and inferential statistics. 12 – 15 The statistical technique needs to be carefully chosen and specified in the protocol prior to commencement of the study, to ensure that the conclusions of the study are valid. This article has outlined the principles for selecting a statistical test, along with a list of tests used commonly. Researchers should seek help from statisticians while writing the research study protocol, to formulate the plan for statistical analysis.

Priya Ranganathan https://orcid.org/0000-0003-1004-5264

Source of support: Nil

Conflict of interest: None

R eferences

COMMENTS

  1. Home

    Overview. Statistical Papers is a forum for presentation and critical assessment of statistical methods encouraging the discussion of methodological foundations and potential applications. The Journal stresses statistical methods that have broad applications, giving special attention to those relevant to the economic and social sciences.

  2. Statistics articles within Scientific Reports

    Read the latest Research articles in Statistics from Scientific Reports

  3. Research Papers / Publications

    Research Papers / Publications. Search Publication Type Publication Year Yan Sun, Pratik Chaudhari, Ian J. Barnett, Edgar Dobriban A ... Annals of Statistics (Accepted). Behrad Moniri, Seyed Hamed Hassani, Edgar Dobriban, Evaluating the Performance of Large Language Models via Debates.

  4. Introduction to Research Statistical Analysis: An Overview of the

    Introduction. Statistical analysis is necessary for any research project seeking to make quantitative conclusions. The following is a primer for research-based statistical analysis. It is intended to be a high-level overview of appropriate statistical testing, while not diving too deep into any specific methodology.

  5. Statistics

    A paper in Physical Review X presents a method for numerically generating data sequences that are as likely to be observed under a power law as a given observed dataset. Zoe Budrikis Research ...

  6. The Beginner's Guide to Statistical Analysis

    Table of contents. Step 1: Write your hypotheses and plan your research design. Step 2: Collect data from a sample. Step 3: Summarize your data with descriptive statistics. Step 4: Test hypotheses or make estimates with inferential statistics.

  7. Journal of Probability and Statistics

    Print ISSN: 1687-952X. Journal of Probability and Statistics is an open access journal publishing papers on the theory and application of probability and statistics that consider new methods and approaches to their implementation, or report significant results for the field. As part of Wiley's Forward Series, this journal offers a streamlined ...

  8. Articles

    Ke Wang. Dehui Wang. Regular Article 23 April 2024. 1. 2. …. 62. Next. Statistical Papers is a forum for presentation and critical assessment of statistical methods encouraging the discussion of methodological foundations and ...

  9. Journal of Applied Statistics

    Journal of Applied Statistics is a world-leading journal which provides a forum for communication among statisticians and practitioners for judicious application of statistical principles and innovations of statistical methodology motivated by current and important real-world examples ... The Journal publishes original research papers, review ...

  10. Data Science: the impact of statistics

    In this paper, we substantiate our premise that statistics is one of the most important disciplines to provide tools and methods to find structure in and to give deeper insight into data, and the most important discipline to analyze and quantify uncertainty. We give an overview over different proposed structures of Data Science and address the impact of statistics on such steps as data ...

  11. (PDF) Data Science: the impact of statistics

    In this paper, we substantiate our premise that statistics is one of the most important disciplines to provide tools and methods. to find structure in and to give deeper insight into data, and ...

  12. (PDF) The most-cited statistical papers

    Only a few of the most influential papers on the field of statistics are included on our list. through papers in statistics'. Four of our most cited papers, Duncan (1955), Kramer. (1956), and ...

  13. Descriptive Statistics for Summarising Data

    Using the data from these three rows, we can draw the following descriptive picture. Mentabil scores spanned a range of 50 (from a minimum score of 85 to a maximum score of 135). Speed scores had a range of 16.05 s (from 1.05 s - the fastest quality decision to 17.10 - the slowest quality decision).

  14. Basic statistical tools in research and data analysis

    Bad statistics may lead to bad research, and bad research may lead to unethical practice. Hence, an adequate knowledge of statistics and the appropriate use of statistical tests are important. An appropriate knowledge about the basic statistical methods will go a long way in improving the research designs and producing quality medical research ...

  15. Journals

    Journal of Educational and Behavioral Statistics. Co-sponsored by the ASA and American Educational Research Association, JEBS includes papers that present new methods of analysis, critical reviews of current practice, tutorial presentations of less well-known methods, and novel applications of already-known methods.

  16. These are the statistics papers you just have to read

    All started with the paper by King (1968) and his critic of the R-squared parameter. Ioannidis (2005) disillusioned me about common research practices and Gill's (1999) paper does not only have a catching title but allowed me to understand my own collywobbles about NHST far better.

  17. Descriptive Statistics

    There are 3 main types of descriptive statistics: The distribution concerns the frequency of each value. The central tendency concerns the averages of the values. The variability or dispersion concerns how spread out the values are. You can apply these to assess only one variable at a time, in univariate analysis, or to compare two or more, in ...

  18. Bayesian statistics for clinical research

    Frequentist and Bayesian statistics represent two differing paradigms for the analysis of data. Frequentism became the dominant mode of statistical thinking in medical practice during the 20th century. The advent of modern computing has made Bayesian analysis increasingly accessible, enabling growin …

  19. Basics of statistics for primary care research

    The following are the general steps for statistical analysis: (1) formulate a hypothesis, (2) select an appropriate statistical test, (3) conduct a power analysis, (4) prepare data for analysis, (5) start with descriptive statistics, (6) check assumptions of tests, (7) run the analysis, (8) examine the statistical model, (9) report the results ...

  20. How to Find Statistics for a Research Paper: 14 Steps

    Identifying the Data You Need. 1. Outline your points or arguments. Before you can figure out what kind of statistics you need, you should have a sense of what your research paper is about. A basic outline of the points you want to make or hypotheses you're trying to prove can help you narrow your focus. [3]

  21. Inferential Statistics

    Inferential statistics have two main uses: making estimates about populations (for example, the mean SAT score of all 11th graders in the US). testing hypotheses to draw conclusions about populations (for example, the relationship between SAT scores and family income). Table of contents. Descriptive versus inferential statistics.

  22. An Introduction to Statistics: Choosing the Correct Statistical Test

    A bstract. The choice of statistical test used for analysis of data from a research study is crucial in interpreting the results of the study. This article gives an overview of the various factors that determine the selection of a statistical test and lists some statistical testsused in common practice. How to cite this article: Ranganathan P.