Fisher's exact test in R: independence test for a small sample

Performing the Fisher’s exact test in R: A reliable independence test for small sample sizes

Introduction:

In this article, we will be discussing the Fisher’s exact test of independence, which is used to determine if there is a significant relationship between two categorical variables. We will compare this test to the more commonly used Chi-square test and explain when each test should be used based on the sample size.

To illustrate the Fisher’s exact test, we will use an example of determining the association between smoking and being a professional athlete. We collected data on 14 individuals and created a contingency table to summarize the observed frequencies.

We will also discuss the concept of expected frequencies and how to retrieve them using the chisq.test() function in R. It is important to check the expected frequencies before deciding between the Chi-square and Fisher’s exact test.

To perform the Fisher’s exact test in R, we will use the fisher.test() function and interpret the output, specifically the p-value. We will reject the null hypothesis if the p-value is less than the significance level of 5%.

Finally, we will show you how to visualize the results of the Fisher’s exact test using a barplot and the ggbarstats() function from the {ggstatsplot} package. This will further confirm the significant relationship between the variables.

We hope this article will help you understand and apply the Fisher’s exact test of independence in R effectively. If you have any questions or suggestions, please leave a comment.

Full Article: Performing the Fisher’s exact test in R: A reliable independence test for small sample sizes

The Importance of Independence Tests in Statistics

In the field of statistics, independence tests play a crucial role in determining if there is a significant relationship between two categorical variables. There are two main types of independence tests: the Chi-square test and the Fisher’s exact test. While the Chi-square test is commonly used for large sample sizes, the Fisher’s exact test is preferred for smaller sample sizes.

Understanding the Chi-square Test and the Fisher’s exact Test

The Chi-square test is ideal when the sample size is large enough, as it provides an approximation that becomes exact when the sample size is infinite. On the other hand, the Fisher’s exact test is more suitable for small sample sizes, as its p-value is exact and not an approximation.

You May Also Like to Read  Effective Strategies for Handling Data Overload in the Healthcare Industry

According to the literature, a general rule is that the Chi-square test is not appropriate when any of the expected values in the contingency table is less than 5. In such cases, the Fisher’s exact test is preferred (McCrum-Gardner, 2008; Bower, 2003).

Hypotheses for the Fisher’s exact Test

The hypotheses for the Fisher’s exact test are the same as those for the Chi-square test. The null hypothesis ((H_0)) states that there is no relationship between the two categorical variables, meaning that knowing the value of one variable does not help predict the value of the other variable. Conversely, the alternative hypothesis ((H_1)) asserts that there is a relationship between the two variables, and knowing the value of one variable aids in predicting the value of the other variable.

Analyzing Data on Smoking and Being a Professional Athlete

To illustrate the application of independence tests, let’s consider whether there is a statistically significant association between smoking habits and being a professional athlete. The variables of interest, smoking and professional athlete status, are qualitative variables with only two possible values: “yes” or “no”. In this example, data on 14 individuals were collected.

Observed Frequencies

The observed frequencies for the contingency table are as follows:

Non-smoker Smoker
Athlete 7 2
Non-athlete 0 5

To visualize the data, a mosaic plot can be used. From the plot, it is evident that the proportion of smokers is higher among non-athletes than athletes. However, visual representation alone is not sufficient to determine if there is a significant association in the overall population.

Understanding Expected Frequencies

Before proceeding with the Fisher’s exact test, it is necessary to check the expected frequencies. These are obtained using the chisq.test() function in R. In our example, the expected frequencies are as follows:

Non-smoker Smoker
Athlete 4.5 4.5
Non-athlete 2.5 2.5

The contingency table confirms the need to use the Fisher’s exact test instead of the Chi-square test because at least one cell has an expected frequency below 5. In R, if the expected frequencies are not checked before applying the Chi-square test, a warning will appear, indicating that the Chi-square approximation may be incorrect when the smallest expected frequency is below 5.

Performing the Fisher’s exact Test in R

To conduct the Fisher’s exact test in R, the fisher.test() function is used. The most important output is the p-value, which in this case is 0.02098. This p-value indicates that the null hypothesis can be rejected at a significance level of 5%. Therefore, there is a significant relationship between smoking habits and being a professional athlete.

You May Also Like to Read  Understanding MSP Cybersecurity: Key Information You Need to Know

Interpreting the Results

Rejecting the null hypothesis signifies that there is a significant relationship between the two categorical variables. Consequently, knowing the value of one variable helps predict the value of the other variable. The Fisher’s exact test confirms this relationship with a p-value of 0.020.

Visualizing the Results

The results of the Fisher’s exact test can be plotted on a barplot using the ggbarstats() function from the ggstatsplot package. This package simplifies the process of working with contingency tables by transforming the data into a data frame. The resulting barplot indicates a higher proportion of smokers among non-athletes, further supporting the presence of a relationship between smoking habits and being a professional athlete. The subtitle of the plot displays the p-value, reinforcing the conclusion drawn from the test.

In conclusion, the Fisher’s exact test is a valuable tool for analyzing the relationship between categorical variables. This article has provided an overview of the test’s application and interpretation in R. If you have any questions or suggestions related to this topic, please feel free to leave a comment for further discussion.

References:
McCrum-Gardner, E. (2008). Which is the Correct Statistical Test to Use? British Journal of Oral and Maxillofacial Surgery, 46(1), 38–41.
Bower, M. (2003). A Beginner’s Guide to Critical Thinking and Writing in Health and Social Care. Open University Press.

Summary: Performing the Fisher’s exact test in R: A reliable independence test for small sample sizes

This article provides an overview of the Fisher’s exact test of independence, which is used to determine if there is a significant relationship between two categorical variables. It explains the difference between the Chi-square test and the Fisher’s exact test and discusses when to use each one. The article then walks through an example of using the Fisher’s exact test in R to analyze the association between smoking habits and being a professional athlete. It explains how to calculate observed and expected frequencies, and how to interpret the results. Additionally, the article demonstrates how to visualize the results using a barplot and provides helpful tips and warnings. Overall, it is a comprehensive guide to performing and interpreting the Fisher’s exact test of independence in R.

Frequently Asked Questions:

1. Question: What is data science and why is it important in today’s world?
Answer: Data science is an interdisciplinary field that involves extracting insights and knowledge from large volumes of structured and unstructured data. It incorporates various techniques such as data mining, statistics, and machine learning to uncover meaningful patterns and make informed decisions. In today’s data-driven world, data science is crucial as it enables organizations to gain a competitive edge by leveraging their data to drive business growth, improve operational efficiency, and enhance decision-making processes.

You May Also Like to Read  What is a Data Lakehouse? - A Simplified Explanation with Human Appeal

2. Question: What are the key skills required to become a data scientist?
Answer: To pursue a career in data science, one should possess a combination of technical and non-technical skills. Technical skills include a strong background in mathematics and statistics, proficiency in programming languages such as Python or R, knowledge of data manipulation and visualization tools, and experience with machine learning algorithms. Non-technical skills such as problem-solving, critical thinking, and effective communication are equally important in data science as they enable data scientists to interpret complex findings and present them in a way that non-technical stakeholders can understand.

3. Question: How does data science play a role in machine learning?
Answer: Data science and machine learning are closely intertwined. Data science provides the foundation and techniques for handling and analyzing large datasets, identifying patterns, and extracting insights. Machine learning, on the other hand, is a subset of artificial intelligence that uses statistical algorithms and models to enable systems to learn from data and make predictions or decisions without explicit programming. Data science provides the necessary tools and methodologies to preprocess, transform, and clean the data before training machine learning models, thus enhancing their accuracy and predictive power.

4. Question: What are the ethical considerations in data science?
Answer: Data science brings forth several ethical challenges that need to be addressed responsibly. One such concern is privacy, as data scientists work with vast amounts of personal and sensitive information. It is vital to ensure proper anonymization and data protection measures are in place to safeguard individual privacy. Transparency and fairness are also crucial, as biases and discrimination can inadvertently be introduced into models through biased data or biased decision-making processes. Data scientists need to be mindful of these ethical considerations and work towards building equitable and unbiased models for societal benefit.

5. Question: How does data science impact different industries?
Answer: Data science has a transformative impact across various industries. In healthcare, data science aids in disease prediction, drug discovery, and personalized treatment plans. In finance, it enables better fraud detection, risk assessment, and algorithmic trading. Retail businesses leverage data science to improve customer segmentation, optimize pricing, and enhance inventory management. Furthermore, data science plays a vital role in transportation, energy, marketing, and many other sectors. Its ability to analyze large volumes of data and generate actionable insights helps organizations make data-driven decisions and drive innovation.