Six Sigma, business statistics and basic data analytics using R programming

Objective – To summarize most common statistical analysis used in Six Sigma projects, general business statistics and analytics using R in a way that is succinct, convenient and can be used as a cheat sheet

Prerequisites – Very basic operational knowledge of RStudio and Intermediate knowledge of statistics. Should be able to comprehend statistical outputs like p-value

Why R? – Readers who are Six Sigma experts, might be guessing why should one use R when one is already comfortable with Minitab. Well, I have tried to list down some of the top reasons below
R is FREE!! Minitab costs about $1500. That should be a good enough reason. Many organizations will not approve the purchase of Minitab due to expense especially in client organizations or client environment.
Analysis on Minitab is limited to functions provided within the software suit whereas in R you have more than 3000 packages with endless functions and analysis. Possibilities on R are limitless.
Once you go through this blog, it can give you an entry to a lot more complex and graphically appealing statistics and analysis which are available on R free of any cost.
R is being widely used by data scientists to perform complex analytics and knowledge on R can open new doors for you professionally

The downside of R is that it has a steeper learning curve and is not very intuitive.

Let’s get started!!

Dataset -
For demonstrating different functions, we will be using preexisting tables in R like “mtcars” and “iris”, as these tables are available to all. You can view headers and sample data from the table or data set by using the command head() or you could view the entire table by just typing the name of the table.


 Packages -
We will be using functions from the base, dplyr, qcc, RVAideMemoire and gmodels packages. To use any package you need to install the package first by using the command install.packages(“”) followed by library() to load the package. Installing a package is required only once.

Things to know - All R code below will appear in blue font and is immediately followed by output that I have directly copied from RStudio. '<-' is an assignment operator and '$' in the below examples is used to call a column from the table. You can always import data from a spreadsheet or CSV file directly into R as well.

Statistical summary –
·         You can use summary(x) function to get basic statistical summary for a data set. Below example gives statistical summary for column ‘disp’ in ‘mtcars’ data set.



·         You can also use summarise(x) function to get basic statistical summary. This function is part of “dyplr” package. Below example calculates count, mean, median, standard deviation, minimum and maximum values for ‘mpg’ column, grouped by ‘cyl’ column from ‘mtcars’ data set.

 
·         To calculate quartiles for the data set, quantile(x) function can be used. Below example calculates quartiles for ‘mpg’ column in ‘mtcars’ data set.


·         To calculate quartiles by groups, we can use summarise and quantile functions in combination as follows



Box Plot –
You can draw box plot by using boxplot(x) function. Below examples creates box plot for ‘mpg’ data in ‘mtcars’ data set.

> boxplot(mtcars$mpg, main = "Box Plot for MPG")



 You can also create box plot by groups which helps in graphically comparing groups of data sets. Below example breaks ‘MPG’ by ‘CYL’ from ’mtcars’ data set.

> boxplot(mtcars$mpg ~ mtcars$cyl, main = "Box Plot for MPG", xlab = "CYL")




 Histogram
Histograms depicting frequency distribution can be created by using hist(x) function from the base package. Below example creates a histogram for MPG column in ‘mtcars’ data set

> hist(mtcars$mpg, col = "blue")



Pareto Chart –
Pareto charts can be created by using pareto.chart(x) function from the “qcc” package. You will need to use an additional command “names” to add labels to defects count. I couldn’t use “mtcars” data set for Pareto analysis so I have created a vector of defects and categories to create a Pareto chart. You can also select columns of an imported spreadsheet to do the same.

> library(qcc)
> defects <- c(10,20,35,55,5,15,25)
> categories <- c("Part A", "Part B", "Part C", "Part D", "Part E", "Part F", "Part G")
> names(defects) <- categories
> pareto.chart(defects)





 Normality
In order to check normality of a data set, shapiro.test(x) can be used from the base package. You could also use the histogram or mean and median to check the same in different ways. Histogram and other statistical measures are already covered above. Shapiro test like many other normality tests will provide you a p-value to make a determination on normality.


Along with Histogram, you can also use QQ plot to graphically assess normality
> qqline(mtcars$mpg)
 



ANOVA –
Below is an example of One Way Anova to test MPG for CYL from “mtcars” data set. You can conduct ANOVA test by using the aov() function from the base package.

> z <- aov(mtcars$mpg ~ mtcars$cyl)
> summary(z)

            Df Sum Sq Mean Sq F value   Pr(&gt;F)   
mtcars$cyl   1  817.7   817.7   79.56 6.11e-10 ***
Residuals   30  308.3    10.3                    


Mood’s Median test -
If your data is non normal or if you want to conduct a non-parametric test to compare medians, you can use this test instead ANOVA. Below example compares MPG with CYL from “mtcars” data set. For this one, you will have to install and load package “RVAideMemoire”.

> library(RVAideMemoire)
> mood.medtest(mtcars$mpg ~ mtcars$cyl)
 
                    Mood's median test
 
data:  mtcars$mpg by mtcars$cyl
p-value = 9.369e-08


Correlation & Regression –
 We will use MPG and DISP variables from MTCARS to study correlation. We will first select the columns from the mtcars table that we need to analyze in the following way
> mpg_disp <- select(mtcars, mpg, disp)


We can calculate the coefficient of correlation by using the cor(x) function from the base package
> cor(mpg_cyl)
            mpg       disp
mpg   1.0000000 -0.8475514
disp -0.8475514  1.0000000

We can also draw the scatter plot to study the relation graphically between the two variables
> plot(mpg_cyl)
 
 
 
We can now derive the regression model for MPG and DISP using lm(x) function. The output provides us with the model and R squared values.

> mpg_disp_reg <- lm(mtcars$mpg ~ mtcars$disp)
> summary(mpg_disp_reg)


Call:
lm(formula = mtcars$mpg ~ mtcars$disp)

Residuals:
    Min      1Q  Median      3Q     Max
-4.8922 -2.2022 -0.9631  1.6272  7.2305

Coefficients:
             Estimate Std. Error t value Pr(&gt;|t|)   
(Intercept) 29.599855   1.229720  24.070  &lt; 2e-16 ***
mtcars$disp -0.041215   0.004712  -8.747 9.38e-10 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 3.251 on 30 degrees of freedom
Multiple R-squared:  0.7183,     Adjusted R-squared:  0.709
F-statistic: 76.51 on 1 and 30 DF,  p-value: 9.38e-10

We can also create a Fitted line plot to show the regression line using the abline(x) function. Abline function can only be used over an existing scatter plot. It will not create a plot on its own –
> plot(mtcars$mpg,mtcars$disp)
> abline(lm(mtcars$disp~mtcars$mpg))


t-tests –
To demonstrate t-tests, we will be using the ‘iris’ data set which is available to all on R. First we will get basic summary for the two columns in the table that we will be analyzing using t-tests so that the output from the t-tests are more relatable

> summarise(iris, mean(Sepal.Length), sd(Sepal.Length), median(Sepal.Length), n())
  mean(Sepal.Length) sd(Sepal.Length) median(Sepal.Length) n()
1           5.843333        0.8280661                  5.8 150
> summarise(iris, mean(Sepal.Width), sd(Sepal.Width), median(Sepal.Width), n())
  mean(Sepal.Width) sd(Sepal.Width) median(Sepal.Width) n()
1          3.057333       0.4358663                   3 150

·         1 sample t-test - The test mean has to be specified as mu=x. Alternative hypothesis can be selected as alternative=”greater” or alternative=”less”. If alternative is not specified, it defaults to not equal (< >)
> t.test(iris$Sepal.Length, mu=5)
 
            One Sample t-test
 
data:  iris$Sepal.Length
t = 12.4733, df = 149, p-value &lt; 2.2e-16
alternative hypothesis: true mean is not equal to 5
95 percent confidence interval:
 5.709732 5.976934
sample estimates:
mean of x 
 5.843333 

·         2 sample t-test – The sample data sets have to be specified within the function. You can add paired=TRUE for paired t-test and var.equal=TRUE to specify if variances are equal.
> t.test(iris$Sepal.Length, iris$Sepal.Width)
 
            Welch Two Sample t-test
 
data:  iris$Sepal.Length and iris$Sepal.Width
t = 36.4633, df = 225.678, p-value &lt; 2.2e-16
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 2.63544 2.93656
sample estimates:
mean of x mean of y 
 5.843333  3.057333 


Proportion tests –
·         2 proportion test – In the below example, we have compared 2 ratios, 75/100 & 160/200 with alternative hypothesis = greater
> prop.test(c(75,160),c(100,200), alternative = "greater")
 
            2-sample test for equality of proportions with continuity correction
 
data:  c(75, 160) out of c(100, 200)
X-squared = 0.7095, df = 1, p-value = 0.8002
alternative hypothesis: greater
95 percent confidence interval:
 -0.1425725  1.0000000
sample estimates:
prop 1 prop 2 
  0.75   0.80 

·         1 proportion test – Test proportion has to be entered as p=x
> prop.test(80, 160, p=0.9, alternative = "less")
 
            1-sample proportions test with continuity correction
 
data:  80 out of 160, null probability 0.9
X-squared = 280.0174, df = 1, p-value &lt; 2.2e-16
alternative hypothesis: true p is less than 0.9
95 percent confidence interval:
 0.0000000 0.5675475
sample estimates:
  p 
0.5 

Confidence Interval –
For CI calculation, you will need to install “gmodels” package and use the ci(x) function

> install.packages("gmodels")
> library(gmodels)
> ci(iris$Sepal.Length, confidence=0.95)
  Estimate   CI lower   CI upper Std. Error 
5.84333333 5.70973248 5.97693419 0.06761132 
 


Comments

Popular posts from this blog

Process Maturity Model

Summary of Six Sigma Analysis - Cheat Sheet