|About Us(CRI)||This page explains what the confidence interval for the average is and how the customer can use it. The computations of confidence interval for the average and others described in this page are a part of the data analysis services we offer at CRI. Please click "Data Analysis" button above to see other types of data analysis we offer.
First of all, we use a word "average" throughout this page. The word "mean" is quite often used as a substitute of the word "average" but the word "mean" has several different meanings as this sentence itself shows and, thus, we avoid using the word "mean".
We explain how certain the value of the average is. We also explain how to know if the difference between values of individual data and the average computed from those data is negligible or not and how to know if the difference between averages computed from two groups are negligible or not.
Population and samples
We quite often use words "population" and "sample" in the statistical analysis. Let 's say we want to know the average height of people aged between 20 and 24 years old in your country. Then, the "population" is all the people aged between 20 and 24. It is usually quite difficult to get data from all of these people due to the cost. So, what we usually do is to pick up some of these people and get data from them. The "samples" are these people we picked up. We then compute average from these samples and call the result the average height of people aged between 20 and 24 in your country. What we did here is estimating the average of "population", let us call it "true average" in this page, from "samples" and we made a silent assumption that sample average is the same as true average. Then, skeptical people might ask if that assumption is correct or not. In reality, they are usually not the same. As you might expect, the exact value of sample average would change even if you replace only one of the sample by another in the population. What this means is that the value of sample average has an uncertainty. So, the question is how much that uncertainty is.
Our example; normal distribution
Figure 1a shows frequency distribution of population A we use as an example in this page. This population contains 100000 (0.1 million) data. Before going any further, we would like to introduce "normal distribution" briefly at this point. There are many cases when data of various kinds show so-called normal distribution that is shown as a red bell shaped curve in figure 1a and 1b. There are some other cases we can transform non-normally distributed data into normally distributed data by applying some simple mathematical transformation. Because of this generality, the characteristics of normally distributed data are well studied in the past. Thus, we made our example populations that are randomly but normally distributed. Figure 1b shows frequency distribution of another population named B. This population is also distributed normally.
|We compute confidence interval and others for you.
Estimations are free. For more information,
|Standard deviation and confidence interval for the average; How certain that average?
We usually compute standard deviation as a first step to evaluate how certain average is. The standard deviation shows how scattered data are from their average and it becomes larger as the data scatter more. We then compute confidence interval of the average from this standard deviation. At this point, we have to specify how certain this interval is by means of the percentage that is called confidence level. For example, 95% confidence interval means that if we extract samples from the population and compute confidence interval for average of them each time repeatedly (sample average differs each time because samples are not the same), the average of the population would fall in these intervals for the 95% of the time. Thus, we say that the confidence of calculated confidence interval for the average is 95%. In this way we know how certain the average computed from sample as an estimate of true average is. We customary use confidence interval of 99, 95 or 90%.
Standard deviation is also useful to know how many data exist within the specific interval from the average as the Tchebycheff's theorem dictates that there are more than 1-1/C^2 (C^2 means square of C, C must be greater than 1!) data within the interval of C times standard deviation from the average. The blue line in figure 1c shows how many data are included within the specific distance, which is measured by standard deviation in this figure, from the average when data are distibuted normally. In this case, there are 68.3% of data within the distance of one standard deviation and 95.5% of data within the distance of the twice of the standard deviation .In this case, there are 68.3% of data within the distance of one standard deviation and 95.5% of data within the distance of the twice of the standard deviation . Tchebycheff's theorem (red line) results smaller values but the good thing about this theorem is that it applicable to non-normally distributed data.
|Now returning to our examples, the mean and the standard deviation of population A are 100.0 and 10.0 and of population B are 120.0 and 20.0, respectively. We extracted 500 data from each of these two populations. Figure 2 shows frequency distributions of these samples. Let us call them sample A (Fig. 2a) and sample B (Fig. 2b). The average and the standard deviation of sample A are 99.34 and 9.43 and of sample B are 119.76 and 19.66, respectively. The 95% confidence intervals for these sample averages are +/- 0.83 (or between 98.51 and 100.17) for the sample A and +/-1.73 (or between 118.27 and 121.73) for the sample B. Here, we computed these intervals from sample standard deviations because it is quite unlikely that we know the standard deviation of the population when we try to estimate average of it in the real world unlike our examples. These results show that the averages of population A and B both fall within the confidence intervals for sample averages. The confidence interval for sample average B is wider than that for sample average A as you might have noticed. This is because the data of sample B tend to scatter more than those of sample A which resulted that the standard deviation of sample B is larger than that of sample A. This, in turn, resulted that the uncertainty of sample average B as an estimation of population average B is larger and, thus, confidence interval becomes wider.
|Now, let us see how certain the confidence level, "95%" of 95% confidence interval, actually is. We made 200 experiments, in each of which we extracted 500 data from population A, and then computed confidence interval for sample average each time. All the data in the population A were used once but never reused in these experiments (100000/500=200). The result of this trial was that the true average fell within the confidence interval for the sample average for 192 times. This is 96% of entire experiment and this experiment shows that the number 95 of 95% confidence interval itself has an uncertainty in the realistic situations. Thus, it usually does not make that much sense to compare confidence intervals of, say, 95% and 96%.
How many samples do we need?
The number of samples you extract from population also affects the confidence interval for the sample average. Figure 3 shows how the half width of the 95% confidence interval for sample average changes as the number of sample changes. The standard deviation is fixed and we computed for three cases. This figure shows that the confidence interval decreases rapidly as the number of sample increases initially. Decrease of confidence interval means you can get more accurate estimate of true average. However, this trend slows down as you extract more samples from population and, eventually, the confidence interval would not decrease that much any more. Your effort to obtain more data will not be rewarded that much at that point. Therefore, you have to have a strategy to know just how many samples you need to have because obtaining data usually costs some money. You have to have three numbers to know how many samples you need. First of all, you have to specify the confidence level. We customary choose 99, 95 or 90% as described before. Next, you have to determine how much possible difference between sample average and true average, the width of confidence interval, you can accept. At this point you basically have determined the certainty/accuracy of sample average you can have. Then, you need to have an estimate of standard deviation of population. You can estimate it based on past experiences or just give a rough estimate. From these three numbers you can draw a figure similar to figure 3 and estimate how many samples you need to accomplish your task. If you are not so sure about your estimate of standard deviation of population, you could repeat computations with the different value of standard deviation of population and make your decision based on those results along with your financial resource. Figure 3 indicates that 500 samples (the vertical black line) were enough for population A but we probably needed more samples for population B in our examples
The value of my data is different from sample average. Is this difference significant?
We separated our services into several categories for the sole purpose of introducing our services in an organized manner. To serve your needs we do combine our services in different categories.