This is the first of a series of posts where I explain some fundamental statistical concepts. Statistics has a reputation for being a) complicated, b) boring, or c) both complicated and boring.
Having an understanding of statistics is useful and actually, quite straightforward. If you take the trouble to find out what these essential concepts are, you’ll already be ahead of most people…
Another misconception is that statistics is all about predicting the future, or forecasting, using bizarre mathematical equations. It is possible to use statistics for forecasting, and that is referred to as inferential statistics – we use some data to infer a future outcome.
When we have some data to analyse, for example for a report, we use descriptive statistics.
One common example of a descriptive statistic is average. We use averages to get a sense of a ‘typical’ value when there is a set of values to choose from. This is typically the mean of a set of numbers (all the values summed together, and then divided by the total number of values in the sample), though it could be the median (the central value of a ranked set of values), or the mode (the value that occurs most frequently in a set).
If we have some temperature data from an industrial process, we might want to take the mean of that data to see what the general temperature is over a period of time.
If we calculated the median of the temperature data, we would be looking for the central value. Why might that be important? Well, if the data has a few values that are much higher than the rest (perhaps a tool went blunt and the cutting temperature increased rapidly), the high values would actually inflate the mean, and distort our view of most of the rest of the data. In such a case, the median is appropriate.
If we are interested in the most common temperature value – when the process has reached operating conditions for instance – then the mode is the average we are seeking.
However, if there is one term that separates those-who-know from those-who-don’t-know, it is standard deviation.
The standard deviation is a measure of how far the values in a data set are spread out. If the standard deviation is high, it means that generally the values are further away from the mean, and that is represented by a flat distribution. Conversely, a low standard deviation means that the values are close to the mean, so the dispersion is less. How is this useful?
When we are monitoring real-life processes, the data is not all neat and tidy. There are natural variations in the values that are recorded. When we look at such data, we need to be able to observe which values are acceptable, and those that are outside some arbitrary limits, otherwise referred to as outliers.
If the process temperature mean is 52 degrees celsius, with a standard deviation of 2 degrees we can say that:
- 1 standard deviation would represent 50 degrees to 54 degrees. This range accounts for 68.3% of the total distribution of values;
- 2 standard deviations is all of the values between 48 degrees and 56 degrees, which is 95.4% of the distribution;
- 3 standard deviations is virtually all of the values – 99.7%.
One practical example of descriptive statistics in manufacturing is Lean Six Sigma. The Greek lower case symbol for standard deviation is sigma; as we have seen above, six of the standard deviations represents 99.7% of the values that are either side of the mean.
Six sigma is therefore a powerful concept when analysing process data.
Comprehending these fundamental statistical concepts is the key to getting the most out of IIoT implementations. Initially it may be about producing reports that are meaningful and aid your understanding of the process that you are monitoring. The next stage is to have your IIoT devices produce not only the data, but also real-time descriptive statistics. Such statistics can then be used as the basis of triggers or alerts to intervene, which might support quality assurance for example.
If your devices already do this, then at least you can understand how the data is being reported, and that will help you think about the opportunities to use data from more than one process to inform your decisions.