Entropy is a measure of the amount of uncertainty or randomness in a set of data. In the case of the frequency distribution of the above text, entropy can be calculated by using the formula:
Entropy = -Σ(p_i * log2(p_i))
or, H = -Σ(p_i * log2(p_i))
where p_i is the probability of each word occurring in the text.
To calculate the entropy, first, you need to count the number of occurrences of each word in the text. Then, divide each count by the total number of words in the text to find the probability of each word. Finally, plug these probabilities into the entropy formula and sum the results.
Example:
We will measure the entropy of the tutorial TIP31C Transistor: A Guide for Arduino Projects:
The total words in the tutorial text is 725. For example, if the word "TIP31C" appears 10 times in the text, the probability of it occurring is 10/total number of words in the text. Then you can apply the same for other words also. You have to do this for every word in the text to get the entropy of the entire text.
Note: This calculation of entropy is based on word frequency. There are other methods of calculating entropy like character frequency, bi-grams, tri-grams etc. Each of these methods has its own advantages and disadvantages.
Let assume our text has the following frequency distribution,
Frequency Distribution: TIP31C - 12 transistor - 11 Arduino - 9 power - 6 high - 6 current - 6 PWM - 4 control - 4 audio - 3 frequency - 3 performance - 3 efficiency - 3 amplification - 3 switch - 2 class B - 2 class AB - 2 DC-DC - 1 converter - 1 switching - 1 application - 1 base - 1 collector - 1 emitter - 1 motor - 1 relay - 1 load - 1 potentiometer - 1 duty cycle - 1 circuit diagram - 1 Mega - 1 code - 1 cost-effective - 1 component - 1 range - 1 specification - 1 alternative - 1 2N2222A - 1 L293D - 1 motor shield - 1 tutorial - 1
Let us calculate the entropy of the above distribution.
For example, in the given text, the word "TIP31C" occurs 8 times out of a total of 89 words, so the probability of "TIP31C" occurring is 8/89 = 0.09. Similarly, the word "Arduino" occurs 12 times out of 89, so the probability of "Arduino" occurring is 12/89 = 0.13.
Once we have calculated the probability of each word, we can plug these values into the entropy formula:
Entropy = -(0.09 * log2(0.09)) - (0.13 * log2(0.13)) - (0.11 * log2(0.11)) - ...
Calculating the entropy for all the words in the text, we get the final entropy value of 4.69.
That is, Entropy=H = 4.69 bits.
What does the total entropy of 4.69bits actually mean?
The total entropy of 4.69bits means that there is a relatively high level of randomness or uncertainty in the distribution of the terms listed above. A higher entropy value indicates that the terms are more evenly distributed and there is less likelihood of any one term being more prevalent than others. In other words, it means that no term is highly dominant over the others and that there is a relatively equal distribution of terms in the given data set. So, it is a measure of the disorder or randomness in the distribution
The above calculation can also be computed using the Information Theory Calculator which displays graph of word distribution histogram, calculates the entropy and frequency distribution as shown below.