Understanding Shannon's Entropy metric for Information [pdf]

jlpom · on Oct 8, 2021

In 1939, when Shannon had been working on his equations for some time, he happened to visit the mathematician John von Neumann. During their discussions, regarding what Shannon should call the "measure of uncertainty" or attenuation in phone-line signals with reference to his new information theory, according to one source:[10]

> My greatest concern was what to call it. I thought of calling it ‘information’, but the word was overly used, so I decided to call it ‘uncertainty’. When I discussed it with John von Neumann, he had a better idea: ‘You should call it entropy, for two reasons: In the first place your uncertainty function has been used in statistical mechanics under that name, so it already has a name. In the second place, and more important, nobody knows what entropy really is, so in a debate you will always have the advantage.

[10]: M. Tribus, E.C. McIrvine, "Energy and information", Scientific American, 224 https://en.wikipedia.org/wiki/History_of_entropy

kgwgk · on Oct 8, 2021

The article is available here: http://www.esalq.usp.br/lepse/imgs/conteudo_thumb/Energy-and...

idatum · on Oct 8, 2021

I enjoyed reading "A mind at Play", Soni and Goodman. There are several of these kind of stories with colleagues. I also like how the book goes into his childhood, including using electrified fence to communicate with his farm neighbors!

dswilkerson · on Oct 8, 2021

I read your arxiv.org article on Shannon Entropy: https://arxiv.org/abs/1405.2061

What I am amazed at is that arxiv accepted this. I submitted exactly the same argument to them on 27 Jan 2009. They rejected it.

Here it is on my home page: http://dsw.users.sonic.net/entropy.html

Entropy is (1) the expected value of (2) the information of an event.

You might want to post a link to your answer here and see which of our explanations is more popular: https://math.stackexchange.com/questions/331103/intuitive-ex...

alexpetralia · on Oct 8, 2021

The "Twenty Questions" explanation of Shannon entropy is definitely the most intuitive one I've read so far. I've heard it repeated it many times, but I do see yours is from 2006! Thanks for this.

In particular here is where I read it (pdf): http://tuvalu.santafe.edu/~simon/it.pdf

micouay · on Oct 8, 2021

It's not mine, I just found it very intuitive and decided to share here.

amitport · on Oct 8, 2021

I didn't know arXiv actually rejects papers based on content. I thought they just have a quick glace to see if it looks like a paper.

privong · on Oct 8, 2021

arXiv submissions are not reviewed in detail (i.e., not reviewed to the same depth as in a journal's peer review procedure), but there is a moderation process: https://arxiv.org/help/moderation

dswilkerson · on Oct 20, 2021

The situation is so annoying that at least one alternative exists.

https://vixra.org/why ____ Why viXra?

In 1991 the electronic e-print archive, now known as arXiv.org, was founded at Los Alamos National Laboratories. In the early days of the World Wide Web it was open to submissions from all scientific researchers, but gradually a policy of moderation was employed to block articles that the administrators considered unsuitable. In 2004 this was replaced by a system of endorsements to reduce the workload and place responsibility of moderation on the endorsers. The stated intention was to permit anybody from the scientific community to continue contributing. However many of us who had successfully submitted e-prints before then found that we were no longer able to. Even those with doctorates in physics and long histories of publication in scientific journals can no longer contribute to the arXiv unless they can find an endorser in a suitable research institution.

The policies of the administrators of Cornell University who now control the arXiv are so strict that even when someone succeeds in finding an endorser their e-print may still be rejected or moved to the "physics" category of the arXiv where it is likely to get less attention. Those who endorse articles that Cornell find unsuitable are under threat of losing their right to endorse or even their own ability to submit e-prints. Given the harm this might cause to their careers it is no surprise that endorsers are very conservative when considering articles from people they do not know. These policies are defended on the arXiv's endorsement help page

A few of the cases where people have been blocked from submitting to the arXiv have been detailed on the Archive Freedom website, but as time has gone by it has become clear that Cornell has no plans to bow to pressure and change their policies. Some of us now feel that the time has come to start an alternative archive which will be open to the whole scientific community. That is why viXra has been created. viXra will be open to anybody for both reading and submitting articles. We will not prevent anybody from submitting and will only reject articles in extreme cases of abuse, e.g. where the work may be vulgar, libellous, plagiaristic or dangerously misleading.

It is inevitable that viXra will therefore contain e-prints that many scientists will consider clearly wrong and unscientific. However, it will also be a repository for new ideas that the scientific establishment is not currently willing to consider. Other perfectly conventional e-prints will be found here simply because the authors were not able to find a suitable endorser for the arXiv or because they prefer a more open system. It is our belief that anybody who considers themselves to have done scientific work should have the right to place it in an archive in order to communicate the idea to a wide public. They should also be allowed to stake their claim of priority in case the idea is recognised as important in the future.

Many scientists argue that if arXiv.org had such an open policy then it would be filled with unscientific papers that waste people's time. There are problems with that argument. Firstly there are already a high number of submissions that do get into the archive which many people consider to be rubbish, but they don't agree on which ones they are. If you removed them all, the arXiv would be left with only safe papers of very limited interest. Instead of complaining about the papers they don't like, researchers need to find other ways of selecting the papers of interest to them. arXiv.org could help by providing technology to help people filter the article lists they browse.

It is also often said that the arXiv.org exclusion policies do not matter because if an independent (or amateur) scientist were to make a great discovery, it would certainly be noticed and recognised. Here are three reasons why this argument is wrong and unhelpful. Firstly, many independent scientists are just trying to do ordinary science. They do not have to make the next great paradigm shift in science before their work can be useful. Secondly, the best new ideas do not follow from conventional research and it may take several years before their importance can be appreciated. If such a discovery cannot be put in a permanent archive it will be overlooked to the detriment of both the author and the scientific community. Thirdly, it is not just independent or amateur scientists that are having problems getting access to repositories and the recognition they deserve.

Another argument is that anybody can submit their work to a journal where it will get an impartial review. The truth is that most journals are now more concerned with the commercial value of their impact factor than with the advance of science. Papers submitted by anyone without a good affiliation to a research institution find it very difficult to publish. Their work is often returned with an unhelpful note saying that it will not be passed on for review because it does not meet the criteria of the journal.

The visual design of viXra.org (but not its content) is a parody of arXiv.org to highlight Cornell University's unacceptable censorship policy. Vixra is also an experiment to see what kind of scientific work is being excluded by the arXiv. But most of all it is a serious and permanent e-print archive for scientific work. Unlike arXiv.org it is truly open to scientists from all walks of life. You can support this project by submitting your articles. ____

soyiuz · on Oct 8, 2021

I have several major problems with this explanation (often endemic in the discussion more generally):

1. Shannon's original paper relates "the capacity [of a channel] to transmit information" as well as the potential of a system (e.g. the English language) to generate information. In other words, your pipe needs to be able to accommodate the potential volume coming from the source ("the entropy of the source determines the channel capacity"). The amount of information in a message or the amount of surprise, as the author has it, should not be confused with the amount of potential information.

2. Instead of "system" and "channel" the author uses the word "variable" which I find misleading. Channel (like a telegraph cable) and system (the English language) are specifically relevant to Shannon's discussion.

3. The discussion of "surprise," as the author has it, is misleading. Shannon is writing his paper in conversation with Hartley and Nyquist---all three specifically attempting to bracket out the subjective psychological factors such as surprise, in order to describe the capacity for information transmission in terms of quantitative measures, based on "physical considerations alone" (Hartley). Surprise reintroduces a subjective, relative, psychological understanding of information the original authors wanted to avoid.

memetomancer · on Oct 8, 2021

With all due respect, you seem to be saying that this 'introductory text' is insufficiently sophisticated, and proceed to dump a high density rebuttal in stilted, academic style.

You may be extra smart in the sense of understanding the topic in a deep way, but it sure does seem foolish and/or myopic to posture this way over a basic introduction that cites the original paper in the first sentence.

dswilkerson · on Oct 20, 2021

(1) A "Random Variable" is a term of art in probability theory and has an entropy.

(2) Your distinction between "capacity" and "potential" is pointless: you have to budget for the expected information you have to transmit.

(3) Your silly games with words would apply to any technical discussion as they all use metaphors. You sound like someone who has never done any technical work at all.

trombonechamp · on Oct 8, 2021

"Surprise", or rather "specific surprise", is actually a technical term in information theory, e.g., see: https://iopscience.iop.org/article/10.1088/0954-898X/10/4/30...

This doesn't seem to be how the author is using it in this arxiv paper though.

superfist · on Oct 8, 2021

1. You are correct, there are two different perspectives: 1. designer of channel (what most of Shannon work is about) and it deals with what you called 'potential information' and 2. message receiver/sender perspective.

2. True

3. Very true, using psychological terms like 'surprise' or in general psychological terms in strict theories is very often misleading but tempting becasue any reader can always contribute something (like own interpretation) to such theories so later he is personly/emotionaly bounded to it more.

amelius · on Oct 8, 2021

Shannon's original paper is very readable.

arketyp · on Oct 8, 2021

Compression is a good entry to Shannon entropy. What was also eye-opening for me was that the metric was motivated by fulfilling a specification which matches an intuitive notion of information [1], much like how the Kolmogorov axioms were characterized to capture the intuitive notion of probability, or how the Church-Turing thesis defines computation.

[1] https://en.wikipedia.org/wiki/Entropy_(information_theory)#C...

hdjjhhvvhga · on Oct 8, 2021

If someone is interested in this, I highly recommend J R Pierce's Symbols, Signals, and Noise. The book puts Shannon's work in perspective and gives extremely useful context so that one can better appreciate its value. One can also clearly understand, for example, why the original title of Shannon's seminal work was "The Mathematical Theory of Communication" instead of Information.

stuartjbray · on Oct 8, 2021

  To store the result of a coin toss requires 1 bit of information, I can either give you a 0 or a 1. But implicit in me communicating that to you is that it is 1 'out of' something, namely 1 'out of' 2.
  To store a trit, ie a 0, 1 or 2, requires 2 bits. You will recieve either a 00, 01, or 10. You will never recieve a 11, because that is not a valid trit.
  Huffman compression reduces the ammount of bits needed to send a block of data, by prematurely terminating the data communicated once a specific sequence has passed. So the huffman compression for a trit value of 0, would not be '00', it would be '0'. We pass fewer bits, because some sequences carry an implication that this is the end of the sequence. The reciever has to know which sequences indicate a termination.
  A trit translates to 2 bits, with a wastage of .5 bits. The concept of 'compression' only makes sense when person A throws a sequence of bits at you, and when they stop, you have to interpret what they just said in order to work out what it meant. In truth, the extra bits may not have been sent, but they are implied.
  When I write '1' on a piece of paper and slide it across the table to you, I am cheating, I am being ambiguous. 1 out of what? shold be the correct response. A bit is more accurately communicated as 1 out of 2. A 'double' would be 1 out of a million ish. You can only store numbers in a binary computer, out of something. A 1 stored in a bit field is a different quantity than a 1 out of a million ish. In lazy human parlance a '1' represents 1 out of infinity, but this is not technologically possible to store. No computer hard drive can store enough digits to contain the information that 1 out of infinity represents.
  
  If I toss a coin and tell you it is a 1 out of 2, this is valid. But if I have cheated, and tossed a double headed coin then from my perspective the '1' aka 'heads' is 100% predictable, and therefore, for me, does not contain any information. For you, ignorant of my cheating coin, the data I communicate to you is 1 out of 2. What is the entropy of that information? Is it 1 bit, as you would believe, or 0 bits, as I would believe? The answer is that ALL MEASUREMENTS ARE NOT PROPERTIES OF OBJECTS, BUT OF RELATIONSHIPS BETWEEN A MEASURING SYSTEM AND AN OBJECT. The entropy of that coin toss does not live inside the data communicated, it lives inside my head (where the entropy is 0 bits) and also in your head (where the answer is 1 bit). The coin toss measurement of '1' does not contain either 0 or 1 bits because a piece of data in isolation does not have any meaning. Only in my head or yours can it's meaning be measured.

dr_dshiv · on Oct 8, 2021

Is there a distinction between the entropy of a message and the impact of a message on the entropy of the system? Where inputs tend to increase system entropy?

qsort · on Oct 8, 2021

They are related by the Boltzmann equation. The intuitive interpretation is this: entropy is the amount of information required to determine a particular microstate from the associated macrostate. (This is how you solve Maxwell's demon)

(Note that I don't really know what I'm talking about, I'm just parroting what I remember from my information theory course)

dr_dshiv · on Oct 8, 2021

I thought it was related to the number of micro states that are equivalent to a given macro state. So, as the number of particles increase, the temperature increases or the volume increases, there are more possible micro states. That’s why very cold things or very ordered things (crystals) have low entropy, because you can’t readily swap out microstates.

It’s also why a pile of legos is high entropy; you could easily swap legos around and the pile would be indistinguishable. Whereas you can’t do that with a built lego.

qsort · on Oct 8, 2021

It's correct, but that point of view is not in contradiction with what I said.

You may imagine a macrostate as a state representing incomplete information. In order to uniquely determine the microstate, you need additional information proportional to the logarithm of the number of possible microstates.

kgwgk · on Oct 8, 2021

The disorder analogy can be misleading in some cases: http://entropysite.oxy.edu/cracked_crutch.html

dr_dshiv · on Oct 8, 2021

Yes, that’s while I don’t frame as disorder but rather explain order in terms of the micro state-macro state relation. While a built lego set and a crystal are well ordered and have low entropy, order and entropy aren’t the same thing.

kgwgk · on Oct 8, 2021

> very ordered things (crystals) have low entropy

kasperset · on Oct 8, 2021

Shannon's entropy is also used for calculating one of the measure of alpha-diversity in ecological and microbiome studies.

dr_dshiv · on Oct 8, 2021

The thing about Shannon entropy is that it depends upon alphabets and symbol systems. I want to understand how it might be used to describe presymbolic computational systems.

arketyp · on Oct 8, 2021

What is a presymbolic computational system? Shannon entropy deals with alphabets in much the same capacity as Turing machines read symbols on a tape. Fundamentally it is about differentiating between states. I don't see how you can get a meaningful definition of information before starting with the notion of discrete states. Even fuzzy logic needs a semantic of states.

dr_dshiv · on Oct 8, 2021

Presymbolic computation is the motivation for the first restricted Boltzmann machine, Paul Smolensky’s “Harmonium”. It’s a great paper.

https://apps.dtic.mil/sti/citations/ADA620727

ohnoNotAgain321 · on Oct 8, 2021

Presymbolic computation appears to me to be an invented term. Any theoretical or actual system can be framed in computational terms when analysed, but the properties provided through the use of symbols will still exist in a system, whether or not that analysis has been performed. The paper you cite appears to me to lean in the direction of cybernetics and control theory, that would naturally be able to translate into terms aligned to information theory. The same rules will apply to any physical system, no matter how complicated you believe it to be.

dr_dshiv · on Oct 8, 2021

Not sure what you mean by “invented term.”

in any case, there seems to be a difference in describing a system with symbols and computational systems that use symbols for information processing. Some information processing seems possible in systems that don’t use symbols.

ohnoNotAgain321 · on Oct 8, 2021

Any computational system uses symbols, whether or not a person has analysed the system and defined those symbols; information is symbols.

dr_dshiv · on Oct 8, 2021

You are suggesting that information was not existing before symbols?

arketyp · on Oct 8, 2021

Where and when information exists so do symbols, in the abstract sense. It’s not symbols as representation of meaning but symbols as mediating meaning.

dr_dshiv · on Oct 8, 2021

That’s a strong claim that is not common. Humans are generally considered to be the only animals capable of symbolic thought [1]. Information flows in far simpler systems than symbolic systems.

[1] https://www.nytimes.com/2014/12/07/magazine/hunting-for-the-...

canjobear · on Oct 8, 2021

No, the interesting thing about Shannon entropy is that it is totally independent of how information is represented (as symbols, alphabets, numbers, structured objects, whatever).

dr_dshiv · on Oct 8, 2021

It is based on the symbol set exchanged, so how is it independent?

canjobear · on Oct 8, 2021

Information is a function only of the probabilities, not of the symbols.

SAI_Peregrinus · on Oct 8, 2021

Shannon entropy works for anything you can assign an alphabet of symbols to. It doesn't need the thing itself to be symbolic, just the description of the thing.

dr_dshiv · on Oct 8, 2021

That seems like a strong claim. Any proof?

SAI_Peregrinus · on Oct 9, 2021

Sure. https://people.math.harvard.edu/~ctm/home/text/others/shanno...

Particularly part 4, The Continuous Channel. No discrete symbols there!

dr_dshiv · on Oct 10, 2021

Maybe I’m not reading it properly, but in part 5, where he deals with transmission of continuous signals he doesn’t use his entropy formulation. He uses a distance metric, showing the difference between the sent signal and the received.