Is there any good posts about the use of entropy for tasks like that? I am wondering for quite some time of how do people actually use it and if it is any effective, but never actually got to investigating the problem myself.
First of all, how to define "entropy" for text is a bit unclear in the first place. Here it's as simple as `-Sum(x log(x))` where x = countOccurences(char) / len(text). And that raises a lot of questions about how good this actually works. How long string needs to be for this to work? Is there a ≈constant entropy for natural languages? Is there a better approach? I mean, it seems there must be: "obviously" "vorpal" must have lower "entropy" than "hJ6&:a". You and I both "know" that because 1) the latter "seems" to use much larger character set than natural language; 2) even if it didn't, the ordering of characters matters, the former just "sounds" like a real word, despite being made up by Carroll. Yet this "entropy" everybody seems to use has no idea about any of it. Both will have exactly the same "entropy". So, ok, maybe this does work good enough for yet-another-github-password-searcher. But is there anything better? Is there more meaningful metric of randomness for text?
Dozens of projects like this, everybody using "entropy" as if it's something obvious, but I've never seen a proper research on the subject.
Entropy is a measure of complexity or disorder of a signal. The interesting part is that the disorder is with respect to the proper basis or dictionary. Something can look complex in one encoding but be low entropy in the right encoding. You need to know the right basis, or figure it out from the context, to accurately determine the entropy of a signal. A much stronger way of building a tool like the OP is to have a few pre-computed dictionaries for a range of typical source texts (source code, natural language), then encode the string against each dictionary, comparing the compressibility of the string. A high entropy string like a secret will compress poorly against all available dictionaries.
I only briefly browsed the code, but this seems to be roughly what yelp/detect-secrets does.
Anyway, that doesn't really answer my question. To summarize answers in this thread, I think PhilipRoman has captured the essence of it: strictly speaking, the idea of entropy of a known string is nonsense. So, as I suspected, information theory definition isn't meaningfully applicable to the problem. And as other commenters like you mentioned, what we are really trying to measure is basically Kolmogorov complexity, which, strictly speaking, is incomputable, but measuring the compression rate for some well-known popular compression algorithm (allegedly) seems to be good enough estimate, empirically.
But I think it's still an interesting linguistic question. Meaningful or not, but it's well defined: so does it appear to work? Are there known constants for different kinds of text for any of these (or other) metrics? I would suspect this should have been explored already, but neither me, nor anybody in this thread apparently has ever stumbled upon such article.
bookmarking to think about later... does this hold for representing numbers as one base compared to another?
Regarding a prime as having higher entropy / less structure than say a perfect square or highly divisible number
a prime is a prime in any base, but the number of divisors will differ in non-primes, if the number is divisible by the base then it may appear to have more structure (smaller function necessary to derive, kolmogorov style),
does prime factorization have anything to do with this? i can almost imagine choosing a large non-prime whose divisibity is only obvious with a particular base such that the base becomes the secret key - base of a number is basically specifying your dictionary, no?
I don't think there's any interesting difference with different bases. Usually the base you represent stuff in is a relatively small number (because using very large bases is already wildly inefficient). I think it only usually makes sense to consider constant or logarithmic bases. If your base is scaling linearly with your number then things are going to be weird.
The problem of finding factors is only complex when you're asking about relatively big factors. If you're looking for constant or log sized factors you can just do trial division and find them.
Changing the base of number representation with a random basis feels like XORing a string with a random string, which is to say you're adding entropy equal to the random string. My thinking is that for any number representation M, you can get any other number representation N given a well-chosen base. So when presented with the encoded N, the original number could be any other number with the same number of digits. But once you put reasonable bounds on the base, you lose that flexibility and end up adding negligible entropy.
Entropy of a particular string isn't a rigorous mathematical idea, since by definition the string which is known can only take one value, the "entropy" is therefore zero bits. The reason why we can distinguish non-random data from random is that only a small subset of all possible states are considered useful for humans, and since we have an idea what that subset looks like, we can try to estimate what process was used to generate a particular string.
There are of course statistical tests like https://en.wikipedia.org/wiki/Diehard_tests, which are good enough for distinguishing low entropy and high entropy data, but current pseudo-random number generators have no problem passing all of those, even though their actual "entropy" is just the seed plus approximate the complexity of the algorithm.
If you’re looking for a rigorous mathematical idea, what people are trying to measure is the Kolmogorov complexity of the code. Measuring the compressed length is a rough estimate of that value.
Yes, although (and here my understanding of Kolmogorov complexity ends) it still depends heavily on the choice of language and it seems to me like "aaaaaaaaa" is only less complex than "pSE+4z*K58" due to assuming a sane, human-centric language which is very different from the "average" of all possible languages. Which then leads me to wonder how to construct an adversarial turing-complete language which has unintuitive Kolmogorov complexities.
There’s no requirement that the K-complexity is measured in a human centric language. Arguably all compression formats are languages too, which can be executed to produce the decompressed result. They are not designed to be human centric at all, and yet they do a surprisingly decent job at providing an estimate (well, upper bound) on Kolmogorov complexity. - As we can see in this program.
Kolmogorov complexity conventionally refers to the Turing machine as the base for implementation. This indeed makes repeated letters significantly less complex than that other string. (If you want intuition for how much code is needed to do something on a Turing machine, learn and play around a bit with Brainfuck. It's actually quite nice for that.)
First of all, how to define "entropy" for text is a bit unclear in the first place. Here it's as simple as `-Sum(x log(x))` where x = countOccurences(char) / len(text). And that raises a lot of questions about how good this actually works. How long string needs to be for this to work? Is there a ≈constant entropy for natural languages? Is there a better approach? I mean, it seems there must be: "obviously" "vorpal" must have lower "entropy" than "hJ6&:a". You and I both "know" that because 1) the latter "seems" to use much larger character set than natural language; 2) even if it didn't, the ordering of characters matters, the former just "sounds" like a real word, despite being made up by Carroll. Yet this "entropy" everybody seems to use has no idea about any of it. Both will have exactly the same "entropy". So, ok, maybe this does work good enough for yet-another-github-password-searcher. But is there anything better? Is there more meaningful metric of randomness for text?
Dozens of projects like this, everybody using "entropy" as if it's something obvious, but I've never seen a proper research on the subject.