In the beginning - frequency analysis (Chapter 1 of 5)
Introduction
We all heard about security and encryption and things called HTTPS and TLS and MFA and a ton of other acronyms when it comes to security. But how did we end up here, to all these acronyms? We'll take a trip down in time, from the codes used in ancient times to PGP.
We'll dive into history of encryption, from simple substitution and Caesar to Vigenère and then on to Enigma and on to RSA.
This first chapter will focus on simple cryptography, what it is, what's the problem with it and a hands-on cracking of a ciphertext.
Let's begin
The cipher that we are going to tackle in this whitepaper is called mono-substitution code. That means that a letter from the original text is always replaced by the same substitution letter in the cipher-text. For example, if our substitution would be
C → F
O → T
D → B
E → A
then the word “code” would become FTBA every single time. But how do we make sense of that garble?
Even more, what could the following text mean?
Na vgrengvir qrirybczrag cebprff cebivqrf nzcyr bccbeghavgl gb ghar gur erdhverzragf naq fgrre gur cebwrpg bhgpbzr. Vg nyfb rafherf gung gur uvturfg inyhr srngherf ner ninvynoyr sbe nal tvira ohqtrg.
Nhgbzngrq havg grfgvat, pbagvahbhf vagrtengvba, fbhepr pbqr nf gur cevznel negvsnpg, pbairagvba bire pbasvthengvba, "whfg rabhtu" qbphzragngvba, naq serdhrag, vgrengvir qryvirel ner gur pbearefgbarf bs bhe zrgubq.
It would seem impossible to decrypt this. But as you’ll see, the longer the ciphertext, the easier it is to crack.
The technique we're going to explore today is called "frequency analysis".
The science of cryptography and attacking codes has its roots in mathematics and statistics. For example, "e" is the most common letter in the English language, "th" is the most common bigram, and "the" is, therefore, the most common trigram.
With this knowledge at hand, we can begin looking at the cypher text and if we are lucky and the text follows the statistics at least a bit, we can make some progress in deciphering it.
We're going to employ only the knowledge above (about "e" and "th") and one feature from the website https://www.cryptogram.org/resource-area/solve-a-cipher/.
There is much more to statistics and language study, but let's see how much headway we can make with these two pieces of information before pulling another page from the statistics book.
Let's paste this text into the website textarea and request the letter count, by clicking “Display/hide letter count” button.

We're going to attempt to decipher this with the assumption that “e” is the most frequent letter in the text.
The statistics show the highest occurrence in the cipher-text is the letter R, so let's start by supposing this is “e”.
The next thing we notice is 5 occurrences of the cipher word "GUR". If “R” is encrypted as “e”, there's a high chance GUR could represent the word “the”.
Let’s enter the information we have up to now and see if we can make any more headway before giving up.

The word that holds the next hint is “GUNG”. Based on our deciphering so far, the plain text for “GUNG” is “th-t”. That would imply “N” would be decrypted as “a”. It also gives us some confidence we might be on the right path.
With this new information, we go back to the text and see “NER” would translate to “a-e”. If our luck and intuition hold, “E” would be “R” and “NER” would translate to “are”. We already know “a-e” cannot be “ate”, because being mono-substitution (and we'll see later why this plays a crucial role) “t” is “G” in cipher-text. So, the word is not “ate”. It could be “are”.
The next piece of information comes from the bigram “GB”. We know again “G” is “t”, so there's a high chance “GB” is “to”.
We enter these letters and the situation looks like:

There are no low hanging fruits left on the screen now. However, “BIRE” becomes “o-er”, so, that could be over (pun not intended :-) ).
There's a bigger fish to fry, though, when it comes to NA, NF, NAL, NAQ.
We could try our logic and luck here: If (and this is big if) NA would be “an”, then NAL might be “and”. Or it could be “any”. And NF could be “as”. (Again, we already know it cannot be "at", because “T” is “g”).
We enter A as N and because of the last particle of QBPHZRAGNGVBA being “-o---entat-on”, we presume “---entat-on” is “---entation”, so we enter V as “i”.

That's the last major missing piece. At this point, we already know we are on the right path.
For programmers, PBAIRAGVBA BIRE as "-onvention over" becomes clear as "convention over configuration".
NINVYNOYR as "avai-a--e" becomes "available".
PBEAREFGBARF as "corner-tone-" becomes "cornerstones".
And it's pretty much over now.

The secret text is:
An iterative development process provides ample opportunity to tune the requirements and steer the project outcome. it also ensures that the highest value features are available for any given budget. Automated unit testing, continuous integration, source code as the primary artifact, convention over configuration, "just enough" documentation, and frequent, iterative delivery are the cornerstones of our method.
That's true. That's the Jonah Group methodology.
Conclusion
Simple mono-alphabetic substitution can be attacked easily at times using just basic knowledge of the language constructs and statistics. While it is not always true the statistics hold, when they do (and being statistics that might be more often than not) the secret is easily revealed and we'll need a better way to hide it.
What are the ways we could employ to defeat this attack?
What do you think are the elements that helped us the most when defeating a bunch of garbled text?
If you want to try your hand at attacking a ciphertext, here's another one:
Pbzcyrgvat n shyy cebwrpg plpyr vf na rkpvgvat gvzr sbe rirelbar — lbh unir na rkpvgvat arj ncc be srngher naq jr unir n pyrnere cvpgher bs lbhe nzovgvbaf naq bcrengvat cersreraprf. Bhe eryngvbafuvc pbagvahbhfyl zveebef gur vapernfvat inyhr bs gur cebqhpg, naq jr ner orggre noyr gb freir lbhe arrqf naq jbex gb gur fcrpvsvpngvbaf gung znxr lbh zbfg pbzsbegnoyr.
Stay tuned for:
- Chapter 2 of 5: The return of the Caesar: Vigenere
- Chapter 3 of 5: The rise of the machines - Enigma
- Chapter 4 of 5: A new hope - pgp, rsa
- Chapter 5 of 5: What lies ahead - signal, quantum