metanohi/site/projects/nalgh.org

3.4 KiB
Executable File

nalgh

#&summary Showoff of a poor (but simple) data compression method #&

nalgh, the {n alg}&strong{orit}{h}&strong{m} (nalgh)

nalgh is an algorithm for lossless compression. It generally compresses a text into 77%-87% of its original size. The focus of nalgh lies on simplicity and not speed and effeciency. See an example of nalgh in action here (using part of a random text I found at the Gutenberg Project's website).

Pros

  • It's simple

Cons

  • It's ineffecient
  • It's slow
  • It delivers a very poor compression

How

  1. nalgh loops through input text and makes a list of the characters that are present.
  2. The list is organised so that the character with most appearances comes first.
  3. nalgh loops through input text and creates a new text with each character converted to their binary counterpart based on their appearance on the list. If a text contains most "a"s, second-most "e"s and third-most "r"s, then a becomes 0, e becomes 1, and r becomes 10. In the newly created text, the binary versions are kept as a string, each part split by a "2". This means that "are", based on the facts just stated, will turn into "021021".
  4. nalgh checks if the length of the new text string is directly dividable with 5. If that is not the case, a bunch of "2"s are appended onto the text. The length of "021021" is 6, which divided with 5 leaves a rest of 1, so it's not directly dividable. The final string would in this case be "0210212222".
  5. nalgh loops through the new text string and converts every 5 characters from trinary to their byte equivalent. Looking at the above string, "02102" is converted into the ASCII character of 65 ("A") and "12222" is converted into 161 ("¡"). 5 characters of trinary has 243 possibilites, where a byte has 256, so problems will never occur.
  6. nalgh is almost done now. To be able to easily decompress the text later on, it is neccesary to specify the numbered equivalents of the characters in the input text. To do so, nalgh creates a text string where the character with most appearances appear first, and so on. Continuing using the letters from the previous examples, this string would be "aer", because there are more "a"s than "e"s but more "e"s than "r"s.
  7. A character with the ASCII value of the amount of different input characters is joined with the text describing the appearances of the characters, which is then joined with the text string consisting of converted trinary numbers. Using the "aer" example, the final string would be "[ASCII character with the value 3]aerA¡".

Decompressing is simply doing it backwards. At the same time, decompressing is a lot less stressful to the computer.

For now nalgh only operates on text, but it shouldn't be too difficult to make it work with all byte stuff.

The nalgh source code is available here. For now there's only a PHP version (and an uncommented one, at that), and I doubt that there will ever be a version in another language.</p>

I sincerely hope that nalgh will never be used. I have not yet seen a data compression method cabaple of performing poorer than nalgh. At the same time, I do like the way nalgh works, keeping everything simple.