metanohi/site/projects/nalgh.org

#+title: nalgh
#&summary
Showoff of a poor (but simple) data compression method
#&
#+license: bysa, page
#+license: wtf, program

* nalgh, the {n alg}&strong{orit}{h}&strong{m} (nalgh)

nalgh is an algorithm for lossless compression. It generally compresses a text
into 77%-87% of its original size. The focus of nalgh lies on simplicity and
not speed and effeciency. See an example of nalgh in action [[http://projects.metanohi.name/nalgh/example.php][here]] (using part of
a [[http://www.gutenberg.org/etext/22600][random text]] I found at the [[http://www.gutenberg.org/][Gutenberg Project's website]]).

** Pros
+ It's simple

** Cons

+ It's ineffecient
+ It's slow
+ It delivers a very poor compression


** How

1. nalgh loops through input text and makes a list of the characters that are
   present.
2. The list is organised so that the character with most appearances comes
   first.
3. nalgh loops through input text and creates a new text with each character
   converted to their binary counterpart based on their appearance on the
   list. If a text contains most "a"s, second-most "e"s and third-most "r"s,
   then a becomes 0, e becomes 1, and r becomes 10. In the newly created text,
   the binary versions are kept as a string, each part split by a "2". This
   means that "are", based on the facts just stated, will turn into "021021".
4. nalgh checks if the length of the new text string is directly dividable
   with 5. If that is not the case, a bunch of "2"s are appended onto the
   text. The length of "021021" is 6, which divided with 5 leaves a rest of 1,
   so it's not directly dividable. The final string would in this case be
   "0210212222".
5. nalgh loops through the new text string and converts every 5 characters from
   trinary to their byte equivalent. Looking at the above string, "02102" is
   converted into the ASCII character of 65 ("A") and "12222" is converted into
   161 ("¡"). 5 characters of trinary has 243 possibilites, where a byte has
   256, so problems will never occur.
6. nalgh is almost done now. To be able to easily decompress the text later on,
   it is neccesary to specify the numbered equivalents of the characters in the
   input text. To do so, nalgh creates a text string where the character with
   most appearances appear first, and so on. Continuing using the letters from
   the previous examples, this string would be "aer", because there are more
   "a"s than "e"s but more "e"s than "r"s.
7. A character with the ASCII value of the amount of different input characters
   is joined with the text describing the appearances of the characters, which
   is then joined with the text string consisting of converted trinary
   numbers. Using the "aer" example, the *final* string would be "[ASCII
   character with the value 3]aerA¡".

Decompressing is simply doing it backwards. At the same time, decompressing is
a lot less stressful to the computer.

For now nalgh only operates on text, but it shouldn't be too difficult to make
it work with all byte stuff.

The nalgh source code is available [[http://projects.metanohi.name/nalgh/nalgh.php][here]]. For now there's only a PHP version
(and an uncommented one, at that), and I doubt that there will ever be a
version in another language.</p>

I sincerely hope that nalgh will never be used. I have not yet seen a data
compression method cabaple of performing poorer than nalgh. At the same time, I
do like the way nalgh works, keeping everything simple.