metanohi/site/projects/nalgh.org

74 lines
3.4 KiB
Org Mode
Raw Normal View History

#+title: nalgh
#&summary
Showoff of a poor (but simple) data compression method
#&
#+license: bysa, page
#+license: wtf, program
* nalgh, the {n alg}&strong{orit}{h}&strong{m} (nalgh)
nalgh is an algorithm for lossless compression. It generally compresses a text
into 77%-87% of its original size. The focus of nalgh lies on simplicity and
not speed and effeciency. See an example of nalgh in action [[http://projects.metanohi.name/nalgh/example.php][here]] (using part of
a [[http://www.gutenberg.org/etext/22600][random text]] I found at the [[http://www.gutenberg.org/][Gutenberg Project's website]]).
** Pros
+ It's simple
** Cons
+ It's ineffecient
+ It's slow
+ It delivers a very poor compression
** How
1. nalgh loops through input text and makes a list of the characters that are
present.
2. The list is organised so that the character with most appearances comes
first.
3. nalgh loops through input text and creates a new text with each character
converted to their binary counterpart based on their appearance on the
list. If a text contains most "a"s, second-most "e"s and third-most "r"s,
then a becomes 0, e becomes 1, and r becomes 10. In the newly created text,
the binary versions are kept as a string, each part split by a "2". This
means that "are", based on the facts just stated, will turn into "021021".
4. nalgh checks if the length of the new text string is directly dividable
with 5. If that is not the case, a bunch of "2"s are appended onto the
text. The length of "021021" is 6, which divided with 5 leaves a rest of 1,
so it's not directly dividable. The final string would in this case be
"0210212222".
5. nalgh loops through the new text string and converts every 5 characters from
trinary to their byte equivalent. Looking at the above string, "02102" is
converted into the ASCII character of 65 ("A") and "12222" is converted into
161 ("¡"). 5 characters of trinary has 243 possibilites, where a byte has
256, so problems will never occur.
6. nalgh is almost done now. To be able to easily decompress the text later on,
it is neccesary to specify the numbered equivalents of the characters in the
input text. To do so, nalgh creates a text string where the character with
most appearances appear first, and so on. Continuing using the letters from
the previous examples, this string would be "aer", because there are more
"a"s than "e"s but more "e"s than "r"s.
7. A character with the ASCII value of the amount of different input characters
is joined with the text describing the appearances of the characters, which
is then joined with the text string consisting of converted trinary
numbers. Using the "aer" example, the *final* string would be "[ASCII
character with the value 3]aerA¡".
Decompressing is simply doing it backwards. At the same time, decompressing is
a lot less stressful to the computer.
For now nalgh only operates on text, but it shouldn't be too difficult to make
it work with all byte stuff.
The nalgh source code is available [[http://projects.metanohi.name/nalgh/nalgh.php][here]]. For now there's only a PHP version
(and an uncommented one, at that), and I doubt that there will ever be a
version in another language.</p>
I sincerely hope that nalgh will never be used. I have not yet seen a data
compression method cabaple of performing poorer than nalgh. At the same time, I
do like the way nalgh works, keeping everything simple.