74 lines
3.4 KiB
Org Mode
74 lines
3.4 KiB
Org Mode
|
#+title: nalgh
|
||
|
#&summary
|
||
|
Showoff of a poor (but simple) data compression method
|
||
|
#&
|
||
|
#+license: bysa, page
|
||
|
#+license: wtf, program
|
||
|
|
||
|
* nalgh, the {n alg}&strong{orit}{h}&strong{m} (nalgh)
|
||
|
|
||
|
nalgh is an algorithm for lossless compression. It generally compresses a text
|
||
|
into 77%-87% of its original size. The focus of nalgh lies on simplicity and
|
||
|
not speed and effeciency. See an example of nalgh in action [[http://projects.metanohi.name/nalgh/example.php][here]] (using part of
|
||
|
a [[http://www.gutenberg.org/etext/22600][random text]] I found at the [[http://www.gutenberg.org/][Gutenberg Project's website]].
|
||
|
|
||
|
** Pros
|
||
|
|
||
|
+ It's simple
|
||
|
|
||
|
** Cons
|
||
|
|
||
|
+ It's ineffecient
|
||
|
+ It's slow
|
||
|
+ It delivers a very poor compression
|
||
|
|
||
|
|
||
|
** How
|
||
|
|
||
|
1. nalgh loops through input text and makes a list of the characters that are
|
||
|
present.
|
||
|
2. The list is organised so that the character with most appearances comes
|
||
|
first.
|
||
|
3. nalgh loops through input text and creates a new text with each character
|
||
|
converted to their binary counterpart based on their appearance on the
|
||
|
list. If a text contains most "a"s, second-most "e"s and third-most "r"s,
|
||
|
then a becomes 0, e becomes 1, and r becomes 10. In the newly created text,
|
||
|
the binary versions are kept as a string, each part split by a "2". This
|
||
|
means that "are", based on the facts just stated, will turn into "021021".
|
||
|
4. nalgh checks if the length of the new text string is directly dividable
|
||
|
with 5. If that is not the case, a bunch of "2"s are appended onto the
|
||
|
text. The length of "021021" is 6, which divided with 5 leaves a rest of 1,
|
||
|
so it's not directly dividable. The final string would in this case be
|
||
|
"0210212222".
|
||
|
5. nalgh loops through the new text string and converts every 5 characters from
|
||
|
trinary to their byte equivalent. Looking at the above string, "02102" is
|
||
|
converted into the ASCII character of 65 ("A") and "12222" is converted into
|
||
|
161 ("¡"). 5 characters of trinary has 243 possibilites, where a byte has
|
||
|
256, so problems will never occur.
|
||
|
6. nalgh is almost done now. To be able to easily decompress the text later on,
|
||
|
it is neccesary to specify the numbered equivalents of the characters in the
|
||
|
input text. To do so, nalgh creates a text string where the character with
|
||
|
most appearances appear first, and so on. Continuing using the letters from
|
||
|
the previous examples, this string would be "aer", because there are more
|
||
|
"a"s than "e"s but more "e"s than "r"s.
|
||
|
7. A character with the ASCII value of the amount of different input characters
|
||
|
is joined with the text describing the appearances of the characters, which
|
||
|
is then joined with the text string consisting of converted trinary
|
||
|
numbers. Using the "aer" example, the *final* string would be "[ASCII
|
||
|
character with the value 3]aerA¡".
|
||
|
|
||
|
Decompressing is simply doing it backwards. At the same time, decompressing is
|
||
|
a lot less stressful to the computer.
|
||
|
|
||
|
For now nalgh only operates on text, but it shouldn't be too difficult to make
|
||
|
it work with all byte stuff.
|
||
|
|
||
|
The nalgh source code is available [[http://projects.metanohi.name/nalgh/nalgh.php][here]]. For now there's only a PHP version
|
||
|
(and an uncommented one, at that), and I doubt that there will ever be a
|
||
|
version in another language.</p>
|
||
|
|
||
|
I sincerely hope that nalgh will never be used. I have not yet seen a data
|
||
|
compression method cabaple of performing poorer than nalgh. At the same time, I
|
||
|
do like the way nalgh works, keeping everything simple.
|
||
|
|