metanohi/site/writings/ordfinder/index.org

9.7 KiB

Old junk code: Word finder

#&toc

Old junk code: Word finder

#&img;url=sadcomputer.png, float=right

Based on this, CC BY 2.0

If you ever get tired of looking at your own junk code, take a look at this.

In August 2008, when I was still learning to program in C, I created a program "ordfinder" (eng: word finder) which, given a word and a dictionary, prints the words from the dictionary which can be created from the letters from the given word in any order. Incredibly, it ended up compiling and works perfectly for any word whose length does not exceed 8 characters, although it is a bit slow.

But why not more than 8 characters? My view of memory might have been a bit naive back then, because the first step in my algorithm is to generate and store all permutations of all subsequences of the given word. That is, if the string is "me", my program stores the array `{ "m", "e", "me", "em" }` in memory before going on to reading the dictionary and looking for matches.

If the string is "you", the program stores `{ "y", "o", "yo", "oy", "u", "yu", "uy", "ou", "uo", "you", "yuo", "oyu", "ouy", "uyo", "uoy" }`.

If the string is "computer", the program stores the 109600 permutations of the subsequences of "computer".

If the string is "difficult", the length of 9 characters means that the program attempts to store 986409 strings of lengths 1 to 9. That probably takes up not more than 10 MB, so it shouldn't be a problem. However, my program seems to store the list of words on the stack instead of in memory, so words with length 9 or above cause a stack overflow to happen.

In any case, a word length of 10 would require about 100 MB, a word length of 11 about 1.2 GB, a word length of 12 about 15.6 GB, and a word length of 17 (like "inconspicuousness") about 16,5 Petabytes (16500000 GB). That's 6,5 Petabytes more than what the Internet Archive uses to store millions of websites, books, video and audio.

So perhaps neither my algorithm nor my implementation was that good.

The code

Note that this code doesn't actually compile, because of all the wrong code. However, it did compile back in 2008 which means that either I added the wrong code after I had compiled it, or I used an overfriendly compiler (I don't remember which compiler it was, but it ran on Windows). I have run the old executable with `wine`, and that works.

It's not necesarry to know C to laugh at this code, but it helps.

We'll start with some basic `#include`s.

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <ctype.h>
#include <math.h>

So far, so good. Then the global variables with descriptive names. And let's declare four strings of length 0 to be statically allocated, because we'll just extend them later on…?

char os[0],s[0],r[0],t[0];
int l,c,rc,k,sk,i,ii,iii,ri;

The next step is to define our own version of C's builtin `strstr` function (almost). I was used to PHP, so I wanted the same return values as PHP's `strpos`.

int strpos (const char *haystack, const char *needle) {
  int i;

  if (strlen (haystack) < strlen (needle))
    return -1;

  for (i = 0; i <= (strlen (haystack) - strlen(needle)); i++) {
    if (!strncmp (&haystack[i], needle, strlen(needle)))
      return i;
  }

  return -1;
}

Then it's time for the main function. We don't want to separate it into auxiliary functions, because that's just ugly!

Indentation? Too much wastes too much space.

int main(int argc, char *argv[])
{
 if (argc>1) {
 strcpy(os,argv[1]);
 }
 else {
 printf("Indtast ord: ");
 gets(os);
 }
 printf("T\x91nker...\n");
 strcpy(s,os);
 for(i=0;s[i];i++) {
 s[i]=tolower(s[i]);
 }

Wait, what? We use `strcpy` to copy the string `argv[1]`, which contains the word we want to permute, into the statically allocated `os` with length 0? Or we read a line from standard in and save in `os`? And almost the same for `s`? That's… not good.

At least these two lines aren't that bad.

 l=strlen(s);
 c=pow(l,l);

But then begins the actual permutation generation logic. I have tried to re-understand it, with no success.

 rc=1;
 i=0;
 while (i<l-1) {
 rc=rc*(l-i);
 i++;
 }

While we're at it, why not declare two to-be-statically-allocated arrays with dynamically-generated ints as lengths?

 int ca[l];
 char ra[rc][l+1];

And then some more assignments and `while` loops…

 ri=0;
 i=0;
 while (i<c) {
 k=1;
 ii=0;
 while (ii<l && k==1) {

This formula does something. I'm not sure what.

 ca[ii]=floor(i/pow(l,l-ii-1))-floor(i/pow(l,l-ii))*l;

More `while` loops, now also with `if` statements.

 iii=0;
 while (iii<ii) {
 if (ca[ii]==ca[iii]) {k=0;}
 iii++;
 }
 ii++;
 }
 if (k==1) {
 strcpy(ra[ri],"");
 ii=0;
 while (ii<l) {
 strncpy(t,s+ca[ii],1);

Let's concatenate `t` onto `ra[ri]`, a string which hardly exists due to the `char ra[rc][l+1];` magic above.

 strcat(ra[ri],t);
 ii++;
 }

And why not concatenate an end-of-string mark onto a string which, if it doesn't have an end-of-string mark, will make `strcat` fail miserably?

 strcat(ra[ri],"\0");

And then more junk.

 sk=1;
 ii=0;
 while (ii<ri && sk==1) {
 if (strcmp(ra[ri],ra[ii])==0) {sk=0;}
 ii++;
 }
 if (sk==1) {
 //printf("%s\n",ra[ri]);
 ri++;
 }
 }
 i++;
 }
 //printf("\nOrd: %s\nOrdl\x91ngde: %d\nOrdkombinationer: %d\n",os,l,ri);

Phew… At this point, I'm certain that `ra` is supposed to be an array of all word permutations. So let's open our dictionary "ord.txt" and look for matches.

 FILE *f;
 char wrd[128];
 if (f=fopen("ord.txt","r")) {
 FILE *fw;

Everything is written both to output.txt and standard out. Anything else would be stupid.

 fw=fopen("output.txt","w");
 printf("Ord dannet af \"%s\":\n\n",os);
 fprintf(fw,"Ord dannet af \"%s\":\n\n",os);
 int wc=0;
 while(!feof(f)) {
 if(fgets(wrd,126,f)) {

The words each end with a newline, so let's replace the newline with an end-of-string mark.

 wrd[strlen(wrd)-1]=0;
 //printf("%s\n",wrd);
 k=0;
 ii=0;
 while (ii<ri && k==0) {

The magical core of the matching logic, using our own `strpos`:

 if (strpos(ra[ii],wrd)>-1) {k=1;}

If k == 1, something good happens. But it doesn't happen at once for some reason.

 ii++;
 }
 if (k==1) {
 printf("%s\n",wrd);
 fprintf(fw,"%s\n",wrd);
 wc++;
 }
 }
 }
 printf("\nI alt %d ord\n",wc);
 fprintf(fw,"\nI alt %d ord",wc);
 fclose(fw);
 fclose(f);
 system("output.txt");
 }
 return 0;
}

And that's my pretty C code.

The SML equivalent

To make my inefficient algorithm a bit clearer, I have made a few SML functions to do the same as above:

open List

(* Removes an element from a list. *)
fun remove x (y :: ys) = if x = y
                         then ys
                         else y :: remove x ys

(* Tails of a list. Stolen from Haskell's Data.List. *)
fun tails [] = [[]]
  | tails (xxs as (_ :: xs)) = xxs :: tails xs

(* Non-empty subsequences of a list. Stolen from Haskell's Data.List. *)
fun nonEmptySubsequences [] = []
  | nonEmptySubsequences (x :: xs) =
    let
        fun f (ys, r) = ys :: (x :: ys) :: r
    in 
        [x] :: foldr f [] (nonEmptySubsequences xs)
    end

(* All permutations of a list. *)
fun permutations [] = [[]]
  | permutations xs =
    let
        fun subPermutations x = map (fn ys => x :: ys) (permutations (remove x xs))
    in
        concat (map subPermutations xs)
    end


(* Permutations of subsequences of a list. *)
fun subsequencePermutations xs = concat (map permutations (nonEmptySubsequences xs))

(* The same, but for a string. *)
fun stringSubsequencePermutations s = map implode (subsequencePermutations (explode s))

(* Finds words in `wordList` which matches any permutation of any subsequence
 * of `word`. *)
fun findMatchingWords word wordList =
    let
        val wordPermutations = stringSubsequencePermutations word
    in
        filter (fn testWord =>
                   exists (fn word => word = testWord)
                          wordPermutations) wordList
    end

As well as some SML functions to calculate the number of permutations and bytes:

(* Calculates the factorial. *)
fun factorial 0 = 1
  | factorial n = n * factorial (n - 1)

(* Calculates the binomial coeffecient. *)
fun binomc n k = factorial n div (factorial k * factorial (n - k))

(* Gives [m, m + 1, ..., n]. *)
fun upTo m n = if m < n
               then m :: upTo (m + 1) n
               else [m]

(* Gives the total number of word subsequence permutations for a given word
 * length. *)
fun nPermutations len = foldl op+ 0 (map (fn n => factorial n * binomc len n)
                                         (upTo 1 len))

(* Gives the size in bytes for storing all word subsequence permutations for a
 * given word length in a space-saving way: there are `len` arrays, each taking
 * up space for the pointer to the array and the permutations of subsequences of
 * length n where `1 <= n <= len` and n is unique.
 *)
fun nSize len = 8 * len + foldl op+ 0 (
                map (fn n => (n + 1) * factorial n * binomc len n)
                    (upTo 1 len))

The alternative

Preprocess the dictionary into a clever data structure and don't use up all the memory.

#&line

Originally published here.