2013-02-11 13:40:29 +01:00
|
|
|
#+title: Old junk code: Word finder
|
|
|
|
#+summary: Less than perfect C code
|
|
|
|
#+license: wtfpl, unless otherwise noted
|
|
|
|
#+startup: showall
|
|
|
|
#&toc
|
|
|
|
|
|
|
|
|
|
|
|
* Old junk code: Word finder
|
|
|
|
|
|
|
|
#+caption: Based on [[https://commons.wikimedia.org/wiki/File:2001-91-1_Computer,_Laptop,_Pentagon_(5891422370).jpg][this]], CC BY 2.0
|
|
|
|
#&img;url=sadcomputer.png, float=right
|
|
|
|
|
|
|
|
If you ever get tired of looking at your own junk code, take a look at this.
|
|
|
|
|
|
|
|
In August 2008, when I was still learning to program in C, I created a program
|
|
|
|
"ordfinder" (eng: word finder) which, given a word and a dictionary, prints the
|
|
|
|
words from the dictionary which can be created from the letters from the given
|
|
|
|
word in any order. Incredibly, it ended up compiling and works perfectly for any
|
|
|
|
word whose length does not exceed 8 characters, although it is a bit slow.
|
|
|
|
|
|
|
|
But why not more than 8 characters? My view of memory might have been a bit
|
|
|
|
naive back then, because the first step in my algorithm is to generate and
|
|
|
|
store all permutations of all subsequences of the given word. That is, if the
|
2013-02-11 21:55:34 +01:00
|
|
|
string is "me", my program stores the array ={ "m", "e", "me", "em" }= in
|
2013-02-11 13:40:29 +01:00
|
|
|
memory before going on to reading the dictionary and looking for matches.
|
|
|
|
|
2013-02-11 21:55:34 +01:00
|
|
|
If the string is "you", the program stores ={ "y", "o", "yo", "oy", "u", "yu",
|
|
|
|
"uy", "ou", "uo", "you", "yuo", "oyu", "ouy", "uyo", "uoy" }=.
|
2013-02-11 13:40:29 +01:00
|
|
|
|
|
|
|
If the string is "computer", the program stores the 109600 permutations of the
|
|
|
|
subsequences of "computer".
|
|
|
|
|
|
|
|
If the string is "difficult", the length of 9 characters means that the program
|
|
|
|
attempts to store 986409 strings of lengths 1 to 9. That probably takes up not
|
|
|
|
more than 10 MB, so it shouldn't be a problem. However, my program seems to
|
|
|
|
store the list of words on the stack instead of in memory, so words with length
|
|
|
|
9 or above cause a stack overflow to happen.
|
|
|
|
|
|
|
|
In any case, a word length of 10 would require about 100 MB, a word length of 11
|
|
|
|
about 1.2 GB, a word length of 12 about 15.6 GB, and a word length of 17 (like
|
|
|
|
"inconspicuousness") about 16,5 Petabytes (16500000 GB). That's 6,5 Petabytes
|
|
|
|
*more* than [[http://archive.org/web/petabox.php][what the Internet Archive uses]] to store millions of websites, books,
|
|
|
|
video and audio.
|
|
|
|
|
|
|
|
So perhaps neither my algorithm nor my implementation was that good.
|
|
|
|
|
|
|
|
|
|
|
|
* The code
|
|
|
|
|
|
|
|
Note that this code doesn't actually compile, because of all the wrong
|
|
|
|
code. However, it did compile back in 2008 which means that either I added the
|
|
|
|
wrong code after I had compiled it, or I used an overfriendly compiler (I don't
|
|
|
|
remember which compiler it was, but it ran on Windows). I have run the old
|
2013-02-11 21:55:34 +01:00
|
|
|
executable with ~wine~, and that works.
|
2013-02-11 13:40:29 +01:00
|
|
|
|
|
|
|
It's not necesarry to know C to laugh at this code, but it helps.
|
|
|
|
|
2013-02-11 21:55:34 +01:00
|
|
|
We'll start with some basic ~#include~s.
|
2013-02-11 13:40:29 +01:00
|
|
|
|
|
|
|
#+BEGIN_SRC c
|
|
|
|
#include <stdio.h>
|
|
|
|
#include <stdlib.h>
|
|
|
|
#include <string.h>
|
|
|
|
#include <ctype.h>
|
|
|
|
#include <math.h>
|
|
|
|
#+END_SRC
|
|
|
|
|
|
|
|
So far, so good. Then the global variables with descriptive names. And let's
|
|
|
|
declare four strings of length 0 to be statically allocated, because we'll just
|
|
|
|
extend them later on...?
|
|
|
|
|
|
|
|
#+BEGIN_SRC c
|
|
|
|
char os[0],s[0],r[0],t[0];
|
|
|
|
int l,c,rc,k,sk,i,ii,iii,ri;
|
|
|
|
#+END_SRC
|
|
|
|
|
2013-02-11 21:55:34 +01:00
|
|
|
The next step is to define our own version of C's builtin ~strstr~ function
|
2013-02-11 13:40:29 +01:00
|
|
|
(almost). I was used to PHP, so I wanted the same return values as PHP's
|
2013-02-11 21:55:34 +01:00
|
|
|
~strpos~.
|
2013-02-11 13:40:29 +01:00
|
|
|
|
|
|
|
#+BEGIN_SRC c
|
|
|
|
int strpos (const char *haystack, const char *needle) {
|
|
|
|
int i;
|
|
|
|
|
|
|
|
if (strlen (haystack) < strlen (needle))
|
|
|
|
return -1;
|
|
|
|
|
|
|
|
for (i = 0; i <= (strlen (haystack) - strlen(needle)); i++) {
|
|
|
|
if (!strncmp (&haystack[i], needle, strlen(needle)))
|
|
|
|
return i;
|
|
|
|
}
|
|
|
|
|
|
|
|
return -1;
|
|
|
|
}
|
|
|
|
#+END_SRC
|
|
|
|
|
|
|
|
Then it's time for the main function. We don't want to separate it into
|
|
|
|
auxiliary functions, because that's just ugly!
|
|
|
|
|
|
|
|
Indentation? Too much wastes too much space.
|
|
|
|
|
|
|
|
#+BEGIN_SRC c
|
|
|
|
int main(int argc, char *argv[])
|
|
|
|
{
|
|
|
|
if (argc>1) {
|
|
|
|
strcpy(os,argv[1]);
|
|
|
|
}
|
|
|
|
else {
|
|
|
|
printf("Indtast ord: ");
|
|
|
|
gets(os);
|
|
|
|
}
|
|
|
|
printf("T\x91nker...\n");
|
|
|
|
strcpy(s,os);
|
|
|
|
for(i=0;s[i];i++) {
|
|
|
|
s[i]=tolower(s[i]);
|
|
|
|
}
|
|
|
|
#+END_SRC
|
|
|
|
|
2013-02-11 21:55:34 +01:00
|
|
|
Wait, what? We use ~strcpy~ to copy the string ~argv[1]~, which contains the
|
|
|
|
word we want to permute, into the statically allocated ~os~ with length 0? Or we
|
|
|
|
read a line from standard in and save in ~os~? And almost the same for ~s~?
|
2013-02-11 13:40:29 +01:00
|
|
|
That's... not good.
|
|
|
|
|
|
|
|
At least these two lines aren't that bad.
|
|
|
|
|
|
|
|
#+BEGIN_SRC c
|
|
|
|
l=strlen(s);
|
|
|
|
c=pow(l,l);
|
|
|
|
#+END_SRC
|
|
|
|
|
|
|
|
But then begins the actual permutation generation logic. I have tried to
|
|
|
|
re-understand it, with no success.
|
|
|
|
|
|
|
|
#+BEGIN_SRC c
|
|
|
|
rc=1;
|
|
|
|
i=0;
|
|
|
|
while (i<l-1) {
|
|
|
|
rc=rc*(l-i);
|
|
|
|
i++;
|
|
|
|
}
|
|
|
|
#+END_SRC
|
|
|
|
|
|
|
|
While we're at it, why not declare two to-be-statically-allocated arrays with
|
|
|
|
dynamically-generated ints as lengths?
|
|
|
|
|
|
|
|
#+BEGIN_SRC c
|
|
|
|
int ca[l];
|
|
|
|
char ra[rc][l+1];
|
|
|
|
#+END_SRC
|
|
|
|
|
2013-02-11 21:55:34 +01:00
|
|
|
And then some more assignments and ~while~ loops...
|
2013-02-11 13:40:29 +01:00
|
|
|
|
|
|
|
#+BEGIN_SRC c
|
|
|
|
ri=0;
|
|
|
|
i=0;
|
|
|
|
while (i<c) {
|
|
|
|
k=1;
|
|
|
|
ii=0;
|
|
|
|
while (ii<l && k==1) {
|
|
|
|
#+END_SRC
|
|
|
|
|
|
|
|
This formula does something. I'm not sure what.
|
|
|
|
|
|
|
|
#+BEGIN_SRC c
|
|
|
|
ca[ii]=floor(i/pow(l,l-ii-1))-floor(i/pow(l,l-ii))*l;
|
|
|
|
#+END_SRC
|
|
|
|
|
2013-02-11 21:55:34 +01:00
|
|
|
More ~while~ loops, now also with ~if~ statements.
|
2013-02-11 13:40:29 +01:00
|
|
|
|
|
|
|
#+BEGIN_SRC c
|
|
|
|
iii=0;
|
|
|
|
while (iii<ii) {
|
|
|
|
if (ca[ii]==ca[iii]) {k=0;}
|
|
|
|
iii++;
|
|
|
|
}
|
|
|
|
ii++;
|
|
|
|
}
|
|
|
|
if (k==1) {
|
|
|
|
strcpy(ra[ri],"");
|
|
|
|
ii=0;
|
|
|
|
while (ii<l) {
|
|
|
|
strncpy(t,s+ca[ii],1);
|
|
|
|
#+END_SRC
|
|
|
|
|
2013-02-11 21:55:34 +01:00
|
|
|
Let's concatenate ~t~ onto ~ra[ri]~, a string which hardly exists due to the
|
|
|
|
~char ra[rc][l+1];~ magic above.
|
2013-02-11 13:40:29 +01:00
|
|
|
|
|
|
|
#+BEGIN_SRC c
|
|
|
|
strcat(ra[ri],t);
|
|
|
|
ii++;
|
|
|
|
}
|
|
|
|
#+END_SRC
|
|
|
|
|
|
|
|
And why not concatenate an end-of-string mark onto a string which, if it
|
2013-02-11 21:55:34 +01:00
|
|
|
doesn't have an end-of-string mark, will make ~strcat~ fail miserably?
|
2013-02-11 13:40:29 +01:00
|
|
|
|
|
|
|
#+BEGIN_SRC c
|
|
|
|
strcat(ra[ri],"\0");
|
|
|
|
#+END_SRC
|
|
|
|
|
|
|
|
And then more junk.
|
|
|
|
|
|
|
|
#+BEGIN_SRC c
|
|
|
|
sk=1;
|
|
|
|
ii=0;
|
|
|
|
while (ii<ri && sk==1) {
|
|
|
|
if (strcmp(ra[ri],ra[ii])==0) {sk=0;}
|
|
|
|
ii++;
|
|
|
|
}
|
|
|
|
if (sk==1) {
|
|
|
|
//printf("%s\n",ra[ri]);
|
|
|
|
ri++;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
i++;
|
|
|
|
}
|
|
|
|
//printf("\nOrd: %s\nOrdl\x91ngde: %d\nOrdkombinationer: %d\n",os,l,ri);
|
|
|
|
#+END_SRC
|
|
|
|
|
2013-02-11 21:55:34 +01:00
|
|
|
Phew... At this point, I'm certain that ~ra~ is supposed to be an array of all
|
2013-02-11 13:40:29 +01:00
|
|
|
word permutations. So let's open our dictionary "ord.txt" and look for matches.
|
|
|
|
|
|
|
|
#+BEGIN_SRC c
|
|
|
|
FILE *f;
|
|
|
|
char wrd[128];
|
|
|
|
if (f=fopen("ord.txt","r")) {
|
|
|
|
FILE *fw;
|
|
|
|
#+END_SRC
|
|
|
|
|
|
|
|
Everything is written both to output.txt *and* standard out. Anything else would
|
|
|
|
be stupid.
|
|
|
|
|
|
|
|
#+BEGIN_SRC c
|
|
|
|
fw=fopen("output.txt","w");
|
|
|
|
printf("Ord dannet af \"%s\":\n\n",os);
|
|
|
|
fprintf(fw,"Ord dannet af \"%s\":\n\n",os);
|
|
|
|
int wc=0;
|
|
|
|
while(!feof(f)) {
|
|
|
|
if(fgets(wrd,126,f)) {
|
|
|
|
#+END_SRC
|
|
|
|
|
|
|
|
The words each end with a newline, so let's replace the newline with an
|
|
|
|
end-of-string mark.
|
|
|
|
|
|
|
|
#+BEGIN_SRC c
|
|
|
|
wrd[strlen(wrd)-1]=0;
|
|
|
|
//printf("%s\n",wrd);
|
|
|
|
k=0;
|
|
|
|
ii=0;
|
|
|
|
while (ii<ri && k==0) {
|
|
|
|
#+END_SRC
|
|
|
|
|
2013-02-11 21:55:34 +01:00
|
|
|
The magical core of the matching logic, using our own ~strpos~:
|
2013-02-11 13:40:29 +01:00
|
|
|
|
|
|
|
#+BEGIN_SRC c
|
|
|
|
if (strpos(ra[ii],wrd)>-1) {k=1;}
|
|
|
|
#+END_SRC
|
|
|
|
|
|
|
|
If ~k == 1~, something good happens. But it doesn't happen at once for some
|
|
|
|
reason.
|
|
|
|
|
|
|
|
#+BEGIN_SRC c
|
|
|
|
ii++;
|
|
|
|
}
|
|
|
|
if (k==1) {
|
|
|
|
printf("%s\n",wrd);
|
|
|
|
fprintf(fw,"%s\n",wrd);
|
|
|
|
wc++;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
|
|
|
printf("\nI alt %d ord\n",wc);
|
|
|
|
fprintf(fw,"\nI alt %d ord",wc);
|
|
|
|
fclose(fw);
|
|
|
|
fclose(f);
|
|
|
|
system("output.txt");
|
|
|
|
}
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
#+END_SRC
|
|
|
|
|
|
|
|
And that's my pretty C code.
|
|
|
|
|
|
|
|
|
|
|
|
* The SML equivalent
|
|
|
|
|
|
|
|
To make my inefficient algorithm a bit clearer, I have made a few SML functions
|
|
|
|
to do the same as above:
|
|
|
|
|
|
|
|
#+BEGIN_SRC ocaml
|
|
|
|
open List
|
|
|
|
|
|
|
|
(* Removes an element from a list. *)
|
|
|
|
fun remove x (y :: ys) = if x = y
|
|
|
|
then ys
|
|
|
|
else y :: remove x ys
|
|
|
|
|
|
|
|
(* Tails of a list. Stolen from Haskell's Data.List. *)
|
|
|
|
fun tails [] = [[]]
|
|
|
|
| tails (xxs as (_ :: xs)) = xxs :: tails xs
|
|
|
|
|
|
|
|
(* Non-empty subsequences of a list. Stolen from Haskell's Data.List. *)
|
|
|
|
fun nonEmptySubsequences [] = []
|
|
|
|
| nonEmptySubsequences (x :: xs) =
|
|
|
|
let
|
|
|
|
fun f (ys, r) = ys :: (x :: ys) :: r
|
|
|
|
in
|
|
|
|
[x] :: foldr f [] (nonEmptySubsequences xs)
|
|
|
|
end
|
|
|
|
|
|
|
|
(* All permutations of a list. *)
|
|
|
|
fun permutations [] = [[]]
|
|
|
|
| permutations xs =
|
|
|
|
let
|
|
|
|
fun subPermutations x = map (fn ys => x :: ys) (permutations (remove x xs))
|
|
|
|
in
|
|
|
|
concat (map subPermutations xs)
|
|
|
|
end
|
|
|
|
|
|
|
|
|
|
|
|
(* Permutations of subsequences of a list. *)
|
|
|
|
fun subsequencePermutations xs = concat (map permutations (nonEmptySubsequences xs))
|
|
|
|
|
|
|
|
(* The same, but for a string. *)
|
|
|
|
fun stringSubsequencePermutations s = map implode (subsequencePermutations (explode s))
|
|
|
|
|
|
|
|
(* Finds words in `wordList` which matches any permutation of any subsequence
|
|
|
|
* of `word`. *)
|
|
|
|
fun findMatchingWords word wordList =
|
|
|
|
let
|
|
|
|
val wordPermutations = stringSubsequencePermutations word
|
|
|
|
in
|
|
|
|
filter (fn testWord =>
|
|
|
|
exists (fn word => word = testWord)
|
|
|
|
wordPermutations) wordList
|
|
|
|
end
|
|
|
|
#+END_SRC
|
|
|
|
|
|
|
|
As well as some SML functions to calculate the number of permutations and bytes:
|
|
|
|
|
|
|
|
#+BEGIN_SRC ocaml
|
|
|
|
(* Calculates the factorial. *)
|
|
|
|
fun factorial 0 = 1
|
|
|
|
| factorial n = n * factorial (n - 1)
|
|
|
|
|
|
|
|
(* Calculates the binomial coeffecient. *)
|
|
|
|
fun binomc n k = factorial n div (factorial k * factorial (n - k))
|
|
|
|
|
|
|
|
(* Gives [m, m + 1, ..., n]. *)
|
|
|
|
fun upTo m n = if m < n
|
|
|
|
then m :: upTo (m + 1) n
|
|
|
|
else [m]
|
|
|
|
|
|
|
|
(* Gives the total number of word subsequence permutations for a given word
|
|
|
|
* length. *)
|
|
|
|
fun nPermutations len = foldl op+ 0 (map (fn n => factorial n * binomc len n)
|
|
|
|
(upTo 1 len))
|
|
|
|
|
|
|
|
(* Gives the size in bytes for storing all word subsequence permutations for a
|
2013-02-11 21:55:34 +01:00
|
|
|
* given word length in a space-saving way: there are ~len~ arrays, each taking
|
2013-02-11 13:40:29 +01:00
|
|
|
* up space for the pointer to the array and the permutations of subsequences of
|
2013-02-11 21:55:34 +01:00
|
|
|
* length n where ~1 <= n <= len~ and n is unique.
|
2013-02-11 13:40:29 +01:00
|
|
|
*)
|
|
|
|
fun nSize len = 8 * len + foldl op+ 0 (
|
|
|
|
map (fn n => (n + 1) * factorial n * binomc len n)
|
|
|
|
(upTo 1 len))
|
|
|
|
#+END_SRC
|
|
|
|
|
|
|
|
* The alternative
|
|
|
|
|
|
|
|
Preprocess the dictionary into a clever data structure and don't use up all the
|
|
|
|
memory.
|
|
|
|
|
|
|
|
|
|
|
|
#&line
|
|
|
|
|
|
|
|
Originally published [[http://dikutal.dk/artikler/old-junk-code-word-finder][here]].
|