libstree
Using libstree
is hopefully really quite simple. As an example, let's walk through
one of the test programs found in the test directory of the
source distribution:
lcshex.c, which calculates the longest common byte sequence of
the programs you pass it as command line arguments
[1].
Since libstree
operates on arbitrary strings, it needs a string model that leaves
the implementation of the common string operations up to the user. String families that
use a given set of such operations is called a
string class (called LST_StringClass).
Strings are instances of the LST_String type, and suffix
trees are instances of LST_STree type.
Each string belongs to a string class. Since it will often be the case that an application
uses only a single class of strings (e.g., byte strings), there is a default
string class, whose properties you can specify once and then forget about it.
libstree
currently uses three string-class-specific operations:
A function to compare string items (LST_StringItemCmpFunc)
A function to copy string items (LST_StringItemCopyFunc)
A function to convert a single string into an ASCII string (LST_StringPrintFunc)
memcmp()
on the memory area that the string item occupies, uses simple character assignment to copy an item.
Our little test program works on binary files, so we can use the compare and copy
function, but for output, we prefer hexadecimal strings. libstree
already provides
a function that does this, and we use it to override the default.
Let's look at some code. We will need a suffix tree and a few strings; also, we override the output method as describes above.
LST_STree *tree; LST_StringSet *set, *result; LST_String *string; /* Create a string class with a special print method, otherwise * we can use the defaults: */ lst_stringclass_set_defaults(NULL, NULL, lst_string_print_hex); |
Since most applications will have to handle multiple strings at a time, a container for
a number of strings is convenient. libstree
provides the type LST_StringSet
that can hold a set of strings. Let's allocate such a set:
/* Create a string set to conveniently hold all our strings: */ set = lst_stringset_new(); |
Now, we check for each of the program names that the user passed in, whether the file exists and if we can read it. We then allocate a chunk of memory and read the entire program code into it:
FILE *f; u_char *data; struct stat st; /* stat() the file, open file handle etc, now read content: */ if (fread(data, 1, st.st_size, f) != (size_t) st.st_size) { printf("Error reading %s -- skipping.\n", argv[i]); free(data); continue; } |
We're ready to create a string object from this bytesequence and add it to our string set.
string = lst_string_new(data, 1, st.st_size); lst_stringset_add(set, string); |
Now, let's build a suffix tree for all the strings we just put into our string set:
/* Create a suffix tree for all strings in the set: */ tree = lst_stree_new(set); |
Now find the longest common substring of all those strings.
/* Find longest common substring(s) */ result = lst_alg_longest_common_substring(tree, 0, 0); |
Note that the result is also a string set -- there may be multiple longest strings, of the same length. You can limit the results by requesting only substrings of a given minimum and maximum length, see lst_alg_longest_common_substring(). Finally, let's print each of the strings out to the console. The string set contents can be iterated using lst_stringset_foreach():
/* Print them out: */ lst_stringset_foreach(result, string_cb, NULL); |
That's it. Please browse the other test programs to get a better feeling for the API.
[1] | What do you mean, "What is this good for!?" :)" |