Monthly Archives: November 2012

Unix Sorting and Breaking n*log(n)

I recently made a small suite of programs that sorts words in a file in theoretically less than n*log(n) time and outputs only unique words. The master program works by piping the words to two “sort” programs  to sort half of the words in the file. Finally another program merges the output of the sorts. It’s not terribly complicated but it’s a fun little program and the code came out very clean even for c. The program is a good introduction to forking, exec-ing, and piping. Heres a schematic of the program called Uniqify.

UniqifyOverview

I say the program could theoretically sort faster than n*log(n) because it uses two different sort processes. If the OS was smart it would put these processes on different cores and thus do some of the sorting work simultaneously. In reality though, you will need a large input file to overcome the lost time in creating new processes and merging the sorted output back together.

Note: You must name the executable files correctly. Start the program with “./Uniqify”. It gets input from stdin so you’ll probably want to redirect an input file. Same for stdout.
Here is the source code:

Uniqify.h
Uniqify.c
UniqParser.c
UniqMerger.c
Shell script that will compile everything and name it correctly: cmpall.txt

The DC3 Algorithm Made Simple

The DC3 algorithm is a powerful linear time suffix sorting algorithm. If you’re not familiar with suffix sorting, it is the process in which you sort the different parts of a single string. For example banana suffix sorted would be:
a
ana
anana
banana
na
nana

Efficient suffix sorting is needed in data compression, and substring matching. Practically speaking, suffix sorting can be used to search for text in an ebook, identifying long repeated sequences of DNA in a genome,  etc.

The most obvious way to suffix sort is to use a n*log(n) comparison algorithm to compare all the suffices of a string. However in the last 20 years, faster linear time suffix sorting algorithms have been developed. These algorithms beat the n*log(n) barrier by avoiding redundant sorting associated with suffix sorting. For example, from the banana suffix sort above, if you have sorted “ana” and “a”,  there is no reason to compare the entire suffices “nana” and “na”. Once you determine that they both start with “n”s you should just use the  result of the “ana” and “a” sort. The DC3 algorithm uses this rationale to create an elegant solution for linear time suffix sorting.  I’ll explain this algorithm by first exploring the process behind it and then by examining a java implementation I created.
Continue reading