Data Compression is one of the key components that has allowed an explosion in communications and the ability to store massive amounts of data.
There are many ways to compress data, however, one of the most elegant and effective is the Burrows-Wheeler Transform. In this article I will explain how the Burrows-Wheeler Transform works and offer an easy to understand java implementation.
An Overview of Burrows-Wheeler
The Burrows-Wheeler Transform is a text transformation that takes an input string and produces an output string with a high amount of repeating characters. For example the word “banana” is transformed into “annb$aa” ( the “$” will be explained shortly). This output can be efficiently compressed.
The magic of Burrows-Wheeler is it’s ability to create output that contains long runs of identical characters and it’s ability to transform the output back to it’s original state. I’ll explain how this works in two sections, Burrows-Wheeler Forward and Burrows-Wheeler Back.
Recently I checked my server’s access log to see who had been visiting my site. There is not a ton of content yet so I was shocked to see that I had over 3,872 requests since it’s launch. I can understand; I’m a pretty big deal, right?
Unfortunately a lot of the page requests were from crawlers. Your first thought may be “Google bots”, however they were only a minor player. Most of the web crawlers were “hack” bots, whose sole purpose was to break into my site. The requests would almost always be looking for unsecured parts of the site. Here’s what a typical part of the access log looks like:
Most of the requested files don’t even exist. The bot is just guessing random admin-esque urls. There are a ton of these attacks. I took the time to compile some basic info on them and found out a few interesting things.
- The total amount of requests related to attacks was about 20% of all requests.
- Most attacks made about 40 requests before giving up ( None succeeded)
- Most attacks lasted between 5 and 20 seconds
- Attacks originated from the following countries:
- USA, Poland, Singapore, Canada, South Korea, Japan, Australia, Taiwan, Brazil, Thailand, Germany, Venezuela
- The U.S. had the most attacks, 3.
- Interestingly some of the worlds largest countries by population were absent including: China(1), India(2), Indonesia(4), Pakistan(6), Nigeria(7), and Russia(8). (Population is not a good predictor of the “attacking” countries, however, GDP per capita is. Nine of the thirteen “attacking” countries had GDP per capita in the top 25.)
I remember when I installed the server there were a lot of disclaimers about maintaining security. When reading them, my first thought was always something like “Who’s trying to get into my tiny website? Not many people will even know I’m here.” I was wrong. The web is filled with crawlers, many of which are scanning for a free server. Check out a snippet from my access_log to get real glimpse at what even a tiny server like this one gets requests for.
Finally! This website is working like a well tuned machine. It’s been frustrating at times, but making making this website has been a lot of fun. I thought it’d let you guys know how I did it.
Starting from scratch:
I realize I could have chosen a web hosting service for a wordpress site. There are plenty of them and many come almost completely configured. However, that would have taken all the fun out of making my own! I started from scratch with my old desktop that has been gathering dust since I left for college and hooked it up to my home internet connection.
Setting up the machine:
First I got rid of windows and installed ubuntu 11.10. Next I installed XAMPP(an Apache server), proftp, and wordpress. Everything seemed to be going great, but I felt uneasy about not knowing enough UNIX. Particularly there are a lot of security risks associated with hosting a webserver. In order to understand what’s going on you need to know UNIX decently well. To put my anxiety to rest, I read two books: the ubuntu pocket guide and UNIX in a Nutshell. Both are great books and really helped me understand how UNIX works ( not to mention a great warm up for my upcoming Operating Systems class) After that I felt pretty confident in my website setup; it’s secure and runs smoothly.
The Dreaded DNS Configuration:
Everything I read about this said the DNS configuration, the process of making the website name, localhost, correspond to my IP address, would be terrible. It certainly wasn’t simple like the famous 5 minute wordpress install, but it was not the end of the world. I changed my firewall settings, set-up a static local IP address for my server and set my server to accept external requests. I used DynDns to register my domain and host the dynamic dns service to keep track of my ever changing IP( your home IP address is changed frequently as a result of your ISP dynamically generating IP addresses). I installed the DynDns update client(it tells DynDNS where to find your server) on my server and changed my wordpress site name to match my new domain name. About two hours after applying the domain name changes I got on my phone’s web browser and tested my site from outside our home wifi. It worked!
Keeping it humming along:
Finally, to keep the server running 24-7 I modified the BIOS to reboot after any power failures and scheduled all the necessary processes to startup when the computer restarts. When I need to login to the computer, I just use ssh. It just needs to be plugged in!
In theory the set-up is very straightforward but in reality it took a lot of head banging to make it work. Missing a detail in a configuration file could be miserable. The little errors slowed down the process a lot. However I’m really glad I did it. I have a new appreciation for UNIX and have come to really respect it’s power. My advice to any one looking at making their own website is to learn why things work. You can get going pretty quick with the manuals and how-to guides but once you hit a bump in the road you’re likely to be in way over your head.