Wednesday, August 12, 2009

How To Handle Large Amounts of Data Quickly and Easily
So the last few weeks have been crazy busy. I have been completely swamped with client work, and while that is a good problem to have, it does take away time to do other stuff (like blogging!).
One of the things I have been working on deals with handling large blobs of data. Its a bit of a tangent since it has absolutely nothing to do with UI, but I figured that since there are a lot of developers reading this site, it might be helpful.
So what do I mean by large amounts of data? Well, for the app I am building we were testing our theoretical limits for performance reasons. Imagine a grid or an Excel spreadsheet that has 700 columns (Excel itself only allows 256) and 30,000 rows. That basically equates to 1.65 million cells in the grid. Now if you compare that lump of data to what, say, Google throws around, its pretty insignificant; however, when it comes to keeping it in memory and accessing it quickly it is a bit beefy.
Most of the time you would get around this problem by paging through the data somehow, or having some sort of filter. Unfortunately for us, we didn’t have this luxury. The application needed the whole kit and caboodle accessible to it at all times. With that as the challenge we went to work.
In the beginning…
Before I started testing the theoretical limits of the app, I was using a run-of-the-mill XML Serialized cache file. It was kinda big with the small sample data I was initially using, but I didn’t think anything of it. Once I started dealing with lumpy up there, the XML file grew to 550 Megs of pure, unadulterated crap. I don’t know about you, but I sure don’t want to throw that bad boy into memory (we actually did a few times and watched my dev box cry uncle…kinda funny and sad at the same time).
So the original solution was out the door…time for plan B. Well my partner is a big database guy so we went with his idea and tried to throw el lumpo into a table called “tblCache”. Ever try throwing 1.6 million inserts at Sql Server Express? Even using BulkCopy it still performed like a one-legged man in a butt-kicking competition. The end result of plan B was a database that quintupled in size to over a gig and an app that performed so poorly that it made the Xml file seem speedy…
That is when the crying began…
My friend came up with the idea of “lets try and find an object based database and just put the whole cache object in there!” Fortunately we both realized that that was a horrendous idea about as soon as it came out of his mouth. However, the idea to cache the object itself stuck so my buddy began researching that a bit and came up with a solution.
The Answer
So how do you deal with a large amount of data quickly and easily? Two words: Binary Serialization. It sounds fancy, but what it basically means is that you take your big fancy object and store it in a bunch of 1s and 0s. This took lumpy and turned him into a 100 Meg blob of goodness (a full 5 times smaller than the XML file). When we were talking about our solution to the problem to another guy on the team he suggested looking up in memory compression. After we got that working our data file shrunk down to a trim 1 meg (my buddy owes that guy dinner for that idea by the way).
Couple that with a few database tricks (like doing a few smaller hits rather than one huge one) and we went from loading our data in a crawling 138 seconds down to a svelte 4. Sounds too good to be true? Its not…and what is better is how easy the code is once you get your mind around it.
That’s Nice…Can I Do It?
So by now you are probably thinking…nice story…but can I do the same thing with my app? The answer is resoundingly yes! and what is even better is, you can do it completely free (i.e. no third party tools) using only the standard libraries of .NET. You could spend some money on a 3rd party tool if you need something a bit specialized (i.e. higher compression than the gzip stuff), but you definitely don’t need to.
Lets get started…
First we need some imports (this project is in VB.NET, but could easily be converted to C#).
Imports System.IO
Imports System.IO.Compression
Imports System.Runtime.Serialization
Imports System.Runtime.Serialization.Formatters.Binary
Nothing too strange there, but there are probably a few you haven’t seen before.
The code itself is relatively simple as well.
Dim compressedzipStream As GZipStream = Nothing
Dim ms As MemoryStream = Nothing
Dim b As BinaryFormatter = Nothing
Dim fs As FileStream = Nothing

Try
ms = New MemoryStream
b = New BinaryFormatter
b.Serialize(ms, cache)

Dim buffer(ms.Length - 1) As Byte
ms.Position = 0
ms.Read(buffer, 0, buffer.Length)

fs = New FileStream("C:\FileGoesHere\FileName.whatever", FileMode.Create)
compressedzipStream = New GZipStream(fs, CompressionMode.Compress, True)
compressedzipStream.Write(buffer, 0, buffer.Length)

Finally
compressedzipStream.Close()
fs.Close()
ms.Close()
End Try

Got all that? Lets break it down.
The first few lines are just initializing some variables we are gonna use later. The GZipStream is the compression class of .NET. It is pretty good, but if you need some serious compression you might want to go with a third party tool.
Next you notice is a classic Try/Catch. This is simply so we can be sure that no matter what happens our streams will be closed. The first line to really notice is b.Serialize(ms, cache). This is basically where the magic happens, or rather where your object (in our case it is called cache) is changed from what you built to a bunch of gobbledygook that only the computer can read. The good news is, its pretty efficient when compared to other serialization stuff (i.e. xml). The bad news is that you can’t read it like you can Xml. It’s not a big deal, but I do find myself missing that little feature.
So what is happening is your object (which can be as simple as a string, or a custom object) is being squooshed down and stored into the memory stream ms.
The compression class needs a byte array to process so that is what we build next. Buffer is our little array that we create to mimic the size of the memory stream we just created After he is initialized we read the stream in. Note: Make sure to set the position of the memory stream to 0 before reading, otherwise nothing happens.
Once our buffer is loaded we create a file stream object that points to the place on your hard drive that you want to store your data. Next we create a new GZipStream class and point our new file stream at it. Finally write the buffer into the compressed stream. Voila! Your data is now saved and zipped up in a nice neat bow.
So its on the disk now…the next thing you need to know is how to access it right? No problem.
dim reader as New StreamReader(cacheFile.FullName)

'this stream has been squooshed so we unsquoosh it here
Dim decomp as New GZipStream(reader.BaseStream, CompressionMode.Decompress, True)

Dim b As New BinaryFormatter
cache = b.Deserialize(decomp)
This looks pretty similar to what we had before. Basically we use a StreamReader to open up the file we saved earlier, then we use the GZipStream to decompress (notice the compression mode). Finally we use our handy dandy BinaryFormatter to deserialize the now unsquooshed data into our object.
Now that you know how to use binary serialization here is the golden rule…
Keep your objects simple
The more complex the object you are trying to serialize is, the larger your file will be and the slower it will be to access. For instance…when I started using this my data file was 17 megs (umcompressed it would have been over 500 Megs!!). The reason for this was because I was storing most of my data in a generic list. Now I love generic lists, and I use them where ever I can, but in this instance, they are absolute memory pigs. The reason for this is when the serializer crunches down objects it creates some overhead. That means for each item in a list you get a little overhead. Serializing a few objects is no big deal, but when you are dealing with hundreds of thousands of little ones, it starts to create a big problem.
In our case we took the same data and changed from a generic list to a comma delimited string and the data file shrunk down to just under 1 meg. From a loading perspective time I went from 24 seconds to 4, so it is a big difference. When we decompress it I change the string back into a list so my code didn’t have to change at all.
Now if you want to go a step further, you can make your app really fly by moving the entire save process to a seperate thread. It is out of scope for this article, but it isn’t as difficult as some people would lead you to believe (*cough* job security *cough*). If you use threading the whole process will seem instant to your user. Can’t beat that!
So there you have it. Dealing with large blobs of data isn’t all that uncommon nowadays. I hope this gives you a different approach to use when speed is of the essence.

No comments: