Not long ago someone sent me an email about my SubsetSum problem post, asking how to reconstruct the set of numbers that were used to create the sum. This is pretty straightforward, until you consider scenarios in which there is more than one correct subset and you want a specific one or all of them.

But, first thing’s first. Let’s consider the easy case – we assume there is only one correct result. For example consider the follwing set { 15, 20, 35, 40 } and the sum of 90 (15+35+40). Using the standard version of the algorithm, we’ll receive the table as follows (top row has values instead of indexes to increase readability, also, I skip non-relevant rows:

X 0 15 20 35 40
0 true true true true true
15 false true true true true
20 false false true true true
35 false false true true true
40 false false false false true
50 false false false true true
55 false false false false true
90 false false false false true

And so, to reconstruct values used, we’re going to determine which number was used to create each subset.

we start with our final value of 90, the first column in which we have ‘true’ is the last one, representing the value ’40’. We add 40 to the result set and substract this value from our final sum.And so recursively, we search for numbers until we reach 0. Starting from 90 we’ll find 40, then search for 90-40 = 50, 50 is true for column representing 35, so we’ll look for 50-35 = 15, we’ll find 15 in the second column and will not search for 15-15 = 0 (and we’re done)

The code can look something like this:


So what will happen if we want to find out what numbers to use to create 35? It can be either 15+20, or just 35. Algorithm in current form will return answer using as little numbers as possible , so it will be just 35. What to do to make it return both ?

Well, we need to remember in how many ways we can create any number from previous numbers. To do this, we’ll use int[][] instead of boolean[][] and increment the value each time we can find a new way to create a number

So the new table will look like this

X 0 15 20 35 40
0 1 1 1 1 1
15 0 1 1 1 1
20 0 0 1 1 1
35 0 0 1 2 2

We can see from this table that 35 can be created from either 35 (leading to 35-35 = 0 and end of algorithm)  or from 20 (leading to 35-20 = 15).

We’ll need to branch on every such occasion, so it’s easiest to use recursion. It may look something like this:


There are two things I believe are worth noticing:

1. It’s relatively easy to apply the same optimizations as in the original post – directly if we have unique sums or more indirectly by creating lists of starting positions if we have multiple subsets giving the same sum.

2.It’s also relatively easy to count all possible subsets without enumerating them (just one line change, instead of table[i][j]++ do table[i][j] += table[i – numbers[j – 1]][j – 1];, table[sum].length – 1] will hold result after algorithm is done)

There is a lot of fun in modifying this to achieve results like ‘subset using least numbers’ or ‘subset minimizing the use of even numbers’, etc. most of which require just small changes. I encourage you to to play with it as a fun exercise.

Recently I’ve encoundered a problem that required me to examine all possible partitions of a given set. Set partition is a collection of disjoint subsets whose union is the whole set. For example a set of {1,2,3} (being a partition by itself) can be devided in the following ways:

  • {1,2,3}
  • {1,2}, {3}
  • {1,3}, {2}
  • {1}, {2,3}
  • {1}, {2}, {3}

The problem was – how to generate those partitions efficiently. Moreover, we want to generate only the partitions where there is a specific number of sets. There is only one division where 3 or 1 sets are produced, and 3 partitions where we get 2 sets. The total number of possibilities for a set of given size is defined by Bell number

There is a connected problem to this one called Restricted growth functions (RGS) – such function is defined through inequality a[i] <= 1 + max(a[1]…a[i-1]) for i = 2,…,n. The reason those are connected is that you can express one by the other. A string produced by such function – for example 010 can be mapped to a set partition of {1,3}, {2} – first and third digits are zeroes, so first and third element of the set belong to subset zero, second digit is one, so the second set element belongs to the subset one, etc.

A string of 0121123 would express a set of {1},{2,4,5},{3,6},{7}.

The naive solution greatly repeats itself by not taking into account the mapping between Set partition and RGS.
Generating RGS is fairly easy:

Lazy generation and an iterative approach is somewhat more difficult – it relies on keeping track of a largest so far used value and iterating through all possibilities that satisfy the RGS inequality just as an Integer Partitioning algorithm would. I’ll refer you to the attached file for the implementation, just let me mention one important thing here that I find interesting.

This simple recursive approach generates all possible combinations that include any number of subsets from N 1-element to 1 N-element. Which is not exactly what I wanted.

I started out with an implementation that generated all the RGSes in an iterative manner, that also generated those permutations that didn’t satisfy the minGroups/maxGroup criteria (it skipped them, not returning from iterator). This worked fine, though was doing an awful lot of unnecessary skips. Generating 3 element subsets from 14 element set was taking 1,5 s.

First observation I made was that we can quickly determine whether we have enough groups in our permutation by examining the highest group used and if at any point there is exactly the number of ‘digits’ to generate as the desired number of groups minus currently highest used, then  the ending digits are fixed (starting from current highest +1 in increasing order). This way we can ensure that we’ll always have the required number of groups.

The second observation regards the other end of the spectrum – maxGroup – all we need to do is to control whether the group number we’re assigning is not higher as the max number of groups and if so, disallow it just as we disallow any digits that would break the RGS inequality.

With those two observations acted upon in implementation we can generate only the RGSes that satisfy the conditions in linear time, thus reducing the example of 3 element subsets from 14 element set to 0,078s.

I attach the code for partition generating class without detailed explanations (other than ones above), if you find something difficult to understand (I’ve tried to keep the implementation concise), do let me know in comments and I’ll be sure to follow up.

The Algorithm (the problem)

Yesterday I’ve encountered the following problem – from a set of positive integer numbers, determine all positive integers that can be constructed from any subset of that set.

For example set {1,3,5} can produce the following numbers {1,3,5,4,6,8,9}. In general case it’s a NP complete problem solvable by determining all combinations (this is what I implemented). However for anything larger than ~25 numbers in the set it’s strating to get too long to find the answer. Moreover, a ridiculous set like this {1,1,1,1,1,1,1,1,1,1,1,1…} that can be enumerated by fifth-grader becomes imposible to process.

So of course I went on looking for a smarter solution and quickly enough found this:

Admittedly, yesterday was a long day, but I was sitting in front of the screen staring at this and couldn’t for the life of me understand how and why it works. So – let me put it out here is somewhat more elaborate but kind of step-by-step way (Once rested I understood it today and even managed to improve it a bit).

This isn’t exactly the problem as stated (find if subset sums to X or find all things subsets sum up to), but the algorithm used is the same.


How it works (on example)

Wikipedia states it’s pseudo-polynomial, the reason for it is simply that it doesn’t grow polynomialy with the number of items in the set, but with the range of numbers. So your problem with set like {1,2,3,4} will not take the same amount of time as set {1,2,3,99999999} (we’ll see why)

So – the general idea is to go through all numbers from 1 to sum({set}) (so for set {1,2,3,4} we’ll go through all numbers 1..(1+2+3+4 = 10)) and try to determine if this particular number can be constructed using first 1…X elements of the set. While trying to determine this we’ll construct an array storing the values for each number.

The array to store the results will be of type boolean and sizes [SUM_OF_NUMBERS_IN_SET+1][COUNT_OF_NUMBERS_IN_SET+1]

let’s work on an example – a trivial array of {1,3} the sum of elements is 4, the number of elements is 2.

We’ll produce an empty boolean array that will look like this

X 0 1 2
0 true true true
1 false false false
2 false false false
3 false false false
4 false false false


this is just initially set up table, the true,true in the first row indicates, that with no elements, or with only first item or with all items from the set, we can create a subset that sums up to zero (empty set) .

Now the algorithm kicks in – for each number (1…4) we’ll determine if with first 1…2 elements of the set, we can create this number. So – how can we determine that – there are just two simple cases – either we can create this number from a smaller subset (without including current number) or  we can create this number from currently analysed number and we know that the remainder we can create from previously seen numbers. The boolean array will help us keep track of that. For each element, we’ll check if either an element to the left is true (first case) or an element higher up (by the amount of currently analysed element) and adjacent to left is true (second case).

In pseudocode that will be: table[i][j] = table[i][j-1] | table[i-set[j-1]][j-1] where set[j-1]  is the value of element from the set (indexed from zero, hence j-1)

So – how will our example unfold:

row 1:

X 0 1 2
0 true true true
1 false true true
2 false false false
3 false false false
4 false false false

[1][1]  we set to true because table[1-1][0] (that is table[0][0]) is true (second case)
[1][2]  we set to true because table[1][1] is true (first case)

for next iteration, on row indexed by 2, we’ll not set anything to true:

[2][1] we’ll not set to true because neither [2][0] nor [1][0] is true
[2][2] we’ll not set to true because neither [2][1] nor [-1][0] is true

for the third row:

[3][1] we’ll not set to true because neither [3][0] nor [2][0] is true
[3][2] we’ll set to true because [0][1] is true

the last one is interesting :

[4][1] we’ll not set to true, because neither [4][0] nor [3][0] (this is the equivalent of saying, with just the first element from the set, we can’t construct 4)
[4][2] we’ll set to true because [1][1] is set to true (this is the equivalent of saying, with the first two elements, we can construct 4 – take 3, and somehow constructed 1 from first element from the set)

So the final table will look like this:

X 0 1 2
0 true true true
1 false true true
2 false false false
3 false true true
4 false false true


The answer to question ‘can we construct X’ from the set will be in table[X][set_size -1] (last element of the row)


The actual code looks like this:


And no we can print what values are possible to create from such set:



The improvement isn’t that big (not big-O kind of thing), but in practical applications it matters. The main observation is that we only set the value to true at certain index for a number and then we keep it true. So we only need to track at which index we found we can make this number (if). The improved code (memory footprint is way smaller) can go like this:

That’s it! Hope you’ll find this instructive and won’t waste too much time trying to understand how exactly are those indices set and why on a sunday evening.


“Although the basic idea of binary search is comparatively straightforward, the details can be surprisingly tricky”. This statement is famously attributed to Donald Knuth, and I’ve recently learned the hard way how right he was.

Consider the following problem:

You’re given an inequality |A-X| < P – where A and P are some integer constants. Find how many integers satisfy that inequality. For simplicity let’s limit ourselves to X > – 2^31 and X < 2^31-1 Simple enough, we can divide the function |A-X| to two sections (- 2^31,A> where the function is strictly decreasing and <A,2^31-1) where it’s strictly increasing.

This can be solved easily analytically – X has to be between P-A and P+A, but let’s think about a generic monotonic inequality that can be specified on runtime.

To solve this more general problem,we only need to do a binary search for the minimum X satisfying inequality in the left section and maximum X satisfying inequality in the right one.

With the typical ‘wikipedia’ implementation of binary search by default we won’t do either (it assumes decreasing order, so it won’t do for the second section, and it returns invalid value if there isn’t an exact edge-value satisfying the inequality).

So there are two major considerations for our search function:

  • increasing / decreasing order of our search domain
  • edge values – should the function return the last value that satisfies conditions, the first one that doesn’t or some value indicating error.

For our example we need both increasing / decreasing order of search and last value that satisfies the conditions. Please note that in this case, we’re not really looking for a particular value (we don’t know what that value is). So, how can we achieve it? For example like this:


First a couple of explanations:

BinarySearchOrder is just an enum of Increasing/Decreasing (and is not that necessary as we’ll see later on).  The ISatisfactionValidator is an interface with a single method isValid which takes an index (an X if you will) and returns true if the value produced for that X (whatever that would be) is still satisfies the conditions (whatever they may be) or not. This can easily be adapted to a single element search as we’ll see later on.

A couple of interesting points here:

  • long conversion for the calculation of middle point – this can be easily overlooked with disastrous consequences. M = L + (R-L)/2 – consider R = MAX and L = MIN (negative) – the R-L is way beyond 32-bits.
  • middle point selection depending on ordering – in case of increasing sequences we’re modifying the right edge so we need to make sure that whenever middle point is between two values, we choose the right one (ceil instead of floor)
  • finally – XOR for validity check is just to mirror the search pattern in case of increasing sequence.
  • One more note here – in case of multiple elements with the same value, in the decreasing sequence the algorithm will produce the first index (from the left) and in the increasing the last (first from right) with given value.

That is all well and good, but in this form, the algorithm deals nicely with monotonic functions and we usually deal with arrays. We can easily modify it to suit our needs:

Nothing really tricky here, we just determine the order and write our Stisfaction validator in such a way that it would be usable by the algorithm. The decision to throw an exception when an element is not found may in some cases not be the best, espectially in some performance critical code, but it can easy be modified.

Finally – just noticed right now that in generic search with all elements the same, a Decreasing option will be chosen (and therefore first element returned). It would be more appropriate to name the enum NonIncreasing (or modify the selection and choose Decreasing/NonDecreasing set)





I’m back after some time spent on hard working, overtiming, doubleshifting and all that since my daytime project was heading for a rough landing. On the up side it gave me a lot to think and write about. Here’s the first topic – queue implementations and thread sleeps.

First of all, what motivated this – quite often (too often) I see a piece of code that looks like this:

Where the queue is shared among producers and consumers. The problem with this code is that if you run it, it’s going to spin. If the queue is more often empty then not (and it most likely is if you don’t have enormous amounts of data flowing through it). A very frequent modification of this ‘pattern’ is one with Thread.Yield() or Thread.Sleep(0) (they’re equivalend) after the while loop.

I used to think that it’s not all that bad, the Yield would give back the scheduled time without spinning much, the waste would be only on context switching, but it wouldn’t be that CPU consuming. Wrong! It does perform slightly better under overall CPU heavy load, but in ‘idle’ mode, it still consumes a whole one CPU core (if there is one consumer).

Another variation of this is something like Thread.Sleep(1) or Thread.Sleep(5). While it helps in idle mode, it has a couple of problems. It still does a lot of context switching completely unnecessairly and it also delays your processing (the bigger the sleep value, the larger the delay, the less context switching)

So how can we do better? I used to use some form of Monitor.Wait and Monitor.Pulse/PulseAll  was a good solution, and while it works fine it’s usually a bit difficult to get completely right and the consequences of not getting it right are dire. If you miss one Pulse or mishandle some exception, you can stop processing or wait while there are messages in queue, or a whole bunch of other unwanted stuff can happen.

Happily, .NET 4.0 introduced a BlockingQueue. This code:

can completely replace what we’ve had above (minus the cancellation, we’ll get there in a moment).

The foreach loop will block until there are elements in the collection and will consume them in a nice thread safe manner. When a producer is done adding, they can call collection.CompleteAdding() and the loop will end for all consumers.

The BlockingCollection contains a bunch of static methods that allow different behaviors dealing with mulltiple collections at one (AddToAny/TakeFromAny) that also can take cancellation tokens and bail out nicely if you decide to quit processing. One word of caution – it’s not guaranteed to behave as if there were any priorities – if you have two blocking collections and you do TakeFromAny, you’ll most likely receive elements from the first one (if there are any) and then from the second one, but it’s not a guaranteed behavior (at least I have not found it documented anywhere). In .NET 4 and 4.5 ILSpy can tell us that it will always behave this way, but in 5.0+ – who knows.

One last thing. This behavior brings me to mind another cool library for .NET – reactive extensions (parts of it – some interfaces – are actually a part of the framework). In short it’s a way of treating your collections (or any deta sources) as sort of ‘data available event producers’ (and more). It’s pretty robust, and it plays nicely with linq. Check it out here:


Some time ago when working on some performance critical serialization code for a C# project, I was looking deeply into all flavors of available serialization benchmarks. One thing that struck me at the time was that there weren’t that many of them. The most comprehensive I’ve found was this one: Published by service stack so maybe a little biased, but pretty neat nonetheless.

The test published were pretty in line with what I have observed – protobuf was a clear winner, but if you didn’t want to (couldn’t) use open source, DataContractSerializer was best available to you.

After some more testing, I started finding some discrepancies between this report and my tests and they we’re pretty big – it seemed that DataContractSerializer was better than reported, and the whole order was sometimes shuffled . It all boiled down to two major differences that I’ll explain here


Data Contract Serializer performance

In the quoted tests, DataContractSerializer was almost 7 times slower than protobuf. That’s a lot. In my tests it was 3 times slower. Big difference, huh ? When I ran the tests on my own machine (the thing I love about people at is that they published on GIT the source to their tests so that you could reproduce it at home) I found that DataContractSerializer was 7 times slower. So obviously, there had to be a difference in the way we used it! And indeed there was, when I added my version it was 2,8 times slower than protobuf running on the same test set. Here’s the difference:


My version:

See the difference ?

Sure, it’s binary, you can’t always use it, but in most cases it will do, and it’s way faster. I also noticed, that the payload size is about 1/3 more than JsonDataContractSerializer, which is actually pretty good, so if you’re worried by bandwidth usage, it’s an additional plus. Here are the results of running the Northwind test data with just one additional serialization option (binary one) on my machine:

Serializer Larger than best Serialization Deserialization Slower than best
MS DataContractSerializer 5,89x 10,24x 5,38x 7,24x
MS DataContractSerializerBinary 4x 3,01x 2,75x 2,85x
MS JsonDataContractSerializer 2,22x 8,07x 9,59x 9,01x
MS BinaryFormatter 6,44x 10,35x 6,23x 7,81x 1x 1x 1x 1x
NewtonSoft.Json 2,22x 2,83x 3,00x 2,94x
ServiceStack Json 2,11x 3,43x 2,48x 2,84x
ServiceStack TypeSerializer 1,67x 2,44x 2,05x 2,19x


Having this difference aside, the results were more consistent with my tests, but still there were big differences in some cases, which leads us to second important point


It’s all about data

There is nothing that will be best for all cases (even protobuf), so you should carefully pick your use case. In my tests, the data was larger, and usually more complex than in Northwind data set used for benchmarking by Even if you look at their results, the smaller data set the bigger the difference – for RegionDto which has only one int and one string, the DataContractSerializer is 14 times slower (10 times slower on my machine, binary XML for DataContractSerializer is 3,71 slower).

So, does the DataContractSerializer perform better (especially in binary version) if larger objects are involved? It does indeed.

I created a EmployeeComplexDto class that inherits from EmployeeDto and adds some data as follows:

The object being serialized  consisted of not more than 10 Orders, 10 Customers and 10 Friends (for each test the same amount and the same data), and here’s what I got serializing and deserializing this 100 000 times:

Serializer Payload size Larger than best Serialization Deserialization Total Avg per iteration Slower than best
MS DataContractSerializer 1155 1,88x 6407245 6123807 12531052 125,3105 4,11x
MS DataContractSerializerBinary 920 1,50x 2380570 3452640 5833210 58,3321 1,91x
MS JsonDataContractSerializer 865 1,41x 7386162 14391807 21777969 217,7797 7,14x
MS BinaryFormatter 1449 2,36x 9734509 7949369 17683878 176,8388 5,80x 613 1x 1099788 1948251 3048039 30,4804 1x
NewtonSoft.Json 844 1,38x 2844681 4272574 7117255 71,1726 2,34x
ServiceStack Json 844 1,38x 4904168 5747964 10652132 106,5213 3,49x
ServiceStack TypeSerializer 753 1,23x 3055495 4606597 7662092 76,6209 2,51x

So – are you really ‘stuck’ with DataContractSerializer, or is it quite good ? The answer is of course – it depends. Surely, whenever you can, use the Binary Writer as it is way faster, but even if you do, whether it’s a best choice or not can be answered only with a specific use case in mind – think about what data will you be serializing, how often, etc.

As always with reasoning about performance, best option is to test with your data, in close to real scenarios and (preferably) on similar hardware as the application is going to run in production. The closer you get to this, the more accurate your tests will be.


The other day I was watching .NET cryptography tutorials on pluralsight ( and one thing struck me – there were some performance tests for SHA hash algorithms, the results of which were couter-intuitive. Namely – the SHA512 was faster than SHA256. I’m not an expert in cryptography, but i know a thing or two how SHA works and I could see nothing that could justify those results. So I did some testing

The result of 1000 hashings of 10 KB file on my PC:

SHA 256 – 202 ms.

SHA 512 – 955 ms.


Now that’s not what I’ve seen in the video ! I ran the ILSpy ( on the libraries  and a thing that immediately stands out as a difference between SHA256 and SHA512 (note: I was using System.Security.Cryptography.SHA256Managed, there is another flavor of SHA256 and I’ll talk about that later) is that where SHA256 uses int, SHA512 uses ulong.

Aha! I quickly recompiled everything as x64 and as expected the results were mighty different:

SHA 256 – 201 ms.

SHA 512 – 130 ms.

Now it makes sense. I guess historically framework designers thought that by the time SHA512 will be a standard, all compilations will be x64 bit so it made sense to speed it up for those kinds of builds.


One other interesting thing is that there are two (at least) mechanisms for SHA computation in .NET. The *Managed one (SHA1Managed, SHA512Managed,etc.) is older and it is not certified to be FIPS 140-2 compliant (which basically means it’s potentialy less secure and you definitely shouldn’t use it if you’re writing software for government/military or public safety use). The other option is to use CryptoServiceProviders (SHA512CryptoServiceProvider, SHA256CryptoServiceProvider, etc.). It’s altogether a better option because compliance aside, those algorithms are faster as they use native windows libraries instead of managed implementations (The only drawback is that you need framework 3.5+).

Here are all the results for different providers and different systems.

compilation SHA256Managed SHA512Managed SHA256CSP SHA512CSP
x86 202 ms. 955 ms. 150 ms. 256 ms.
x64 201 ms. 130 ms. 114 ms. 73 ms.


So, the bottom line – use CryptoServiceProvider wherever possible and get ready to move on to 64 bit compilation if you need that extra security and/or extra performance (there is about a gazilion reasons to do it anyway)



About a year ago we had some internal coding challenge in the company, in which I decided to take part. The goal was simple enough – we were given a huge file (couple of GB) in a specific format where some IDs were put on a per-row basis along with other attributes. The goal was to read off of standard input the ID and find the associated attributes (or return some error i think when the ID did not exist). The entries were sorted according to the ID .

The language was not specified, you could use any. I used C#.

With data being more or less random it was very hard to do better than just binary search, though I’ve tried various algorithms (interpolation search with binary search cutoff was in some cases a little better) but ended up with a straight binary search. I had however a small caching algorithm that was building a map of pointers to the original file so that the initial binary search would be performed in-memory rather than seeking in  the heavy file. The cache was only ensuring it was sparse enough to ensure it’s size was below 512K. This was because the test procedure involved running the program multiple times on single-item inputs rather than once on multi-item input, so I judged that loading/saving may contribute too much if the cache was large).

The benchmark for me was simply the amount of reads of the disk until the answer could be given and I managed to keep it at about 13 which compared with raw binary search was a bit better (16 without cache).

Imagine my surprise when after all tests were ran, my program took twice as much as the best one !

Of course I started analyzing why did this happen and there were two very interesting conclusions:

1. Loading and Saving

HDD disks nowadays have about 32MB of cache. One thing that I noticed was that when I ran my program twice on the same input, it returned instantly. Just because the data being accessed on the disk was already in the cache and could be returned in no time. I assumed the same (to perhaps an even higher degree) would happen to my cache being saved and loaded each time the process started.

I was wrong. Loading and especially saving 512KB of data takes time. A lot of time if you do that frequently. In our test cases, the program was ran couple of hundreds of times with different inputs. That means a couple of hundred of loads and possibly as many saves (there not always were changes in the cache, but thanks to some re-balancing there often where).

After removing the cache and reverting to the raw binary search, the program was still 25% slower than the winning one, still having one major disadvantage, but no longer having the cache.

2.Process startup time

Now we come to the main point that interested me in this story – .NET process startup time.

In native code, the binary EXE file is just executed. The binary format (PE) of the EXE defines where the code starts (where the bootstrap code starts) so that the OS can just go ahead and execute this. In .NET world however, all we have is a managed assembly. When you execute a managed EXE, it jumps to a initialization function in mscoree.dll (If I’m not mistaken, it’s a Windows dll, not a framework dll, so it will tell you so even if you don’t have the framework installed at all). There it is decided which CLR version should be initialized (1.0/2.0/4.0) and only after this initialization CLR JITs and executes your managed code. This initialization has to take time. But how much of it ?

First – some baseline tests:

Most basic program, not doing anything, just main function that exits immediately. Executed 500 times using a batch script on my PC.

2.0 39,08 s 39,55 s
4.0 37,55 s 36,08 s
4.5 36,25 s 35,51 s

Note here that 4.0 and 4.5 are not really different CLR types – it’s just framework and compiler.

Having this baseline I tried different things to speed up the startup – first changing the build type to x86/x64 instead of any cpu. Result – nothing.  The numbers didn’t differ by more than a single second (and sometimes in the wrong direction)

Next – removing all the referenced assemblies. Result – the same – no change.

After some more unsuccessful attempts of trying various optimization and compilation options which yielded nothing I’ve found the only thing that in my tests has made any difference. NGEN.

Pre-generating the native image (thus removing the need for JIT compiling) speeded up the startup time by about 10% (the best in class 4.5 compilation reached 32,63 s). All other attempts for optimization didn’t provide signifficant results.

For comparison I wrote a similar program in C++ (main function, return immediately), and used the same test procedure. The result – 26,85 s.

A bonus lesson learned (at least for me) was that for any competition before you submit your program one should check the impact of the test procedure on ones program. It may just make all the difference.


Stability testing is important, but often overlooked. Especially in stable products. We had a saying about one of the products I was working that it had two bugs – one known and one hidden. And that every time you fixed any of them it immediately (and miraculously) returned to a stable state of having two bugs.

After spending about two weeks trying to find and fix one of them, we resolved to find and fix both. Surely enough there were some signs of the other – hidden bug – namely the number of handles reported by task manager. After a couple of days it was reaching 15 000. That’s a lot of handles event for a server application. Of course the worst thing was that after weekend of idleness, the handle count didn’t go down (unless you restarted the application and then it started growing again from nominal ~150).

In situations like these WinDbg is an invaluable ally and we quickly found out that the handles are mostly to a stopped threads and are not released because a finalizer thread is stuck.

Not only stuck but stuck with a very peculiar stacktrace:

Wait for single object? No wonder it’s frozen. But what is it waiting on ?  One thing worth noticing is the last line of the abbreviated stacktrace here which seems to be pointing at some COM object problem.

Some digging inside revealed the following piece of code:

I really don’t like WMI management objects, I find them hard to work with and the API somewhat cryptic, but sometimes there is no way around it. This particular code what being executed as a part of startup configuration from the main thread which then proceeded to execute some main program timer loop. Now the thing about the management objects is that they sometime have (or use something that has) destructors which are executed by the finalizer thread.

Say your main method looks like this:

I see a lot of methods like this (especially in software that is about 5-8 years old ;) If your readConfiguration() method is using something that will be accessing COM object in a destructor – you’re in trouble (your Finalizer thread will show the same stacktrace as the one in the beginning of this post).

Now why is that: the whole issue boils down to the annotation above the Main method – [STAThread]. Your main thread while creating the COM object will associate it with its own Single Threaded Apartment. Because of this, when finalizer thread will want to do something with this COM object, it won’t be able to do this directly, but will have to do it through the thread that created this COM object – your main thread. Your main thread however will be busy doing other things and not willing to proxy to the COM object (even if you do the Thread.sleep). The end result will be your finalizer thread frozen waiting, your handle count growing and eventually a crash of the application.

How to alleviate this problem – there are multiple ways to avoid it. Easiest fix is to just remove [STAThread], provided you don’t need it for other COM objects. Other is to execute your COM object creating code in another thread that is MTA. I chose to avoid using WMI at all – we’ve found that the reading of service start mode was completely unnecessary and there only for legacy reasons.

One interesting thing I noticed is that if you call GC.WaitForPendingFinalizers() in main thread, it will indeed wait, but also will release the finalizer thread from its waitOne by interacting with the COM on its behalf.

The other day I was watching a new series of lectures about Least significant digit Radix sort given by Robert Sedgewick and Kevin Wayne on Coursera ( and it got me thinking – why don’t I use it anywhere ?

I got so comfy with Java’s mergesort and Comparable<T> that I never thought on optimizing that even in some performance critical code. So i though i should at least check what the performance impact can be.

LSD Radix sort is only useful in cases where you have a reasonable length of the keys on which you are sorting, so for my tests I assumed long as THE key data type. Reasoning behind was the following datatype:

since timestamp (Date object) can be expressed as long, and it’s perfectly reasonable to sort by date, it will do.

There are two additional things I had to define to enable more general sorting:

An interface that allows for extracting the sort key from object (and thus allowing for sorting various object by various keys – an equivalent of Comparator) and implementation of that for my data type.

So now for the sort itself:

So how this works:

We start by extracting the keys to a separate array (to avoid extracting them for every pass). It’s an optimization step, but it’s worth to do it, since calling the extractor method that many times will be expensive (but of course, there is a memory trade-off). Then, for each double octet (16 bits) starting from least significant ones, we do a key-index counting sort on extracted keys also moving the objects themselves into an auxiliary array. The result of each pass is an array sorted by that double byte. Because the sort is stable, we can start from least significant and move ‘up’.

The method is nor overly complicated, there are a couple of subtleties though:

lines 20-21 and 34-35 – the XOR with most significant bit of the most significant byte is required in order to correctly sort the negative numbers (negating the sign bit)

lines 40-46 – exchanging the auxiliary array with initial array (or just two aux arrays in case of keys) is needed in order to avoid copying all elements from aux array which is typically done in one-pass key-index sort.

Also note here, that thanks to the fact we have an even number of passes (4), we use the input array as an auxiliary array in the last pass, so we end up with sorted elements in original array and there is no need to copy elements from aux array to the original array.

So, is this any good ? Here’s results:

Size Comparable LSD Radix
10 0 ms. 10 ms.
100 1 ms. 9 ms.
1_000 3 ms. 6 ms.
10_000 37 ms. 10 ms.
100_000 178 ms. 65 ms.
1_000_000 466 ms. 179 ms.
10_000_000 8882 ms. 2219 ms.

For small number of elements it’s definitely not worth it, but that was expected. Above certain threshold though it definitely makes sense – it’s faster than system sort, and since Java uses mergesort for objects anyway, it doesn’t have that much bigger memory requirements than system sorting (the two aux key arrays, but in performance critical code if you want to sacrifice some purity for performance it too can be avoided).

There is one low-hanging optimization that can double the performance for large arrays – if you can limit your key size to a smaller number of bytes, you can get away with less passes. For example I tested with the Date converted to milliseconds from start of unix epoch which is long datatype, but afterwords did some tests with the key calculated by making the number of milliseconds relative to January 1-st 2000 (since I knew my data set couldn’t contain dates prior to that one) and lowered resolution to seconds which allowed me to fit it in 4 byte int.

It not always can be done, but as with many optimizations – a lot depends on inherent properties of your data.