“What’s the Best Way To Process a Pool of Work?”

“What’s the best way to process a pool of work?” is a recurring question. As usual, the answer is “it depends” because the optimal answer often depends on both the characteristics of the work itself and the constraints imposed by run-time system resources.

For example, I recently received the following email from reader Sören Meyer-Eppler, where the key was to avoid oversubscribing system resources (in this case, memory):

I have an application that has multiple threads processing work from a todo queue. I have no influence over what gets into the queue and in what order (it is fed externally by the user). A single work item from the queue may take anywhere between a couple of seconds to several hours of runtime and should not be interrupted while processing. Also, a single work item may consume between a couple of megabytes to around 2GBs of memory. The memory consumption is my problem. I’m running as a 64bit process on a 8GB machine with 8 parallel threads. If each of them hits a worst case work item at the same time I run out of memory. I’m wondering about the best way to work around this.

1. plan conservatively and run 4 threads only. The worst case shouldn’t be a problem anymore, but we waste a lot of parallelism, making the average case a lot slower.

2. make each thread check available memory (or rather total allocated memory by all threads) before starting with a new item. Only start when more than 2GB memory are left. Recheck periodically, hoping that other threads will finish their memory hogs and we may start eventually. Still dangerous if the check happens when all threads are just starting out with their allocations.

3. try to predict how much memory items from the queue will need (hard) and plan accordingly. We could reorder the queue (overriding user choice) or simply adjust the number of running worker threads.

4. more ideas?

I’m currently tending towards number 2 because it seems simple to implement and solve most cases. However, I’m still wondering what standard ways of handling situations like this exist? The operating system must do something very similar on a process level after all…

I replied:

I don’t have time to write a detailed answer right now, but also consider two queues (one for big tasks and one for small tasks), or having work items give a rough size estimate (possibly by doing an extra lightweight pass over the data up front).

May I post an extract of your mail on my blog? Then others may comment and provide useful hints.

He said yes, and so here it is for your consideration.

Note also this similar question that came up a few days ago on comp.programming threads, but with different constraints — in that case, it was about avoiding idleness rather than avoiding oversubscription.

15 thoughts on ““What’s the Best Way To Process a Pool of Work?”

  1. Dmitry Vyukov’s solution would be suitable in your problem scenario simply because the tasks seem to consume memory in terms of GBs. Neither Pooling nor Sub-Pooling might help in this Scenario. And hence it is better to get closest to the Host OS.

  2. Thank you all for the suggestions! Very productive discussion indeed.

    @Leo: no, I’m not tracking per thread yet, but your idea sounds doable – I’ll give it a try.

    cheers,

    Sören

  3. @BuschnicK: But are you tracking the memory usage on a *per worker thread* basis? If you are, then the solution is simple: Create a semaphore with one permit that is shared among all worker threads. As soon as an item allocates more than 800MB, acquire a permit and set a local flag. When finished and local flag is set, release the permit.

    That way you’ll only ever have one work item using up more than 800MB, and with 8 threads you are capped at 7 * 800MB + 2GB = 5.6GB + 2GB = 7.6GB.

    /LS

  4. Isn’t this a similar problem facing by ThreadPool implementation?

    We can use a dedicated dispatching thread to serve the task queue and a pool of threads to serving the tasks. Whenever the task queue is not empty, the main dispatching thread will wake and check the memory consumption of system, if the consumption is still not bad, fetch a thread from the pool to process the new task. Also, the threadpool worker threads will sleep or end based on the current free threads count. This way, the threadpool can increase/decrease based on the work load dynamically.

    Jeffrey Richter’s Win2000 Server-side programming has the code snippet for implementing this type of thread pool.

    Let me know if there is any problem in this design.

  5. If the jobs can be aborted with no side effect i would use priority in the following way:
    Have memory management keep track of the amount of memory allocated by each thread.
    If memory run low start to abort tasks according to priority.
    Priority should be a mix of :
    1. how much memory the thread allocated (i.e. how much advantage we get from aborting this thread)
    2. how much time the thread been running on it current job (i.e. how much CPU resource we waste by aborting the job).
    After aborting job the aborted job is put back to the jobs queue. It can either be put to the front of the queue in this case no thread will start that job again until at least one other job finished. The job can be put to the back of the queue in which case the next job (which might have smaller memory footprint) can be fetched.
    It is also possible to remember for each aborted job how much memory / time they consumed before they were aborted to gain some insight if they should be started again in the current memory situation etc.

  6. …Just be careful if suspending work on any thread that you can never have *all* threads suspended!

  7. @kert: What’s the difference between an explicitly shared memory pool and the regular heap? What advantage would I gain?

    @Leo: I am already using custom allocators (boost pool allocators for small allocations and global overrides for new and delete). So tracking the memory used is not much of a problem. The question is how to use this knowledge to not run out…

    @SDR: Upgrading the target machine isn’t as easy as it sounds because we:
    a) have lots of them not only one (it is not only parallel per machine but also distributed over several boxes)
    b) some have already been shipped to customers and we can easily upgrade the software but not the hardware
    c) we are also using Amazon’s EC2 cloud where you only have limited control over hardware resources.

  8. Implement a memory pool that is shared between the threads.

    Such design problems and patterns are common in embedded systems which consider memory as a finite resource in every context.

  9. Sören,

    the idea was to track memory consumption by somehow hooking into the allocator. That way, you can accurately track memory use on a per-work item basis, and suspend when a work item has grown too big.

    How does the code analyzer work? Is it your own C/C++ code, or someone elses? A mixture of both? Is there any chance at all to intercept malloc/new calls?

    /LS

  10. If your requirements permit, you may consider upgrading the target machine. Doubling the RAM will remove your concern. This is probably more cost effective than devoting development resources.

    If you don’t have control of your target machine, you may choose to do nothing. If you expect events where the memory usage exceeds 8GB to be rare, you may accept the performance hit of paging.

    Otherwise, is there a reliable pattern to the timing of memory usage? Or is both the amplitude and velocity of memory use random?

  11. You may try to move processing to separate OS processes, this way you will get an easy way to determine resource consumption and to temporary suspend individual processing (OS must provide such functions).
    I.e. main process accepts requests and distributes them among a pool of worker processes, when processing completes it receives results and send them back to a user. Besides that main process periodically checks memory consumption of worker processes and suspends some of them if needed. Suspended processes are known to be “large”, they are resumed later when total memory consumption drops down. Suspended processes will occupy some space in page file (must not be a problem taking into account sizes of modern hard drives), but will release physical memory.

  12. Without knowing your problem in details, I would do something along the lines of your solution #3:

    Create a resource manager that tracks resource usage (e.g. memory, long running work).

    When adding work to the queue, then add info about the work describing resource usage.

    When choosing which work item to process next, then choose the first work item that will not overuse available resources. Update resource manager whenever a processing of a work item begins and ends. In your example, it sounds like initial available resources are: 8GB memory and X allowed long running tasks (assuming you would not like all your threads to process long running work items).

    If you have difficulty estimating the resource consumption, then try to split work items into multiple stages. That way, estimation of resource consumption becomes easier. In addition it becomes less of a problem if you are too conservative when estimating resource consumption.

    If there are requirements for certain work items to be processed in sequence, then that should be possible too. The key thing is to enforce the sequencing only when there is an actual requirement for it.

  13. More info about the problem: http://www.vxclass.com
    I’m trying to optimize the diffing back-end of vxClass.

    @Leo: Handling out of memory situations after they have occured is very hard because unfortunately they don’t always manifest in a simple out-of-memory exception but in all kinds of other nasty ways.
    Your suggestion 1. sounds a bit like my 2.. I’ll certainly use something like that.
    Splitting into two queues for small and large items is a good idea – although predicting how much memory one item will take is a tough problem in our case.

    @Bob: We have already thought about how we can parallelize a single task from the queue. We couldn’t think of a good way. The way the algorithm works every subsequent step is dependent on the results of all previous ones.

    Anyways, thank you very much for the suggestions thus far!

    Sören

    I’m the one who asked the original question btw ;-)

  14. Just use a single thread to process the queue. That way you will never have fights between threads over memory.

    If you need to get performance back up, then use multi-threading inside a single task process.

    You might find this link useful: http://blog.mrmeyer.com/?p=4646

  15. It is very difficult to give advice, since we are told so very little of the finer points of the problem, for example, what means we have to track memory allocation on a per-item basis. Anyway:

    0. When dealing with random and unknowable input, don’t expect to make optimal decisions. Just so you don’t kill yourself over this.

    1. Continuously check available memory while processing. If it looks to be running out, use some ordering scheme to decide which work item should either (a) dump everything and be put back into the queue or (b) just suspend execution for a little while or (c) flush itself to disk and suspend.

    2. Put in some code to handle out of memory conditions. Then just run.

    3. If the size of work items are known up front, use two queues. If not, check when a work item has allocated more than 1GB. That makes it “large”. If another item being processed is also large, suspend until the other item has completed. Worst case here is that you have one 2GB item and then 7 “large” ones of 1GB each, so maybe set the threshold a bit lower – 800MB for 7×0.8 + 2 = 7.6GB max, giving yourself a nice 400MB margin.

    /LS

Comments are closed.