How to make png encoding much faster? I'm working with large medical images and ...

gred · on Aug 29, 2021

I've spent some time on this problem -- classic space vs. time tradeoff. Usually if you're spending a lot of time on PNG encoding, you're spending it compressing the image content. PNG compression uses the DEFLATE format, and many software stacks leverage zlib here. It sounds like you're not simply looking to adjust the compression level (space vs. time balance), so we'll skip that.

Now zlib specifically is focused on correctness and stability, to the point of ignoring some fairly obvious opportunities to improve performance. This has led to frustration, and this frustration has led performance-focused zlib forks. The guys at AWS published a performance-focused survey [1] of the zlib fork landscape fairly recently. If your stack uses zlib, you may be able to find a way to swap in a different (faster) fork. If your stack does not use zlib, you may at least be able to find a few ideas for next steps.

[1] https://aws.amazon.com/blogs/opensource/improving-zlib-cloud...

lightcatcher · on Aug 29, 2021

I have no experience in PNG encoding, but found https://github.com/brion/mtpng The author mentions "It takes about 1.25s to save a 7680×2160 desktop screenshot PNG on this machine; 0.75s on my faster laptop." which makes me think your slower performance on smaller images either comes using the max compression setting or using hardware with worse single threaded performance.

Although these don't directly solve the PNG encoding performance problem, maybe some of these ideas could help?

* if users will be using the app in an environment with plenty of bandwidth and you don't mind paying for server bandwidth, could you serve up PNGs with less compression? Max compression takes 15s and saves 35MB's. If the users have 50mbit internet, then it only takes 5.6s to transmit the extra 35MB, so you could come out 10s ahead by not compressing. (yes, I see your comment about "don't say to use lower compression", but no reason to be killed by compression CPU cost if the bandwidth is available).

* initially show the user a lossy image (could be a downsized png) that can be quickly generated. You could then upgrade to a full quality once you finish encoding the PNG, or if server bandwidth/CPU usage is an issue then you could only upgrade if the user clicks a "high-quality" button or something. If server CPU usage is an issue, the low then high quality approach could let you turn down the compression setting and save some CPU at the cost of bandwidth and user latency.

minhmeoke · on Aug 29, 2021

Are you required to use PNG or could you save the files in an alternative lossless format like TIFF [1]? If you're stuck with PNG, mtpng [2] mentioned earlier seems to be significantly faster with multithreading (>40% reduction in encoding times). If you're publishing for web, TIFF or cwebp might also be possibilities with -mt (multithreading) and -q 25 (lower compression and larger filesize but faster) flags, or an experimental GPU implementation [3].

[1] https://blender.stackexchange.com/questions/148231/what-imag...

[2] https://github.com/brion/mtpng

[3] https://emmaliu.info/15418-Final-Project/

Const-me · on Aug 29, 2021

GPGPU is the way to go.

Not terribly hard if you only need 1-2 formats supported, e.g. RGBA8 only. You don't need to port the complete codec, only some initial portion of the pipeline and stream data back from GPUs, the last steps with lossless compression of the stream ain't a good fit for GPUs.

If you want the code to run on a web server, after you'll debug the encoder your next problem is where to deploy. NVidia teslas are frickin expensive. If you wanna run on public clouds, I'd consider their VMs with AMD GPUs.

pgroves · on Aug 29, 2021

Thanks, I hadn't heard of that and I will look into it. This is a research setting with plenty of hardware we can request and not a huge number of users so that part doesn't worry me.

Const-me · on Aug 29, 2021

> This is a research setting with plenty of hardware we can request and not a huge number of users

If you don’t care about cost of ownership, use CUDA. It only runs on nVidia GPUs, but the API is nice. I like it better than vendor-agnostic equivalents like DirectCompute, OpenCL, or Vulkan Compute.

physicles · on Aug 30, 2021

I solved a similar problem last year. As others have said, your bottleneck is the compression scheme that PNG uses. Turning down the level of compression will help. If you can build a custom intermediate format, you'll see huge gains.

Here's what that custom format might look like.

(I'm guessing these images are gray scale, so the "raw" format is uint16 or uint32)

First, take the raw data and delta encode it. This is similar to PNG's concept of "filters" -- little processors that massage the data a bit to make it more compressible. Then, since most of the compression algorithms operate on unsigned ints, you'll need to apply zigzag encoding (this is superior to allowing integer underflow, as benchmarks will show).

Then, take a look at some of the dedicated integer compression algorithms. Examples: FastPFor (or TurboPFor), BP32, snappy, simple8b, and good ol' run length encoding. These are blazing fast compared to gzip.

In my use case, I didn't care how slow compression was, so I wrote an adaptive compressor that would try all compression profiles and select the smallest one.

Of course, benchmark everything.

bjornlouser · on Aug 29, 2021

> Is there a way to use multiple threads or GPU

Maybe you could write the png without compression, compress chunks of the image in parallel using 7z, then reconstitute and decompress on the client side.

pgroves · on Aug 29, 2021

This is on our list of possibilities. It would take a little more time than I'd like to spend on this problem but it would work.

primitivesuave · on Aug 29, 2021

I would also be interested in knowing the answer to this. Currently we use OpenSeadragon to generate a map tiling of whole slide images (~4 GB per image), then stitch together and crop tiles of a particular zoom layer to produce PNGs of the desired resolution.

yboris · on Aug 29, 2021

I'm unsure if this will help, but the new image format JPEG XL (.jxl) is coming soon to replace JPEG. It will have a lossless and a lossy abilities. It claims to be faster than JPEG.

Another neat feature is that it's designed to be progressive, so you could host a single 10mb original file, and the client can download just the first 1mb (up to the quality they are comfortable with).

Take a look: https://jpegxl.info/

pgroves · on Aug 29, 2021

This is a research university that moves very slow, so waiting two years for something better is actually a possibility (and prerendering to S3 works ok for now). I'll keep this bookmarked.

dehrmann · on Aug 29, 2021

Since this is Python, which encoder are you using? I'd make sure it's in C, not Python. You might also be spending a lot of time converting numpy arrays to Python arrays.

pella · on Aug 29, 2021

also check the FPGA cards (ask the Xilinx; Altera/Intel, ...)