I wouldn't have expected any functions to be 5KiB or larger. That seems like a l...

jeffbee · on Sept 11, 2020

They get larger the more aggressively you inline code.

Edited: I totally forgot about this great paper that shows the size of the top 100 most-executed functions at Google. See Figure 8.

https://people.ucsc.edu/~hlitz/papers/asmdb.pdf

Here's a gigantic function I found laying about: https://gist.github.com/jwbee/2167565e043578000ae489f67a82d6...

Protocol Buffers message decoders tend to be extremely large, since each message has an auto-generated parse function that will tend to inline everything. My local build of protobuf contains 20 functions each larger than 5KB, including a 10KB-long specialization of std::sort.

Someone · on Sept 11, 2020

That google paper has a nice observation on glibc’s memcmp:

“memcmp clearly stands out of the correlation between call frequency and function size in Figure 8. It is both extremely frequent, and, at almost 6 KiB of code, 10× larger than memcpy which is conceptually of similar complexity. Examining its layout and execution patterns (Figure 13) suggests that it does suffer from the high amount of fragmentation we observed fleetwide in the previous section. While covering 90% of executed instructions in memcmp only requires two cache lines, getting up to 99% coverage requires 41 lines, or 2.6 KiB of cache capacity. Not only is more than 50% of code cold, but it is also interspersed between the relatively hot regions, and likely unnecessarily brought in by prefetchers. Such bloat is costly – performance counter data collected by GWP indicates that 8.2% of all i-cache misses among the 100 hottest functions are from memcmp alone.

A closer look at the actual code from glibc can explain the execution patterns in Figure 13. It is hand-written in assembly and precompiled, with extensive manual loop unrolling, many conditional cases for the various alignments of the two source arrays, and large jump tables.

In our experience, code usually evolves into similar state from over-reliance on micro-optimization and micro-benchmarking. While writing in assembly can in rare cases be a necessary evil, it prevents the compiler from doing even the most basic feedback-directed code layout optimizations.

[…]

We tested this hypothesis by macro-benchmarking a version of memcmp that is specifically optimized for code size (only 2 i-cache lines) and locality. In short, it only special-cases very small string sizes (to aid the compiler in inlining very fast cases) and falls back to rep cmps for larger compares. Even though it achieves slightly lower throughput numbers than the glibc version in micro-benchmarks, this simple proof-of-concept showed an overall 0.5%-1% end-to-end performance improvement on large-footprint workloads like web search.”

I would guess the assembly version can be improved by moving those cold blocks out of the cache lines of the hot code.

jcranmer · on Sept 11, 2020

Totally non-scientific measurement (I took a few applications I had lying around, got size of symbols with nm -S) suggests that you're looking at around 0.1-1% of all functions are larger than 5KiB.

Interestingly, the largest function all cases seemed to be something on the order of "initialize a large static hashtable."

mac01021 · on Sept 11, 2020

I've seen plenty of c/c++/java functions that exceeded 200 lines of high level code. It seems reasonable to me that such a function would be 5k bytes of machine code, especially on a 64-bit architecture.

Not that such functions are the right way to structure your program...

account42 · on Sept 11, 2020

> especially on a 64-bit architecture

Pointer size has little impact on code size. amd64 code can even be smaller than x86 in some cases.

knaq · on Sept 11, 2020

I think you could easily get past a megabyte with auto-generated parser code. That would be stuff like lex, yacc, bison, antlr, flex, etc.

Anything with a big switch would qualify too, such as an emulator.