Compiling with Constraints

mad0 · on March 18, 2024

Huh I haven't thought that I will see minizinc outside of my university. I keep being pleasantly surprised that constraint programming and formal methods are being used somewhere out there.

philzook · on March 18, 2024

There are some nice coursera courses on minizinc

- https://www.coursera.org/learn/discrete-optimization good hats. very fun.

- https://www.coursera.org/learn/basic-modeling

WJW · on March 19, 2024

The hats in the discrete optimization course are indeed good and it's a fun course. Highly recommended!

It's extremely rare that I find myself needing to pull out the big boy tools in $DAYJOB but it's good to know they exist and how to use them. Helped out in one of the later days of advent of code, too.

scholaronroad · on March 18, 2024

Out of curiosity, which university is this? IMO the number of universities that actually teach CP is small.

mad0 · on March 19, 2024

This was on AGH (Akademia Górniczo Hutnicza) in Krakow, Poland. I did my masters there. The course was kind of zero to hero type of thing.

Here is the syllabus (unfortunately it is in polish but maybe you will be able to translate it): https://sylabusy.agh.edu.pl/en/document/065a0d32-a947-4234-a...

adamnemecek · on March 19, 2024

Guessing Monash.

richard_shelton · on March 19, 2024

What I like about these notes is that a complex topic is described in a simple and very practical manner. By the way, a few years ago there was another project on creating a code generator using declarative constraint solving:

https://www.researchgate.net/profile/Peter-Sovietov/publicat...

jabowery · on March 22, 2024

This is reminiscent of an argument I had with the Mercury Prolog guys regarding "typing" in logic programming. My point boils down to this:

Any predicate can be considered a constraint. Types are constraints. While it may be reasonable to have syntactic sugars for type declarations that, at compile time, are transformed into predicates, it is unreasonable to lard a completely different kind of semantics on top of an already adequate semantic such as first order logic.

https://groups.google.com/g/comp.lang.prolog/c/8yJxmY-jbG0/m...

tekknolagi · on March 18, 2024

Holy shit, this is an excellent tour. I want to look at constraint-based register allocation for my next project, I think. I also wanted to see if it were possible to synthesize a data format and short x86 opcode sequences that would satisfy certain mathematical laws, but I did not figure out how to do that. Maybe we should chat.

In particular, I love the tiny tiny Union-Find!

    uf = {}
    def find(x):
      while x in uf:
        x = uf[x]
      return x

    def union(x,y):
      x = find(x)
      y = find(y)
      if x != y:
        uf[x] = y
      return y

That's even smaller than the impl I use in toy projects. Wow.

And thanks for linking to my DDCG posts :)

UncleEntity · on March 19, 2024

Man, that's a very good explanation of DDCG.

I've actually tried to understand the original paper at one point and translated all the complicated greek pseudo-code into, umm...pseudo-code to use at some future point. I spent a little time pondering on how to combine it with global value numbering (if that's even something worthwhile to pursue is still an open question) for an added 'cheap' optimization and to manage a java-like local variable heap(?) but haven't done anything with it yet.

A little off topic but props where they're due...

tekknolagi · on March 19, 2024

Thank you!

philzook · on March 18, 2024

Thanks! Big fan of your posts! Keep it up!

anonymous_union · on March 18, 2024

its fascinating how register allocation is not a solved problem. thinking about it, it seems like optimal allocation would depend on input/workload, so any ahead of time selection will always be a compromise.

JonChesterfield · on March 18, 2024

It's np hard. You can solve it perfectly for a given instance using a constraint solver and adequate patience.

Unfortunately instruction selection is also NP, and so is instruction scheduling, and all three are inter-related. An optimal solution for one tends to be suboptimal for the other two.

Ideally one would feed all three problems into a constraint solver together. Thus far this is considered computationally infeasible, though I don't think that belief would bear up to scrutiny with a supercomputer.

PhilipRoman · on March 19, 2024

>You can solve it perfectly for a given instance using a constraint solver and adequate patience.

Is this really true for cases where there is no choice but to spill registers? How would a graph coloring solver understand to spill outer loop variables instead of inner loop ones? I've never written a "proper" register allocator so I'm curious.

mzl · on March 19, 2024

The article links to the Unison project, which does optimal spilling and register allocation using constraint programming solvers.

The core idea is that memory for spilling is treated as an additional set of register-like locations, with costs for using.

LoganDark · on March 18, 2024

> though I don't think that belief would bear up to scrutiny with a supercomputer.

It's certainly computationally infeasible for home users, that's for sure.

philzook · on March 18, 2024

I think one of the biggest problems is even stating what the solution an optimal compiler would be. How does one model a program or a CPU? There almost certainly isn't one perfect model for all situations. CPU's are extremely complicated despite the surface simplicity of assembly. Naive notions of registers or instruction ordering don't really exist. But also because of this, perfect compilation isn't that crucial. The CPU kind of JITs for you. The only objective function that seems pretty clear to me is code size.

Joker_vD · on March 19, 2024

Maybe we should just make our ISAs be memory-memory, and let the chips map memory addresses onto its hundreds of registers internally, serving as a L0 cache of a sort: after all, we the programmers don't manage L1/L2/L3 caches manually, do we? The CPUs do it transparently for us.

anonymous_union · on March 18, 2024

right yea, modern CPUs seem to have all sorts of abstract optimizations. its kind of strange how we meet in the middle with hardware ISA manufacturers. they do all sorts of tricks on their side to make code go faster, and we try to generate machine code that we think will go faster, but neither side works with the other (compiler writers and ISA developers). i bet there is easy low hanging fruit here.

LoganDark · on March 18, 2024

There sort of is, but isn't that why stuff like VLIW exists? So that the compiler can optimize the hell out of the machine code, and the CPU doesn't have to do its own OoO nonsense?

philzook · on March 18, 2024

VLIW is what the Unison project (the context behind this post) was attacking. https://unison-code.github.io/ My best understanding is that exposing the details of the CPU to the compiler via a more complex ISA has not worked that well. Possibly because the compilation problem just gets too hard, but maybe also for momeentum and ecosystem reasons. I think another lesson is that it is unwise to make your isa expose microarchitecture because microarchitecture changes quite a bit between generations and inidividual CPUs. Things like delay slots seem like misfeatures now.

gsnedders · on March 19, 2024

And it's worth noting VLIW survived longer in GPUs—where there isn't any expectation of ISA stability, which avoids that problem.

remexre · on March 19, 2024

And, perhaps also relevantly, your compute kernel probably fits in iCache, even with VLIW instructions.

LoganDark · on March 18, 2024

Not to be confused with the Unison language, I guess. https://www.unison-lang.org

anonymous_union · on March 18, 2024

interesting that code size is important to you. hey everyone, write shorter programs :)

philzook · on March 18, 2024

I do like short programs :). Code size does matter for instruction cache. In my application in particular, post hoc binary patching, a single byte can make a big difference in difficulty of patching.

convolvatron · on March 18, 2024

certainly optimal would be global, and I'm pretty sure its isomorphic to bin packing..whether you use a fixed call interface or not. so its solved, its just exponential

anonymous_union · on March 18, 2024

im not sure. given a hypothetical arch with only 1 free register, which local variable do you allocate to it?

  1 int
  2 main(int argc, char *argv[])
  3 {
  4         int sum = 0;
  5         int sum2 = 0;
  6
  7         if (argc >= 2)
  8                 for (int i = 0; i < argv[1]; i++)
  9                         sum += i;
 10
 11         if (argc >= 3)
 12                 for (int i = 0; i < argv[2]; i++)
 13                         sum2 += sum + i;
 14
 15         return 0;
 16 }

Someone · on March 19, 2024

> given a hypothetical arch with only 1 free register, which local variable do you allocate to it?

A truly good compiler would use the same register for sum and sum2 and for argv[1] and argv[2] if it needs to, (even if they weren’t dead, as in this toy example)

With only one register, the answer probably will be “none or every single one”, depending on the instruction set, though.

For example, how do you index into arrays? Self modifying code as often is done on the 6502? Or do you have stack-relative indirect addressing that allows you to say “add the contents of the address pointed to by the value at stack pointer plus 6 to the register”? Either way, there will be lots of contention for that single register.

Or does this hypothetical architecture have various complex “add contents of addresses foo[register] and bar and store the result in baz[3] instructions? (Which would probably make it a very bad design. Adding a second register would be about the first thing to do (fun fact: the 4004 had 16 registers), but maybe this hypothetical design has RAM that’s as fast as registers?)

UncleEntity · on March 19, 2024

Oh, I want to play...

So, an optimizing compiler would see that pretty much everything is dead code it would assign 0 to the return register and done.

canucker2016 · on March 19, 2024

I changed the return stmt to "return sum2" so the for-loop calculations aren't optimized away and fed the code to godbolt.org (compiler explorer).

gcc-trunk at -O3 for x64 will vectorize the loops but there's no register pressure, so the register allocator wasn't taxed much.

No niche optimization pass to convert to using Gauss's shortcut - https://physicsdb.com/sum-natural-numbers/

canucker2016 · on March 19, 2024

x86-64-icx-latest and x86-64-clang-trunk use the Gauss shortcut.

wow. wonder if there's much use for that optimization pattern.

edit: clang discussion https://stackoverflow.com/questions/74417624/how-does-clang-...

anonymous_union · on March 18, 2024

oops im dumb, convert argv[1] and argv[2] to integers first.

T_MacThrowFace · on March 19, 2024

youngun, this is why line numbers always increase by 10

thechao · on March 18, 2024

If you're willing to work in the SSA domain, and not do any spill/fill, you can do optimal RA in polynomial time. (Usually quoted as linear, but quadraticish IRL.)