Huh I haven't thought that I will see minizinc outside of my university. I keep being pleasantly surprised that constraint programming and formal methods are being used somewhere out there.
The hats in the discrete optimization course are indeed good and it's a fun course. Highly recommended!
It's extremely rare that I find myself needing to pull out the big boy tools in $DAYJOB but it's good to know they exist and how to use them. Helped out in one of the later days of advent of code, too.
What I like about these notes is that a complex topic is described in a simple and very practical manner. By the way, a few years ago there was another project on creating a code generator using declarative constraint solving:
This is reminiscent of an argument I had with the Mercury Prolog guys regarding "typing" in logic programming. My point boils down to this:
Any predicate can be considered a constraint. Types are constraints. While it may be reasonable to have syntactic sugars for type declarations that, at compile time, are transformed into predicates, it is unreasonable to lard a completely different kind of semantics on top of an already adequate semantic such as first order logic.
Holy shit, this is an excellent tour. I want to look at constraint-based
register allocation for my next project, I think. I also wanted to see if it
were possible to synthesize a data format and short x86 opcode sequences that
would satisfy certain mathematical laws, but I did not figure out how to do
that. Maybe we should chat.
In particular, I love the tiny tiny Union-Find!
uf = {}
def find(x):
while x in uf:
x = uf[x]
return x
def union(x,y):
x = find(x)
y = find(y)
if x != y:
uf[x] = y
return y
That's even smaller than the impl I use in toy projects. Wow.
I've actually tried to understand the original paper at one point and translated all the complicated greek pseudo-code into, umm...pseudo-code to use at some future point. I spent a little time pondering on how to combine it with global value numbering (if that's even something worthwhile to pursue is still an open question) for an added 'cheap' optimization and to manage a java-like local variable heap(?) but haven't done anything with it yet.
its fascinating how register allocation is not a solved problem. thinking about it, it seems like optimal allocation would depend on input/workload, so any ahead of time selection will always be a compromise.
It's np hard. You can solve it perfectly for a given instance using a constraint solver and adequate patience.
Unfortunately instruction selection is also NP, and so is instruction scheduling, and all three are inter-related. An optimal solution for one tends to be suboptimal for the other two.
Ideally one would feed all three problems into a constraint solver together. Thus far this is considered computationally infeasible, though I don't think that belief would bear up to scrutiny with a supercomputer.
>You can solve it perfectly for a given instance using a constraint solver and adequate patience.
Is this really true for cases where there is no choice but to spill registers? How would a graph coloring solver understand to spill outer loop variables instead of inner loop ones? I've never written a "proper" register allocator so I'm curious.
I think one of the biggest problems is even stating what the solution an optimal compiler would be. How does one model a program or a CPU? There almost certainly isn't one perfect model for all situations. CPU's are extremely complicated despite the surface simplicity of assembly. Naive notions of registers or instruction ordering don't really exist. But also because of this, perfect compilation isn't that crucial. The CPU kind of JITs for you. The only objective function that seems pretty clear to me is code size.
Maybe we should just make our ISAs be memory-memory, and let the chips map memory addresses onto its hundreds of registers internally, serving as a L0 cache of a sort: after all, we the programmers don't manage L1/L2/L3 caches manually, do we? The CPUs do it transparently for us.
right yea, modern CPUs seem to have all sorts of abstract optimizations. its kind of strange how we meet in the middle with hardware ISA manufacturers. they do all sorts of tricks on their side to make code go faster, and we try to generate machine code that we think will go faster, but neither side works with the other (compiler writers and ISA developers). i bet there is easy low hanging fruit here.
There sort of is, but isn't that why stuff like VLIW exists? So that the compiler can optimize the hell out of the machine code, and the CPU doesn't have to do its own OoO nonsense?
VLIW is what the Unison project (the context behind this post) was attacking. https://unison-code.github.io/ My best understanding is that exposing the details of the CPU to the compiler via a more complex ISA has not worked that well. Possibly because the compilation problem just gets too hard, but maybe also for momeentum and ecosystem reasons. I think another lesson is that it is unwise to make your isa expose microarchitecture because microarchitecture changes quite a bit between generations and inidividual CPUs. Things like delay slots seem like misfeatures now.
I do like short programs :). Code size does matter for instruction cache. In my application in particular, post hoc binary patching, a single byte can make a big difference in difficulty of patching.
certainly optimal would be global, and I'm pretty sure its isomorphic to bin packing..whether you use a fixed call interface or not. so its solved, its just exponential
im not sure. given a hypothetical arch with only 1 free register, which local variable do you allocate to it?
1 int
2 main(int argc, char *argv[])
3 {
4 int sum = 0;
5 int sum2 = 0;
6
7 if (argc >= 2)
8 for (int i = 0; i < argv[1]; i++)
9 sum += i;
10
11 if (argc >= 3)
12 for (int i = 0; i < argv[2]; i++)
13 sum2 += sum + i;
14
15 return 0;
16 }
> given a hypothetical arch with only 1 free register, which local variable do you allocate to it?
A truly good compiler would use the same register for sum and sum2 and for argv[1] and argv[2] if it needs to, (even if they weren’t dead, as in this toy example)
With only one register, the answer probably will be “none or every single one”, depending on the instruction set, though.
For example, how do you index into arrays? Self modifying code as often is done on the 6502? Or do you have stack-relative indirect addressing that allows you to say “add the contents of the address pointed to by the value at stack pointer plus 6 to the register”? Either way, there will be lots of contention for that single register.
Or does this hypothetical architecture have various complex “add contents of addresses foo[register] and bar and store the result in baz[3] instructions? (Which would probably make it a very bad design. Adding a second register would be about the first thing to do (fun fact: the 4004 had 16 registers), but maybe this hypothetical design has RAM that’s as fast as registers?)
If you're willing to work in the SSA domain, and not do any spill/fill, you can do optimal RA in polynomial time. (Usually quoted as linear, but quadraticish IRL.)