Thanks for all the great feedback! I've created a Twitter thread to discuss future development and share updates. Would love to connect with you all there:
- The DDN single-shot generator architecture is more efficient than diffusion.
- DDN is fully end-to-end differentiable, allowing for more efficient optimization when integrated with discriminative models or reinforcement learning.
Thanks for the idea, but DDN and flow can’t be flipped into each other that easily.
1. DDN doesn’t need to be invertible.
2. Its latent is discrete, not continuous.
3. As far as I know, flow keeps input and output the same size so it can compute log|detJ|. DDN’s latent is 1-D and discrete, so that condition fails.
4. To me, “hierarchical many-shot generation + split-and-prune” is simpler and more general than “invertible design + log|detJ|.”
5. Your design seems to have abandoned the characteristics of DDN. (ZSCG, 1D tree latent, lossy compression)
The two designs start from different premises and are built differently. Your proposal would change so much that whatever came out wouldn’t be DDN any more.
Fwiw, I'm not convinced its a Flow and that's my niche. But there are some interesting similarities that actually make me uncertain. A deeper dive is needed.
But you address your points
> 1. DDN doesn’t need to be invertible
The flow doesn't need to be invertible at every point in the network. As long as you can do the mapping the condition will hold. Like the classic layer is [x_0, s(x_1)*x_0 + t(x_1)]. s,t are parametrized by an arbitrary neural network. But some of your layers look more like invertible convolutions.
I think it is worth checking. FWIW I don't think an equivalence would undermine the novelty here.
> 2. Its latent is discrete, not continuous.
That's perfectly fine. Flows aren't restricted that way. Technically all flows aren't exactly invertible as you noise the data to dequantize it.
Also note that there are discrete flows. I'm not sure I've seen an implementation where each flow step is discrete but that's more an implementation issue.
> 3. As far as I know, flow keeps input and output the same size so it can compute log|detJ|.
You have a unet, right? You're full network is doing T:R^n -> R^n? Or at least excluding the extra embedding information? Either way I think you might not interested in "Approximation Capabilities of Neural ODEs and Invertible Residual Networks". At minimum their dimensionality discussion and reference to the Whitney Embedding Theorem is likely valuable to you (I don't think they say it by name?).
You may also want to look at RealNVP since they have a hierarchical architecture which does splitting.
Do note that NODEs are flows. You can see Ricky Chen's works on i-resnets.
As for the Jacobian, I actually wouldn't call that a condition for a flow but it sure is convenient. The typical Flows people are familiar with use a change of variables formula via the Jacobian but the isomorphism is really the part that's important. If it were up to me I'd change the name but it's not lol.
> 5. Your design seems to have abandoned the characteristics of DDN. (ZSCG, 1D tree latent, lossy compression)
I think you're on the money here. I've definitely never seen something like your network before. Even if it turns out to not be its own class I don't think that's an issue. It's not obviously something else but I think it's worth digging into.
FWIW I think it looks more like a diffusion model. A SNODE. Because I think you're right that the invertibility conditions likely don't hold. But in either case remember that even though you're estimating multiple distributions that that's equivalent to estimating a single distribution.
I think the most interesting thing you could do is plot the trajectories like you'll find in Flow and diffusion papers. If you get crossing you can quickly rule out flows.
I'm definitely going to spend more time with this work. It's really interesting. Good job!
This understanding is incorrect. The video samples all the leaf nodes of the entire tree only to visualize the distribution in latent space. In normal use, only the L outputs along a single path are generated.
Oh it is selected output, yes I meant that I was a bit confused. So in the initial design when you first tried it, you passed both to the next layer? or it is part of where you find out to perform better?
Even in the earliest stages of the DDN concept, we had already decided to pass features down to the next layer.
I never even ran an ablation that disabled the stem features; I assume the network would still train without them, but since the previous layer has already computed the features, it would be wasteful not to reuse them. Retaining the stem features also lets DDN adopt the more efficient single-shot-generator architecture.
Another deeper reason is that, unlike diffusion models, DDN does not need the Markov-chain property between adjacent layers.