Hacker Newsnew | past | comments | ask | show | jobs | submit | smusamashah's commentslogin

I remember playing a matrix mod. It was awesome. There was this famous fight near stairs with so much bullet time and effects. Another was near an elevator with pillars.

It was also a part of 3d mark 2001 back then, I've used it up until mid 2010s to see the evolution of graphic cards.

https://youtu.be/VQql9LqczXI?si=zWNbTaGYOQTLWjZX


I dont understand the lines connecting two pieces of text. In most cases, the connected words have absolutely zero connection with each other.

In "Father wound" the words "abandoned at birth" are connected to "did not". Which makes it look like those visual connections are just a stylistic choice and don't carry any meaning at all.


Yes, they look really good but they're being connected by an LLM.

I had the exact same impression.

If this model is so good at estimating depth from single image, shouldn't it also be able to take multiple images as input and estimate even better? But searching a bit it looks like this is supposed to be a single image to 3D only. I don't understand why it does not (can not?) work with multiple images.

It's using Apple's SHARP method, which is monocular. https://apple.github.io/ml-sharp/

I also feel like an heavily multimodal model could be very nice for this: allow multiple images from various angles, optionally some true depth data even if imperfect (like what a basic phone LIDAR would output), why not even photos of the same place even if it comes from other sources at other times (just to gather more data), and based on that generate a 3D scene you can explore, using generative AI for filling with plausible content what is missing.

If you have multiple images you could use photogrammetry.

At the end, if you want to "fill in the blanks" llm will always "make up" stuff, based on all of its training data.

With a technology like photogrammetry you can get much better results, therefor if you have multiple angled images and dont really need to make up stuff, its better to use such


You could use both. Photogrammetry requires you to have a lot of additional information, and/or to make a lot of assumptions (e.g. about camera, specific lens properties, medium properties, material composition and properties, etc. - and what are reasonable range for values in context), if you want it to work well for general cases, as otherwise the problem you're solving is underspecified. In practice, even enumerating those assumptions is a huge task, much less defending them. That's why photogrammetry applications tend to be used for solving very specific problems in select domains.

ML models, on the other hand, are in a big way, intuitive assumption machines. Through training, they learn what's likely and what's not, given both the input measurements and the state of the world. They bake in knowledge for what kind of cameras exist, what kind of measurements are being made, what results make sense in the real world.

In the past I'd say that for best results, we should combine the two approaches - have AI supply assumptions and estimates for otherwise explicitly formal, photogrammetric approach. Today, I'm no longer convinced it's the case - because relative to the fuzzy world modeling part, the actual math seems trivial and well within capabilities of ML models to do correctly. The last few years demonstrated that ML models are capable of internally modeling calculations and executing them, so I now feel it's more likely that a sufficiently trained model will just do photogrammetry calculations internally. See also: the Bitter Lesson.


Surely this is not an LLM?

I'm going to guess this is because the image to depth data, while good, is not perfectly accurate and therefore cannot be a shared ground truth between multiple images. At that point what you want is a more traditional structure from motion workflow, which already exists and does a decent job.

Also, are we allowed to use this model? Apple had a very restrictive licence, IIRC?


Multi-view approaches tend to have a very different pipeline.

Do any of the top models let you pause and think while speaking? I have to speak non-stop to Gemini assitant and ChatGPT, which is very very useless/unnatural for voice mode. Specially for non-english speakers probably. I sometimes have to think more to translate my thoughts to english.

Have you tried talking to ChatGPT in your native tongue? I was blown away by my mother speaking her native tongue to ChatGPT and having it respond in that language. (It's ever so slightly not a mainstream one.)

Even in my own language I can't talk without any pauses.

https://samwho.dev/ (e.g. Reservoir sampling, Load balancing)

https://imadr.me/pbr/ (physically based rendering)


Re-watched on computer monitor this time. The cut is between 00:00:19 and 00:00:20 of the demo.

I dont understand the UI at all. When I click All or something withij brackets, what am I supposed to see? Covers similar to what I clicked? But the covers I see don't seem similar to me at all no matter what I click. What am I missing? Or may be I am expecting a different kind of similarity.

The confusion is understandable as the comparison is basic and uses image hashes (https://pypi.org/project/ImageHash/), which are pretty surface level and don't always provide reliable "this image is obviously very similar to that one" results.

You are correct that when you click something in the brackets, the results returned are covers similar to what you clicked.

Still have a lot of room for improvement as I go further down this image matching rabbit hole, but the comparison's current state does provide some useful results every so often.


As suggested in another comment, CLIP should give the images that actually look similar. This is a great collection of images, and using those you will find very similar covers.

http://same.energy/search?i=yN5ou

Also https://mood.zip/


What about Sonnet 4.5? I used both Opus and Sonnet on Claude.ai and found sonnet much better at following instructions and doing exactly what was asked.

(it was for single html/js PWA to measure and track heart rate)

Opus seems to go less deep, does it's own things, do not follow instructions exactly EVEN IF I WROTE ALL CAPS. With Sonnet 4.5 I can understand everything author is saying. May be Opus is optimised for Claude code and Sonnet works best on Web.


What kind of things people are building that can be almost completely automatically built like this?

I have a feeling most of these folks are talking about personal projects or work on relatively small products. I have a good amount of personal projects that I haven’t written a line of code for. After bootstrapping an MVP, I can almost entirely drive by having Claude pick up GitHub issues. They’re small codebases though.

My day job is mostly a gigantic codebases that seem to still choke the best models. Also there’s zero way I’d be allowed to tailscale to my work computer from my phone.


From my perspective: tons of very simple, duplicated software. The bad thing is - there is a lot of space on different markets for such software. Here in Poland you can earn for pretty decent life being lame programmer, but building simple automations for small companies. I was raised in a way I still don’t have courage to switch to such approach, but doing this for 3-4 such entities I can see how you can make living from that. With LLMs you can automate 90+% of the job if not more.

I'm kind of confused too. I spend way more time testing and reviewing code than I could possibly keep up with 4 agents

I'm wondering about the same thing, I imagine it's good for posting it on the #hustleculture circles.

One very common thing I do is think of a small feature and ask Claude Code for Web to impl it. It works very well.

Looking at this it looks like that Moore pattern is just waves/signals interacting with each other. Also that low cost effects can be created using these patterns.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: