Hacker Newsnew | past | comments | ask | show | jobs | submit | klodolph's commentslogin

I feel like I grokked Perl enough and I still write Perl code, but I also think that there are some technical reasons why it declined in popularity in the 2000s and 2010s. All those differences between $ % @, the idea of scalar or list context, overuse of globals, and references. These features can all make sense if you spend enough time in Perl, and can even be defended, but it creates a never-ending stream of code that looks right but is wrong, and it all seems to create a lot of complexity with very little benefit.

I think a reasonable solution is “people who find the answer should observe that the question was asked eight years ago, and certainly double-check the answer”. If it’s a question about company internal codebases or operations, then you should have access to see the code or resources the answer is talking about.

I have an overly reductive take on this—it’s Unix environment variables.

You have your terminal window and your .bashrc (or equivalent), and that sets a bunch of environment variables. But your GUI runs with, most likely, different environment variables. It sucks.

And here’s my controversial take on things—the “correct” resolution is to reify some higher-level concept of environment. Each process should not have its own separate copy of environment variables. Some things should be handled… ugh, I hate to say it… through RPC to some centralized system like systemd.


Windows registry just sort of hovering in the backdrop

Something that is still inheritable, between “there is one and it is global” and “there is a separate copy for each process”.

Bugs can get introduced for other reasons besides “feature not completed”.


> Langan has not produced any acclaimed works of art or science. In this way, he differs significantly he differs significantly from outsider intellectuals like Paul Erdös, Stephen Wolfram, Nassim Taleb, etc.

Paul Erdős is the only outsider intellectual on that list, IMO.

(Also note that ő and ö are different!)


Can you even be called an "outsider" when everyone who recognizes the name associates it with "eccentric but well respected mathematician who was well liked enough in the community that people would regularly let him sleep in their homes for days on end"? According to his wikipedia page, Erdős collaborated with hundreds of other mathematicians. That's the very opposite of being an outsider IMO.


I think of those two, agentic crawlers are worse.


That sounds like the desired outcome here. Your agent should respect robots.txt, OR it should be designed to not follow links.


An agent acting on my behalf, following my specific and narrowly scoped instructions, should not obey robots.txt because it's not a robot/crawler. Just like how a single cURL request shouldn't follow robots.txt. (It also shouldn't generate any more traffic than a regular browser user)

Unfortunately "mass scraping the internet for training data" and an "LLM powered user agent" get lumped together too much as "AI Crawlers". The user agent shouldn't actually be crawling.


Confused as to what you're asking for here. You want a robot acting out of spec, to not be treated as a robot acting out of spec, because you told it to?

How does this make you any different than the bad faith LLM actors they are trying to block?


robots.txt is for automated, headless crawlers, NOT user-initiated actions. If a human directly triggers the action, then robots.txt should not be followed.


But what action are you triggering that automatically follows invisible links? Especially those not meant to be followed with text saying not to follow them.

This is not banning you for following <h1><a>Today's Weather</a></h1>

If you are a robot that's so poorly coded that it is following links it clearly shouldn't that's are explicitly numerated as not to be followed, that's a problem. From an operator's perspective, how is this different than a case you described.

If a googler kicked off the googlebot manually from a session every morning, should they not respect robots.txt either?


I was responding to someone earlier saying a user agent should respect robots.txt. An LLM powered user-agent wouldn't follow links, invisible or not, because it's not crawling.


It very feasibly could. If I made an LLM agent that clicks on a returned element, and then the element was this trap doored link, that would happen


There's a fuzzy line between an agent analyzing the content of a single page I requested, and one making many page fetches on my behalf. I think it's fair to treat an agent that clicks an invisible link as a robot/crawler since that agent is causing more traffic than a regular user agent (browser).

Just trying to make the point that an LLM powered user agent fetching a single page at my request isn't a robot.


You're equating asking Siri to call your mom to using a robo-dialer machine.


If your specific and narrowly scoped instructions cause the agent, acting on your behalf, to click that link that clearly isn't going to help it--a link that is only being clicked by the scrapers because the scrapers are blindly downloading everything they can find without having any real goal--then, frankly, you might as well be blocked also, as your narrowly scoped instructions must literally have been something like "scrape this website without paying any attention to what you are doing", as an actual agent--just like an actual human--wouldn't find our click that link (and that this is true has nothing at all to do with robots.txt).


If it's a robot it should follow robots.txt. And if it's following invisible links it's clearly crawling.

Sure, a bad site could use this to screw with people, but bad sites have done that since forever in various ways. But if this technique helps against malicious crawlers, I think it's fair. The only downside I can see is that Google might mark you as a malware site. But again, they should be obeying robots.txt.


Your web browser is a robot, and always has been. Even using netcat to manually type your GET request is a robot in some sense, as you have a machine translating your ascii and moving it between computers.

The significant difference isn't in whether a robot is doing the actions for you or not, it's whether the robot is a user agent for a human or not.


should cURL follow robots.txt? What makes browser software not a robot? Should `curl <URL>` ignore robots.txt but `curl <URL> | llm` respect it?

The line gets blurrier with things like OAI's Atlas browser. It's just re-skinned Chromium that's a regular browser, but you can ask an LLM about the content of the page you just navigated to. The decision to use an LLM on that page is made after the page load. Doing the same thing but without rendering the page doesn't seem meaningfully different.

In general robots.txt is for headless automated crawlers fetching many pages, not software performing a specific request for a user. If there's 1:1 mapping between a user's request and a page load, then it's not a robot. An LLM powered user agent (browser) wouldn't follow invisible links, or any links, because it's not crawling.


How did you get the url for curl? Do you personally look for hidden links in pages to follow? This isn't an issue for people looking at the page, it's only a problem for systems that automatically follow all the links on a page.


Yea i think the context for my reply got lost. I was responding to someone saying that an LLM powered user-agent (browser) should respect robots.txt. And it wouldn't be clicking the hidden link because it's not crawling.


Maybe your agent is smart enough to determine that going against the wishes of the website owner can be detrimental to your relationship the such website owner and therefore the likelihood of the website to continue existing, so is prioritizing your long-term interests over your short-term ones.


How does a server tell an agent acting on behalf of a real person from the unwashed masses of scrapers? Do agents send a special header or token that other scrapers can't easily copy?

They get lumped together because they're more or less indistinguishable and cause similar problems: server load spikes, increased bandwidth, increased AWS bill ... with no discernible benefit for the server operator such as increased user engagement or ad revenue.

Now all automated requests are considered guilty until proven innocent. If you want your agent to be allowed, it's on you to prove that you're different. Maybe start by slowing down your agent so that it doesn't make requests any faster than the average human visitor would.


The only real difference that LLM crawlers tend to not respect /robots.txt and some of them hammer sites with some pretty heavy traffic.

The trap in the article has a link. Bots are instructed not to follow the link. The link is normally invisible to humans. A client that visits the link is probably therefore a poorly behaved bot.


> I don't see it. Can you say what 80% you feel like you're getting?

I read it as “80% of the way to Rust levels of reliability and performance.” That doesn’t mean that the type system or syntax is at all similar, but that you get some of the same benefits.

I might say that, “C gets you 80% of the way to assembly with 20% of the effort.” From context, you could make a reasonable guess that I’m talking about performance.


Yes, for me I've always pushed the limits of what kinds of memory and cpu usage I can get out of languages. NLP, text conversion, video encoding, image rendering, etc...

Rust beats Go in performance.. but nothing like how far behind Java, C#, or scripting languages (python, ruby, typescript, etc..) are from all the work I've done with them. I get most of the performance of Rust with very little effort a fully contained stdlib/test suite/package manger/formatter/etc.. with Go.


Rust is the most defect free language I have ever had the pleasure of working with. It's a language where you can almost be certain that if it compiles and if you wrote tests, you'll have no runtime bugs.

I can only think of two production bugs I've written in Rust this year. Minor bugs. And I write a lot of Rust.

The language has very intentional design around error handling: Result<T,E>, Option<T>, match, if let, functional predicates, mapping, `?`, etc.

Go, on the other hand, has nil and extremely exhausting boilerplate error checking.

Honestly, Go has been one of my worst languages outside of Python, Ruby, and JavaScript for error introduction. It's a total pain in the ass to handle errors and exceptional behavior. And this leads to making mistakes and stupid gotchas.

I'm so glad newer languages are picking up on and copying Rust's design choices from day one. It's a godsend to be done with null and exceptions.

I really want a fast, memory managed, statically typed scripting language somewhere between Rust and Go that's fast to compile like Go, but designed in a safe way like Rust. I need it for my smaller tasks and scripting. Swift is kind of nice, but it's too Apple centric and hard to use outside of Apple platforms.

I'm honestly totally content to keep using Rust in a wife variety of problem domains. It's an S-tier language.


> I really want a fast, memory managed, statically typed scripting language somewhere between Rust and Go that's fast to compile

It could as well be Haskell :) Only partly a joke: https://zignar.net/2021/07/09/why-haskell-became-my-favorite...


Borgo could be that language for you. It compiles down to Go, and uses constructs like Option<T> instead of nil, Result<T,E> instead of multiple return values, etc. https://github.com/borgo-lang/borgo


> I really want a fast, memory managed, statically typed scripting language somewhere between Rust and Go that's fast to compile like Go, but designed in a safe way like Rust

OCaml is pretty much that, with a very direct relationship with Rust, so it will even feel familiar.


I agree with a lot of what you said. I'm hoping Rust will warm on me as I improve in it. I hate nil/null.

> Go... extremely exhausting boilerplate error checking

This actually isn't correct. That's because Go is the only language that makes you think about errors at every step. If you just ignored them and passed them up like exceptions or maybe you're basically just exchanging handling errors for assuming the whole thing pass/fail.

If you you write actual error checking like Go in Rust (or Java, or any other language) then Go is often less noisy.

It's just two very different approaches to error handling that the dev community is split on. Here's a pretty good explanation from a rust dev: https://www.youtube.com/watch?v=YZhwOWvoR3I


It’s very common in Go to just pass the error on since there’s no way to handle it in that layer.

Rust forces you to think about errors exactly as much, but in the common case of passing it on it’s more ergonomic.


just be careful with unwrap :)


Go is in the same performance profile as Java and C#. There are tons of benchmarks that support this.


1) for one-off scripts and 2) If you ignore memory.

You can make about anything faster if you provide more memory to store data in more optimized formats. That doesn't make them faster.

Part of the problem is that Java in the real world requires an unreasonable number of classes and 3rd party libraries. Even for basic stuff like JSON marshaling. The Java stdlib is just not very useful.

Between these two points, all my production Java systems easily use 8x more memory and still barely match the performance of my Go systems.


I genuinely can’t think of anything the Java standard library is missing, apart from a json parser which is being added.

It’s your preference to prefer one over the other, I prefer Java’s standard library because atleast it has a generic Set data structure in it and C#’s standard library does have a JSON parser.

I don’t think discussions about what is in the standard library really refutes anything about Go being within the same performance profile though.


Memory is the most common tradeoff engineers make for better performance. You can trivially do so yourself with java, feel free to cut down the heap size and Java's GC will happily chug along 10-100 times as often without a second thought, they are beasts. The important metric is that Java's GC will be able to keep up with most workloads, and it won't needlessly block user threads from doing their work. Also, not running the GC as often makes Java use surprisingly small amounts of energy.

As for the stdlib, Go's is certainly impressive but come on, I wouldn't even say that in general case Java's standard library is smaller. It just so happens that Go was developed with the web in mind almost exclusively, while Java has a wider scope. Nonetheless, the Java standard library is certainly among the bests in richness.


ZGC? It should be on par or better than Go.


Java’s collectors vastly outperform Go’s. Look at the Debian binary tree benchmarks [0]. Go just uses less memory because it’s AOT compiled from the start and Java’s strategy up until recently is to never return memory to the OS. Java programs are typically on servers where it’s the only application running.

[0] https://benchmarksgame-team.pages.debian.net/benchmarksgame/...



IIRC the native image GC is still the serial GC by default. Which would probably perform the worst out of all the available GCs.

I know on HotSpot they’re planning to make G1 the default for every situation. Even where it would previously choose the serial GC.


Having used XSLT, I remember hating it with the passion of a thousand suns. Maybe we could have improved what we had, but anything I wanted to do was better done somehow else.

I'm glad to have all sorts of specialists on our team, like DBAs, security engineers, and QA. But we had XSLT specialists, and I thought it was just a waste of effort.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: