Hacker Newsnew | past | comments | ask | show | jobs | submit | n_u's commentslogin

Question for folks in data science / ML space: Has DuckDB been replacing Pandas and NumPy for basic data processing?

> Our agreement with TerraPower will provide funding that supports the development of two new Natrium® units capable of generating up to 690 MW of firm power with delivery as early as 2032.

> Our partnership with Oklo helps advance the development of entirely new nuclear energy in Pike County, Ohio. This advanced nuclear technology campus — which may come online as early as 2030 — is poised to add up to 1.2 GW of clean baseload power directly into the PJM market and support our operations in the region.

It seems like they are definitely building a new plant in Ohio. I'm not sure exactly what is happening with TerraPower but it seems like an expansion rather than "purchasing power from existing nuke plants".

Perhaps I'm misreading it though.


If history repeats itself ... tax payers will be fitting the bill. Ohio has shown to be corrupt when it comes to their Nuclear infrastructure. [0] High confident that politicians are lining up behind the scenes to get their slice of the pie.

[0] https://en.wikipedia.org/wiki/Ohio_nuclear_bribery_scandal


Well, private investment is a great way to avoid subsidy nonsense.

You know that there's no actual private investment in nuclear in the US.

The nuclear industry is indemnified by the taxpayers. Without thar insurance backstop, there would be no nuclear energy industry.


Taxpayers are private. They earn money and give some of it to the state.

The weasel wording is strong here. That's like me saying that buying a hamburger will help advance the science of hamburger-making. I'm just trading money for hamburgers. They're trying to put a shiny coat of paint on the ugly fact that they're buying up MWh, reducing the supply of existing power for the rest of us, and burning it to desperately try to convince investors that AGI is right around the corner so that the circular funding musical chairs doesn't stop.

We got hosed when they stole our content to make chatbots. We get hosed when they build datacenters with massive tax handouts and use our cheap power to produce nothing, and we'll get hosed when the house of cards ultimately collapses and the government bails them out. The game is rigged. At least when you go to the casino everyone acknowledges that the house always wins.


what company?


> There is confusion about the less obvious benefits, confusion about how it works, confusion about the dangers (how do I adjust my well honed IPv4 spidey senses?), and confusion about how I transition my current private network

Could you be specific about what the misconceptions are?


I had Copilot produce this for you based on the comments in this discussion (as at just before the timestamp of this comment).

https://copilot.microsoft.com/shares/656dEMHWyFye5cCeicgGv


Interesting that this is getting downvoted. I truly wonder why. One of the things LLMs are good at is summarising and extracting key points. Or should I have gone to the trouble to do this myself - read the entire comment thread and manually summarise - when the person I was replying to hadn’t done that? My comment was meant in good faith: “here’s the info you wanted and how you can easily get them yourself next time”.


1. People come here for discussions with real people. The other night I was at a party and we had a great time playing chess and board games. It would be weird if someone started using stockfish, even if it is a better player. Everything stockfish does, it already knows. It doesn't learn or explore the game-space.

2. The response is still too wordy, generic, and boring. So LLMs are not really better players, at least for now.

3. With LLMs, you can produce a ton of text much faster than it can be read. Whereas the dynamic is reversed for ordinary writing. By writing this by hand, I am doing you a favor by spending more time on this comment than you will. But by reading your LLM output I am doing you a favor by spending more time reading than you did generating.

You could probably get away with using an LLM here by copying the response and then cutting down 90% of it. But at that point it would be better to just restate the points yourself in your own words.


So cheap questions where the answers could be readily had are not downvoted even though the answers to their question are right here in the discussion. Whereas because I did not do the legwork that my correspondent would not do, I am penalised. That’s what I’m hearing.

EDITED TO ADD:

> by reading your LLM output I am doing you a favor by spending more time reading than you did generating

How could my respondent (presumably on whose behalf you are making the argument) possibly be doing me a favour when they asked the question? Is it each of our responsibility to go to some lengths to spoon feed one another when others don’t deign to feed themselves?


And yet the llm did a better work of disparaging everyone comments as uniformed, which they are btw.


You're not offering anything of value. We all can ask some LLM about stuff we want to know. It's like in the past, when someone would post a link to search results as a reply.


> Submit the write to the primary file

> Link fsync to that write (IOSQE_IO_LINK)

> The fsync's completion queue entry only arrives after the write completes

> Repeat for secondary file

Wait, so the OS can re-order the fsync() to happen before the write request it is supposed to be syncing? Is there a citation or link to some code for that? It seems too ridiculous to be real.

> O_DSYNC: Synchronous writes. Don't return from write() until the data is actually stable on the disk.

If you call fsync() this isn't needed correct? And if you use this, then fsync() isn't needed right?


> Wait, so the OS can re-order the fsync() to happen before the write request it is supposed to be syncing? Is there a citation or link to some code for that? It seems too ridiculous to be real.

This is an io_uring-specific thing. It doesn't guarantee any ordering between operations submitted at the same time, unless you explicitly ask it to with the `IOSQE_IO_LINK` they mentioned.

Otherwise it's as if you called write() from one thread and fsync() from another, before waiting for the write() call to return. That obviously defeats the point of using fsync() so you wouldn't do that.

> If you call fsync(), [O_DSYNC] isn't needed correct? And if you use [O_DSYNC], then fsync() isn't needed right?

I believe you're right.


I guess I'm a bit confused why the author recommends using this flag and fsync.

Related: I would think that grouping your writes and then fsyncing rather than fsyncing every time would be more efficient but it looks like a previous commenter did some testing and that isn't always the case https://news.ycombinator.com/item?id=15535814


I'm not sure there's any good reason. Other commenters mentioned AI tells. I wouldn't consider this article a trustworthy or primary source.


Yeah that seems reasonable. The article seems to mix fsync and O_DSYNC without discussing their relationship which seems more like AI and less like a human who understands it.

It also seems if you were using io_uring and used O_DSYNC you wouldn't need to use IOSQE_IO_LINK right?

Even if you were doing primary and secondary log file writes, they are to different files so it doesn't matter if they race.


> It also seems if you were using io_uring and used O_DSYNC you wouldn't need to use IOSQE_IO_LINK right? Even if you were doing primary and secondary log file writes, they are to different files so it doesn't matter if they race.

I think there are a lot of reasons to use this flag besides a write()+f(data)sync() sequence:

* If you're putting something in a write-ahead log then applying it to the primary storage, you want it to be fully committed to the write-ahead log before you start changing the primary storage, so if there's a crash halfway through the primary storage change you can use the log to get to a consistent state (via undo or redo).

* If you're trying to atomically replace a file via the rename-a-temporary-file-into-place trick, you can submit the whole operation to the ring at once, but you'd want to use `IOSQE_IO_LINK` to ensure the temporary file is fully written/synced before the rename happens.

btw, a clarification about my earlier comment: `O_SYNC` (no `D`) should be equivalent to calling `fsync` after every write. `O_DSYNC` should be equivalent to calling the weaker `fdatasync` after every write. The difference is the metadata stored in the inode.


> I think there are a lot of reasons to use this flag besides a write()+f(data)sync() sequence:

> * If you're putting something in a write-ahead log then applying it to the primary storage, you want it to be fully committed to the write-ahead log before you start changing the primary storage, so if there's a crash halfway through the primary storage change you can use the log to get to a consistent state (via undo or redo).

I guess I meant exclusively in terms of writing to the WAL. As I understand most DBMSes synchronously write the log entries for a transaction and asynchronously write the data pages to disk via a separate API or just mark the pages as dirty and let the buffer pool manager flush them to disk at its discretion.

> * If you're trying to atomically replace a file via the rename-a-temporary-file-into-place trick, you can submit the whole operation to the ring at once, but you'd want to use `IOSQE_IO_LINK` to ensure the temporary file is fully written/synced before the rename happens.

Makes sense


> As I understand most DBMSes synchronously write the log entries for a transaction and asynchronously write the data pages to disk via a separate API or just mark the pages as dirty and let the buffer pool manager flush them to disk at its discretion.

I think they do need to ensure that page doesn't get flushed before the log entry in some manner. This might happen naturally if they're doing something in single-threaded code without io_uring (or any other form of async IO). With io_uring, it could be a matter of waiting for completion entry for the log write before submitting the page write, but it could be the link instead.


> I think they do need to ensure that page doesn't get flushed before the log entry in some manner.

Yes I agree. I meant like they synchronously write the log entries, then return success to the caller, and then deal with dirty data pages. As I recall the buffer pool manager has to do something special with dirty pages for transactions that are not committed yet.


Cool! I've always wanted something like this. Usually I just have to manually remove redundant CSS and styling options.

Can you explain why the viewport width and height are needed?


Content that appears in the viewport before scrolling is considered 'above-the-fold' and is thus prioritised to load quickest. The viewport dimensions are used to figure out what will be above-the-fold.


Wireless ethernet adapters


What was the CRDT bug?


> PyTorch has dominated the AI scene since TF1 fumbled the ball at 10th yard line

can you explain why you think TensorFlow fumbled?


I see good answers already, but here's a concrete example:

In my University we had to decide between both libraries so, as a test, we decided to write a language model from scratch. The first minor problem with TF was that (if memory serves me right) you were supposed to declare your network "backwards" - instead of saying "A -> B -> C" you had to declare "C(B(A))". The major problem, however, was that there was no way to add debug messages - either your network worked or it didn't. To make matters worse, the "official" TF tutorial on how to write a Seq2Seq model didn't compile because the library had changed but the bug reports for that were met for years with "we are changing the API so we'll fix the example once we're done".

PyTorch, by comparison, had the advantage of a Python-based interface - you simply defined classes like you always did (including debug statements!), connected them as variables, and that was that. So when I and my beginner colleagues had to decide which library to pick, "the one that's not a nightmare to debug" sounded much better than "the one that's more efficient if you have several billions training datapoints and a cluster". Me and my colleagues then went on to become professionals, and we all brought PyTorch with us.


This was also my experience. TensorFlow's model of constructing then evaluating a computation graph felt at odds with Python's principles. It made it extremely difficult to debug because you couldn't print tensors easily! It didn't feel like Python at all.

Also the API changed constantly so examples from docs or open source repos wouldn't work.

They also had that weird thing about all tensors having a unique global name. I remember I tried to evaluate a DQN network twice in the same script and it errored because of that.

It's somewhat vindicating to see many people in this thread shared my frustrations. Considering the impact of these technologies I think a documentary about why TensorFlow failed and PyTorch took off would be a great watch.


The inability to use print debug to tell me the dimensions of my hidden states was 100% why TF was hard for me to use as a greenhorn MSc student.

Another consequence of this was that PyTorch let you use regular old Python for logic flow.


In 2018, I co-wrote a blog post with the inflammatory title “Don’t use TensorFlow, try PyTorch instead” (https://news.ycombinator.com/item?id=17415321). As it gained traction here, it was changed to “Keras vs PyTorch” (some edgy things that work for a private blog are not good for a corporate one). Yet the initial title stuck, and you can see it resonated well with the crowd.

TensorFlow (while a huge step on top of Theano) had issues with a strange API, mixing needlessly complex parts (even for the simplest layers) with magic-box-like optimization.

There was Keras, which I liked and used before it was cool (when it still supported the Theano backend), and it was the right decision for TF to incorporate it as the default API. But it was 1–2 years too late.

At the same time, I initially looked at PyTorch as some intern’s summer project porting from Lua to Python. I expected an imitation of the original Torch. Yet the more it developed, the better it was, with (at least to my mind) the perfect level of abstraction. On the one hand, you can easily add two tensors, as if it were NumPy (and print its values in Python, which was impossible with TF at that time). On the other hand, you can wrap anything (from just a simple operation to a huge network) in an nn.Module. So it offered this natural hierarchical approach to deep learning. It offered building blocks that can be easily created, composed, debugged, and reused. It offered a natural way of picking the abstraction level you want to work with, so it worked well for industry and experimentation with novel architectures.

So, while in 2016–2017 I was using Keras as the go-to for deep learning (https://p.migdal.pl/blog/2017/04/teaching-deep-learning/), in 2018 I saw the light of PyTorch and didn’t feel a need to look back. In 2019, even for the intro, I used PyTorch (https://github.com/stared/thinking-in-tensors-writing-in-pyt...).


Actually, I opened “Teaching deep learning” and smiled as I saw how it evolved:

> There is a handful of popular deep learning libraries, including TensorFlow, Theano, Torch and Caffe. Each of them has Python interface (now also for Torch: PyTorch)

> [...]

> EDIT (July 2017): If you want a low-level framework, PyTorch may be the best way to start. It combines relatively brief and readable code (almost like Keras) but at the same time gives low-level access to all features (actually, more than TensorFlow).

> EDIT (June 2018): In Keras or PyTorch as your first deep learning framework I discuss pros and cons of starting learning deep learning with each of them.


The original TensorFlow had an API similar to the original Lua-based Torch (the predecessor to PyTorch) that required you to first build the network, node by node, then run it. PyTorch used a completely different, and much more convenient approach, where the network is built automatically for you just by running the forward pass code (and will then be used for the backward pass), using both provided node types and arbitrary NumPy compatible code. You're basically just writing differentiable code.

This new PyTorch approach was eventually supported by TensorFlow as well ("immediate mode"), but the PyTorch approach was such a huge improvement that there had been an immediate shift by many developers from TF to PyTorch, and TF never seemed able to regain the momentum.

TF also suffered from having a confusing array of alternate user libraries built on top of the core framework, none of which had great documentation, while PyTorch had a more focused approach and fantastic online support from the developer team.


LuaTorch is eager-execution. The problem with LuaTorch is the GC. You cannot rely on traditional GC for good work, since each tensor is megabytes (at the time), now gigabytes large, you need to collect them aggressively rather than at intervals (Python's reference-counting system solves this issue, and of course, by "collecting", I don't mean free the memory (PyTorch has a simple slab allocator to manage CUDA memory)).


With Lua Torch the model execution was eager, but you still had to construct the model graph beforehand - it wasn't "define by run" like PyTorch.

Back in the day, having completed Andrew Ng's ML coursew, I then built my own C++ NN framework copying this graph-mode Lua Torch API. One of the nice things about explicitly building a graph was that my framework supported having the model generate a GraphViz DOT representation of itself so I could visualize it.


Ah, I get what you mean now. I am mixing up the nn module and the tensor execution bits. (to be fair, the PyTorch nn module carries over many these quirks!).


I'm no machine learning engineer but I've dabbled professionally with both frameworks a few years ago and the developer experience didn't even compare. The main issue with TF was that you could only chose between a powerful but incomprehensible, poorly documented [1], ultra-verbose and ever changing low-level API, and an abstraction layer (Keras) that was too high level to be really useful.

Maybe TF has gotten better since but at the time it really felt like an internal tool that Google decided to just throw into the wild. By contrast PyTorch offered a more reasonable level of abstraction along with excellent API documentation and tutorials, so it's no wonder that machine learning engineers (who are generally more interested in the science of the model than the technical implementation) ended up favoring it.

[1] The worst part was that Google only hosted the docs for the latest version of TF, so if you were stuck on an older version (because, oh I don't know, you wanted a stable environment to serve models in production), well tough luck. That certainly didn't gain TF any favors.


For me it was about 8 years ago. Back then TF was already bloated but had two weaknesses. Their bet on static compute graphs made writing code verbose and debugging difficult.

The few people I know back then used keras instead. I switched to PyTorch for my next project which was more "batteries included".


Imagine a total newbie trying to fine-tune an image classifier, reusing some open source example code, about a decade ago.

If their folder of 10,000 labelled images contains one image that's a different size to the others, the training job will fail with an error about unexpected dimensions while concatenating.

But it won't be able to say the file's name, or that the problem is an input image of the wrong size. It'll just say it can't concatenate tensors of different sizes.

An experienced user will recognise the error immediately, and will have run a data cleansing script beforehand anyway. But it's not experienced users who bounce from frameworks, it's newbies.


> An experienced user will recognise the error immediately, and will have run a data cleansing script beforehand anyway. But it's not experienced users who bounce from frameworks, it's newbies.

Even seasoned developers will bounce away from frameworks or libraries - no matter if old dogs or the next hot thing - if the documentation isn't up to speed or simple, common tasks require wading through dozens of pages of documentation.

Writing good documentation is hard enough, writing relevant "common usage examples" is even harder... but keeping them up to date and working is a rarely seen art.

And the greatest art of all of it is logging. Soooo many libraries refuse to implement detailed structured logging in internal classes (despite particularly Java and PHP offering very powerful mechanisms), making it much more difficult to troubleshoot problems in the field.


I just remember TF1 being super hard to use as a beginner and Google repeatedly insisting it had to be that way. People talk about the layering API, but it's more than that, everything about it was covered with sharp edges.


I personally believe TF1 was serving the need of its core users. It provided a compileable compute graph with autodiff, and you got very efficient training and inference from it. There was a steep learning curve, but if you got past it, things worked very very well. The distributed TF never really took off—it was buggy, and I think they made some wrong earlier bets in the design for performance reasons that they should have been sacrificed in favor of simplicity.

I believe some years after the TF1 release, they realized the learning curve was too steep, they were losing users to PyTorch. I think also the Cloud team was attempting to sell customers on their amazing DL tech, which was falling flat. So they tried to keep the TF brand while totally changing the product under the hood by introducing imperative programming and gradient tapes. They killed TF1, upsetting those users, while not having a fully functioning TF2, all the while having plenty of documentation pointing to TF1 references that didn’t work. Any new grad student made the simple choice of using a tool that was user-friendly and worked, which was PyTorch. And most old TF1 users hopped on the band wagon.


First, the migration to 2.0 in 20219 to add eager mode support was horribly painful. Then, starting around 2.7, backward compatibility kept being broken. Not being able to load previously trained models with a new version of the library is wildly painful.


I only remember 2015 TF and I was wondering: why would I use Python to assemble a computational graph when what I really want is to write code and then differentiate through it?


Greenfielding TF2.X and not maintaining 1.X compatibility


> Reranking: the highest value 5 lines of code you'll add. The chunk ranking shifted a lot. More than you'd expect. Reranking can many times make up for a bad setup if you pass in enough chunks. We found the ideal reranker set-up to be 50 chunk input -> 15 output.

What is re-ranking in the context of RAG? Why not just show the code if it’s only 5 lines?


OP. Reranking is a specialized LLM that takes the user query, and a list of candidate results, then re-sets the order based on which ones are more relevant to the query.

Here's sample code: https://docs.cohere.com/reference/rerank


What is the difference between reranking versus generating text embeddings and comparing with cosine similarity?


My understanding:

If you generate embeddings (of the query, and of the candidate documents) and compare them for similarity, you're essentially asking whether the documents "look like the question."

If you get an LLM to evaluate how well each candidate document follows from the query, you're asking whether the documents "look like an answer to the question."

An ideal candidate chunk/document from a cosine-similarity perspective, would be one that perfectly restates what the user said — whether or not that document actually helps the user. Which can be made to work, if you're e.g. indexing a knowledge base where every KB document is SEO-optimized to embed all pertinent questions a user might ask that "should lead" to that KB document. But for such documents, even matching the user's query text against a "dumb" tf-idf index will surface them. LLMs aren't gaining you any ground here. (As is evident by the fact that webpages SEO-optimized in this way could already be easily surfaced by old-school search engines if you typed such a query into them.)

An ideal candidate chunk/document from a re-ranking LLM's perspective, would be one that an instruction-following LLM (with the whole corpus in its context) would spit out as a response, if it were prompted with the user's query. E.g. if the user asks a question that could be answered with data, a document containing that data would rank highly. And that's exactly the kind of documents we'd like "semantic search" to surface.


I've been thinking about the problem of what to do if the answer to a question is very different to the question itself in embedding space. The KB method sounds interesting and not something I thought about, you sort work on the "document side" I guess. I've also heard of HYDE, the works on the query side, you generate hypothetical answers instead to the user query and look for documents that are similar to the answer, if I've understood it correctly.


The main point didn't get hit on by the responses. Re-ranking is just a mini-LLM (for latency/cost reasons) that does a double heck. Embedding model finds the closest M documents in R^N space. Re-ranker picks the top K documents from the M documents. In theory, if we just used Gemini 2.5 Pro or GPT 5 as the re-ranker, the performance would even be better than whatever small re-ranker people choose to use.


text similarity finds items that closely match. Reranking my select items that are less semantically "similar" but are more relevant to the query.


the reranker is a cross encoder that sees the docs and the query at the same time. What you normally do is you generating embeddings ahead of time, independent of the prompt used, calculate cosine similarity with the prompt, select the top-k best chunks that match the prompt and only then use a reranker to sort them.

embeddings are a lossy compression, so if you feed the chunks with the prompt at the same time, the results are better. But you can't do this for your whole db, that's why the filtering with cosine similarity at the beginning.


Because LLMs are a lot smarter than embeddings and basic math. Think of the vector / lexical search as the first approximation.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: