More

philbe77 · 2025-11-22T15:09:06 1763824146

GizmoSQL is definitely a good option. I work at GizmoData and maintain GizmoSQL. It is an Arrow Flight SQL server with DuckDB as a back-end SQL execution engine. It can support independent thread-safe concurrent sessions, has robust security, logging, token-based authentication, and more.

It also has a growing list of adapters - including: ODBC, JDBC, ADBC, dbt, SQLAlchemy, Metabase, Apache Superset and more.

We also just introduced a PySpark drop-in adapter - letting you run your Python Spark Dataframe workloads with GizmoSQL - for dramatic savings compared to Databricks for sub-5TB workloads.

Check it out at: https://gizmodata.com/gizmosql

Repo: https://github.com/gizmodata/gizmosql

philbe77 · 2025-11-22T15:10:20 1763824220

Oh, and GizmoData Cloud (SaaS option) is coming soon - to make it easier than ever to provision GizmoSQL instances...

philbe77 · 2025-10-24T19:18:35 1761333515

good point :) - we can re-aggregate HyperLogLog (HLL) sketches to get a pretty accurate NDV (Count Distinct) - see Query.farm's DataSketches DuckDB extension here: https://github.com/Query-farm/datasketches

We also have Bitmap aggregation capabilities for exact count distinct - something I worked with Oracle, Snowflake, Databricks, and DuckDB labs on implementing. It isn't as fast as HLL - but it is 100% accurate...

fifilura · 2025-10-24T21:29:37 1761341377

I remember BigQuery had Distinct with HLL accuracy 10 years ago but rather quickly replaced it with actual accuracy.

How would you compare this solution to BigQuery?

philbe77 · 2025-10-24T19:15:14 1761333314

hi sdairs, we did store the data on the worker nodes for the challenge, but not in memory. We wrote the data to the local NVMe SSD storage on the node. Linux may cache the filesystem data, but we didn't load the data directly into memory. We like to preserve the memory for aggregations, joins, etc. as much as possible...

It is true you would need to run the instance(s) 24/7 to get the performance all day, the startup time over a couple minutes is not ideal. We have a lot of work to do on the engine, but it has been a fun learning experience...

otterley · 2025-10-24T19:39:15 1761334755

“Linux may cache the filesystem data” means there’s a non-zero likelihood that the data in memory unless you dropped caches right before you began the benchmark. You don’t have to explicitly load it into memory for this to be true. What’s more, unless you are in charge of how memory is used, the kernel is going to make its own decisions as to what to cache and what to evict, which can make benchmarks unreproducible.

It’s important to know what you are benchmarking before you start and to control for extrinsic factors as explicitly as possible.

sdairs · 2025-10-24T19:46:54 1761335214

Thanks for clarifying; I'm not trying to take anything away from you, I work in the OLAP space too so it's always good to see people pushing it forwards. It would be interesting to see a comparison of totally cold Vs hot caches.

Are you looking at distributed queries directly over S3? We did this in ClickHouse and can do instant virtual sharding over large data sets S3. We call it parallel replicas https://clickhouse.com/blog/clickhouse-parallel-replicas

tanelpoder · 2025-10-24T22:23:01 1761344581

(I submitted this link). My interest in this approach in general is about observability infra at scale - thinking about buffering detailed events, metrics and thread samples at the edge and later only extract things of interest, after early filtering at the edge. I’m a SQL & database nerd, thus this approach looks interesting.

jamesblonde · 2025-10-24T20:40:21 1761338421

With 2 modern NVMe disks per host (15 GB/s) and pcie 5.0, it should only take 15s to read 30 TB into memory on 63 hosts.

You can find those disks on Hetzner. Not AWS, though.

jiggawatts · 2025-10-25T01:36:19 1761356179

I don’t understand why both Azure and AWS have local SSDs that are an order of magnitude slower than what I can get in a laptop. If Hetzner can do it, surely so can they!

Not to mention that Azure now exposes local drives as raw NVMe devices mapped straight through to the guest with no virtualisation overheads.

jamesblonde · 2025-10-25T13:09:11 1761397751

It would undercut all their higher level services - like DynamoDB, CosmosDB, etc.

Databases would suddenly go BRRR in the cloud and show up cloud-native (S3) based databases for the high latency services they are.

philbe77 · 2025-10-24T16:27:50 1761323270

This is something we are trying to take a novel approach to as well. We have a video demonstrating some TPC-H SF10TB queries which perform inner joins, etc. - with GizmoEdge as well: https://www.youtube.com/watch?v=hlSx0E2jGMU

lolive · 2025-10-24T16:51:36 1761324696

Does that study go into the global vision of DuckLake ?

philbe77 · 2025-10-24T16:09:18 1761322158

Hi mosselman, GizmoEdge is not open-source. DeepSeek has "smallpond" however, which is open-source: https://github.com/deepseek-ai/smallpond

I plan on getting GizmoEdge to production-grade quality eventually so folks can use it as a service or licensed software. There is a lot of work to do, though :)

philbe77 · 2025-10-24T16:06:08 1761321968

hi djhworld. The 5s does not include the download/materialization step. That parts takes the worker about 1 to 2 minutes for this data set. I didn't know that this was going on HackerNews or would be this popular - I will try to get more solid stats on that part, and update the blog accordingly.

You can have GizmoEdge reference cloud (remote) data as well, but of course that would be slower than what I did for the challenge here...

The data is on disk - on locally mounted NVMe on each worker - in the form of a DuckDB database file (once the worker has converted it from parquet). I originally kept the data in parquet, but the duckdb format was about 10 to 15% faster - and since I was trying to squeeze every drop of performance - I went ahead and did that...

Thanks for the questions.

GizmoEdge is not production yet - this was just to demonstrate the art of the possible. I wanted to divide-and-conquer a huge dataset with a lot of power...

philbe77 · 2025-10-24T16:15:12 1761322512

I've since learned (from a DuckDB blog) - that DuckDB seems to do better when the XFS filesytem. I used ext4 for this, so I may be able to get another 10 to 15% (maybe!).

DuckDB blog: https://duckdb.org/2025/10/09/benchmark-results-14-lts

philbe77 · 2025-10-24T15:18:07 1761319087

:D that is scary!

philbe77 · 2025-10-24T15:17:34 1761319054

You should do it then, and post it here. I did do it with one machine as well: https://gizmodata.com/blog/gizmosql-one-trillion-row-challen...

NorwegianDude · 2025-10-24T15:50:31 1761321031

Nobody cares if I can do it a million times faster, everyone can. It's cheating.

The whole reason you have to account for the time you spend setting it up is so that all work spent processing the data is timed. Otherwise we can just precomputed the answer and print it on demand, that is very fast and easy.

Just getting it into memory is a large bottleneck in the actual challenge.

If I first put it into a DB with statistics that tracks the needed min/max/mean then it's basically instant to retrieve, but also slower to set up because that work needs to be done somewhere. That's why the challenge is time from file to result.

philbe77 · 2025-10-24T15:02:25 1761318145

Challenge accepted - I'll try it on a 4XL Snowflake to get actual perf/cost

philbe77 · 2025-10-24T15:01:01 1761318061

Cost-wise, 64 4xl Snowflake clusters would cost: 64 x $384/hr - for a total of: $24,576/hr (I believe)

__mharrison__ · 2025-10-24T15:50:30 1761321030

What was the cost of the duck implementation?