Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

You can write some SQL in Spark, but

1) Why would you want to maintain your own Spark infrastructure? Spark on Kube is a huge improvement over YARN but you still have to deal with OOMEs, filled disks, Kube upgrades, pushing custom images to container registries, etc etc etc.

2) Snowflake is probably 10-50x as performant as Spark for data manipulation. I don't know what kind of unholy demonic incantations Snowflake is doing on the backend to support their SQL performance, but it's really freaking fast. There's just no other way to cut it.

I've spent 5-10 years eking every ounce of performance I can get out of a Hadoop/Spark cluster. I'm not trying to be unreasonable about this. I would love for OSS to be competitive; it's great for the world, and it would be great for my skill set and earning potential.

But it's not a contest, and if you think standalone Spark is going to be a viable competitor in a couple years, you are deluding yourself. Make informed choices about your career and investment.



A lot of articles I read about snowflake involves data vault which is a massive turn off. And when their tech lead (Kent Graziano) is a prominent figure in the DV bullshit...


Snowflake and DV have no interdependency whatsoever. Snowflake is just a database. Whether you use DV to model the data inside of it or dimensional modelling or "big wide tables" is completely up to you, there's nothing about it that requires or benefits DV in particular.


What is DV?


Count not knowing what it is, as a blessing. Run if you can.

https://en.wikipedia.org/wiki/Data_vault_modeling

Edit: As the Wikipedia article has no Criticism section I will add some references:

http://kejser.org/the-data-vault-vs-kimball-round-2/

https://timi.eu/blog/data-vaulting-from-a-bad-idea-to-ineffi...


It's a data modelling method for data warehouses, it can be used in Snowflake or on any other data management platform.


> Snowflake is probably 10-50x as performant as Spark for data manipulation

Wow is this for a fact? I haven't used either in a while but I saw the blog post from databricks and Spark was more performant than snowflake.

I assumed that's what I'll also get when i run spark on kube


You should try Databricks, especially the new Photon engine powering Spark. In general more performant than Snowflake in SQL and a lot more flexible. (There are some cases in which Databricks would be slower but the perf is improving rapidly.)


Probably an oversight on your part, but I would argue would be elegant to disclose you are one of the co-founders.


Databricks has an extremely bad API. So, sure, your Spark jobs might be a little bit faster some times, but why would you use it if you can't even read logs of running jobs?


Databricks is amazing, the Delta Live Table technology is incredible. It's very hard to approach problems like Data Lineage and Data Quality, but that platform does it in the right way.

My only concern is that they offer just a managed cloud product. That's cool for startups, but large enterprises sometimes need more governance and ownership than that.


Very surprised by this. Do you have a reference ?


Fyi, rxin is co-founder of databricks.


That explains it


The biggest selling point of Snowflake for most of the customers is that they do not need to maintain the infrastructure.


Of course you would say that it's more performant and flexible ...TCP-DS was just a PR ploy




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: