Hacker Newsnew | past | comments | ask | show | jobs | submit | gdcohen's commentslogin

Take a look at Exploiting a Natural Network Effect for Scalable, Fine-grained Clock Synchronization - https://www.usenix.org/conference/nsdi18/presentation/geng (commercial version by the same authors - clockwork.io).


What's amazing to me is that without reading (watching?) too much into it, and me not being a seasoned South Park viewer, this could easily pass as a real episode.


> this could easily pass as a real episode.

Hard disagree. The animation, pacing, sound, everything is of considerable lower quality. You can watch full episodes on their website to compare.

https://www.southparkstudios.com/seasons/south-park


Skimming the video, 100% of the dialog was plodding, tedious, unfunny and predictable. I don't know South Park very well but I know snappy dialog is somewhat needed. Just as much, endless vacuous paragraphs on the potential and dangers of AI seems unlikely to be what sustains interest in show.


They actually had an episode about "the dangers of AI" recently (the boys used ChatGPT to avoid having to text with their needy girlfriends), it was VASTLY better than this fan made one.

https://www.southparkstudios.com/episodes/8byci4/south-park-...

https://www.youtube.com/watch?v=mvrg1SJhL_Q


Having played with ChatGPT a lot to generate scripts, this episode is as boring and formulaic as every other script it generates. Cartman uses the word "baby" at the end of a sentence, as an example, which is wildly out of character for him.


See also: https://news.airbnb.com/an-update-about-our-community-in-new...

Essentially NY City is making short term rentals (rentals < 30 days) illegal for most renters. The only exception is a room without a lock, for up to 2 people, in a residence that has shared living space and is occupied by the owner.


Previous discussion: https://news.ycombinator.com/item?id=36154175

The law in question affects about 60 percent of the market and does not introduce new restrictions -- but rather a beefier enforcement structure for restrictions already on the books.


Unfortunately it's more restrictive than just owner-occupied units. It's essentially an unlocked room (for up to 2 people) in someone's house/apt that includes shared living spaces.

For example, there are many multi-family houses where a family lives upstairs but can't rent out the downstairs apt.


Right - the new law is quite restrictive. But it doesn't say what Airbnb wants us to believe it says.


Worth a read if you've ever used Airbnb for a short term stay in NYC (or if you're an Airbnb host). The City of NY is about to start enforcing a ban on short term rentals on Airbnb, VRBO, etc.


That's the take-away Airbnb wants us to have, but that's just not what the law says, as the company knows perfectly well. Please see the sibling comment in regard to this.


It seems that - at least partially - the lawsuits revolve around the need to register and how complicated this is made.

At least according to this:

https://www.citysignal.com/airbnb-regulations-nyc-january-20...

there are no substantial changes to the Law in type of rent and duration allowed, already you could not legally rent for less than 30 days a "whole" unit, the fact that everyone did it nonetheless is another thing.


Does anyone have a simple explanation of how it structures the log data?


1) Use Zstandard with a generated dictionary kept separate from the data, but moreover:

2) Organize the log data into tables with columns, and then compress by column (so, each column has its own dictionary). This lets the compression algorithm perform optimally, since now all the similar data is right next to itself. (This reminds me of the Burrows-Wheeler transform, except much more straightforward, thanks to how similar log lines are.)

3) Search is performed without decompression. Somehow they use the dictionary to index into the table- very clever. I would have just compressed the search term using the same dictionary and do a binary search for that, but I think that would only work for exact matches.


Heh, my hacked version of this was used for apache logging. I just ran Mysql on top of a ZFS volume with compression turned on and had a column for timestamp, IP, referral IP, URL, action, and return code. I was amazed at how fast it was, how easy/fast it was to query, disk storage efficiency, and was overall quite impressed at how it was nearly as useful as a standard web traffic analysis tool that took significant time to crunch the logs, but worked on live data.


I wonder what would happen if you stored the columns in separate tables (perhaps pairs of columns?) and queried them with a join off a shared ID (perhaps a view or materialized view?) in order to really take advantage of compression’s ability to compress highly self-similar data located together, highly.

Also, I assume you used a smallish blocksize in ZFS because of the frequent small writes?


Well the storage was crazy efficient, I kept checking to make sure it was recording what I thought it was. No way 50M hits could fit in a file that small...

Timestamps, especially north of 1000 hits/sec have many bits in common. URL, Referrer, and IP address where all just indexes. That worked really well because it was storage efficient, and made various queries like "who hit this URL", "who is our top referrer" and the like very efficient. Things that used to require ingesting a months worth of logs and spitting out a report would often be answered with a simple SQL query.

All in all using indexed columns was a huge win.


Does the user have to specify the "schema" (each unique log message type) manually? or is it learned automatically (a la gzip and friends)? I wasn't able to discover this from a cursory readthrough of the paper...


the paper mentions that CLP comes with a default set of schemas, but you can also provide your own rules for better compression and faster search


If you have exactly two logging operations in your entire program:

log(“Began {x} connected to {y}”)

log(“Ended {x} connected to {y}”)

We can label the first one logging operation 1 and the second one logging operation 2.

Then, if logging operation 1 occurs we can write out:

1, {x}, {y} instead of “Began {x} connected to {y}” because we can reconstruct the message as long as we know what operation occurred, 1, and the value of all the variables in the message. This general strategy can be extended to any number of logging operations by just giving them all a unique ID.

That is basically the source of their entire improvement. The only other thing that may cause a non-trivial improvement is that they delta encode their timestamps instead of writing out what looks to be a 23 character timestamp string.

The columnar storage of data and dictionary deduplication, what is called Phase 2 in the article, is still not fully implemented according to the article authors and is only expected to result in a 2x improvement. In contrast, the elements I mentioned previously, Phase 1, were responsible for a 169x(!) improvement in storage density.


So does it means you need to specify rules for all unique logging operations you have? I was under impression that the thing can do it automatically... In a runtime based on receiving repetitive logs or something like that. If it doesn't, it is still a hell of work to compile and maintain the list of rules for your applications.


Figure 2[2] from the article is pretty good.

[2]: https://blog.uber-cdn.com/cdn-cgi/image/width=2216,quality=8...


It seems very cool to get logmine style log line templating built right in. I've found its very helpful to run logs through it, and having a log system that can do this at ingest time for quicker querying seems like it'd have amazing log digging workflow benefits.


Gavin from Zebrium here. We've done some interesting things with GPT-3 - we use it to produce plain language summaries of log events that describe the root cause of a problem. See - https://www.zebrium.com/blog/using-gpt-3-with-zebrium-for-pl... and https://www.zebrium.com/blog/real-world-examples-of-gpt-3-pl...


This is really cool. You've done a tremendous job in making this very usable (I love your yellow sticky note instructions).


Gavin from Zebrium here. We've found that if only you somehow knew what you were monitoring for in logs, they can be a great source of detecting (and then describing) the long tail of unknown/unknowns (failure modes with unknown symptoms and causes). Our approach is to be able to find these patterns in near real-time using ML. This blog by our CTO explains the tech with some good examples: https://www.zebrium.com/blog/is-autonomous-monitoring-the-an....


Gavin from Zebrium here. Completely concur with #1. We are big advocates of writing good logs and not having to worry about structured vs unstructured (and even if you structure your logs, you'll still probably have to deal with unstructured logs in third party components).

Our approach to deal with logs is to use ML to structure them after the fact (and we can deal with changing log structures). You can read about it in a couple of our blogs like: https://www.zebrium.com/blog/using-ml-to-auto-learn-changing... and https://www.zebrium.com/blog/please-dont-make-me-structure-l....


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: