Hacker Newsnew | past | comments | ask | show | jobs | submit | dspillett's commentslogin

> nailing down Unicode and text encodings was still considered rocket science. Now this is a solved problem

I wish…

Detecting text encoding is only easy if all you need to contend with is UTF16-with-BOM, UTF8-with-BOM, UTF8-without-BOM, and plain ASCII (which is effectively also UTF8). As soon as you might see UTF16 or UCS without a BOM, or 8-bit codepages other than plain ASCII (many apps/libs assume that these are always CP1252, a superset of the printable characters of ISO-8859-1, which may not be the case), things are not fully deterministic.

Thankfully UTF8 has largely won out over the many 8-bit encodings, but that leaves the interesting case of UTF8-with-BOM. The standard recommends against using it, that plain UTF8 is the way to go, but to get Excel to correctly load a UTF8 encoded CSV or similar you must include the BOM (otherwise it assumes CP 1252 and characters above 127 are corrupted). But… some apps/libs are completely unaware that UTF8-with-BOM is a thing at all so they load such files with the first column header corrupted.

Source: we have clients pushing & pulling (or having us push/pull) data back & forth in various CSV formats, and we see some oddities in what we receive and what we are expected to send more regularly than you might think. The real fun comes when something at the client's end processes text badly (multiple steps with more than one of them incorrectly reading UTF8 as CP1252, for example) before we get hold of it, and we have to convince them that what they have sent is non-deterministically corrupt and we can't reliably fix it on the receiving end…


> to get Excel to correctly load a UTF8 encoded CSV or similar you must include the BOM

Ah so that’s the trick! I’ve run into this problem a bunch of times in the wild, where some script emits csv which works on the developers machine but fails strangely with real world data.

Good to know there’s a simple solution. I hope I remember your comment next time I see this!


Excel CSV is broken anyway, since in some (EU, ...) countries it needs ; as separator.

That's not an excel issue. That's a locale issue.

Due to (parts of?) the EU using then comma as the decimal separator, you have to use another symbol to separate your values.


Comma for decimal separator, and point (or sometimes 'postraphy) for thousands separator if there is one, is very common. IIRC more European countries use that than don't, officially, and a bunch of countries outside Europe do too.

It wouldn't normally necessitate not using comma as the field separator in CSV files though, wrapping those values is quotes is how that would usually be handled in my experience.

Though many people end up switching to “our way”, despite their normal locale preferences, because of compatibility issues they encounter otherwise with US/UK software written naively.


Locales should have died long ago. You use plain data, stop parsing it depdending on wen your live. Plan9/9front uses where right long ago. Just use Unicode everywhere, use context-free units for money.

Locales are fine for display, but yes they should not affect what goes into files for transfer. There have always been appropriate control characters in the common character sets, in ASCII and most 8-bit codepages there are non-printing control characters that have suitable meanings to be used in place of commas and EOL so they could be used unescaped in data fields. Numbers could be plain, perhaps with the dot still as a standard decimal point or we could store non-integers as a pair of ints (value and scale), dates in an unambiguous format (something like one of the options from ISO8601), etc.

Unfortunately people like CSV to be at least part way human-readable, which means readable delimiters, end-or-record markers being EOLs that a text editor would understand, and the decimal/thousand/currency symbols & date formatting that they are used to.


A lot of the time when people say CSV they mean “character separated values” rather than specifically “comma separated values”.

In the text files we get from clients we sometimes see tab used instead of comma, or pipe. I don't think we've seen semicolon yet, though our standard file interpreter would quietly cope¹ as long as there is nothing really odd in the header row.

--------

[1] it uses the heuristic “the most common non-alpha-numeric non-space non-quote character found in the header row” to detect the separator used if it isn't explicitly told what to expect


The very fact that UTF-8 itself discouraged from using the BOM is just so alien to me. I understand they want it to be the last encoding and therefore not in need of a explicit indicator, but as it currently IS NOT the only encoding that is used, it makes is just so difficult to understand if I'm reading any of the weird ASCII derivatives or actual Unicode.

It's maddening and it's frustrating. The US doesn't have any of these issues, but in Europe, that's a complete mess!


> The very fact that UTF-8 itself discouraged from using the BOM is just so alien to me.

Adding a BOM makes it incompatible with ASCII, which is one of the benefits of using UTF-8.


> The very fact that UTF-8 itself discouraged from using the BOM is just so alien to me.

One of the key advantages of UTF8 is that all ASCII content is effectively UTF-8. Having the BOM present reduces that convenience a bit, and a file starting with the three bytes 0xEF,0xBB,0xBF may be mistaken by some tools for a binary file rather than readable text.


> The US doesn't have any of these issues

I think you mean “the US chooses to completely ignore these issues and gets away with it because they defined the basic standard that is used, ASCII, way-back-when, and didn't foresee it becoming an international thing so didn't think about anyone else” :)


> because they defined the basic standard that is used, ASCII

I thought it was EBCDIC /s


From wikipedia...

    UTF-8 always has the same byte order,[5] so its only use in UTF-8 is to signal at the start that the text stream is encoded in UTF-8...
    Not using a BOM allows text to be backwards-compatible with software designed for extended ASCII. For instance many programming languages permit non-ASCII bytes in string literals but not at the start of the file. ...
   A BOM is unnecessary for detecting UTF-8 encoding. UTF-8 is a sparse encoding: a large fraction of possible byte combinations do not result in valid UTF-8 text.
That last one is a weaker point but it is true that with CSV a BOM is more likely to do harm, than good.

Indeed, I've been using the BOM in all my text files for maybe decades now, those who wrote the recommendation are clearly from an English country

> are clearly from an English country

One particular English-speaking country… The UK has issues with ASCII too, as our currently symbol (£) is not included. Not nearly as much trouble as non-English languages due to the lack of accents & such that they need, but we are still affected.


> Most users just absolutely do not know about, care about, or worry about security, privacy, maintainability, robustness, or a host of other things.

That is a problem that needs to be fixed in those users, not something we should take advantage of as an excuse for releasing shoddy work.

> For some reason this is continually terrifying and shocking to many.

For many reasons.

It means that a good product can be outcompeted by a substandard one because it releases faster, despite the fact it will cause problems later, so good products are going to become much more rare at the same time as slop becoming much more abundant.

It means that those of us trying to produce good output will be squeezed more and more to the point where we can't do that without burning out.

It means that we can trust any given product or service even less than we were able to in the past.

It means that because we are all on the same network, any flaw could potentially affect us all not just the people who don't care.

The people who don't care when caring means things release with lower cadence, are often the same people who will cry loudest and longest about how much everyone else should have cared when a serious bug bites their face off.

and so on… … …

Are you suggesting we should just sit back and let then entire software industry go the way of AAA games or worse?


> > Most users just absolutely do not know about, care about, or worry about security, privacy, maintainability, robustness, or a host of other things.

> That is a problem that needs to be fixed in those users, not something we should take advantage of as an excuse for releasing shoddy work.

Ok. Tech folks have been trying to educate users and get them to make better decisions (in the viewpoint of those tech folks) for a long time. And the current state points to how successful that's been: not very. This isn't exclusive to software... many industries have consumers who make unsound long-term choices (in the viewpoint of experts).

Taking advantage? Besides cases where folks are actually breaking the law and committing fraud, this isn't some kind of illicit activity, it's just building what the users choose to buy/use.

> It means ... It means ... It means ... It means ...

Perhaps, perhaps, perhaps, and perhaps.

> Are you suggesting we should just sit back and let then entire software industry go the way of AAA games or worse?

I'm not sure what "the way of AAA games" means. I'm just laying out how I view the last 30 years of the software industry.

I don't see any reason to expect significant change.


>* I'm not sure what "the way of AAA games" means.*

The rush to get things out NowNowNowNowNOWNOWNOW has resulted in massive crunches at the end (or even from the very start) or many big projects, and an apparent “sod it, it'll do, we can patch it later” attitude. Over the last decade or more this problem has become worse, with only a few exceptions to the rule.

With “vibe coding” and “vibe designing” taking more load, I expect that “sod it, it'll do, we can patch it later” will become more common everywhere¹, and that is among those that do have an understanding of the potential security and stability issues that things going out without sufficient review can cause.

--------

[1] Once management are convinced LLM tools will increase throughput by, say, 50% in ideal cases, they'll expect output to increase by 50+% in all cases, and like the gaming industry “if you can't put the hours in, someone else will” when problems in LLM output cause delays or production issues, is likely to become a key driver, more so than it might be already.


> This isn't supposed to replace Windows,

OP wasn't suggesting it was, just that the lack of quality in one significant area of the company's output leads to a lack of confidence in other products that they release.


Given anything the size of Microsoft, it's not a good assumption. MS has large research teams that produce really interesting things. Their output is unrelated to released products.

Companies want us to trust their things based on positive experiences with their other things, and that works both ways.

> people publishing articles that contain these kinds of LLM-ass LLMisms don't mind and don't notice them

That certainly seems to be the case, as demonstrated by the fact that they post them. It is also safe to assume that those who fairly directly use LLM output themselves are not going to be overly bothered by the style being present in posts by others.

> but there are also always clearly real people in the comments who just don't realize that they're responding to a bot

Or perhaps many think they might be responding to someone who has just used an LLM to reword the post. Or translate it from their first language if that is not the common language of the forum in question.

TBH I don't bother (if I don't care enough to make the effort of writing something myself, then I don't care enough to have it written at all) but I try to have a little understanding for those who have problems writing (particularly those not writing in a language they are fluent in).


> Or translate it from their first language if that is not the common language of the forum in question.

While LLM-based translations might have their own specific and recognizable style (I'm not sure), it's distinct from the typical output you get when you just have an LLM write text from scratch. I'm often using LLM translations, and I've never seen it introduce patterns like "it's not x, it's y" when that wasn't in the source.


That is true, but the “negative em-dash positive” pattern is far from the only simple smell that people use to identify LLM output. For instance certain phrases common in US politics have quickly become common in UK press releases do to LLM based tools being used to edit/summarise/translate content.

> so that you can hibernate

The “paging space needs to be X*RAM” and “paging space needs to be RAM+Y” predate hibernate being a common thing (even a thing at all), with hibernate being an extra use for that paging space not the reason it is there in the first place. Some OSs have hibernate space allocated separately from paging/swap space.


> There’s a common rule of thumb that says you should have swap space equal to some multiple of your RAM.

That rule came about when RAM was measured in a couple of MB rather than GB, and hasn't made sense for a long time in most circumstances (if you are paging our a few GB of stuff on spinning drives your system is likely to be stalling so hard due to disk thrashing that you hit the power switch, and on SSDs you are not-so-slowly killing them due to the excess writing).

That doesn't mean it isn't still a good idea to have a little allocated just-in-case. And as RAM prices soar while IO throughput & latency are low, we may see larger Swap/RAM ratios being useful again as RAM sizes are constrained by working-sets aren't getting any smaller.

In a theoretical ideal computer, which the actual designs we have are leaky-abstraction laden implementations of, things are the other way around: all the online storage is your active memory and RAM is just the first level of cache. That ideal hasn't historically ended up being what we have because the disparities in speed & latency between other online storage and RAM have been so high (several orders of magnitude), fast RAM has been volatile, and hardware & software designs or not stable & correct enough such that regular complete state resets are necessary.

> Why? At that point, I already have the same total memory as those with 8 GB of RAM and 8 GB of swap combined.

Because your need for fast immediate storage has increased, so 8-quick-8-slow is no longer sufficient. You are right in that this doesn't mean you need 16-quick-16-slow is sensible, and 128-quick-128-slow would be ridiculous. But no swap at all doesn't make sense either: on your machine imbued with silly amounts of RAM are you really going to miss a few GB of space allocated just-in-case? When it could be the difference between slower operation for a short while and some thing(s) getting OOM-killed?


Swap is not a replacement for RAM. It is not just slow. It is very-very-very slow. Even SSDs are 10^3 slower at random access with small 4K blocks. Swap is for allocated but unused memory. If the system tries to use swap as active memory, it is going to become unresponsive very quickly - 0.1% memory excess causes a 2x degradation, 1% - 10x degradation, 10% - 100x degradation.

What is allocated but unused memory? That sounds like memory that will be used in the near future and we are scheduling in an annoying disk load when it is needed

You are of course highlighting the problem that virtual addressing was intended to over abstract memory resource usage, but it provides poor facilities for power users to finely prioritize memory usage.

The example of this is game consoles, which didn't have this layer. Game writers had to reserve parts of ram fur specific uses.

You can't do this easily in Linux afaik, because it is forcing the model upon you.


Unused or Inactive memory is memory that hasn't been accessed recently. The kernel maintains LRU (least recently used) lists for most of its memory pages. The kernel memory management works on the assumption that the least recently used pages are least likely to be accessed soon. Under memory pressure, when the kernel needs to free some memory pages, it swaps out pages at the tail of the inactive anonymous LRU.

Cgroup limits and OOM scores allow to prioritize memory usage on a per-process and per-process group basis. madvise(2) syscall allows to prioritize memory usage within a process.


> There is 264KB of space left for your newly created files.

This could be increased noticeably by using one of the common extended floppy formats. The 21-sectors-per-track format used by MS¹ for Windows 95's floppy distribution was widely supported enough by drives (and found to be reliable enough on standard disks) that they considered it safe for mass use, and gave 1680KB instead of the 1440Kb offered by the standard 18-sector layout. The standard floppy formatting tools for Linux support creating such layouts.

--------

[1] There was some suggestion² that MS invented the extended floppy format, they were sometimes called “windows format”, but it³ had been used elsewhere for some time before MS used them for Windows and Office.

[2] I'm not sure if this came from MS themselves, or was invented by the tech press.

[3] and even further extended formats, including 1720KByte by squeezing in two extra tracks as well as more data per track which IIRC was used for OS/2 install floppies.


IIRC rsync uses your default SSH options, so turning off compression is only needed if your default config explicitly turns it on (generally or just for that host). If sending compressible content using rsync's compression instead of SSH's is more effective when updating files because even if not sending everything it can use it to form the compression dictionary window for what does get sent (though for sending whoe files, SSH's compression may be preferable as rsync is single threaded and using SSH's compression moves that chunk of work to the SSH process).

> Are there really people who "spend weeks planning the perfect architecture" to build some automation tools for themselves?

Probably. I've been known to spend weeks planning something that I then forget and leave completely unstarted because other things took my attention!

> Commenter's history is full of 'red flags'

I wonder how much these red flags are starting to change how people write without LLMs, to avoid being accused of being a bot. A number of text checking tools suggested replacing ASCII hyphens with m-dashes in the pre-LLM-boom days¹ and I started listening to them, though I no longer do. That doesn't affect the overall sentence structure, but a lot of people jump on m-/n- dashes anywhere in text as a sign, not just in “it isn't <x> - it is <y>” like patterns.

It is certainly changing what people write about, with many threads like this one being diverted into discussing LLM output and how to spot it!

--------

[1] This is probably why there are many of them in the training data, so they are seen as significant by tokenisation steps, so they come out of the resulting models often.


It’s already happening. This came up in a webinar attended by someone from our sales team:

> "A typo or two also helps to show it’s not AI (one of the biggest issues right now)."


When it comes to forum posts, I think getting to the point quickly makes something worth reading whether or not it’s AI generated.

The best marketing is usually brief.


The best marketing is indistinguishable from non–marketing, like the label on the side of my Contoso® Widget-like Electrical Machine™ — it feels like a list of ingredients and system requirements but every brand name there was sponsored.


> … the internet is already full of LLM writing, but where it's not quite invisible yet. It's just a matter of time …

I don't think it will become significantly less visible⁰ in the near future. The models are going to hit the problem of being trained on LLM generated content which will cause the growth in their effectiveness quite a bit. It is already a concern that people are trying to develop mitigations for, and I expect it to hit hard soon unless some new revolutionary technique pops up¹².

> those Dead Internet Theory guys score another point

I'm betting that us Habsburg Internet predictors will have our little we-told-you-so moment first!

--------

[0] Though it is already hard to tell when you don't have your thinking head properly on sometimes. I bet it is much harder for non-native speakers, even relatively fluent ones, of the target language. I'm attempting to learn Spanish and there is no way I'd see the difference at my level in the language (A1, low A2 on a good day) given it often isn't immediately obvious in my native language. It might be interesting to study how LLM generated content affects people at different levels (primary language, fluent second, fluent but in a localised creole, etc.).

[1] and that revolution will likely be in detecting generated content, which will make generated content easier to flag for other purposes too, starting an arms race rather than solving the problem overall

[2] such a revolution will pop up, it is inevitable, but I think (hope?) the chance of it happening soon is low


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: