Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
The Invisible Character That Cost Me Too Much Debugging Time (dochia.dev)
23 points by ludovicianul 81 days ago | hide | past | favorite | 11 comments


This is really just an ad for Dochia's testing product.

But the first half of the post really is an interesting problem -- what to do about invisible Unicode characters that wind up in a username login field, thus becoming an invalid user, because the username was copied-pasted from a source that inserted things. The post lists potential sources as:

> Copy-paste from PDFs or Word docs: Rich-text formats often inject hidden control characters.

> Email clients and chat apps: Some insert soft hyphens, directionality markers, or non-breaking spaces.

> Keyboards and IMEs: Certain language input systems add combining marks or zero-width joiners.

But of course it's part of a broader Unicode problem, like the fact that there are two ways of representing common accented characters (precomposed vs decomposed) that are also not equivalent, or that multiple accents can be in a different order. Normalization handles those cases, but it doesn't do anything about nonprinting characters.

Is there not any common method for Unicode we should be using to check for, essentially, "grapheme comparison" that doesn't just normalize but ignores non-printing codepoints?


Self promotion, and seems somewhat fake? The front page of their websites has user testimonials from 2026.


Author here. I've just launched the tool and wanted to have some simple dev humor in it. It intentional says future testimonials. The story is real though. And it happened with other types of hidden chars in different forms.


It literally says, 'user@dev:~$ tail -f future-happy-customers.log [INFO] What engineers will be saying (when they're not debugging)', so the future part is intentional. Do with that information what you will.


I've hit something similar recently though thankfully it didn't cause significant problems¹: a left-to-right indicator, U+200E, at the end of a user's name.

Apparently Word has a habit of inserting these in fields, whether needed or not in the context, with any right-to-left language supporting language packs are installed. Once added they are silently maintained and depending on exactly what you select may get included when you copy the text out to paste elsewhere, or get included if you use some form of automation to read the field value directly from the document or Word itself.

--------

[1] I noticed it while digging into some output to analyse a related issue, the file had been mashed together from content with different codepages in a way that meant it included invalid code points.


One company I worked with we use to import data from other systems into our project management system for clients to help them get set up.

This was the 2000s so it was all scripts (SQL scripts and vbscripts I seem to remember). As part of it, we ended up cleaning the customer data from a myriad of bugs. Inconsistent capitalization, leading and trailing spaces, and this. Weird characters you didn't even know exists.

Over time more and more of these hidden characters were added to the script, because back then it wasn't a case of googling it or asking on SO.

I have a friend who works as a data analyst for a local council. He hates the school reports season as the data from the schools comes in with all sorts of weird problems in consistency.


A common issue I've come across is an invisible character added when you copy a certificate fingerprint in Windows. https://support.microsoft.com/en-us/topic/certificate-thumbp...


Somewhat related: a coworker of mine recently wrestled with unexpected output coming from the company's internal CLI tool. What he was seeing did not match the flags that were specified in the command.

Lo and behold, his input method automatically collapsed two consecutive dashes into an en-dash (`–-f`), and the "option" was instead treated as a regular positional argument.


Something like that made it into a colleagues Ruby code, and it blew up! I think he lost half a day to it.

It was back in the 1.8.7 days, just before proper Unicode support in 1.9, but I don’t remember if that was relevant to this story.

He was deleting code until the bug disappeared, and then we zeroed in and found the character.

It was in the Textmate days, and it didn’t highlight such characters.


I'm flashing back to when we used to hold the punch card to the light looking for the unprintable character that had to be hiding in there somewhere... ah, there it is!


Microsoft is terrible about this kind of stuff. We have a big problem with MS Teams replacing tabs with nbsps in XML code snippet blocks. It breaks our pom files. We've also had similar issues with pasting excel tables into emails.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: