More

robertelder · on May 8, 2021

That's interesting that it was exactly 7kg. Maybe they read this document first:

(§ 40.22 Small quantities of source material.) https://www.nrc.gov/reading-rm/doc-collections/cfr/part040/p...

"(a) A general license is hereby issued authorizing commercial and industrial firms; research, educational, and medical institutions; and Federal, State, and local government agencies to receive, possess, use, and transfer uranium and thorium, in their natural isotopic concentrations and in the form of depleted uranium, for research, development, educational, commercial, or operational purposes in the following forms and quantities:

... (2) No more than a total of 7 kg (15.4 lb) of uranium and thorium at any one time...

"

throwaway3neu94 · on May 8, 2021

Wrong country.

rrrrrrrrrrrryan · on May 8, 2021

Perhaps it was bound for the USA eventually though?

robertelder · on Nov 9, 2020

+1 For Brother laser printers. I've had mine for a few years now, and only gone through 1 cartridge of toner. It works with my Linux setup too.

robertelder · on Nov 7, 2020

How does this work from a technical perspective? If I forward mail to this service instead of clicking the 'unsubscribe' link, then how do they 'unsubscribe' me on my behalf? In order to complete the unsubscription process, the newsletter owner needs to receive an email (usually generated automatically by the mail client when clicking the 'unsubscribe link') that comes from the person being unsubscribed. If I outsource this process, this means that 'please-unsubscribe.com' will need to spoof emails from me on my behalf and send them the email specified in the List-Unsubscribe email header. If spoofed emails are not used and they come from 'please-unsubscribe.com' this won't be very actionable to the newsletter owner. From their perspective, they're now going to get 'unsubscribe' notifications that all come from an email at 'please-unsubscribe.com' which never signed up to the newsletter in the first place.

This all assumes that the newsletter unsubscription process is using the email based method instead of the link based method (which is a good assumption email-based is more common).

BenjiWiebe · on Nov 7, 2020

It's not at all common anymore to use email replies for unsubscribing. All the marketing emails I get have unsubscribe links, and it's been that way for a few years at least.

robertelder · on Nov 7, 2020

I just checked a few newsletter emails in my inbox, and they all offer List-Unsubscribe: mailto: based unsubscription. From what I see, it is quite universal and more common than link-based unsubscription. Newsletter providers prefer not to use the url based unsubscription because some email providers will crawl the URLs in emails and automatically unsubscribe people without them knowing.

It's also possible that you're talking about a 'link' (a href) in the email body that goes to the newsletter provider's site and ask them a bunch of questions (why are you unsubscribing etc.). That's a different concept entirely.

From what I've seen, the email-based way is the primary unsubscription method in use today. Even when you use the gmail feature "Mark this as spam", depending on what you click, it will just send an email on your behalf to the to whatever is specified in the List-Unsubscribe: mailto: header.

iamacyborg · on Nov 7, 2020

> From what I've seen, the email-based way is the primary unsubscription method in use today.

It's not. Most people unsubscribe by clicking a link in the email.

What you're referring to is the list-unsub header which is again a different concept to what the person above you is saying regarding replying to an email and asking to be opted out.

robertelder · on Nov 7, 2020

It may very well be the case that users more often click a link to unsubscribe directly on the site more than they click the 'unsubscribe' feature in the email client (I don't have any stats on that).

The original question remains though: How does the stated service here work when there is only a List-Unsubscribe: header and no web form?

iamacyborg · on Nov 7, 2020

It's illegal (and mostly impossible with most ESP's) to not send a commercial email with a valid unsubscribe link.

If you're receiving email without one you're better off marking it as the spam it is.

robertelder · on Oct 9, 2020

As the author of this blog post, I find this statement interesting. Can you expand upon the sentiments expressed in your statement? Also, by 'read this', do you mean just the blog post, or contents of the works by Jim Roskind?

tom_mellior · on Oct 9, 2020

As the author of the blog post, would you be willing to change its title to "The Jim Roskind C and C++ Grammars"? The text makes it very clear that you know that C and C++ are separate languages, and that the archives actually contain separate grammar files for these separate languages. As they should.

From past experience I find "C/C++" a very grating term. Back when I was reading comp.lang.c, "C/C++" was used in questions by particular brands of people, who tried to compile C++ with a C compiler, or compile C with a C++ compiler, or use C++ features in what they thought was C, or in general were confused about the fact that there is no single language called "C/C++", but rather two separate, incompatible languages called "C" and "C++". Anyway, wouldn't it be good to not add to the (potential) confusion?

robertelder · on Oct 9, 2020

hah, because you care so much, I'll go ahead and make that change. It's probably better for SEO anyway. I made the change on my local copy. It will get pushed next time I re-build the site.

tom_mellior · on Oct 9, 2020

Awesome, thanks!

choeger · on Oct 9, 2020

The languages are related but quite different. The eco system and ABI are rather similar, though.

pjmlp · on Oct 9, 2020

Better tell ISO, NVidia, ARM, Intel, AMD, Microsoft, Apple, Google,... to fix their documentation as well.

tom_mellior · on Oct 9, 2020

I know you're just trolling, but I'll take a link to that ISO "C/C++" standard if you have it handy.

pjmlp · on Oct 10, 2020

Not trolling at all, just fed up with "I know better", C/C++ is an abbreviation for C and C++, a very basic English grammar rule.

You want a link? Here are several,

https://www.arm.com/products/development-tools/server-and-hp...

https://software.intel.com/content/www/us/en/develop/videos/...

https://docs.microsoft.com/en-us/cpp/cpp/c-cpp-language-and-...

https://www.ibm.com/products/xl-cpp-aix-compiler-power

https://developer.amd.com/amd-aocc/

Ah, from ISO as well, no problem, can I start with one paper from Bjarne?

https://isocpp.org/blog/2020/06/thriving-in-a-crowded-and-ch...

> In addition to these groups, there is a semi-official C/C++ liaison group consisting of people who are members of both the C++ committee and the C committee....

I can provide you more papers from the same author, maybe you feel like educating Bjarne Stroustrup on how to write it.

robertelder · on July 28, 2020

It's worth pointing out that the resource

https://swtch.com/~rsc/regexp/regexp1.html

linked above is also by the same author, Russ Cox,

as the reference I included at the end of the blog post:

https://swtch.com/~rsc/regexp/regexp2.html

All of the insights that I presented through this visualization tool are based upon the knowledge found in Russ Cox's articles. Also, the link above on RE2:

https://github.com/google/re2

is a project that was was also (started by|heavily contributed to by) Russ Cox. His writings on the topic of regular expressions are absolutely world-class and I have never found any better resource for understanding the deep low-level theory on how they work in practice.

chubot · on July 28, 2020

But Cox would dislike this formulation of regular expressions. His whole point is that Perl-stye regexes are not regular languages, and RE2 goes to great lengths to work around this fact.

I'm taking up his arguments with some more traditional terminology that I think may help:

http://www.oilshell.org/blog/2020/07/eggex-theory.html

The entire O'Reilly book "Mastering Regular Expressions" by Friedl is based on the "debugging regexes by backtracking" approach you present here. It talks about "optimizing" regexes by reordering clauses and optimizing the backtracking.

Russ Cox HATES THAT APPROACH and he wrote the entire RE2 project to advocate that it NOT be used! You do not need to optimize, reorder clauses, debug, or trace backtracking when you're using automata-based engines, because they run in guaranteed linear time and constant space (with respect to the length of the input). There is no backtracking.

(I was at Google when he wrote RE2 and used it shortly afterward on big data, and regular languages are indeed better than regexes for such engineering. Also made clear by the HyperScan work.)

Since his site is down I can't find the original comments, but here's a twitter exchange from 2018 that backs this up:

https://twitter.com/geofflangdale/status/1060313603268501504

https://twitter.com/_rsc/status/1060702201159565313

Also, I believe your article would be improved if you simply take out this sentence:

In computer science theory, this node type should remind you of the difference between DFAs and NFAs.

It is treading close to the confusion that Cox is talking about. So I would just remove, add a proper explanation, or add a link to the authoritative source.

----

I would love to see a followup that goes through the rest of the article with the Thompson VM and Pike VM approaches! :) I think the whole point is that you don't have to backtrack to implement this VM, and this page confuses that issue.

robertelder · on July 28, 2020

The goal with this article was never to show the best or most efficient way to match regular expressions, but rather to convey the correct interpretation of what a given regular expression will do when you try to use it. Even discussions of slow and simple backtracking regular expression matchers are a pretty niche topic and likely too much for casual users of regular expressions.

Yes, I was thinking about also allowing it to optionally cross-compile to the more asymptotically efficient matching algorithm, also documented by Russ Cox. Everything takes time though, and I figured it would make to put this out there before investing too much time in something that possibly no one would use.

chubot · on July 28, 2020

Sure, but automata based engines are also "correct" and used in common tools like grep, awk, and sed.

It would be nice to make that clear in the article.

-----

It's a matter of preference, but to me debugging by backtracking is a pointless exercise. It's easier to reason about regexes as sets of strings, composed by sequence and alternation.

Write unit tests that enumerate what you accept and reject, and build them up by composition. There's no need for debugging or tracing.

I understand that the imperative style is more natural to many programmers, but I don't think it requires too much of a leap of thinking. And once you get used to the other style, it saves effort overall (as well as bounding your worst case running time).

robertelder · on July 28, 2020

Agreed on the point that there are better mental models out there. The difficulty is to make content that is simultaneously correct, but also not over-complicated. I finished off the post with a reference to Russ Cox's work so any sufficiently motivated individual will find the right models there.

ridiculous_fish · on July 28, 2020

How does the "sets of strings" model express the difference between:

   /x*/

and

   /x*?/

chubot · on July 29, 2020

Update: I tested it out a little more here.

https://github.com/oilshell/blog-code/blob/master/regular-la...

Basically everything you want with regard to greedy-nongreedy can be done through automata-based engines.

They jump through a lot of hoops to make them work!

However I would also argue that when you're using them, you're doing it as a performance hack in a Perl-style regex engine. In an automata-based engine, you can just write

    /x*/
    /(x*)/

and it's what you want, and it runs fast.

chubot · on July 29, 2020

Good question, and here's something I wrote a couple weeks ago to the author of the Regex Lingua Franca paper [1]:

I admit I occasionally use non-greedy repetition in Python, but I think that is the only construct I use that's not in the shell-ish engines. I believe it's usually pretty easy to rewrite without non-greedy, but I haven't tested that theory.

----

OK so I started testing that theory. I collected all the times I used

.*?

https://github.com/oilshell/blog-code/blob/master/regular-la...

With a cursory inspection, most of them are unnecessary in an automata-based "regular language" engine.

They are "nice to have" in a Perl-style regex engine for PERFORMANCE. But automata-based engines don't have that performance problem. They don't backtrack.

The distinction between .* and .*? doesn't affect whether it matches or not, by definition. It does affects how much it backtracks. (And it affects where the submatches are, which I need to test out)

So I think most of my own usages ARE "nice to have" in Python, but unnecessary in awk because it uses an automata-based engine.

But I'd like to look at it more carefully, and I'm interested in answering this question more thoroughly (feel free to contact me about it, e-mail in profile).

I started a demo here. Still needs work.

https://github.com/oilshell/blog-code/blob/master/regular-la...

----

So here's another conjecture: regexes are bad because they force you to make UNNECESSARY DISTINCTIONS in the name of performance.

You have to care about greedy vs. nongreedy, because there is backtracking.

In contrast, with regular languages, you just write what you mean, and they do it quickly for you. You can't write a slow expression for a regular language.

There are some caveats around that, like submatch extraction with nongreedy repetition, which I will test out. If anyone has specific examples or use cases, send them my way.

I can't see where I really rely on nongreedy submatches, but I'll have to look more closely. I think the nongreedy use case is usually just "skip quickly until this other string appears", e.g. the one in pgen2/tokenize.py that I didn't write.

----

As far as I remember submatch extraction works fine in automata-based engines, and Cox's articles are in a large part about the (then) little known algorithms for doing it in automata-based engines. re2c also has a 2017 paper for their algorithm to do it in DFAs:

http://re2c.org/2017_trofimovich_tagged_deterministic_finite...

----

[1] thread: https://news.ycombinator.com/item?id=23687478

ridiculous_fish · on July 29, 2020

I don't agree it's primarily about performance. /".+"/ will find a quoted string, but you will certainly prefer the non-greedy match instead.

Now you may torture the greedy regex to make it work (/"[^"]+"/), but the point stands. "Did it match" is rarely enough; the regex author usually wishes to know where it matched, and to exercise control over which match is preferred if there are multiple possibilities. It's indeed possible to do this with DFAs but the theory is comparatively hard, as a cursory reading of your linked paper will show!

chubot · on July 29, 2020

Yeah that is a good example, and I recorded it here:

https://github.com/oilshell/blog-code/blob/master/regular-la...

I don't mind the rewrite, but I'm not surprised that some would be inconvienced by it. (Though that is part of the reason for the Eggex syntax)

----

On the other side I'll offer that I wrote a lexer for bash and all its sublanguages without any nongreedy repetitions. It's dozens if not hundreds of regexes.

https://github.com/oilshell/oil/blob/master/frontend/lexer_d...

Though, regular languages are more like a building block for the lexer than the lexer itself. As mentioned elsewhere, they're too limited to express the entire lexer. They're too limited to express syntax highlighters (but so are regexes! They also need some logic outside the model, e.g. for lexing shell here docs)

-----

So I pose the question of whether there's a more powerful abstraction that's still linear time and constant space:

http://www.oilshell.org/blog/2020/07/ideas-questions.html#mo...

Of course there is, but it's a matter of picking one that will address a lot of use cases effectively.

I think re2c has grown something like

.*?

for example. grep needs something like that too. Often you want to be able to skip input in a linear fashion, without invoking the pattern matching logic at all.

Regexes are OK if you know how to use them, but evidently most people have lots of problems with them (the necessity of OP's demo being an example of that.) I think there's something better out there built on top of regular languages.

ridiculous_fish · on July 29, 2020

You hit upon an important cleavage: features vs worst-case perf. A generous lexer supports partial input, syntax highlighting, helpful error messages, etc., while a strict one is not so nice, but guarantees protection against malicious input. But the way you write these are very different! I also would like to write one lexer that can operate in both modes, but with one syntax.

BTW, high five from one shell author to another :)

chubot · on July 29, 2020

Yeah I've been looking at fish. Definitely has the cleanest code out of any shell I've looked at :) (e.g. bash, dash, zsh, mksh)

And I came to realize what a huge effort it is on top of the language:

http://www.oilshell.org/blog/2020/02/recap.html#fish-oil-is-...

Also I wonder if you knew of the existence of ble.sh, a fish-like UI written in pure bash! I think it's the biggest shell program in the world:

http://www.oilshell.org/cross-ref.html?tag=ble.sh#ble.sh

https://github.com/oilshell/oil/wiki/The-Biggest-Shell-Progr...

----

I did add little hack in the lexer for autocompletion of partial input. But my hope though is to keep most of that logic in the parser and not the lexer? Though it is true I was thinking of breaking up the exact regex we talked about for similar reasons, e.g. a single token:

    ' [^']* '

vs. 3 tokens

    '
    [^']*
    '

I think the latter could be better for completion of paths in quotes, which bash does, but I haven't revisited it yet.

robertelder · on July 28, 2020

Hi, I'm the author of this page. This tool has a lot of complexity in it and a huge number of corner-cases. If you discover any bugs or XSS vulnerabilities, feel free to let me know.

simonebrunozzi · on July 28, 2020

This is fantastic, thanks for sharing it.

I wish I had it when I was a CS professor in 2004-2006, teaching Compilers to 2nd year university students. They would have found it really useful.

mlyle · on July 28, 2020

I'm having trouble with backtracking and groups. It's possible I'm doing something wrong, but

    [bc]*(cd)+

doesn't seem to match

cbcdcd

Nor does the simpler

    a*(ab)

match

aab

But

    a*ab

matches aab.

robertelder · on July 28, 2020

You found a bug. The bug was: When I was computing the animation frames for the visual, after a backtracking event, I forgot to restore the 'progress' values for the PROGRESS nodes to what they were before the stack push took place. I think I fixed this issue, and I just pushed an update which should appear any second when the cache invalidates.

mlyle · on July 28, 2020

Cool! Thanks for looking at it/fixing it.

P.S. The whole split() thing you do is an interesting hack to not have to preserve/rewind so much state. :D I've never quite seen fork() used so...

chii · on July 28, 2020

I tried using this expression:

   ^((?:(?:https?|ftps?):\/\/)|irc:\/\/|mailto:)([\w?!~^\/\\#$%&'()*+,\-.\/:;<=@]*[\w~^\/\\#$%&'()*+,\-\/:;<=@])

and it seems to fail to visualize it.

robertelder · on July 28, 2020

The (?:stuff) part is for a non-capturing group which I didn't add support for. Maybe in the near future. That would be one of the easiest features to add since the tool also doesn't support backreferences yet.

Edit: Actually, since the non-capturing group part doesn't add much here, you can just remove the '?:'s and this works:

https://blog.robertelder.org/regular-expression-visualizer/?...

evntdrvn · on July 28, 2020

The first one that I tried also failed for the same reason, FWIW. Cool tool! :)

andialo · on July 28, 2020

That's really impressive work. What flavor of regular expressions do you follow there?

robertelder · on July 28, 2020

I tried to follow a middle-of-the-road works mostly everywhere made up flavour. Here is a grammar for it:

https://blog.robertelder.org/regular-expression-parser-gramm...

I may add to it in the future if there is enough interest. At the moment it doesn't support a few common things like word breaks and backreferences, but I don't think it'd be too hard to add.

indentit · on Aug 2, 2020

Any chance of open sourcing it so we can contribute some regex features to your awesome project? :)

robertelder · on March 25, 2020

I'll provide another voice to support the idea that d3 is a really great library. For anyone who has tried to learn it in the past and found it a bit too complicated, I'd suggest you approach it with the following perspective: Learning d3 is more about learning SVG than it is about learning d3 (unless of course you already know SVG).

robertelder · on Feb 9, 2020

I did something similar recently using my custom terminal diff tool. Here is a more colourful terminal-based view of the differences between MN908947.3 and sars NC_004718.3:

https://twitter.com/RobertElderSoft/status/12226906294923796...

To produce that image I used this tool:

https://github.com/RobertElderSoftware/roberteldersoftwaredi... With these two genomes:

https://www.ncbi.nlm.nih.gov/nuccore/MN908947

https://www.ncbi.nlm.nih.gov/nuccore/NC_004718.3

with the leading numbers, spaces and newlines removed from each genome file.

My tool (as well as unix diff with --minimal) uses the myers diff algorithm which just compute a mathematically minimal diff. More advanced algorithms exist for computing phylogenetic trees that take into account the biological likelihood of certain sequence change (deletions vs. additions vs. translations etc.).

robertelder · on Dec 12, 2019

"Do not rely on security.txt as authorization to test a site!", I was actually thinking of putting up a security.txt recently to do exactly this and explicitly define the terms upon which it would be ok to have people probe my site without retribution. Arguably, this might actually be the most valuable thing to include in a security.txt. I can't really see any down side (can anyone think of any?) as this would help attract attention from the most noble white hats who are likely to help you and are eager to sharpen their skills. Black hats won't care what you put in it. You'd probably want a lawyer to help you flesh out decent terms, but it would be a great playground to have a list of sites that invite people to pen test them under clear terms for mutual benefit.

Avamander · on Dec 12, 2019

Having a bug bounty program is generally a delicate dance between letting people too loose at your infrastructure and catching any serious bugs. security.txt or not that's something that probably has to be written clearly or it's "all rights reserved" (don't try to hack us and if you do find something then be very cautious).

As an additional though that just occured to me, we haven't seen how authorization to probe systems (bug bounty system) would interact with GDPR, e.g a breach happens due to a "ethical" hacker, have we?

robertelder · on Nov 23, 2019

'expect' is definitely for edge-case automation, but in the few cases where it is useful, it is extremely useful. Despite the fact that it's quite 'old fashioned', it is a dependency of DejaGnu which, itself, is a dependency of the test cases for a huge number of GNU projects. This leads me to believe that it'll still be around for a very long time and will continue to work the same as it does now.

shantly · on Nov 23, 2019

I used it as recently as a couple years ago to make short work of some build + deployment stuff on some slightly unusual platform. Might've been Garmin wearables? Something like that.

When you need "expect", it's wonderful.

theamk · on Nov 24, 2019

Sure, but is it worth learning TCL for a few edge-cases? I like python, so I used "pexpect" in the past. I am sure other languages have such libraries too.

And if you are going for solid production, you probably want to re-implement "expect" anyway -- so all input is whitelisted and unexpected messages are flagged. You probably don't what to break your expensive 1990's industrial device because you were driving it with expect script, and it cheerfully ignored "WARNING RESERVE BATTERY DEAD, REPLACE BEFORE POWER OFF" messages.