More

TomNomNom · 2025-10-15T00:15:45 1760487345

I tried the oracle and got this:

> Your accent is Dutch, my friend. I identified your accent based on subtle details in your pronunciation. Want to sound like a native English speaker?

I'm British; from Yorkshire.

When letting it know how it got it wrong there's no option more specific than "English - United Kingdom". That's kind of funny, if not absurd, to anyone who knows anything of the incredible range of accents across the UK.

I also think the question "Do you have an accent when speaking English?" is an odd one. Everyone has an accent when speaking any language.

walthamstow · 2025-10-15T08:02:06 1760515326

I agree there is no such thing as a "British accent", though I'm lucky that my mockney lilt is considered to be one, but Dutch, Danish and Yorkshire are very similar for historical reasons so it's somewhat understandable for you to be detected as Dutch in this app.

I find Danes speaking Danish to sound like a soft Yorkshire accent, and the vowels that Yorkies use are better written in Danish, like phøne.

vintermann · 2025-10-15T12:27:27 1760531247

Well in Danish e's at the end of words aren't silent, so you may get som føni results:)

david-gpu · 2025-10-15T00:41:26 1760488886

> I also think the question "Do you have an accent when speaking English?" is an odd one. Everyone has an accent when speaking any language.

Sure, I agree. But look at it from the perspective of a foreigner living in an English-speaking country, which is probably their target demographic.

We know that as soon as we open our mouth the locals will instantly pigeonhole us as "a foreigner". No matter how good we might be in other areas, we will never be one of "them". The degree of prejudice that may or may not exist against us doesn't matter as much as the ever present knowledge that the locals know that we are not one of them, and the fear of being dismissed because of that.

Nobody likes to stand out like that, particularly when it so clearly puts you at a disadvantage. That sort of insecurity is what this product is aimed at.

wizzwizz4 · 2025-10-15T07:30:14 1760513414

It's not ethical to lie to people about whether they need something you're selling, especially if you're playing on their fears of vulnerability to make the sale. Laundering the lies through an AI model doesn't make it any less bad.

BoldVoice is very clear about being an American accent "training app", so that's not (necessarily) what's happening here, but the point remains.

nedt · 2025-10-15T10:40:40 1760524840

Yeah it's the same for having just one accent "German". Swiss, Austrians but also north vs middle vs south Germans do still sound different - even when they talk English.

f7f3 · 2025-10-15T01:18:11 1760491091

It's quite offensive. English is my native tongue, I got a perfect IELTS score, and one of my parents was an English professor. But my accent makes me less than "native".

Antibabelic · 2025-10-15T07:43:21 1760514201

IELTS is a test for non-native English speakers. Why did you have to take it?

physicsguy · 2025-10-15T08:05:57 1760515557

It's often required for immigration purposes. Countries/Universities will let you off where you're coming from a country that has english as it's main language or have studied a degree in the language, but they often won't if you're a native English speaker living elsewhere.

chronci739 · 2025-10-15T08:26:46 1760516806

> Countries/Universities will let you off where you're coming from a country that has english as it's main language

Singapore is a “native” English speaking country yet has an extremely distinctive accent.

(usually seen as a negative by both Singaporeans and non-Singaporeans)

stared · 2025-10-15T09:51:47 1760521907

I had a 3-month stay in Singapore (at CQT, NUS).

The first two days were a shock, as I felt it was a different language. But just after some time, god adjusted. And I find endearing both Singlish pronunciation and phrases.

For example, the first time I hear "ondah-cah?" I was puzzled. Then understood that it is "Monday can?". Which, as I learned, means "Would Monday work for you?".

f7f3 · 2025-10-16T07:11:57 1760598717

Migration purposes. I'm from South Africa, so it was required.

I would have done it regardless, because it got me extra points.

Measter · 2025-10-15T11:34:58 1760528098

For me it doesn't think I'm reading the prompt correctly, and refuses to accept the input.

I'm also British, from Devon.

suddenlybananas · 2025-10-15T08:15:39 1760516139

>That's kind of funny, if not absurd, to anyone who knows anything of the incredible range of accents across the UK.

Yeah I was disappointed when I realised this post was about foreign accents and not regional accents in English across the world.

TomNomNom · 2025-07-29T23:49:27 1753832967

There's a bunch of open source work in the robot combat space, but it doesn't have to be used for robot combat specifically. The Malenki Nano is a great example: a tiny, open source receiver with three speed controllers on a single board. It has two PWM channels for servos too so you could do a ton of interesting projects with it.

https://github.com/MarkR42/malenki-nano

https://shop.bristolbotbuilders.com/product/malenki/

TomNomNom · on Oct 10, 2024

Not directly relevant to the post, but seems like a good place to share.

My team and I once took on a very tricky automation project. At the time we had a complex software deployment done about once per month that involved a team of about a dozen people showing up at 4am to do it while traffic was low.

The deployment involved many manual steps and coordination of everybody involved. The person leading each deployment followed the documented list of steps and got each person to do their bit at the right time; people to run database migrations, people to install RPMs on particular servers, people to test and verify functionality. Mistakes and missed steps were not uncommon.

The very first thing we did was take the documentation and write a Jenkins job to post each step into a Slack channel specifically for coordinating the deployments. Someone clicked "go" and each step was posted as a message in that channel with a 'done' button to be clicked when that step was done. Clicking the button caused the next step to be posted.

The next release we did used that instead of one person reading the steps out of confluence. Everyone involved in the release could always see what step we were at, and when it was their turn to do their bit. This helped ensure no steps were ever missed too.

Over the following months we chipped away at that job a bit at a time. We'd pick a step in the process and automate just that step, starting with the low-hanging fruit first. The Slack message for that step went from "click to confirm you've done it" to "click to do it", with the result posted once it was done; followed by the next step to perform.

It was a long process, but it allowed the rest of the business (and us!) to gradually gain confidence in the automation, and lowered the risk of the project dramatically. Once several steps had been automated and battle-tested we removed the 'click to do' bits in between and the whole release became a couple of clicks followed by the odd bit of manual QA.

OrderlyTiamat · on Oct 10, 2024

This is an awesome example of what I've seen denoted as a Do-Nothing script: https://news.ycombinator.com/item?id=29083367

Was it hard to achieve buy in from all parties? I'd guess that would be the hardest part, to get everyone to join in on working on the automation.

sampo · on Oct 11, 2024

> https://news.ycombinator.com/item?id=29083367

> > https://blog.danslimmon.com/2019/07/15/do-nothing-scripting-...

What is the point of defining a Python class with a single `run` method, and then running with `Class.run()`, instead of just defining a `function` and running with `function()`?

GarnetFloride · on Oct 10, 2024

Checklist automation is a very powerful tool. Making a script that just calls the checklist that you walk through is the first step because you have to debug the checklist. It's really hard to automate something if you don't really known what it is you are doing, and steps are always missed. Once you debug the checklist you can start automating the steps and then you find the things that are easy for humans to do but hard for computers. That's the fun part. In one case I worked on, they automated the sales funnel, but then customers started asking for refunds and complaining online. Turns out the automation didn't work. I got a team together to get the customers happy and then did a Stalin's Postman test where we walked through the whole process. All but one of the steps failed. Now that we knew what was failing, we could start the process for fixing it.

pchristensen · on Oct 11, 2024

What's a "Stalin's Postman test"? I couldn't find anything for that phrase.

sebastialonso · on Oct 11, 2024

I can't stop thinking parent comment is ChatGPT output

Jimmc414 · on Oct 13, 2024

We should delve into that.

hinkley · on Oct 10, 2024

Hybrid automation is remarkably effective. I learned about this idea around five or six years ago and have been employing it since.

Step 3: Go to this URL and do a thing, then click Y.

Do some stuff, do some more stuff

Step 8: Go to this URL and undo the thing you did in Step 3, then click Y.

deely3 · on Oct 10, 2024

Thats ideal world solution. Love it! How much resources requires maintenance of it?

TomNomNom · on April 23, 2024

Back in 2020 my Swedish employer booked me a flight and I used Google Translate on the confirmation email. It changed the airport I was flying from

https://twitter.com/TomNomNom/status/1233058805598031873

TomNomNom · on March 29, 2024

A ~20 minute video on a particular number cropping up all over the place and not one mention of the frequency illusion / Baader–Meinhof phenomenon feels at least a little bit disingenuous.

I don't doubt that it crops up a lot when people are asked for a random number between 1 and 100 (especially if, as per the video, you ignore 1, 2, 7, 42, 69, 73, and 77, and sometimes 99 and 100), but it's pretty disappointing for a big channel based around science like Veritasium to not even mention that if you became obsessed with and went hunting for special properties of a different number between 1 and 100 you'd almost certainly find them.

Sporktacular · on March 30, 2024

It's not the first time Veritasium's scientific credentials could be questioned: https://www.youtube.com/watch?v=CM0aohBfUTc

jerf · on March 29, 2024

"The Law of Fives states simply that: ALL THINGS HAPPEN IN FIVES, OR ARE DIVISIBLE BY OR MULTIPLES OF FIVE, OR ARE SOMEHOW DIRECTLY OR INDIRECTLY APPROPRIATE TO FIVE. The Law of Fives is never wrong. In the Erisian Archives is an old memo from Omar to Mal-2: 'I find the Law of Fives to be more and more manifest the harder I look.'" - Principia Discordia, https://en.wikiquote.org/wiki/Principia_Discordia#THE_LAW_OF...

TomNomNom · on March 17, 2024

A bug in the spec doesn't necessarily mean there will be a noticeable bug in the browsers; e.g. a crash.

The browsers may have been written to "work" / not crash over adhering strictly to the spec.

TomNomNom · on Feb 16, 2024

This looks cool. I ran this on some web crawl data I have locally, so: all files you'd find on regular websites; HTML, CSS, JavaScript, fonts etc.

It identified some simple HTML files (html, head, title, body, p tags and not much else) as "MS Visual Basic source (VBA)", "ASP source (code)", and "Generic text document" where the `file` utility correctly identified all such examples as "HTML document text".

Some woff and woff2 files it identified as "TrueType Font Data", others are "Unknown binary data (unknown)" with low confidence guesses ranging from FLAC audio to ISO 9660. Again, the `file` utility correctly identifies these files as "Web Open Font Format".

I like the idea, but the current implementation can't be relied on IMO; especially not for automation.

A minor pet peeve also: it doesn't seem to detect when its output is a pipe and strip the shell colour escapes resulting in `^[[1;37` and `^[[0;39m` wrapping every line if you pipe the output into a vim buffer or similar.

ebursztein · on Feb 16, 2024

Thanks for the feedback -- we will look into it. If you can share with us the list of URL that would be very helpful so we can reproduce - send us an email at magika-dev@google.com if that is possible.

For crawling we have planned a head only model to avoid fetching the whole file but it is not ready yet -- we weren't sure what use-cases would emerge so that is good to know that such model might be useful.

We mostly use Magika internally to route files for AV scanning as we wrote in the blog post, so it is possible that despite our best effort to test Magika extensively on various file types it is not as good on fonts format as it should be. We will look into.

Thanks again for sharing your experience with Magika this is very useful.

TomNomNom · on Feb 16, 2024

Sure thing :)

Here's[0] a .tgz file with 3 files in it that are misidentified by magika but correctly identified by the `file` utility: asp.html, vba.html, unknown.woff

These are files that were in one of my crawl datasets.

[0]: https://poc.lol/files/magika-test.tgz

ebursztein · on Feb 16, 2024

Thank you - we are adding them to our test suit for the next version.

TomNomNom · on Feb 16, 2024

Super, thank you! I look forward to it :)

I've worked on similar problems recently so I'm well aware of how difficult this is. An example I've given people is in automatically detecting base64-encoded data. It seems easy at first, but any four, eight, or twelve (etc) letter word is technically valid base64, so you need to decide if and how those things should be excluded.

beeboobaa · on Feb 16, 2024

Do you have permission to redistribute these files?

renewiltord · on Feb 17, 2024

LOL nice b8 m8. For the rest of you who are curious, the files look like this:

    <HTML><HEAD>
    <TITLE>Access Denied</TITLE>
    </HEAD><BODY>
    <H1>Access Denied</H1>
     
    You don't have permission to access "http&#58;&#47;&#47;placement&#46;api&#46;test4&#46;example&#46;com&#47;" on this server.<P>
    Reference&#32;&#35;18&#46;9cb0f748&#46;1695037739&#46;283e2e00
    </BODY>
    </HTML>

Legend. "Do you have permission" hahaha.

IvyMike · on Feb 16, 2024

You are asking what if this guy has "web crawl data" that google does not have?

And what if he says no, he does not have permission.

beeboobaa · on Feb 16, 2024

> You are asking what if this guy has "web crawl data" that google does not have?

No, I'm asking if he has permission to redistribute these files.

timschmidt · on Feb 16, 2024

Are you attempting to assert that use of these files solely for the purpose of improving a software system meant to classify file types does not fall under fair use?

https://en.wikipedia.org/wiki/Fair_use

beeboobaa · on Feb 16, 2024

I'm asking a question.

Here's another one for you: Do you believe that all pictures you have ever taken, all emails you have ever written, all code you have ever written could be posted here on this forum to improve someone else's software system?

If so, could you go ahead and post that zip? I'd like to ingest it in my model.

timschmidt · on Feb 16, 2024

Your question seems orthogonal to the situation. The three files posted seem to be the minimum amount of information required to reproduce the bug. Fair use encompasses a LOT of uses of otherwise copyrighted work, and this seems clearly to be one.

beeboobaa · on Feb 16, 2024

I don't see how publicly posting them on a forum is

> the minimum amount of information required to reproduce the bug

MAYBE if they had communicated privately that'd be an argument that made sense.

timschmidt · on Feb 16, 2024

So you don't think that software development which happens in public web forums deserve fair use protection?

beeboobaa · on Feb 16, 2024

That's an interesting way to frame "publicly posted someone else's data without their consent for anyone to see and download"

timschmidt · on Feb 16, 2024

I notice you're so invested that you haven't noticed that the files have been renamed and zipped such that they're not even indexable. How you'd expect anyone not participating in software development to find them is yet to be explained.

beeboobaa · on Feb 16, 2024

[flagged]

timschmidt · on Feb 16, 2024

Have fun, buddy!

jdiff · on Feb 16, 2024

It's three files that were scraped from (and so publicly available on) the web. That's not at all similar to your strawful analogy.

timschmidt · on Feb 16, 2024

I'm over here trying to fathom the lack of control over one's own life it would take to cause someone to turn into an online copyright cop, when the data in question isn't even their own, is clearly divorced from any context which would make it useful for anything other than fixing the bug, and about which the original copyright holder hasn't complained.

Some people just want to argue.

If the copyright holder has a problem with the use, they are perfectly entitled to spend some of their dollar bills to file a law suit, as part of which the contents of the files can be entered into the public record for all to legally access, as was done with Scientology.

I don't expect anyone would be so daft.

beeboobaa · on Feb 16, 2024

Literally just asked a question and that seems to have set you off, bud. Are you alright? Do you need to feed your LLM more data to keep it happy?

timschmidt · on Feb 16, 2024

I'm always happy to stand up for folks who make things over people who want to police them. Especially when nothing wrong has happened. Maybe take a walk and get some fresh air?

_a_a_a_ · on Feb 17, 2024

I share your distaste for people whose only contribution is subtraction but suggest you lay off the sarcasm though. Trolls; don't feed. (Well done on your project BTW)

timschmidt · on Feb 17, 2024

I don't see any sarcasm from me in the thread. I had serious questions. Perhaps you could point out what you see? Thanks for the supportive words about the project.

_a_a_a_ · on Feb 17, 2024

Perhaps I misread "Maybe take a walk and get some fresh air?" - no worries though.

timschmidt · on Feb 17, 2024

I've certainly seen people say similar things facetiously, but I was being genuine. I'm not sure if beeboobaa was trolling or not, I try to take what folks say at face value. They seemed to be pretty attached to a particular point of view, though. Happens to all of us. The thing for attachment is time and space and new experiences. Walks are great for those things, and also the best for organizing thoughts. Einstein loved taking walks for these reasons, and me too. It feels better to suggest something helpful when discussion derails, than to hurl insults as happens all too frequently.

beeboobaa · on Feb 17, 2024

Literally all you did is bitch and moan about someone asking a simple question, lol. Go touch grass.

timschmidt · on Feb 17, 2024

I already had my walk this morning, thanks! If you'd like to learn more about copyright law, including about all the ways it's fuzzy around the edges for legitimate uses like this one, I highly recommend groklaw.net. PJ did wonderful work writing about such boring topics in personable and readable ways. I hope you have a great day!

beeboobaa · on Feb 17, 2024

no thanks, not interested in your american nonsense laws. lecturing people who are asking SOMEONE ELSE a question is a terrible personality trait btw

timschmidt · on Feb 17, 2024

181 out of 195 countries and counting!

https://en.wikipedia.org/wiki/Berne_Convention

Look at that map!

https://upload.wikimedia.org/wikipedia/commons/7/76/Berne_Co...

P.S. Berne doesn't sound like a very American name.

You would really learn a lot from reading Groklaw. Of course, I can't make you. Good luck in the world though!

beeboobaa · on Feb 18, 2024

man, you really are putting a lot of effort into justifying stealing other people's content

timschmidt · on Feb 18, 2024

Thanks for such great opportunities to post educational content to Hacker News! I genuinely hope some things go your way, man. Rooting for you. Go get 'em.

jdonaldson · on Feb 17, 2024

If you can’t undermine someone’s argument, undermine their nationality. American tech culture doesn’t do this as much as it should, perhaps because we know eventually those folks wake up.

beeboobaa · on Feb 18, 2024

Not sure what your point is, but why would i care to learn about the laws of some other dude's country that he's using to support his bizarro arguments?

timschmidt · on Feb 18, 2024

> why would i care to learn about the laws of some other dude's country

The website you're attempting to police other people's behavior on is hosted in the country you're complaining about. Lol.

Maybe there is a website local to your country where your ideas would be better received?

beeboobaa · on Feb 17, 2024

You're so brave

timschmidt · on Feb 17, 2024

Thanks!

westurner · on Feb 16, 2024

What is the MIME type of a .tar file; and what are the MIME types of the constituent concatenated files within an archive format like e.g. tar?

hachoir/subfile/main.py: https://github.com/vstinner/hachoir/blob/main/hachoir/subfil...

File signature: https://en.wikipedia.org/wiki/File_signature

PhotoRec: https://en.wikipedia.org/wiki/PhotoRec

"File Format Gallery for Kaitai Struct"; 185+ binary file format specifications: https://formats.kaitai.io/

Table of ': https://formats.kaitai.io/xref.html

AntiVirus software > Identification methods > Signature-based detection, Heuristics, and ML/AI data mining: https://en.wikipedia.org/wiki/Antivirus_software#Identificat...

Executable compression; packer/loader: https://en.wikipedia.org/wiki/Executable_compression

Shellcode database > MSF: https://en.wikipedia.org/wiki/Shellcode_database

sigtool.c: https://github.com/Cisco-Talos/clamav/blob/main/sigtool/sigt...

clamav sigtool: https://www.google.com/search?q=clamav+sigtool

https://blog.didierstevens.com/2017/07/14/clamav-sigtool-dec... :

  sigtool –-find-sigs "$name" | sigtool –-decode-sigs

List of file signatures: https://en.wikipedia.org/wiki/List_of_file_signatures

And then also clusterfuzz/oss-fuzz scans .txt source files with (sandboxed) Static and Dynamic Analysis tools, and `debsums`/`rpm -Va` verify that files on disk have the same (GPG signed) checksums as the package they are supposed to have been installed from, and a file-based HIDS builds a database of file hashes and compares what's on disk in a later scan with what was presumed good, and ~gdesktop LLM tools scan every file, and there are extended filesystem attributes for label-based MAC systems like SELinux, oh and NTFS ADS.

A sufficient cryptographic hash function yields random bits with uniform probability. DRBG Deterministic Random Bit Generators need high entropy random bits in order to continuously re-seed the RNG random number generator. Is it safe to assume that hashing (1) every file on disk, or (2) any given file on disk at random, will yield random bits with uniform probability; and (3) why Argon2 instead of e.g. only two rounds of SHA256?

https://github.com/google/osv.dev/blob/master/README.md#usin... :

> We provide a Go based tool that will scan your dependencies, and check them against the OSV database for known vulnerabilities via the OSV API. ... With package metadata, not (a file hash, package) database that could be generated from OSV and the actual package files instead of their manifest of already-calculated checksums.

Might as well be heating a pool on the roof with all of this waste heat from hashing binaries build from code of unknown static and dynamic quality.

Add'l useful formats:

> Currently it is able to scan various lockfiles, debian docker containers, SPDX and CycloneDB SBOMs, and git repositories

Things like bittorrent magnet URIs, Named Data Networking, and IPFS are (file-hash based) "Content addressable storage": https://en.wikipedia.org/wiki/Content-addressable_storage

dieortin · on Feb 17, 2024

I’m not sure what this comment is trying to say

westurner · on Feb 18, 2024

File-based hashing is done is so many places, there's so much heat.

Sub- file-based hashing with feature engineering is necessary for AV, which must take packing, obfuscating, loading, and dynamic analysis into account in addition to zip archives and magic file numbers.

AV AntiVirus applications with LLMs: what do you train it on, what are some of the existing signature databases.

https://SigStore.dev/ (The Linux Foundation) also has a hash-file inverted index for released artifacts.

Also otoh with a time limit,

1. What file is this? Dirname, basename, hashes(s)

2. Is it supposed to be installed at such path?

3. Per it's header, is the file an archive or an image or a document?

4. What file(s) and records and fields are packed into a file, and what transforms were the data transformed with?

michaelmior · on Feb 16, 2024

> the current implementation can't be relied on IMO

What's your reasoning for not relying on this? (It seems to me that this would be application-dependent at the very least.)

jdiff · on Feb 16, 2024

I'm not the person you asked, but I'm not sure I understand your question and I'd like to. It whiffed multiple common softballs, to the point it brings into question the claims made about its performance. What reasoning is there to trust it?

EnigmaFlare · on Feb 17, 2024

It had 3 failures. How is that a sign it's untrustworthy? I'm sure all alternatives have more than 3 failures. You might be making assumptions about the distribution of successes and failures (GP didn't say how many files they tested to find those 3) or how "soft" they were. In an extreme case, they might even have been crafted adversarial examples. But even if not, they might have features that really do look more like some other file type from the point of view of the classifier even if it's not easily apparent to a human. Being strictly superior to a competent human is a pretty high bar to set.

epcoa · on Feb 17, 2024

> or how "soft" they were.

From the comment: It identified some simple HTML files (html, head, title, body, p tags and not much else) as "MS Visual Basic source (VBA)", "ASP source (code)", and "Generic text document" where the `file` utility correctly identified all such examples as "HTML document text".

That's pretty soft. Nothing "adversarial" claimed either.

> Being strictly superior to a competent human is a pretty high bar to set.

The bar is the file utility.

EnigmaFlare · on Feb 17, 2024

Those are only soft to a human. I looked at a couple and I picked them correctly but I don't know what details the classifier was seeing which I was blind to. Not to say it was correct, just that we can't call them soft just because they're short and easy for a human.

> The bar is the file utility.

It has higher accuracy than that. You would reject it just because the failures are different even though they're less?

jdiff · on Feb 17, 2024

Yes. Unpredictable failures are significantly worse than predictable ones. If file messes up, it's because it decided a ZIP-based document was a generic ZIP file. If Magika messes up, it's entirely random. I can work around file's failure modes, especially if it's one I work with often. Magika's failure modes strike at random and are not possible to anticipate. File also bails out when it doesn't know, a very common failure mode in Magika is that it confidently returns a random answer when it wasn't trained on a file type.

EnigmaFlare · on Feb 18, 2024

Your original statement was that having a couple of failures brings into question its claims about performance. It doesn't because it doesn't claim such high performance. 99.31% is lower than perhaps 997 out of 1000 or whatever the GP tested. Of course having unpredictable failures is a worry but it's a different worry.

jdiff · on Feb 18, 2024

They uploaded 3 sample files for the authors, there were more failures than that, and the failures that GP and others have experienced are of a less tolerable nature. This is the point I was making, that the value added by classifying files with no rigid structure is offset heavily by its unpredictable shortcomings and difficult-to-detect failure modes.

If you have a point of your own to make I'd prefer you jump to it. Nitpicking baseless assumptions like how many files the evil GP had to sift through in order to breathlessly bring us 3 bad eggs is not something I find worthwhile.

EnigmaFlare · on Feb 21, 2024

The point I'm making is that you drew a conclusion based on insufficient information, apparently by making assumptions about the distribution of failures or the definition of "easy".

michaelmior · on Feb 16, 2024

> It whiffed multiple common softballs

I must have missed this in the article. Where was this?

jdiff · on Feb 16, 2024

...It's in the comment you were responding to. Directly above the section you quoted.

michaelmior · on Feb 18, 2024

I understand that, but it wasn't clear to me where those examples came from.

jdiff · on Feb 18, 2024

It's pretty obvious from the whole comment that they're his own experience. Are you going anywhere with this or are you just saying things?

TomNomNom · on Feb 16, 2024

It provided the wrong file-types for some files, so I cannot rely on its output to be correct.

If you wanted to, for example, use this tool to route different files to different format-specific handlers it would sometimes send files to the wrong handlers.

michaelmior · on Feb 16, 2024

Except a 100% correct implementation doesn't exist AFAIK. So if I want to do anything that makes a decision based on the type of a file, I have to pick some algorithm to do that. If I can do that correctly 99% of the time, that's better than not being able to make that decision at all, which is where I'm left if a perfect implementation doesn't exist.

jdiff · on Feb 16, 2024

Nobody's asking for perfection. But the AI is offering inexplicable and obvious nondeterministic mistakes that the traditional algorithms don't suffer from.

Magika goes wrong and your fonts become audio files and nobody knows why. Magic goes wrong and your ZIP-based documents get mistaken for generic ZIP files. If you work with that edge case a lot, you can anticipate it with traditional algorithms. You can't anticipate nondeterministic hallucination.

duskwuff · on Feb 16, 2024

Seconding this.

Something like Magika is potentially useful as a second pass if conventional methods of detecting a file type fail or yield a low-confidence result. But, for the majority of binary files, those conventional methods are perfectly adequate. If the first few bytes of a file are "GIF89a", you don't need an AI to tell you that it's probably a GIF image.

EnigmaFlare · on Feb 17, 2024

Doesn't seem all that non-deterministic. I tested the vba.html example multiple times and it always said it was VBA. I added a space between </HEAD> and <BODY> and it correctly picked HTML as most likely but with a low confidence.

So I think we can say it's sensitive to mysterious features, not that it's non-deterministic. Still leads to your same conclusion that you can't anticipate the failures. But I don't think you can with traditional tools either. Some magic numbers are just plain text (like MZ) which could legitimately accidentally appear at the beginning of a plain text file, for example.

jsnell · on Feb 16, 2024

Where are you getting the non-determinism part from? It would seem surprising for there to be anything non-deterministic about an ML model like this, and nothing in the original reports seems to suggest that either.

TeMPOraL · on Feb 16, 2024

Large ML models tend to be uncorrectably non-deterministic simply from doing lots of floating point math in parallel. Addition and multiplication of floats is neither commutative nor associative - you may get different results depending on the order in which you add/multiply numbers.

eapriv · on Feb 17, 2024

Addition and multiplication of floats are commutative.

Gormo · on Feb 16, 2024

> It would seem surprising for there to be anything non-deterministic about an ML model like this

I think there may be some confusion of ideas going in here. Machine learning is fundamentally stochastic, so it is non-deterministic almost by definition.

TomNomNom · on Jan 21, 2024

Thanks for the shout-out :)

I like the idea! I'll see if I get a chance to play around with it at work next week.

codesoap · on Jan 21, 2024

Wow, didn't expect you to see this, but glad to have you here :-) Don't hesitate to contact me if you have any feedback!

TomNomNom · on June 13, 2023

Just a silly aside with regards to the regex to extract domains from URLs, my little tool called unfurl [0] exists to solve that exact sort of problem :)

[0]https://github.com/tomnomnom/unfurl

opello · on June 13, 2023

bagder (of curl) also made trurl to address URL manipulation:

https://github.com/curl/trurl

TomNomNom · on Aug 22, 2022

I strongly disagree that nearly 20% of the median household income (£31,400 [0]) is reasonable for a family's annual energy bill. Median income for the poorest 20% is £14,600 - the chances they could afford £6k a year on energy are slim.

Are you really suggesting that housing costs should average around £1400 per year (about one tenth the average annual rent [1]), but energy costs more than 4 times that amount are reasonable?

[0] https://www.ons.gov.uk/peoplepopulationandcommunity/personal...

[1] https://homelet.co.uk/homelet-rental-index

linuxftw · on Aug 22, 2022

> Are you really suggesting that housing costs should average around £1400 per year

Relative to wages in the UK, absolutely. The working class is getting absolutely fleeced over there.

> I strongly disagree that nearly 20% of the median household income

Same thing here, there's two sides to every equation. The UK has to import almost all of it's energy, importing is expensive.

It seems to me, global energy demands are outstripping supply, and energy is going to become a permanently larger portion of everyone's budget. The way of life we've enjoyed the last 70+ years in the western world, that life is quickly changing.