> Your accent is Dutch, my friend. I identified your accent based on subtle details in your pronunciation. Want to sound like a native English speaker?
I'm British; from Yorkshire.
When letting it know how it got it wrong there's no option more specific than "English - United Kingdom". That's kind of funny, if not absurd, to anyone who knows anything of the incredible range of accents across the UK.
I also think the question "Do you have an accent when speaking English?" is an odd one. Everyone has an accent when speaking any language.
I agree there is no such thing as a "British accent", though I'm lucky that my mockney lilt is considered to be one, but Dutch, Danish and Yorkshire are very similar for historical reasons so it's somewhat understandable for you to be detected as Dutch in this app.
I find Danes speaking Danish to sound like a soft Yorkshire accent, and the vowels that Yorkies use are better written in Danish, like phøne.
> I also think the question "Do you have an accent when speaking English?" is an odd one. Everyone has an accent when speaking any language.
Sure, I agree. But look at it from the perspective of a foreigner living in an English-speaking country, which is probably their target demographic.
We know that as soon as we open our mouth the locals will instantly pigeonhole us as "a foreigner". No matter how good we might be in other areas, we will never be one of "them". The degree of prejudice that may or may not exist against us doesn't matter as much as the ever present knowledge that the locals know that we are not one of them, and the fear of being dismissed because of that.
Nobody likes to stand out like that, particularly when it so clearly puts you at a disadvantage. That sort of insecurity is what this product is aimed at.
It's not ethical to lie to people about whether they need something you're selling, especially if you're playing on their fears of vulnerability to make the sale. Laundering the lies through an AI model doesn't make it any less bad.
BoldVoice is very clear about being an American accent "training app", so that's not (necessarily) what's happening here, but the point remains.
Yeah it's the same for having just one accent "German". Swiss, Austrians but also north vs middle vs south Germans do still sound different - even when they talk English.
It's quite offensive. English is my native tongue, I got a perfect IELTS score, and one of my parents was an English professor. But my accent makes me less than "native".
It's often required for immigration purposes. Countries/Universities will let you off where you're coming from a country that has english as it's main language or have studied a degree in the language, but they often won't if you're a native English speaker living elsewhere.
The first two days were a shock, as I felt it was a different language. But just after some time, god adjusted. And I find endearing both Singlish pronunciation and phrases.
For example, the first time I hear "ondah-cah?" I was puzzled. Then understood that it is "Monday can?". Which, as I learned, means "Would Monday work for you?".
There's a bunch of open source work in the robot combat space, but it doesn't have to be used for robot combat specifically. The Malenki Nano is a great example: a tiny, open source receiver with three speed controllers on a single board. It has two PWM channels for servos too so you could do a ton of interesting projects with it.
Not directly relevant to the post, but seems like a good place to share.
My team and I once took on a very tricky automation project. At the time we had a complex software deployment done about once per month that involved a team of about a dozen people showing up at 4am to do it while traffic was low.
The deployment involved many manual steps and coordination of everybody involved. The person leading each deployment followed the documented list of steps and got each person to do their bit at the right time; people to run database migrations, people to install RPMs on particular servers, people to test and verify functionality. Mistakes and missed steps were not uncommon.
The very first thing we did was take the documentation and write a Jenkins job to post each step into a Slack channel specifically for coordinating the deployments. Someone clicked "go" and each step was posted as a message in that channel with a 'done' button to be clicked when that step was done. Clicking the button caused the next step to be posted.
The next release we did used that instead of one person reading the steps out of confluence. Everyone involved in the release could always see what step we were at, and when it was their turn to do their bit. This helped ensure no steps were ever missed too.
Over the following months we chipped away at that job a bit at a time. We'd pick a step in the process and automate just that step, starting with the low-hanging fruit first. The Slack message for that step went from "click to confirm you've done it" to "click to do it", with the result posted once it was done; followed by the next step to perform.
It was a long process, but it allowed the rest of the business (and us!) to gradually gain confidence in the automation, and lowered the risk of the project dramatically. Once several steps had been automated and battle-tested we removed the 'click to do' bits in between and the whole release became a couple of clicks followed by the odd bit of manual QA.
What is the point of defining a Python class with a single `run` method, and then running with `Class.run()`, instead of just defining a `function` and running with `function()`?
Checklist automation is a very powerful tool. Making a script that just calls the checklist that you walk through is the first step because you have to debug the checklist. It's really hard to automate something if you don't really known what it is you are doing, and steps are always missed.
Once you debug the checklist you can start automating the steps and then you find the things that are easy for humans to do but hard for computers. That's the fun part.
In one case I worked on, they automated the sales funnel, but then customers started asking for refunds and complaining online. Turns out the automation didn't work.
I got a team together to get the customers happy and then did a Stalin's Postman test where we walked through the whole process. All but one of the steps failed.
Now that we knew what was failing, we could start the process for fixing it.
A ~20 minute video on a particular number cropping up all over the place and not one mention of the frequency illusion / Baader–Meinhof phenomenon feels at least a little bit disingenuous.
I don't doubt that it crops up a lot when people are asked for a random number between 1 and 100 (especially if, as per the video, you ignore 1, 2, 7, 42, 69, 73, and 77, and sometimes 99 and 100), but it's pretty disappointing for a big channel based around science like Veritasium to not even mention that if you became obsessed with and went hunting for special properties of a different number between 1 and 100 you'd almost certainly find them.
"The Law of Fives states simply that: ALL THINGS HAPPEN IN FIVES, OR ARE DIVISIBLE BY OR MULTIPLES OF FIVE, OR ARE SOMEHOW DIRECTLY OR INDIRECTLY APPROPRIATE TO FIVE.
The Law of Fives is never wrong. In the Erisian Archives is an old memo from Omar to Mal-2: 'I find the Law of Fives to be more and more manifest the harder I look.'" - Principia Discordia, https://en.wikiquote.org/wiki/Principia_Discordia#THE_LAW_OF...
This looks cool. I ran this on some web crawl data I have locally, so: all files you'd find on regular websites; HTML, CSS, JavaScript, fonts etc.
It identified some simple HTML files (html, head, title, body, p tags and not much else) as "MS Visual Basic source (VBA)", "ASP source (code)", and "Generic text document" where the `file` utility correctly identified all such examples as "HTML document text".
Some woff and woff2 files it identified as "TrueType Font Data", others are "Unknown binary data (unknown)" with low confidence guesses ranging from FLAC audio to ISO 9660. Again, the `file` utility correctly identifies these files as "Web Open Font Format".
I like the idea, but the current implementation can't be relied on IMO; especially not for automation.
A minor pet peeve also: it doesn't seem to detect when its output is a pipe and strip the shell colour escapes resulting in `^[[1;37` and `^[[0;39m` wrapping every line if you pipe the output into a vim buffer or similar.
Thanks for the feedback -- we will look into it. If you can share with us the list of URL that would be very helpful so we can reproduce - send us an email at magika-dev@google.com if that is possible.
For crawling we have planned a head only model to avoid fetching the whole file but it is not ready yet -- we weren't sure what use-cases would emerge so that is good to know that such model might be useful.
We mostly use Magika internally to route files for AV scanning as we wrote in the blog post, so it is possible that despite our best effort to test Magika extensively on various file types it is not as good on fonts format as it should be. We will look into.
Thanks again for sharing your experience with Magika this is very useful.
Here's[0] a .tgz file with 3 files in it that are misidentified by magika but correctly identified by the `file` utility: asp.html, vba.html, unknown.woff
These are files that were in one of my crawl datasets.
I've worked on similar problems recently so I'm well aware of how difficult this is. An example I've given people is in automatically detecting base64-encoded data. It seems easy at first, but any four, eight, or twelve (etc) letter word is technically valid base64, so you need to decide if and how those things should be excluded.
LOL nice b8 m8. For the rest of you who are curious, the files look like this:
<HTML><HEAD>
<TITLE>Access Denied</TITLE>
</HEAD><BODY>
<H1>Access Denied</H1>
You don't have permission to access "http://placement.api.test4.example.com/" on this server.<P>
Reference #18.9cb0f748.1695037739.283e2e00
</BODY>
</HTML>
Are you attempting to assert that use of these files solely for the purpose of improving a software system meant to classify file types does not fall under fair use?
Here's another one for you: Do you believe that all pictures you have ever taken, all emails you have ever written, all code you have ever written could be posted here on this forum to improve someone else's software system?
If so, could you go ahead and post that zip? I'd like to ingest it in my model.
Your question seems orthogonal to the situation. The three files posted seem to be the minimum amount of information required to reproduce the bug. Fair use encompasses a LOT of uses of otherwise copyrighted work, and this seems clearly to be one.
I notice you're so invested that you haven't noticed that the files have been renamed and zipped such that they're not even indexable. How you'd expect anyone not participating in software development to find them is yet to be explained.
I'm over here trying to fathom the lack of control over one's own life it would take to cause someone to turn into an online copyright cop, when the data in question isn't even their own, is clearly divorced from any context which would make it useful for anything other than fixing the bug, and about which the original copyright holder hasn't complained.
Some people just want to argue.
If the copyright holder has a problem with the use, they are perfectly entitled to spend some of their dollar bills to file a law suit, as part of which the contents of the files can be entered into the public record for all to legally access, as was done with Scientology.
I'm always happy to stand up for folks who make things over people who want to police them. Especially when nothing wrong has happened. Maybe take a walk and get some fresh air?
I share your distaste for people whose only contribution is subtraction but suggest you lay off the sarcasm though. Trolls; don't feed. (Well done on your project BTW)
I don't see any sarcasm from me in the thread. I had serious questions. Perhaps you could point out what you see? Thanks for the supportive words about the project.
I've certainly seen people say similar things facetiously, but I was being genuine. I'm not sure if beeboobaa was trolling or not, I try to take what folks say at face value. They seemed to be pretty attached to a particular point of view, though. Happens to all of us. The thing for attachment is time and space and new experiences. Walks are great for those things, and also the best for organizing thoughts. Einstein loved taking walks for these reasons, and me too. It feels better to suggest something helpful when discussion derails, than to hurl insults as happens all too frequently.
I already had my walk this morning, thanks! If you'd like to learn more about copyright law, including about all the ways it's fuzzy around the edges for legitimate uses like this one, I highly recommend groklaw.net. PJ did wonderful work writing about such boring topics in personable and readable ways. I hope you have a great day!
Thanks for such great opportunities to post educational content to Hacker News! I genuinely hope some things go your way, man. Rooting for you. Go get 'em.
If you can’t undermine someone’s argument, undermine their nationality. American tech culture doesn’t do this as much as it should, perhaps because we know eventually those folks wake up.
Not sure what your point is, but why would i care to learn about the laws of some other dude's country that he's using to support his bizarro arguments?
And then also clusterfuzz/oss-fuzz scans .txt source files with (sandboxed) Static and Dynamic Analysis tools, and `debsums`/`rpm -Va` verify that files on disk have the same (GPG signed) checksums as the package they are supposed to have been installed from, and a file-based HIDS builds a database of file hashes and compares what's on disk in a later scan with what was presumed good, and ~gdesktop LLM tools scan every file,
and there are extended filesystem attributes for label-based MAC systems like SELinux, oh and NTFS ADS.
A sufficient cryptographic hash function yields random bits with uniform probability.
DRBG Deterministic Random Bit Generators need high entropy random bits in order to continuously re-seed the RNG random number generator.
Is it safe to assume that hashing (1) every file on disk, or (2) any given file on disk at random, will yield random bits with uniform probability; and (3) why Argon2 instead of e.g. only two rounds of SHA256?
> We provide a Go based tool that will scan your dependencies, and check them against the OSV database for known vulnerabilities via the OSV API. ... With package metadata, not (a file hash, package) database that could be generated from OSV and the actual package files instead of their manifest of already-calculated checksums.
Might as well be heating a pool on the roof with all of this waste heat from hashing binaries build from code of unknown static and dynamic quality.
Add'l useful formats:
> Currently it is able to scan various lockfiles, debian docker containers, SPDX and CycloneDB SBOMs, and git repositories
File-based hashing is done is so many places, there's so much heat.
Sub- file-based hashing with feature engineering is necessary for AV, which must take packing, obfuscating, loading, and dynamic analysis into account in addition to zip archives and magic file numbers.
AV AntiVirus applications with LLMs: what do you train it on, what are some of the existing signature databases.
https://SigStore.dev/ (The Linux Foundation) also has a hash-file inverted index for released artifacts.
Also otoh with a time limit,
1. What file is this? Dirname, basename, hashes(s)
2. Is it supposed to be installed at such path?
3. Per it's header, is the file an archive or an image or a document?
4. What file(s) and records and fields are packed into a file, and what transforms were the data transformed with?
I'm not the person you asked, but I'm not sure I understand your question and I'd like to. It whiffed multiple common softballs, to the point it brings into question the claims made about its performance. What reasoning is there to trust it?
It had 3 failures. How is that a sign it's untrustworthy? I'm sure all alternatives have more than 3 failures. You might be making assumptions about the distribution of successes and failures (GP didn't say how many files they tested to find those 3) or how "soft" they were. In an extreme case, they might even have been crafted adversarial examples. But even if not, they might have features that really do look more like some other file type from the point of view of the classifier even if it's not easily apparent to a human. Being strictly superior to a competent human is a pretty high bar to set.
From the comment:
It identified some simpleHTMLfiles (html, head, title, body, p tags andnotmuchelse) as "MS Visual Basic source (VBA)", "ASP source (code)", and "Generic text document" where the `file` utility correctly identified all such examples as "HTML document text".
Those are only soft to a human. I looked at a couple and I picked them correctly but I don't know what details the classifier was seeing which I was blind to. Not to say it was correct, just that we can't call them soft just because they're short and easy for a human.
> The bar is the file utility.
It has higher accuracy than that. You would reject it just because the failures are different even though they're less?
Yes. Unpredictable failures are significantly worse than predictable ones. If file messes up, it's because it decided a ZIP-based document was a generic ZIP file. If Magika messes up, it's entirely random. I can work around file's failure modes, especially if it's one I work with often. Magika's failure modes strike at random and are not possible to anticipate. File also bails out when it doesn't know, a very common failure mode in Magika is that it confidently returns a random answer when it wasn't trained on a file type.
Your original statement was that having a couple of failures brings into question its claims about performance. It doesn't because it doesn't claim such high performance. 99.31% is lower than perhaps 997 out of 1000 or whatever the GP tested. Of course having unpredictable failures is a worry but it's a different worry.
They uploaded 3 sample files for the authors, there were more failures than that, and the failures that GP and others have experienced are of a less tolerable nature. This is the point I was making, that the value added by classifying files with no rigid structure is offset heavily by its unpredictable shortcomings and difficult-to-detect failure modes.
If you have a point of your own to make I'd prefer you jump to it. Nitpicking baseless assumptions like how many files the evil GP had to sift through in order to breathlessly bring us 3 bad eggs is not something I find worthwhile.
The point I'm making is that you drew a conclusion based on insufficient information, apparently by making assumptions about the distribution of failures or the definition of "easy".
It provided the wrong file-types for some files, so I cannot rely on its output to be correct.
If you wanted to, for example, use this tool to route different files to different format-specific handlers it would sometimes send files to the wrong handlers.
Except a 100% correct implementation doesn't exist AFAIK. So if I want to do anything that makes a decision based on the type of a file, I have to pick some algorithm to do that. If I can do that correctly 99% of the time, that's better than not being able to make that decision at all, which is where I'm left if a perfect implementation doesn't exist.
Nobody's asking for perfection. But the AI is offering inexplicable and obvious nondeterministic mistakes that the traditional algorithms don't suffer from.
Magika goes wrong and your fonts become audio files and nobody knows why. Magic goes wrong and your ZIP-based documents get mistaken for generic ZIP files. If you work with that edge case a lot, you can anticipate it with traditional algorithms. You can't anticipate nondeterministic hallucination.
Something like Magika is potentially useful as a second pass if conventional methods of detecting a file type fail or yield a low-confidence result. But, for the majority of binary files, those conventional methods are perfectly adequate. If the first few bytes of a file are "GIF89a", you don't need an AI to tell you that it's probably a GIF image.
Doesn't seem all that non-deterministic. I tested the vba.html example multiple times and it always said it was VBA. I added a space between </HEAD> and <BODY> and it correctly picked HTML as most likely but with a low confidence.
So I think we can say it's sensitive to mysterious features, not that it's non-deterministic. Still leads to your same conclusion that you can't anticipate the failures. But I don't think you can with traditional tools either. Some magic numbers are just plain text (like MZ) which could legitimately accidentally appear at the beginning of a plain text file, for example.
Where are you getting the non-determinism part from? It would seem surprising for there to be anything non-deterministic about an ML model like this, and nothing in the original reports seems to suggest that either.
Large ML models tend to be uncorrectably non-deterministic simply from doing lots of floating point math in parallel. Addition and multiplication of floats is neither commutative nor associative - you may get different results depending on the order in which you add/multiply numbers.
> It would seem surprising for there to be anything non-deterministic about an ML model like this
I think there may be some confusion of ideas going in here. Machine learning is fundamentally stochastic, so it is non-deterministic almost by definition.
Just a silly aside with regards to the regex to extract domains from URLs, my little tool called unfurl [0] exists to solve that exact sort of problem :)
I strongly disagree that nearly 20% of the median household income (£31,400 [0]) is reasonable for a family's annual energy bill. Median income for the poorest 20% is £14,600 - the chances they could afford £6k a year on energy are slim.
Are you really suggesting that housing costs should average around £1400 per year (about one tenth the average annual rent [1]), but energy costs more than 4 times that amount are reasonable?
> Are you really suggesting that housing costs should average around £1400 per year
Relative to wages in the UK, absolutely. The working class is getting absolutely fleeced over there.
> I strongly disagree that nearly 20% of the median household income
Same thing here, there's two sides to every equation. The UK has to import almost all of it's energy, importing is expensive.
It seems to me, global energy demands are outstripping supply, and energy is going to become a permanently larger portion of everyone's budget. The way of life we've enjoyed the last 70+ years in the western world, that life is quickly changing.
> Your accent is Dutch, my friend. I identified your accent based on subtle details in your pronunciation. Want to sound like a native English speaker?
I'm British; from Yorkshire.
When letting it know how it got it wrong there's no option more specific than "English - United Kingdom". That's kind of funny, if not absurd, to anyone who knows anything of the incredible range of accents across the UK.
I also think the question "Do you have an accent when speaking English?" is an odd one. Everyone has an accent when speaking any language.