Aftonbladet may not be a very high quality news service, the article is still real, it's still the words of our PM.
As an immigrant who has more-or-less integrated and adapted into Sweden, lived here for nearly a decade, I completely despise this op-ed. The nationalistic way he talks in is something I never expected from Swedish politics, it's a nuanced way to play nationalism but is still shockingly unexpected.
Ulf Kristersson sold his soul to become PM, he is on the record barely 5-6 years ago saying he'd never ally himself and his party (M - Moderaterna) with SD, he has broken that to become PM and this is the blow SD wanted: a more nationalistic political discourse. Playing the Swedish language into this just makes me reek.
Jag är ledsen för er som röstade in denna regering...
Documentation looks really neat and in-depth, always appreciated.
Looks like you’re missing a .gitignore file. Folders like __pycache__ don’t need to be checked in.
Quite a neat way to crawl websites using a browser extension. That by itself is a form of donation to the search engine. Maybe in the future you can have dedicated software for self-hosted clients that users can run to crawl and index websites for mwmbl? Kinda like folding@home.
How are the batches of URLs to be crawled generated/discovered and posted at your API?
I have also thought that distributed crawling with the help of browser extensions, and/or clients like folding@home, could be a good idea. But how to deal with "spam injections"?
Get 3 people to scrape it and see if there are significant differences.
Some might, because of A/B testing or news updating, but even updating news will get a positive similar page and those that don't should probably fall into an exceptions category until it can be determined what can be done about it. Maybe a flag in the URL to give you a static page or just accept that it changes often enough that even faked pages won't last long?
Then I'll just add 3 million bots to the network (or just enough to have about 50%) and I can guarantee to win the A/B test against an honest client most of the time.
It's an arms race, but this is mostly a question of rate limiting account creation, assigning a trustworthiness score to different accounts, some network analysis to detect coordinated accounts, and having some trusted accounts (run by the project) that can help double check results. After an account loads poisoned data, you can detect this after the attack (user reported spam), and then block (or probably shadow ban) the malicious account.
You make it sound easy but companies have been trying to fight this stuff for ages.
You can buy a trustworthy residential IP for low cost, you can buy them in bulk in the thousands. All of them are real residential IPs from any ISP of your choosing in any country. You can rent Chrome browsers running over those IPs, directed via remote desktop and accessibility protocols (good luck banning that without running awful of anti-discrimination laws). You can do all that for under 1k$ a month for like 1 million clients.
My workplace has been at the other end of DDoS attacks directed by such services, best you can do is ban specific Chrome versions they use but that lasts until they update.
It's an uphill battle that you will loose in the long term if you rely on client trust.
In terms of spam injection (the concern from up thread) I don't think DDoS is relevant. If the core project manages asking clients to process URLs, they'd just IP ban any client that returns too many results. DDoS is a concern for other reasons though.
I think in this specific case, the spammer is on poor footing. The spammer wants to inject specific content, ideally many times. With double processing of URLs and the spammer controls 50% of the clients then there's a 50% chance that a simple diff would show the injected spam. The problem is that the spammer needs to do this many times, so their injection becomes statistically apparent. If the spammer can only inject a small number of messages before they are detected, then the cost per injected spam will be quite high. Long running spam campaigns could eventually be detected by content analysis, so the spammer also needs to rotate content.
Obviously you can play with the numbers, the attacker could try to control >>50% of the clients. The project could process URLs >2x. The project could re-process N% of URLs on trusted hardware, etc. It's not easy by any means, but you can tune the knobs to increase the cost for spammers.
> but this is mostly a question of rate limiting account creation, assigning a trustworthiness score to different accounts, some network analysis to detect coordinated accounts, and having some trusted accounts (run by the project) that can help double check results.
Then OP has to do things that don't scale: Review some pages and identify a subset that can be trusted. Then OP can compare their downloads to new accounts and mark the bots.
Then the botnet will just be honest for like a year before it abuses the network. Even better because now honest new clients can be kicked as they disagree with the bot majority. So now the network bleeds users.
Checking which account is honest isn't too hard, you detect that there is a "problematic mismatch" between two clients. So the project runs their own client to check. If one has an exact match, then you'd question the other.
There is a challenge for sites that serve different content based on GeoIP, A/B testing, dynamic content, etc. So some human review of the diff may help check for malice. If there's literally spam, human review would clearly detect this and that bot is distrusted.
Then I'll simply use more bots to get 80% of the network, then I can almost always win any disagreements and your "problematic mismatch" never triggers.
Plus I can now cause you to have to run your own crawler anyway and either slow progress or cost you a lot of money.
Maybe I misunderstand, but doesn't that mean you lose the benefit of having distributed crawlers if everything has to be crawled (again) locally somewhere?
YaCy can do distributed crawling and exchange the Indexes (in Peer to Peer mode). I have some node's who just receives and send indexes without crawling (much less storage intensive).
Lists tend to attract the "proscription" prefix, when in a political context.
Also, if you support the stance, you arguably don't want to make it easy for Russian/Belarusian users to avoid software, you want to maximize their inconvenience - which you do when they invest time into learning something and then find out they are technically not allowed to use it in production.
This said, it's all theatre anyway - I would bet good money that most Russian businesses already didn't care about respecting FOSS licenses, considering the infamously lax attitude about copyright enforcement in that country (and in China and Iran). So adding that sort of clause will stop absolutely no one, in practice, and I bet the authors will never even try to sue anyone in Russia (or anywhere else with some sort of jurisdiction over Russia, like the WTO) for infringing it.
I'm the author of PhotoStructure. I started working on it to clean up the digital mess I had accumulated due to failed/cancelled photo management software, not to replace Google Photos:
Since then, I've been working on improving deduplication, scaling to very large libraries, and whittling away at the feature list my users and I collaborate on. I keep detailed release notes here: https://photostructure.com/about/release-notes/
Replacing at least the organization and sharing aspects of Google Photos is now my top priority. Improving search (with geo and faces and ML object labelling) is my next.
If an individual is concerned enough about tax implications of a new acquisition, the individual probably has enough foresight to consider the cost of fuel.
That time when there was a big propaganda campaign to get everyone to hide inside for a big scary virus.
We probably maybe not really sure saved 0.5%-1% of the population, so it was worth 2-3% of our lives worrying about it, 10% inflation, public transport becoming financially unviable, various career paths literally disappearing, an irreparable political split, education being deleted for a bit, etc.
Thankfully, slightly later on, ~all then got it anyway so now there's at least a critical mass of people who realise this was all a load of shit and I don't have to adblock it by self excluding from all social networks.