How Corporations Harvest Data

uniqueid · on Feb 17, 2021

Great post! If the author reads this, for his/her next article in the series, please google articles on the 'spurious data' approach used by TrackMeNot. In my opinion, spurious data is the best approach to counter tracking, even if it's the least popular. The basic idea is to regularly perform arbitrary actions over the internet. Done properly, it cheapens all the information others may collect about you.

dkzlv · on Feb 17, 2021

Hi! Author's here.

Just read about this extension, thanks. The concept is rather interesting. Most probably you can cause some problems for corporations if you constantly add informational noise, but I don't think it will really change anything significantly. You see they will still get real information about you, even if it has some noise. Your real behavior is the most valuable here, and it will not go anywhere.

Also I'm quite sure Google can detect automatic actions very well, it's a thing they need for detecting bots, who click on ads or leave comments on youtube. Same goes for most of the tech corporations.

uniqueid · on Feb 17, 2021

It's an approach that didn't gain popularity. Bruce Schneier panned the idea, if I recall correctly, because he thought the extra traffic was wasteful. He's so well respected, perhaps that played a role.

I think there are variations of the approach that could make many kinds of tracking pretty useless.

Part of my confidence comes from my bias that 'AI' is mostly hype. Google hasn't manage to filter all spam from their results nor to 'read the users mind' particularly well when parsing search results. I think they'd have trouble separating generated results from real ones.

Somehow this reply sounds combative. That's completely unintentional.

I look forward to your next post.

maximente · on Feb 18, 2021

that may be true for small scale or vanilla automated actions, but every probabilistic layer costs the bad team additional engineering time, compute time, and won't be 100 percent reliable. plus, you only need to obfuscate /you/, not everyone else.

right now the sensors (tracking) are largely trusted, because there isn't enough corrupt data hitting them to warrant countermeasures. if they are vacuuming up data, throw in sand. then broken glass. then wet wool. and - critically - lots of it. defeat the sensors by overwhelming them with tons of questionable data.

are these huge tech companies capable of filtering out all that cruft? of course, but this is an easily winnable arms race at the individual level, because it's too costly to try to do for billions of people. i see it manifesting sort of like digital guerilla warfare.

dkzlv · on Feb 18, 2021

I don't believe in this solution, and I have a few reasons for this:

1. this algorithm will be very complex. Corporations can detect when bots generate incoming traffic, it's a part of their business model: they do not take money from advertisers for bot views/clicks, for example. There's no such service as dry out competitor's ad budget because of that. Google does that very effectively, and it's not a simple patterns detection, it's about the complex behavioral analysis: how you (mis)click on stuff, how you move your mouse, you reaction time, your navigation flows, etc. Simple algorithms that just open new tabs and click a few links won't do it. Probably it is possible to write an algo that would learn from you and mimic your unique behavior; it should also have some crowdsourced mechanism of adequate behavioral patterns: you can't just google random vocabulary sequences and chaotically click on different stuff — this will be caught rather quickly. Given the complexity of this whole thing I think there's just no such party that can create this.

2. some actions cannot be faked. Your purchase history on Amazon, places you go with your mobile phone turned on (telco can track your geolocation), stuff you put out in public using your accounts (twitter, instagram, hn comments, etc.). It will still give out too much information.

3. resource overhead. The web is slow thanks to all the trackers and ads. This solution proposes to leave the trackers intact AND add some machine-generated masking activity, that will additionally slow the device. The majority of the world just won't accept this.

4. mobile. Afaik you just cannot do any this in iOS at all and can do this (to some extent) in Android. iOS doesn't have either browser extensions or any decent ways to automate actions. You will face another issue in Android (which is applicable to iOS as well): apps run in isolated sandboxes, so no outside code can intrude and do any stuff there. You won't be able to _simply_ allow the device do some fake searches on Booking.com while you sleep, for example (complex automations won't be adopted). And mobile is ~60% of internet traffic. If a solution does not work on mobile it won't solve the problem.

Btw, in my next article I will talk about why I think your theses on "I only need to obfuscate me" completely misses the point. It's already published in Russian, so stay tuned for the translation :)

Long story short, since we're all are not unique in any way (we're merely a combination of different popular behaviors that can easily be predicted given enough data), then if the solution is not adopted by the majority of the population, the solution will not solve the problem.

maximente · on Feb 18, 2021

right on - looking forward to it!

wsinks · on Feb 17, 2021

Awesome article author!

I just forwarded it to a student of the space. Such a long and well thought through article with great hilarious images along the way.

Thanks for writing, I've signed up for more spam from you.

dkzlv · on Feb 18, 2021

Thanks a lot, I appreciate it <3

TurkishPoptart · on Feb 17, 2021

This is my idea for a business, for which I'm not tech-savvy enough to carry out. It's called OtherYou. It's a sort of mass data-poisoning scheme to create a double of your digital life on all platforms, but with completely bogus or conflicting PII. So if you were a 40-something, Asian-American engineer living in D.C., we would want to make a digital "double" with the primary key datapoint of your name/email the same, but all the other PII is shifted, and it would be a 20-something African-American pilot living in Miami. This would throw off/distort tracking and enhance privacy, in my view.

dkzlv · on Feb 18, 2021

Above I've outlined my key arguments why this, imo, won't work in real life: https://news.ycombinator.com/item?id=26177626

amelius · on Feb 17, 2021

Yes, this is a good idea. However, if you decrease the signal-to-noise ratio, someone may still be able to extract the signal. So then the challenge becomes to make the noise look like a real signal, which may be quite a challenge.

For example, Google can figure out if you are a bot with its "I'm not a robot" checkbox. This means that they can probably also figure out that fake automated requests are indeed fake.

ardy42 · on Feb 17, 2021

> For example, Google can figure out if you are a bot with its "I'm not a robot" checkbox. This means that they can probably also figure out that fake automated requests are indeed fake.

Not necessarily. Given a noise script doesn't have to run unattended, the script could easily just prompt the user to solve any captchas it encounters, so that couldn't be used to distinguish real from fake.

amelius · on Feb 17, 2021

Websites could have hidden captchas. For example, a website could track the movement of your mouse pointer and compute whether the movements most likely correspond to a human or not.

programbreeding · on Feb 17, 2021

I laughed, but also stopped reading at

>if you're from IT — well, it would be like learning the alphabet once again. Probably you won't find anything new; if you do, that is a bad sign for your employer.

But after seeing a few other positive comments about it I decided to go back and give it a read, and I'm so glad I did. Great article.

dkzlv · on Feb 18, 2021

So should your employer worry? Was it worth your time? :D