Claim: Private GitHub repos included in AI dataset

simonw · on March 20, 2024

This is talking about The Stack. The poster says:

"I found two of my old Github repos in there. Both were deleted last year and both were private."

The Stack was constructed a while ago, so "deleted last year" wouldn't have an impact if it was constructed before then.

"Both were private" is the thing that needs to be unpacked here. Were these genuinely private repositories that had never been made public on GitHub?

https://huggingface.co/datasets/bigcode/the-stack-v2 talks about where the Stack comes from: "This dataset is derived from the Software Heritage archive, the largest public archive of software source code and accompanying development history"

You can search that here: https://archive.softwareheritage.org/browse/search/?q=simonw... - it would be interesting to know if the OP's "private" repos are included in that collection.

simonw · on March 20, 2024

Here's a reply from a GitHub staff member: https://mastodon.social/@correcthorse/112128192392083842

> I work for github. This post was shared in our slack, on a non work related channel. We don't think it's us.

> https://huggingface.co/datasets/bigcode/the-stack-v2

> They say they get data from SoftwareHeritage, a website that archives repos from github. If your repos were ever open, they might have been archived there even after you deleted from github.

> That's my best guess.

beeboobaa · on March 20, 2024

Depending on the license, how is this legal?

altairprime · on March 20, 2024

In the US: Fair use disregards licenses. Fair use can be found to apply, or found not apply, by a court of law. Archival of information is generally felt to be Fair use, as is Search indexing.

ThunderSizzle · on March 20, 2024

The real argument is when is AI re-use of copyrighted material a violation of copyright. That is a large grey area that will probably be determined in favor of large corporations and not in favor of individuals. (As in, Disney can use AI writers to copy you, but you won't be allowed to copy Disney)

ryan-c · on March 20, 2024

Software Heritage aggressively insists on French law... which does not have fair use.

simonw · on March 20, 2024

Which bit? The archiving on SoftwareHeritage, the gathering of that data into the Stack or the subsequent training of models?

s4mw1se · on March 20, 2024

that’s a good question… There seems to be two problems.

The definition of open source depends on a license existing in a repo. Without a license it’s not legal to copy and distribute.

Public vs Private repo is a platforms issue not the code maintainers.

If a public repo does not have a license, it does not mean it free to copy and distribute.

If a private repo has an open source license like MIT, then the crawler has a right to copy and distribute that repo. Regardless if it has authorization to access the repo or not.

marcinzm · on March 20, 2024

> Without a license it’s not legal to copy and distribute.

Yes it is. Due to both the terms you agree when you use GitHub and the general Implied License that covers everything public on the internet.

https://en.wikipedia.org/wiki/Field_v._Google,_Inc.

vharuck · on March 20, 2024

Looking at that ruling, it seems the case you linked to hinged on a fact not applicable with the Stack:

>Field had actual knowledge of the Googlebot. He also was aware of the ways to prevent Google from either listing his site at all or listing it but not providing a link to the cached version. Instead of opting out, however, he chose to allow Google to both index and provide a link to the cached version.

For the AI dataset, (A) did the person know their work was being collected by this group and for this purpose, and (B) did they know of a way to prevent that collection?

AshamedCaptain · on March 20, 2024

It is not clear to me if they are _only_ using GitHub as source. The Stack explicitly mentions they are using Software Heritage as source and Software Heritage definitely sources from repositories that are NOT stored in GitHub (and never have been).

iamacyborg · on March 20, 2024

I don’t think that “implied license” you’re referring to holds up in the courts.

to11mtm · on March 20, 2024

Hopefully the crawler smart enough to properly handle edge cases...

e.x. if the repo has some sort of /used-licenses/ folder where the licenses for packages and the like are included, it could make a bad decision.

orf · on March 20, 2024

> Without a license it’s not legal to copy and distribute.

Is this true? When you post anything publicly, from sticking a poster on the street to making artwork like banksy, isn’t the default set to “it’s legal to copy, unless explicitly stated otherwise”?

amarshall · on March 20, 2024

The default in the majority of the world is that most creative works (including software code) are by-default copyrighted by the author, and the author must explicitly license away those rights. Some jurisdictions (e.g. France) put limits on what rights the author is allowed to give up. I.e., the default is it is illegal to copy (subject to exemptions like “fair use”).

ryan-c · on March 20, 2024

Note that this archive project is French.

fweimer · on March 20, 2024

Banksy apparently runs a licensing program. Their artwork is most definitely under copyright, and they rely on trademark protection as well.

There is also the practical issue that a lot of content is posted publicly without consent of the copyright owner. It's simply not true that just because someone else committed a copyright violation first, you can commit further violations without impunity based on that first violation.

zzo38computer · on March 20, 2024

> If a public repo does not have a license, it does not mean it free to copy and distribute.

Whether or not it is free to copy and distribute, it should be free to copy and distribute. (My opinion is that copyright is no good; if the file is public then you should be allowed to copy and distribute it.)

> If a private repo has an open source license like MIT, then the crawler has a right to copy and distribute that repo. Regardless if it has authorization to access the repo or not.

I should not think so. The license would only apply if you have a copy of it anyways. If you are not authorized to access it because it is private, then you would have to get a copy from somewhere else, and if nobody else is providing a copy, that shouldn't give you the right to unauthorized access. However, if it has been done, then it is done, so now there is a copy, and the license (if it is a license that allows copying it in this way) would authorize you to continue to use and distribute the copy that you have.

s4mw1se · on March 21, 2024

i’m not saying I agree or care about any of it. A sane company would never allow the use of source code from a third party without a license.

If repo is forked and the license is deleted the source code would need to be hashed to verify its the exact version of an open source repo. Mainly they don’t want copyleft or “malcious” license infecting their IP

If the hashes don’t match then it’s not technically the same code, so a company can’t safely use it without a license.

lightedman · on March 20, 2024

[flagged]

stefanfisk · on March 20, 2024

Why would it be GitHub’s responsibility to delete the repo from a third party service?

dawnerd · on March 20, 2024

And would that extend to anyone that ever cloned it? No, that's ridiculous.

shagie · on March 20, 2024

Software Heritage isn't GitHub.

https://en.wikipedia.org/wiki/Software_Heritage

> Software Heritage is a non-profit organization which provides a service for archiving and referencing historical and contemporary software — with a focus on human readable source code. The site was unveiled in 2016 by Inria [1] and is supported by UNESCO. The project itself is structured as a non‑profit multi‑stakeholder initiative.

> Development of Software Heritage began at Inria under the direction of computer scientists Roberto Di Cosmo and Stefano Zacchiroli in early 2015, and the project was officially announced to the public on June 30, 2016.

---

I am not sure that GitHub has the authority to do a take down of a repository on a different server (and jurisdiction - it's based out of France) on behalf of a user.

542458 · on March 20, 2024

By “archived there even after you deleted from github” I believe they mean “archived on SoftwareHeritage and continue to exist there even after you deleted from github”.

buffington · on March 20, 2024

I have about fifty private github repos and hundreds more public github repos (half of them forks of other public repos). I've verified that none of my private repos are in the dataset, and all of my public repos are.

I was expecting to see at least one repo that shouldn't be there depending on when the dataset was put together. In 2015 I changed a repo from public to private, which I think might suggest that the dataset was built after 2015 since my now private repo isn't in the dataset?

godelski · on March 20, 2024

I did the same and see none of my private repos. Which is quite a lot. I do see plenty of deleted repos but those were public.

hn72774 · on March 20, 2024

Github publicly streams the global change log including all public repos. I have a public repo and noticed there were multiple clones daily after every commit I made. That's when I discovered the API: https://docs.github.com/en/rest/activity/events?apiVersion=2...

If the repo was public even for a single commit, that was likely cloned and replicated elsewhere.

latexr · on March 20, 2024

> Were these genuinely private repositories that had never been made public on GitHub?

The poster responded to that:

https://hachyderm.io/@emenel@post.lurk.org/11212861313743638...

> I’m pretty sure the repos of mine were private. If it were only me i could be misremembering something, but i have heard from a number of people that they also found repos in this dataset that were private.

chefandy · on March 20, 2024

I'd be interested in hearing from the cited number of people. If it's like 5 randos, then that's quite possibly a misremembering, or even conceivably the victims of some unrelated code theft. If it's a few dozen people, well that would have very different implications.

latexr · on March 20, 2024

One more person:

https://news.ycombinator.com/item?id=39770776

beaugunderson · on March 20, 2024

I went through my private repos and the only one I found was one where I had forked a project, called `documentation`, and my fork made it into the Software Heritage archive. Sometime between then and now I deleted my fork and cloned a private repo with the same name. I confirmed the Software Heritage archive only has the one that was public and not the private one. I feel like we'd be hearing much louder alarm bells if the Software Heritage archive included private code given how many people rely on private GitHub repos.

yreg · on March 20, 2024

That person is unable to provide us with a repo name either.

chefandy · on March 20, 2024

Seems worth tracking publicly. Maybe I should create a GH rep.... er... hm...

agilob · on March 20, 2024

Maybe worth posting leaked Windows, Apple, Intel, Steam, The Wicher codes on GH as private repos and then seeing who will sue whom first.

cavisne · on March 20, 2024

Considering private repos used to require a paid subscription misremembering seems likely.

It used to be public by default, and enough people got confused by this that AWS & Github used to specifically scan repo's for accidentally public AWS credentials.

RandallBrown · on March 20, 2024

None of my private repos are included in the software heritage project.

throwanem · on March 20, 2024

Nor mine. But if anyone in the year 10,000 needs to run a 4-digit LED segment display as a clock with a 1st-gen Raspberry Pi, they'll have the code to do it.

eagerpace · on March 20, 2024

That’s really funny to think about. By then today’s internet may be the equivalent of the little NES machine you can buy with all the games on it.

yard2010 · on March 20, 2024

He said year 10,000 not year 2052 tho

simonw · on March 20, 2024

This is a really big claim. I think we need specifics on this - I'm inclined to think this is people not understanding the GitHub public/private repo model (or misremembering the history of their repos) over GitHub deliberately leaking private code to third parties.

ametrau · on March 20, 2024

I have personally experienced this. The person who did it didn’t realise the significance of the repo being “public only for a couple of hours”. I’m also inclined to believe it’s misremembering / misconfiguring.

latexr · on March 20, 2024

I agree that might be the case. Since I do not personally know the author, I elected to use “may” in the submission title. Hopefully we get a definitive answer from those involved.

Still, the post continues to be useful for those who want to opt-out regardless.

politelemon · on March 20, 2024

I would use "claim" rather than may until it's substantiated.

latexr · on March 20, 2024

Reads a bit weirder, but changed.

treffer · on March 20, 2024

I would give it a 50:50 chance.

The other option is that the scraper got lucky with a tiny glitch (whatsoever).

On the one hand I bet github does everything to keep stuff secure, on the other hand I can't believe there wasn't a single glitch in the last few years.

And if there was a glitch then regular automated scrapers are a pretty likely siphon.

Heck I should check if any of my private repository show up.....

rfoo · on March 20, 2024

> a tiny glitch

That's not "a tiny glitch". What you described is "GitHub may show your private repo to strangers randomly" which is a even more serious issue than they appearing in an archive some time later.

simonw · on March 20, 2024

Right: exposing private source code in this way is a showstopper bug for GitHub, and a major scandal if they've been covering it up.

A lot of companies pay GitHub a LOT of money to securely host their code. A breach like this is a way bigger story than just another "AI is training on your data!" thing.

I continue to doubt that private repos being exposed like this actually happened here.

errantmind · on March 20, 2024

For info, I checked and it looks like none of my AGPL licensed repos are included in The Stack. Neither are my private repos.

tom_ · on March 20, 2024

My GPL ones seems to be excluded.

I've got a couple that are intended to be GPL, but you wouldn't know unless you go to the GitHub issues to find the issue I raised about the licence file not being in the repo. They are included.

shkkmo · on March 20, 2024

It's weird that they exclude GPL licensed repos, but not unlicensed repos. It seems like they would have even fewer rights with the unlicensed ones.

kaibee · on March 20, 2024

There's probably a ton more unlicensed repos than licensed. It isn't weird, its capitalism.

calvinmorrison · on March 20, 2024

The software heritage sent me an email they were going to steal copies of my software. I think they assumed because it was published as a git repo they had unlimited rights to steal, reproduce, sell, etc my copyrighted work.

To quote the email they sent me:

"The mission of Software Heritage is to collect, preserve and share all the publicly available source code: https://www.softwareheritage.org"

So they're telling me their intent is to reproduce (share) my work, just because it was publicly available.

Upon my reply they did offer to cancel the request, but also told me they are facilitating the storage of my code for users private theft

" "Add forge now" requests are submitted by Software Heritage users who think that a forge is worth being archived.

After a careful examination of your arguments, we acknowledge that your forges may not be archived, so we won't process their ingestion, and close this add forge now request. However, we cannot prevent a publicly available from being archived by anyone using our "Save code now" feature

buffington · on March 20, 2024

Do you happen to still have that email? I'd like to search my inbox for anything similar, and would love some snippets to search for.

Edit: I asked this before the parent commenter included the contents of the email. Thanks parent commenter!

calvinmorrison · on March 20, 2024

Hello Calvin Morrison,

The mission of Software Heritage is to collect, preserve and share all the publicly available source code: https://www.softwareheritage.org

We have received a request to add the forge hosted at the URL below to the list of software origins that are archived, and it is our understanding that you are or know the contact person for this forge.

https://git.ceux.org/

In order to archive the forge contents, we will have to periodically pull the public repositories it contains and clone them into the Software Heritage archive. FAQs for our processes are available:

https://docs.softwareheritage.org/user/faq/#add-forge-now https://www.softwareheritage.org/faq/

Please let us know if there are any issues to consider before we launch the archival of the public repositories hosted on your infrastructure. Please use "Reply all" to ensure our system will process your answer properly.

In the absence of an answer to this message, we will start to archive your forge in the coming weeks. Only the publicly accessible repositories will be archived.

Thank you in advance for your help.

Kind regards, The Software Heritage team

-- bye, pabs

remotefonts · on March 20, 2024

Steal your repos? So you don't have them anymore because they took them? Maybe ask nicely to have them back and with a bit of luck they'll give them back to you.

rpdillon · on March 20, 2024

This is a sound point, even if it's being made a bit sarcastically. Copyright and intellectual property are an entirely different body of law than property rights, and people most often conflate them as an emotionally charged rhetorical technique that's generally rooted in a dissatisfaction that intellectual property and physical property are treated differently. The use of the word "steal" in this context is probably to incite readers to their cause, but ends up just sounding sophomoric to people that know anything about intellectual property law.

It's a shame because the GP had some valuable information to share about the emails software heritage was sending, but likely got downvoted because of this.

calvinmorrison · on March 21, 2024

Problem is GP (me) is a dickhead about it.

but frankly, it's not the first time. Github (after purchase by microsoft) did the same thing to my code. They reproduced it and put it on ice in some place in norway. That was the guise right? I mean I am sure they did, but you can bet your ass they're also using my code to train their AI.

Most of it was not licensed. It was publicly available as a portfolio so people would see I write serious code and would hire me.

So then I took down my entire github after being informed I was enrolled, couldn't opt out, etc, and then some OTHER foundation is now scanning my self hosted git repos? That made my blood BOIL!

calvinmorrison · on March 20, 2024

reproducing my work without license is theft.

rpdillon · on March 20, 2024

I mean, this is a matter of law, and it isn't. It's a curious point to double down on.

bevekspldnw · on March 20, 2024

Curiously, the people most vociferous about digital copying of IP not being “theft” tend to be those people who have produced the least amount of original IP.

Surprisingly, the people generating real, novel, IP like to buy groceries now and then.

mik1998 · on March 20, 2024

Absurd, presumably anecdotal claim.

My own anecdotal experience is that it has no relation and I know many people who have violated the copyright on the IP they themselves produced.

calvinmorrison · on March 20, 2024

mik1998 · on March 20, 2024

People who produce IP either by work for hire or by selling their copyright rights.

I've had people "steal" IP by sending me a PDF of a paper they wrote whose copyright resides with their university. Not the typical image of a thief.

ryan-c · on March 20, 2024

> However, we cannot prevent a publicly available from being archived by anyone using our "Save code now" feature

This is not true. They just had to remove about 500 public repos to comply with my copyright.

calvinmorrison · on March 21, 2024

It's true as in, that's what they told me when I asked.

ryan-c · on March 21, 2024

They're lying.

AshamedCaptain · on March 20, 2024

Have you figured out if there is a way to prevent them from doing that? The email I got, which I assume is similar, was suspiciously lacking a "just GTFO and leave my website alone" option.

dustyharddrive · on March 31, 2024

At least some of their software has unique user agents ("Software Heritage*"). I wish a popular FOSS host like Codeberg would block them.

calvinmorrison · on March 20, 2024

They explicitly told me they could not blacklist me, but they did do a prompt takedown.

gavinhoward · on March 20, 2024

I blocked their IP address specifically. Contact me if you want it.

caesil · on March 20, 2024

All of the comments here where the commenter found their repos say either "it only has the public ones" or "it has some private ones from x years ago and i don't quite remember if they were ever public".

So it seems probable this is a case of repo owners misremembering.

jtietema · on March 20, 2024

I agree. Only my public repos are in the data set.

latexr · on March 20, 2024

Full text from the post:

> If you had code on GitHub at any point it looks like it might be included in a large dataset called “The Stack” — If you want your code removed from this massive “ai” training data go here:

> https://huggingface.co/spaces/bigcode/in-the-stack

> I found two of my old Github repos in there. Both were deleted last year and both were private. This is a serious breach of trust by Github and @huggingface.

> Remove all your code from Github.

> CONSENT IS NOT OPT-OUT.

tailspin2019 · on March 20, 2024

> "Both were deleted last year and both were private"

I'd really like to know if they were ever public at any point because that might explain it.

None of my repos that have always been private are included (apparently).

That's not to say I'm not concerned by this...

codazoda · on March 20, 2024

This is interesting...

I have 66 repositories picked up and put in The Stack. I spot-checked the first 10. All 10 are Public on GitHub. 8 of the 10 do not have a license of any type, meaning they are covered by copyright, at least in the US, unless GitHub has some terms extending the license of public projects to 3rd parties.

One of mine is Private and was an extension I sold for a short time. I can't say if I ever made it public or not.

https://github.com/codazoda/like_roller

kevincrane · on March 20, 2024

god the language in that opt-out link is patronizing as hell. do AI people just assume everyone is happy to be a cog in their monetization scheme?

CatWChainsaw · on March 23, 2024

They assume that they don't need to care and their assumption has so far proven correct.

samtheprogram · on March 20, 2024

Thought it was bait, but I can confirm I can find private repos in the search results. What the heck?

rfoo · on March 20, 2024

Since your repo's name is public now anyway, could you please help us and post it here? I'm really curious about what happened. Since public GitHub activities were archived [1], if you post it here we can check if it were ever public or it's truly private at all time.

[1] https://www.gharchive.org/

yreg · on March 20, 2024

It seems the original ones (from the mastodon post) are

- https://github.com/emenel/dust

- https://github.com/emenel/portfolio

(based on https://archive.softwareheritage.org/browse/search/?q=emenel...)

Care to check them on gharchive? I bet they used to be public.

sdesol · on March 20, 2024

I checked my GitHub archive data and emenel/dust is not there, but emenel/portfolio is. Note, GitHub archive is not 100% accurate and it is missing 319 hours.

latexr · on March 20, 2024

> based on

The data for The Stack’s dataset is sourced from the Software Heritage Archive, so checking that is redundant. We need different sources.

mkishi · on March 20, 2024

Even still, both repos had READMEs [1][2] clearly meant to be read by the public. The archival was only successful years ago, with a failed snapshot as far back as 2021 [3]. This really seems like they forgot it was ever public.

Now, this is only about it being a GitHub breach. Whether unlicensed (emenel/portfolio) or GPL (emenel/dust) code should be allowed in such datasets is a different matter.

[1] https://archive.softwareheritage.org/browse/origin/directory...

[2] https://archive.softwareheritage.org/browse/origin/directory...

[3] https://archive.softwareheritage.org/browse/origin/visits/?o...

yreg · on March 20, 2024

I used SW Heritage to identify the repos that were used for training, since op did not post the repo names.

The “different source” is supposed to be ghactions.

johncoatesdev · on March 20, 2024

How do I check this? I found a repo I'm pretty sure was always private on there that I deleted a while back. https://github.com/johncoates/JCBootstrap

There's no archived version on archive.org at least.

simonw · on March 20, 2024

https://archive.softwareheritage.org/browse/origin/directory... shows a "snapshot date" of "11 August 2015, 07:28:00 UTC" - any chance it was public on that date such that the crawler could have accessed it?

sdesol · on March 20, 2024

I checked my GitHub archive (https://www.gharchive.org/) indexed data and the only repo that I saw for johncoates was LanscapeVideos, which has a last event time of 2015-06-09 07:09:52+02

It is important to note that GitHub archive is not 100% accurate and there is over 319 missing hours.

johncoatesdev · on March 20, 2024

I can't find any reason why I would have made it public. I made the repo in 2014 for internal use and don't like to share projects like that. I'm pretty careful when releasing any code publicly. It's some code that other private projects depend on. I searched for any references in public code and there are none, so there should have been no reason to make it public.

Interestingly my public code with thousands of stars isn't in "The Stack".

bredren · on March 20, 2024

This shouldn't be something where we're relying on recollection.

Presumably github repo privacy state has an audit trail. This would allow GH to prove / disprove claims on any given repo easily. I hope a rep steps in to do so.

johncoatesdev · on March 20, 2024

Yeah I agree. Tried https://news.ycombinator.com/item?id=39771541 but there's nothing related to this repo. Does GitHub send an email out when you make something public? I don't have any emails related to this repo.

simonw · on March 20, 2024

I just upgraded the tool at https://observablehq.com/@simonw/github-public-repo-history to use lowercase comparisons (previously it was case sensitive) so it's worth having another look.

tailspin2019 · on March 20, 2024

Were those repos always private?

simonw · on March 20, 2024

Which private repos? Any that you're willing to share (I get that sharing names of private repos goes against the whole idea of them being private!)

simonw · on March 20, 2024

Since a lot of this depends on whether someone had ever had a private repo public in the past, I was hoping it could be resolved using the GitHub security audit log.

You can access that for your repos here: https://github.com/settings/security-log

Then search for repo:simonw/datasette or similar.

But... it looks like the audit log only goes back 6 months, so sadly it's not useful for reviewing this particular situation which involves repos that could have been 5 or more years old.

The ClickHouse copy of the GitHub Archive is useful for reviewing things and goes back a lot further. Try it here:

https://play.clickhouse.com/play?user=play

You can run this query to see relevant events for a specific username:

    with public_events as (
      select
        created_at as timestamp,
        'Private repo made public' as action,
        repo_name
      from github_events 
      where actor_login = 'simonw'
      and event_type in ('PublicEvent')
    ),
    most_recent_public_push as (
      select
        max(created_at) as timestamp,
        'Most recent public push' as action,
        repo_name
      from github_events
      where event_type = 'PushEvent'
      and actor_login = 'simonw'
      group by repo_name
    ),
    combined as (
      select * from public_events
      union all select * from most_recent_public_push
    )
    select * from combined order by timestamp

The PublicEvent one is "When a private repository is made public" according to https://docs.github.com/en/rest/using-the-rest-api/github-ev...

I just built a tool for running this query without having to type in the SQL: https://observablehq.com/@simonw/github-public-repo-history

Explained in this TIL: https://til.simonwillison.net/clickhouse/github-public-histo...

dangnetgod · on March 20, 2024

[flagged]

buffington · on March 20, 2024

Not all of us do, bro.

jmuguy · on March 20, 2024

Need more proof than "pretty sure they were private" and "heard from a number of people".

tailspin2019 · on March 20, 2024

From one of the comments on that post:

> I work for github. This post was shared in our slack, on a non work related channel. We don't think it's us.

> They say they get data from SoftwareHeritage, a website that archives repos from github. If your repos were ever open, they might have been archived there even after you deleted from github.

> That's my best guess.

latexr · on March 20, 2024

The OP responded to that:

https://hachyderm.io/@emenel@post.lurk.org/11212861313743638...

> thanks for the additional info. I’m pretty sure the repos of mine were private. If it were only me i could be misremembering something, but i have heard from a number of people that they also found repos in this dataset that were private. So i’m not sure what to think or how to explain it.

psuedo_uuh · on March 20, 2024

It blows my mind that we’re all fine with “the home of open source software” is closed source

bogwog · on March 20, 2024

The way I rationalize my use of Github to myself is by framing it as me stealing free compute and storage from the evil Microsoft.

HaZeust · on March 20, 2024

Trust me, they're making it back.

mdaniel · on March 20, 2024

I'd presume if nothing else in copilot subscriptions, populated by harvesting upteen bazillion repos, issues, and maybe even CI job run logs

mdswanson · on March 20, 2024

https://aibusiness.com/nlp/github-copilot-loses-20-a-month-p...

simonw · on March 20, 2024

The accuracy of that story was vigorously denied by people in a position to know: https://twitter.com/natfriedman/status/1712140497127342404

mnau · on March 20, 2024

i am pretty sure they would just clone stuff from BitBucket or SF if they didn't have it on their own platform.

mdaniel · on March 20, 2024

especially with the number of actual FOSS alternatives available right now, but that network effect, whew, it's strong :-(

mnau · on March 20, 2024

I am considering moving away because of network effect. The more popular repo is, the more work is there (issues, PRs...). It's not scalable.

Using another platform/self-host would introduce a friction.

bmitc · on March 20, 2024

You can disable issues.

mnau · on March 20, 2024

That's not the point.

I am not opposed to issues per se. The problem is dealing with the amount, it's exhausting. It's just too easy to make issue or comment on GitHub because of network effect. This sums it up pretty well: https://nolanlawson.com/2017/03/05/what-it-feels-like-to-be-... (my workload is of course far smaller, but still it takes time and saps energy)

Yes, that issue has been opened for 4 years and I don't consider it important and neither did any of "me too" comments to do anything about it. At best I sometimes get drive by PR, which takes far more time to deal with than if I just did it myself.

bmitc · on March 20, 2024

What FOSS solutions give me an editor in the web, Codespaces, free CI/CD compute, free website hosting, etc.?

mdaniel · on March 20, 2024

I feel as though your question conflates FOSS with free SaaS compute, but https://salsa.debian.org/help/instance_configuration#gitlab-... shows they are using the community edition (MIT https://gitlab.com/gitlab-org/gitlab-foss/-/blob/v16.10.0/LI... ) and using GitLab Pages so that's "editor in the web", "CI/CD compute", "website hosting" right there. I believe codespaces is "we run a docker container with vscode in it" so kind of a subset of "CI compute" but since I don't use that, I can't speak to whether it's included in the FOSS side of GitLab or not

bmitc · on March 21, 2024

> I feel as though your question conflates FOSS with free SaaS compute

My comment did not. The comment I replied to did. It said:

> especially with the number of actual FOSS alternatives available right now

I find GitLab quite lacking to GitHub and don't particularly see it as a compelling alternative to GitHub. Plus, there is no actual benefit to it being FOSS.

mdaniel · on March 21, 2024

https://gitlab.com/gitlab-org/gitlab/-/merge_requests?scope=... would disagree as would the top section of every release notes post showing which community member made the biggest contribution to that GitLab release. That's not even counting the fun things I can do to my own copy of GitLab which I can modify and host for my organization's needs, no AGPL, no reverse engineering obfuscated ruby from a .vhd, just actual open source

You are welcome to find GitLab lacking (there's currently at least 66500 people who agree with you), and "compelling" is up to you, but to say it's not a full featured competitor to GitHub is disingenuous

angst_ridden · on March 20, 2024

I'm curious how they get around licenses. For example, I have repos that show up in The Stack. Some have licenses that require inclusion of the copyright in any source re-use or redistribution.

IANAL, but it seems like inclusion in the data set and subsequent distribution without the copyright notice would be a violation.

AshamedCaptain · on March 20, 2024

In The Stack FAQ, they claim that they are doing minimal analysis of the LICENSE file and SPDX tags.

I'd bet that this is enough to detect cases like GPL code, but I also bet that if this analysis fails instead of falling back to "unknown license, assume proprietary, don't copy" they fall back to "free lunch!". Because reasons.

ryan-c · on March 20, 2024

angst_ridden · on March 20, 2024

I suspect you're right.

Even though even permissive licenses like MIT and BSD require attribution and preservation of copyright notices. Maybe their AI just can't "reliably detect" licenses.

xyst · on March 20, 2024

Probably some obscure GH legal clause stating “we own your data. Ownership is implied and we may do anything with it. Private vs public is a concept of accessibility over the internet. Not necessarily means it’s not accessible via intranet or other non-public means”

It’s the similar legal clauses used for decades on social and video hosting platforms.

angst_ridden · on March 20, 2024

I don't think that gets around the code licenses. They may use it as if it does, but I'm not convinced that would hold up if soemone were wealthy enough to make a court case.

ADeerAppeared · on March 20, 2024

GH interestingly doesn't grant themselves nor others that many rights.

---

  You grant us and our legal successors the right to store, archive, parse, and display Your Content, and make incidental copies, as necessary to provide the Service, including improving the Service over time. This license includes the right to do things like copy it to our database and make backups; show it to you and other users; parse it into a search index or otherwise analyze it on our servers; share it with other users; and perform it, in case Your Content is something like music or video.

  This license does not grant GitHub the right to sell Your Content. It also does not grant GitHub the right to otherwise distribute or use Your Content outside of our provision of the Service, except that as part of the right to archive Your Content, GitHub may permit our partners to store and archive Your Content in public repositories in connection with the GitHub Arctic Code Vault and GitHub Archive Program.

---

  Any User-Generated Content you post publicly, including issues, comments, and contributions to other Users' repositories, may be viewed by others. By setting your repositories to be viewed publicly, you agree to allow others to view and "fork" your repositories (this means that others may make their own copies of Content from your repositories in repositories they control).

  If you set your pages and repositories to be viewed publicly, you grant each User of GitHub a nonexclusive, worldwide license to use, display, and perform Your Content through the GitHub Service and to reproduce Your Content solely on GitHub as permitted through GitHub's functionality (for example, through forking).

---

They don't even include training for copilot (though a dodgy lawyer will likely try to include that as "part of providing the service). 3rd parties only get a license to fork your repo, seemingly not even a license to do anything with that repo. (And hot take: Github should just let people disable the fork button already.)

bionhoward · on March 20, 2024

What if we include the Microsoft Services Agreement? GitHub Copilot is a Microsoft AI Service

Retr0id · on March 20, 2024

47 of my repos are in the stack, and none of them are private (and I have plenty of private repos)

multimoon · on March 20, 2024

Same here. I think some people are confusing private now vs private since creation.

Retr0id · on March 20, 2024

I do think, however, that "repo no longer public" should be treated as an opt-out signal.

Mtinie · on March 20, 2024

Which is valid, but what's the timeframe for rechecking and validating if there's been a change? I'll posit it's practically impossible to catch every single change within a reasonable timeframe to ensure that "a repo no longer public, today, within a data set collected earlier" would be excluded.

I may be misunderstanding your suggestion. If so, I'm curious to learn what I missed.

Retr0id · on March 20, 2024

My assumption is that the "my private repo was included" crowd are misremembering that their repo used to be public at some point in the distant past (~years). My suggestion would be that they re-scan before each major revision of the dataset (such as this upcoming "v2" release). This would be a relatively expensive process given the numbers involved, but so be it.

(edit: I think "v2" is already released, but you get my point)

Mtinie · on March 20, 2024

I can get behind this idea. Thank you for clarifying.

inopinatus · on March 20, 2024

None of mine are in the stack and I am now personally offended that all those crappy little one-file Swift and sh utilities will be excluded from the singularity

johncoatesdev · on March 20, 2024

From the reports it looks like they have to have been private and deleted

sdflhasjd · on March 20, 2024

Hmm, I do see some of my repos there, but only public ones. It would seem a bit too foolish of GitHub to put private repositories in a public dataset.

Is there any other corroboration or proof?

jaggederest · on March 20, 2024

It would be fun to upload a private repo with AWS keys in it as a canary. If the account ever gets used, the repo is no longer private?

ergonaught · on March 20, 2024

As with others, I can say they only seem to have my repos that were ever public. Nothing of mine that was always private is showing.

jddj · on March 20, 2024

Same here, public only

derriz · on March 20, 2024

Yep - same. My public repos show but not my private one.

nromiun · on March 20, 2024

I also see a 7 years old private repo from my account. But I may have made it public initially, I don't really remember it. Anyone knows how to check if a repo was ever made public?

rfoo · on March 20, 2024

Check if it was ever appeared in gharchive [1].

[1] https://www.gharchive.org/

abdullahkhalids · on March 20, 2024

If the repo was public for a while (say 2 years), and then became private, then the code in the Stack for your repo must be 5 years old.

If on the other hand, it's tracked to the latest commit, that is a different scenario.

netule · on March 20, 2024

The Wayback Machine [0] if you're lucky (?) enough to have had it indexed.

[0]: https://web.archive.org/

roland35 · on March 20, 2024

It is always interesting to see how copilot autocompletes things like "Todo: " or "my ssh key" and see what data leaks through!

zzo38computer · on March 20, 2024

I do not use private GitHub repositories. (If I want something private, I will store it on my own computer and/or on DVDs, etc.)

If they are using private data in AI dataset (or other uses) then that is a serious issue; they are copying data which is meant to be private. (Public files are public and should be made copies that others can use too, though.)

AshamedCaptain · on March 20, 2024

So let me understand this.

Step 1) "Software Heritage" crawls the web (not only Github; they definitely crawl ANY gitlab instances they find online among practically everything else). They "store" ANY type of source code, irregardless of the license of that code. They admit as much in their own FAQ ( https://www.softwareheritage.org/faq/#24_What_is_the_policy_... ) :

> Software Heritage archives everything that is publicly available, without preliminary tests or checks [for LICENSE file or others]. You are responsible for checking whether the source code you find in the archive can be reused, and under which terms.

Step 2) "Hugging Face" uses the Software Heritage dataset to build some AI training dataset ("The Stack"), again, completely ignoring licensing. You apparently have to manually opt-out if you don't want your COPYRIGHTED source code to be included there. But as far as I can see, opting out is only considered AT ALL for Github repositories via https://huggingface.co/datasets/bigcode/the-stack-v2 . If you have your source code published outside Github, then your code appears to be used, period.

> The Stack v2 is a collection of source code from repositories with various licenses. Any use of all or part of the code gathered in The Stack v2 must abide by the terms of the original licenses [...]

Step 3) Eventually and inevitably someone releases some model trained on this dataset, ignoring licensing again.

Step 4) Someone uses code generated from such chatbot, unknowingly violating everything software license known to man.

Step 5) ???

Step 6) Profit!

Where do these "Software Heritage" guys mention which user-agent their crawler is using so that I can permanently ban them from my websites?

At least archive.org does that and they have a much nicer way to request exclusion from their archive.

gavinhoward · on March 20, 2024

I had this problem with Software Heritage for non-Open Source stuff in my personal Gitea.

I eventually had to specifically block their IP addresses from accessing my Gitea and threaten legal action before they took the offending repos down.

I still have them blocked.

a_wild_dandan · on March 20, 2024

You lost me at step four. If reading source is fair use and generating content isn't derivative work, this is kosher. As litigation mounts, we'll learn more.

To me, a learning path that's acceptable for humans is acceptable for AIs. All of the rationales for treating AI/human learning distinctly feel flimsy and arbitrary. "The difference is learning scale." So what? "LLMs just parrot derivatives of their training data." No, they don't. That's demonstrably false, and theoretically absurd, for all the common cited reasons. "AIs obviously aren't people. So their learning restrictions must be different." Why? "People are financially benefiting from the AI's knowledge!" Yep, that's generally how employment works. Etc, etc.

My take could be totally wrong. I get that. It'll be interesting to see where the courts fall on the issue. But personally, the critics' arguments feel profoundly crummy and counterproductive; applying their logic to pre-LLM systems usually results in horrendous alternative realities. But that's just, like, my opinion, man.

AshamedCaptain · on March 21, 2024

Having an agenda to push much?

Anyway, this is strictly one level up above that debate, as here they are scraping source code which is NOT free, and whose licenses could explicitly forbid copy for use in training datasets, or by the military, or even by people whose eye color I don't like.

For example, example code for proprietary tools which usually allows you to only copy it strictly for purposes of extending the proprietary tool.

matt3210 · on March 20, 2024

Opt-out consent is the equivalent of assuming a partner consents because they haven't said anything.

Consent is not Opt-out == Yes means Yes

pityJuke · on March 20, 2024

This organization claims that they'll allow you to opt-out, but they haven't seemingly done so for the oldest request on their repository [0]. Pathetic.

[0]: https://github.com/bigcode-project/opt-out-v2/issues/1

t1c · on March 20, 2024

Even though they were only my public repos in there, I really don't like this being opt out instead of opt in, especially since some of those repositories were not under fully permissive licenses

matt3210 · on March 20, 2024

I and most of my coworkers have all found at least one private repo which we believe was always private to be in the data set version 2.

misterpigs · on March 20, 2024

This is a nothingburger.

I have "private" repos listed in the dataset but they were all at one time public. Searching the SoftwareHeritage site I can find those once-public repos with ancient commits.

My private repos that were always private are not listed in the dataset.

latexr · on March 20, 2024

https://archive.is/nWzk4

pluc · on March 20, 2024

Could "private" mean "kept from the outside world"?

max_ · on March 20, 2024

Time for an end to end encrypted Git Service.

LightFog · on March 20, 2024

Time to start large scale poisoning of repos.

mtam · on March 20, 2024

Just to add yet another data point. Only my Public repos included. All my private repos are not included.

drpossum · on March 20, 2024

I'm guessing these were public once (mine were), but I'll be damned if I let github host anything of mine again.

tamimio · on March 20, 2024

Encrypt your private repos..

falsandtru · on March 20, 2024

Is there any other good service for enterprise development?

throwitaway222 · on March 20, 2024

I mean, there is also a lot of "public" github repos that contain absolutely copywritten work too.

rvz · on March 20, 2024

[flagged]

ametrau · on March 20, 2024

Was there a previous time?

nyc_data_geek · on March 20, 2024

I am Jack's complete lack of surprise.

In all seriousness, without strong data privacy regulations (ie: GDPR), we will continue to see this sort of stuff, as the potential monetary rewards for using this sort of data far outweigh the potential liability. Cost of doing business type stuff, rather than an existential risk for abusing public trust.

My opinion is that data should be treated like a fissile element - very dangerous to hold and store, but extremely powerful when properly employed. However, it's only dangerous if the liability of storing it is significant and real, as of today, it's not (in the US).

xyst · on March 20, 2024

The crypto craze stole our energy to validate digital currency transactions on the blockchain.

Now the AI craze stole our code and collective knowledge to ultimately train their models, and hopefully replace SWEs and other knowledge based fields (ie, medicine).

At least with blockchain, some people got rich. But with the AI craze the only people getting rich are the rich themselves.

latexr · on March 20, 2024

> At least with blockchain, some people got rich.

But plenty more lost money or went bankrupt. The one’s who got rich did so at the expense of other non-rich people.

remotefonts · on March 20, 2024

> But plenty more lost money or went bankrupt. The one’s who got rich did so at the expense of other non-rich people.

[citation needed]

a_wild_dandan · on March 20, 2024

https://en.wikipedia.org/wiki/Cryptocurrency_and_crime#notab...

https://en.wikipedia.org/wiki/Bankruptcy_of_FTX

https://en.wikipedia.org/wiki/OneCoin

https://en.wikipedia.org/wiki/Bitconnect

Etc.

remotefonts · on March 20, 2024

LOL what about the amounts the banks and bankers stole from us in the last decade and a half? Why don't you complain about that also? I'm sure it dwarfs whatever scammers have stolen in crypto. But you guys hate that you missed your chance. But it's never too late. You can still invest. It's not over yet. You can't teach me anything at all about the world of crypto, believe me. I'm one of the MtGox creditors, and that has been going on for 10 years now

latexr · on March 21, 2024

https://en.wikipedia.org/wiki/Whataboutism

https://en.wikipedia.org/wiki/Moving_the_goalposts

codr7 · on March 20, 2024

Logical reasoning is still a valuable tool.

Where else would the money have come from?

remotefonts · on March 20, 2024

I got "rich" also, but I've never speculated, and TBH I've sold, and spent, more spent than sold at all kind of valuations, and most of the riches are still unrealized, so, sorry if I doubt what you say.