Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
[flagged] Claim: Private GitHub repos included in AI dataset (lurk.org)
193 points by latexr on March 20, 2024 | hide | past | favorite | 167 comments


This is talking about The Stack. The poster says:

"I found two of my old Github repos in there. Both were deleted last year and both were private."

The Stack was constructed a while ago, so "deleted last year" wouldn't have an impact if it was constructed before then.

"Both were private" is the thing that needs to be unpacked here. Were these genuinely private repositories that had never been made public on GitHub?

https://huggingface.co/datasets/bigcode/the-stack-v2 talks about where the Stack comes from: "This dataset is derived from the Software Heritage archive, the largest public archive of software source code and accompanying development history"

You can search that here: https://archive.softwareheritage.org/browse/search/?q=simonw... - it would be interesting to know if the OP's "private" repos are included in that collection.


Here's a reply from a GitHub staff member: https://mastodon.social/@correcthorse/112128192392083842

> I work for github. This post was shared in our slack, on a non work related channel. We don't think it's us.

> https://huggingface.co/datasets/bigcode/the-stack-v2

> They say they get data from SoftwareHeritage, a website that archives repos from github. If your repos were ever open, they might have been archived there even after you deleted from github.

> That's my best guess.


Depending on the license, how is this legal?


In the US: Fair use disregards licenses. Fair use can be found to apply, or found not apply, by a court of law. Archival of information is generally felt to be Fair use, as is Search indexing.


The real argument is when is AI re-use of copyrighted material a violation of copyright. That is a large grey area that will probably be determined in favor of large corporations and not in favor of individuals. (As in, Disney can use AI writers to copy you, but you won't be allowed to copy Disney)


Software Heritage aggressively insists on French law... which does not have fair use.


Which bit? The archiving on SoftwareHeritage, the gathering of that data into the Stack or the subsequent training of models?


that’s a good question… There seems to be two problems.

The definition of open source depends on a license existing in a repo. Without a license it’s not legal to copy and distribute.

Public vs Private repo is a platforms issue not the code maintainers.

If a public repo does not have a license, it does not mean it free to copy and distribute.

If a private repo has an open source license like MIT, then the crawler has a right to copy and distribute that repo. Regardless if it has authorization to access the repo or not.


> Without a license it’s not legal to copy and distribute.

Yes it is. Due to both the terms you agree when you use GitHub and the general Implied License that covers everything public on the internet.

https://en.wikipedia.org/wiki/Field_v._Google,_Inc.


Looking at that ruling, it seems the case you linked to hinged on a fact not applicable with the Stack:

>Field had actual knowledge of the Googlebot. He also was aware of the ways to prevent Google from either listing his site at all or listing it but not providing a link to the cached version. Instead of opting out, however, he chose to allow Google to both index and provide a link to the cached version.

For the AI dataset, (A) did the person know their work was being collected by this group and for this purpose, and (B) did they know of a way to prevent that collection?


It is not clear to me if they are _only_ using GitHub as source. The Stack explicitly mentions they are using Software Heritage as source and Software Heritage definitely sources from repositories that are NOT stored in GitHub (and never have been).


I don’t think that “implied license” you’re referring to holds up in the courts.


Hopefully the crawler smart enough to properly handle edge cases...

e.x. if the repo has some sort of /used-licenses/ folder where the licenses for packages and the like are included, it could make a bad decision.


> Without a license it’s not legal to copy and distribute.

Is this true? When you post anything publicly, from sticking a poster on the street to making artwork like banksy, isn’t the default set to “it’s legal to copy, unless explicitly stated otherwise”?


The default in the majority of the world is that most creative works (including software code) are by-default copyrighted by the author, and the author must explicitly license away those rights. Some jurisdictions (e.g. France) put limits on what rights the author is allowed to give up. I.e., the default is it is illegal to copy (subject to exemptions like “fair use”).


Note that this archive project is French.


Banksy apparently runs a licensing program. Their artwork is most definitely under copyright, and they rely on trademark protection as well.

There is also the practical issue that a lot of content is posted publicly without consent of the copyright owner. It's simply not true that just because someone else committed a copyright violation first, you can commit further violations without impunity based on that first violation.


> If a public repo does not have a license, it does not mean it free to copy and distribute.

Whether or not it is free to copy and distribute, it should be free to copy and distribute. (My opinion is that copyright is no good; if the file is public then you should be allowed to copy and distribute it.)

> If a private repo has an open source license like MIT, then the crawler has a right to copy and distribute that repo. Regardless if it has authorization to access the repo or not.

I should not think so. The license would only apply if you have a copy of it anyways. If you are not authorized to access it because it is private, then you would have to get a copy from somewhere else, and if nobody else is providing a copy, that shouldn't give you the right to unauthorized access. However, if it has been done, then it is done, so now there is a copy, and the license (if it is a license that allows copying it in this way) would authorize you to continue to use and distribute the copy that you have.


i’m not saying I agree or care about any of it. A sane company would never allow the use of source code from a third party without a license.

If repo is forked and the license is deleted the source code would need to be hashed to verify its the exact version of an open source repo. Mainly they don’t want copyleft or “malcious” license infecting their IP

If the hashes don’t match then it’s not technically the same code, so a company can’t safely use it without a license.


[flagged]


Why would it be GitHub’s responsibility to delete the repo from a third party service?


And would that extend to anyone that ever cloned it? No, that's ridiculous.


Software Heritage isn't GitHub.

https://en.wikipedia.org/wiki/Software_Heritage

> Software Heritage is a non-profit organization which provides a service for archiving and referencing historical and contemporary software — with a focus on human readable source code. The site was unveiled in 2016 by Inria [1] and is supported by UNESCO. The project itself is structured as a non‑profit multi‑stakeholder initiative.

> Development of Software Heritage began at Inria under the direction of computer scientists Roberto Di Cosmo and Stefano Zacchiroli in early 2015, and the project was officially announced to the public on June 30, 2016.

---

I am not sure that GitHub has the authority to do a take down of a repository on a different server (and jurisdiction - it's based out of France) on behalf of a user.


By “archived there even after you deleted from github” I believe they mean “archived on SoftwareHeritage and continue to exist there even after you deleted from github”.


I have about fifty private github repos and hundreds more public github repos (half of them forks of other public repos). I've verified that none of my private repos are in the dataset, and all of my public repos are.

I was expecting to see at least one repo that shouldn't be there depending on when the dataset was put together. In 2015 I changed a repo from public to private, which I think might suggest that the dataset was built after 2015 since my now private repo isn't in the dataset?


I did the same and see none of my private repos. Which is quite a lot. I do see plenty of deleted repos but those were public.


Github publicly streams the global change log including all public repos. I have a public repo and noticed there were multiple clones daily after every commit I made. That's when I discovered the API: https://docs.github.com/en/rest/activity/events?apiVersion=2...

If the repo was public even for a single commit, that was likely cloned and replicated elsewhere.


> Were these genuinely private repositories that had never been made public on GitHub?

The poster responded to that:

https://hachyderm.io/@emenel@post.lurk.org/11212861313743638...

> I’m pretty sure the repos of mine were private. If it were only me i could be misremembering something, but i have heard from a number of people that they also found repos in this dataset that were private.


I'd be interested in hearing from the cited number of people. If it's like 5 randos, then that's quite possibly a misremembering, or even conceivably the victims of some unrelated code theft. If it's a few dozen people, well that would have very different implications.



I went through my private repos and the only one I found was one where I had forked a project, called `documentation`, and my fork made it into the Software Heritage archive. Sometime between then and now I deleted my fork and cloned a private repo with the same name. I confirmed the Software Heritage archive only has the one that was public and not the private one. I feel like we'd be hearing much louder alarm bells if the Software Heritage archive included private code given how many people rely on private GitHub repos.


That person is unable to provide us with a repo name either.


Seems worth tracking publicly. Maybe I should create a GH rep.... er... hm...


Maybe worth posting leaked Windows, Apple, Intel, Steam, The Wicher codes on GH as private repos and then seeing who will sue whom first.


Considering private repos used to require a paid subscription misremembering seems likely.

It used to be public by default, and enough people got confused by this that AWS & Github used to specifically scan repo's for accidentally public AWS credentials.


None of my private repos are included in the software heritage project.


Nor mine. But if anyone in the year 10,000 needs to run a 4-digit LED segment display as a clock with a 1st-gen Raspberry Pi, they'll have the code to do it.


That’s really funny to think about. By then today’s internet may be the equivalent of the little NES machine you can buy with all the games on it.


He said year 10,000 not year 2052 tho


This is a really big claim. I think we need specifics on this - I'm inclined to think this is people not understanding the GitHub public/private repo model (or misremembering the history of their repos) over GitHub deliberately leaking private code to third parties.


I have personally experienced this. The person who did it didn’t realise the significance of the repo being “public only for a couple of hours”. I’m also inclined to believe it’s misremembering / misconfiguring.


I agree that might be the case. Since I do not personally know the author, I elected to use “may” in the submission title. Hopefully we get a definitive answer from those involved.

Still, the post continues to be useful for those who want to opt-out regardless.


I would use "claim" rather than may until it's substantiated.


Reads a bit weirder, but changed.


I would give it a 50:50 chance.

The other option is that the scraper got lucky with a tiny glitch (whatsoever).

On the one hand I bet github does everything to keep stuff secure, on the other hand I can't believe there wasn't a single glitch in the last few years.

And if there was a glitch then regular automated scrapers are a pretty likely siphon.

Heck I should check if any of my private repository show up.....


> a tiny glitch

That's not "a tiny glitch". What you described is "GitHub may show your private repo to strangers randomly" which is a even more serious issue than they appearing in an archive some time later.


Right: exposing private source code in this way is a showstopper bug for GitHub, and a major scandal if they've been covering it up.

A lot of companies pay GitHub a LOT of money to securely host their code. A breach like this is a way bigger story than just another "AI is training on your data!" thing.

I continue to doubt that private repos being exposed like this actually happened here.


For info, I checked and it looks like none of my AGPL licensed repos are included in The Stack. Neither are my private repos.


My GPL ones seems to be excluded.

I've got a couple that are intended to be GPL, but you wouldn't know unless you go to the GitHub issues to find the issue I raised about the licence file not being in the repo. They are included.


It's weird that they exclude GPL licensed repos, but not unlicensed repos. It seems like they would have even fewer rights with the unlicensed ones.


There's probably a ton more unlicensed repos than licensed. It isn't weird, its capitalism.


The software heritage sent me an email they were going to steal copies of my software. I think they assumed because it was published as a git repo they had unlimited rights to steal, reproduce, sell, etc my copyrighted work.

To quote the email they sent me:

"The mission of Software Heritage is to collect, preserve and share all the publicly available source code: https://www.softwareheritage.org"

So they're telling me their intent is to reproduce (share) my work, just because it was publicly available.

Upon my reply they did offer to cancel the request, but also told me they are facilitating the storage of my code for users private theft

" "Add forge now" requests are submitted by Software Heritage users who think that a forge is worth being archived.

After a careful examination of your arguments, we acknowledge that your forges may not be archived, so we won't process their ingestion, and close this add forge now request. However, we cannot prevent a publicly available from being archived by anyone using our "Save code now" feature


Do you happen to still have that email? I'd like to search my inbox for anything similar, and would love some snippets to search for.

Edit: I asked this before the parent commenter included the contents of the email. Thanks parent commenter!


Hello Calvin Morrison,

The mission of Software Heritage is to collect, preserve and share all the publicly available source code: https://www.softwareheritage.org

We have received a request to add the forge hosted at the URL below to the list of software origins that are archived, and it is our understanding that you are or know the contact person for this forge.

https://git.ceux.org/

In order to archive the forge contents, we will have to periodically pull the public repositories it contains and clone them into the Software Heritage archive. FAQs for our processes are available:

https://docs.softwareheritage.org/user/faq/#add-forge-now https://www.softwareheritage.org/faq/

Please let us know if there are any issues to consider before we launch the archival of the public repositories hosted on your infrastructure. Please use "Reply all" to ensure our system will process your answer properly.

In the absence of an answer to this message, we will start to archive your forge in the coming weeks. Only the publicly accessible repositories will be archived.

Thank you in advance for your help.

Kind regards, The Software Heritage team

-- bye, pabs


Steal your repos? So you don't have them anymore because they took them? Maybe ask nicely to have them back and with a bit of luck they'll give them back to you.


This is a sound point, even if it's being made a bit sarcastically. Copyright and intellectual property are an entirely different body of law than property rights, and people most often conflate them as an emotionally charged rhetorical technique that's generally rooted in a dissatisfaction that intellectual property and physical property are treated differently. The use of the word "steal" in this context is probably to incite readers to their cause, but ends up just sounding sophomoric to people that know anything about intellectual property law.

It's a shame because the GP had some valuable information to share about the emails software heritage was sending, but likely got downvoted because of this.


Problem is GP (me) is a dickhead about it.

but frankly, it's not the first time. Github (after purchase by microsoft) did the same thing to my code. They reproduced it and put it on ice in some place in norway. That was the guise right? I mean I am sure they did, but you can bet your ass they're also using my code to train their AI.

Most of it was not licensed. It was publicly available as a portfolio so people would see I write serious code and would hire me.

So then I took down my entire github after being informed I was enrolled, couldn't opt out, etc, and then some OTHER foundation is now scanning my self hosted git repos? That made my blood BOIL!


reproducing my work without license is theft.


I mean, this is a matter of law, and it isn't. It's a curious point to double down on.


Curiously, the people most vociferous about digital copying of IP not being “theft” tend to be those people who have produced the least amount of original IP.

Surprisingly, the people generating real, novel, IP like to buy groceries now and then.


Absurd, presumably anecdotal claim.

My own anecdotal experience is that it has no relation and I know many people who have violated the copyright on the IP they themselves produced.


how does one violate their own copyright? I mean, its my IP, I am free to license, distribute, and so forth as I see fit, there's an implicit "All Rights Reserved".


People who produce IP either by work for hire or by selling their copyright rights.

I've had people "steal" IP by sending me a PDF of a paper they wrote whose copyright resides with their university. Not the typical image of a thief.


> However, we cannot prevent a publicly available from being archived by anyone using our "Save code now" feature

This is not true. They just had to remove about 500 public repos to comply with my copyright.


It's true as in, that's what they told me when I asked.


They're lying.


Have you figured out if there is a way to prevent them from doing that? The email I got, which I assume is similar, was suspiciously lacking a "just GTFO and leave my website alone" option.


At least some of their software has unique user agents ("Software Heritage*"). I wish a popular FOSS host like Codeberg would block them.


They explicitly told me they could not blacklist me, but they did do a prompt takedown.


I blocked their IP address specifically. Contact me if you want it.


All of the comments here where the commenter found their repos say either "it only has the public ones" or "it has some private ones from x years ago and i don't quite remember if they were ever public".

So it seems probable this is a case of repo owners misremembering.


I agree. Only my public repos are in the data set.


Full text from the post:

> If you had code on GitHub at any point it looks like it might be included in a large dataset called “The Stack” — If you want your code removed from this massive “ai” training data go here:

> https://huggingface.co/spaces/bigcode/in-the-stack

> I found two of my old Github repos in there. Both were deleted last year and both were private. This is a serious breach of trust by Github and @huggingface.

> Remove all your code from Github.

> CONSENT IS NOT OPT-OUT.


> "Both were deleted last year and both were private"

I'd really like to know if they were ever public at any point because that might explain it.

None of my repos that have always been private are included (apparently).

That's not to say I'm not concerned by this...


This is interesting...

I have 66 repositories picked up and put in The Stack. I spot-checked the first 10. All 10 are Public on GitHub. 8 of the 10 do not have a license of any type, meaning they are covered by copyright, at least in the US, unless GitHub has some terms extending the license of public projects to 3rd parties.

One of mine is Private and was an extension I sold for a short time. I can't say if I ever made it public or not.

https://github.com/codazoda/like_roller


god the language in that opt-out link is patronizing as hell. do AI people just assume everyone is happy to be a cog in their monetization scheme?


They assume that they don't need to care and their assumption has so far proven correct.


Thought it was bait, but I can confirm I can find private repos in the search results. What the heck?


Since your repo's name is public now anyway, could you please help us and post it here? I'm really curious about what happened. Since public GitHub activities were archived [1], if you post it here we can check if it were ever public or it's truly private at all time.

[1] https://www.gharchive.org/


It seems the original ones (from the mastodon post) are

- https://github.com/emenel/dust

- https://github.com/emenel/portfolio

(based on https://archive.softwareheritage.org/browse/search/?q=emenel...)

Care to check them on gharchive? I bet they used to be public.


I checked my GitHub archive data and emenel/dust is not there, but emenel/portfolio is. Note, GitHub archive is not 100% accurate and it is missing 319 hours.


> based on

The data for The Stack’s dataset is sourced from the Software Heritage Archive, so checking that is redundant. We need different sources.


Even still, both repos had READMEs [1][2] clearly meant to be read by the public. The archival was only successful years ago, with a failed snapshot as far back as 2021 [3]. This really seems like they forgot it was ever public.

Now, this is only about it being a GitHub breach. Whether unlicensed (emenel/portfolio) or GPL (emenel/dust) code should be allowed in such datasets is a different matter.

[1] https://archive.softwareheritage.org/browse/origin/directory...

[2] https://archive.softwareheritage.org/browse/origin/directory...

[3] https://archive.softwareheritage.org/browse/origin/visits/?o...


I used SW Heritage to identify the repos that were used for training, since op did not post the repo names.

The “different source” is supposed to be ghactions.


How do I check this? I found a repo I'm pretty sure was always private on there that I deleted a while back. https://github.com/johncoates/JCBootstrap

There's no archived version on archive.org at least.


https://archive.softwareheritage.org/browse/origin/directory... shows a "snapshot date" of "11 August 2015, 07:28:00 UTC" - any chance it was public on that date such that the crawler could have accessed it?


I checked my GitHub archive (https://www.gharchive.org/) indexed data and the only repo that I saw for johncoates was LanscapeVideos, which has a last event time of 2015-06-09 07:09:52+02

It is important to note that GitHub archive is not 100% accurate and there is over 319 missing hours.


I can't find any reason why I would have made it public. I made the repo in 2014 for internal use and don't like to share projects like that. I'm pretty careful when releasing any code publicly. It's some code that other private projects depend on. I searched for any references in public code and there are none, so there should have been no reason to make it public.

Interestingly my public code with thousands of stars isn't in "The Stack".


This shouldn't be something where we're relying on recollection.

Presumably github repo privacy state has an audit trail. This would allow GH to prove / disprove claims on any given repo easily. I hope a rep steps in to do so.


Yeah I agree. Tried https://news.ycombinator.com/item?id=39771541 but there's nothing related to this repo. Does GitHub send an email out when you make something public? I don't have any emails related to this repo.


I just upgraded the tool at https://observablehq.com/@simonw/github-public-repo-history to use lowercase comparisons (previously it was case sensitive) so it's worth having another look.


Were those repos always private?


Which private repos? Any that you're willing to share (I get that sharing names of private repos goes against the whole idea of them being private!)


Since a lot of this depends on whether someone had ever had a private repo public in the past, I was hoping it could be resolved using the GitHub security audit log.

You can access that for your repos here: https://github.com/settings/security-log

Then search for repo:simonw/datasette or similar.

But... it looks like the audit log only goes back 6 months, so sadly it's not useful for reviewing this particular situation which involves repos that could have been 5 or more years old.

The ClickHouse copy of the GitHub Archive is useful for reviewing things and goes back a lot further. Try it here:

https://play.clickhouse.com/play?user=play

You can run this query to see relevant events for a specific username:

    with public_events as (
      select
        created_at as timestamp,
        'Private repo made public' as action,
        repo_name
      from github_events 
      where actor_login = 'simonw'
      and event_type in ('PublicEvent')
    ),
    most_recent_public_push as (
      select
        max(created_at) as timestamp,
        'Most recent public push' as action,
        repo_name
      from github_events
      where event_type = 'PushEvent'
      and actor_login = 'simonw'
      group by repo_name
    ),
    combined as (
      select * from public_events
      union all select * from most_recent_public_push
    )
    select * from combined order by timestamp
The PublicEvent one is "When a private repository is made public" according to https://docs.github.com/en/rest/using-the-rest-api/github-ev...

I just built a tool for running this query without having to type in the SQL: https://observablehq.com/@simonw/github-public-repo-history

Explained in this TIL: https://til.simonwillison.net/clickhouse/github-public-histo...


[flagged]


Not all of us do, bro.


Need more proof than "pretty sure they were private" and "heard from a number of people".


From one of the comments on that post:

> I work for github. This post was shared in our slack, on a non work related channel. We don't think it's us.

> They say they get data from SoftwareHeritage, a website that archives repos from github. If your repos were ever open, they might have been archived there even after you deleted from github.

> That's my best guess.


The OP responded to that:

https://hachyderm.io/@emenel@post.lurk.org/11212861313743638...

> thanks for the additional info. I’m pretty sure the repos of mine were private. If it were only me i could be misremembering something, but i have heard from a number of people that they also found repos in this dataset that were private. So i’m not sure what to think or how to explain it.


It blows my mind that we’re all fine with “the home of open source software” is closed source


The way I rationalize my use of Github to myself is by framing it as me stealing free compute and storage from the evil Microsoft.


Trust me, they're making it back.


I'd presume if nothing else in copilot subscriptions, populated by harvesting upteen bazillion repos, issues, and maybe even CI job run logs



The accuracy of that story was vigorously denied by people in a position to know: https://twitter.com/natfriedman/status/1712140497127342404


i am pretty sure they would just clone stuff from BitBucket or SF if they didn't have it on their own platform.


especially with the number of actual FOSS alternatives available right now, but that network effect, whew, it's strong :-(


I am considering moving away because of network effect. The more popular repo is, the more work is there (issues, PRs...). It's not scalable.

Using another platform/self-host would introduce a friction.


You can disable issues.


That's not the point.

I am not opposed to issues per se. The problem is dealing with the amount, it's exhausting. It's just too easy to make issue or comment on GitHub because of network effect. This sums it up pretty well: https://nolanlawson.com/2017/03/05/what-it-feels-like-to-be-... (my workload is of course far smaller, but still it takes time and saps energy)

Yes, that issue has been opened for 4 years and I don't consider it important and neither did any of "me too" comments to do anything about it. At best I sometimes get drive by PR, which takes far more time to deal with than if I just did it myself.


What FOSS solutions give me an editor in the web, Codespaces, free CI/CD compute, free website hosting, etc.?


I feel as though your question conflates FOSS with free SaaS compute, but https://salsa.debian.org/help/instance_configuration#gitlab-... shows they are using the community edition (MIT https://gitlab.com/gitlab-org/gitlab-foss/-/blob/v16.10.0/LI... ) and using GitLab Pages so that's "editor in the web", "CI/CD compute", "website hosting" right there. I believe codespaces is "we run a docker container with vscode in it" so kind of a subset of "CI compute" but since I don't use that, I can't speak to whether it's included in the FOSS side of GitLab or not


> I feel as though your question conflates FOSS with free SaaS compute

My comment did not. The comment I replied to did. It said:

> especially with the number of actual FOSS alternatives available right now

I find GitLab quite lacking to GitHub and don't particularly see it as a compelling alternative to GitHub. Plus, there is no actual benefit to it being FOSS.


https://gitlab.com/gitlab-org/gitlab/-/merge_requests?scope=... would disagree as would the top section of every release notes post showing which community member made the biggest contribution to that GitLab release. That's not even counting the fun things I can do to my own copy of GitLab which I can modify and host for my organization's needs, no AGPL, no reverse engineering obfuscated ruby from a .vhd, just actual open source

You are welcome to find GitLab lacking (there's currently at least 66500 people who agree with you), and "compelling" is up to you, but to say it's not a full featured competitor to GitHub is disingenuous


I'm curious how they get around licenses. For example, I have repos that show up in The Stack. Some have licenses that require inclusion of the copyright in any source re-use or redistribution.

IANAL, but it seems like inclusion in the data set and subsequent distribution without the copyright notice would be a violation.


In The Stack FAQ, they claim that they are doing minimal analysis of the LICENSE file and SPDX tags.

I'd bet that this is enough to detect cases like GPL code, but I also bet that if this analysis fails instead of falling back to "unknown license, assume proprietary, don't copy" they fall back to "free lunch!". Because reasons.


At least one of my repos with no license/spdx was excluded, though the source files do say "all rights reserved" in them.


I suspect you're right.

Even though even permissive licenses like MIT and BSD require attribution and preservation of copyright notices. Maybe their AI just can't "reliably detect" licenses.


Probably some obscure GH legal clause stating “we own your data. Ownership is implied and we may do anything with it. Private vs public is a concept of accessibility over the internet. Not necessarily means it’s not accessible via intranet or other non-public means”

It’s the similar legal clauses used for decades on social and video hosting platforms.


I don't think that gets around the code licenses. They may use it as if it does, but I'm not convinced that would hold up if soemone were wealthy enough to make a court case.


GH interestingly doesn't grant themselves nor others that many rights.

---

  You grant us and our legal successors the right to store, archive, parse, and display Your Content, and make incidental copies, as necessary to provide the Service, including improving the Service over time. This license includes the right to do things like copy it to our database and make backups; show it to you and other users; parse it into a search index or otherwise analyze it on our servers; share it with other users; and perform it, in case Your Content is something like music or video.

  This license does not grant GitHub the right to sell Your Content. It also does not grant GitHub the right to otherwise distribute or use Your Content outside of our provision of the Service, except that as part of the right to archive Your Content, GitHub may permit our partners to store and archive Your Content in public repositories in connection with the GitHub Arctic Code Vault and GitHub Archive Program.
---

  Any User-Generated Content you post publicly, including issues, comments, and contributions to other Users' repositories, may be viewed by others. By setting your repositories to be viewed publicly, you agree to allow others to view and "fork" your repositories (this means that others may make their own copies of Content from your repositories in repositories they control).

  If you set your pages and repositories to be viewed publicly, you grant each User of GitHub a nonexclusive, worldwide license to use, display, and perform Your Content through the GitHub Service and to reproduce Your Content solely on GitHub as permitted through GitHub's functionality (for example, through forking).
---

They don't even include training for copilot (though a dodgy lawyer will likely try to include that as "part of providing the service). 3rd parties only get a license to fork your repo, seemingly not even a license to do anything with that repo. (And hot take: Github should just let people disable the fork button already.)


What if we include the Microsoft Services Agreement? GitHub Copilot is a Microsoft AI Service


47 of my repos are in the stack, and none of them are private (and I have plenty of private repos)


Same here. I think some people are confusing private now vs private since creation.


I do think, however, that "repo no longer public" should be treated as an opt-out signal.


Which is valid, but what's the timeframe for rechecking and validating if there's been a change? I'll posit it's practically impossible to catch every single change within a reasonable timeframe to ensure that "a repo no longer public, today, within a data set collected earlier" would be excluded.

I may be misunderstanding your suggestion. If so, I'm curious to learn what I missed.


My assumption is that the "my private repo was included" crowd are misremembering that their repo used to be public at some point in the distant past (~years). My suggestion would be that they re-scan before each major revision of the dataset (such as this upcoming "v2" release). This would be a relatively expensive process given the numbers involved, but so be it.

(edit: I think "v2" is already released, but you get my point)


I can get behind this idea. Thank you for clarifying.


None of mine are in the stack and I am now personally offended that all those crappy little one-file Swift and sh utilities will be excluded from the singularity


From the reports it looks like they have to have been private and deleted


Hmm, I do see some of my repos there, but only public ones. It would seem a bit too foolish of GitHub to put private repositories in a public dataset.

Is there any other corroboration or proof?


It would be fun to upload a private repo with AWS keys in it as a canary. If the account ever gets used, the repo is no longer private?


As with others, I can say they only seem to have my repos that were ever public. Nothing of mine that was always private is showing.


Same here, public only


Yep - same. My public repos show but not my private one.


I also see a 7 years old private repo from my account. But I may have made it public initially, I don't really remember it. Anyone knows how to check if a repo was ever made public?


Check if it was ever appeared in gharchive [1].

[1] https://www.gharchive.org/


If the repo was public for a while (say 2 years), and then became private, then the code in the Stack for your repo must be 5 years old.

If on the other hand, it's tracked to the latest commit, that is a different scenario.


The Wayback Machine [0] if you're lucky (?) enough to have had it indexed.

[0]: https://web.archive.org/


It is always interesting to see how copilot autocompletes things like "Todo: " or "my ssh key" and see what data leaks through!


I do not use private GitHub repositories. (If I want something private, I will store it on my own computer and/or on DVDs, etc.)

If they are using private data in AI dataset (or other uses) then that is a serious issue; they are copying data which is meant to be private. (Public files are public and should be made copies that others can use too, though.)


So let me understand this.

Step 1) "Software Heritage" crawls the web (not only Github; they definitely crawl ANY gitlab instances they find online among practically everything else). They "store" ANY type of source code, irregardless of the license of that code. They admit as much in their own FAQ ( https://www.softwareheritage.org/faq/#24_What_is_the_policy_... ) :

> Software Heritage archives everything that is publicly available, without preliminary tests or checks [for LICENSE file or others]. You are responsible for checking whether the source code you find in the archive can be reused, and under which terms.

Step 2) "Hugging Face" uses the Software Heritage dataset to build some AI training dataset ("The Stack"), again, completely ignoring licensing. You apparently have to manually opt-out if you don't want your COPYRIGHTED source code to be included there. But as far as I can see, opting out is only considered AT ALL for Github repositories via https://huggingface.co/datasets/bigcode/the-stack-v2 . If you have your source code published outside Github, then your code appears to be used, period.

> The Stack v2 is a collection of source code from repositories with various licenses. Any use of all or part of the code gathered in The Stack v2 must abide by the terms of the original licenses [...]

Step 3) Eventually and inevitably someone releases some model trained on this dataset, ignoring licensing again.

Step 4) Someone uses code generated from such chatbot, unknowingly violating everything software license known to man.

Step 5) ???

Step 6) Profit!

Where do these "Software Heritage" guys mention which user-agent their crawler is using so that I can permanently ban them from my websites?

At least archive.org does that and they have a much nicer way to request exclusion from their archive.


I had this problem with Software Heritage for non-Open Source stuff in my personal Gitea.

I eventually had to specifically block their IP addresses from accessing my Gitea and threaten legal action before they took the offending repos down.

I still have them blocked.


You lost me at step four. If reading source is fair use and generating content isn't derivative work, this is kosher. As litigation mounts, we'll learn more.

To me, a learning path that's acceptable for humans is acceptable for AIs. All of the rationales for treating AI/human learning distinctly feel flimsy and arbitrary. "The difference is learning scale." So what? "LLMs just parrot derivatives of their training data." No, they don't. That's demonstrably false, and theoretically absurd, for all the common cited reasons. "AIs obviously aren't people. So their learning restrictions must be different." Why? "People are financially benefiting from the AI's knowledge!" Yep, that's generally how employment works. Etc, etc.

My take could be totally wrong. I get that. It'll be interesting to see where the courts fall on the issue. But personally, the critics' arguments feel profoundly crummy and counterproductive; applying their logic to pre-LLM systems usually results in horrendous alternative realities. But that's just, like, my opinion, man.


Having an agenda to push much?

Anyway, this is strictly one level up above that debate, as here they are scraping source code which is NOT free, and whose licenses could explicitly forbid copy for use in training datasets, or by the military, or even by people whose eye color I don't like.

For example, example code for proprietary tools which usually allows you to only copy it strictly for purposes of extending the proprietary tool.


Opt-out consent is the equivalent of assuming a partner consents because they haven't said anything.

Consent is not Opt-out == Yes means Yes


This organization claims that they'll allow you to opt-out, but they haven't seemingly done so for the oldest request on their repository [0]. Pathetic.

[0]: https://github.com/bigcode-project/opt-out-v2/issues/1


Even though they were only my public repos in there, I really don't like this being opt out instead of opt in, especially since some of those repositories were not under fully permissive licenses


I and most of my coworkers have all found at least one private repo which we believe was always private to be in the data set version 2.


This is a nothingburger.

I have "private" repos listed in the dataset but they were all at one time public. Searching the SoftwareHeritage site I can find those once-public repos with ancient commits.

My private repos that were always private are not listed in the dataset.



Could "private" mean "kept from the outside world"?


Time for an end to end encrypted Git Service.


Time to start large scale poisoning of repos.


Just to add yet another data point. Only my Public repos included. All my private repos are not included.


I'm guessing these were public once (mine were), but I'll be damned if I let github host anything of mine again.


Encrypt your private repos..


Is there any other good service for enterprise development?


I mean, there is also a lot of "public" github repos that contain absolutely copywritten work too.


[flagged]


Was there a previous time?


I am Jack's complete lack of surprise.

In all seriousness, without strong data privacy regulations (ie: GDPR), we will continue to see this sort of stuff, as the potential monetary rewards for using this sort of data far outweigh the potential liability. Cost of doing business type stuff, rather than an existential risk for abusing public trust.

My opinion is that data should be treated like a fissile element - very dangerous to hold and store, but extremely powerful when properly employed. However, it's only dangerous if the liability of storing it is significant and real, as of today, it's not (in the US).


The crypto craze stole our energy to validate digital currency transactions on the blockchain.

Now the AI craze stole our code and collective knowledge to ultimately train their models, and hopefully replace SWEs and other knowledge based fields (ie, medicine).

At least with blockchain, some people got rich. But with the AI craze the only people getting rich are the rich themselves.


> At least with blockchain, some people got rich.

But plenty more lost money or went bankrupt. The one’s who got rich did so at the expense of other non-rich people.


> But plenty more lost money or went bankrupt. The one’s who got rich did so at the expense of other non-rich people.

[citation needed]



LOL what about the amounts the banks and bankers stole from us in the last decade and a half? Why don't you complain about that also? I'm sure it dwarfs whatever scammers have stolen in crypto. But you guys hate that you missed your chance. But it's never too late. You can still invest. It's not over yet. You can't teach me anything at all about the world of crypto, believe me. I'm one of the MtGox creditors, and that has been going on for 10 years now



Logical reasoning is still a valuable tool.

Where else would the money have come from?


I got "rich" also, but I've never speculated, and TBH I've sold, and spent, more spent than sold at all kind of valuations, and most of the riches are still unrealized, so, sorry if I doubt what you say.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: