There's a story going around at the moment that people have found code from their private GitHub repositories in the AI training data known as The Stack, using this search tool: https://huggingface.co/spaces/bigcode/in-the-stack
I'm very doubtful that private data has been included in that training set. I think it's far more likely that the repositories in question were public at some point in the time, and were gathered up by the https://www.softwareheritage.org/ project when they archived code from GitHub.
But how can we tell if a private repository was public at some point in the past?
GitHub have a security audit log for logged in users, but sadly it appears to only cover the past six months.
For a longer history, we can look things up in the GitHub Archive project, which has been recording public events from the GitHub API since 2011.
TLDR: I built a tool for this here: https://observablehq.com/@simonw/github-public-repo-history
The ClickHouse team provide a public tool for querying that data using SQL as a demo of their software. We can use that to try and find out if a repository was public at some point in the past.
Access the tool here - no login required: https://play.clickhouse.com/play
Now execute the following SQL, replacing my username with yours in both places where it occurs:
with public_events as (
select
created_at as timestamp,
'Private repo made public' as action,
repo_name
from github_events
where lower(actor_login) = 'simonw'
and event_type in ('PublicEvent')
),
most_recent_public_push as (
select
max(created_at) as timestamp,
'Most recent public push' as action,
repo_name
from github_events
where event_type = 'PushEvent'
and lower(actor_login) = 'simonw'
group by repo_name
),
combined as (
select * from public_events
union all select * from most_recent_public_push
)
select * from combined order by timestamp
The result is a combined timeline showing two things:
PublicEvent
events - which GitHub describes as "When a private repository is made public. Without a doubt: the best GitHub event."PushEvent
for each repository. Repositories which started life public won't show up in the PublicEvent
list, so this aims to capture them.Here's an extract from the data I get back when I run the query for myself:
I put together an Observable Notebook that provides a UI for executing this query: https://observablehq.com/@simonw/github-public-repo-history
It uses just three cells of JavaScript. The first provides a username input, with a submit button to avoid firing off SQL queries while the user is still typing their name:
viewof username = Inputs.text({
placeholder: "Your GitHub username",
submit: true
})
The second executes the query using the ClickHouse JSON API, described previously:
results = username.trim() &&
(
await fetch("https://play.clickhouse.com/?user=play", {
method: "POST",
body: `with public_events as (
select
created_at as timestamp,
'Private repo made public' as action,
repo_name
from github_events
where lower(actor_login) = '${username.trim().toLowerCase()}'
and event_type in ('PublicEvent')
),
most_recent_public_push as (
select
max(created_at) as timestamp,
'Most recent public push' as action,
repo_name
from github_events
where event_type = 'PushEvent'
and lower(actor_login) = '${username.trim()}.toLowerCase()'
group by repo_name
),
combined as (
select * from public_events
union all select * from most_recent_public_push
)
select * from combined order by timestamp FORMAT JSON`
})
).json()
The third conditionally shows a table of results if the data has been fetched:
table = {
if (results && results.data) {
return Inputs.table(results.data);
} else {
return null;
}
}
Here's what it looks like running on Observable:
Created 2024-03-20T13:49:53-07:00, updated 2024-03-20T14:32:11-07:00 · History · Edit