TL;DR: I think it is important for Debian to consider AI models free
even if those models are based on models that do not release their
training data. In the terms of the DFSG, I think that a model itself is
often a preferred form of modification for creating derived works. Put another way, I don't think toxic candy is as toxic as I thought it was reading lumin's original ML policy.
If we focus too much on availability of data, I think we will help the
large players and force individuals and small contributors out of the
free software ecosystem.
I will be drafting a GR option to support this position.
I do not see how proposal A harms the ecosystem. It just prevents huge
binary blobs from entering Debian's main section of the archive. It
does not stop people from uploading the binary blobs to non-free
section.
On Wed, 2025-02-05 at 07:45 -0700, Sam Hartman wrote:
If we are saying that to be open source software, any model you use
needs to provide full training data up to the original training run
with random parameters, I think that would harm our community.
I’d like to remindyou that these huge binary blobs still contain,
in lossily compressed form, illegally obtained and unethically
pre-prepared, copies of copyrighted works, whose licences are not
honoured by the proposed implementations.
As such I cannot consider them acceptable even for Debian’s non-free.
While the act of training such a model *for data analysēs* may be
legal, distributing it, or output gained from it that is not a, and
I quote the copyright law, “pattern, trend [or] correlation” isn’t legal.
"Large language models (LMs) have been shown to memorize parts of
their training data, and when prompted appropriately, they will emit
the memorized training data verbatim."
I don't think we should focus our conversation on LLMs much, if at all.
The reason is that, even if a completely free-as-in-freedom (including
in its training dataset), high quality LLM were to materialize in the
future, its preferred form of modification (which includes the dataset)
will be practically impossible to distribute by Debian due to its size.
So when we think of concrete examples, let's focus on what could be reasonably distributed by Debian. This includes small(er) generative AI language models, but also all sorts of *non-generative* AI models, e.g., classification models. The latter do not generate copyrightable content,
so most of the issues you pointed out do not apply to them.
Other issues
still apply to them, including biases analyses (at a scale which *is* manageable, addressing some of the issues pointed out by hartmans), and ethical data sourcing.
"Sam" == Sam Johnston <samj@samj.net> writes:
On Fri, 7 Feb 2025 at 08:48, Thorsten Glaser <tg@debian.org> wrote:[…]
I’d like to remind you that these huge binary blobs still contain,
in lossily compressed form, illegally obtained and unethically
pre-prepared, copies of copyrighted works, whose licences are not
honoured by the proposed implementations.
As such I cannot consider them acceptable even for Debian’s non-free.
Agreed, we know these models can and do routinely recall training data
in the course of normal operation[1]:
We also know that even models carefully designed to avoid this, often
using guardrails that would be trivially removed when running locally
rather than as a service like OpenAI, will divulge their secrets if >coerced[2]:
The OSI and others arguing […] demonstrates they either do not
understand the technology, or worse, do and are trying to deceive us.
For me, the debate should end here.
While the act of training such a model *for data analysēs* may be
legal, distributing it, or output gained from it that is not a, and
I quote the copyright law, “pattern, trend [or] correlation” isn’t
legal.
Some 4D chess players have argued that a model is not copyrightable as
it is merely "a set of factual observations about the data", and that
excusable (if unethical) under fair use
and text and data mining exemptions.
This ignores the intentions of the authors of the content
Unless otherwise specified, content is typically published with "all
rights reserved" by default.
In any case, the result is "a statistical model that spits out
memorized information [that] might infringe [...] copyright". The
exemptions relied upon for training do not extend to reproduction
during inference, for which a test of “substantial similarity” would >apply (otherwise one might argue such copyright violations are
coincidental).
Allowing this would be knowingly shipping obfuscated binary blobs in
main, akin to a book archive (Authors Guild v. Google, 2015) with
trivially reversible encryption, or a printer driver that can
spontaneously reproduce copyrighted content from memory.
Digital Public Goods (albeit not yet certified like Fedora[3]), as the
DPGA has just today "finalized the decision to make training data
mandatory for AI systems applying to become DPGs. This requirement will
While I'm not trying to be alarmist, I am alarmed. Our community was
built on respect for rights, and dropping this principle out of
expediency now would be a radical departure from the norm. I don't
think this is clear enough in lumin's proposal and "Toxic Candy"
let's focus on what could be reasonably distributed by Debian. This
includes small(er) generative AI language models, but also all sorts of >*non-generative* AI models, e.g., classification models.
The latter do not generate copyrightable content,
so most of the issues you pointed out do not apply to them.
The reason is that, even if a completely free-as-in-freedom (including
in its training dataset), high quality LLM were to materialize in the
future, its preferred form of modification (which includes the dataset)
will be practically impossible to distribute by Debian due to its size.
Other issues still apply to them, including biases analyses (at a scale
which *is* manageable, addressing some of the issues pointed out by >hartmans), and ethical data sourcing.
Preferred Form of Modification
==============================
[...]
As a practical matter, for the non-monopolies in the free software ecosystem, the preferred form of modification for base models is the
model themselves.
I would have strongly disagreed with this until a short while ago, and
stated that unless I can run a modified training process -- which would require the training data -- I don't have the preferred form of
modification.
However, recent advances point to new useful models being built from
other models, for example what DeepSeek accomplished with Llama. They obviously didn't have the original training data, yet still built
something very useful from the base model.
So I now have a slight doubt. But it is only slight; my gut says that
even many useful derivations cannot "heal" an initial problem of
free-ness. Because if the original base model were to disappear (as you
put it in "Free Today, Gone Tomorrrow"), all derivations in the chain
would lose their reproducibility, too.
Dear lumin:
First, thanks for all your work on AI and free software.
When I started my own AI explorations, I found your ML policy
inspirational in how I thought about AI and free software.
With my Debian hat on, I don't really care whether base models are
considered free or non-free. I don't think it will be important for
Debian to include base-models in our archive. What I do care about is
what we can do with software that takes base models and adapts them
for a particular use case.
Sysop: | Keyop |
---|---|
Location: | Huddersfield, West Yorkshire, UK |
Users: | 463 |
Nodes: | 16 (2 / 14) |
Uptime: | 156:13:45 |
Calls: | 9,384 |
Calls today: | 4 |
Files: | 13,561 |
Messages: | 6,095,837 |