Forum: >>> Magnum BBS <<<

A Different Take on AI

From Sam Hartman@21:1/5 to All on Wed Feb 5 16:00:01 2025

TL;DR: I think it is important for Debian to consider AI models free
even if those models are based on models that do not release their
training data. In the terms of the DFSG, I think that a model itself is
often a preferred form of modification for creating derived works. Put
another way, I don't think toxic candy is as toxic as I thought it was
reading lumin's original ML policy.
If we focus too much on availability of data, I think we will help the
large players and force individuals and small contributors out of the
free software ecosystem.
I will be drafting a GR option to support this position.

Dear lumin:

First, thanks for all your work on AI and free software.
When I started my own AI explorations, I found your ML policy
inspirational in how I thought about AI and free software.
As I have begun my own explorations, which often involve trying to change/remove bias from models, I have come to think somewhat
differently than you did in your original ML policy.

I apologize that I did not include a lot of references in this message.
I found that I was having trouble coming up with enough time to write it
at all. I wanted to give you some notice that I planned to draft what I believe is a competing GR option, and doing that took the time I have.
I am not a researcher by trade, and I do not have all the references and
links I wish I did handy.
I'm just a free software person who has been working on AI as a side
project, because I hope it can make parts of the world I care about
better.

As I understand it, you believe that:

1) Looking at the original training data would be the best approach for
trying to remove bias from a model.

2) It would be difficult/impossible to do that kind of work without
access to the original training data.

I have come to believe that:

1) AI models are not very transparent even if you have the training
data. taking advantage of the training data for a base model is probably outside the scope of most of us even if we had it. That's definitely
true for retraining a model, but I think is also true for understanding
where bias is coming from. That's for base models. I think that fine
tuning data sets for things like Open Assistant are within the scope of
the masses to examine and use.

2) I think that retraining, particularly with training techniques like
ORPO, is a more effective strategy for the democratized (read
non-Google, non-Meta) community to change bias than using training data.
In other words, I am not convinced that we would use training data even
if we had it, to adjust the bias of our models.
Which is to say I think the preferred form of modification for models is
often the model itself rather than the training data.

Goals
=====

I think both of us are more concerned about democratizing AI. I think we
are more interested in preserving individuals' ability to modify and
create software than we are in promoting monopolies or advantaging
OpenAI, Meta, Google, and the like.
I think we may disagree about how to do that.

With my Debian hat on, I don't really care whether base models are
considered free or non-free. I don't think it will be important for
Debian to include base-models in our archive.
What I do care about is what we can do with software that takes base
models and adapts them for a particular use case.
If LibreOffice gained an AI assistant, our users would be served if we
are able to include a high quality AI assistant that preserves their
core freedoms.
With my Debian hat on, I care more about what we can do with things
derived from base models than the base models themselves.

Core Freedoms
=============

I think that the core freedoms we care about are:

1) Being able to use software.
2) Being able to share software.
3) Being able to modify software.
4) transparency: being able to understand how software works.

Debian has always valued transparency, but I think the DFSG and our
practices have always valued transparency less than the other freedoms.
There's nothing in the DFSG itself that requires transparency.
We've had plenty of arguments over the years about things like minimized
forms of code and whether they met the conditions of the DFSG.
One factor that has been raised is transparency, but it mostly gets
swept aside by the question of can we modify the software.
The idea appears to be that if we have the preferred form of
modification that 's transparent enough.
If the upstream doesn't have any advantages in transparency , well, we
decide that's free enough.

One argument that comes up over the years looking at vendoring is
whether replacement is the preferred form of modification.
I have some vendored blob that is a minimized representation of an
upstream software project. Say minimized javascript or some form of
byte code.
Most of the time I'm going to modify that by replacing the upstream
sources entirely with a new version.
So, at least for vendored code, is that good enough?
generally we've decided that no it is not.
We want individuals to be able to make arbitrary modifications to the
code, not just replace it.

My claim is that analysis works differently for AI than for minimized Javascript.

AI is Big
=========

I cannot get my head around how big AI training sets for base models
are.

I was recently looking at the DeepSeek Math paper [1]:

[1]: https://arxiv.org/abs/2402.03300

As I understand it, they took Their Deepseek Code model as a base.
So that's trained on some huge dataset--so big that they didn't even
want to repeat it.

Then they had a 1.2 billion token dataset (say 6G uncompressed text)
that they used for a training round--some sort of fine tuning round.

Then they applied 2**17 examples (so over a hundred thousand examples)
where they knew both the question and a correct answer.
But the impressive part for me was how the 1.2 billion token dataset was produced. I found the discussion of that process fascinating, but it
involved going over a significant chunk of the Common Crawl dataset,
which is mind bogglingly unbelievably huge, to figure out which fraction
of that dataset talks about math reasoning.

Searching the 1.2 billion token data set is clearly within our
capability.
But it's not at all clear to me that I could find what in a 6G dataset
is going to be producing bias.
I think it would be quite possible to hide bias in such a dataset intentionally, in such a way that even given the 1.2 billion tokens we
would find it difficult to remove the bias by modifying the dataset. I
think there will certainly be unintentional bias there I colud not find.

So, to really have the training data, we need Common Crawl, and we need
the scripts and random seeds necessary to reproduce the 1.2 billion
token dataset.
I also believe there was at least one language model in that process, so
you would also need the training data for that model.

I am quite sure that finding bias in something that large, or even
examining it is outside the scope of all but well funded players.

I am absolutely sure that reducing Common Crawl to the 1.2 billion
tokens --that is actually running the data analysis, including all the
runs of any language models involved--is outside the scope of all but
well funded players. In other words, taking that original training data
and using it is a preferred form of modification is outside of the scope
of everyone we want to center in our work.

And then we're left repeating the process for the base model DeepSeek
Code.

My position is that by taking this approach we've sacrificed
modifyability for transparency, and I am not even sure we have gained transparency at a price that is available to the members of the
community we want to center in our work.
In this focus on data, we have taken the wrong value trade off for
Debian.
Debian has always put the ability to modify software first.

Free Today, Gone Tomorrow
=========================

One significant concern I know Lumen is aware of with requiring data is
what happens when the data is available today but not tomorrow.
One of the models that tried to be as open as possible ran into problems because they were forced to take down part of their dataset after the
model was released.
(I believe a copyright issue.)

The AI copyright landscape is very fluid.
Right now we do not know what is fair use.
We do not even have a firm ethical ground for what sharing of data for
AI training should look like in terms of social policy.

We run a significant risk that significant chunks of free software will
depend on what we believe is a free model today, only to have it get reclassified as non-free tomorrow when some of the training data is no
longer available to the public.

We run significant risks when different jurisdictions have different
laws.

It is very likely that there will be cases where models will still be distributable even when some of the training datasets underlying the
model can no longer be distributed.

So, you say, let's have several models and switch from one to another if
we run into problems with one model.
Hold that in the back of your mind. We'll come back.

Debian as Second Class
======================

I am concerned that if we are not careful the quality of models we are
able to offer our users will lag significantly behind the rest of the
world.
If we are much more strict than other free-software projects, we will
limit the models our users can use.
Significant sources of training data will be available to others but not
our users.
I suspect that models that only need to release data information rather
than training data will be higher quality because they can have access
to things like published books, works that can be freely used, but not
freely distributed and the like.

Our social contract promises we will value our users and free software.
If we reduce the selection (and thus quality) of what we offer our
users, it should somehow serve free software.
In this instance, I believe that it probably does not serve transparency
and harms our core goal of making software easy to modify. In other
words I do not believe free software is being helped enough to justify disadvantaging our users.

Preferred Form of Modification
==============================

I talked earlier about how if one model ended up being non-free, we
could switch to another one.
That happens all the time in the AI ecosystem.
A software system has a fine tuning dataset.
They might fine tune a version of Llama3, Mistral, one of the newer
models, all against their dataset.
They will pick the one that performs best.
As new models come out, the base model for some software might switch.

As a practical matter, for the non-monopolies in the free software
ecosystem, the preferred form of modification for base models is the
model themselves.
We switch out models and then adjust the code on top of that, using
various fine tuning and prompt engineering tasks to adapt a model.

The entire ecosystem has evolved to support this. There are
competitions between models with similar (or the same) inputs.
There are sites that allow you to interact with more than one model at
once so you can choose which works best for you and switch out. (Or get
around biases or restrictions, perhaps using Chat GPT to write part of a
story, and a more open model to write adult scenes that Chat GPT would
refuse to write.)

On the other hand, I did talk about fine tuning and task-specific or program-specific datasets.
many of those are at scopes we could modify, and fine tuning (or
producing adapters) models based on those datasets is part of the
preferred form of modification for the programs involved.

What I want for Debian
======================

here's what I want to be able to do for Debian:

* First, the bits of the model--its code and parameters--need to be
under a DFSG-free license. So Llama3 is never going to meet Debian
main's needs under its current license.

* We look at what the software authors actually do to modify models they
incorporate to determine the preferred form of modification. If in
practice they switch out base models and fine tune, that's okay. In
this situation we probably would need full access to the fine tuning
data, but the training data for the base model.

*Where it is plausible that the preferred form of modification works
this way, we effectively cut off the source code and do not look further. If you
are integrating model x into your software, your software is free if
model x is under a free license and any fine tuning data/scripts you
use are free. I.E. if our users could actually go from upstream model
x to what the software uses, that's DFSG-free enough even if the user
could not reproduce model x itself.

I firmly believe that the ability to retrain models to change their bias without access to the original training data will only continue to
improve.
i think that especially with techniques like ORPO, my explorations
suggest that for smaller models we may have already reached a point that
is good enough for free software.

So What about the OSI Definition
================================

I don't know.
I think it depends on how the OSI definition treats derivative works.
If what we're saying is base models need to release training data, I
think that would harm the free software community.
It would mean free models were always of lower quality than proprietary
models, at least unless the fair use cases go in a direction where all
the models are of low quality.
I think data information is best for base models.

If instead, what we're saying is that OSI's definition is more focused
on software incorporating models, and it is okay to use a model without
fully specified data is an input so long as you give all the data for
what you do to that model in your program, I could agree.

If we are saying that to be open source software, any model you use
needs to provide full training data up to the original training run with
random parameters, I think that would harm our community.

-----BEGIN PGP SIGNATURE-----

iHUEARYIAB0WIQSj2jRwbAdKzGY/4uAsbEw8qDeGdAUCZ6N5hwAKCRAsbEw8qDeG dFsqAQDpSkIXdSV3DY2id80eNaA4KS5bJ19piKh2pLP84soljQD9Htt/yDp5A4Li iuw9LPxgijiqAAHWBYzSHxfjh93t4Ag=
=vz5v
-----END PGP SIGNATURE-----

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From M. Zhou@21:1/5 to Sam Hartman on Wed Feb 5 18:30:01 2025

Hi Sam,

Thank you for the input. I see your point, and those are exactly why I
wrote proposal B in my draft. Here is my quick response after going through
the text.

On Wed, 2025-02-05 at 07:45 -0700, Sam Hartman wrote:

TL;DR: I think it is important for Debian to consider AI models free
even if those models are based on models that do not release their
training data. In the terms of the DFSG, I think that a model itself is
often a preferred form of modification for creating derived works. Put another way, I don't think toxic candy is as toxic as I thought it was reading lumin's original ML policy.
If we focus too much on availability of data, I think we will help the
large players and force individuals and small contributors out of the
free software ecosystem.
I will be drafting a GR option to support this position.

I want to point out that the "preferred form of modification for creating DERIVED WORKS" -- the "derived works" is where your proposal (and proposal B) differs from the proposal A.

Proposal A (toxic candy is not free software), preserves the full freedom
for derived works, but also the freedom to inspect, study, reproduce, modify the original base model. Only covering derived work is not a integral freedom.

Proposal B (toxic candy is free software), is similar to treating those base models as blobs (such as firmware) that no free software community can really handle (at the current stage).

I do not see how proposal A harms the ecosystem. It just prevents huge
binary blobs from entering Debian's main section of the archive. It does not stop people from uploading the binary blobs to non-free section.

General AI applications are not something to worry about even with proposal A. DebGPT [https://tracker.debian.org/pkg/debgpt] itself incorporated two common practice how the existing AI applications work:

(1) by default, DebGPT behaves as a REST API client. It supports a wide range
of existing service end points, including commercial and self-hosted ones. (2) the build-in backend of DebGPT can pull a binary blob from internet and
provide the REST endpoint using that model.

I personally do not see how insisting proposal A can harm the ecosystem. While developers cannot put binary blobs into main, but you can still trigger the automatic download from the software in main.

I consistently believe putting a giant binary blob (base model) into main, that nobody other than the upstream can reproduce is ridiculously funny. That said, non-free is somewhere such model can go.

My appreciation to software freedom roots in the equal sharing of knowledge that benefits human in the long run. When I was young, looking at the binary blobs of Microsoft Windows, while being unable to easily learn how computer works really disappointed me. The discovery of Debian, makes me feel happy
with open source crap even if they falls behind the closed-source Ferrari.

Proposal A preserves the integrity of knowledge when anybody wants to study
the stuff in depth. Proposal B departures from my appreciation to software freedom. I hope free software can still help people achieve their personal revolution in terms of knowledge and skill in the future that belongs to AI, just like how it has done for me.

Let's leave enough time preparing the proposal. I'll focus on my proposal A
and incorporate the others' suggestions from the list.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Thorsten Glaser@21:1/5 to All on Fri Feb 7 08:50:01 2025

M. Zhou dixit:

I do not see how proposal A harms the ecosystem. It just prevents huge
binary blobs from entering Debian's main section of the archive. It
does not stop people from uploading the binary blobs to non-free
section.

I’d like to remindyou that these huge binary blobs still contain,
in lossily compressed form, illegally obtained and unethically
pre-prepared, copies of copyrighted works, whose licences are not
honoured by the proposed implementations.

As such I cannot consider them acceptable even for Debian’s non-free.

As someone publishing a lot of things under OSS licences, I, you,
really we all are affected by this. Given that I mostly publish
under Copyfree Ⓕ licences, attribution (and disclaimer of as much
liability as permitted) is all I seek, and they give not even that.

While the act of training such a model *for data analysēs* may be
legal, distributing it, or output gained from it that is not a, and
I quote the copyright law, “pattern, trend [or] correlation” isn’t
legal.

https://evolvis.org/~tg/cc.htm contains more writeup on this (and
my Fediverse bookmark list has tons more material I need to add to
its “Further references” section, really…) and links to a wlog entry containing even more on this and the homepage of The MirOS Licence
where I put an explicit interpretation requirement following along
the same line as well.

On Wed, 2025-02-05 at 07:45 -0700, Sam Hartman wrote:

If we are saying that to be open source software, any model you use
needs to provide full training data up to the original training run
with random parameters, I think that would harm our community.

I cannot conceive how someone in your position in Debian, or really
any fellow DD, can make such a statement with the declared intent.

(Others on LWN, where I spotted this (I’m not subscribed to the list,
so Cc me on replies if you want me to see them), have pointed out the
fallacy behind the quality argument already.)

bye,
//mirabilos
--
„Cool, /usr/share/doc/mksh/examples/uhr.gz ist ja ein Grund,
mksh auf jedem System zu installieren.“
-- XTaran auf der OpenRheinRuhr, ganz begeistert
(EN: “[…]uhr.gz is a reason to install mksh on every system.”)

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Sam Johnston@21:1/5 to Thorsten Glaser on Fri Feb 7 13:40:04 2025

On Fri, 7 Feb 2025 at 08:48, Thorsten Glaser <tg@debian.org> wrote:

I’d like to remindyou that these huge binary blobs still contain,
in lossily compressed form, illegally obtained and unethically
pre-prepared, copies of copyrighted works, whose licences are not
honoured by the proposed implementations.

As such I cannot consider them acceptable even for Debian’s non-free.

Agreed, we know these models can and do routinely recall training data
in the course of normal operation[1]:

"Large language models (LMs) have been shown to memorize parts of
their training data, and when prompted appropriately, they will emit
the memorized training data verbatim."

We also know that even models carefully designed to avoid this, often
using guardrails that would be trivially removed when running locally
rather than as a service like OpenAI, will divulge their secrets if
coerced[2]:

"The Times paid someone to hack OpenAI’s products,” and even so, it
“took them tens of thousands of attempts to generate the highly
anomalous results”

The OSI and others arguing that this is a valid way to protect
sensitive training data (copyright content, but also personally
identifiable information, medical records, proprietary datasets for
federated learning, and even CSAM) demonstrates they either do not
understand the technology, or worse, do and are trying to deceive us.
For me, the debate should end here.

While the act of training such a model *for data analysēs* may be
legal, distributing it, or output gained from it that is not a, and
I quote the copyright law, “pattern, trend [or] correlation” isn’t legal.

Some 4D chess players have argued that a model is not copyrightable as
it is merely "a set of factual observations about the data", and that
the copyright violations necessary for training are technically
excusable (if unethical) under fair use and text and data mining
exemptions. This ignores the intentions of the authors of the content
(and the exemptions, which pre-date LLMs), with training on e.g.
Common Crawl being done without their consent. Unless otherwise
specified, content is typically published with "all rights reserved"
by default.

In any case, the result is "a statistical model that spits out
memorized information [that] might infringe [...] copyright". The
exemptions relied upon for training do not extend to reproduction
during inference, for which a test of “substantial similarity” would
apply (otherwise one might argue such copyright violations are
coincidental).

Allowing this would be knowingly shipping obfuscated binary blobs in
main, akin to a book archive (Authors Guild v. Google, 2015) with
trivially reversible encryption, or a printer driver that can
spontaneously reproduce copyrighted content from memory. That we've
been discussing these AI policy issues on the public record for years
could even subject the project to claims of contributory copyright
infringement when our users inevitably commit direct infringement
(deliberately or inadvertently).

It would be a shame to see Debian enter the same category as Grokster
and Napster. It would also be unfortunate if Debian and derivatives
could no longer be considered Digital Public Goods (albeit not yet
certified like Fedora[3]), as the DPGA has just today "finalized the
decision to make training data mandatory for AI systems applying to
become DPGs. This requirement will help ensure that AI systems are
built ethically and are transparent and interpretable"[4]. This too
should give pause to advocates of allowing obviously non-free modules
in main.

While I'm not trying to be alarmist, I am alarmed. Our community was
built on respect for rights, and dropping this principle out of
expediency now would be a radical departure from the norm. I don't
think this is clear enough in lumin's proposal and "Toxic Candy"
language, but rather than splitting the vote we should work on a
consolidated clear and concise position, keeping the context separate.
The alternative would also have unintended consequences, including
chilling effects on open data, and on high-quality open models that
emerged around/after (and in many cases, before) OSI's contentious
OSAID release.

- samj

1. https://arxiv.org/abs/2202.07646
2. https://hls.harvard.edu/today/does-chatgpt-violate-new-york-times-copyrights/
3. https://www.networkworld.com/article/970236/fedora-linux-declared-a-digital-public-good.html
4. https://github.com/DPGAlliance/dpg-standard/issues/193#issuecomment-2642584851

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Stefano Zacchiroli@21:1/5 to Sam Johnston on Fri Feb 7 16:10:01 2025

While I'm still digesting the very impactful (for me) message by the
other Sam (hartmans), a quick but important note on the following:

On Fri, Feb 07, 2025 at 01:35:00PM +0100, Sam Johnston wrote:

"Large language models (LMs) have been shown to memorize parts of
their training data, and when prompted appropriately, they will emit
the memorized training data verbatim."

I don't think we should focus our conversation on LLMs much, if at all.
The reason is that, even if a completely free-as-in-freedom (including
in its training dataset), high quality LLM were to materialize in the
future, its preferred form of modification (which includes the dataset)
will be practically impossible to distribute by Debian due to its size.

So when we think of concrete examples, let's focus on what could be
reasonably distributed by Debian. This includes small(er) generative AI language models, but also all sorts of *non-generative* AI models, e.g., classification models. The latter do not generate copyrightable content,
so most of the issues you pointed out do not apply to them. Other issues
still apply to them, including biases analyses (at a scale which *is* manageable, addressing some of the issues pointed out by hartmans), and
ethical data sourcing.

Cheers
--
Stefano Zacchiroli . zack@upsilon.cc . https://upsilon.cc/zack _. ^ ._
Full professor of Computer Science o o o \/|V|\/ Télécom Paris, Polytechnic Institute of Paris o o o </> <\> Co-founder & CSO Software Heritage o o o o /\|^|/\ Mastodon: https://mastodon.xyz/@zacchiro '" V "'

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Sam Johnston@21:1/5 to Stefano Zacchiroli on Fri Feb 7 19:40:02 2025

On Fri, 7 Feb 2025 at 16:04, Stefano Zacchiroli <zack@debian.org> wrote:

I don't think we should focus our conversation on LLMs much, if at all.

While I agree LLMs tend to be the tail wagging the dog in AI/ML
discussion, the thread focuses on LLMs and the resulting policy will
apply to them.

The reason is that, even if a completely free-as-in-freedom (including
in its training dataset), high quality LLM were to materialize in the
future, its preferred form of modification (which includes the dataset)
will be practically impossible to distribute by Debian due to its size.

There are several candidates already, including Ai2's OLMO 2[1] and Pleias[2]:

"They Said It Couldn’t Be Done[3]
Training large language models required copyrighted data until it did
not. [...] These represent the first ever models trained exclusively
on open data, meaning data that are either non-copyrighted or are
published under a permissible license. These are the first fully EU AI
Act compliant models. In fact, Pleias sets a new standard for safety
and openness."

Given these provide a foundation on which future developers can build,
as well as an example others can follow, there will be many more.
Conversely, if we propagate the myth that these are too
big/hard/costly to create with today's tools, let alone tomorrow's,
then we run the risk people believe us. Not long ago even obtaining a
computer that could download and compile software was out of the reach
of most!

On the "preferred form" (wording from the OSD rather than the DFSG),
this is subjective and will be different for one than for another.
While Sam may possess the tools and techniques to assess and address
bias to some extent with weights only, if I as a security researcher
or data protection officer need to detect and entirely eliminate
problematic content (e.g., backdoors or "right to be forgotten"
requests) then the *only* form I can accept is the training data, thus
making it my "preferred form". I can't just say to a privacy
commissioner or judge "there was only a 0.7% chance patients' medical
records would be revealed, your honour". While Sam's tools are
improving, so are tools that can reverse the training process (e.g.,
DLG/iDLG for model inversion which "stands out due to its ability to
extract sensitive information from the training dataset and compromise
user privacy"[4]).

Just as the software vendor doesn't get to tell users what constitutes
an improvement for the purposes of the free software definition, we
don't get to tell practitioners what the subjective "preferred form"
means. That's why I prefer the objective "actual form" Sam referred to
in suggesting "We look at what the software authors *actually do* to
modify models they incorporate to determine the preferred form of modification". I guarantee some will reach for the data, so it must be
included for that freedom to be fully protected.

So when we think of concrete examples, let's focus on what could be reasonably distributed by Debian. This includes small(er) generative AI language models, but also all sorts of *non-generative* AI models, e.g., classification models. The latter do not generate copyrightable content,
so most of the issues you pointed out do not apply to them.

We can't make a valid decision or draft a policy focusing on models
which have no ability to create output that violates copyrights, only
to then put the project, its derivitatives, and users in legal hot
water with others that do. You do raise a good point about what we can reasonably distribute with Debian, and many models would already
exceed our current capacity (even without the dependencies required
for reproducibility). This is a solvable problem though, and it's
better to deliver utility to our users by solving it than compromise
on our principles or give up altogether. Common Crawl don't host their
own dumps, for example.

Other issues
still apply to them, including biases analyses (at a scale which *is* manageable, addressing some of the issues pointed out by hartmans), and ethical data sourcing.

I'm not sure I accept that relying on fair use for training only to
then incite direct infringement by users through deliberate or
inadvertent reproduction per proposed policies can be considered
"ethical data sourcing". Even if fair use did extend to cover
infringing model outputs, it would clearly be against the wishes of
the authors. This much is clear from the various generative AI
lawsuits already underway[5], including a class action against
Bloomberg[6], who joins Software Heritage in the small and shrinking
group of OSAID endorsers[7].

- samj

1. https://allenai.org/blog/olmo2
2. https://simonwillison.net/2024/Dec/5/pleias-llms/
3. https://huggingface.co/blog/Pclanglais/common-models
4. https://arxiv.org/abs/2501.18934v1
5. https://generative-ai-newsroom.com/the-current-state-of-genai-copyright-lawsuits-203a1bd0f616
6. https://admin.bakerlaw.com/wp-content/uploads/2024/01/ECF-74-Amended-Complaint.pdf
7. https://opensource.org/ai/endorsements

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Sam Hartman@21:1/5 to All on Sat Feb 8 00:30:01 2025

"Sam" == Sam Johnston <samj@samj.net> writes:

Sam> On Fri, 7 Feb 2025 at 16:04, Stefano Zacchiroli <zack@debian.org> wrote:
>> I don't think we should focus our conversation on LLMs much, if
>> at all.

Sam> Just as the software vendor doesn't get to tell users what
Sam> constitutes an improvement for the purposes of the free
Sam> software definition, we don't get to tell practitioners what
Sam> the subjective "preferred form" means. That's why I prefer the
Sam> objective "actual form" Sam referred to in suggesting "We look
Sam> at what the software authors *actually do* to modify models
Sam> they incorporate to determine the preferred form of
Sam> modification". I guarantee some will reach for the data, so it
Sam> must be included for that freedom to be fully protected.

Actually, no, that's not how Debian works.
We look to what the people working on the upstream project and the
package do when modifying the package, and generally accept that as the preferred form of modification for the package.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Thorsten Glaser@21:1/5 to Sam Johnston on Sat Feb 8 00:30:01 2025

On Fri, 7 Feb 2025, Sam Johnston wrote:

On Fri, 7 Feb 2025 at 08:48, Thorsten Glaser <tg@debian.org> wrote:

I’d like to remind you that these huge binary blobs still contain,
in lossily compressed form, illegally obtained and unethically
pre-prepared, copies of copyrighted works, whose licences are not
honoured by the proposed implementations.

As such I cannot consider them acceptable even for Debian’s non-free.

Agreed, we know these models can and do routinely recall training data
in the course of normal operation[1]:

[…]

We also know that even models carefully designed to avoid this, often
using guardrails that would be trivially removed when running locally
rather than as a service like OpenAI, will divulge their secrets if >coerced[2]:

Indeed, I’ve seen more examples for this.

The OSI and others arguing […] demonstrates they either do not
understand the technology, or worse, do and are trying to deceive us.
For me, the debate should end here.

+1

While the act of training such a model *for data analysēs* may be
legal, distributing it, or output gained from it that is not a, and
I quote the copyright law, “pattern, trend [or] correlation” isn’t
legal.

Some 4D chess players have argued that a model is not copyrightable as
it is merely "a set of factual observations about the data", and that

This sounds like the usual “some random software developer trying
their hand at legalese” which lawyers routinely laugh about.

I’ve heard that this is irrelevant, as long as the model’s output
can reproduce sufficiently recognisable parts of others’ works,
standalone and/or as collage, it’s a derived work. IANAL, ofc.

excusable (if unethical) under fair use

… which is purely a US-american thing…

and text and data mining exemptions.

… which does not allow reproduction of works, only what I quoted above.

This ignores the intentions of the authors of the content

That, too.

Unless otherwise specified, content is typically published with "all
rights reserved" by default.

The Berne Convention says so, indeed.

In any case, the result is "a statistical model that spits out
memorized information [that] might infringe [...] copyright". The
exemptions relied upon for training do not extend to reproduction
during inference, for which a test of “substantial similarity” would >apply (otherwise one might argue such copyright violations are
coincidental).

+1

Allowing this would be knowingly shipping obfuscated binary blobs in
main, akin to a book archive (Authors Guild v. Google, 2015) with
trivially reversible encryption, or a printer driver that can
spontaneously reproduce copyrighted content from memory.

Interesting comparisons. If you take the lossy compression into
account (book archive as JPEG or so), with possibly increased lossiness/compresion rate, this is a good fit.

Digital Public Goods (albeit not yet certified like Fedora[3]), as the
DPGA has just today "finalized the decision to make training data
mandatory for AI systems applying to become DPGs. This requirement will

Interesting, didn’t know about DPGs yet. (Hmm, they have a requirement
for CC licences for data collections, which (except CC0 which on the
other hand is problematic for reuse in/of code) aren’t Copyfree…grml…)

While I'm not trying to be alarmist, I am alarmed. Our community was
built on respect for rights, and dropping this principle out of
expediency now would be a radical departure from the norm. I don't
think this is clear enough in lumin's proposal and "Toxic Candy"

I’ve not read the actual proposal (I saw the mail after responding
to *this* thread only), but the summary by lumin in this thread makes
it clear to me that it doesn’t go far enough, see also below.

What does this mean for the proposed GR? Honestly, I’d rather skip it
as it’s clear enough it’s inacceptable (your “should end here”). Mo, can you perhaps solicit input from ftpmasters first (also to see if
they lean towards a similar hard stance)? If so, we can probably end
up not needing one.

*peeks at the proposals in the current text in your (Mo’s) repo*

“A free software AI should publish the training data and training
software under free software license, not just a FOSS-licensed
pre-trained model along with the inference software.”

OK, this doesn’t mention that they are acceptable for non-free, which
your wording in the thread indicated. I could vote for that… except
“free software license” isn’t what we need, we need “DFSG-compliant licence” (whether software or not as most FOSS data licences aren’t software licences, with The MirOS Licence as notable exception). I’ll
file an issue about that.

“Downside: This is not compatible with OSAID.”

That’s an upside in my book, and many on Fedi would agree.

I’d also argue that upsides/downsides belong into a conclusion, not
into the text of the proposal, as they (some more than others) are
subjective statements of the drafter. I’ll file that separately.

(Note I haven’t looked at the rest. Still hoping we can not need a GR.)

Zack wrote:

let's focus on what could be reasonably distributed by Debian. This
includes small(er) generative AI language models, but also all sorts of >*non-generative* AI models, e.g., classification models.

I think the same rule as for other statically linked binaries applies.
All sources must be available and in Debian (or at least to Debian and
its users) and their licences must work together and be honoured.

For non-free we can waive the requirement to reproduce sources, but not
that the licences of the sources are honoured and are compatible, which includes auditability.

The licence terms of the model itself must be suitable, of course, but
must also include the licence terms of the “training data”. The output (which isn’t just a pattern/trend/correlation) made from the model must
also be considered “potentially a derivative work of parts or all of its input, including training data”, and so default to have the entire terms applicable to it, unless the model can know which parts were used and
which weren’t. (The output is machine-generated like a compiler, so it
cannot be copyrighted as new work by itself, but this doesn’t mean it
can’t be copyrighted as derived work.)

I think this is true for all kids of models, generative or not, though
if a classification model is small enough that it can be proven, to at
least reasonable exclusion, that it cannot reproduce its inputs in a
form sufficient for copyright, could get partial exeptions.

For main I think I’d still want sources available. In a twist, I agree
that those small-enough classification models (not sure about generative models) could go to non-free-firmware.

The latter do not generate copyrightable content,
so most of the issues you pointed out do not apply to them.

AIUI, models and software making use of them are distinct (data vs. code, otherwise the models couldn’t go to non-free-firmware). It would have to
be seen if you could take a sufficiently large classification model and
plug it, possibly with minor changes, into a “generative AI” program (gods I hate that term, it regurgitates, doesn’t generate, and it misrepresents true generative art as well); if so, or if there’s something that can take
a model and “disassemble” it into recognisable parts of the training mate‐
rial, it’d still be an issue.

The reason is that, even if a completely free-as-in-freedom (including
in its training dataset), high quality LLM were to materialize in the
future, its preferred form of modification (which includes the dataset)
will be practically impossible to distribute by Debian due to its size.

Probably/possibly, but there’s still a distinction between contrib and non-free (and “just no”) on the line.

It’d also most likely not realistically be reproducable by Debian.

I once was asked (while preparing a response to a questionnaire about
this) what conditions it would take for me to accept “an AI”, and besides honouring and reproducing licence terms, attributions, etc. one condition
was participating in a “reproducible builds” effort, where the “training data” and all other input used during training, such as the PRNG stream, would be recorded, and others with sufficiently beefy systems could then reproduce the created model. If this is occasionally checked (and if,
during development, steps are taken to not “accidentally” break it), then we could deal with the ready-made model.

From a freedom perspective, we would still want all sources available,
so that people with the means to do can still produce a model from
modified sources.

I admit I haven’t thought on some of the things I wrote above, like how
they can fit into a Debian-ish model, as much as into the other things (especially what I put on the webpages linked in the previous mail), but
they should serve as a good start.

Other issues still apply to them, including biases analyses (at a scale
which *is* manageable, addressing some of the issues pointed out by >hartmans), and ethical data sourcing.

And environmental concerns, indeed, indeed.

These can probably be handled by the relevant team (d-science?) like
they are with other prospective packages, should the other concerns (DFSG-freeness, archive rules, etc.) pass.

bye,
//mirabilos (still not subscribed)
--
Save the environment. Don’t use “AI” to summarise this eMail, please.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From M. Zhou@21:1/5 to Christian Kastner on Mon Feb 10 21:50:01 2025

On Mon, 2025-02-10 at 19:12 +0100, Christian Kastner wrote:

Preferred Form of Modification
==============================
[...]
As a practical matter, for the non-monopolies in the free software ecosystem, the preferred form of modification for base models is the
model themselves.

I would have strongly disagreed with this until a short while ago, and
stated that unless I can run a modified training process -- which would require the training data -- I don't have the preferred form of
modification.

However, recent advances point to new useful models being built from
other models, for example what DeepSeek accomplished with Llama. They obviously didn't have the original training data, yet still built
something very useful from the base model.

So I now have a slight doubt. But it is only slight; my gut says that
even many useful derivations cannot "heal" an initial problem of
free-ness. Because if the original base model were to disappear (as you
put it in "Free Today, Gone Tomorrrow"), all derivations in the chain
would lose their reproducibility, too.

And independence too, which connects to a healthy ecosystem for long run.

Think about the case where basemodel-v1 is released under MIT, and there
are some derivative works around this v1 model. Then someday, the license
of basemodel-v2 has been changed into a proprietary one. The open source ecosystem around v2 will simply decay.

For a traditional open source or free software, if people are unsatisfied
about how software-v1 is written, or the upstream of software-v1 decides to discontinue the effort, people can still fork the v1 work, and potentially create a v2 independently.

Data access matters even more for academia. Without the original training
data, there will never be a fair comparison, let alone rigorous research
for making real improvements. For example, ResNet (for image classification)
is trained on ImageNet (a large scale image dataset, academic-use-only).
The original author has already stopped making improvements on this "base model". However, people can still try to train new "base models"
such as ViT (vision transformer) on ImageNet to make real improvements.
The original training dataset being accessible, although academic-use-only,
is one key factor that maintains the line of research healthy. If anybody
is unsatisfied with ResNet / ViT / etc, they can reproduce the original
base model, and try to make improvements.

No model is endgame so far. Pre-trained models will be replaced very quickly. An open-source ecosystem built upon a frozen toxic candy base model is not iteratable. As long as the frozen base model is outdated, the whole ecosystem is outdated, because the system is not independent, and cannot iterate itself.

Similarly, if we treat "sbuild" as a "frozen base model". Then the community can create sbuild-schroot, sbuild-unshare, etc around it. When sbuild discontinues,
the derivative works will be impacted. However, if the fundamentals (dpkg-dev) remains public, people can still independently design other "frozen base models",
like debspawn (systemd-nspawn) or even docker-based ones. In that sense, the debian package builder ecosystem is still healthy.

My interpretation on the "toxic candy" is not only focusing on the current,
but also the future, especially the key factors that contributes to a healthy, positive loop where the ecosystem can constructively grow.

If the software freedom is defined on top of a "toxic candy" base model and
is dependent on it -- once the base model quit the game and discontinues,
the "software freedom" will have to quit the game and discontinue, because nobody other than the original author has the freedom to improve the original base model itself.

"Toxic candy" models are not reproducible, and are not something people can independently improve. By definition, I don't believe this satisfies the definition
of software freedom. If disagreed on this point, then the question turns to whether "being able to do secondary development" covers all freedoms in definition.

Independence matters for some aspects, like trustworthiness. For example,
what if a "toxic candy" language model respond with spam advertisements instead of really answering user's question. Nobody is able to fix this "base model" other
than the original author. Should I trust the "toxic candy" model and regard it as "free software", while being unable to study or modify the "base model" itself?

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Gard Spreemann@21:1/5 to Sam Hartman on Fri Feb 21 11:30:01 2025

Sam Hartman <hartmans@debian.org> writes:

Dear lumin:

First, thanks for all your work on AI and free software.
When I started my own AI explorations, I found your ML policy
inspirational in how I thought about AI and free software.

I'd like to pile on and repeat this sentiment; thank you, Mo!

With my Debian hat on, I don't really care whether base models are
considered free or non-free. I don't think it will be important for
Debian to include base-models in our archive. What I do care about is
what we can do with software that takes base models and adapts them
for a particular use case.

I really struggle to follow this reasoning. What about this way of
thinking does _not_ transfer to "classical" software? And why? Why isn't
what you're saying an equally good (or, I claim, bad) argument for
acceptance of classical software that is somehow derived from non-free software? (An actual real-world example that springs to mind might be
so-called open source projects that start out with leaked source code
from e.g. a proprietary game).

Best,
Gard

--=-=-Content-Type: application/pgp-signature; name="signature.asc"

-----BEGIN PGP SIGNATURE-----

iQJGBAEBCgAwFiEEz8XvhRCFHnNVtV6AnRFYKv1UjPoFAme4UbcSHGdzcHJAbm9u ZW1wdHkub3JnAAoJEJ0RWCr9VIz6LeUP/0I6WMfebhz+BoAocwSOJy9SIQ7gsHol CdSwH+gwRVI+KAO7Lrgft0i7W+qs+iGXXuxL1yUi8fEKoVXKiU7crI6yOJJ1GOR7 UCVsmiGpEiu9GpIX+xkYptVqesOt+Rwz1o44z4XWVsxp8VBoMCwksEyOlOKAvP5r si9KDNoxSmrJEyYyQFISDUsxYkNwmB5LfqCqkWC4lAATdVtijymuBbvKRoHngvZP KcJ7HE3VfCEAvrgng+/AYw4Edcppb1E2C2fHkMDFlj//5JoBoEv1bbyhpPj9XsaS E26f0Oxk1HZqba6UT8nTgNc6/Hxi0vP4l+pWtptn20newcqJt2m5o7b00RhRgWPC Cj0qpU/J57dcnETR3x81ZmJwnJjbhIZm4lyS6KYs1cJoN/lZ8suwSB+56uAGCgi5 h9IpgCvfCnnP/AXGSsONUUKNDICKSQwTGuUvVnH//2I5udM7yYcq5atzJywOXhH+ Uox0pvRS6n+JRlAniTZ+f7HUpu4z4iTUcI3z2vKl54RwSwO0WBYYDgZ4Ob830ppa BGzBd8ae7ZamvmnZDt4ZvIvYeUiSRudHPQ59LNL477FjSVavPDHVszjcyVWP7wKA mfnf9omNyMhhzwFgKFXCATZTmq9ZCyFOldb0xDiyhBHo3hpDBD06NvGrKZSrxKnt
w+iR131F26Mv
=ykUG
-----END PGP SIGNATURE-----

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

Who's Online
Recent Visitors
- Centurion
  Sat Apr 26 16:30:05 2025
  from Berea, Ohio via Telnet
- Lonewolf
  Sat Apr 26 16:01:12 2025
  from Little Flock, Ar via Telnet
- Gretchiie
  Sat Apr 26 04:48:26 2025
  from Derry, Nh via Telnet
- Xbit
  Sat Apr 26 02:53:58 2025
  from Pdx, Or via Telnet
- Daniel Garrod
  Fri Apr 25 16:51:41 2025
  from Cambridge, Uk via Telnet
- Lonewolf
  Fri Apr 25 16:12:11 2025
  from Little Flock, Ar via Telnet
- Daniel Garrod
  Fri Apr 25 16:05:48 2025
  from Cambridge, Uk via Telnet
- Ray
  Thu Apr 24 23:59:06 2025
  from Remote via SSH

System Info

Sysop:	Keyop
Location:	Huddersfield, West Yorkshire, UK
Users:	463
Nodes:	16 (2 / 14)
Uptime:	156:13:45
Calls:	9,384
Calls today:	4
Files:	13,561
Messages:	6,095,837

A Different Take on AI

Who's Online

Recent Visitors

System Info