• A Different Take on AI

    From Sam Hartman@21:1/5 to All on Wed Feb 5 16:00:01 2025
    TL;DR: I think it is important for Debian to consider AI models free
    even if those models are based on models that do not release their
    training data. In the terms of the DFSG, I think that a model itself is
    often a preferred form of modification for creating derived works. Put
    another way, I don't think toxic candy is as toxic as I thought it was
    reading lumin's original ML policy.
    If we focus too much on availability of data, I think we will help the
    large players and force individuals and small contributors out of the
    free software ecosystem.
    I will be drafting a GR option to support this position.


    Dear lumin:

    First, thanks for all your work on AI and free software.
    When I started my own AI explorations, I found your ML policy
    inspirational in how I thought about AI and free software.
    As I have begun my own explorations, which often involve trying to change/remove bias from models, I have come to think somewhat
    differently than you did in your original ML policy.

    I apologize that I did not include a lot of references in this message.
    I found that I was having trouble coming up with enough time to write it
    at all. I wanted to give you some notice that I planned to draft what I believe is a competing GR option, and doing that took the time I have.
    I am not a researcher by trade, and I do not have all the references and
    links I wish I did handy.
    I'm just a free software person who has been working on AI as a side
    project, because I hope it can make parts of the world I care about
    better.

    As I understand it, you believe that:

    1) Looking at the original training data would be the best approach for
    trying to remove bias from a model.

    2) It would be difficult/impossible to do that kind of work without
    access to the original training data.

    I have come to believe that:

    1) AI models are not very transparent even if you have the training
    data. taking advantage of the training data for a base model is probably outside the scope of most of us even if we had it. That's definitely
    true for retraining a model, but I think is also true for understanding
    where bias is coming from. That's for base models. I think that fine
    tuning data sets for things like Open Assistant are within the scope of
    the masses to examine and use.

    2) I think that retraining, particularly with training techniques like
    ORPO, is a more effective strategy for the democratized (read
    non-Google, non-Meta) community to change bias than using training data.
    In other words, I am not convinced that we would use training data even
    if we had it, to adjust the bias of our models.
    Which is to say I think the preferred form of modification for models is
    often the model itself rather than the training data.

    Goals
    =====


    I think both of us are more concerned about democratizing AI. I think we
    are more interested in preserving individuals' ability to modify and
    create software than we are in promoting monopolies or advantaging
    OpenAI, Meta, Google, and the like.
    I think we may disagree about how to do that.

    With my Debian hat on, I don't really care whether base models are
    considered free or non-free. I don't think it will be important for
    Debian to include base-models in our archive.
    What I do care about is what we can do with software that takes base
    models and adapts them for a particular use case.
    If LibreOffice gained an AI assistant, our users would be served if we
    are able to include a high quality AI assistant that preserves their
    core freedoms.
    With my Debian hat on, I care more about what we can do with things
    derived from base models than the base models themselves.

    Core Freedoms
    =============

    I think that the core freedoms we care about are:

    1) Being able to use software.
    2) Being able to share software.
    3) Being able to modify software.
    4) transparency: being able to understand how software works.

    Debian has always valued transparency, but I think the DFSG and our
    practices have always valued transparency less than the other freedoms.
    There's nothing in the DFSG itself that requires transparency.
    We've had plenty of arguments over the years about things like minimized
    forms of code and whether they met the conditions of the DFSG.
    One factor that has been raised is transparency, but it mostly gets
    swept aside by the question of can we modify the software.
    The idea appears to be that if we have the preferred form of
    modification that 's transparent enough.
    If the upstream doesn't have any advantages in transparency , well, we
    decide that's free enough.

    One argument that comes up over the years looking at vendoring is
    whether replacement is the preferred form of modification.
    I have some vendored blob that is a minimized representation of an
    upstream software project. Say minimized javascript or some form of
    byte code.
    Most of the time I'm going to modify that by replacing the upstream
    sources entirely with a new version.
    So, at least for vendored code, is that good enough?
    generally we've decided that no it is not.
    We want individuals to be able to make arbitrary modifications to the
    code, not just replace it.

    My claim is that analysis works differently for AI than for minimized Javascript.

    AI is Big
    =========

    I cannot get my head around how big AI training sets for base models
    are.

    I was recently looking at the DeepSeek Math paper [1]:

    [1]: https://arxiv.org/abs/2402.03300

    As I understand it, they took Their Deepseek Code model as a base.
    So that's trained on some huge dataset--so big that they didn't even
    want to repeat it.

    Then they had a 1.2 billion token dataset (say 6G uncompressed text)
    that they used for a training round--some sort of fine tuning round.

    Then they applied 2**17 examples (so over a hundred thousand examples)
    where they knew both the question and a correct answer.
    But the impressive part for me was how the 1.2 billion token dataset was produced. I found the discussion of that process fascinating, but it
    involved going over a significant chunk of the Common Crawl dataset,
    which is mind bogglingly unbelievably huge, to figure out which fraction
    of that dataset talks about math reasoning.

    Searching the 1.2 billion token data set is clearly within our
    capability.
    But it's not at all clear to me that I could find what in a 6G dataset
    is going to be producing bias.
    I think it would be quite possible to hide bias in such a dataset intentionally, in such a way that even given the 1.2 billion tokens we
    would find it difficult to remove the bias by modifying the dataset. I
    think there will certainly be unintentional bias there I colud not find.

    So, to really have the training data, we need Common Crawl, and we need
    the scripts and random seeds necessary to reproduce the 1.2 billion
    token dataset.
    I also believe there was at least one language model in that process, so
    you would also need the training data for that model.

    I am quite sure that finding bias in something that large, or even
    examining it is outside the scope of all but well funded players.

    I am absolutely sure that reducing Common Crawl to the 1.2 billion
    tokens --that is actually running the data analysis, including all the
    runs of any language models involved--is outside the scope of all but
    well funded players. In other words, taking that original training data
    and using it is a preferred form of modification is outside of the scope
    of everyone we want to center in our work.

    And then we're left repeating the process for the base model DeepSeek
    Code.

    My position is that by taking this approach we've sacrificed
    modifyability for transparency, and I am not even sure we have gained transparency at a price that is available to the members of the
    community we want to center in our work.
    In this focus on data, we have taken the wrong value trade off for
    Debian.
    Debian has always put the ability to modify software first.

    Free Today, Gone Tomorrow
    =========================

    One significant concern I know Lumen is aware of with requiring data is
    what happens when the data is available today but not tomorrow.
    One of the models that tried to be as open as possible ran into problems because they were forced to take down part of their dataset after the
    model was released.
    (I believe a copyright issue.)

    The AI copyright landscape is very fluid.
    Right now we do not know what is fair use.
    We do not even have a firm ethical ground for what sharing of data for
    AI training should look like in terms of social policy.

    We run a significant risk that significant chunks of free software will
    depend on what we believe is a free model today, only to have it get reclassified as non-free tomorrow when some of the training data is no
    longer available to the public.

    We run significant risks when different jurisdictions have different
    laws.

    It is very likely that there will be cases where models will still be distributable even when some of the training datasets underlying the
    model can no longer be distributed.

    So, you say, let's have several models and switch from one to another if
    we run into problems with one model.
    Hold that in the back of your mind. We'll come back.

    Debian as Second Class
    ======================

    I am concerned that if we are not careful the quality of models we are
    able to offer our users will lag significantly behind the rest of the
    world.
    If we are much more strict than other free-software projects, we will
    limit the models our users can use.
    Significant sources of training data will be available to others but not
    our users.
    I suspect that models that only need to release data information rather
    than training data will be higher quality because they can have access
    to things like published books, works that can be freely used, but not
    freely distributed and the like.

    Our social contract promises we will value our users and free software.
    If we reduce the selection (and thus quality) of what we offer our
    users, it should somehow serve free software.
    In this instance, I believe that it probably does not serve transparency
    and harms our core goal of making software easy to modify. In other
    words I do not believe free software is being helped enough to justify disadvantaging our users.

    Preferred Form of Modification
    ==============================

    I talked earlier about how if one model ended up being non-free, we
    could switch to another one.
    That happens all the time in the AI ecosystem.
    A software system has a fine tuning dataset.
    They might fine tune a version of Llama3, Mistral, one of the newer
    models, all against their dataset.
    They will pick the one that performs best.
    As new models come out, the base model for some software might switch.

    As a practical matter, for the non-monopolies in the free software
    ecosystem, the preferred form of modification for base models is the
    model themselves.
    We switch out models and then adjust the code on top of that, using
    various fine tuning and prompt engineering tasks to adapt a model.

    The entire ecosystem has evolved to support this. There are
    competitions between models with similar (or the same) inputs.
    There are sites that allow you to interact with more than one model at
    once so you can choose which works best for you and switch out. (Or get
    around biases or restrictions, perhaps using Chat GPT to write part of a
    story, and a more open model to write adult scenes that Chat GPT would
    refuse to write.)

    On the other hand, I did talk about fine tuning and task-specific or program-specific datasets.
    many of those are at scopes we could modify, and fine tuning (or
    producing adapters) models based on those datasets is part of the
    preferred form of modification for the programs involved.

    What I want for Debian
    ======================

    here's what I want to be able to do for Debian:

    * First, the bits of the model--its code and parameters--need to be
    under a DFSG-free license. So Llama3 is never going to meet Debian
    main's needs under its current license.

    * We look at what the software authors actually do to modify models they
    incorporate to determine the preferred form of modification. If in
    practice they switch out base models and fine tune, that's okay. In
    this situation we probably would need full access to the fine tuning
    data, but the training data for the base model.

    *Where it is plausible that the preferred form of modification works
    this way, we effectively cut off the source code and do not look further. If you
    are integrating model x into your software, your software is free if
    model x is under a free license and any fine tuning data/scripts you
    use are free. I.E. if our users could actually go from upstream model
    x to what the software uses, that's DFSG-free enough even if the user
    could not reproduce model x itself.

    I firmly believe that the ability to retrain models to change their bias without access to the original training data will only continue to
    improve.
    i think that especially with techniques like ORPO, my explorations
    suggest that for smaller models we may have already reached a point that
    is good enough for free software.

    So What about the OSI Definition
    ================================

    I don't know.
    I think it depends on how the OSI definition treats derivative works.
    If what we're saying is base models need to release training data, I
    think that would harm the free software community.
    It would mean free models were always of lower quality than proprietary
    models, at least unless the fair use cases go in a direction where all
    the models are of low quality.
    I think data information is best for base models.

    If instead, what we're saying is that OSI's definition is more focused
    on software incorporating models, and it is okay to use a model without
    fully specified data is an input so long as you give all the data for
    what you do to that model in your program, I could agree.

    If we are saying that to be open source software, any model you use
    needs to provide full training data up to the original training run with
    random parameters, I think that would harm our community.

    -----BEGIN PGP SIGNATURE-----

    iHUEARYIAB0WIQSj2jRwbAdKzGY/4uAsbEw8qDeGdAUCZ6N5hwAKCRAsbEw8qDeG dFsqAQDpSkIXdSV3DY2id80eNaA4KS5bJ19piKh2pLP84soljQD9Htt/yDp5A4Li iuw9LPxgijiqAAHWBYzSHxfjh93t4Ag=
    =vz5v
    -----END PGP SIGNATURE-----

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From M. Zhou@21:1/5 to Sam Hartman on Wed Feb 5 18:30:01 2025
    Hi Sam,

    Thank you for the input. I see your point, and those are exactly why I
    wrote proposal B in my draft. Here is my quick response after going through
    the text.

    On Wed, 2025-02-05 at 07:45 -0700, Sam Hartman wrote:

    TL;DR: I think it is important for Debian to consider AI models free
    even if those models are based on models that do not release their
    training data. In the terms of the DFSG, I think that a model itself is
    often a preferred form of modification for creating derived works. Put another way, I don't think toxic candy is as toxic as  I thought it was reading  lumin's original ML policy.
    If we focus too much on availability of data, I think we will help the
    large players and force individuals and small contributors out of the
    free software ecosystem.
    I will be drafting a GR option to support this position.

    I want to point out that the "preferred form of modification for creating DERIVED WORKS" -- the "derived works" is where your proposal (and proposal B) differs from the proposal A.

    Proposal A (toxic candy is not free software), preserves the full freedom
    for derived works, but also the freedom to inspect, study, reproduce, modify the original base model. Only covering derived work is not a integral freedom.

    Proposal B (toxic candy is free software), is similar to treating those base models as blobs (such as firmware) that no free software community can really handle (at the current stage).

    I do not see how proposal A harms the ecosystem. It just prevents huge
    binary blobs from entering Debian's main section of the archive. It does not stop people from uploading the binary blobs to non-free section.

    General AI applications are not something to worry about even with proposal A. DebGPT [https://tracker.debian.org/pkg/debgpt] itself incorporated two common practice how the existing AI applications work:

    (1) by default, DebGPT behaves as a REST API client. It supports a wide range
    of existing service end points, including commercial and self-hosted ones. (2) the build-in backend of DebGPT can pull a binary blob from internet and
    provide the REST endpoint using that model.

    I personally do not see how insisting proposal A can harm the ecosystem. While developers cannot put binary blobs into main, but you can still trigger the automatic download from the software in main.

    I consistently believe putting a giant binary blob (base model) into main, that nobody other than the upstream can reproduce is ridiculously funny. That said, non-free is somewhere such model can go.


    My appreciation to software freedom roots in the equal sharing of knowledge that benefits human in the long run. When I was young, looking at the binary blobs of Microsoft Windows, while being unable to easily learn how computer works really disappointed me. The discovery of Debian, makes me feel happy
    with open source crap even if they falls behind the closed-source Ferrari.

    Proposal A preserves the integrity of knowledge when anybody wants to study
    the stuff in depth. Proposal B departures from my appreciation to software freedom. I hope free software can still help people achieve their personal revolution in terms of knowledge and skill in the future that belongs to AI, just like how it has done for me.


    Let's leave enough time preparing the proposal. I'll focus on my proposal A
    and incorporate the others' suggestions from the list.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Thorsten Glaser@21:1/5 to All on Fri Feb 7 08:50:01 2025
    M. Zhou dixit:

    I do not see how proposal A harms the ecosystem. It just prevents huge
    binary blobs from entering Debian's main section of the archive. It
    does not stop people from uploading the binary blobs to non-free
    section.

    I’d like to remindyou that these huge binary blobs still contain,
    in lossily compressed form, illegally obtained and unethically
    pre-prepared, copies of copyrighted works, whose licences are not
    honoured by the proposed implementations.

    As such I cannot consider them acceptable even for Debian’s non-free.

    As someone publishing a lot of things under OSS licences, I, you,
    really we all are affected by this. Given that I mostly publish
    under Copyfree Ⓕ licences, attribution (and disclaimer of as much
    liability as permitted) is all I seek, and they give not even that.

    While the act of training such a model *for data analysēs* may be
    legal, distributing it, or output gained from it that is not a, and
    I quote the copyright law, “pattern, trend [or] correlation” isn’t
    legal.

    https://evolvis.org/~tg/cc.htm contains more writeup on this (and
    my Fediverse bookmark list has tons more material I need to add to
    its “Further references” section, really…) and links to a wlog entry containing even more on this and the homepage of The MirOS Licence
    where I put an explicit interpretation requirement following along
    the same line as well.

    On Wed, 2025-02-05 at 07:45 -0700, Sam Hartman wrote:

    If we are saying that to be open source software, any model you use
    needs to provide full training data up to the original training run
    with random parameters, I think that would harm our community.

    I cannot conceive how someone in your position in Debian, or really
    any fellow DD, can make such a statement with the declared intent.

    (Others on LWN, where I spotted this (I’m not subscribed to the list,
    so Cc me on replies if you want me to see them), have pointed out the
    fallacy behind the quality argument already.)

    bye,
    //mirabilos
    --
    „Cool, /usr/share/doc/mksh/examples/uhr.gz ist ja ein Grund,
    mksh auf jedem System zu installieren.“
    -- XTaran auf der OpenRheinRuhr, ganz begeistert
    (EN: “[…]uhr.gz is a reason to install mksh on every system.”)

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Sam Johnston@21:1/5 to Thorsten Glaser on Fri Feb 7 13:40:04 2025
    On Fri, 7 Feb 2025 at 08:48, Thorsten Glaser <tg@debian.org> wrote:

    I’d like to remindyou that these huge binary blobs still contain,
    in lossily compressed form, illegally obtained and unethically
    pre-prepared, copies of copyrighted works, whose licences are not
    honoured by the proposed implementations.

    As such I cannot consider them acceptable even for Debian’s non-free.

    Agreed, we know these models can and do routinely recall training data
    in the course of normal operation[1]:

    "Large language models (LMs) have been shown to memorize parts of
    their training data, and when prompted appropriately, they will emit
    the memorized training data verbatim."

    We also know that even models carefully designed to avoid this, often
    using guardrails that would be trivially removed when running locally
    rather than as a service like OpenAI, will divulge their secrets if
    coerced[2]:

    "The Times paid someone to hack OpenAI’s products,” and even so, it
    “took them tens of thousands of attempts to generate the highly
    anomalous results”

    The OSI and others arguing that this is a valid way to protect
    sensitive training data (copyright content, but also personally
    identifiable information, medical records, proprietary datasets for
    federated learning, and even CSAM) demonstrates they either do not
    understand the technology, or worse, do and are trying to deceive us.
    For me, the debate should end here.

    While the act of training such a model *for data analysēs* may be
    legal, distributing it, or output gained from it that is not a, and
    I quote the copyright law, “pattern, trend [or] correlation” isn’t legal.

    Some 4D chess players have argued that a model is not copyrightable as
    it is merely "a set of factual observations about the data", and that
    the copyright violations necessary for training are technically
    excusable (if unethical) under fair use and text and data mining
    exemptions. This ignores the intentions of the authors of the content
    (and the exemptions, which pre-date LLMs), with training on e.g.
    Common Crawl being done without their consent. Unless otherwise
    specified, content is typically published with "all rights reserved"
    by default.

    In any case, the result is "a statistical model that spits out
    memorized information [that] might infringe [...] copyright". The
    exemptions relied upon for training do not extend to reproduction
    during inference, for which a test of “substantial similarity” would
    apply (otherwise one might argue such copyright violations are
    coincidental).

    Allowing this would be knowingly shipping obfuscated binary blobs in
    main, akin to a book archive (Authors Guild v. Google, 2015) with
    trivially reversible encryption, or a printer driver that can
    spontaneously reproduce copyrighted content from memory. That we've
    been discussing these AI policy issues on the public record for years
    could even subject the project to claims of contributory copyright
    infringement when our users inevitably commit direct infringement
    (deliberately or inadvertently).

    It would be a shame to see Debian enter the same category as Grokster
    and Napster. It would also be unfortunate if Debian and derivatives
    could no longer be considered Digital Public Goods (albeit not yet
    certified like Fedora[3]), as the DPGA has just today "finalized the
    decision to make training data mandatory for AI systems applying to
    become DPGs. This requirement will help ensure that AI systems are
    built ethically and are transparent and interpretable"[4]. This too
    should give pause to advocates of allowing obviously non-free modules
    in main.

    While I'm not trying to be alarmist, I am alarmed. Our community was
    built on respect for rights, and dropping this principle out of
    expediency now would be a radical departure from the norm. I don't
    think this is clear enough in lumin's proposal and "Toxic Candy"
    language, but rather than splitting the vote we should work on a
    consolidated clear and concise position, keeping the context separate.
    The alternative would also have unintended consequences, including
    chilling effects on open data, and on high-quality open models that
    emerged around/after (and in many cases, before) OSI's contentious
    OSAID release.

    - samj

    1. https://arxiv.org/abs/2202.07646
    2. https://hls.harvard.edu/today/does-chatgpt-violate-new-york-times-copyrights/
    3. https://www.networkworld.com/article/970236/fedora-linux-declared-a-digital-public-good.html
    4. https://github.com/DPGAlliance/dpg-standard/issues/193#issuecomment-2642584851

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Stefano Zacchiroli@21:1/5 to Sam Johnston on Fri Feb 7 16:10:01 2025
    While I'm still digesting the very impactful (for me) message by the
    other Sam (hartmans), a quick but important note on the following:

    On Fri, Feb 07, 2025 at 01:35:00PM +0100, Sam Johnston wrote:
    "Large language models (LMs) have been shown to memorize parts of
    their training data, and when prompted appropriately, they will emit
    the memorized training data verbatim."

    I don't think we should focus our conversation on LLMs much, if at all.
    The reason is that, even if a completely free-as-in-freedom (including
    in its training dataset), high quality LLM were to materialize in the
    future, its preferred form of modification (which includes the dataset)
    will be practically impossible to distribute by Debian due to its size.

    So when we think of concrete examples, let's focus on what could be
    reasonably distributed by Debian. This includes small(er) generative AI language models, but also all sorts of *non-generative* AI models, e.g., classification models. The latter do not generate copyrightable content,
    so most of the issues you pointed out do not apply to them. Other issues
    still apply to them, including biases analyses (at a scale which *is* manageable, addressing some of the issues pointed out by hartmans), and
    ethical data sourcing.

    Cheers
    --
    Stefano Zacchiroli . zack@upsilon.cc . https://upsilon.cc/zack _. ^ ._
    Full professor of Computer Science o o o \/|V|\/ Télécom Paris, Polytechnic Institute of Paris o o o </> <\> Co-founder & CSO Software Heritage o o o o /\|^|/\ Mastodon: https://mastodon.xyz/@zacchiro '" V "'

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Sam Johnston@21:1/5 to Stefano Zacchiroli on Fri Feb 7 19:40:02 2025
    On Fri, 7 Feb 2025 at 16:04, Stefano Zacchiroli <zack@debian.org> wrote:
    I don't think we should focus our conversation on LLMs much, if at all.

    While I agree LLMs tend to be the tail wagging the dog in AI/ML
    discussion, the thread focuses on LLMs and the resulting policy will
    apply to them.

    The reason is that, even if a completely free-as-in-freedom (including
    in its training dataset), high quality LLM were to materialize in the
    future, its preferred form of modification (which includes the dataset)
    will be practically impossible to distribute by Debian due to its size.

    There are several candidates already, including Ai2's OLMO 2[1] and Pleias[2]:

    "They Said It Couldn’t Be Done[3]
    Training large language models required copyrighted data until it did
    not. [...] These represent the first ever models trained exclusively
    on open data, meaning data that are either non-copyrighted or are
    published under a permissible license. These are the first fully EU AI
    Act compliant models. In fact, Pleias sets a new standard for safety
    and openness."

    Given these provide a foundation on which future developers can build,
    as well as an example others can follow, there will be many more.
    Conversely, if we propagate the myth that these are too
    big/hard/costly to create with today's tools, let alone tomorrow's,
    then we run the risk people believe us. Not long ago even obtaining a
    computer that could download and compile software was out of the reach
    of most!

    On the "preferred form" (wording from the OSD rather than the DFSG),
    this is subjective and will be different for one than for another.
    While Sam may possess the tools and techniques to assess and address
    bias to some extent with weights only, if I as a security researcher
    or data protection officer need to detect and entirely eliminate
    problematic content (e.g., backdoors or "right to be forgotten"
    requests) then the *only* form I can accept is the training data, thus
    making it my "preferred form". I can't just say to a privacy
    commissioner or judge "there was only a 0.7% chance patients' medical
    records would be revealed, your honour". While Sam's tools are
    improving, so are tools that can reverse the training process (e.g.,
    DLG/iDLG for model inversion which "stands out due to its ability to
    extract sensitive information from the training dataset and compromise
    user privacy"[4]).

    Just as the software vendor doesn't get to tell users what constitutes
    an improvement for the purposes of the free software definition, we
    don't get to tell practitioners what the subjective "preferred form"
    means. That's why I prefer the objective "actual form" Sam referred to
    in suggesting "We look at what the software authors *actually do* to
    modify models they incorporate to determine the preferred form of modification". I guarantee some will reach for the data, so it must be
    included for that freedom to be fully protected.

    So when we think of concrete examples, let's focus on what could be reasonably distributed by Debian. This includes small(er) generative AI language models, but also all sorts of *non-generative* AI models, e.g., classification models. The latter do not generate copyrightable content,
    so most of the issues you pointed out do not apply to them.

    We can't make a valid decision or draft a policy focusing on models
    which have no ability to create output that violates copyrights, only
    to then put the project, its derivitatives, and users in legal hot
    water with others that do. You do raise a good point about what we can reasonably distribute with Debian, and many models would already
    exceed our current capacity (even without the dependencies required
    for reproducibility). This is a solvable problem though, and it's
    better to deliver utility to our users by solving it than compromise
    on our principles or give up altogether. Common Crawl don't host their
    own dumps, for example.

    Other issues
    still apply to them, including biases analyses (at a scale which *is* manageable, addressing some of the issues pointed out by hartmans), and ethical data sourcing.

    I'm not sure I accept that relying on fair use for training only to
    then incite direct infringement by users through deliberate or
    inadvertent reproduction per proposed policies can be considered
    "ethical data sourcing". Even if fair use did extend to cover
    infringing model outputs, it would clearly be against the wishes of
    the authors. This much is clear from the various generative AI
    lawsuits already underway[5], including a class action against
    Bloomberg[6], who joins Software Heritage in the small and shrinking
    group of OSAID endorsers[7].

    - samj

    1. https://allenai.org/blog/olmo2
    2. https://simonwillison.net/2024/Dec/5/pleias-llms/
    3. https://huggingface.co/blog/Pclanglais/common-models
    4. https://arxiv.org/abs/2501.18934v1
    5. https://generative-ai-newsroom.com/the-current-state-of-genai-copyright-lawsuits-203a1bd0f616
    6. https://admin.bakerlaw.com/wp-content/uploads/2024/01/ECF-74-Amended-Complaint.pdf
    7. https://opensource.org/ai/endorsements

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Sam Hartman@21:1/5 to All on Sat Feb 8 00:30:01 2025
    "Sam" == Sam Johnston <samj@samj.net> writes:

    Sam> On Fri, 7 Feb 2025 at 16:04, Stefano Zacchiroli <zack@debian.org> wrote:
    >> I don't think we should focus our conversation on LLMs much, if
    >> at all.

    Sam> Just as the software vendor doesn't get to tell users what
    Sam> constitutes an improvement for the purposes of the free
    Sam> software definition, we don't get to tell practitioners what
    Sam> the subjective "preferred form" means. That's why I prefer the
    Sam> objective "actual form" Sam referred to in suggesting "We look
    Sam> at what the software authors *actually do* to modify models
    Sam> they incorporate to determine the preferred form of
    Sam> modification". I guarantee some will reach for the data, so it
    Sam> must be included for that freedom to be fully protected.


    Actually, no, that's not how Debian works.
    We look to what the people working on the upstream project and the
    package do when modifying the package, and generally accept that as the preferred form of modification for the package.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Thorsten Glaser@21:1/5 to Sam Johnston on Sat Feb 8 00:30:01 2025
    On Fri, 7 Feb 2025, Sam Johnston wrote:

    On Fri, 7 Feb 2025 at 08:48, Thorsten Glaser <tg@debian.org> wrote:

    I’d like to remind you that these huge binary blobs still contain,
    in lossily compressed form, illegally obtained and unethically
    pre-prepared, copies of copyrighted works, whose licences are not
    honoured by the proposed implementations.

    As such I cannot consider them acceptable even for Debian’s non-free.

    Agreed, we know these models can and do routinely recall training data
    in the course of normal operation[1]:
    […]
    We also know that even models carefully designed to avoid this, often
    using guardrails that would be trivially removed when running locally
    rather than as a service like OpenAI, will divulge their secrets if >coerced[2]:

    Indeed, I’ve seen more examples for this.

    The OSI and others arguing […] demonstrates they either do not
    understand the technology, or worse, do and are trying to deceive us.
    For me, the debate should end here.

    +1

    While the act of training such a model *for data analysēs* may be
    legal, distributing it, or output gained from it that is not a, and
    I quote the copyright law, “pattern, trend [or] correlation” isn’t
    legal.

    Some 4D chess players have argued that a model is not copyrightable as
    it is merely "a set of factual observations about the data", and that

    This sounds like the usual “some random software developer trying
    their hand at legalese” which lawyers routinely laugh about.

    I’ve heard that this is irrelevant, as long as the model’s output
    can reproduce sufficiently recognisable parts of others’ works,
    standalone and/or as collage, it’s a derived work. IANAL, ofc.

    excusable (if unethical) under fair use

    … which is purely a US-american thing…

    and text and data mining exemptions.

    … which does not allow reproduction of works, only what I quoted above.

    This ignores the intentions of the authors of the content

    That, too.

    Unless otherwise specified, content is typically published with "all
    rights reserved" by default.

    The Berne Convention says so, indeed.

    In any case, the result is "a statistical model that spits out
    memorized information [that] might infringe [...] copyright". The
    exemptions relied upon for training do not extend to reproduction
    during inference, for which a test of “substantial similarity” would >apply (otherwise one might argue such copyright violations are
    coincidental).

    +1

    Allowing this would be knowingly shipping obfuscated binary blobs in
    main, akin to a book archive (Authors Guild v. Google, 2015) with
    trivially reversible encryption, or a printer driver that can
    spontaneously reproduce copyrighted content from memory.

    Interesting comparisons. If you take the lossy compression into
    account (book archive as JPEG or so), with possibly increased lossiness/compresion rate, this is a good fit.

    Digital Public Goods (albeit not yet certified like Fedora[3]), as the
    DPGA has just today "finalized the decision to make training data
    mandatory for AI systems applying to become DPGs. This requirement will

    Interesting, didn’t know about DPGs yet. (Hmm, they have a requirement
    for CC licences for data collections, which (except CC0 which on the
    other hand is problematic for reuse in/of code) aren’t Copyfree…grml…)

    While I'm not trying to be alarmist, I am alarmed. Our community was
    built on respect for rights, and dropping this principle out of
    expediency now would be a radical departure from the norm. I don't
    think this is clear enough in lumin's proposal and "Toxic Candy"

    I’ve not read the actual proposal (I saw the mail after responding
    to *this* thread only), but the summary by lumin in this thread makes
    it clear to me that it doesn’t go far enough, see also below.

    What does this mean for the proposed GR? Honestly, I’d rather skip it
    as it’s clear enough it’s inacceptable (your “should end here”). Mo, can you perhaps solicit input from ftpmasters first (also to see if
    they lean towards a similar hard stance)? If so, we can probably end
    up not needing one.

    *peeks at the proposals in the current text in your (Mo’s) repo*

    “A free software AI should publish the training data and training
    software under free software license, not just a FOSS-licensed
    pre-trained model along with the inference software.”

    OK, this doesn’t mention that they are acceptable for non-free, which
    your wording in the thread indicated. I could vote for that… except
    “free software license” isn’t what we need, we need “DFSG-compliant licence” (whether software or not as most FOSS data licences aren’t software licences, with The MirOS Licence as notable exception). I’ll
    file an issue about that.

    “Downside: This is not compatible with OSAID.”

    That’s an upside in my book, and many on Fedi would agree.

    I’d also argue that upsides/downsides belong into a conclusion, not
    into the text of the proposal, as they (some more than others) are
    subjective statements of the drafter. I’ll file that separately.

    (Note I haven’t looked at the rest. Still hoping we can not need a GR.)


    Zack wrote:

    let's focus on what could be reasonably distributed by Debian. This
    includes small(er) generative AI language models, but also all sorts of >*non-generative* AI models, e.g., classification models.

    I think the same rule as for other statically linked binaries applies.
    All sources must be available and in Debian (or at least to Debian and
    its users) and their licences must work together and be honoured.

    For non-free we can waive the requirement to reproduce sources, but not
    that the licences of the sources are honoured and are compatible, which includes auditability.

    The licence terms of the model itself must be suitable, of course, but
    must also include the licence terms of the “training data”. The output (which isn’t just a pattern/trend/correlation) made from the model must
    also be considered “potentially a derivative work of parts or all of its input, including training data”, and so default to have the entire terms applicable to it, unless the model can know which parts were used and
    which weren’t. (The output is machine-generated like a compiler, so it
    cannot be copyrighted as new work by itself, but this doesn’t mean it
    can’t be copyrighted as derived work.)

    I think this is true for all kids of models, generative or not, though
    if a classification model is small enough that it can be proven, to at
    least reasonable exclusion, that it cannot reproduce its inputs in a
    form sufficient for copyright, could get partial exeptions.

    For main I think I’d still want sources available. In a twist, I agree
    that those small-enough classification models (not sure about generative models) could go to non-free-firmware.

    The latter do not generate copyrightable content,
    so most of the issues you pointed out do not apply to them.

    AIUI, models and software making use of them are distinct (data vs. code, otherwise the models couldn’t go to non-free-firmware). It would have to
    be seen if you could take a sufficiently large classification model and
    plug it, possibly with minor changes, into a “generative AI” program (gods I hate that term, it regurgitates, doesn’t generate, and it misrepresents true generative art as well); if so, or if there’s something that can take
    a model and “disassemble” it into recognisable parts of the training mate‐
    rial, it’d still be an issue.

    The reason is that, even if a completely free-as-in-freedom (including
    in its training dataset), high quality LLM were to materialize in the
    future, its preferred form of modification (which includes the dataset)
    will be practically impossible to distribute by Debian due to its size.

    Probably/possibly, but there’s still a distinction between contrib and non-free (and “just no”) on the line.

    It’d also most likely not realistically be reproducable by Debian.

    I once was asked (while preparing a response to a questionnaire about
    this) what conditions it would take for me to accept “an AI”, and besides honouring and reproducing licence terms, attributions, etc. one condition
    was participating in a “reproducible builds” effort, where the “training data” and all other input used during training, such as the PRNG stream, would be recorded, and others with sufficiently beefy systems could then reproduce the created model. If this is occasionally checked (and if,
    during development, steps are taken to not “accidentally” break it), then we could deal with the ready-made model.

    From a freedom perspective, we would still want all sources available,
    so that people with the means to do can still produce a model from
    modified sources.


    I admit I haven’t thought on some of the things I wrote above, like how
    they can fit into a Debian-ish model, as much as into the other things (especially what I put on the webpages linked in the previous mail), but
    they should serve as a good start.


    Other issues still apply to them, including biases analyses (at a scale
    which *is* manageable, addressing some of the issues pointed out by >hartmans), and ethical data sourcing.

    And environmental concerns, indeed, indeed.

    These can probably be handled by the relevant team (d-science?) like
    they are with other prospective packages, should the other concerns (DFSG-freeness, archive rules, etc.) pass.

    bye,
    //mirabilos (still not subscribed)
    --
    Save the environment. Don’t use “AI” to summarise this eMail, please.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From M. Zhou@21:1/5 to Christian Kastner on Mon Feb 10 21:50:01 2025
    On Mon, 2025-02-10 at 19:12 +0100, Christian Kastner wrote:
    Preferred Form of Modification
    ==============================
    [...]
    As a practical matter, for the non-monopolies in the free software ecosystem, the preferred form of modification for base models is the
    model themselves.

    I would have strongly disagreed with this until a short while ago, and
    stated that unless I can run a modified training process -- which would require the training data -- I don't have the preferred form of
    modification.

    However, recent advances point to new useful models being built from
    other models, for example what DeepSeek accomplished with Llama. They obviously didn't have the original training data, yet still built
    something very useful from the base model.

    So I now have a slight doubt. But it is only slight; my gut says that
    even many useful derivations cannot "heal" an initial problem of
    free-ness. Because if the original base model were to disappear (as you
    put it in "Free Today, Gone Tomorrrow"), all derivations in the chain
    would lose their reproducibility, too.

    And independence too, which connects to a healthy ecosystem for long run.

    Think about the case where basemodel-v1 is released under MIT, and there
    are some derivative works around this v1 model. Then someday, the license
    of basemodel-v2 has been changed into a proprietary one. The open source ecosystem around v2 will simply decay.

    For a traditional open source or free software, if people are unsatisfied
    about how software-v1 is written, or the upstream of software-v1 decides to discontinue the effort, people can still fork the v1 work, and potentially create a v2 independently.

    Data access matters even more for academia. Without the original training
    data, there will never be a fair comparison, let alone rigorous research
    for making real improvements. For example, ResNet (for image classification)
    is trained on ImageNet (a large scale image dataset, academic-use-only).
    The original author has already stopped making improvements on this "base model". However, people can still try to train new "base models"
    such as ViT (vision transformer) on ImageNet to make real improvements.
    The original training dataset being accessible, although academic-use-only,
    is one key factor that maintains the line of research healthy. If anybody
    is unsatisfied with ResNet / ViT / etc, they can reproduce the original
    base model, and try to make improvements.

    No model is endgame so far. Pre-trained models will be replaced very quickly. An open-source ecosystem built upon a frozen toxic candy base model is not iteratable. As long as the frozen base model is outdated, the whole ecosystem is outdated, because the system is not independent, and cannot iterate itself.

    Similarly, if we treat "sbuild" as a "frozen base model". Then the community can create sbuild-schroot, sbuild-unshare, etc around it. When sbuild discontinues,
    the derivative works will be impacted. However, if the fundamentals (dpkg-dev) remains public, people can still independently design other "frozen base models",
    like debspawn (systemd-nspawn) or even docker-based ones. In that sense, the debian package builder ecosystem is still healthy.

    My interpretation on the "toxic candy" is not only focusing on the current,
    but also the future, especially the key factors that contributes to a healthy, positive loop where the ecosystem can constructively grow.

    If the software freedom is defined on top of a "toxic candy" base model and
    is dependent on it -- once the base model quit the game and discontinues,
    the "software freedom" will have to quit the game and discontinue, because nobody other than the original author has the freedom to improve the original base model itself.

    "Toxic candy" models are not reproducible, and are not something people can independently improve. By definition, I don't believe this satisfies the definition
    of software freedom. If disagreed on this point, then the question turns to whether "being able to do secondary development" covers all freedoms in definition.

    Independence matters for some aspects, like trustworthiness. For example,
    what if a "toxic candy" language model respond with spam advertisements instead of really answering user's question. Nobody is able to fix this "base model" other
    than the original author. Should I trust the "toxic candy" model and regard it as "free software", while being unable to study or modify the "base model" itself?

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Gard Spreemann@21:1/5 to Sam Hartman on Fri Feb 21 11:30:01 2025
    Sam Hartman <hartmans@debian.org> writes:

    Dear lumin:

    First, thanks for all your work on AI and free software.
    When I started my own AI explorations, I found your ML policy
    inspirational in how I thought about AI and free software.

    I'd like to pile on and repeat this sentiment; thank you, Mo!

    With my Debian hat on, I don't really care whether base models are
    considered free or non-free. I don't think it will be important for
    Debian to include base-models in our archive. What I do care about is
    what we can do with software that takes base models and adapts them
    for a particular use case.

    I really struggle to follow this reasoning. What about this way of
    thinking does _not_ transfer to "classical" software? And why? Why isn't
    what you're saying an equally good (or, I claim, bad) argument for
    acceptance of classical software that is somehow derived from non-free software? (An actual real-world example that springs to mind might be
    so-called open source projects that start out with leaked source code
    from e.g. a proprietary game).


    Best,
    Gard

    --=-=-Content-Type: application/pgp-signature; name="signature.asc"

    -----BEGIN PGP SIGNATURE-----

    iQJGBAEBCgAwFiEEz8XvhRCFHnNVtV6AnRFYKv1UjPoFAme4UbcSHGdzcHJAbm9u ZW1wdHkub3JnAAoJEJ0RWCr9VIz6LeUP/0I6WMfebhz+BoAocwSOJy9SIQ7gsHol CdSwH+gwRVI+KAO7Lrgft0i7W+qs+iGXXuxL1yUi8fEKoVXKiU7crI6yOJJ1GOR7 UCVsmiGpEiu9GpIX+xkYptVqesOt+Rwz1o44z4XWVsxp8VBoMCwksEyOlOKAvP5r si9KDNoxSmrJEyYyQFISDUsxYkNwmB5LfqCqkWC4lAATdVtijymuBbvKRoHngvZP KcJ7HE3VfCEAvrgng+/AYw4Edcppb1E2C2fHkMDFlj//5JoBoEv1bbyhpPj9XsaS E26f0Oxk1HZqba6UT8nTgNc6/Hxi0vP4l+pWtptn20newcqJt2m5o7b00RhRgWPC Cj0qpU/J57dcnETR3x81ZmJwnJjbhIZm4lyS6KYs1cJoN/lZ8suwSB+56uAGCgi5 h9IpgCvfCnnP/AXGSsONUUKNDICKSQwTGuUvVnH//2I5udM7yYcq5atzJywOXhH+ Uox0pvRS6n+JRlAniTZ+f7HUpu4z4iTUcI3z2vKl54RwSwO0WBYYDgZ4Ob830ppa BGzBd8ae7ZamvmnZDt4ZvIvYeUiSRudHPQ59LNL477FjSVavPDHVszjcyVWP7wKA mfnf9omNyMhhzwFgKFXCATZTmq9ZCyFOldb0xDiyhBHo3hpDBD06NvGrKZSrxKnt
    w+iR131F26Mv
    =ykUG
    -----END PGP SIGNATURE-----

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)