• Is there a more efficient threading lock?

    From Skip Montanaro@21:1/5 to All on Sat Feb 25 09:52:15 2023
    I have a multi-threaded program which calls out to a non-thread-safe
    library (not mine) in a couple places. I guard against multiple
    threads executing code there using threading.Lock. The code is
    straightforward:

    from threading import Lock

    # Something in textblob and/or nltk doesn't play nice with no-gil, so just
    # serialize all blobby accesses.
    BLOB_LOCK = Lock()

    def get_terms(text):
    with BLOB_LOCK:
    phrases = TextBlob(text, np_extractor=EXTRACTOR).noun_phrases
    for phrase in phrases:
    yield phrase

    When I monitor the application using py-spy, that with statement is
    consuming huge amounts of CPU. Does threading.Lock.acquire() sleep
    anywhere? I didn't see anything obvious poking around in the C code
    which implements this stuff. I'm no expert though, so could easily
    have missed something.

    Thx,

    Skip

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Thomas Passin@21:1/5 to Skip Montanaro on Sat Feb 25 11:48:37 2023
    On 2/25/2023 10:52 AM, Skip Montanaro wrote:
    I have a multi-threaded program which calls out to a non-thread-safe
    library (not mine) in a couple places. I guard against multiple
    threads executing code there using threading.Lock. The code is straightforward:

    from threading import Lock

    # Something in textblob and/or nltk doesn't play nice with no-gil, so just
    # serialize all blobby accesses.
    BLOB_LOCK = Lock()

    def get_terms(text):
    with BLOB_LOCK:
    phrases = TextBlob(text, np_extractor=EXTRACTOR).noun_phrases
    for phrase in phrases:
    yield phrase

    When I monitor the application using py-spy, that with statement is
    consuming huge amounts of CPU. Does threading.Lock.acquire() sleep
    anywhere? I didn't see anything obvious poking around in the C code
    which implements this stuff. I'm no expert though, so could easily
    have missed something.

    I'm no expert on locks, but you don't usually want to keep a lock while
    some long-running computation goes on. You want the computation to be
    done by a separate thread, put its results somewhere, and then notify
    the choreographing thread that the result is ready.

    This link may be helpful -

    https://anandology.com/blog/using-iterators-and-generators/

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Peter J. Holzer@21:1/5 to Skip Montanaro on Sat Feb 25 17:27:51 2023
    On 2023-02-25 09:52:15 -0600, Skip Montanaro wrote:
    I have a multi-threaded program which calls out to a non-thread-safe
    library (not mine) in a couple places. I guard against multiple
    threads executing code there using threading.Lock. The code is straightforward:

    from threading import Lock

    # Something in textblob and/or nltk doesn't play nice with no-gil, so just
    # serialize all blobby accesses.
    BLOB_LOCK = Lock()

    def get_terms(text):
    with BLOB_LOCK:
    phrases = TextBlob(text, np_extractor=EXTRACTOR).noun_phrases
    for phrase in phrases:
    yield phrase

    When I monitor the application using py-spy, that with statement is
    consuming huge amounts of CPU.

    Which OS is this?

    Does threading.Lock.acquire() sleep anywhere?

    On Linux it calls futex(2), which does sleep if it can't get the lock
    right away. (Of course if it does get the lock, it will return
    immediately which may use a lot of CPU if you are calling it a lot.)

    hp


    --
    _ | Peter J. Holzer | Story must make more sense than reality.
    |_|_) | |
    | | | hjp@hjp.at | -- Charles Stross, "Creative writing
    __/ | http://www.hjp.at/ | challenge!"

    -----BEGIN PGP SIGNATURE-----

    iQIzBAABCgAdFiEETtJbRjyPwVTYGJ5k8g5IURL+KF0FAmP6NwIACgkQ8g5IURL+ KF18tRAAqpkMX0CSkYYWGXN3gYgSEa7ad/F7A7O9sjiAnhvGU2PN+e4+tOypgfJX pMoN87qwB1VqVysEU6yE6X6YoQWjuWUcoYDrfvrdt5+vyxDahmhILEXGURnXs3SG 2qwvy7ZiLdUxkGXtqYzpFJ9Y50WsSIFzEbQVyd5xJppiavcQnfgjXC/To9NSceH9 Yt6n0K+UQXpM8cydKq19eX/sgL0H3mJVTyKmorlD6eVgV3cF8k8QgEeDi4fu82Zs Wf0Mr5hQBjgzJbQ6QbfIz3PndVtN9hfFcXstS9jWR7CImF1CTErkzOZOK4/Dm4Fb Jo4j8hAEfydOdFGiv7ndfeVa2+Y3XKg+1KEWT1KNQ+Nl/kcDIe78QnEz2Nt1UvC8 tCtg/R5sBMp3RXazkkGen4Ql3OboUID0kbf0rKGIoSjD8s2drdgUnQTAyxU+sqa0 R5kmuU96bWJRVkUrCzE2ASyDncWqbuB+6o5OZ81aP+sKBHeGzj4geV3EVsoXVGiZ iZjuh3INgIF1pkL0KUJQ9yHYBAgdsmbtmbko3jYaseq1MXLBThhlm2CKne0Zznzf jdBEt4kuZKhcQwUHhdcoLdC1axbD7SE3LRhGIHNciP4JidtjHpCKZl+0mpJCs9Ht 4W5mn5KFD7TUCOUbiXxzcuzZL/vzYL4PvPduLV8
  • From Paul Rubin@21:1/5 to Skip Montanaro on Sat Feb 25 11:41:13 2023
    Skip Montanaro <skip.montanaro@gmail.com> writes:
    from threading import Lock

    1) you generally want to use RLock rather than Lock

    2) I have generally felt that using locks at the app level at all is an antipattern. The main way I've stayed sane in multi-threaded Python
    code is to have every mutable strictly owned by exactly one thread, pass
    values around using Queues, and have an event loop in each thread taking requests from Queues.

    3) I didn't know that no-gil was a now thing and I'm used to having the
    GIL. So I would have considered the multiprocessing module rather than threading, for something like this.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Paul Rubin@21:1/5 to Jon Ribbens on Sat Feb 25 13:51:38 2023
    Jon Ribbens <jon+usenet@unequivocal.eu> writes:
    1) you generally want to use RLock rather than Lock
    Why?

    So that a thread that tries to acquire it twice doesn't block itself,
    etc. Look at the threading lib docs for more info.

    What does this mean? Are you saying the GIL has been removed?

    Last I heard there was an experimental version of CPython with the GIL
    removed. It is supposed to take less of a performance hit due to
    INCREF/DECREF than an earlier attempt some years back. I don't know its current status.

    The GIL is an evil thing, but it has been around for so long that most
    of us have gotten used to it, and some user code actually relies on it.
    For example, with the GIL in place, a statement like "x += 1" is always
    atomic, I believe. But, I think it is better to not have any shared
    mutables regardless.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Jon Ribbens@21:1/5 to Paul Rubin on Sat Feb 25 21:24:19 2023
    On 2023-02-25, Paul Rubin <no.email@nospam.invalid> wrote:
    Skip Montanaro <skip.montanaro@gmail.com> writes:
    from threading import Lock

    1) you generally want to use RLock rather than Lock

    Why?

    2) I have generally felt that using locks at the app level at all is an antipattern. The main way I've stayed sane in multi-threaded Python
    code is to have every mutable strictly owned by exactly one thread, pass values around using Queues, and have an event loop in each thread taking requests from Queues.

    3) I didn't know that no-gil was a now thing and I'm used to having the
    GIL. So I would have considered the multiprocessing module rather than threading, for something like this.

    What does this mean? Are you saying the GIL has been removed?

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Peter J. Holzer@21:1/5 to Skip Montanaro on Sat Feb 25 22:53:22 2023
    On 2023-02-25 09:52:15 -0600, Skip Montanaro wrote:
    BLOB_LOCK = Lock()

    def get_terms(text):
    with BLOB_LOCK:
    phrases = TextBlob(text, np_extractor=EXTRACTOR).noun_phrases
    for phrase in phrases:
    yield phrase

    When I monitor the application using py-spy, that with statement is
    consuming huge amounts of CPU.

    Another thought:

    How accurate is py-spy? Is it possible that it assigns time actually
    spent in
    phrases = TextBlob(text, np_extractor=EXTRACTOR).noun_phrases
    to
    with BLOB_LOCK:
    ?

    hp

    --
    _ | Peter J. Holzer | Story must make more sense than reality.
    |_|_) | |
    | | | hjp@hjp.at | -- Charles Stross, "Creative writing
    __/ | http://www.hjp.at/ | challenge!"

    -----BEGIN PGP SIGNATURE-----

    iQIzBAABCgAdFiEETtJbRjyPwVTYGJ5k8g5IURL+KF0FAmP6g0wACgkQ8g5IURL+ KF3TThAAnsPRreAxjTkVc1uLW8LEyzWewLm6QrZblcCj9zAsWTPZIb/qYVWHcqko agpZH6UNKlwMvGN1Gil5LViKJtJjAsXyBlmI6HCVkzqX4k/k1LXsxVzsbWztdrAP feLur/bCfkcelDw7xCX8GkHBoQ2owA5I5zIUqsnEB0njjJ7TO34PjZZLTSlGXo/U MKly/+Vhnq7bcDa8EogbDyYzhsgT3xvlabpTQtNNdfAjC4LW4Th9uknEVsZF+dnt I3OW52RORCwgl2CXCMK6B82+wopEFfnlupiAwmZwjUERsgwbD0slMHoDjOfyImK5 TYoLVY4frqbN5Aa4Fo7BWISDnMUd478mJUINRoXTNADwBAJngn6+LIAAZUcCRQPC RB3xKdpgu/6CUVSnBIYlDVkPjO50ZSzLsTG6YwLrd7zaISpKhf1v/BKrIEV44jBP Pt3vDDz6SlRK7hE9nQQQg6wFfNJJdU8neSdiUq4D5Afoit/NqqIHgC6/J31kwgeK UqbEnl0rz7RLi22bM7llVMmH1ol1w4sLlTi2x0oHUGqQA0LWJi/PeMGwczR66SJW R/wKzSiI8KDhI9lj1CRr5VLNql9HdIJRYKUmVIODSOdhYdwGmYV8VuxYwIqTD/Cr sQxV7ZpT4mApuuJaOKsvTnQpKOxhYJtbQiNr1W9
  • From Skip Montanaro@21:1/5 to Peter on Sat Feb 25 15:41:52 2023
    Thanks for the responses.

    Peter wrote:

    Which OS is this?

    MacOS Ventura 13.1, M1 MacBook Pro (eight cores).

    Thomas wrote:

    I'm no expert on locks, but you don't usually want to keep a lock while
    some long-running computation goes on. You want the computation to be
    done by a separate thread, put its results somewhere, and then notify
    the choreographing thread that the result is ready.

    In this case I'm extracting the noun phrases from the body of an email
    message (returned as a list). I have a collection of email messages
    organized by month (typically 1000 to 3000 messages per month). I'm using concurrent.futures.ThreadPoolExecutor() with the default number of workers ( os.cpu_count() * 1.5, or 12 threads on my system) to process each month, so
    12 active threads at a time. Given that the process is pretty much CPU
    bound, maybe reducing the number of workers to the CPU count would make
    sense. Processing of each email message enters that with block once. That's about as minimal as I can make it. I thought for a bit about pushing the textblob stuff into a separate worker thread, but it wasn't obvious how to
    set up queues to handle the communication between the threads created by ThreadPoolExecutor() and the worker thread. Maybe I'll think about it
    harder. (I have a related problem with SQLite, since an open database can't
    be manipulated from multiple threads. That makes much of the program's end-of-run processing single-threaded.)

    This link may be helpful -

    https://anandology.com/blog/using-iterators-and-generators/

    I don't think that's where my problem is. The lock protects the generation
    of the noun phrases. My loop which does the yielding operates outside of
    that lock's control. The version of the code is my latest, in which I
    tossed out a bunch of phrase-processing code (effectively dead end ideas
    for processing the phrases). Replacing the for loop with a simple return
    seems not to have any effect. In any case, the caller which uses the
    phrases does a fair amount of extra work with the phrases, populating a
    SQLite database, so I don't think the amount of time it takes to process a single email message is dominated by the phrase generation.

    Here's timeit output for the noun_phrases code:

    % python -m timeit -s 'text = """`python -m timeit --help`""" ; from
    textblob import TextBlob ; from textblob.np_extractors import
    ConllExtractor ; ext = ConllExtractor() ; phrases = TextBlob(text, np_extractor=ext).noun_phrases' 'phrases = TextBlob(text, np_extractor=ext).noun_phrases'
    5000 loops, best of 5: 98.7 usec per loop

    I process the output of timeit's help message which looks to be about the
    same length as a typical email message, certainly the same order of
    magnitude. Also, note that I call it once in the setup to eliminate the
    initial training of the ConllExtractor instance. I don't know if ~100us qualifies as long running or not.

    I'll keep messing with it.

    Skip

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Paul Rubin@21:1/5 to Skip Montanaro on Sat Feb 25 13:57:48 2023
    Skip Montanaro <skip.montanaro@gmail.com> writes:
    In this case I'm extracting the noun phrases from the body of an email message (returned as a list). I have a collection of email messages
    organized by month (typically 1000 to 3000 messages per month).

    This is embarassingly parallel enough that I would probably launch a
    bunch of separate command line processes with GNU Parallel, rather than
    messing with writing a multi-threaded Python program. That would also
    let you distribute the processing across multiple machines on a network,
    if the cpu requirements warranted it.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Weatherby,Gerard@21:1/5 to All on Sat Feb 25 21:47:00 2023
    “I'm no expert on locks, but you don't usually want to keep a lock while
    some long-running computation goes on. You want the computation to be
    done by a separate thread, put its results somewhere, and then notify
    the choreographing thread that the result is ready.”

    Maybe. There are so many possible threaded application designs I’d hesitate to make a general statement.

    The threading.Lock.acquire method has flags for both a non-blocking attempt and a timeout, so a valid design could include a long-running computation with a main thread or event loop polling the thread. Or the thread could signal a main loop some other
    way.

    I’ve written some code that coordinated threads by having a process talk to itself using a socket.socketpair. The advantage is that you can bundle multiple items (sockets, file handles, a polling timeout) into a select.select call which waits without
    consuming resources (at least on Linux) until
    something interesting happens.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Barry@21:1/5 to All on Sat Feb 25 22:48:47 2023
    Re sqlite and threads. The C API can be compiled to be thread safe from my Reading if the sqlite docs. What I have not checked is how python’s bundled sqlite
    is compiled. There are claims python’s sqlite is not thread safe.

    Barry

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Thomas Passin@21:1/5 to Skip Montanaro on Sat Feb 25 17:20:20 2023
    On 2/25/2023 4:41 PM, Skip Montanaro wrote:
    Thanks for the responses.

    Peter wrote:

    Which OS is this?

    MacOS Ventura 13.1, M1 MacBook Pro (eight cores).

    Thomas wrote:

    I'm no expert on locks, but you don't usually want to keep a lock while some long-running computation goes on.  You want the computation to be done by a separate thread, put its results somewhere, and then notify
    the choreographing thread that the result is ready.

    In this case I'm extracting the noun phrases from the body of an email message(returned as a list). I have a collection of email messages
    organized by month(typically 1000 to 3000 messages per month). I'm using concurrent.futures.ThreadPoolExecutor() with the default number of
    workers (os.cpu_count() * 1.5, or 12 threads on my system)to process
    each month, so 12 active threads at a time. Given that the process is
    pretty much CPU bound, maybe reducing the number of workers to the CPU
    count would make sense. Processing of each email message enters that
    with block once.That's about as minimal as I can make it. I thought for
    a bit about pushing the textblob stuff into a separate worker thread,
    but it wasn't obvious how to set up queues to handle the communication between the threads created by ThreadPoolExecutor()and the worker
    thread. Maybe I'll think about it harder. (I have a related problem with SQLite, since an open database can't be manipulated from multiple
    threads. That makes much of the program's end-of-run processing single-threaded.)

    If the noun extractor is single-threaded (which I think you mentioned),
    no amount of parallel access is going to help. The best you can do is
    to queue up requests so that as soon as the noun extractor returns from
    one call, it gets handed another blob. The CPU will be busy all the
    time running the noun-extraction code.

    If that's the case, you might just as well eliminate all the threads and
    just do it sequentially in the most obvious and simple manner.

    It would possibly be worth while to try this approach out and see what
    happens to the CPU usage and overall computation time.

    This link may be helpful -

    https://anandology.com/blog/using-iterators-and-generators/
    <https://anandology.com/blog/using-iterators-and-generators/>

    I don't think that's where my problem is. The lock protects the
    generation of the noun phrases. My loop which does the yielding operates outside of that lock's control. The version of the code is my latest, in which I tossed out a bunch of phrase-processing code (effectively dead
    end ideas for processing the phrases). Replacing the for loop with a
    simple return seems not to have any effect. In any case, the caller
    which uses the phrases does a fair amount of extra work with the
    phrases, populating a SQLite database, so I don't think the amount of
    time it takes to process a single email message is dominated by the
    phrase generation.

    Here's timeitoutput for the noun_phrases code:

    % python -m timeit -s 'text = """`python -m timeit --help`""" ; from
    textblob import TextBlob ; from textblob.np_extractors import
    ConllExtractor ; ext = ConllExtractor() ; phrases = TextBlob(text, np_extractor=ext).noun_phrases' 'phrases = TextBlob(text, np_extractor=ext).noun_phrases'
    5000 loops, best of 5: 98.7 usec per loop

    I process the output of timeit's help message which looks to be about
    the same length as a typical email message, certainly the same order of magnitude. Also, note that I call it once in the setup to eliminate the initial training of the ConllExtractor instance. I don't know if ~100us qualifies as long running or not.

    I'll keep messing with it.

    Skip

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Jon Ribbens@21:1/5 to Paul Rubin on Sat Feb 25 23:45:36 2023
    On 2023-02-25, Paul Rubin <no.email@nospam.invalid> wrote:
    Jon Ribbens <jon+usenet@unequivocal.eu> writes:
    1) you generally want to use RLock rather than Lock
    Why?

    So that a thread that tries to acquire it twice doesn't block itself,
    etc. Look at the threading lib docs for more info.

    Yes, I know what the docs say, I was asking why you were making the
    statement above. I haven't used Lock very often, but I've literally
    never once in 25 years needed to use RLock. As you say, it's best
    to keep the lock-protected code brief, so it's usually pretty
    obvious that the code can't be re-entered.

    What does this mean? Are you saying the GIL has been removed?

    Last I heard there was an experimental version of CPython with the GIL removed. It is supposed to take less of a performance hit due to INCREF/DECREF than an earlier attempt some years back. I don't know its current status.

    The GIL is an evil thing, but it has been around for so long that most
    of us have gotten used to it, and some user code actually relies on it.
    For example, with the GIL in place, a statement like "x += 1" is always atomic, I believe. But, I think it is better to not have any shared
    mutables regardless.

    I think it is the case that x += 1 is atomic but foo.x += 1 is not.
    Any replacement for the GIL would have to keep the former at least,
    plus the fact that you can do hundreds of things like list.append(foo)
    which are all effectively atomic.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Dennis Lee Bieber@21:1/5 to All on Sat Feb 25 22:10:48 2023
    On Sat, 25 Feb 2023 15:41:52 -0600, Skip Montanaro
    <skip.montanaro@gmail.com> declaimed the following:


    concurrent.futures.ThreadPoolExecutor() with the default number of workers ( >os.cpu_count() * 1.5, or 12 threads on my system) to process each month, so >12 active threads at a time. Given that the process is pretty much CPU
    bound, maybe reducing the number of workers to the CPU count would make

    Unless things have improved a lot over the years, the GIL still limits active threads to the equivalent of a single CPU. The OS may swap among
    which CPU as it schedules system processes, but only one thread will be
    running at any moment regardless of CPU count.

    Common wisdom is that Python threading works well for I/O bound systems, where each thread spends most of its time waiting for some I/O operation to complete -- thereby allowing the OS to schedule other threads.

    For CPU bound, use of the multiprocessing package may be more suited -- though you'll have to device a working IPC system transfer data to/from the separate processes (no shared objects as possible with threads).


    --
    Wulfraed Dennis Lee Bieber AF6VN
    wlfraed@ix.netcom.com http://wlfraed.microdiversity.freeddns.org/

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Chris Angelico@21:1/5 to Dennis Lee Bieber on Sun Feb 26 16:50:38 2023
    On Sun, 26 Feb 2023 at 16:27, Dennis Lee Bieber <wlfraed@ix.netcom.com> wrote:

    On Sat, 25 Feb 2023 15:41:52 -0600, Skip Montanaro
    <skip.montanaro@gmail.com> declaimed the following:


    concurrent.futures.ThreadPoolExecutor() with the default number of workers ( >os.cpu_count() * 1.5, or 12 threads on my system) to process each month, so >12 active threads at a time. Given that the process is pretty much CPU >bound, maybe reducing the number of workers to the CPU count would make

    Unless things have improved a lot over the years, the GIL still limits
    active threads to the equivalent of a single CPU. The OS may swap among
    which CPU as it schedules system processes, but only one thread will be running at any moment regardless of CPU count.

    Specifically, a single CPU core *executing Python bytecode*. There are
    quite a few libraries that release the GIL during computation. Here's
    a script that's quite capable of saturating my CPU entirely - in fact,
    typing this email is glitchy due to lack of resources:

    import threading
    import bcrypt
    results = [0, 0]
    def thrd():
    for _ in range(10):
    ok = bcrypt.checkpw(b"password", b'$2b$15$DGDXMb2zvPotw1rHFouzyOVzSopiLIUSedO5DVGQ1GblAd6L6I8/6')
    results[ok] += 1

    threads = [threading.Thread(target=thrd) for _ in range(100)]
    for t in threads: t.start()
    for t in threads: t.join()
    print(results)

    I have four cores eight threads, and yeah, my CPU's not exactly the
    latest and greatest (i7 6700k - it was quite good some years ago, but outstripped now), but feel free to crank the numbers if you want to.

    I'm pretty sure bcrypt won't use more than one CPU core for a single hashpw/checkpw call, but by releasing the GIL during the hard number
    crunching, it allows easy parallelization. Same goes for numpy work,
    or anything else that can be treated as a separate operation.

    So it's more accurate to say that only one CPU core can be
    *manipulating Python objects* at a time, although it's hard to pin
    down exactly what that means, making it easier to say that there can
    only be one executing Python bytecode; it should be possible for any
    function call into a C library to be a point where other threads can
    take over (most notably, any sort of I/O, but also these kinds of
    things).

    As mentioned, GIL-removal has been under discussion at times, most
    recently (and currently) with PEP 703
    https://peps.python.org/pep-0703/ - and the benefits in multithreaded applications always have to be factored against quite significant
    performance penalties. It's looking like PEP 703's proposal has the
    least-bad performance measurements of any GILectomy I've seen so far,
    showing 10% worse performance on average (possibly able to be reduced
    to 5%). As it happens, a GIL just makes sense when you want pure, raw performance, and it's only certain workloads that suffer under it.

    ChrisA

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Chris Angelico@21:1/5 to python-list@python.org on Sun Feb 26 16:35:50 2023
    On Sun, 26 Feb 2023 at 16:16, Jon Ribbens via Python-list <python-list@python.org> wrote:

    On 2023-02-25, Paul Rubin <no.email@nospam.invalid> wrote:
    The GIL is an evil thing, but it has been around for so long that most
    of us have gotten used to it, and some user code actually relies on it.
    For example, with the GIL in place, a statement like "x += 1" is always atomic, I believe. But, I think it is better to not have any shared mutables regardless.

    I think it is the case that x += 1 is atomic but foo.x += 1 is not.
    Any replacement for the GIL would have to keep the former at least,
    plus the fact that you can do hundreds of things like list.append(foo)
    which are all effectively atomic.

    The GIL is most assuredly *not* an evil thing. If you think it's so
    evil, go ahead and remove it, because we'll clearly be better off
    without it, right?

    As it turns out, most GIL-removal attempts have had a fairly nasty
    negative effect on performance. The GIL is a huge performance boost.

    As to what is atomic and what is not... it's complicated, as always.
    Suppose that x (or foo.x) is a custom type:

    class Thing:
    def __iadd__(self, other):
    print("Hi, I'm being added onto!")
    self.increment_by(other)
    return self

    Then no, neither of these is atomic, although if the increment itself
    is, it probably won't matter. As far as I know, the only way that it
    would be at all different for x+=1 and foo.x+=1 would be if the
    __iadd__ method both mutates and returns something other than self,
    which is quite unusual. (Most incrementing is done by either
    constructing a new object to return, or mutating the existing one, but
    not a hybrid.)

    Consider this:

    import threading
    d = {0:0, 1:0, 2:0, 3:0}
    def thrd():
    for _ in range(10000):
    d[0] += 1
    d[1] += 1
    d[2] += 1
    d[3] += 1

    threads = [threading.Thread(target=thrd) for _ in range(50)]
    for t in threads: t.start()
    for t in threads: t.join()
    print(d)

    Is this code guaranteed to result in 500000 in every slot in the
    dictionary? What if you replace the dictionary with a four-element
    list? Do you need a GIL for this, or some other sort of lock? What
    exactly is it that is needed? To answer that question, let's look at
    exactly what happens in the disassembly:

    def thrd():
    ... d[0] += 1
    ... d[1] += 1
    ...
    import dis
    dis.dis(thrd)
    1 0 RESUME 0

    2 2 LOAD_GLOBAL 0 (d)
    14 LOAD_CONST 1 (0)
    16 COPY 2
    18 COPY 2
    20 BINARY_SUBSCR
    30 LOAD_CONST 2 (1)
    32 BINARY_OP 13 (+=)
    36 SWAP 3
    38 SWAP 2
    40 STORE_SUBSCR

    3 44 LOAD_GLOBAL 0 (d)
    56 LOAD_CONST 2 (1)
    58 COPY 2
    60 COPY 2
    62 BINARY_SUBSCR
    72 LOAD_CONST 2 (1)
    74 BINARY_OP 13 (+=)
    78 SWAP 3
    80 SWAP 2
    82 STORE_SUBSCR
    86 LOAD_CONST 0 (None)
    88 RETURN_VALUE


    (Your exact disassembly may differ, this was on CPython 3.12.)
    Crucially, note these three instructions that occur in each block: BINARY_SUBSCR, BINARY_OP, and STORE_SUBSCR. Those are a lookup
    (retrieving the value of d[0]), the actual addition (adding one to the
    value), and a store (putting the result back into d[0]). So it's
    actually not guaranteed to be atomic; it would be perfectly reasonable
    to interrupt that sequence and have something else do another
    subscript.

    Here's the equivalent with just incrementing a global:

    def thrd():
    ... x += 1
    ...
    dis.dis(thrd)
    1 0 RESUME 0

    2 2 LOAD_FAST_CHECK 0 (x)
    4 LOAD_CONST 1 (1)
    6 BINARY_OP 13 (+=)
    10 STORE_FAST 0 (x)
    12 LOAD_CONST 0 (None)
    14 RETURN_VALUE


    The exact same sequence: load, add, store. Still not atomic.

    General takeaway: The GIL is a performance feature, not a magic
    solution, and certainly not an evil beast that must be slain at any
    cost. Attempts to remove it always have to provide equivalent
    protection in some other way. But the protection you think you have
    might not be what you actually have.

    ChrisA

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Barry Scott@21:1/5 to Jon Ribbens via Python-list on Sun Feb 26 11:53:26 2023
    On 25/02/2023 23:45, Jon Ribbens via Python-list wrote:
    I think it is the case that x += 1 is atomic but foo.x += 1 is not.

    No that is not true, and has never been true.

    def x(a):
    :...    a += 1
    :...

    dis.dis(x)
     1           0 RESUME                   0

     2           2 LOAD_FAST                0 (a)
                 4 LOAD_CONST               1 (1)
                 6 BINARY_OP               13 (+=)
                10 STORE_FAST               0 (a)
                12 LOAD_CONST               0 (None)
                14 RETURN_VALUE


    As you can see there are 4 byte code ops executed.

    Python's eval loop can switch to another thread between any of them.

    Its is not true that the GIL provides atomic operations in python.

    Barry

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Jon Ribbens@21:1/5 to Chris Angelico on Sun Feb 26 16:09:44 2023
    On 2023-02-26, Chris Angelico <rosuav@gmail.com> wrote:
    On Sun, 26 Feb 2023 at 16:16, Jon Ribbens via Python-list
    <python-list@python.org> wrote:
    On 2023-02-25, Paul Rubin <no.email@nospam.invalid> wrote:
    The GIL is an evil thing, but it has been around for so long that most
    of us have gotten used to it, and some user code actually relies on it.
    For example, with the GIL in place, a statement like "x += 1" is always
    atomic, I believe. But, I think it is better to not have any shared
    mutables regardless.

    I think it is the case that x += 1 is atomic but foo.x += 1 is not.
    Any replacement for the GIL would have to keep the former at least,
    plus the fact that you can do hundreds of things like list.append(foo)
    which are all effectively atomic.

    The GIL is most assuredly *not* an evil thing. If you think it's so
    evil, go ahead and remove it, because we'll clearly be better off
    without it, right?

    If you say so. I said nothing whatsoever about the GIL being evil.

    As it turns out, most GIL-removal attempts have had a fairly nasty
    negative effect on performance. The GIL is a huge performance boost.

    As to what is atomic and what is not... it's complicated, as always.
    Suppose that x (or foo.x) is a custom type:

    Yes, sure, you can make x += 1 not work even single-threaded if you
    make custom types which override basic operations. I'm talking about
    when you're dealing with simple atomic built-in types such as integers.

    Here's the equivalent with just incrementing a global:

    def thrd():
    ... x += 1
    ...
    dis.dis(thrd)
    1 0 RESUME 0

    2 2 LOAD_FAST_CHECK 0 (x)
    4 LOAD_CONST 1 (1)
    6 BINARY_OP 13 (+=)
    10 STORE_FAST 0 (x)
    12 LOAD_CONST 0 (None)
    14 RETURN_VALUE


    The exact same sequence: load, add, store. Still not atomic.

    And yet, it appears that *something* changed between Python 2
    and Python 3 such that it *is* atomic:

    import sys, threading
    class Foo:
    x = 0
    foo = Foo()
    y = 0
    def thrd():
    global y
    for _ in range(10000):
    foo.x += 1
    y += 1
    threads = [threading.Thread(target=thrd) for _ in range(50)]
    for t in threads: t.start()
    for t in threads: t.join()
    print(sys.version)
    print(foo.x, y)

    2.7.5 (default, Jun 28 2022, 15:30:04)
    [GCC 4.8.5 20150623 (Red Hat 4.8.5-44)]
    (64489, 59854)

    3.8.10 (default, Nov 14 2022, 12:59:47)
    [GCC 9.4.0]
    500000 500000

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Jon Ribbens@21:1/5 to Barry Scott on Sun Feb 26 16:11:11 2023
    On 2023-02-26, Barry Scott <barry@barrys-emacs.org> wrote:
    On 25/02/2023 23:45, Jon Ribbens via Python-list wrote:
    I think it is the case that x += 1 is atomic but foo.x += 1 is not.

    No that is not true, and has never been true.

    def x(a):
    :...    a += 1
    :...

    dis.dis(x)
     1           0 RESUME                   0

     2           2 LOAD_FAST                0 (a)
                 4 LOAD_CONST               1 (1)
                 6 BINARY_OP               13 (+=)
                10 STORE_FAST               0 (a)
                12 LOAD_CONST               0 (None)
                14 RETURN_VALUE


    As you can see there are 4 byte code ops executed.

    Python's eval loop can switch to another thread between any of them.

    Its is not true that the GIL provides atomic operations in python.

    That's oversimplifying to the point of falsehood (just as the opposite
    would be too). And: see my other reply in this thread just now - if the
    GIL isn't making "x += 1" atomic, something else is.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Skip Montanaro@21:1/5 to All on Sun Feb 26 11:53:45 2023
    Thanks for the various replies. The program originally started out single-threaded. I wandered down the multi-threaded path to see if I could
    get a performance boost using Sam Gross's NoGIL fork <https://github.com/colesbury/nogil-3.12>. I was pretty sure the GIL would limit multi-threading performance on a stock Python interpreter. When I
    first switched to threads, I didn't have a lock around the one or two
    places which called out to the TextBlob <https://textblob.readthedocs.io/>/NLTK stuff. The use of threading.Lock was the obvious simplest choice, and it
    solved the crash I saw without it. I'm still thinking about using queues to communicate between the email processing threads and the TextBlob & SQLite processing stuff.

    I had been doing a bit of pre- and post-processing of the default TextBlob
    noun phrase generation, but I wasn't happy with it, so I decided to
    experiment with an alternate noun phrase extractor <https://textblob.readthedocs.io/en/dev/api_reference.html?highlight=ConllExtractor#textblob.en.np_extractors.ConllExtractor>.
    I was happier with that, so ripped out most of the ad hoc stuff I was
    doing. While doing this code surgery, I moved back to 3.11 to have a more trusty Python interpreter. (I've yet to encounter a problem with NoGIL,
    just cutting back on moving parts, and wasn't seeing any obvious
    performance gains.)

    As for SQLite and multi-threading, I figured if the core devs hadn't yet
    gotten around to making it available then it probably wasn't
    straightforward. I wasn't willing to tackle that.

    So, I'll keep messing around. It's all just for fun <https://www.smontanaro.net/CR> anyway.

    Skip

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Paul Rubin@21:1/5 to Chris Angelico on Sun Feb 26 10:50:45 2023
    Chris Angelico <rosuav@gmail.com> writes:
    The GIL is most assuredly *not* an evil thing. If you think it's so
    evil, go ahead and remove it, because we'll clearly be better off
    without it, right? ... As it turns out, most GIL-removal attempts
    have had a fairly nasty negative effect on performance. The GIL is a
    huge performance boost.

    The evil GIL doesn't give any performance boost. Rather, removing it
    gives a terrible performance loss, because CPython's evil reference
    counting memory management system would require evil locks around the
    evil reference counts in the absence of the GIL, causing a slowdown.

    The "right" fix is to throw out the refcounts and use a proper garbage collector, like every Lisp implementation has done back to the 1950s. MicroPython, IronPython, and other Python implementations all do this as
    well. But, this change would break CPython's C API pretty badly, so we
    are stuck. That is another reason Python 3 is a tragedy. The 2 to 3 transition would have been the right time to retire CPython entirely,
    and move to a PyPy based reference implementation.

    Yeah, yeah, deterministic releases and the myth of no pauses.
    Deterministic release is why the "with" statement was added: no need to
    rely on refcounting for that. And realtime (bounded latency) GC's exist
    for those who want them. CPython's refcounting can have unbounded
    delays if you free a large structure and millions of refcounts have to
    all be decremented.

    CPython of course ended up adding a GC anyway, to clean up after the
    refcount system in cases where there is cyclic structure needing to be reclaimed...

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Skip Montanaro@21:1/5 to All on Sun Feb 26 18:20:02 2023
    And yet, it appears that *something* changed between Python 2 and Python
    3 such that it *is* atomic:

    I haven't looked, but something to check in the source is opcode
    prediction. It's possible that after the BINARY_OP executes, opcode
    prediction jumps straight to the STORE_FAST opcode, avoiding the transfer
    to the top of the virtual machine loop. That would (I think) avoid checks related to GIL release and thread switches.

    I don't guarantee that's what's going on, and even if I'm correct, I don't think you can rely on it.

    Skip

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Chris Angelico@21:1/5 to python-list@python.org on Mon Feb 27 12:25:28 2023
    On Mon, 27 Feb 2023 at 10:42, Jon Ribbens via Python-list <python-list@python.org> wrote:

    On 2023-02-26, Chris Angelico <rosuav@gmail.com> wrote:
    On Sun, 26 Feb 2023 at 16:16, Jon Ribbens via Python-list
    <python-list@python.org> wrote:
    On 2023-02-25, Paul Rubin <no.email@nospam.invalid> wrote:
    The GIL is an evil thing, but it has been around for so long that most >> > of us have gotten used to it, and some user code actually relies on it. >> > For example, with the GIL in place, a statement like "x += 1" is always >> > atomic, I believe. But, I think it is better to not have any shared
    mutables regardless.

    I think it is the case that x += 1 is atomic but foo.x += 1 is not.
    Any replacement for the GIL would have to keep the former at least,
    plus the fact that you can do hundreds of things like list.append(foo)
    which are all effectively atomic.

    The GIL is most assuredly *not* an evil thing. If you think it's so
    evil, go ahead and remove it, because we'll clearly be better off
    without it, right?

    If you say so. I said nothing whatsoever about the GIL being evil.

    You didn't, but I was also responding to Paul's description that the
    GIL "is an evil thing". Apologies if that wasn't clear.

    Yes, sure, you can make x += 1 not work even single-threaded if you
    make custom types which override basic operations. I'm talking about
    when you're dealing with simple atomic built-in types such as integers.

    Here's the equivalent with just incrementing a global:

    def thrd():
    ... x += 1
    ...
    dis.dis(thrd)
    1 0 RESUME 0

    2 2 LOAD_FAST_CHECK 0 (x)
    4 LOAD_CONST 1 (1)
    6 BINARY_OP 13 (+=)
    10 STORE_FAST 0 (x)
    12 LOAD_CONST 0 (None)
    14 RETURN_VALUE


    The exact same sequence: load, add, store. Still not atomic.

    And yet, it appears that *something* changed between Python 2
    and Python 3 such that it *is* atomic:

    I don't think that's a guarantee. You might be unable to make it
    break, but that doesn't mean it's dependable.

    In any case, it's not the GIL that's doing this. It might be a quirk
    of the current implementation of the core evaluation loop, or it might
    be something unrelated, but whatever it is, removing the GIL wouldn't
    change that; and it's certainly no different whether it's a global or
    an attribute of an object.

    ChrisA

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Michael Speer@21:1/5 to python-list@python.org on Sun Feb 26 22:19:55 2023
    I wanted to provide an example that your claimed atomicity is simply wrong,
    but I found there is something different in the 3.10+ cpython
    implementations.

    I've tested the code at the bottom of this message using a few docker
    python images, and it appears there is a difference starting in 3.10.0

    python3.8
    EXPECTED 2560000000
    ACTUAL 84533137
    python:3.9
    EXPECTED 2560000000
    ACTUAL 95311773
    python:3.10 (.8)
    EXPECTED 2560000000
    ACTUAL 2560000000

    just to see if there was a specific sub-version of 3.10 that added it python:3.10.0
    EXPECTED 2560000000
    ACTUAL 2560000000

    nope, from the start of 3.10 this is happening

    the only difference in the bytecode I see is 3.10 adds SETUP_LOOP and
    POP_BLOCK around the for loop

    I don't see anything different in the long c code that I would expect would cause this.

    AFAICT the inplace add is null for longs and so should revert to the
    long_add that always creates a new integer in x_add

    another test
    python:3.11
    EXPECTED 2560000000
    ACTUAL 2560000000

    I'm not sure where the difference is at the moment. I didn't see anything
    in the release notes given a quick glance.

    I do agree that you shouldn't depend on this unless you find a written guarantee of the behavior, as it is likely an implementation quirk of some
    kind

    --[code]--

    import threading

    UPDATES = 10000000
    THREADS = 256

    vv = 0

    def update_x_times( xx ):
    for _ in range( xx ):
    global vv
    vv += 1

    def main():
    tts = []
    for _ in range( THREADS ):
    tts.append( threading.Thread( target = update_x_times, args = (UPDATES,) ) )

    for tt in tts:
    tt.start()

    for tt in tts:
    tt.join()

    print( 'EXPECTED', UPDATES * THREADS )
    print( 'ACTUAL ', vv )

    if __name__ == '__main__':
    main()

    On Sun, Feb 26, 2023 at 6:35 PM Jon Ribbens via Python-list < python-list@python.org> wrote:

    On 2023-02-26, Barry Scott <barry@barrys-emacs.org> wrote:
    On 25/02/2023 23:45, Jon Ribbens via Python-list wrote:
    I think it is the case that x += 1 is atomic but foo.x += 1 is not.

    No that is not true, and has never been true.

    def x(a):
    :... a += 1
    :...

    dis.dis(x)
    1 0 RESUME 0

    2 2 LOAD_FAST 0 (a)
    4 LOAD_CONST 1 (1)
    6 BINARY_OP 13 (+=)
    10 STORE_FAST 0 (a)
    12 LOAD_CONST 0 (None)
    14 RETURN_VALUE


    As you can see there are 4 byte code ops executed.

    Python's eval loop can switch to another thread between any of them.

    Its is not true that the GIL provides atomic operations in python.

    That's oversimplifying to the point of falsehood (just as the opposite
    would be too). And: see my other reply in this thread just now - if the
    GIL isn't making "x += 1" atomic, something else is.
    --
    https://mail.python.org/mailman/listinfo/python-list


    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Michael Speer@21:1/5 to knomenet@gmail.com on Mon Feb 27 01:26:59 2023
    https://stackoverflow.com/questions/69993959/python-threads-difference-for-3-10-and-others

    https://github.com/python/cpython/commit/4958f5d69dd2bf86866c43491caf72f774ddec97

    it's a quirk of implementation. the scheduler currently only checks if it
    needs to release the gil after the POP_JUMP_IF_FALSE, POP_JUMP_IF_TRUE, JUMP_ABSOLUTE, CALL_METHOD, CALL_FUNCTION, CALL_FUNCTION_KW, and CALL_FUNCTION_EX opcodes.

    import code
    import dis
    dis.dis( code.update_x_times )
    10 0 LOAD_GLOBAL 0 (range)
    2 LOAD_FAST 0 (xx)
    4 CALL_FUNCTION 1
    ##### GIL CAN RELEASE HERE #####
    6 GET_ITER
    >> 8 FOR_ITER 6 (to 22)
    10 STORE_FAST 1 (_)
    12 12 LOAD_GLOBAL 1 (vv)
    14 LOAD_CONST 1 (1)
    16 INPLACE_ADD
    18 STORE_GLOBAL 1 (vv)
    20 JUMP_ABSOLUTE 4 (to 8)
    ##### GIL CAN RELEASE HERE (after JUMP_ABSOLUTE points the instruction
    counter back to FOR_ITER, but before the interpreter actually jumps to
    FOR_ITER again) #####
    10 >> 22 LOAD_CONST 0 (None)
    24 RETURN_VALUE


    due to this, this section:
    12 12 LOAD_GLOBAL 1 (vv)
    14 LOAD_CONST 1 (1)
    16 INPLACE_ADD
    18 STORE_GLOBAL 1 (vv)

    is effectively locked/atomic on post-3.10 interpreters, though this is
    neither portable nor guaranteed to stay that way into the future


    On Sun, Feb 26, 2023 at 10:19 PM Michael Speer <knomenet@gmail.com> wrote:

    I wanted to provide an example that your claimed atomicity is simply
    wrong, but I found there is something different in the 3.10+ cpython implementations.

    I've tested the code at the bottom of this message using a few docker
    python images, and it appears there is a difference starting in 3.10.0

    python3.8
    EXPECTED 2560000000
    ACTUAL 84533137
    python:3.9
    EXPECTED 2560000000
    ACTUAL 95311773
    python:3.10 (.8)
    EXPECTED 2560000000
    ACTUAL 2560000000

    just to see if there was a specific sub-version of 3.10 that added it python:3.10.0
    EXPECTED 2560000000
    ACTUAL 2560000000

    nope, from the start of 3.10 this is happening

    the only difference in the bytecode I see is 3.10 adds SETUP_LOOP and POP_BLOCK around the for loop

    I don't see anything different in the long c code that I would expect
    would cause this.

    AFAICT the inplace add is null for longs and so should revert to the
    long_add that always creates a new integer in x_add

    another test
    python:3.11
    EXPECTED 2560000000
    ACTUAL 2560000000

    I'm not sure where the difference is at the moment. I didn't see anything
    in the release notes given a quick glance.

    I do agree that you shouldn't depend on this unless you find a written guarantee of the behavior, as it is likely an implementation quirk of some kind

    --[code]--

    import threading

    UPDATES = 10000000
    THREADS = 256

    vv = 0

    def update_x_times( xx ):
    for _ in range( xx ):
    global vv
    vv += 1

    def main():
    tts = []
    for _ in range( THREADS ):
    tts.append( threading.Thread( target = update_x_times, args = (UPDATES,) ) )

    for tt in tts:
    tt.start()

    for tt in tts:
    tt.join()

    print( 'EXPECTED', UPDATES * THREADS )
    print( 'ACTUAL ', vv )

    if __name__ == '__main__':
    main()

    On Sun, Feb 26, 2023 at 6:35 PM Jon Ribbens via Python-list < python-list@python.org> wrote:

    On 2023-02-26, Barry Scott <barry@barrys-emacs.org> wrote:
    On 25/02/2023 23:45, Jon Ribbens via Python-list wrote:
    I think it is the case that x += 1 is atomic but foo.x += 1 is not.

    No that is not true, and has never been true.

    def x(a):
    :... a += 1
    :...

    dis.dis(x)
    1 0 RESUME 0

    2 2 LOAD_FAST 0 (a)
    4 LOAD_CONST 1 (1)
    6 BINARY_OP 13 (+=)
    10 STORE_FAST 0 (a)
    12 LOAD_CONST 0 (None)
    14 RETURN_VALUE


    As you can see there are 4 byte code ops executed.

    Python's eval loop can switch to another thread between any of them.

    Its is not true that the GIL provides atomic operations in python.

    That's oversimplifying to the point of falsehood (just as the opposite
    would be too). And: see my other reply in this thread just now - if the
    GIL isn't making "x += 1" atomic, something else is.
    --
    https://mail.python.org/mailman/listinfo/python-list



    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Chris Angelico@21:1/5 to Michael Speer on Mon Feb 27 17:37:32 2023
    On Mon, 27 Feb 2023 at 17:28, Michael Speer <knomenet@gmail.com> wrote:

    https://github.com/python/cpython/commit/4958f5d69dd2bf86866c43491caf72f774ddec97

    it's a quirk of implementation. the scheduler currently only checks if it needs to release the gil after the POP_JUMP_IF_FALSE, POP_JUMP_IF_TRUE, JUMP_ABSOLUTE, CALL_METHOD, CALL_FUNCTION, CALL_FUNCTION_KW, and CALL_FUNCTION_EX opcodes.


    Oh now that is VERY interesting. It's a quirk of implementation, yes,
    but there's a reason for it; a bug being solved. The underlying
    guarantee about __exit__ should be considered to be defined behaviour,
    meaning that the precise quirk might not be relevant even though the
    bug has to remain fixed in all future versions. But I'd also note here
    that, if it can be absolutely 100% guaranteed that the GIL will be
    released and signals checked on a reasonable interval, there's no
    particular reason to state that signals are checked after every single
    Python bytecode. (See the removed comment about empty loops, which
    would have been a serious issue and is probably why the backward jump
    rule exists.)

    So it wouldn't be too hard for a future release of Python to mandate
    atomicity of certain specific operations. Obviously it'd require
    buy-in from other implementations, but it would be rather convenient
    if, subject to some very tight rules like "only when adding integers
    onto core data types" etc, a simple statement like "x.y += 1" could
    actually be guaranteed to take place atomically.

    Though it's still probably not as useful as you might hope. In C, if I
    can do "int id = counter++;" atomically, it would guarantee me a new
    ID that no other thread could ever have. But in Python, that increment operation doesn't give you the result, so all it's really useful for
    is statistics on operations done. Still, that in itself could be of
    value in quite a few situations.

    In any case, though, this isn't something to depend upon at the moment.

    ChrisA

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Paul Rubin@21:1/5 to Chris Angelico on Mon Feb 27 00:37:18 2023
    Chris Angelico <rosuav@gmail.com> writes:
    So it wouldn't be too hard for a future release of Python to mandate atomicity of certain specific operations.... "only when adding
    integers onto core data types" etc, a simple statement like "x.y += 1"
    could actually be guaranteed to take place atomically.

    That would be pretty awful and messy, have performance costs, be cpython specific, etc. It is likely feasible to have something like

    with atomic(): x.y += 1

    where atomic() used a locked machine instruction rather than a system
    call or anything awful like that. The article "Beautiful Concurrency" describes the GHC API for stuff like this, but not so much the
    underlying implementation:

    https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.365.1337&rep=rep1&type=pdf

    The implementation uses a hardware compare-and-swap instruction (LOCK
    XCHG on the x86) that exists on most modern cpus. This goes into more
    detail, iirc:

    https://research.microsoft.com/en-us/um/people/simonpj/Papers/stm/stm.pdf

    Haskell uses its type system to make sure you don't do forbidden stuff
    like I/O inside a memory transaction, but Python could do dynamic checks
    and raise errors if you made a mistake.

    The proposed multitasking spec for ANS Forth includes compare-and-swap
    and it is actually used in a STM (software transactional memory) library written in Forth. It is very cool stuff and makes a whole lot of
    lock-related pain go away in concurrent programs.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Barry@21:1/5 to All on Tue Feb 28 23:04:39 2023
    Though it's still probably not as useful as you might hope. In C, if I
    can do "int id = counter++;" atomically, it would guarantee me a new
    ID that no other thread could ever have.

    C does not have to do that atomically. In fact it is free to use lots of instructions to build the int value. And some compilers indeed do, the linux kernel folks see this in gcc generated code.

    I understand you have to use the new atomics features.

    Barry

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Chris Angelico@21:1/5 to Barry on Wed Mar 1 12:58:54 2023
    On Wed, 1 Mar 2023 at 10:04, Barry <barry@barrys-emacs.org> wrote:

    Though it's still probably not as useful as you might hope. In C, if I
    can do "int id = counter++;" atomically, it would guarantee me a new
    ID that no other thread could ever have.

    C does not have to do that atomically. In fact it is free to use lots of instructions to build the int value. And some compilers indeed do, the linux kernel folks see this in gcc generated code.

    I understand you have to use the new atomics features.


    Yeah, I didn't have a good analogy so I went with a hypothetical. The atomicity would be more useful in that context as it would give
    lock-free ID generation, which doesn't work in Python.

    ChrisA

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)