• Fast lookup of bulky "table"

    From Dino@21:1/5 to All on Sat Jan 14 23:26:27 2023
    Hello, I have built a PoC service in Python Flask for my work, and - now
    that the point is made - I need to make it a little more performant (to
    be honest, chances are that someone else will pick up from where I left
    off, and implement the same service from scratch in a different language (GoLang? .Net? Java?) but I am digressing).

    Anyway, my Flask service initializes by loading a big "table" of 100k
    rows and 40 columns or so (memory footprint: order of 300 Mb) and then
    accepts queries through a REST endpoint. Columns are strings, enums, and numbers. Once initialized, the table is read only. The endpoint will
    parse the query and match it against column values (equality,
    inequality, greater than, etc.) Finally, it will return a (JSON) list of
    all rows that satisfy all conditions in the query.

    As you can imagine, this is not very performant in its current form, but performance was not the point of the PoC - at least initially.

    Before I deliver the PoC to a more experienced software architect who
    will look at my code, though, I wouldn't mind to look a bit less lame
    and do something about performance in my own code first, possibly by
    bringing the average time for queries down from where it is now (order
    of 1 to 4 seconds per query on my laptop) to 1 or 2 milliseconds on
    average).

    To be honest, I was already able to bring the time down to a handful of microseconds thanks to a rudimentary cache that will associate the
    "signature" of a query to its result, and serve it the next time the
    same query is received, but this may not be good enough: 1) queries
    might be many and very different from one another each time, AND 2) I am
    not sure the server will have a ton of RAM if/when this thing - or
    whatever is derived from it - is placed into production.

    How can I make my queries generally more performant, ideally also in
    case of a new query?

    Here's what I have been considering:

    1. making my cache more "modular", i.e. cache the result of certain
    (wide) queries. When a complex query comes in, I may be able to restrict
    my search to a subset of the rows (as determined by a previously cached
    partial query). This should keep the memory footprint under control.

    2. Load my data into a numpy.array and use numpy.array operations to
    slice and dice my data.

    3. load my data into sqlite3 and use SELECT statement to query my table.
    I have never used sqllite, plus there's some extra complexity as
    comparing certain colum requires custom logic, but I wonder if this architecture would work well also when dealing with a 300Mb database.

    4. Other ideas?

    Hopefully I made sense. Thank you for your attention

    Dino

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Lars Liedtke@21:1/5 to All on Sun Jan 15 09:17:03 2023
    Hey,

    before you start optimizing. I would suggest, that you measure response times and query times, data search times and so on. In order to save time, you have to know where you "loose" time.

    Does your service really have to load the whole table at once? Yes that might lead to quicker response times on requests, but databases are often very good with caching themselves, so that the first request might be slower than following requests, with
    similar parameters. Do you use a database, or are you reading from a file? Are you maybe looping through your whole dataset on every request? Instead of asking for the specific data?

    Before you start introducing a cache and its added complexity, do you really need that cache?

    You are talking about saving microseconds, that sounds a bit as if you might be “overdoing” it. How many requests will you have in the future? At least in which magnitude and how quick do they have to be? You write about 1-4 seconds on your laptop.
    But that does not really tell you that much, because most probably the service will run on a server. I am not saying that you should get a server or a cloud-instance to test against, but to talk with your architect about that.

    I totally understand your impulse to appear as good as can be, but you have to know where you really need to debug and optimize. It will not be advantageous for you, if you start to optimize for optimizing's sake. Additionally if you service is a PoC,
    optimizing now might be not the first thing you have to worry about, but about that you made everything as simple and readable as possible and that you do not spend too much time for just showing how it could work.

    But of course, I do not know the tasks given to you and the expectations you have to fulfil. All I am trying to say is to reconsider where you really could improve and how far you have to improve.



    Lars Liedtke
    Software Entwickler

    [Tel.] +49 721 98993-
    [Fax] +49 721 98993-
    [E-Mail] lal@solute.de<mailto:lal@solute.de>


    solute GmbH
    Zeppelinstraße 15
    76185 Karlsruhe
    Germany


    [Logo Solute]


    Marken der solute GmbH | brands of solute GmbH
    [Marken]
    [Advertising Partner]

    Geschäftsführer | Managing Director: Dr. Thilo Gans, Bernd Vermaaten
    Webseite | www.solute.de <http://www.solute.de/>
    Sitz | Registered Office: Karlsruhe
    Registergericht | Register Court: Amtsgericht Mannheim
    Registernummer | Register No.: HRB 110579
    USt-ID | VAT ID: DE234663798



    Informationen zum Datenschutz | Information about privacy policy https://www.solute.de/ger/datenschutz/grundsaetze-der-datenverarbeitung.php




    Am 15.01.23 um 05:26 schrieb Dino:

    Hello, I have built a PoC service in Python Flask for my work, and - now that the point is made - I need to make it a little more performant (to be honest, chances are that someone else will pick up from where I left off, and implement the same service
    from scratch in a different language (GoLang? .Net? Java?) but I am digressing).

    Anyway, my Flask service initializes by loading a big "table" of 100k rows and 40 columns or so (memory footprint: order of 300 Mb) and then accepts queries through a REST endpoint. Columns are strings, enums, and numbers. Once initialized, the table is
    read only. The endpoint will parse the query and match it against column values (equality, inequality, greater than, etc.) Finally, it will return a (JSON) list of all rows that satisfy all conditions in the query.

    As you can imagine, this is not very performant in its current form, but performance was not the point of the PoC - at least initially.

    Before I deliver the PoC to a more experienced software architect who will look at my code, though, I wouldn't mind to look a bit less lame and do something about performance in my own code first, possibly by bringing the average time for queries down
    from where it is now (order of 1 to 4 seconds per query on my laptop) to 1 or 2 milliseconds on average).

    To be honest, I was already able to bring the time down to a handful of microseconds thanks to a rudimentary cache that will associate the "signature" of a query to its result, and serve it the next time the same query is received, but this may not be
    good enough: 1) queries might be many and very different from one another each time, AND 2) I am not sure the server will have a ton of RAM if/when this thing - or whatever is derived from it - is placed into production.

    How can I make my queries generally more performant, ideally also in case of a new query?

    Here's what I have been considering:

    1. making my cache more "modular", i.e. cache the result of certain (wide) queries. When a complex query comes in, I may be able to restrict my search to a subset of the rows (as determined by a previously cached partial query). This should keep the
    memory footprint under control.

    2. Load my data into a numpy.array and use numpy.array operations to slice and dice my data.

    3. load my data into sqlite3 and use SELECT statement to query my table. I have never used sqllite, plus there's some extra complexity as comparing certain colum requires custom logic, but I wonder if this architecture would work well also when dealing
    with a 300Mb database.

    4. Other ideas?

    Hopefully I made sense. Thank you for your attention

    Dino

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Peter J. Holzer@21:1/5 to Dino on Sun Jan 15 12:14:24 2023
    On 2023-01-14 23:26:27 -0500, Dino wrote:
    Hello, I have built a PoC service in Python Flask for my work, and - now
    that the point is made - I need to make it a little more performant (to be honest, chances are that someone else will pick up from where I left off,
    and implement the same service from scratch in a different language (GoLang? .Net? Java?) but I am digressing).

    Anyway, my Flask service initializes by loading a big "table" of 100k rows and 40 columns or so (memory footprint: order of 300 Mb)

    300 MB is large enough that you should at least consider putting that
    into a database (Sqlite is probably simplest. Personally I would go with PostgreSQL because I'm most familiar with it and Sqlite is a bit of an outlier).

    The main reason for putting it into a database is the ability to use
    indexes, so you don't have to scan all 100 k rows for each query.

    You may be able to do that for your Python data structures, too: Can you
    set up dicts which map to subsets you need often?

    There are some specialized in-memory bitmap implementations which can be
    used for filtering. I've used
    [Judy bitmaps](https://judy.sourceforge.net/doc/Judy1_3x.htm) in the
    past (mostly in Perl).
    These days [Roaring Bitmaps](https://www.roaringbitmap.org/) is probably
    the most popular. I see several packages on PyPI - but I haven't used
    any of them yet, so no recommendation from me.

    Numpy might also help. You will still have linear scans, but it is more
    compact and many of the searches can probably be done in C and not in
    Python.

    As you can imagine, this is not very performant in its current form, but performance was not the point of the PoC - at least initially.

    For performanc optimization it is very important to actually measure performance, and a good profiler helps very much in identifying hot
    spots. Unfortunately until recently Python was a bit deficient in this
    area, but [Scalene](https://pypi.org/project/scalene/) looks promising.

    hp

    --
    _ | Peter J. Holzer | Story must make more sense than reality.
    |_|_) | |
    | | | hjp@hjp.at | -- Charles Stross, "Creative writing
    __/ | http://www.hjp.at/ | challenge!"

    -----BEGIN PGP SIGNATURE-----

    iQIzBAABCgAdFiEETtJbRjyPwVTYGJ5k8g5IURL+KF0FAmPD4AwACgkQ8g5IURL+ KF0uEQ/7Brw842A7pZ7Cj8sv7oFnRUb1Dl+9nt/rT4h/dLZrpA3MxllUsfwjryFP Dg7H0efNdIxDUufgNuHdCujobkjqLFRLklAXc3RSyJI1Y3Dtl1dpL/5AqVmEdXJX TKZ2t2B8Z1/a8Y2xoRZdjzRzdwxuAdpCwdYJvU34IUqrX5tuhIReh7S3yhZWmxYW y3bRus4Ji98tjAedyaLncfnwOmsFuNQDB57jRRdz4aLbXDNXHSs2jOlQ5sZ2mHv8 yIh8CsV5kDJ6Txs7gmUYv1pAsCV6N1SU/b2sBeOfPEUdAGaEBqTNYNrk3cwLtEMa BIOQoDMDEaW/5uDsCOoIMNfH2BsBB9UC04CIPNnmhqDWLT32ZXIikdGYfwzoqpmr 6hYy59Sn/MpZGop6BFtW43IzOHoCkPM0DHkVKGPhUu10JJ52MgTHGbtjK9Bw0AlP 3c/7MFR3LIhdoH558VnYFXML9I8/88M3v1/ffRtQaz/qxLL6UIXitpQBgwkK0dOI GQbxGkB3cu93O8M52i5hdTzrsOHRfgzwhC47nMWaR1+4C+gJc+GZJSoTonZEGepc xjB/InoDGNQaX8VNVt5ml8HwZQAVzt4guNOJvQoFRgYE+lBTt1SHf0qYoaHiog9f KXHa5x7UkG2Exsjx0kHXoV66iSW577bwICBFfTN
  • From Dino@21:1/5 to Peter J. Holzer on Sun Jan 15 08:27:29 2023
    Thank you, Peter. Yes, setting up my own indexes is more or less the
    idea of the modular cache that I was considering. Seeing others think in
    the same direction makes it look more viable.

    About Scalene, thank you for the pointer. I'll do some research.

    Do you have any idea about the speed of a SELECT query against a 100k
    rows / 300 Mb Sqlite db?

    Dino

    On 1/15/2023 6:14 AM, Peter J. Holzer wrote:
    On 2023-01-14 23:26:27 -0500, Dino wrote:
    Hello, I have built a PoC service in Python Flask for my work, and - now
    that the point is made - I need to make it a little more performant (to be >> honest, chances are that someone else will pick up from where I left off,
    and implement the same service from scratch in a different language (GoLang? >> .Net? Java?) but I am digressing).

    Anyway, my Flask service initializes by loading a big "table" of 100k rows >> and 40 columns or so (memory footprint: order of 300 Mb)

    300 MB is large enough that you should at least consider putting that
    into a database (Sqlite is probably simplest. Personally I would go with PostgreSQL because I'm most familiar with it and Sqlite is a bit of an outlier).

    The main reason for putting it into a database is the ability to use
    indexes, so you don't have to scan all 100 k rows for each query.

    You may be able to do that for your Python data structures, too: Can you
    set up dicts which map to subsets you need often?

    There are some specialized in-memory bitmap implementations which can be
    used for filtering. I've used
    [Judy bitmaps](https://judy.sourceforge.net/doc/Judy1_3x.htm) in the
    past (mostly in Perl).
    These days [Roaring Bitmaps](https://www.roaringbitmap.org/) is probably
    the most popular. I see several packages on PyPI - but I haven't used
    any of them yet, so no recommendation from me.

    Numpy might also help. You will still have linear scans, but it is more compact and many of the searches can probably be done in C and not in
    Python.

    As you can imagine, this is not very performant in its current form, but
    performance was not the point of the PoC - at least initially.

    For performanc optimization it is very important to actually measure performance, and a good profiler helps very much in identifying hot
    spots. Unfortunately until recently Python was a bit deficient in this
    area, but [Scalene](https://pypi.org/project/scalene/) looks promising.

    hp


    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Dino@21:1/5 to Lars Liedtke on Sun Jan 15 08:20:01 2023
    Thank you for your answer, Lars. Just a clarification: I am already
    doing a rough measuring of my queries.

    A fresh query without any caching: < 4s.

    Cached full query: < 5 micro-s (i.e. 6 orders of magnitude faster)

    Desired speed for my POC: 10 <ms

    Also, I didn't want to ask a question with way too many "moving parts",
    but when I talked about the "table", it's actually a 100k long list of
    IDs. I can then use each ID to invoke an API that will return those 40 attributes. The API is fast, but still, I am bound to loop through the
    whole thing to respond to the query, that's unless I pre-load the data
    into something that allows faster access.

    Also, as you correctly observed, "looking good with my colleagues" is a nice-to-have feature at this point, not really an absolute requirement :)

    Dino

    On 1/15/2023 3:17 AM, Lars Liedtke wrote:
    Hey,

    before you start optimizing. I would suggest, that you measure response
    times and query times, data search times and so on. In order to save
    time, you have to know where you "loose" time.

    Does your service really have to load the whole table at once? Yes that
    might lead to quicker response times on requests, but databases are
    often very good with caching themselves, so that the first request might
    be slower than following requests, with similar parameters. Do you use a database, or are you reading from a file? Are you maybe looping through
    your whole dataset on every request? Instead of asking for the specific
    data?

    Before you start introducing a cache and its added complexity, do you
    really need that cache?

    You are talking about saving microseconds, that sounds a bit as if you
    might be “overdoing” it. How many requests will you have in the future? At least in which magnitude and how quick do they have to be? You write
    about 1-4 seconds on your laptop. But that does not really tell you that much, because most probably the service will run on a server. I am not
    saying that you should get a server or a cloud-instance to test against,
    but to talk with your architect about that.

    I totally understand your impulse to appear as good as can be, but you
    have to know where you really need to debug and optimize. It will not be advantageous for you, if you start to optimize for optimizing's sake. Additionally if you service is a PoC, optimizing now might be not the
    first thing you have to worry about, but about that you made everything
    as simple and readable as possible and that you do not spend too much
    time for just showing how it could work.

    But of course, I do not know the tasks given to you and the expectations
    you have to fulfil. All I am trying to say is to reconsider where you
    really could improve and how far you have to improve.



    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Thomas Passin@21:1/5 to Peter J. Holzer on Sun Jan 15 10:38:22 2023
    On 1/15/2023 6:14 AM, Peter J. Holzer wrote:
    On 2023-01-14 23:26:27 -0500, Dino wrote:
    Hello, I have built a PoC service in Python Flask for my work, and - now
    that the point is made - I need to make it a little more performant (to be >> honest, chances are that someone else will pick up from where I left off,
    and implement the same service from scratch in a different language (GoLang? >> .Net? Java?) but I am digressing).

    Anyway, my Flask service initializes by loading a big "table" of 100k rows >> and 40 columns or so (memory footprint: order of 300 Mb)

    300 MB is large enough that you should at least consider putting that
    into a database (Sqlite is probably simplest. Personally I would go with PostgreSQL because I'm most familiar with it and Sqlite is a bit of an outlier).

    The main reason for putting it into a database is the ability to use
    indexes, so you don't have to scan all 100 k rows for each query.

    I have an (inherited) server program that uses about 30 MB of data in a
    MySQL database. It services queries received over the network. It too
    had performance problems, to which adding indexes and smarter joins
    helped but not enough.

    I changed the program so that at startup it imports much of the data
    into Python dictionaries that are structured to support the kinds of
    queries that need the help. Response time to queries dropped
    dramatically. Some kinds of queries needed more help, and I collected auxiliary collections of (usually highly pre-processed) data into
    ordinary files, and those too get imported into dictionaries during startup.

    Note that these dictionaries do not always match the table structures.
    Some of them change the structure to make queries easier to process. You
    may be able to do that with Python code, or by creating SQL views in the database and importing directly from the views (database views take
    almost no database memory).

    The drawback is that all that data is now stored in memory while the
    program is running. In my case, many hundreds of MB. But if it would
    be too much memory for you - you would need to prototype it to know -
    you should let the database engine do the work. It is more highly
    optimized and efficient for searches than your code could ever be. But
    there will be a price to pay. The price is in denormalizing the database
    table design. This means to include redundant data, organized to match
    the kinds of queries that will be made. No more 3rd normal form! Your
    sql queries will need to be designed to take advantage of this new
    structure. This will be a cost because the database will be larger, but
    also because the redundancies will make it much harder to update the
    data correctly. Fortunately you do not need to do that during normal
    operation (my program's data was also static like yours).

    PostgreSQL would probably be a better choice than Sqlite, since it
    supports features such as foreign keys, and has a function definition capability.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Peter J. Holzer@21:1/5 to Thomas Passin on Sun Jan 15 20:39:14 2023
    On 2023-01-15 10:38:22 -0500, Thomas Passin wrote:
    On 1/15/2023 6:14 AM, Peter J. Holzer wrote:
    On 2023-01-14 23:26:27 -0500, Dino wrote:
    Anyway, my Flask service initializes by loading a big "table" of 100k rows
    and 40 columns or so (memory footprint: order of 300 Mb)

    300 MB is large enough that you should at least consider putting that
    into a database (Sqlite is probably simplest. Personally I would go with PostgreSQL because I'm most familiar with it and Sqlite is a bit of an outlier).

    The main reason for putting it into a database is the ability to use indexes, so you don't have to scan all 100 k rows for each query.

    I have an (inherited) server program that uses about 30 MB of data in a
    MySQL database. It services queries received over the network. It too had performance problems, to which adding indexes and smarter joins helped but not enough.

    I changed the program so that at startup it imports much of the data into Python dictionaries that are structured to support the kinds of queries that need the help. Response time to queries dropped dramatically.

    This is to be expected: Firstly, because you don't have disk accesses
    any more, secondly because you don't have network latency any more and
    thirdly, because you structured the data to fit the queries.

    The thing to keep in mind is that the relational database model was
    invented to have a uniform and simple way to model all data, and that
    RDBMSs are designed to handle all workloads (from a single tiny table to thousands of tables with hundreds of terabytes) reasonably well. For any
    given application you can always find a more efficient solution than
    using an RDBMS. Sometimes it's simple (just load all the data into a
    dict and serve from there), sometimes it's a major research project.
    The nice thing about RDBMSs isn't that they are the optimal solution for anything but that they are a "good enough" solution for a large class of problems.

    hp

    --
    _ | Peter J. Holzer | Story must make more sense than reality.
    |_|_) | |
    | | | hjp@hjp.at | -- Charles Stross, "Creative writing
    __/ | http://www.hjp.at/ | challenge!"

    -----BEGIN PGP SIGNATURE-----

    iQIzBAABCgAdFiEETtJbRjyPwVTYGJ5k8g5IURL+KF0FAmPEVlwACgkQ8g5IURL+ KF0Fsg/9F70GHWXJf/7CbeV74/dO1e6wIVUBLEMG0NBT+w4Bn5b+yeGHHFY9WIWf 57kkI3Jw4e9meDIad8Uv2HTdjU2WeN/Qo2FKNMsx7BpoBx3qq8/xqkI1l0FL0yZA LP3OFTnJW8dpF2Vnvms7q5dQGnehEWo9nB5oivooN7g64nCebONl8kV5FRe5sV0l IvAWTZFP2CtOHdSJY+aslvtA6FPCg++gXksi0FEi1QTI9bOe6Vr+yW1AU3RtNIOv blqVHb7m/bpL3kVGI/Z6dZiJill+ZBwoCN6J/RqCI0GYCNphpFt+wB/xrGgmofEt 6e+DTrIih6HiD3b4vAd1PsOvGJbuwLpJNGZ4oMnxIxBFY8u3LUkMMD/gbJzlbtN4 ntNfPmA6xgR7MbMy2Ga/HZrgCD5+CB5hk3TDfNrqtsqiita0xDDBz3uNSejSQXDB /DrsYIgN+UblhjdorWF6ZMcflRMYnpmsTBZ89vl87W6b7fXyEeTygWBXXUwHoiYt X0iNVelkOXqsTBHRtjvDdH4wde76GAlGiAnMtaiG7o+X7TYS1RQRawNjJUWb5BHR 5AhM1XEEYNCH2bmv0e/Mh83FY+9Edw/+d2AzPGT22cyJkGcld07vJLQ9+5z8zuR2 Yrw7isfGzRiiiGzuRjo/u7gnRrtAGtzkO3wS4in
  • From Weatherby,Gerard@21:1/5 to Lars Liedtke on Sun Jan 15 19:23:40 2023
    That�s about what I got using a Python dictionary on random data on a high memory machine.

    https://github.com/Gerardwx/database_testing.git

    It�s not obvious to me how to get it much faster than that.

    From: Python-list <python-list-bounces+gweatherby=uchc.edu@python.org> on behalf of Dino <dino@no.spam.ar>
    Date: Sunday, January 15, 2023 at 1:29 PM
    To: python-list@python.org <python-list@python.org>
    Subject: Re: Fast lookup of bulky "table"
    *** Attention: This is an external email. Use caution responding, opening attachments or clicking on links. ***

    Thank you for your answer, Lars. Just a clarification: I am already
    doing a rough measuring of my queries.

    A fresh query without any caching: < 4s.

    Cached full query: < 5 micro-s (i.e. 6 orders of magnitude faster)

    Desired speed for my POC: 10 <ms

    Also, I didn't want to ask a question with way too many "moving parts",
    but when I talked about the "table", it's actually a 100k long list of
    IDs. I can then use each ID to invoke an API that will return those 40 attributes. The API is fast, but still, I am bound to loop through the
    whole thing to respond to the query, that's unless I pre-load the data
    into something that allows faster access.

    Also, as you correctly observed, "looking good with my colleagues" is a nice-to-have feature at this point, not really an absolute requirement :)

    Dino

    On 1/15/2023 3:17 AM, Lars Liedtke wrote:
    Hey,

    before you start optimizing. I would suggest, that you measure response
    times and query times, data search times and so on. In order to save
    time, you have to know where you "loose" time.

    Does your service really have to load the whole table at once? Yes that
    might lead to quicker response times on requests, but databases are
    often very good with caching themselves, so that the first request might
    be slower than following requests, with similar parameters. Do you use a database, or are you reading from a file? Are you maybe looping through
    your whole dataset on every request? Instead of asking for the specific
    data?

    Before you start introducing a cache and its added complexity, do you
    really need that cache?

    You are talking about saving microseconds, that sounds a bit as if you
    might be �overdoing� it. How many requests will you have in the future?
    At least in which magnitude and how quick do they have to be? You write
    about 1-4 seconds on your laptop. But that does not really tell you that much, because most probably the service will run on a server. I am not
    saying that you should get a server or a cloud-instance to test against,
    but to talk with your architect about that.

    I totally understand your impulse to appear as good as can be, but you
    have to know where you really need to debug and optimize. It will not be advantageous for you, if you start to optimize for optimizing's sake. Additionally if you service is a PoC, optimizing now might be not the
    first thing you have to worry about, but about that you made everything
    as simple and readable as possible and that you do not spend too much
    time for just showing how it could work.

    But of course, I do not know the tasks given to you and the expectations
    you have to fulfil. All I am trying to say is to reconsider where you
    really could improve and how far you have to improve.


    -- https://urldefense.com/v3/__https://mail.python.org/mailman/listinfo/python-list__;!!Cn_UX_p3!npizb3UAz-jPUnhlimB3_lctLibK5EW4zJwjZVmQ41yV_-2WSm2eQ5cTi8vzOEuCfsdNTjIvIhFcakrX$<https://urldefense.com/v3/__https:/mail.python.org/mailman/listinfo/python-
    list__;!!Cn_UX_p3!npizb3UAz-jPUnhlimB3_lctLibK5EW4zJwjZVmQ41yV_-2WSm2eQ5cTi8vzOEuCfsdNTjIvIhFcakrX$>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Weatherby,Gerard@21:1/5 to Lars Liedtke on Sun Jan 15 19:36:53 2023
    I think any peformance improvements would have to come from a language change or better indexing of the data.

    From: Python-list <python-list-bounces+gweatherby=uchc.edu@python.org> on behalf of Weatherby,Gerard <gweatherby@uchc.edu>
    Date: Sunday, January 15, 2023 at 2:25 PM
    To: Dino <dino@no.spam.ar>, python-list@python.org <python-list@python.org> Subject: Re: Fast lookup of bulky "table"
    That�s about what I got using a Python dictionary on random data on a high memory machine.

    https://urldefense.com/v3/__https://github.com/Gerardwx/database_testing.git__;!!Cn_UX_p3!keHKWsb1LGR6u_6BQA04MyEJlnzICq04FNdn8z9BnnjG8NopVu3KiL0k3rMiowxtp87xBUi6OcavBQIqksBjbd9v$<https://urldefense.com/v3/__https:/github.com/Gerardwx/database_testing.
    git__;!!Cn_UX_p3!keHKWsb1LGR6u_6BQA04MyEJlnzICq04FNdn8z9BnnjG8NopVu3KiL0k3rMiowxtp87xBUi6OcavBQIqksBjbd9v$>

    It�s not obvious to me how to get it much faster than that.

    From: Python-list <python-list-bounces+gweatherby=uchc.edu@python.org> on behalf of Dino <dino@no.spam.ar>
    Date: Sunday, January 15, 2023 at 1:29 PM
    To: python-list@python.org <python-list@python.org>
    Subject: Re: Fast lookup of bulky "table"
    *** Attention: This is an external email. Use caution responding, opening attachments or clicking on links. ***

    Thank you for your answer, Lars. Just a clarification: I am already
    doing a rough measuring of my queries.

    A fresh query without any caching: < 4s.

    Cached full query: < 5 micro-s (i.e. 6 orders of magnitude faster)

    Desired speed for my POC: 10 <ms

    Also, I didn't want to ask a question with way too many "moving parts",
    but when I talked about the "table", it's actually a 100k long list of
    IDs. I can then use each ID to invoke an API that will return those 40 attributes. The API is fast, but still, I am bound to loop through the
    whole thing to respond to the query, that's unless I pre-load the data
    into something that allows faster access.

    Also, as you correctly observed, "looking good with my colleagues" is a nice-to-have feature at this point, not really an absolute requirement :)

    Dino

    On 1/15/2023 3:17 AM, Lars Liedtke wrote:
    Hey,

    before you start optimizing. I would suggest, that you measure response
    times and query times, data search times and so on. In order to save
    time, you have to know where you "loose" time.

    Does your service really have to load the whole table at once? Yes that
    might lead to quicker response times on requests, but databases are
    often very good with caching themselves, so that the first request might
    be slower than following requests, with similar parameters. Do you use a database, or are you reading from a file? Are you maybe looping through
    your whole dataset on every request? Instead of asking for the specific
    data?

    Before you start introducing a cache and its added complexity, do you
    really need that cache?

    You are talking about saving microseconds, that sounds a bit as if you
    might be �overdoing� it. How many requests will you have in the future?
    At least in which magnitude and how quick do they have to be? You write
    about 1-4 seconds on your laptop. But that does not really tell you that much, because most probably the service will run on a server. I am not
    saying that you should get a server or a cloud-instance to test against,
    but to talk with your architect about that.

    I totally understand your impulse to appear as good as can be, but you
    have to know where you really need to debug and optimize. It will not be advantageous for you, if you start to optimize for optimizing's sake. Additionally if you service is a PoC, optimizing now might be not the
    first thing you have to worry about, but about that you made everything
    as simple and readable as possible and that you do not spend too much
    time for just showing how it could work.

    But of course, I do not know the tasks given to you and the expectations
    you have to fulfil. All I am trying to say is to reconsider where you
    really could improve and how far you have to improve.


    -- https://urldefense.com/v3/__https://mail.python.org/mailman/listinfo/python-list__;!!Cn_UX_p3!npizb3UAz-jPUnhlimB3_lctLibK5EW4zJwjZVmQ41yV_-2WSm2eQ5cTi8vzOEuCfsdNTjIvIhFcakrX$<https://urldefense.com/v3/__https:/mail.python.org/mailman/listinfo/python-
    list__;!!Cn_UX_p3!npizb3UAz-jPUnhlimB3_lctLibK5EW4zJwjZVmQ41yV_-2WSm2eQ5cTi8vzOEuCfsdNTjIvIhFcakrX$><https://urldefense.com/v3/__https:/mail.python.org/mailman/listinfo/python-list__;!!Cn_UX_p3!npizb3UAz-jPUnhlimB3_lctLibK5EW4zJwjZVmQ41yV_-
    2WSm2eQ5cTi8vzOEuCfsdNTjIvIhFcakrX$%3chttps:/urldefense.com/v3/__https:/mail.python.org/mailman/listinfo/python-list__;!!Cn_UX_p3!npizb3UAz-jPUnhlimB3_lctLibK5EW4zJwjZVmQ41yV_-2WSm2eQ5cTi8vzOEuCfsdNTjIvIhFcakrX$%3e>
    -- https://urldefense.com/v3/__https://mail.python.org/mailman/listinfo/python-list__;!!Cn_UX_p3!keHKWsb1LGR6u_6BQA04MyEJlnzICq04FNdn8z9BnnjG8NopVu3KiL0k3rMiowxtp87xBUi6OcavBQIqkvzm3bP5$<https://urldefense.com/v3/__https:/mail.python.org/mailman/listinfo/
    python-list__;!!Cn_UX_p3!keHKWsb1LGR6u_6BQA04MyEJlnzICq04FNdn8z9BnnjG8NopVu3KiL0k3rMiowxtp87xBUi6OcavBQIqkvzm3bP5$>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Thomas Passin@21:1/5 to Peter J. Holzer on Sun Jan 15 14:58:12 2023
    On 1/15/2023 2:39 PM, Peter J. Holzer wrote:
    On 2023-01-15 10:38:22 -0500, Thomas Passin wrote:
    On 1/15/2023 6:14 AM, Peter J. Holzer wrote:
    On 2023-01-14 23:26:27 -0500, Dino wrote:
    Anyway, my Flask service initializes by loading a big "table" of 100k rows >>>> and 40 columns or so (memory footprint: order of 300 Mb)

    300 MB is large enough that you should at least consider putting that
    into a database (Sqlite is probably simplest. Personally I would go with >>> PostgreSQL because I'm most familiar with it and Sqlite is a bit of an
    outlier).

    The main reason for putting it into a database is the ability to use
    indexes, so you don't have to scan all 100 k rows for each query.

    I have an (inherited) server program that uses about 30 MB of data in a
    MySQL database. It services queries received over the network. It too had
    performance problems, to which adding indexes and smarter joins helped but >> not enough.

    I changed the program so that at startup it imports much of the data into
    Python dictionaries that are structured to support the kinds of queries that >> need the help. Response time to queries dropped dramatically.

    This is to be expected: Firstly, because you don't have disk accesses
    any more, secondly because you don't have network latency any more and thirdly, because you structured the data to fit the queries.

    Of course: that's exactly why I made those changes. The tradeoff is
    using more memory for your program, sometimes a lot more.

    The thing to keep in mind is that the relational database model was
    invented to have a uniform and simple way to model all data, and that
    RDBMSs are designed to handle all workloads (from a single tiny table to thousands of tables with hundreds of terabytes) reasonably well. For any given application you can always find a more efficient solution than
    using an RDBMS. Sometimes it's simple (just load all the data into a
    dict and serve from there), sometimes it's a major research project.
    The nice thing about RDBMSs isn't that they are the optimal solution for anything but that they are a "good enough" solution for a large class of problems.

    Often the solution is careful (and not very normalized) table design to
    support your queries. In the case I'm discussing, it was easier for me
    to make Python do the work, and I could afford the memory load. In
    other cases, you have to put in the work on the database side. Often
    for slow queries, disk latency and I/O are not the limiting factors, but
    you have to put in the work and do the testing to make sure.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From dn@21:1/5 to Gerard on Mon Jan 16 09:12:30 2023
    On 16/01/2023 08.36, Weatherby,Gerard wrote:
    I think any peformance improvements would have to come from a language change or better indexing of the data.

    Exactly!

    Expanding on @Peter's post: databases (relational or not) are best
    organised according to use. Some must accept rapid insert/updates.
    Others are about look-ups (data-retrieval).

    A basic RDBMS, just as a Python dict, may only offer a single key for
    efficient retrieval.

    Postgres and MySQL (for example) enable the establishment of multiple
    and sophisticated indices/indexes, and the aptly-named "views" of data.

    If the queries can be grouped according to the manner in which the data
    must be accessed, a view could be built for each. At which time, even if
    every row must be accessed, the retrieval will be made more efficient
    and/or the response better-organised.

    Thus, if we have a DB of people. Many retrievals are likely to utilise
    an index on 'name'. However, if at times interest is limited to place or suburb, an index and view of such will speed things from O(n).
    Similarly, if a query is only to return people with a dog license.

    Some programmers don't realise that SQL can also be used for
    calculations, eg the eponymous COUNT(), which saves (CPU-time and coding-effort) over post-processing in Python.

    If there are many way to group/examine the data, then this may not be
    possible, but that depends upon the count of views cf the value of speedy-response - and bearing-in-mind that the demands for response-time
    may vary by type of query/data-access.

    So many variables to consider ...

    --
    Regards,
    =dn

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From rbowman@21:1/5 to Dino on Sun Jan 15 20:54:18 2023
    On Sun, 15 Jan 2023 08:27:29 -0500, Dino wrote:


    Do you have any idea about the speed of a SELECT query against a 100k
    rows / 300 Mb Sqlite db?

    https://www.sqlite.org/speed.html

    The site is old but has a number of comparisons. I have not used SQLite
    with Python yet but with both C and C# I've been impressed with the speed versus Postgres or MSSQL. One thing to watch is insertions. By default
    each insertion is a transaction. There is a dramatic speed increase for multiple insertions if you explicitly start a transaction, do the inserts,
    and end the transaction.

    My usage is a result of ESRI's dropping their C++/C# Engine API. My new approach uses queries against the AcrGIS Server REST interface for much of
    the functionality but some spatial queries can be replaced with
    predetermined tabular data rather than runtime spatial queries. For
    example, for a given dataset you can determine the coordinates of every intersection beforehand and the intersection of PINE ST and MAPLE AVE
    becomes a simple search in the SLQite database.

    ESRI's ArcPy is the closest replacement for the legacy C++ API so I assume
    in the future I will be using it in conjunction with SQLite. The actual
    geodata will still need to be in a spatially aware RDMBS like SQL Server
    or PostgreSQL/PostGIS but SQLite so far is the fastest and easiest to
    implement for non-spatial data. Also, it is in the public domain which
    avoids the complexities of MySQL and its derivatives for commercial applications.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Stefan Ram@21:1/5 to PythonList@DancesWithMice.info on Sun Jan 15 21:49:19 2023
    dn <PythonList@DancesWithMice.info> writes:
    Some programmers don't realise that SQL can also be used for
    calculations, eg the eponymous COUNT(), which saves (CPU-time and >coding-effort) over post-processing in Python.

    Yes, I second that! Sometimes, people only re-invent things
    in Python because they don't know SQL well enough, or they
    do not normalize their tables because they have not properly
    learned how to do this.

    I'd always start out with normalized tables and do as many
    operations in SQL as possible. I would then hesitate to
    de-normalize anything or transfer data operations into
    the programming language unless I am very sure that this
    is really advantageous.

    Once I had the task of writing VBA code to query and analyze
    data from a Jet engine (i.e., Microsoft Access). I ended up
    writing 90 % of the code in SQL and a thin layer of 10 % in VBA.
    And it was fast.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Weatherby,Gerard@21:1/5 to PythonList@DancesWithMice.info on Sun Jan 15 22:31:17 2023
    With Postgresql, one can also do pre-processing in Python. https://www.postgresql.org/docs/15/plpython.html

    While it�s not as convenient to develop as client-side Python, it can be used to implement complicated constraints or implement filtering on the server side, which reduces the amount of data that has to be sent back to the client.


    From: Python-list <python-list-bounces+gweatherby=uchc.edu@python.org> on behalf of Stefan Ram <ram@zedat.fu-berlin.de>
    Date: Sunday, January 15, 2023 at 5:03 PM
    To: python-list@python.org <python-list@python.org>
    Subject: Re: Fast lookup of bulky "table"
    *** Attention: This is an external email. Use caution responding, opening attachments or clicking on links. ***

    dn <PythonList@DancesWithMice.info> writes:
    Some programmers don't realise that SQL can also be used for
    calculations, eg the eponymous COUNT(), which saves (CPU-time and >coding-effort) over post-processing in Python.

    Yes, I second that! Sometimes, people only re-invent things
    in Python because they don't know SQL well enough, or they
    do not normalize their tables because they have not properly
    learned how to do this.

    I'd always start out with normalized tables and do as many
    operations in SQL as possible. I would then hesitate to
    de-normalize anything or transfer data operations into
    the programming language unless I am very sure that this
    is really advantageous.

    Once I had the task of writing VBA code to query and analyze
    data from a Jet engine (i.e., Microsoft Access). I ended up
    writing 90 % of the code in SQL and a thin layer of 10 % in VBA.
    And it was fast.


    -- https://urldefense.com/v3/__https://mail.python.org/mailman/listinfo/python-list__;!!Cn_UX_p3!kAIZWRJ3oqrlkixX-iwrGeG9VVWjooBvzuMirfp44VTP32cELWf8Dk6MkPQwK2QwWzuUT9eNPNTlN152b23eFcM$<https://urldefense.com/v3/__https:/mail.python.org/mailman/listinfo/
    python-list__;!!Cn_UX_p3!kAIZWRJ3oqrlkixX-iwrGeG9VVWjooBvzuMirfp44VTP32cELWf8Dk6MkPQwK2QwWzuUT9eNPNTlN152b23eFcM$>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Greg Ewing@21:1/5 to Dino on Mon Jan 16 11:54:02 2023
    On 16/01/23 2:27 am, Dino wrote:
    Do you have any idea about the speed of a SELECT query against a 100k
    rows / 300 Mb Sqlite db?

    That depends entirely on the nature of the query and how the
    data is indexed. If it's indexed in a way that allows sqlite to
    home in directly on the wanted data, it will be very fast. If
    it has to fall back on a linear search, it probably won't be
    significantly faster than your existing Python implementation.

    --
    Greg

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Thomas Passin@21:1/5 to Stefan Ram on Sun Jan 15 18:06:36 2023
    On 1/15/2023 4:49 PM, Stefan Ram wrote:
    dn <PythonList@DancesWithMice.info> writes:
    Some programmers don't realise that SQL can also be used for
    calculations, eg the eponymous COUNT(), which saves (CPU-time and
    coding-effort) over post-processing in Python.

    Yes, I second that! Sometimes, people only re-invent things
    in Python because they don't know SQL well enough, or they
    do not normalize their tables because they have not properly
    learned how to do this.

    I'd always start out with normalized tables and do as many
    operations in SQL as possible. I would then hesitate to
    de-normalize anything or transfer data operations into
    the programming language unless I am very sure that this
    is really advantageous.

    Yes, if you get the indexes and joins right, sometimes you can get a
    very large speed-up. It takes some experimenting and use of EXPLAIN,
    but it's worth doing. You especially want to avoid letting the database
    engine do full-table scans over and over. And you never want to send a
    lot of rows to Python and do post-filtering on them if you can avoid it.

    Use WHERE instead of HAVING if possible (HAVING works post-scan, WHERE
    works during row retrieval).

    Once I had the task of writing VBA code to query and analyze
    data from a Jet engine (i.e., Microsoft Access). I ended up
    writing 90 % of the code in SQL and a thin layer of 10 % in VBA.
    And it was fast.



    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Dino@21:1/5 to Gerard on Sun Jan 15 22:13:06 2023
    On 1/15/2023 2:23 PM, Weatherby,Gerard wrote:
    That�s about what I got using a Python dictionary on random data on a high memory machine.

    https://github.com/Gerardwx/database_testing.git

    It�s not obvious to me how to get it much faster than that.

    Gerard, you are a rockstar. This is going to be really useful if I do
    decide to adopt sqlite3 for my PoC, as I understand what's going on conceptually, but never really used sqlite (nor SQL in a long long
    time), so this may save me a bunch of time.

    I created a 300 Mb DB using your script. Then:

    $ ./readone.py
    testing 2654792 of 4655974
    Found somedata0002654713 for 1ed9f9cd-0a9e-47e3-b0a7-3e1fcdabe166 in 0.239335000020219 seconds

    $ ./prefetch.py
    Index build 4.420937899994897 seconds
    testing 3058568 of 4655974
    Found somedata0000202200 for 5dca1455-9cd6-4e4d-8e5a-7e6400de7ca7 in 4.4000043999403715e-06 seconds

    So, if I understand right:

    1) once I built a dict out of the DB (in about 4 seconds), I was able to
    lookup an entry/record in 4 microseconds(!)

    2) looking up a record/entry using a Sqlite query took 0.2 seconds (i.e.
    500x slower)

    Interesting. Thank you for this. Very informative. I really appreciate
    that you took the time to write this.

    The conclusion seems to me that I probably don't want to go the Sqlite
    route, as I would be placing my data into a database just to extract it
    back into a dict when I need it if I want it fast.

    Ps: a few minor fixes to the README as this may be helpful to others.

    ./venv/... => ./env/..

    i.e.
    ./env/bin/pip install -U pip
    ./env/bin/pip install -e .

    Also add part in []

    Run create.py [size of DB in bytes] prior to running readone.py and/or prefetch.py

    BTW, can you tell me what is going on here? what's := ?

    while (increase := add_some(conn,adding)) == 0:

    https://github.com/Gerardwx/database_testing/blob/main/src/database_testing/create.py#L40

    Dino

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From David@21:1/5 to Dino on Mon Jan 16 18:53:38 2023
    On Mon, 16 Jan 2023 at 16:15, Dino <dino@no.spam.ar> wrote:

    BTW, can you tell me what is going on here? what's := ?

    while (increase := add_some(conn,adding)) == 0:

    See here:
    https://docs.python.org/3/reference/expressions.html#assignment-expressions
    https://realpython.com/python-walrus-operator/

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Dino@21:1/5 to David on Mon Jan 16 09:43:01 2023
    On 1/16/2023 2:53 AM, David wrote:
    See here:
    https://docs.python.org/3/reference/expressions.html#assignment-expressions
    https://realpython.com/python-walrus-operator/

    Thank you, brother.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Stefan Ram@21:1/5 to Thomas Passin on Mon Jan 16 15:14:06 2023
    Thomas Passin <list1@tompassin.net> writes:
    I changed the program so that at startup it imports much of the data
    into Python dictionaries that are structured to support the kinds of
    queries that need the help. Response time to queries dropped
    dramatically. Some kinds of queries needed more help, and I collected >auxiliary collections of (usually highly pre-processed) data into
    ordinary files, and those too get imported into dictionaries during startup.

    There are several reasons why sometimes databases are used:

    In organizations, they can manage access permissions for
    several different tables and users/rôles.

    They can coordinate parallel use and data access by several
    different processes (programs).

    They can hold data that is too large to fit into main memory.

    By replicating the same data between several different
    computers, they can provide additional security for the data.

    By transactions and referential integrity, they also can
    provide more security for the data, ensuring that certain
    integrity constraints are always met.

    When a process (program) crashes, recent changes to its
    internal tables in memory might be lost. Would it have written
    those changes to a database, they would not have been lost.

    This is just what I was able to write off the top of my head.

    When none of those reasons matter, one can use dictionaries
    in Python as well. And then what Chandler Carruth showed us
    applies:

    CPUS HAVE A HIERARCHICAL CACHE SYSTEM
    (from a 2014 talk by Chandler Carruth)

    One cycle on a 3 GHz processor 1 ns
    L1 cache reference 0.5 ns
    Branch mispredict 5 ns
    L2 cache reference 7 ns 14x L1 cache
    Mutex lock/unlock 25 ns
    Main memory reference 100 ns 20xL2, 200xL1 Compress 1K bytes with Snappy 3,000 ns
    Send 1K bytes over 1 Gbps network 10,000 ns 0.01 ms
    Read 4K randomly from SSD 150,000 ns 0.15 ms
    Read 1 MB sequentially from memory 250,000 ns 0.25 ms
    Round trip within same datacenter 500,000 ns 0.5 ms
    Read 1 MB sequentially From SSD 1,000,000 ns 1 ms 4x memory
    Disk seek 10,000,000 ns 10 ms 20xdatacen. RT
    Read 1 MB sequentially from disk 20,000,000 ns 20 ms 80xmem.,20xSSD
    Send packet CA->Netherlands->CA 150,000,000 ns 150 ms

    . Of course, as Raymond Hettinger told us, dictionaries in
    Python have been made very fast by the work of many smart
    programmers.

    Raymond Hettinger:
    Modern Python Dictionaries -
    A confluence of a dozen great ideas,
    PyCon 2017.

    (He also explains a surprising relation they have to databases.)

    However, operating systems and databases also try to cache
    information in main memory that is estimated to be accessed
    often. So, when one is only reading data and not writing it
    to a database, a table cached in main memory might be much
    faster than a table where everything is read from disk for
    each access. Maybe some databases offer settings to tell it
    to cache certain tables in memory.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Dino@21:1/5 to Dino on Mon Jan 16 09:44:34 2023
    Just wanted to take a moment to express my gratitude to everyone who
    responded here. You have all been so incredibly helpful. Thank you

    Dino

    On 1/14/2023 11:26 PM, Dino wrote:

    Hello, I have built a PoC service in Python Flask for my work, and - now
    that the point is made - I need to make it a little more performant (to
    be honest, chances are that someone else will pick up from where I left
    off, and implement the same service from scratch in a different language (GoLang? .Net? Java?) but I am digressing).

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Stefan Ram@21:1/5 to rbowman on Mon Jan 16 17:01:47 2023
    rbowman <bowman@montana.com> writes:
    On 16 Jan 2023 15:14:06 GMT, Stefan Ram wrote:
    When none of those reasons matter, one can use dictionaries in Python
    as well. And then what Chandler Carruth showed us applies:
    I am missing something. Where is the data in your dictionary coming from?

    Could be a database, could be a file with the pickled
    dictionary from the preceding pass of the program or
    a file in a common or custom data format.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From rbowman@21:1/5 to Stefan Ram on Mon Jan 16 16:56:43 2023
    On 16 Jan 2023 15:14:06 GMT, Stefan Ram wrote:


    When none of those reasons matter, one can use dictionaries in Python
    as well. And then what Chandler Carruth showed us applies:

    I am missing something. Where is the data in your dictionary coming from?

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Thomas Passin@21:1/5 to Stefan Ram on Mon Jan 16 12:26:32 2023
    On 1/16/2023 10:14 AM, Stefan Ram wrote:
    However, operating systems and databases also try to cache
    information in main memory that is estimated to be accessed
    often.
    Yes, and you can only know by testing, when that's possible. Also, if
    you know that you have the same queries repeated over and over, you can
    intern (basically, cache) their results in Python and get a performance
    boost that way.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Edmondo Giovannozzi@21:1/5 to All on Mon Jan 16 10:18:57 2023
    Il giorno domenica 15 gennaio 2023 alle 05:26:50 UTC+1 Dino ha scritto:
    Hello, I have built a PoC service in Python Flask for my work, and - now that the point is made - I need to make it a little more performant (to
    be honest, chances are that someone else will pick up from where I left
    off, and implement the same service from scratch in a different language (GoLang? .Net? Java?) but I am digressing).

    Anyway, my Flask service initializes by loading a big "table" of 100k
    rows and 40 columns or so (memory footprint: order of 300 Mb) and then accepts queries through a REST endpoint. Columns are strings, enums, and numbers. Once initialized, the table is read only. The endpoint will
    parse the query and match it against column values (equality,
    inequality, greater than, etc.) Finally, it will return a (JSON) list of
    all rows that satisfy all conditions in the query.

    As you can imagine, this is not very performant in its current form, but performance was not the point of the PoC - at least initially.

    Before I deliver the PoC to a more experienced software architect who
    will look at my code, though, I wouldn't mind to look a bit less lame
    and do something about performance in my own code first, possibly by bringing the average time for queries down from where it is now (order
    of 1 to 4 seconds per query on my laptop) to 1 or 2 milliseconds on average).

    To be honest, I was already able to bring the time down to a handful of microseconds thanks to a rudimentary cache that will associate the "signature" of a query to its result, and serve it the next time the
    same query is received, but this may not be good enough: 1) queries
    might be many and very different from one another each time, AND 2) I am
    not sure the server will have a ton of RAM if/when this thing - or
    whatever is derived from it - is placed into production.

    How can I make my queries generally more performant, ideally also in
    case of a new query?

    Here's what I have been considering:

    1. making my cache more "modular", i.e. cache the result of certain
    (wide) queries. When a complex query comes in, I may be able to restrict
    my search to a subset of the rows (as determined by a previously cached partial query). This should keep the memory footprint under control.

    2. Load my data into a numpy.array and use numpy.array operations to
    slice and dice my data.

    3. load my data into sqlite3 and use SELECT statement to query my table.
    I have never used sqllite, plus there's some extra complexity as
    comparing certain colum requires custom logic, but I wonder if this architecture would work well also when dealing with a 300Mb database.

    4. Other ideas?

    Hopefully I made sense. Thank you for your attention

    Dino

    As a comparison with numpy. Given the following lines:

    import numpy as np
    a = np.random.randn(400,100_000)
    ia = np.argsort(a[0,:])
    a_elem = a[56, ia[0]]

    I have just taken an element randomly in a numeric table of 400x100000 elements To find it with numpy:

    %timeit isel = a == a_elem
    35.5 ms ± 2.79 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

    And
    %timeit a[isel]
    9.18 ms ± 371 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

    As data are not ordered it is searching it one by one but at C level.
    Of course it depends on a lot of thing...

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Peter J. Holzer@21:1/5 to Thomas Passin on Mon Jan 16 20:32:16 2023
    On 2023-01-15 18:06:36 -0500, Thomas Passin wrote:
    You especially want to avoid letting the database engine do full-table
    scans over and over. And you never want to send a lot of rows to
    Python and do post-filtering on them if you can avoid it.

    Another thing to avoid: Lots of small queries. If you have the choice
    between a single query which returns 10000 rows and 1000 queries which
    return on average 5 rows each, the former is almost always much faster.
    Each query carries some overhead. This may not be very noticeable with
    SQLite (as everything is in the same process), but it becomes more and
    more pronounced the farther the database is from the client. For
    complicated queries the parsing and planning overhead can also become significant.

    hp

    --
    _ | Peter J. Holzer | Story must make more sense than reality.
    |_|_) | |
    | | | hjp@hjp.at | -- Charles Stross, "Creative writing
    __/ | http://www.hjp.at/ | challenge!"

    -----BEGIN PGP SIGNATURE-----

    iQIzBAABCgAdFiEETtJbRjyPwVTYGJ5k8g5IURL+KF0FAmPFpkAACgkQ8g5IURL+ KF1cNQ/8DhNB9x3/nxsp8KdBtxUZwGX9O29FxTi4dbTXSiRS8SDUTi1GEuEx3U3A vflECFJwN3QL0SW0fiAgsYrOTwmy2l/gtSdgqcg6oBo7GlNe3XwFMasZ2vrZ8P9s x90HZZMZYF5H1w6MMnnUHlwWVQTZNWmWdIfQySdc25TO07Xdpg6AjiiQl7L/Lj9q FSZBSFjL2CgusDl9d5hILF4j2jzcMd1GfblF59iMO6RN//B+mxENKQXuuztwE57I V7wv2ST2bpxkXnm+mk2N4SOzVYy77C4v14J1MqFhKOeNfp0xdX8ky1aNX4JhrCkC aWGzA0AXDhR48UCuwx6Tr8MGxtHeYKyJu/Vnx2z/GdQrjjZGGtAKOEXZxND4b1PP eR/IRud9Y1767msElqCA1LGesWMYv5EjS+y49aDa7oCQURKQ0DAPAkD8jt1WEt1a wQ9Cr3Aqw8e9dkDgjS/cnah0g+0WTFfWVwX0z8F/o6o/dsUo4O1yHSjsXA/JEu2y b/GaqOq+shI7bwStqHFhlo/qCF9pF9cJlmzrruBkAIuE1lfX1IfxVuK18I/NKlXQ BpxfL1zUwYiyUNAcgtZPBE910z49uN43J/YlLLwjrWB/f3JwCkdB762AVGtPRwe4 vJ7IGTc8LL6PnQgs5K3MKj7uKCm1URngnBm0POn
  • From Peter J. Holzer@21:1/5 to dn via Python-list on Mon Jan 16 20:25:22 2023
    On 2023-01-16 09:12:30 +1300, dn via Python-list wrote:
    On 16/01/2023 08.36, Weatherby,Gerard wrote:
    I think any peformance improvements would have to come from a language change or better indexing of the data.
    Expanding on @Peter's post: databases (relational or not) are best organised according to use.

    I probably wasn't very clear here. I think one of the main advantages of relational databases over earlier models (especially hierarchical
    databases but also network databases) is that you *don't* have to do
    that.

    In a hierarchical database you have to decide what's higher or lower in
    the hierarchy (e.g., does a department have employees, or do employees
    have a department?) and that has a dramatic effect on performance so
    you must structure the database on which queries are expected to be more common.

    In a relational database employees and departments are at the same level
    and you have a relationship between them. You don't need to know whether
    you'll have to look up all employees of a given department or the
    department of a given employee more often. Both will be similarly fast
    with appropriate indexes.

    (Of course you should still have an idea what the data is used for when designing your data model. But semantics are much more important than
    use cases at this stage and you don't have to redesign your entire
    database just because you need a new query.)

    This flexibility comes with a cost, though: A relational database is
    almost always good enough, but almost never optimal for any given use.


    Postgres and MySQL (for example) enable the establishment of multiple and sophisticated indices/indexes, and the aptly-named "views" of data.

    If the queries can be grouped according to the manner in which the data must be accessed, a view could be built for each. At which time, even if every
    row must be accessed, the retrieval will be made more efficient and/or the response better-organised.

    Nitpick: A view will not be more efficient (unless it's a materialized
    view which is basically just a table). To retrieve that data, the RDBMS
    has to perform exactly the same operations as for the underlying query.
    Views make life simpler for the programmer (or analyst) by letting them
    *write* simpler queries, but under the surface nothing changes. In fact
    the performance may be worse, since the perceived simplicity of the view
    may cause the human to write queries which are hard to optimize.

    (There is another important use case for views: Access control and
    enforcement of business rules. But I fear we are straying far from the
    original topic now.)

    Thus, if we have a DB of people. Many retrievals are likely to utilise an index on 'name'. However, if at times interest is limited to place or
    suburb, an index and view of such will speed things from O(n). Similarly, if a query is only to return people with a dog license.

    I don't see where a view would help here - but maybe you are thinking of
    a more complex database than you describe.

    Some programmers don't realise that SQL can also be used for calculations,
    eg the eponymous COUNT(), which saves (CPU-time and coding-effort) over post-processing in Python.

    Agreed. Aggregate and window functions can do a lot of work in the
    database.

    hp

    --
    _ | Peter J. Holzer | Story must make more sense than reality.
    |_|_) | |
    | | | hjp@hjp.at | -- Charles Stross, "Creative writing
    __/ | http://www.hjp.at/ | challenge!"

    -----BEGIN PGP SIGNATURE-----

    iQIzBAABCgAdFiEETtJbRjyPwVTYGJ5k8g5IURL+KF0FAmPFpJ0ACgkQ8g5IURL+ KF3NrA//WUwFzrDZmMgSrPGzaGyR7CLprk/qgtrZflRssIYDarLdUgKN4CVW9S20 u6bvyQe4ukSkJu2J2tMEbIe4QopuEm52sQNQqcd+JBG3BqZ78Yp0/9XgSRxsGq+1 nP48PcwtcMLLQ4GB5HLVLIcD/JWj5MjunGFIQkDuObqam1UTE7IpZgDYf6f43wQk 3cC3JTQ4uqq606WelNC7XNbQJgfCY1/b3aE0aM1x1keVR04+IRNCH3gzmeHeBHIb cXar72MgcSp+H5dbFIAZ3wPVA/Kdluit+vv3I8t4l/x3X6SoYM5CSK4AbE2lR01G Id+9XU4qvvrucVBGaIh+ZpalMX92fFFJ1np5KdJuhgI5XKJe7yNH/vdWZUEoY6RU /QVnap+kiwTA/flW+bn2tMS/LR9t48ISuljPIOMKtnq1qoWKSoQ0TlL3qa6nJObQ Gv97sAo0shIl9nhjunnY1WWTo9Zb28e8iMNLXFTLg97IOCo0a5ZbR+ps/7PHDSL8 ZlFNWPB9i9JJpOuGqAL2d3+alYi8bkjKqbjGJy8O7gIFLhocuHY4l5AuaBQ3Lid3 xXfUihKQ+tAfFY3gnFjzoVK0K36NNqMrH+/xCaoNnovcuEQfUEX73exMYs3M823d doSMwgGsOZcU2SJCDTRP90mRenekwntCEcjIM71
  • From Thomas Passin@21:1/5 to rbowman on Mon Jan 16 12:28:37 2023
    On 1/16/2023 11:56 AM, rbowman wrote:
    On 16 Jan 2023 15:14:06 GMT, Stefan Ram wrote:


    When none of those reasons matter, one can use dictionaries in Python
    as well. And then what Chandler Carruth showed us applies:

    I am missing something. Where is the data in your dictionary coming from?

    It would get imported on startup. This is assuming that the data does
    not get changed during operation, which the OP said is the case.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Albert-Jan Roskam@21:1/5 to All on Mon Jan 16 20:59:30 2023
    On Jan 15, 2023 05:26, Dino <dino@no.spam.ar> wrote:

    Hello, I have built a PoC service in Python Flask for my work, and - now
    that the point is made - I need to make it a little more performant (to
    be honest, chances are that someone else will pick up from where I left
    off, and implement the same service from scratch in a different language
    (GoLang? .Net? Java?) but I am digressing).

    =======
    Hi,
    * I'd start by measuring where your program spends its
    time: https://docs.python.org/3/library/profile.html
    * It might be useful try DuckDB instead of
    Sqlite. https://duckdb.org/why_duckdb.html
    Best wishes,
    AJ

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Dino@21:1/5 to Edmondo Giovannozzi on Mon Jan 16 18:17:42 2023
    On 1/16/2023 1:18 PM, Edmondo Giovannozzi wrote:

    As a comparison with numpy. Given the following lines:

    import numpy as np
    a = np.random.randn(400,100_000)
    ia = np.argsort(a[0,:])
    a_elem = a[56, ia[0]]

    I have just taken an element randomly in a numeric table of 400x100000 elements
    To find it with numpy:

    %timeit isel = a == a_elem
    35.5 ms ± 2.79 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

    And
    %timeit a[isel]
    9.18 ms ± 371 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

    As data are not ordered it is searching it one by one but at C level.
    Of course it depends on a lot of thing...

    thank you for this. It's probably my lack of experience with Numpy,
    but... can you explain what is going on here in more detail?

    Thank you

    Dino

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From rbowman@21:1/5 to Thomas Passin on Tue Jan 17 01:47:19 2023
    On Mon, 16 Jan 2023 12:28:37 -0500, Thomas Passin wrote:

    On 1/16/2023 11:56 AM, rbowman wrote:
    On 16 Jan 2023 15:14:06 GMT, Stefan Ram wrote:


    When none of those reasons matter, one can use dictionaries in
    Python as well. And then what Chandler Carruth showed us applies:

    I am missing something. Where is the data in your dictionary coming
    from?

    It would get imported on startup. This is assuming that the data does
    not get changed during operation, which the OP said is the case.

    Okay, so there is no data loss if the program or machine crashes. That
    makes sense. In the cases Where I'm using SQLite the input data consists
    of relatively expensive spatial queries that are processed and written out
    to persist the results for use by other applications in the suite.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Edmondo Giovannozzi@21:1/5 to All on Tue Jan 17 02:42:02 2023
    Il giorno martedì 17 gennaio 2023 alle 00:18:04 UTC+1 Dino ha scritto:
    On 1/16/2023 1:18 PM, Edmondo Giovannozzi wrote:

    As a comparison with numpy. Given the following lines:

    import numpy as np
    a = np.random.randn(400,100_000)
    ia = np.argsort(a[0,:])
    a_elem = a[56, ia[0]]

    I have just taken an element randomly in a numeric table of 400x100000 elements
    To find it with numpy:

    %timeit isel = a == a_elem
    35.5 ms ± 2.79 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

    And
    %timeit a[isel]
    9.18 ms ± 371 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

    As data are not ordered it is searching it one by one but at C level.
    Of course it depends on a lot of thing...
    thank you for this. It's probably my lack of experience with Numpy,
    but... can you explain what is going on here in more detail?

    Thank you

    Dino

    Sorry,
    I was just creating an array of 400x100000 elements that I fill with random numbers:

    a = np.random.randn(400,100_000)

    Then I pick one element randomly, it is just a stupid sort on a row and then I take an element in another row, but it doesn't matter, I'm just taking a random element. I may have used other ways to get that but was the first that came to my mind.

    ia = np.argsort(a[0,:])
    a_elem = a[56, ia[0]]

    The I'm finding that element in the all the matrix a (of course I know where it is, but I want to test the speed of a linear search done on the C level):

    %timeit isel = a == a_elem

    Actually isel is a logic array that is True where a[i,j] == a_elem and False where a[i,j] != a_elem. It may find more then one element but, of course, in our case it will find only the element that we have selected at the beginning. So it will give the
    speed of a linear search plus the time needed to allocate the logic array. The search is on the all matrix of 40 million of elements not just on one of its row of 100k element.

    On the single row (that I should say I have chosen to be contiguous) is much faster.

    %timeit isel = a[56,:] == a_elem
    26 µs ± 588 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

    the matrix is a double precision numbers that is 8 byte, I haven't tested it on string of characters.

    This wanted to be an estimate of the speed that one can get going to the C level.
    You loose of course the possibility to have a relational database, you need to have everything in memory, etc...

    A package that implements tables based on numpy is pandas: https://pandas.pydata.org/

    I hope that it can be useful.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Dino@21:1/5 to Edmondo Giovannozzi on Tue Jan 17 10:39:14 2023
    Thanks a lot, Edmondo. Or better... Grazie mille.

    On 1/17/2023 5:42 AM, Edmondo Giovannozzi wrote:

    Sorry,
    I was just creating an array of 400x100000 elements that I fill with random numbers:

    a = np.random.randn(400,100_000)

    Then I pick one element randomly, it is just a stupid sort on a row and then I take an element in another row, but it doesn't matter, I'm just taking a random element. I may have used other ways to get that but was the first that came to my mind.

    ia = np.argsort(a[0,:])
    a_elem = a[56, ia[0]]

    The I'm finding that element in the all the matrix a (of course I know where it is, but I want to test the speed of a linear search done on the C level):

    %timeit isel = a == a_elem

    Actually isel is a logic array that is True where a[i,j] == a_elem and False where a[i,j] != a_elem. It may find more then one element but, of course, in our case it will find only the element that we have selected at the beginning. So it will give the
    speed of a linear search plus the time needed to allocate the logic array. The search is on the all matrix of 40 million of elements not just on one of its row of 100k element.

    On the single row (that I should say I have chosen to be contiguous) is much faster.

    %timeit isel = a[56,:] == a_elem
    26 µs ± 588 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

    the matrix is a double precision numbers that is 8 byte, I haven't tested it on string of characters.

    This wanted to be an estimate of the speed that one can get going to the C level.
    You loose of course the possibility to have a relational database, you need to have everything in memory, etc...

    A package that implements tables based on numpy is pandas: https://pandas.pydata.org/

    I hope that it can be useful.



    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)