Forum: >>> Magnum BBS <<<

Fast lookup of bulky "table"

From Dino@21:1/5 to All on Sat Jan 14 23:26:27 2023

Hello, I have built a PoC service in Python Flask for my work, and - now
that the point is made - I need to make it a little more performant (to
be honest, chances are that someone else will pick up from where I left
off, and implement the same service from scratch in a different language (GoLang? .Net? Java?) but I am digressing).

Anyway, my Flask service initializes by loading a big "table" of 100k
rows and 40 columns or so (memory footprint: order of 300 Mb) and then
accepts queries through a REST endpoint. Columns are strings, enums, and numbers. Once initialized, the table is read only. The endpoint will
parse the query and match it against column values (equality,
inequality, greater than, etc.) Finally, it will return a (JSON) list of
all rows that satisfy all conditions in the query.

As you can imagine, this is not very performant in its current form, but performance was not the point of the PoC - at least initially.

Before I deliver the PoC to a more experienced software architect who
will look at my code, though, I wouldn't mind to look a bit less lame
and do something about performance in my own code first, possibly by
bringing the average time for queries down from where it is now (order
of 1 to 4 seconds per query on my laptop) to 1 or 2 milliseconds on
average).

To be honest, I was already able to bring the time down to a handful of microseconds thanks to a rudimentary cache that will associate the
"signature" of a query to its result, and serve it the next time the
same query is received, but this may not be good enough: 1) queries
might be many and very different from one another each time, AND 2) I am
not sure the server will have a ton of RAM if/when this thing - or
whatever is derived from it - is placed into production.

How can I make my queries generally more performant, ideally also in
case of a new query?

Here's what I have been considering:

1. making my cache more "modular", i.e. cache the result of certain
(wide) queries. When a complex query comes in, I may be able to restrict
my search to a subset of the rows (as determined by a previously cached
partial query). This should keep the memory footprint under control.

2. Load my data into a numpy.array and use numpy.array operations to
slice and dice my data.

3. load my data into sqlite3 and use SELECT statement to query my table.
I have never used sqllite, plus there's some extra complexity as
comparing certain colum requires custom logic, but I wonder if this architecture would work well also when dealing with a 300Mb database.

4. Other ideas?

Hopefully I made sense. Thank you for your attention

Dino

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Lars Liedtke@21:1/5 to All on Sun Jan 15 09:17:03 2023

Hey,

before you start optimizing. I would suggest, that you measure response times and query times, data search times and so on. In order to save time, you have to know where you "loose" time.

Does your service really have to load the whole table at once? Yes that might lead to quicker response times on requests, but databases are often very good with caching themselves, so that the first request might be slower than following requests, with
similar parameters. Do you use a database, or are you reading from a file? Are you maybe looping through your whole dataset on every request? Instead of asking for the specific data?

Before you start introducing a cache and its added complexity, do you really need that cache?

You are talking about saving microseconds, that sounds a bit as if you might be “overdoing” it. How many requests will you have in the future? At least in which magnitude and how quick do they have to be? You write about 1-4 seconds on your laptop.
But that does not really tell you that much, because most probably the service will run on a server. I am not saying that you should get a server or a cloud-instance to test against, but to talk with your architect about that.

I totally understand your impulse to appear as good as can be, but you have to know where you really need to debug and optimize. It will not be advantageous for you, if you start to optimize for optimizing's sake. Additionally if you service is a PoC,
optimizing now might be not the first thing you have to worry about, but about that you made everything as simple and readable as possible and that you do not spend too much time for just showing how it could work.

But of course, I do not know the tasks given to you and the expectations you have to fulfil. All I am trying to say is to reconsider where you really could improve and how far you have to improve.

Lars Liedtke
Software Entwickler

[Tel.] +49 721 98993-
[Fax] +49 721 98993-
[E-Mail] lal@solute.de<mailto:lal@solute.de>

solute GmbH
Zeppelinstraße 15
76185 Karlsruhe
Germany

[Logo Solute]

Marken der solute GmbH | brands of solute GmbH
[Marken]
[Advertising Partner]

Geschäftsführer | Managing Director: Dr. Thilo Gans, Bernd Vermaaten
Webseite | www.solute.de <http://www.solute.de/>
Sitz | Registered Office: Karlsruhe
Registergericht | Register Court: Amtsgericht Mannheim
Registernummer | Register No.: HRB 110579
USt-ID | VAT ID: DE234663798

Informationen zum Datenschutz | Information about privacy policy https://www.solute.de/ger/datenschutz/grundsaetze-der-datenverarbeitung.php

Am 15.01.23 um 05:26 schrieb Dino:

Hello, I have built a PoC service in Python Flask for my work, and - now that the point is made - I need to make it a little more performant (to be honest, chances are that someone else will pick up from where I left off, and implement the same service
from scratch in a different language (GoLang? .Net? Java?) but I am digressing).

Anyway, my Flask service initializes by loading a big "table" of 100k rows and 40 columns or so (memory footprint: order of 300 Mb) and then accepts queries through a REST endpoint. Columns are strings, enums, and numbers. Once initialized, the table is
read only. The endpoint will parse the query and match it against column values (equality, inequality, greater than, etc.) Finally, it will return a (JSON) list of all rows that satisfy all conditions in the query.

As you can imagine, this is not very performant in its current form, but performance was not the point of the PoC - at least initially.

Before I deliver the PoC to a more experienced software architect who will look at my code, though, I wouldn't mind to look a bit less lame and do something about performance in my own code first, possibly by bringing the average time for queries down
from where it is now (order of 1 to 4 seconds per query on my laptop) to 1 or 2 milliseconds on average).

To be honest, I was already able to bring the time down to a handful of microseconds thanks to a rudimentary cache that will associate the "signature" of a query to its result, and serve it the next time the same query is received, but this may not be
good enough: 1) queries might be many and very different from one another each time, AND 2) I am not sure the server will have a ton of RAM if/when this thing - or whatever is derived from it - is placed into production.

How can I make my queries generally more performant, ideally also in case of a new query?

Here's what I have been considering:

1. making my cache more "modular", i.e. cache the result of certain (wide) queries. When a complex query comes in, I may be able to restrict my search to a subset of the rows (as determined by a previously cached partial query). This should keep the
memory footprint under control.

2. Load my data into a numpy.array and use numpy.array operations to slice and dice my data.

3. load my data into sqlite3 and use SELECT statement to query my table. I have never used sqllite, plus there's some extra complexity as comparing certain colum requires custom logic, but I wonder if this architecture would work well also when dealing
with a 300Mb database.

4. Other ideas?

Hopefully I made sense. Thank you for your attention

Dino

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Peter J. Holzer@21:1/5 to Dino on Sun Jan 15 12:14:24 2023

On 2023-01-14 23:26:27 -0500, Dino wrote:

Hello, I have built a PoC service in Python Flask for my work, and - now
that the point is made - I need to make it a little more performant (to be honest, chances are that someone else will pick up from where I left off,
and implement the same service from scratch in a different language (GoLang? .Net? Java?) but I am digressing).

Anyway, my Flask service initializes by loading a big "table" of 100k rows and 40 columns or so (memory footprint: order of 300 Mb)

300 MB is large enough that you should at least consider putting that
into a database (Sqlite is probably simplest. Personally I would go with PostgreSQL because I'm most familiar with it and Sqlite is a bit of an outlier).

The main reason for putting it into a database is the ability to use
indexes, so you don't have to scan all 100 k rows for each query.

You may be able to do that for your Python data structures, too: Can you
set up dicts which map to subsets you need often?

There are some specialized in-memory bitmap implementations which can be
used for filtering. I've used
[Judy bitmaps](https://judy.sourceforge.net/doc/Judy1_3x.htm) in the
past (mostly in Perl).
These days [Roaring Bitmaps](https://www.roaringbitmap.org/) is probably
the most popular. I see several packages on PyPI - but I haven't used
any of them yet, so no recommendation from me.

Numpy might also help. You will still have linear scans, but it is more
compact and many of the searches can probably be done in C and not in
Python.

As you can imagine, this is not very performant in its current form, but performance was not the point of the PoC - at least initially.

For performanc optimization it is very important to actually measure performance, and a good profiler helps very much in identifying hot
spots. Unfortunately until recently Python was a bit deficient in this
area, but [Scalene](https://pypi.org/project/scalene/) looks promising.

hp

--
_ | Peter J. Holzer | Story must make more sense than reality.
|_|_) | |
| | | hjp@hjp.at | -- Charles Stross, "Creative writing
__/ | http://www.hjp.at/ | challenge!"

-----BEGIN PGP SIGNATURE-----

iQIzBAABCgAdFiEETtJbRjyPwVTYGJ5k8g5IURL+KF0FAmPD4AwACgkQ8g5IURL+ KF0uEQ/7Brw842A7pZ7Cj8sv7oFnRUb1Dl+9nt/rT4h/dLZrpA3MxllUsfwjryFP Dg7H0efNdIxDUufgNuHdCujobkjqLFRLklAXc3RSyJI1Y3Dtl1dpL/5AqVmEdXJX TKZ2t2B8Z1/a8Y2xoRZdjzRzdwxuAdpCwdYJvU34IUqrX5tuhIReh7S3yhZWmxYW y3bRus4Ji98tjAedyaLncfnwOmsFuNQDB57jRRdz4aLbXDNXHSs2jOlQ5sZ2mHv8 yIh8CsV5kDJ6Txs7gmUYv1pAsCV6N1SU/b2sBeOfPEUdAGaEBqTNYNrk3cwLtEMa BIOQoDMDEaW/5uDsCOoIMNfH2BsBB9UC04CIPNnmhqDWLT32ZXIikdGYfwzoqpmr 6hYy59Sn/MpZGop6BFtW43IzOHoCkPM0DHkVKGPhUu10JJ52MgTHGbtjK9Bw0AlP 3c/7MFR3LIhdoH558VnYFXML9I8/88M3v1/ffRtQaz/qxLL6UIXitpQBgwkK0dOI GQbxGkB3cu93O8M52i5hdTzrsOHRfgzwhC47nMWaR1+4C+gJc+GZJSoTonZEGepc xjB/InoDGNQaX8VNVt5ml8HwZQAVzt4guNOJvQoFRgYE+lBTt1SHf0qYoaHiog9f KXHa5x7UkG2Exsjx0kHXoV66iSW577bwICBFfTN

From Dino@21:1/5 to Peter J. Holzer on Sun Jan 15 08:27:29 2023

Thank you, Peter. Yes, setting up my own indexes is more or less the
idea of the modular cache that I was considering. Seeing others think in
the same direction makes it look more viable.

About Scalene, thank you for the pointer. I'll do some research.

Do you have any idea about the speed of a SELECT query against a 100k
rows / 300 Mb Sqlite db?

Dino

On 1/15/2023 6:14 AM, Peter J. Holzer wrote:

On 2023-01-14 23:26:27 -0500, Dino wrote:

Hello, I have built a PoC service in Python Flask for my work, and - now
that the point is made - I need to make it a little more performant (to be >> honest, chances are that someone else will pick up from where I left off,
and implement the same service from scratch in a different language (GoLang? >> .Net? Java?) but I am digressing).

Anyway, my Flask service initializes by loading a big "table" of 100k rows >> and 40 columns or so (memory footprint: order of 300 Mb)

300 MB is large enough that you should at least consider putting that
into a database (Sqlite is probably simplest. Personally I would go with PostgreSQL because I'm most familiar with it and Sqlite is a bit of an outlier).

The main reason for putting it into a database is the ability to use
indexes, so you don't have to scan all 100 k rows for each query.

You may be able to do that for your Python data structures, too: Can you
set up dicts which map to subsets you need often?

There are some specialized in-memory bitmap implementations which can be
used for filtering. I've used
[Judy bitmaps](https://judy.sourceforge.net/doc/Judy1_3x.htm) in the
past (mostly in Perl).
These days [Roaring Bitmaps](https://www.roaringbitmap.org/) is probably
the most popular. I see several packages on PyPI - but I haven't used
any of them yet, so no recommendation from me.

Numpy might also help. You will still have linear scans, but it is more compact and many of the searches can probably be done in C and not in
Python.

As you can imagine, this is not very performant in its current form, but
performance was not the point of the PoC - at least initially.

For performanc optimization it is very important to actually measure performance, and a good profiler helps very much in identifying hot
spots. Unfortunately until recently Python was a bit deficient in this
area, but [Scalene](https://pypi.org/project/scalene/) looks promising.

hp

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Dino@21:1/5 to Lars Liedtke on Sun Jan 15 08:20:01 2023

Thank you for your answer, Lars. Just a clarification: I am already
doing a rough measuring of my queries.

A fresh query without any caching: < 4s.

Cached full query: < 5 micro-s (i.e. 6 orders of magnitude faster)

Desired speed for my POC: 10 <ms

Also, I didn't want to ask a question with way too many "moving parts",
but when I talked about the "table", it's actually a 100k long list of
IDs. I can then use each ID to invoke an API that will return those 40 attributes. The API is fast, but still, I am bound to loop through the
whole thing to respond to the query, that's unless I pre-load the data
into something that allows faster access.

Also, as you correctly observed, "looking good with my colleagues" is a nice-to-have feature at this point, not really an absolute requirement :)

Dino

On 1/15/2023 3:17 AM, Lars Liedtke wrote:

Hey,

before you start optimizing. I would suggest, that you measure response
times and query times, data search times and so on. In order to save
time, you have to know where you "loose" time.

Does your service really have to load the whole table at once? Yes that
might lead to quicker response times on requests, but databases are
often very good with caching themselves, so that the first request might
be slower than following requests, with similar parameters. Do you use a database, or are you reading from a file? Are you maybe looping through
your whole dataset on every request? Instead of asking for the specific
data?

Before you start introducing a cache and its added complexity, do you
really need that cache?

You are talking about saving microseconds, that sounds a bit as if you
might be “overdoing” it. How many requests will you have in the future? At least in which magnitude and how quick do they have to be? You write
about 1-4 seconds on your laptop. But that does not really tell you that much, because most probably the service will run on a server. I am not
saying that you should get a server or a cloud-instance to test against,
but to talk with your architect about that.

I totally understand your impulse to appear as good as can be, but you
have to know where you really need to debug and optimize. It will not be advantageous for you, if you start to optimize for optimizing's sake. Additionally if you service is a PoC, optimizing now might be not the
first thing you have to worry about, but about that you made everything
as simple and readable as possible and that you do not spend too much
time for just showing how it could work.

But of course, I do not know the tasks given to you and the expectations
you have to fulfil. All I am trying to say is to reconsider where you
really could improve and how far you have to improve.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Thomas Passin@21:1/5 to Peter J. Holzer on Sun Jan 15 10:38:22 2023

On 1/15/2023 6:14 AM, Peter J. Holzer wrote:

On 2023-01-14 23:26:27 -0500, Dino wrote:

Hello, I have built a PoC service in Python Flask for my work, and - now
that the point is made - I need to make it a little more performant (to be >> honest, chances are that someone else will pick up from where I left off,
and implement the same service from scratch in a different language (GoLang? >> .Net? Java?) but I am digressing).

Anyway, my Flask service initializes by loading a big "table" of 100k rows >> and 40 columns or so (memory footprint: order of 300 Mb)

300 MB is large enough that you should at least consider putting that
into a database (Sqlite is probably simplest. Personally I would go with PostgreSQL because I'm most familiar with it and Sqlite is a bit of an outlier).

The main reason for putting it into a database is the ability to use
indexes, so you don't have to scan all 100 k rows for each query.

I have an (inherited) server program that uses about 30 MB of data in a
MySQL database. It services queries received over the network. It too
had performance problems, to which adding indexes and smarter joins
helped but not enough.

I changed the program so that at startup it imports much of the data
into Python dictionaries that are structured to support the kinds of
queries that need the help. Response time to queries dropped
dramatically. Some kinds of queries needed more help, and I collected auxiliary collections of (usually highly pre-processed) data into
ordinary files, and those too get imported into dictionaries during startup.

Note that these dictionaries do not always match the table structures.
Some of them change the structure to make queries easier to process. You
may be able to do that with Python code, or by creating SQL views in the database and importing directly from the views (database views take
almost no database memory).

The drawback is that all that data is now stored in memory while the
program is running. In my case, many hundreds of MB. But if it would
be too much memory for you - you would need to prototype it to know -
you should let the database engine do the work. It is more highly
optimized and efficient for searches than your code could ever be. But
there will be a price to pay. The price is in denormalizing the database
table design. This means to include redundant data, organized to match
the kinds of queries that will be made. No more 3rd normal form! Your
sql queries will need to be designed to take advantage of this new
structure. This will be a cost because the database will be larger, but
also because the redundancies will make it much harder to update the
data correctly. Fortunately you do not need to do that during normal
operation (my program's data was also static like yours).

PostgreSQL would probably be a better choice than Sqlite, since it
supports features such as foreign keys, and has a function definition capability.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Peter J. Holzer@21:1/5 to Thomas Passin on Sun Jan 15 20:39:14 2023

On 2023-01-15 10:38:22 -0500, Thomas Passin wrote:

On 1/15/2023 6:14 AM, Peter J. Holzer wrote:

On 2023-01-14 23:26:27 -0500, Dino wrote:

Anyway, my Flask service initializes by loading a big "table" of 100k rows
and 40 columns or so (memory footprint: order of 300 Mb)

300 MB is large enough that you should at least consider putting that
into a database (Sqlite is probably simplest. Personally I would go with PostgreSQL because I'm most familiar with it and Sqlite is a bit of an outlier).

The main reason for putting it into a database is the ability to use indexes, so you don't have to scan all 100 k rows for each query.

I have an (inherited) server program that uses about 30 MB of data in a
MySQL database. It services queries received over the network. It too had performance problems, to which adding indexes and smarter joins helped but not enough.

I changed the program so that at startup it imports much of the data into Python dictionaries that are structured to support the kinds of queries that need the help. Response time to queries dropped dramatically.

This is to be expected: Firstly, because you don't have disk accesses
any more, secondly because you don't have network latency any more and
thirdly, because you structured the data to fit the queries.

The thing to keep in mind is that the relational database model was
invented to have a uniform and simple way to model all data, and that
RDBMSs are designed to handle all workloads (from a single tiny table to thousands of tables with hundreds of terabytes) reasonably well. For any
given application you can always find a more efficient solution than
using an RDBMS. Sometimes it's simple (just load all the data into a
dict and serve from there), sometimes it's a major research project.
The nice thing about RDBMSs isn't that they are the optimal solution for anything but that they are a "good enough" solution for a large class of problems.

hp

--
_ | Peter J. Holzer | Story must make more sense than reality.
|_|_) | |
| | | hjp@hjp.at | -- Charles Stross, "Creative writing
__/ | http://www.hjp.at/ | challenge!"

-----BEGIN PGP SIGNATURE-----

iQIzBAABCgAdFiEETtJbRjyPwVTYGJ5k8g5IURL+KF0FAmPEVlwACgkQ8g5IURL+ KF0Fsg/9F70GHWXJf/7CbeV74/dO1e6wIVUBLEMG0NBT+w4Bn5b+yeGHHFY9WIWf 57kkI3Jw4e9meDIad8Uv2HTdjU2WeN/Qo2FKNMsx7BpoBx3qq8/xqkI1l0FL0yZA LP3OFTnJW8dpF2Vnvms7q5dQGnehEWo9nB5oivooN7g64nCebONl8kV5FRe5sV0l IvAWTZFP2CtOHdSJY+aslvtA6FPCg++gXksi0FEi1QTI9bOe6Vr+yW1AU3RtNIOv blqVHb7m/bpL3kVGI/Z6dZiJill+ZBwoCN6J/RqCI0GYCNphpFt+wB/xrGgmofEt 6e+DTrIih6HiD3b4vAd1PsOvGJbuwLpJNGZ4oMnxIxBFY8u3LUkMMD/gbJzlbtN4 ntNfPmA6xgR7MbMy2Ga/HZrgCD5+CB5hk3TDfNrqtsqiita0xDDBz3uNSejSQXDB /DrsYIgN+UblhjdorWF6ZMcflRMYnpmsTBZ89vl87W6b7fXyEeTygWBXXUwHoiYt X0iNVelkOXqsTBHRtjvDdH4wde76GAlGiAnMtaiG7o+X7TYS1RQRawNjJUWb5BHR 5AhM1XEEYNCH2bmv0e/Mh83FY+9Edw/+d2AzPGT22cyJkGcld07vJLQ9+5z8zuR2 Yrw7isfGzRiiiGzuRjo/u7gnRrtAGtzkO3wS4in

From Weatherby,Gerard@21:1/5 to Lars Liedtke on Sun Jan 15 19:23:40 2023

That�s about what I got using a Python dictionary on random data on a high memory machine.

https://github.com/Gerardwx/database_testing.git

It�s not obvious to me how to get it much faster than that.

From: Python-list <python-list-bounces+gweatherby=uchc.edu@python.org> on behalf of Dino <dino@no.spam.ar>
Date: Sunday, January 15, 2023 at 1:29 PM
To: python-list@python.org <python-list@python.org>
Subject: Re: Fast lookup of bulky "table"
*** Attention: This is an external email. Use caution responding, opening attachments or clicking on links. ***

Thank you for your answer, Lars. Just a clarification: I am already
doing a rough measuring of my queries.

A fresh query without any caching: < 4s.

Cached full query: < 5 micro-s (i.e. 6 orders of magnitude faster)

Desired speed for my POC: 10 <ms

Also, I didn't want to ask a question with way too many "moving parts",
but when I talked about the "table", it's actually a 100k long list of
IDs. I can then use each ID to invoke an API that will return those 40 attributes. The API is fast, but still, I am bound to loop through the
whole thing to respond to the query, that's unless I pre-load the data
into something that allows faster access.

Also, as you correctly observed, "looking good with my colleagues" is a nice-to-have feature at this point, not really an absolute requirement :)

Dino

On 1/15/2023 3:17 AM, Lars Liedtke wrote:

Hey,

before you start optimizing. I would suggest, that you measure response
times and query times, data search times and so on. In order to save
time, you have to know where you "loose" time.

Does your service really have to load the whole table at once? Yes that
might lead to quicker response times on requests, but databases are
often very good with caching themselves, so that the first request might
be slower than following requests, with similar parameters. Do you use a database, or are you reading from a file? Are you maybe looping through
your whole dataset on every request? Instead of asking for the specific
data?

Before you start introducing a cache and its added complexity, do you
really need that cache?

You are talking about saving microseconds, that sounds a bit as if you
might be �overdoing� it. How many requests will you have in the future?
At least in which magnitude and how quick do they have to be? You write
about 1-4 seconds on your laptop. But that does not really tell you that much, because most probably the service will run on a server. I am not
saying that you should get a server or a cloud-instance to test against,
but to talk with your architect about that.

I totally understand your impulse to appear as good as can be, but you
have to know where you really need to debug and optimize. It will not be advantageous for you, if you start to optimize for optimizing's sake. Additionally if you service is a PoC, optimizing now might be not the
first thing you have to worry about, but about that you made everything
as simple and readable as possible and that you do not spend too much
time for just showing how it could work.

But of course, I do not know the tasks given to you and the expectations
you have to fulfil. All I am trying to say is to reconsider where you
really could improve and how far you have to improve.

-- https://urldefense.com/v3/__https://mail.python.org/mailman/listinfo/python-list__;!!Cn_UX_p3!npizb3UAz-jPUnhlimB3_lctLibK5EW4zJwjZVmQ41yV_-2WSm2eQ5cTi8vzOEuCfsdNTjIvIhFcakrX$<https://urldefense.com/v3/__https:/mail.python.org/mailman/listinfo/python-
list__;!!Cn_UX_p3!npizb3UAz-jPUnhlimB3_lctLibK5EW4zJwjZVmQ41yV_-2WSm2eQ5cTi8vzOEuCfsdNTjIvIhFcakrX$>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Weatherby,Gerard@21:1/5 to Lars Liedtke on Sun Jan 15 19:36:53 2023

I think any peformance improvements would have to come from a language change or better indexing of the data.

From: Python-list <python-list-bounces+gweatherby=uchc.edu@python.org> on behalf of Weatherby,Gerard <gweatherby@uchc.edu>
Date: Sunday, January 15, 2023 at 2:25 PM
To: Dino <dino@no.spam.ar>, python-list@python.org <python-list@python.org> Subject: Re: Fast lookup of bulky "table"
That�s about what I got using a Python dictionary on random data on a high memory machine.

https://urldefense.com/v3/__https://github.com/Gerardwx/database_testing.git__;!!Cn_UX_p3!keHKWsb1LGR6u_6BQA04MyEJlnzICq04FNdn8z9BnnjG8NopVu3KiL0k3rMiowxtp87xBUi6OcavBQIqksBjbd9v$<https://urldefense.com/v3/__https:/github.com/Gerardwx/database_testing.
git__;!!Cn_UX_p3!keHKWsb1LGR6u_6BQA04MyEJlnzICq04FNdn8z9BnnjG8NopVu3KiL0k3rMiowxtp87xBUi6OcavBQIqksBjbd9v$>

It�s not obvious to me how to get it much faster than that.

From: Python-list <python-list-bounces+gweatherby=uchc.edu@python.org> on behalf of Dino <dino@no.spam.ar>
Date: Sunday, January 15, 2023 at 1:29 PM
To: python-list@python.org <python-list@python.org>
Subject: Re: Fast lookup of bulky "table"
*** Attention: This is an external email. Use caution responding, opening attachments or clicking on links. ***

Thank you for your answer, Lars. Just a clarification: I am already
doing a rough measuring of my queries.

A fresh query without any caching: < 4s.

Cached full query: < 5 micro-s (i.e. 6 orders of magnitude faster)

Desired speed for my POC: 10 <ms

Also, I didn't want to ask a question with way too many "moving parts",
but when I talked about the "table", it's actually a 100k long list of
IDs. I can then use each ID to invoke an API that will return those 40 attributes. The API is fast, but still, I am bound to loop through the
whole thing to respond to the query, that's unless I pre-load the data
into something that allows faster access.

Also, as you correctly observed, "looking good with my colleagues" is a nice-to-have feature at this point, not really an absolute requirement :)

Dino

On 1/15/2023 3:17 AM, Lars Liedtke wrote:

Hey,

before you start optimizing. I would suggest, that you measure response
times and query times, data search times and so on. In order to save
time, you have to know where you "loose" time.

Does your service really have to load the whole table at once? Yes that
might lead to quicker response times on requests, but databases are
often very good with caching themselves, so that the first request might
be slower than following requests, with similar parameters. Do you use a database, or are you reading from a file? Are you maybe looping through
your whole dataset on every request? Instead of asking for the specific
data?

Before you start introducing a cache and its added complexity, do you
really need that cache?

You are talking about saving microseconds, that sounds a bit as if you
might be �overdoing� it. How many requests will you have in the future?
At least in which magnitude and how quick do they have to be? You write
about 1-4 seconds on your laptop. But that does not really tell you that much, because most probably the service will run on a server. I am not
saying that you should get a server or a cloud-instance to test against,
but to talk with your architect about that.

I totally understand your impulse to appear as good as can be, but you
have to know where you really need to debug and optimize. It will not be advantageous for you, if you start to optimize for optimizing's sake. Additionally if you service is a PoC, optimizing now might be not the
first thing you have to worry about, but about that you made everything
as simple and readable as possible and that you do not spend too much
time for just showing how it could work.

But of course, I do not know the tasks given to you and the expectations
you have to fulfil. All I am trying to say is to reconsider where you
really could improve and how far you have to improve.

-- https://urldefense.com/v3/__https://mail.python.org/mailman/listinfo/python-list__;!!Cn_UX_p3!npizb3UAz-jPUnhlimB3_lctLibK5EW4zJwjZVmQ41yV_-2WSm2eQ5cTi8vzOEuCfsdNTjIvIhFcakrX$<https://urldefense.com/v3/__https:/mail.python.org/mailman/listinfo/python-
list__;!!Cn_UX_p3!npizb3UAz-jPUnhlimB3_lctLibK5EW4zJwjZVmQ41yV_-2WSm2eQ5cTi8vzOEuCfsdNTjIvIhFcakrX$><https://urldefense.com/v3/__https:/mail.python.org/mailman/listinfo/python-list__;!!Cn_UX_p3!npizb3UAz-jPUnhlimB3_lctLibK5EW4zJwjZVmQ41yV_-
2WSm2eQ5cTi8vzOEuCfsdNTjIvIhFcakrX$%3chttps:/urldefense.com/v3/__https:/mail.python.org/mailman/listinfo/python-list__;!!Cn_UX_p3!npizb3UAz-jPUnhlimB3_lctLibK5EW4zJwjZVmQ41yV_-2WSm2eQ5cTi8vzOEuCfsdNTjIvIhFcakrX$%3e>
-- https://urldefense.com/v3/__https://mail.python.org/mailman/listinfo/python-list__;!!Cn_UX_p3!keHKWsb1LGR6u_6BQA04MyEJlnzICq04FNdn8z9BnnjG8NopVu3KiL0k3rMiowxtp87xBUi6OcavBQIqkvzm3bP5$<https://urldefense.com/v3/__https:/mail.python.org/mailman/listinfo/
python-list__;!!Cn_UX_p3!keHKWsb1LGR6u_6BQA04MyEJlnzICq04FNdn8z9BnnjG8NopVu3KiL0k3rMiowxtp87xBUi6OcavBQIqkvzm3bP5$>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Thomas Passin@21:1/5 to Peter J. Holzer on Sun Jan 15 14:58:12 2023

On 1/15/2023 2:39 PM, Peter J. Holzer wrote:

On 2023-01-15 10:38:22 -0500, Thomas Passin wrote:

On 1/15/2023 6:14 AM, Peter J. Holzer wrote:

On 2023-01-14 23:26:27 -0500, Dino wrote:

Anyway, my Flask service initializes by loading a big "table" of 100k rows >>>> and 40 columns or so (memory footprint: order of 300 Mb)

300 MB is large enough that you should at least consider putting that
into a database (Sqlite is probably simplest. Personally I would go with >>> PostgreSQL because I'm most familiar with it and Sqlite is a bit of an
outlier).

The main reason for putting it into a database is the ability to use
indexes, so you don't have to scan all 100 k rows for each query.

I have an (inherited) server program that uses about 30 MB of data in a
MySQL database. It services queries received over the network. It too had
performance problems, to which adding indexes and smarter joins helped but >> not enough.

I changed the program so that at startup it imports much of the data into
Python dictionaries that are structured to support the kinds of queries that >> need the help. Response time to queries dropped dramatically.

This is to be expected: Firstly, because you don't have disk accesses
any more, secondly because you don't have network latency any more and thirdly, because you structured the data to fit the queries.

Of course: that's exactly why I made those changes. The tradeoff is
using more memory for your program, sometimes a lot more.

The thing to keep in mind is that the relational database model was
invented to have a uniform and simple way to model all data, and that
RDBMSs are designed to handle all workloads (from a single tiny table to thousands of tables with hundreds of terabytes) reasonably well. For any given application you can always find a more efficient solution than
using an RDBMS. Sometimes it's simple (just load all the data into a
dict and serve from there), sometimes it's a major research project.
The nice thing about RDBMSs isn't that they are the optimal solution for anything but that they are a "good enough" solution for a large class of problems.

Often the solution is careful (and not very normalized) table design to
support your queries. In the case I'm discussing, it was easier for me
to make Python do the work, and I could afford the memory load. In
other cases, you have to put in the work on the database side. Often
for slow queries, disk latency and I/O are not the limiting factors, but
you have to put in the work and do the testing to make sure.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From dn@21:1/5 to Gerard on Mon Jan 16 09:12:30 2023

On 16/01/2023 08.36, Weatherby,Gerard wrote:

I think any peformance improvements would have to come from a language change or better indexing of the data.

Exactly!

Expanding on @Peter's post: databases (relational or not) are best
organised according to use. Some must accept rapid insert/updates.
Others are about look-ups (data-retrieval).

A basic RDBMS, just as a Python dict, may only offer a single key for
efficient retrieval.

Postgres and MySQL (for example) enable the establishment of multiple
and sophisticated indices/indexes, and the aptly-named "views" of data.

If the queries can be grouped according to the manner in which the data
must be accessed, a view could be built for each. At which time, even if
every row must be accessed, the retrieval will be made more efficient
and/or the response better-organised.

Thus, if we have a DB of people. Many retrievals are likely to utilise
an index on 'name'. However, if at times interest is limited to place or suburb, an index and view of such will speed things from O(n).
Similarly, if a query is only to return people with a dog license.

Some programmers don't realise that SQL can also be used for
calculations, eg the eponymous COUNT(), which saves (CPU-time and coding-effort) over post-processing in Python.

If there are many way to group/examine the data, then this may not be
possible, but that depends upon the count of views cf the value of speedy-response - and bearing-in-mind that the demands for response-time
may vary by type of query/data-access.

So many variables to consider ...

--
Regards,
=dn

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From rbowman@21:1/5 to Dino on Sun Jan 15 20:54:18 2023

On Sun, 15 Jan 2023 08:27:29 -0500, Dino wrote:

Do you have any idea about the speed of a SELECT query against a 100k
rows / 300 Mb Sqlite db?

https://www.sqlite.org/speed.html

The site is old but has a number of comparisons. I have not used SQLite
with Python yet but with both C and C# I've been impressed with the speed versus Postgres or MSSQL. One thing to watch is insertions. By default
each insertion is a transaction. There is a dramatic speed increase for multiple insertions if you explicitly start a transaction, do the inserts,
and end the transaction.

My usage is a result of ESRI's dropping their C++/C# Engine API. My new approach uses queries against the AcrGIS Server REST interface for much of
the functionality but some spatial queries can be replaced with
predetermined tabular data rather than runtime spatial queries. For
example, for a given dataset you can determine the coordinates of every intersection beforehand and the intersection of PINE ST and MAPLE AVE
becomes a simple search in the SLQite database.

ESRI's ArcPy is the closest replacement for the legacy C++ API so I assume
in the future I will be using it in conjunction with SQLite. The actual
geodata will still need to be in a spatially aware RDMBS like SQL Server
or PostgreSQL/PostGIS but SQLite so far is the fastest and easiest to
implement for non-spatial data. Also, it is in the public domain which
avoids the complexities of MySQL and its derivatives for commercial applications.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Stefan Ram@21:1/5 to PythonList@DancesWithMice.info on Sun Jan 15 21:49:19 2023

dn <PythonList@DancesWithMice.info> writes:

Some programmers don't realise that SQL can also be used for
calculations, eg the eponymous COUNT(), which saves (CPU-time and >coding-effort) over post-processing in Python.

Yes, I second that! Sometimes, people only re-invent things
in Python because they don't know SQL well enough, or they
do not normalize their tables because they have not properly
learned how to do this.

I'd always start out with normalized tables and do as many
operations in SQL as possible. I would then hesitate to
de-normalize anything or transfer data operations into
the programming language unless I am very sure that this
is really advantageous.

Once I had the task of writing VBA code to query and analyze
data from a Jet engine (i.e., Microsoft Access). I ended up
writing 90 % of the code in SQL and a thin layer of 10 % in VBA.
And it was fast.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Weatherby,Gerard@21:1/5 to PythonList@DancesWithMice.info on Sun Jan 15 22:31:17 2023

With Postgresql, one can also do pre-processing in Python. https://www.postgresql.org/docs/15/plpython.html

While it�s not as convenient to develop as client-side Python, it can be used to implement complicated constraints or implement filtering on the server side, which reduces the amount of data that has to be sent back to the client.

From: Python-list <python-list-bounces+gweatherby=uchc.edu@python.org> on behalf of Stefan Ram <ram@zedat.fu-berlin.de>
Date: Sunday, January 15, 2023 at 5:03 PM
To: python-list@python.org <python-list@python.org>
Subject: Re: Fast lookup of bulky "table"
*** Attention: This is an external email. Use caution responding, opening attachments or clicking on links. ***

dn <PythonList@DancesWithMice.info> writes:

Some programmers don't realise that SQL can also be used for
calculations, eg the eponymous COUNT(), which saves (CPU-time and >coding-effort) over post-processing in Python.

Yes, I second that! Sometimes, people only re-invent things
in Python because they don't know SQL well enough, or they
do not normalize their tables because they have not properly
learned how to do this.

I'd always start out with normalized tables and do as many
operations in SQL as possible. I would then hesitate to
de-normalize anything or transfer data operations into
the programming language unless I am very sure that this
is really advantageous.

Once I had the task of writing VBA code to query and analyze
data from a Jet engine (i.e., Microsoft Access). I ended up
writing 90 % of the code in SQL and a thin layer of 10 % in VBA.
And it was fast.

-- https://urldefense.com/v3/__https://mail.python.org/mailman/listinfo/python-list__;!!Cn_UX_p3!kAIZWRJ3oqrlkixX-iwrGeG9VVWjooBvzuMirfp44VTP32cELWf8Dk6MkPQwK2QwWzuUT9eNPNTlN152b23eFcM$<https://urldefense.com/v3/__https:/mail.python.org/mailman/listinfo/
python-list__;!!Cn_UX_p3!kAIZWRJ3oqrlkixX-iwrGeG9VVWjooBvzuMirfp44VTP32cELWf8Dk6MkPQwK2QwWzuUT9eNPNTlN152b23eFcM$>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Greg Ewing@21:1/5 to Dino on Mon Jan 16 11:54:02 2023

On 16/01/23 2:27 am, Dino wrote:

Do you have any idea about the speed of a SELECT query against a 100k
rows / 300 Mb Sqlite db?

That depends entirely on the nature of the query and how the
data is indexed. If it's indexed in a way that allows sqlite to
home in directly on the wanted data, it will be very fast. If
it has to fall back on a linear search, it probably won't be
significantly faster than your existing Python implementation.

--
Greg

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Thomas Passin@21:1/5 to Stefan Ram on Sun Jan 15 18:06:36 2023

On 1/15/2023 4:49 PM, Stefan Ram wrote:

dn <PythonList@DancesWithMice.info> writes:

Some programmers don't realise that SQL can also be used for
calculations, eg the eponymous COUNT(), which saves (CPU-time and
coding-effort) over post-processing in Python.

Yes, I second that! Sometimes, people only re-invent things
in Python because they don't know SQL well enough, or they
do not normalize their tables because they have not properly
learned how to do this.

I'd always start out with normalized tables and do as many
operations in SQL as possible. I would then hesitate to
de-normalize anything or transfer data operations into
the programming language unless I am very sure that this
is really advantageous.

Yes, if you get the indexes and joins right, sometimes you can get a
very large speed-up. It takes some experimenting and use of EXPLAIN,
but it's worth doing. You especially want to avoid letting the database
engine do full-table scans over and over. And you never want to send a
lot of rows to Python and do post-filtering on them if you can avoid it.

Use WHERE instead of HAVING if possible (HAVING works post-scan, WHERE
works during row retrieval).

Once I had the task of writing VBA code to query and analyze
data from a Jet engine (i.e., Microsoft Access). I ended up
writing 90 % of the code in SQL and a thin layer of 10 % in VBA.
And it was fast.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Dino@21:1/5 to Gerard on Sun Jan 15 22:13:06 2023

On 1/15/2023 2:23 PM, Weatherby,Gerard wrote:

That�s about what I got using a Python dictionary on random data on a high memory machine.

https://github.com/Gerardwx/database_testing.git

It�s not obvious to me how to get it much faster than that.

Gerard, you are a rockstar. This is going to be really useful if I do
decide to adopt sqlite3 for my PoC, as I understand what's going on conceptually, but never really used sqlite (nor SQL in a long long
time), so this may save me a bunch of time.

I created a 300 Mb DB using your script. Then:

$ ./readone.py
testing 2654792 of 4655974
Found somedata0002654713 for 1ed9f9cd-0a9e-47e3-b0a7-3e1fcdabe166 in 0.239335000020219 seconds

$ ./prefetch.py
Index build 4.420937899994897 seconds
testing 3058568 of 4655974
Found somedata0000202200 for 5dca1455-9cd6-4e4d-8e5a-7e6400de7ca7 in 4.4000043999403715e-06 seconds

So, if I understand right:

1) once I built a dict out of the DB (in about 4 seconds), I was able to
lookup an entry/record in 4 microseconds(!)

2) looking up a record/entry using a Sqlite query took 0.2 seconds (i.e.
500x slower)

Interesting. Thank you for this. Very informative. I really appreciate
that you took the time to write this.

The conclusion seems to me that I probably don't want to go the Sqlite
route, as I would be placing my data into a database just to extract it
back into a dict when I need it if I want it fast.

Ps: a few minor fixes to the README as this may be helpful to others.

./venv/... => ./env/..

i.e.
./env/bin/pip install -U pip
./env/bin/pip install -e .

Also add part in []

Run create.py [size of DB in bytes] prior to running readone.py and/or prefetch.py

BTW, can you tell me what is going on here? what's := ?

while (increase := add_some(conn,adding)) == 0:

https://github.com/Gerardwx/database_testing/blob/main/src/database_testing/create.py#L40

Dino

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From David@21:1/5 to Dino on Mon Jan 16 18:53:38 2023

On Mon, 16 Jan 2023 at 16:15, Dino <dino@no.spam.ar> wrote:

BTW, can you tell me what is going on here? what's := ?

while (increase := add_some(conn,adding)) == 0:

See here:
https://docs.python.org/3/reference/expressions.html#assignment-expressions
https://realpython.com/python-walrus-operator/

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Dino@21:1/5 to David on Mon Jan 16 09:43:01 2023

On 1/16/2023 2:53 AM, David wrote:

See here:
https://docs.python.org/3/reference/expressions.html#assignment-expressions
https://realpython.com/python-walrus-operator/

Thank you, brother.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Stefan Ram@21:1/5 to Thomas Passin on Mon Jan 16 15:14:06 2023

Thomas Passin <list1@tompassin.net> writes:

I changed the program so that at startup it imports much of the data
into Python dictionaries that are structured to support the kinds of
queries that need the help. Response time to queries dropped
dramatically. Some kinds of queries needed more help, and I collected >auxiliary collections of (usually highly pre-processed) data into
ordinary files, and those too get imported into dictionaries during startup.

There are several reasons why sometimes databases are used:

In organizations, they can manage access permissions for
several different tables and users/rôles.

They can coordinate parallel use and data access by several
different processes (programs).

They can hold data that is too large to fit into main memory.

By replicating the same data between several different
computers, they can provide additional security for the data.

By transactions and referential integrity, they also can
provide more security for the data, ensuring that certain
integrity constraints are always met.

When a process (program) crashes, recent changes to its
internal tables in memory might be lost. Would it have written
those changes to a database, they would not have been lost.

This is just what I was able to write off the top of my head.

When none of those reasons matter, one can use dictionaries
in Python as well. And then what Chandler Carruth showed us
applies:

CPUS HAVE A HIERARCHICAL CACHE SYSTEM
(from a 2014 talk by Chandler Carruth)

One cycle on a 3 GHz processor 1 ns
L1 cache reference 0.5 ns
Branch mispredict 5 ns
L2 cache reference 7 ns 14x L1 cache
Mutex lock/unlock 25 ns
Main memory reference 100 ns 20xL2, 200xL1 Compress 1K bytes with Snappy 3,000 ns
Send 1K bytes over 1 Gbps network 10,000 ns 0.01 ms
Read 4K randomly from SSD 150,000 ns 0.15 ms
Read 1 MB sequentially from memory 250,000 ns 0.25 ms
Round trip within same datacenter 500,000 ns 0.5 ms
Read 1 MB sequentially From SSD 1,000,000 ns 1 ms 4x memory
Disk seek 10,000,000 ns 10 ms 20xdatacen. RT
Read 1 MB sequentially from disk 20,000,000 ns 20 ms 80xmem.,20xSSD
Send packet CA->Netherlands->CA 150,000,000 ns 150 ms

. Of course, as Raymond Hettinger told us, dictionaries in
Python have been made very fast by the work of many smart
programmers.

Raymond Hettinger:
Modern Python Dictionaries -
A confluence of a dozen great ideas,
PyCon 2017.

(He also explains a surprising relation they have to databases.)

However, operating systems and databases also try to cache
information in main memory that is estimated to be accessed
often. So, when one is only reading data and not writing it
to a database, a table cached in main memory might be much
faster than a table where everything is read from disk for
each access. Maybe some databases offer settings to tell it
to cache certain tables in memory.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Dino@21:1/5 to Dino on Mon Jan 16 09:44:34 2023

Just wanted to take a moment to express my gratitude to everyone who
responded here. You have all been so incredibly helpful. Thank you

Dino

On 1/14/2023 11:26 PM, Dino wrote:

Hello, I have built a PoC service in Python Flask for my work, and - now
that the point is made - I need to make it a little more performant (to
be honest, chances are that someone else will pick up from where I left
off, and implement the same service from scratch in a different language (GoLang? .Net? Java?) but I am digressing).

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Stefan Ram@21:1/5 to rbowman on Mon Jan 16 17:01:47 2023

rbowman <bowman@montana.com> writes:

On 16 Jan 2023 15:14:06 GMT, Stefan Ram wrote:

When none of those reasons matter, one can use dictionaries in Python
as well. And then what Chandler Carruth showed us applies:

I am missing something. Where is the data in your dictionary coming from?

Could be a database, could be a file with the pickled
dictionary from the preceding pass of the program or
a file in a common or custom data format.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From rbowman@21:1/5 to Stefan Ram on Mon Jan 16 16:56:43 2023

On 16 Jan 2023 15:14:06 GMT, Stefan Ram wrote:

When none of those reasons matter, one can use dictionaries in Python
as well. And then what Chandler Carruth showed us applies:

I am missing something. Where is the data in your dictionary coming from?

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Thomas Passin@21:1/5 to Stefan Ram on Mon Jan 16 12:26:32 2023

On 1/16/2023 10:14 AM, Stefan Ram wrote:

However, operating systems and databases also try to cache
information in main memory that is estimated to be accessed
often.

Yes, and you can only know by testing, when that's possible. Also, if
you know that you have the same queries repeated over and over, you can
intern (basically, cache) their results in Python and get a performance
boost that way.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Edmondo Giovannozzi@21:1/5 to All on Mon Jan 16 10:18:57 2023

Il giorno domenica 15 gennaio 2023 alle 05:26:50 UTC+1 Dino ha scritto:

Hello, I have built a PoC service in Python Flask for my work, and - now that the point is made - I need to make it a little more performant (to
be honest, chances are that someone else will pick up from where I left
off, and implement the same service from scratch in a different language (GoLang? .Net? Java?) but I am digressing).

Anyway, my Flask service initializes by loading a big "table" of 100k
rows and 40 columns or so (memory footprint: order of 300 Mb) and then accepts queries through a REST endpoint. Columns are strings, enums, and numbers. Once initialized, the table is read only. The endpoint will
parse the query and match it against column values (equality,
inequality, greater than, etc.) Finally, it will return a (JSON) list of
all rows that satisfy all conditions in the query.

As you can imagine, this is not very performant in its current form, but performance was not the point of the PoC - at least initially.

Before I deliver the PoC to a more experienced software architect who
will look at my code, though, I wouldn't mind to look a bit less lame
and do something about performance in my own code first, possibly by bringing the average time for queries down from where it is now (order
of 1 to 4 seconds per query on my laptop) to 1 or 2 milliseconds on average).

To be honest, I was already able to bring the time down to a handful of microseconds thanks to a rudimentary cache that will associate the "signature" of a query to its result, and serve it the next time the
same query is received, but this may not be good enough: 1) queries
might be many and very different from one another each time, AND 2) I am
not sure the server will have a ton of RAM if/when this thing - or
whatever is derived from it - is placed into production.

How can I make my queries generally more performant, ideally also in
case of a new query?

Here's what I have been considering:

1. making my cache more "modular", i.e. cache the result of certain
(wide) queries. When a complex query comes in, I may be able to restrict
my search to a subset of the rows (as determined by a previously cached partial query). This should keep the memory footprint under control.

2. Load my data into a numpy.array and use numpy.array operations to
slice and dice my data.

3. load my data into sqlite3 and use SELECT statement to query my table.
I have never used sqllite, plus there's some extra complexity as
comparing certain colum requires custom logic, but I wonder if this architecture would work well also when dealing with a 300Mb database.

4. Other ideas?

Hopefully I made sense. Thank you for your attention

Dino

As a comparison with numpy. Given the following lines:

import numpy as np
a = np.random.randn(400,100_000)
ia = np.argsort(a[0,:])
a_elem = a[56, ia[0]]

I have just taken an element randomly in a numeric table of 400x100000 elements To find it with numpy:

%timeit isel = a == a_elem
35.5 ms ± 2.79 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

And
%timeit a[isel]
9.18 ms ± 371 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

As data are not ordered it is searching it one by one but at C level.
Of course it depends on a lot of thing...

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Peter J. Holzer@21:1/5 to Thomas Passin on Mon Jan 16 20:32:16 2023

On 2023-01-15 18:06:36 -0500, Thomas Passin wrote:

You especially want to avoid letting the database engine do full-table
scans over and over. And you never want to send a lot of rows to
Python and do post-filtering on them if you can avoid it.

Another thing to avoid: Lots of small queries. If you have the choice
between a single query which returns 10000 rows and 1000 queries which
return on average 5 rows each, the former is almost always much faster.
Each query carries some overhead. This may not be very noticeable with
SQLite (as everything is in the same process), but it becomes more and
more pronounced the farther the database is from the client. For
complicated queries the parsing and planning overhead can also become significant.

hp

--
_ | Peter J. Holzer | Story must make more sense than reality.
|_|_) | |
| | | hjp@hjp.at | -- Charles Stross, "Creative writing
__/ | http://www.hjp.at/ | challenge!"

-----BEGIN PGP SIGNATURE-----

iQIzBAABCgAdFiEETtJbRjyPwVTYGJ5k8g5IURL+KF0FAmPFpkAACgkQ8g5IURL+ KF1cNQ/8DhNB9x3/nxsp8KdBtxUZwGX9O29FxTi4dbTXSiRS8SDUTi1GEuEx3U3A vflECFJwN3QL0SW0fiAgsYrOTwmy2l/gtSdgqcg6oBo7GlNe3XwFMasZ2vrZ8P9s x90HZZMZYF5H1w6MMnnUHlwWVQTZNWmWdIfQySdc25TO07Xdpg6AjiiQl7L/Lj9q FSZBSFjL2CgusDl9d5hILF4j2jzcMd1GfblF59iMO6RN//B+mxENKQXuuztwE57I V7wv2ST2bpxkXnm+mk2N4SOzVYy77C4v14J1MqFhKOeNfp0xdX8ky1aNX4JhrCkC aWGzA0AXDhR48UCuwx6Tr8MGxtHeYKyJu/Vnx2z/GdQrjjZGGtAKOEXZxND4b1PP eR/IRud9Y1767msElqCA1LGesWMYv5EjS+y49aDa7oCQURKQ0DAPAkD8jt1WEt1a wQ9Cr3Aqw8e9dkDgjS/cnah0g+0WTFfWVwX0z8F/o6o/dsUo4O1yHSjsXA/JEu2y b/GaqOq+shI7bwStqHFhlo/qCF9pF9cJlmzrruBkAIuE1lfX1IfxVuK18I/NKlXQ BpxfL1zUwYiyUNAcgtZPBE910z49uN43J/YlLLwjrWB/f3JwCkdB762AVGtPRwe4 vJ7IGTc8LL6PnQgs5K3MKj7uKCm1URngnBm0POn

From Peter J. Holzer@21:1/5 to dn via Python-list on Mon Jan 16 20:25:22 2023

On 2023-01-16 09:12:30 +1300, dn via Python-list wrote:

On 16/01/2023 08.36, Weatherby,Gerard wrote:

I think any peformance improvements would have to come from a language change or better indexing of the data.

Expanding on @Peter's post: databases (relational or not) are best organised according to use.

I probably wasn't very clear here. I think one of the main advantages of relational databases over earlier models (especially hierarchical
databases but also network databases) is that you *don't* have to do
that.

In a hierarchical database you have to decide what's higher or lower in
the hierarchy (e.g., does a department have employees, or do employees
have a department?) and that has a dramatic effect on performance so
you must structure the database on which queries are expected to be more common.

In a relational database employees and departments are at the same level
and you have a relationship between them. You don't need to know whether
you'll have to look up all employees of a given department or the
department of a given employee more often. Both will be similarly fast
with appropriate indexes.

(Of course you should still have an idea what the data is used for when designing your data model. But semantics are much more important than
use cases at this stage and you don't have to redesign your entire
database just because you need a new query.)

This flexibility comes with a cost, though: A relational database is
almost always good enough, but almost never optimal for any given use.

Postgres and MySQL (for example) enable the establishment of multiple and sophisticated indices/indexes, and the aptly-named "views" of data.

If the queries can be grouped according to the manner in which the data must be accessed, a view could be built for each. At which time, even if every
row must be accessed, the retrieval will be made more efficient and/or the response better-organised.

Nitpick: A view will not be more efficient (unless it's a materialized
view which is basically just a table). To retrieve that data, the RDBMS
has to perform exactly the same operations as for the underlying query.
Views make life simpler for the programmer (or analyst) by letting them
*write* simpler queries, but under the surface nothing changes. In fact
the performance may be worse, since the perceived simplicity of the view
may cause the human to write queries which are hard to optimize.

(There is another important use case for views: Access control and
enforcement of business rules. But I fear we are straying far from the
original topic now.)

Thus, if we have a DB of people. Many retrievals are likely to utilise an index on 'name'. However, if at times interest is limited to place or
suburb, an index and view of such will speed things from O(n). Similarly, if a query is only to return people with a dog license.

I don't see where a view would help here - but maybe you are thinking of
a more complex database than you describe.

Some programmers don't realise that SQL can also be used for calculations,
eg the eponymous COUNT(), which saves (CPU-time and coding-effort) over post-processing in Python.

Agreed. Aggregate and window functions can do a lot of work in the
database.

hp

--
_ | Peter J. Holzer | Story must make more sense than reality.
|_|_) | |
| | | hjp@hjp.at | -- Charles Stross, "Creative writing
__/ | http://www.hjp.at/ | challenge!"

-----BEGIN PGP SIGNATURE-----

iQIzBAABCgAdFiEETtJbRjyPwVTYGJ5k8g5IURL+KF0FAmPFpJ0ACgkQ8g5IURL+ KF3NrA//WUwFzrDZmMgSrPGzaGyR7CLprk/qgtrZflRssIYDarLdUgKN4CVW9S20 u6bvyQe4ukSkJu2J2tMEbIe4QopuEm52sQNQqcd+JBG3BqZ78Yp0/9XgSRxsGq+1 nP48PcwtcMLLQ4GB5HLVLIcD/JWj5MjunGFIQkDuObqam1UTE7IpZgDYf6f43wQk 3cC3JTQ4uqq606WelNC7XNbQJgfCY1/b3aE0aM1x1keVR04+IRNCH3gzmeHeBHIb cXar72MgcSp+H5dbFIAZ3wPVA/Kdluit+vv3I8t4l/x3X6SoYM5CSK4AbE2lR01G Id+9XU4qvvrucVBGaIh+ZpalMX92fFFJ1np5KdJuhgI5XKJe7yNH/vdWZUEoY6RU /QVnap+kiwTA/flW+bn2tMS/LR9t48ISuljPIOMKtnq1qoWKSoQ0TlL3qa6nJObQ Gv97sAo0shIl9nhjunnY1WWTo9Zb28e8iMNLXFTLg97IOCo0a5ZbR+ps/7PHDSL8 ZlFNWPB9i9JJpOuGqAL2d3+alYi8bkjKqbjGJy8O7gIFLhocuHY4l5AuaBQ3Lid3 xXfUihKQ+tAfFY3gnFjzoVK0K36NNqMrH+/xCaoNnovcuEQfUEX73exMYs3M823d doSMwgGsOZcU2SJCDTRP90mRenekwntCEcjIM71

From Thomas Passin@21:1/5 to rbowman on Mon Jan 16 12:28:37 2023

On 1/16/2023 11:56 AM, rbowman wrote:

On 16 Jan 2023 15:14:06 GMT, Stefan Ram wrote:

When none of those reasons matter, one can use dictionaries in Python
as well. And then what Chandler Carruth showed us applies:

I am missing something. Where is the data in your dictionary coming from?

It would get imported on startup. This is assuming that the data does
not get changed during operation, which the OP said is the case.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Albert-Jan Roskam@21:1/5 to All on Mon Jan 16 20:59:30 2023

On Jan 15, 2023 05:26, Dino <dino@no.spam.ar> wrote:

Hello, I have built a PoC service in Python Flask for my work, and - now
that the point is made - I need to make it a little more performant (to
be honest, chances are that someone else will pick up from where I left
off, and implement the same service from scratch in a different language
(GoLang? .Net? Java?) but I am digressing).

=======
Hi,
* I'd start by measuring where your program spends its
time: https://docs.python.org/3/library/profile.html
* It might be useful try DuckDB instead of
Sqlite. https://duckdb.org/why_duckdb.html
Best wishes,
AJ

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Dino@21:1/5 to Edmondo Giovannozzi on Mon Jan 16 18:17:42 2023

On 1/16/2023 1:18 PM, Edmondo Giovannozzi wrote:

As a comparison with numpy. Given the following lines:

import numpy as np
a = np.random.randn(400,100_000)
ia = np.argsort(a[0,:])
a_elem = a[56, ia[0]]

I have just taken an element randomly in a numeric table of 400x100000 elements
To find it with numpy:

%timeit isel = a == a_elem
35.5 ms ± 2.79 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

And
%timeit a[isel]
9.18 ms ± 371 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

As data are not ordered it is searching it one by one but at C level.
Of course it depends on a lot of thing...

thank you for this. It's probably my lack of experience with Numpy,
but... can you explain what is going on here in more detail?

Thank you

Dino

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From rbowman@21:1/5 to Thomas Passin on Tue Jan 17 01:47:19 2023

On Mon, 16 Jan 2023 12:28:37 -0500, Thomas Passin wrote:

On 1/16/2023 11:56 AM, rbowman wrote:

On 16 Jan 2023 15:14:06 GMT, Stefan Ram wrote:

When none of those reasons matter, one can use dictionaries in
Python as well. And then what Chandler Carruth showed us applies:

I am missing something. Where is the data in your dictionary coming
from?

It would get imported on startup. This is assuming that the data does
not get changed during operation, which the OP said is the case.

Okay, so there is no data loss if the program or machine crashes. That
makes sense. In the cases Where I'm using SQLite the input data consists
of relatively expensive spatial queries that are processed and written out
to persist the results for use by other applications in the suite.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Edmondo Giovannozzi@21:1/5 to All on Tue Jan 17 02:42:02 2023

Il giorno martedì 17 gennaio 2023 alle 00:18:04 UTC+1 Dino ha scritto:

On 1/16/2023 1:18 PM, Edmondo Giovannozzi wrote:

As a comparison with numpy. Given the following lines:

import numpy as np
a = np.random.randn(400,100_000)
ia = np.argsort(a[0,:])
a_elem = a[56, ia[0]]

I have just taken an element randomly in a numeric table of 400x100000 elements
To find it with numpy:

%timeit isel = a == a_elem
35.5 ms ± 2.79 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

And
%timeit a[isel]
9.18 ms ± 371 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

As data are not ordered it is searching it one by one but at C level.
Of course it depends on a lot of thing...

thank you for this. It's probably my lack of experience with Numpy,
but... can you explain what is going on here in more detail?

Thank you

Dino

Sorry,
I was just creating an array of 400x100000 elements that I fill with random numbers:

a = np.random.randn(400,100_000)

Then I pick one element randomly, it is just a stupid sort on a row and then I take an element in another row, but it doesn't matter, I'm just taking a random element. I may have used other ways to get that but was the first that came to my mind.

ia = np.argsort(a[0,:])
a_elem = a[56, ia[0]]

The I'm finding that element in the all the matrix a (of course I know where it is, but I want to test the speed of a linear search done on the C level):

%timeit isel = a == a_elem

Actually isel is a logic array that is True where a[i,j] == a_elem and False where a[i,j] != a_elem. It may find more then one element but, of course, in our case it will find only the element that we have selected at the beginning. So it will give the
speed of a linear search plus the time needed to allocate the logic array. The search is on the all matrix of 40 million of elements not just on one of its row of 100k element.

On the single row (that I should say I have chosen to be contiguous) is much faster.

%timeit isel = a[56,:] == a_elem
26 µs ± 588 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

the matrix is a double precision numbers that is 8 byte, I haven't tested it on string of characters.

This wanted to be an estimate of the speed that one can get going to the C level.
You loose of course the possibility to have a relational database, you need to have everything in memory, etc...

A package that implements tables based on numpy is pandas: https://pandas.pydata.org/

I hope that it can be useful.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Dino@21:1/5 to Edmondo Giovannozzi on Tue Jan 17 10:39:14 2023

Thanks a lot, Edmondo. Or better... Grazie mille.

On 1/17/2023 5:42 AM, Edmondo Giovannozzi wrote:

Sorry,
I was just creating an array of 400x100000 elements that I fill with random numbers:

a = np.random.randn(400,100_000)

Then I pick one element randomly, it is just a stupid sort on a row and then I take an element in another row, but it doesn't matter, I'm just taking a random element. I may have used other ways to get that but was the first that came to my mind.

ia = np.argsort(a[0,:])
a_elem = a[56, ia[0]]

The I'm finding that element in the all the matrix a (of course I know where it is, but I want to test the speed of a linear search done on the C level):

%timeit isel = a == a_elem

Actually isel is a logic array that is True where a[i,j] == a_elem and False where a[i,j] != a_elem. It may find more then one element but, of course, in our case it will find only the element that we have selected at the beginning. So it will give the

speed of a linear search plus the time needed to allocate the logic array. The search is on the all matrix of 40 million of elements not just on one of its row of 100k element.

On the single row (that I should say I have chosen to be contiguous) is much faster.

%timeit isel = a[56,:] == a_elem
26 µs ± 588 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

the matrix is a double precision numbers that is 8 byte, I haven't tested it on string of characters.

This wanted to be an estimate of the speed that one can get going to the C level.
You loose of course the possibility to have a relational database, you need to have everything in memory, etc...

A package that implements tables based on numpy is pandas: https://pandas.pydata.org/

I hope that it can be useful.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

Who's Online
Recent Visitors
- Enzo
  Sat Apr 12 14:26:23 2025
  from France via SSH
- Rixter
  Sat Apr 12 13:38:16 2025
  from Madison, Nc via Telnet
- Rixter
  Sat Apr 12 13:31:52 2025
  from Madison, Nc via Telnet
- Centurion
  Sat Apr 12 06:03:25 2025
  from Berea, Ohio via Telnet
- Plume
  Sat Apr 12 01:23:17 2025
  from Uk via Telnet
- Centurion
  Sat Apr 12 00:01:59 2025
  from Berea, Ohio via Telnet
- Centurion
  Fri Apr 11 23:29:56 2025
  from Berea, Ohio via Telnet
- Guest
  Fri Apr 11 22:11:14 2025
  from A via Raw

System Info

Sysop:	Keyop
Location:	Huddersfield, West Yorkshire, UK
Users:	448
Nodes:	16 (2 / 14)
Uptime:	73:22:15
Calls:	9,254
Calls today:	6
Files:	13,501
Messages:	6,068,399

Fast lookup of bulky "table"

Who's Online

Recent Visitors

System Info