Hello, I have built a PoC service in Python Flask for my work, and - now
that the point is made - I need to make it a little more performant (to be honest, chances are that someone else will pick up from where I left off,
and implement the same service from scratch in a different language (GoLang? .Net? Java?) but I am digressing).
Anyway, my Flask service initializes by loading a big "table" of 100k rows and 40 columns or so (memory footprint: order of 300 Mb)
As you can imagine, this is not very performant in its current form, but performance was not the point of the PoC - at least initially.
On 2023-01-14 23:26:27 -0500, Dino wrote:
Hello, I have built a PoC service in Python Flask for my work, and - now
that the point is made - I need to make it a little more performant (to be >> honest, chances are that someone else will pick up from where I left off,
and implement the same service from scratch in a different language (GoLang? >> .Net? Java?) but I am digressing).
Anyway, my Flask service initializes by loading a big "table" of 100k rows >> and 40 columns or so (memory footprint: order of 300 Mb)
300 MB is large enough that you should at least consider putting that
into a database (Sqlite is probably simplest. Personally I would go with PostgreSQL because I'm most familiar with it and Sqlite is a bit of an outlier).
The main reason for putting it into a database is the ability to use
indexes, so you don't have to scan all 100 k rows for each query.
You may be able to do that for your Python data structures, too: Can you
set up dicts which map to subsets you need often?
There are some specialized in-memory bitmap implementations which can be
used for filtering. I've used
[Judy bitmaps](https://judy.sourceforge.net/doc/Judy1_3x.htm) in the
past (mostly in Perl).
These days [Roaring Bitmaps](https://www.roaringbitmap.org/) is probably
the most popular. I see several packages on PyPI - but I haven't used
any of them yet, so no recommendation from me.
Numpy might also help. You will still have linear scans, but it is more compact and many of the searches can probably be done in C and not in
Python.
As you can imagine, this is not very performant in its current form, but
performance was not the point of the PoC - at least initially.
For performanc optimization it is very important to actually measure performance, and a good profiler helps very much in identifying hot
spots. Unfortunately until recently Python was a bit deficient in this
area, but [Scalene](https://pypi.org/project/scalene/) looks promising.
hp
Hey,
before you start optimizing. I would suggest, that you measure response
times and query times, data search times and so on. In order to save
time, you have to know where you "loose" time.
Does your service really have to load the whole table at once? Yes that
might lead to quicker response times on requests, but databases are
often very good with caching themselves, so that the first request might
be slower than following requests, with similar parameters. Do you use a database, or are you reading from a file? Are you maybe looping through
your whole dataset on every request? Instead of asking for the specific
data?
Before you start introducing a cache and its added complexity, do you
really need that cache?
You are talking about saving microseconds, that sounds a bit as if you
might be “overdoing” it. How many requests will you have in the future? At least in which magnitude and how quick do they have to be? You write
about 1-4 seconds on your laptop. But that does not really tell you that much, because most probably the service will run on a server. I am not
saying that you should get a server or a cloud-instance to test against,
but to talk with your architect about that.
I totally understand your impulse to appear as good as can be, but you
have to know where you really need to debug and optimize. It will not be advantageous for you, if you start to optimize for optimizing's sake. Additionally if you service is a PoC, optimizing now might be not the
first thing you have to worry about, but about that you made everything
as simple and readable as possible and that you do not spend too much
time for just showing how it could work.
But of course, I do not know the tasks given to you and the expectations
you have to fulfil. All I am trying to say is to reconsider where you
really could improve and how far you have to improve.
On 2023-01-14 23:26:27 -0500, Dino wrote:
Hello, I have built a PoC service in Python Flask for my work, and - now
that the point is made - I need to make it a little more performant (to be >> honest, chances are that someone else will pick up from where I left off,
and implement the same service from scratch in a different language (GoLang? >> .Net? Java?) but I am digressing).
Anyway, my Flask service initializes by loading a big "table" of 100k rows >> and 40 columns or so (memory footprint: order of 300 Mb)
300 MB is large enough that you should at least consider putting that
into a database (Sqlite is probably simplest. Personally I would go with PostgreSQL because I'm most familiar with it and Sqlite is a bit of an outlier).
The main reason for putting it into a database is the ability to use
indexes, so you don't have to scan all 100 k rows for each query.
On 1/15/2023 6:14 AM, Peter J. Holzer wrote:
On 2023-01-14 23:26:27 -0500, Dino wrote:
Anyway, my Flask service initializes by loading a big "table" of 100k rows
and 40 columns or so (memory footprint: order of 300 Mb)
300 MB is large enough that you should at least consider putting that
into a database (Sqlite is probably simplest. Personally I would go with PostgreSQL because I'm most familiar with it and Sqlite is a bit of an outlier).
The main reason for putting it into a database is the ability to use indexes, so you don't have to scan all 100 k rows for each query.
I have an (inherited) server program that uses about 30 MB of data in a
MySQL database. It services queries received over the network. It too had performance problems, to which adding indexes and smarter joins helped but not enough.
I changed the program so that at startup it imports much of the data into Python dictionaries that are structured to support the kinds of queries that need the help. Response time to queries dropped dramatically.
Hey,
before you start optimizing. I would suggest, that you measure response
times and query times, data search times and so on. In order to save
time, you have to know where you "loose" time.
Does your service really have to load the whole table at once? Yes that
might lead to quicker response times on requests, but databases are
often very good with caching themselves, so that the first request might
be slower than following requests, with similar parameters. Do you use a database, or are you reading from a file? Are you maybe looping through
your whole dataset on every request? Instead of asking for the specific
data?
Before you start introducing a cache and its added complexity, do you
really need that cache?
You are talking about saving microseconds, that sounds a bit as if you
might be �overdoing� it. How many requests will you have in the future?
At least in which magnitude and how quick do they have to be? You write
about 1-4 seconds on your laptop. But that does not really tell you that much, because most probably the service will run on a server. I am not
saying that you should get a server or a cloud-instance to test against,
but to talk with your architect about that.
I totally understand your impulse to appear as good as can be, but you
have to know where you really need to debug and optimize. It will not be advantageous for you, if you start to optimize for optimizing's sake. Additionally if you service is a PoC, optimizing now might be not the
first thing you have to worry about, but about that you made everything
as simple and readable as possible and that you do not spend too much
time for just showing how it could work.
But of course, I do not know the tasks given to you and the expectations
you have to fulfil. All I am trying to say is to reconsider where you
really could improve and how far you have to improve.
Hey,
before you start optimizing. I would suggest, that you measure response
times and query times, data search times and so on. In order to save
time, you have to know where you "loose" time.
Does your service really have to load the whole table at once? Yes that
might lead to quicker response times on requests, but databases are
often very good with caching themselves, so that the first request might
be slower than following requests, with similar parameters. Do you use a database, or are you reading from a file? Are you maybe looping through
your whole dataset on every request? Instead of asking for the specific
data?
Before you start introducing a cache and its added complexity, do you
really need that cache?
You are talking about saving microseconds, that sounds a bit as if you
might be �overdoing� it. How many requests will you have in the future?
At least in which magnitude and how quick do they have to be? You write
about 1-4 seconds on your laptop. But that does not really tell you that much, because most probably the service will run on a server. I am not
saying that you should get a server or a cloud-instance to test against,
but to talk with your architect about that.
I totally understand your impulse to appear as good as can be, but you
have to know where you really need to debug and optimize. It will not be advantageous for you, if you start to optimize for optimizing's sake. Additionally if you service is a PoC, optimizing now might be not the
first thing you have to worry about, but about that you made everything
as simple and readable as possible and that you do not spend too much
time for just showing how it could work.
But of course, I do not know the tasks given to you and the expectations
you have to fulfil. All I am trying to say is to reconsider where you
really could improve and how far you have to improve.
On 2023-01-15 10:38:22 -0500, Thomas Passin wrote:
On 1/15/2023 6:14 AM, Peter J. Holzer wrote:
On 2023-01-14 23:26:27 -0500, Dino wrote:
Anyway, my Flask service initializes by loading a big "table" of 100k rows >>>> and 40 columns or so (memory footprint: order of 300 Mb)
300 MB is large enough that you should at least consider putting that
into a database (Sqlite is probably simplest. Personally I would go with >>> PostgreSQL because I'm most familiar with it and Sqlite is a bit of an
outlier).
The main reason for putting it into a database is the ability to use
indexes, so you don't have to scan all 100 k rows for each query.
I have an (inherited) server program that uses about 30 MB of data in a
MySQL database. It services queries received over the network. It too had
performance problems, to which adding indexes and smarter joins helped but >> not enough.
I changed the program so that at startup it imports much of the data into
Python dictionaries that are structured to support the kinds of queries that >> need the help. Response time to queries dropped dramatically.
This is to be expected: Firstly, because you don't have disk accesses
any more, secondly because you don't have network latency any more and thirdly, because you structured the data to fit the queries.
The thing to keep in mind is that the relational database model was
invented to have a uniform and simple way to model all data, and that
RDBMSs are designed to handle all workloads (from a single tiny table to thousands of tables with hundreds of terabytes) reasonably well. For any given application you can always find a more efficient solution than
using an RDBMS. Sometimes it's simple (just load all the data into a
dict and serve from there), sometimes it's a major research project.
The nice thing about RDBMSs isn't that they are the optimal solution for anything but that they are a "good enough" solution for a large class of problems.
I think any peformance improvements would have to come from a language change or better indexing of the data.
Do you have any idea about the speed of a SELECT query against a 100k
rows / 300 Mb Sqlite db?
Some programmers don't realise that SQL can also be used for
calculations, eg the eponymous COUNT(), which saves (CPU-time and >coding-effort) over post-processing in Python.
Some programmers don't realise that SQL can also be used for
calculations, eg the eponymous COUNT(), which saves (CPU-time and >coding-effort) over post-processing in Python.
Do you have any idea about the speed of a SELECT query against a 100k
rows / 300 Mb Sqlite db?
dn <PythonList@DancesWithMice.info> writes:
Some programmers don't realise that SQL can also be used for
calculations, eg the eponymous COUNT(), which saves (CPU-time and
coding-effort) over post-processing in Python.
Yes, I second that! Sometimes, people only re-invent things
in Python because they don't know SQL well enough, or they
do not normalize their tables because they have not properly
learned how to do this.
I'd always start out with normalized tables and do as many
operations in SQL as possible. I would then hesitate to
de-normalize anything or transfer data operations into
the programming language unless I am very sure that this
is really advantageous.
Once I had the task of writing VBA code to query and analyze
data from a Jet engine (i.e., Microsoft Access). I ended up
writing 90 % of the code in SQL and a thin layer of 10 % in VBA.
And it was fast.
That�s about what I got using a Python dictionary on random data on a high memory machine.
https://github.com/Gerardwx/database_testing.git
It�s not obvious to me how to get it much faster than that.
BTW, can you tell me what is going on here? what's := ?
while (increase := add_some(conn,adding)) == 0:
See here:
https://docs.python.org/3/reference/expressions.html#assignment-expressions
https://realpython.com/python-walrus-operator/
I changed the program so that at startup it imports much of the data
into Python dictionaries that are structured to support the kinds of
queries that need the help. Response time to queries dropped
dramatically. Some kinds of queries needed more help, and I collected >auxiliary collections of (usually highly pre-processed) data into
ordinary files, and those too get imported into dictionaries during startup.
Hello, I have built a PoC service in Python Flask for my work, and - now
that the point is made - I need to make it a little more performant (to
be honest, chances are that someone else will pick up from where I left
off, and implement the same service from scratch in a different language (GoLang? .Net? Java?) but I am digressing).
On 16 Jan 2023 15:14:06 GMT, Stefan Ram wrote:
When none of those reasons matter, one can use dictionaries in PythonI am missing something. Where is the data in your dictionary coming from?
as well. And then what Chandler Carruth showed us applies:
When none of those reasons matter, one can use dictionaries in Python
as well. And then what Chandler Carruth showed us applies:
However, operating systems and databases also try to cacheYes, and you can only know by testing, when that's possible. Also, if
information in main memory that is estimated to be accessed
often.
Hello, I have built a PoC service in Python Flask for my work, and - now that the point is made - I need to make it a little more performant (to
be honest, chances are that someone else will pick up from where I left
off, and implement the same service from scratch in a different language (GoLang? .Net? Java?) but I am digressing).
Anyway, my Flask service initializes by loading a big "table" of 100k
rows and 40 columns or so (memory footprint: order of 300 Mb) and then accepts queries through a REST endpoint. Columns are strings, enums, and numbers. Once initialized, the table is read only. The endpoint will
parse the query and match it against column values (equality,
inequality, greater than, etc.) Finally, it will return a (JSON) list of
all rows that satisfy all conditions in the query.
As you can imagine, this is not very performant in its current form, but performance was not the point of the PoC - at least initially.
Before I deliver the PoC to a more experienced software architect who
will look at my code, though, I wouldn't mind to look a bit less lame
and do something about performance in my own code first, possibly by bringing the average time for queries down from where it is now (order
of 1 to 4 seconds per query on my laptop) to 1 or 2 milliseconds on average).
To be honest, I was already able to bring the time down to a handful of microseconds thanks to a rudimentary cache that will associate the "signature" of a query to its result, and serve it the next time the
same query is received, but this may not be good enough: 1) queries
might be many and very different from one another each time, AND 2) I am
not sure the server will have a ton of RAM if/when this thing - or
whatever is derived from it - is placed into production.
How can I make my queries generally more performant, ideally also in
case of a new query?
Here's what I have been considering:
1. making my cache more "modular", i.e. cache the result of certain
(wide) queries. When a complex query comes in, I may be able to restrict
my search to a subset of the rows (as determined by a previously cached partial query). This should keep the memory footprint under control.
2. Load my data into a numpy.array and use numpy.array operations to
slice and dice my data.
3. load my data into sqlite3 and use SELECT statement to query my table.
I have never used sqllite, plus there's some extra complexity as
comparing certain colum requires custom logic, but I wonder if this architecture would work well also when dealing with a 300Mb database.
4. Other ideas?
Hopefully I made sense. Thank you for your attention
Dino
You especially want to avoid letting the database engine do full-table
scans over and over. And you never want to send a lot of rows to
Python and do post-filtering on them if you can avoid it.
On 16/01/2023 08.36, Weatherby,Gerard wrote:
I think any peformance improvements would have to come from a language change or better indexing of the data.Expanding on @Peter's post: databases (relational or not) are best organised according to use.
Postgres and MySQL (for example) enable the establishment of multiple and sophisticated indices/indexes, and the aptly-named "views" of data.
If the queries can be grouped according to the manner in which the data must be accessed, a view could be built for each. At which time, even if every
row must be accessed, the retrieval will be made more efficient and/or the response better-organised.
Thus, if we have a DB of people. Many retrievals are likely to utilise an index on 'name'. However, if at times interest is limited to place or
suburb, an index and view of such will speed things from O(n). Similarly, if a query is only to return people with a dog license.
Some programmers don't realise that SQL can also be used for calculations,
eg the eponymous COUNT(), which saves (CPU-time and coding-effort) over post-processing in Python.
On 16 Jan 2023 15:14:06 GMT, Stefan Ram wrote:
When none of those reasons matter, one can use dictionaries in Python
as well. And then what Chandler Carruth showed us applies:
I am missing something. Where is the data in your dictionary coming from?
As a comparison with numpy. Given the following lines:
import numpy as np
a = np.random.randn(400,100_000)
ia = np.argsort(a[0,:])
a_elem = a[56, ia[0]]
I have just taken an element randomly in a numeric table of 400x100000 elements
To find it with numpy:
%timeit isel = a == a_elem
35.5 ms ± 2.79 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
And
%timeit a[isel]
9.18 ms ± 371 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
As data are not ordered it is searching it one by one but at C level.
Of course it depends on a lot of thing...
On 1/16/2023 11:56 AM, rbowman wrote:
On 16 Jan 2023 15:14:06 GMT, Stefan Ram wrote:
When none of those reasons matter, one can use dictionaries in
Python as well. And then what Chandler Carruth showed us applies:
I am missing something. Where is the data in your dictionary coming
from?
It would get imported on startup. This is assuming that the data does
not get changed during operation, which the OP said is the case.
On 1/16/2023 1:18 PM, Edmondo Giovannozzi wrote:
As a comparison with numpy. Given the following lines:
import numpy as np
a = np.random.randn(400,100_000)
ia = np.argsort(a[0,:])
a_elem = a[56, ia[0]]
I have just taken an element randomly in a numeric table of 400x100000 elements
To find it with numpy:
%timeit isel = a == a_elem
35.5 ms ± 2.79 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
And
%timeit a[isel]
9.18 ms ± 371 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
As data are not ordered it is searching it one by one but at C level.thank you for this. It's probably my lack of experience with Numpy,
Of course it depends on a lot of thing...
but... can you explain what is going on here in more detail?
Thank you
Dino
Sorry,speed of a linear search plus the time needed to allocate the logic array. The search is on the all matrix of 40 million of elements not just on one of its row of 100k element.
I was just creating an array of 400x100000 elements that I fill with random numbers:
a = np.random.randn(400,100_000)
Then I pick one element randomly, it is just a stupid sort on a row and then I take an element in another row, but it doesn't matter, I'm just taking a random element. I may have used other ways to get that but was the first that came to my mind.
ia = np.argsort(a[0,:])
a_elem = a[56, ia[0]]
The I'm finding that element in the all the matrix a (of course I know where it is, but I want to test the speed of a linear search done on the C level):
%timeit isel = a == a_elem
Actually isel is a logic array that is True where a[i,j] == a_elem and False where a[i,j] != a_elem. It may find more then one element but, of course, in our case it will find only the element that we have selected at the beginning. So it will give the
On the single row (that I should say I have chosen to be contiguous) is much faster.
%timeit isel = a[56,:] == a_elem
26 µs ± 588 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
the matrix is a double precision numbers that is 8 byte, I haven't tested it on string of characters.
This wanted to be an estimate of the speed that one can get going to the C level.
You loose of course the possibility to have a relational database, you need to have everything in memory, etc...
A package that implements tables based on numpy is pandas: https://pandas.pydata.org/
I hope that it can be useful.
Sysop: | Keyop |
---|---|
Location: | Huddersfield, West Yorkshire, UK |
Users: | 448 |
Nodes: | 16 (2 / 14) |
Uptime: | 73:22:15 |
Calls: | 9,254 |
Calls today: | 6 |
Files: | 13,501 |
Messages: | 6,068,399 |