I want to experiment with shared memory between iForth instantiationsMaybe play with umask() before opening up shm?
running on a multi-core CPU. On Windows, it is possible to share a memory-mapped file between programs. When a non-existing file name is given, the
used system call defaults to an arbitrary memory buffer, exactly what is needed.
First experiments are successful, I am able to pass text from one iForth
to another with literally only a single line of code. However, after hours of debugging, it proves that the sharing is only possible when both iForth instances are run as an Administrator, which is somewhat understandable,
but a nuisance.
The MS example 'C' code ignores the problem, suggesting that
default security measures do not prevent the idea from working.
Does anybody know how to get around this problem (or lessen the OS
default security level a notch)?
-marcel
First experiments are successful, I am able to pass text from one iForth
to another with literally only a single line of code.
The MS example 'C' code ignores the problem, suggesting that
default security measures do not prevent the idea from working.
Note that, if you want to communicate between the processes by writing
to shared memory in one process and reading in the other, modern CPUs
tend to have quite nonintuitive behaviour, and require the programmer
to jump through some hoops for reliable operation. IA-32 and AMD64
are somewhat better in that respect than, e.g., ARM, but even they
have non-intuitive behaviour.
The MS example 'C' code ignores the problem, suggesting thatAnd, have you tried it? Does it work as non-administrator? If it
default security measures do not prevent the idea from working.
does, what's the difference from what you have tried?
And, have you tried it? Does it work as non-administrator? If it
does, what's the difference from what you have tried?
On Friday, December 30, 2022 at 10:53:01 AM UTC+1, Anton Ertl wrote:
[..]
Note that, if you want to communicate between the processes by writing
to shared memory in one process and reading in the other, modern CPUs
tend to have quite nonintuitive behaviour, and require the programmer
to jump through some hoops for reliable operation. IA-32 and AMD64
are somewhat better in that respect than, e.g., ARM, but even they
have non-intuitive behaviour.
(iForth does not yet support ARM.) Your warning is appreciated, because
I thought that I was done already (apart from setting up a semaphore).
I want to experiment with shared memory between iForth instantiations
running on a multi-core CPU. On Windows, it is possible to share a memory-mapped file between programs. When a non-existing file name is given, the
used system call defaults to an arbitrary memory buffer, exactly what is needed.
First experiments are successful, I am able to pass text from one iForth
to another with literally only a single line of code. However, after hours of debugging, it proves that the sharing is only possible when both iForth instances are run as an Administrator, which is somewhat understandable,
but a nuisance.
The MS example 'C' code ignores the problem, suggesting that
default security measures do not prevent the idea from working.
Does anybody know how to get around this problem (or lessen the OS
default security level a notch)?
Marcel Hendrix schrieb am Freitag, 30. Dezember 2022 um 00:29:33 UTC+1:
I want to experiment with shared memory between iForth instantiations running on a multi-core CPU. On Windows, it is possible to share a memory-mapped file between programs. When a non-existing file name is given, the
used system call defaults to an arbitrary memory buffer, exactly what is needed.
First experiments are successful, I am able to pass text from one iForth
to another with literally only a single line of code. However, after hours of
debugging, it proves that the sharing is only possible when both iForth instances are run as an Administrator, which is somewhat understandable, but a nuisance.
The MS example 'C' code ignores the problem, suggesting thatPerhaps this helps: https://epdf.tips/multicore-application-programming-for-windows-linux-and-oracle-solaris.html
default security measures do not prevent the idea from working.
Does anybody know how to get around this problem (or lessen the OS
default security level a notch)?
see page 225ff
I want to experiment with shared memory between iForth instantiations
running on a multi-core CPU. On Windows, it is possible to share a memory-mapped file between programs. When a non-existing file name is given, the
used system call defaults to an arbitrary memory buffer, exactly what is >needed.
First experiments are successful, I am able to pass text from one iForth
to another with literally only a single line of code. However, after hours of >debugging, it proves that the sharing is only possible when both iForth >instances are run as an Administrator, which is somewhat understandable,
but a nuisance.
The MS example 'C' code ignores the problem, suggesting that
default security measures do not prevent the idea from working.
Does anybody know how to get around this problem (or lessen the OS
default security level a notch)?
-marcel
I think I got it. Shared memory is implemented.
On Saturday, January 7, 2023 at 7:54:38 PM UTC+1, Marcel Hendrix wrote:
I think I got it. Shared memory is implemented.Now without bugs. ( https://ibb.co/Qd7Xw3g )
I think I got it. Shared memory is implemented.
I think I got it. Shared memory is implemented.
Neither Windows
nor Linux appear to directly support shared memory between networked >computers.
Is there a Forth library with RDMA (a transparent protocol build into many network
adapters)?
If it existed I could buy a refurbished HP840 workstation and *really*
get going (such workstations have 44 cores/88 threads and cost a mere
2000 Euros, 15 - 20k new, refurbished RDMA nic's are 20 Euros...).
Is there a Forth library with RDMA (a transparent protocol build into
many network adapters)? If it existed I could buy a refurbished HP840 workstation and *really* get going (such workstations have 44 cores/88 threads and cost a mere 2000 Euros, 15 - 20k new, refurbished RDMA
nic's are 20 Euros...).
Marcel Hendrix <m...@iae.nl> writes:[..]
If it existed I could buy a refurbished HP840 workstation and *really*Makes you wonder what's wrong with them:-)
get going (such workstations have 44 cores/88 threads and cost a mere
2000 Euros, 15 - 20k new, refurbished RDMA nic's are 20 Euros...).
Marcel Hendrix <m...@iae.nl> writes:[..]
If you can live with its performance characteristics (and probably
lack of coherence), how about mmapping an NFS-mounted file (other
distributed fie systems may be better for that purpose, though).
Marcel Hendrix <m...@iae.nl> writes:[..]
Unless you had a bunch of those workstations networked together, why
would you need RDMA, assuming your Forth program is running on the workstation?
This 44 core system is almost definitely slower than a 32 core
Threadripper, but might beat a 16 core Ryzen.
Only a minimum amount of data is shared (a page with published
parameters and achieved results, plus the ready! flags).
I have been polishing my shared memory application (iSPICE) a bit more.
The benchmark I previously showed compared running a circuit simulation
with a variable number of communicating CPUs. Only a minimum amount of data >is shared (a page with published parameters and achieved results, plus the >ready! flags). With this setup I got about a factor of 3 improvement for
8 CPUs. I hoped to improve this factor a bit with better hardware and maybe >some software tweaking.
What I didn't try until today was checking how fast the circuit simulation >ran on a single CPU, *not* using the shared memory framework. And indeed, >that is a problem, in that without shared memory the runtime is *3 times >less* than with shared memory. In other words, there is no net gain in
having 8 mem-shared cpu's. As a additional check I started the circuit run
in 3 separate windows. They all achieved the same speed as the single run >non-shared version, proving that the hardware (cpu/memory/disk) is amply >sufficient to provide an 8 times speed-up.
I will now start working on Anton's suggesting of a shared file. Or maybe
I should try this on Linux first, maybe shared memory works better there.
-marcel
Maybe try something simple before jumping into sockets and mapped
files.
Maybe try something simple before jumping into sockets and mapped
files.
I have tried that way for the past 20 years already, and indeed it
works fine. However, my simple example shown above needs 24 threads >/processes/cores (whatever) each having about 2 to 4 GB of memory.
-marcel
I have lost context, can you tell more about the simple example?
(My provider purges old messages swiftly)
I have lost context, can you tell more about the simple example?
(My provider purges old messages swiftly)
I was in the exploring/debugging phase and have only very recently
completed the experiments.
The final results are that with shared memory, on Windows
11, it is possible to get an almost linear speedup with the
number of cores in use. The way shared memory is implemented
on Windows is with a memory-mapped file that uses the OS
pagefile as backup. The file is guaranteed to not be swapped
out under reasonable conditions, and Windows keeps its
management invisible for users.
I tried to make the file as small as possible. For this
iForth benchmark it was 11 int64's (11 * 8 bytes) and 24
extended floats (24 * 16 bytes), about 1/2 Kbyte. The file
is touched very infrequently, just 24 result writes and
then a loop over the 11 words to see if all cpu's finished
(check at 10ms intervals). At the moment I have no idea
what happens with very frequent read/writes (it is not
the intended type of use).
[During debugging I was lucky. When setting the number of
working cpu's interactively, completely wrong results
were obtained. This happened because #|cpus was defined
as a VALUE in a configuration file. When changing #|cpus
from the console, the value in sconfig.frt stayed the
same (of course) while all the dynamically started cores
used the on-disk value, not the value I typed in on
CPU #0. Easy to understand in hindsight, but this type
of 'black-hole' mistake can take hours to find in a 7000+
line program. For some reason I just knew that it had to
be #|cpus that was causing the problem.]
The benchmark is a circuit file that defines a voltage
source and a 2-resistor divider, all parameterized.
These values were swept for a total of 24 different
circuits. To calculate the result for one of the
combinations takes 2.277s on a single core with iSPICE,
or 24 x that value, 54.648s, for all 24 combinations.
In the benchmark the 24 simulations are spread out over
11 processes on an 8-core CPU :
iSPICE> .ticker-info
AMD Ryzen 7 5800X 8-Core Processor
TICKS-GET uses os time & PROCESSOR-CLOCK 4192MHz
Do: < n TO PROCESSOR-CLOCK RECALIBRATE >
The aim is to get an 8 times speedup, or more if
hyperthreads bring something, and do all combinations
in less than 6.831 seconds. The best I managed is
7.694s or about 7.67 "cores", which I consider not
that bad. Here are the details (run 4 times):
% cpus time [s] perf. ratio
1 49.874 1.46
2 25.314 2.39
3 17.391 3.23
4 13.335 4.11
5 10.565 5.17
6 9.468 5.71
7 8.712 6.22
8 7.694 7.67
9 7.260 7.37
10 7.874 6.72
11 7.856 6.73 ok
For your information: Running the same 24 variations
with LTspice 17.1.15, one of the fastest SPICE
implementations currently available, takes 382.265
seconds, almost exactly 7 times slower than the iSPICE
single-core run. Using 8 cores (LTspice pretends to
use 16 threads), that ratio becomes 62 times.
In the above table the performance ration for a single
cpu is 1.46 (1.46 times faster than doing the 24
simulations on a single core *without* shared memory),
which might seem strange. I think the phenomenon is
caused by the fact that a single combination takes
only 2.277s and this may be too slow for the processor
(or Windows) to ramp up the clock frequency. If the
performance factor is normalized by the timing for
1 cpu, the maximum speedup decreases to 5.25.
We'll see what happens on an HPZ840.
-marcel--
In article <c2fb7eb58b7ae773f632a15c1abac917@www.novabbs.com>,[..]
mhx <mhx@iae.nl> wrote:
I have lost context, can you tell more about the simple example?
[..]The final results are that with shared memory, on Windows
11, it is possible to get an almost linear speedup with the
number of cores in use.
Linear speedup? That must depend on the program.
Can I surmise that the context is that you're comparing your
version/clone iSpice with LTSpice.
iSPICE> .ticker-info
AMD Ryzen 7 5800X 8-Core Processor
TICKS-GET uses os time & PROCESSOR-CLOCK 4192MHz
Do: < n TO PROCESSOR-CLOCK RECALIBRATE >
The aim is to get an 8 times speedup, or more if
hyperthreads bring something, and do all combinations
in less than 6.831 seconds. The best I managed is
7.694s or about 7.67 "cores", which I consider not
that bad. Here are the details (run 4 times):
% cpus time [s] perf. ratio
1 49.874 1.46
2 25.314 2.39
3 17.391 3.23
4 13.335 4.11
5 10.565 5.17
6 9.468 5.71
7 8.712 6.22
8 7.694 7.67
9 7.260 7.37
10 7.874 6.72
11 7.856 6.73 ok
For your information: Running the same 24 variations
with LTspice 17.1.15, one of the fastest SPICE
implementations currently available, takes 382.265
seconds, almost exactly 7 times slower than the iSPICE
single-core run. Using 8 cores (LTspice pretends to
use 16 threads), that ratio becomes 62 times.
So LT spice becomes slower by using 8 cores
going from 7 times slower to 62 time slower than iSPICE.
There must be a mistake here.
We'll see what happens on an HPZ840.
You are going to run Windows 11 on the HP work station?
I'm going to install a Linux version, for I want to
experiment with CUDA.
Sysop: | Keyop |
---|---|
Location: | Huddersfield, West Yorkshire, UK |
Users: | 399 |
Nodes: | 16 (2 / 14) |
Uptime: | 39:51:21 |
Calls: | 8,336 |
Calls today: | 13 |
Files: | 13,155 |
Messages: | 5,891,348 |
Posted today: | 1 |