There are many single core Forth FPGA CPUs, but not yet any many core cpus. There is the Core 1 project. Not yet quite ready. https://www.youtube.com/watch?v=KXjQdKBl7ag&t=1115s
There is the GA144, inspirational, but that has too little memory.
There is the Parallax P2 with Taqoz Forth, also a bit tight on memory.
There is the 6Ghz project mentioned here. Still not shipping.
If a forth core is so small, an obvious win is to put lots of them on an FPGA, but so far no one has done that yet.
I am the only one interested in using such a thing?
Chris
If you need it, why can't you do that yourself?Sadly I am not yet an FPGA designer. But I am looking into going back to school to learn how to do this.
If you need it, why can't you do that yourself?Sadly I am not yet an FPGA designer. But I am looking into going back to school to learn how to do this.
Yes I am aware that the Parallax is not a stack processor.
And no I am not asking about why no stack processors in ASIC.
I am looking for multiple stack processors on a single FPGA.
But your inability to understand my question, is very interesting. As if the idea of many forth cpus on an FPGA is nutty. Something that almost no one would even considering doing.
That was helpful. Thank you.
I am doing Forth-on-stackprocessor-on-FPGA (Mecrisp-Ice) both for work and for fun, and the reason is simple: Because in FPGA one has dedicated logic for complex peripheral IO. A traditional manycore microcontroller is used to get timing right onmultiple interfaces or run two timing-critical tasks in parallel, but in an FPGA, the Forth is usually only orchestrating the various peripherals that work standalone otherwise.
Nevertheless, it would be perfectly possible to do if a need arises.
There are many single core Forth FPGA CPUs, but not yet any many core cpus. There is the Core 1 project. Not yet quite ready. https://www.youtube.com/watch?v=KXjQdKBl7ag&t=1115s
There is the GA144, inspirational, but that has too little memory.
There is the Parallax P2 with Taqoz Forth, also a bit tight on memory.
There is the 6Ghz project mentioned here. Still not shipping.
If a forth core is so small, an obvious win is to put lots of them on an FPGA, but so far no one has done that yet.
Christopher Lozinski <caloz...@gmail.com> wrote:
There are many single core Forth FPGA CPUs, but not yet any many core cpus. There is the Core 1 project. Not yet quite ready. https://www.youtube.com/watch?v=KXjQdKBl7ag&t=1115sWhat do you plan to use this for?
There is the GA144, inspirational, but that has too little memory.
There is the Parallax P2 with Taqoz Forth, also a bit tight on memory. There is the 6Ghz project mentioned here. Still not shipping.
If a forth core is so small, an obvious win is to put lots of them on an FPGA, but so far no one has done that yet.
One of the challenges with FPGA design concerns memory bandwidth. You can have a tightly coupled local memory to a small core, and that scales with
the number of cores you lay down. But those memories are only of the order
of kilobytes.
If you want to go with off-chip memories, you are contended by bandwidth. Everything has to share the same 16/32/64 bit memory interface, that can
only be accessed by one core at once. DRAM also has a lot of latency so
it's slow to switch from one address to another. This means a lot of small cores making small requests to a DRAM is quite inefficient.
The solution to the DRAM problem is either to go wider (vector style) or to use caches, but they come at a cost. You also need a memory interconnect
that connects all your cores to the memory, and that takes area.
Another option is not to have memory interconnect but just a lot of communicating cores, and have them pass message through other cores to
access the DRAM. This only works if DRAM accesses are rare.
One model is the 'systolic array' where a core only needs to communicate
with its neighbours. This is fine for a 2D problem that maps nicely to a 2D chip, but as soon as you go to more dimensions the point to point wiring
gets complicated. The solution to that is a network rather than point to point wiring, and we're now back to the interconnect question.
Having small cores don't really help here, because now you're doing less compute in the cores but for more cores you need more interconnect overhead. It make sense to spend area on bigger (wider) cores and proportionately less area on interconnect.
So my question would be: what applications would fit a sea of small cores with small local memories, but little inter-core communication?
Are there any which aren't currently served by existing hardware, and for which a tiled Forth core would beat a tile of simple RISC (eg RISC-V) cores?
Christopher Lozinski <caloz...@gmail.com> wrote:[..]
What do you plan to use this for?[..]
One of the challenges with FPGA design concerns memory bandwidth. You can have a tightly coupled local memory to a small core, and that scales with
the number of cores you lay down. But those memories are only of the order
of kilobytes.
So my question would be: what applications would fit a sea of small cores with small local memories, but little inter-core communication?
Are there any which aren't currently served by existing hardware, and for which a tiled Forth core would beat a tile of simple RISC (eg RISC-V) cores?
On Tuesday, January 3, 2023 at 12:31:48 PM UTC+1, Theo wrote:
Christopher Lozinski <caloz...@gmail.com> wrote:[..]
What do you plan to use this for?
One of the challenges with FPGA design concerns memory bandwidth. You can have a tightly coupled local memory to a small core, and that scales with the number of cores you lay down. But those memories are only of the order of kilobytes.[..]
So my question would be: what applications would fit a sea of small cores with small local memories, but little inter-core communication?Thanks Theo, very enlightening!
Are there any which aren't currently served by existing hardware, and for which a tiled Forth core would beat a tile of simple RISC (eg RISC-V) cores?
Up to now I was toying with the idea of eventually putting my algorithm
in an FPGA, using a few hundred cores that all worked on a tiny part of
the problem. Running the numbers with your scheme, I come up with 24 bytes/10ns (the other data + code can be kept locally in say 10 kbytes),
so with 100 cores I'd need 240GB/s memory bandwidth (3 x that of an
AMD 7950X) and 1MB on chip ... IOW the problem is hopelessly I/O
bound.
I will have to wait for a few more FPGA/memory generations, or buy an
A100 (2TB memory bandwidth $32,097.00) instead.
Marcel Hendrix <m...@iae.nl> wrote:[..]
so with 100 cores I'd need 240GB/s memory bandwidth (3 x that of an
AMD 7950X) and 1MB on chip ... IOW the problem is hopelessly I/O
bound.
Another option is HBM: the Agilex M parts can do up to 820Gbits/s (102Gbytes/s) and that's a lot easier to use than DRAM.
However, in either case your application will have to be built around maximising the DRAM bandwidth, and you'll have to do whatever it takes to
get that. You'll be micromanaging everything to get peak DRAM performance.
I don't see Forth being anywhere near being optimal.
Plug: a while ago we had a paper on this - not a Forth CPU but custom logic, and compared it with vector processing for maximising DRAM bandwidth. Many
of the same issues apply though: https://www.cl.cam.ac.uk/~atm26/pubs/FPL2013-BlueVec.pdf
I think it's one of those things that seems 'obvious' until you discover
what you have to do in addition to make the 'obvious' thing work, which make it a lot less attractive.
Up to now I was toying with the idea of eventually putting my algorithm
in an FPGA, using a few hundred cores that all worked on a tiny part of
the problem. Running the numbers with your scheme, I come up with 24 bytes/10ns (the other data + code can be kept locally in say 10 kbytes),
so with 100 cores I'd need 240GB/s memory bandwidth (3 x that of an
AMD 7950X) and 1MB on chip ... IOW the problem is hopelessly I/O
bound.
I will have to wait for a few more FPGA/memory generations, or buy an
A100 (2TB memory bandwidth $32,097.00) instead.
Up to now I was toying with the idea of eventually putting my algorithm
in an FPGA, using a few hundred cores that all worked on a tiny part of
the problem. Running the numbers with your scheme, I come up with 24 >bytes/10ns (the other data + code can be kept locally in say 10 kbytes),
so with 100 cores I'd need 240GB/s memory bandwidth (3 x that of an
AMD 7950X) and 1MB on chip ... IOW the problem is hopelessly I/O
bound.
Marcel Hendrix <m...@iae.nl> writes:[..]
If you can organize the computation such that it stays mostly in the
cache for single-core implementation, it should be possible to stay in
the cache for multi-core implementation, no? Yes, there is some
additional buffering necessary, because the next core does not pick up
the data immediately, but with cache sizes of 32MB and more, you can
afford ~0.1ms of average slack given the data rate you mentioned.
If you can split your problem into hundreds of small parallel tasks,
you can make good use of multi-core CPUs, possibly with a custom
scheduler for each core/thread.
I don't know enough about your problem and GPUs to comment on whether
GPUs are useful for it. The general impression I have about GPUs is
that they are good for doing the same thing to a lot of data.
Sysop: | Keyop |
---|---|
Location: | Huddersfield, West Yorkshire, UK |
Users: | 429 |
Nodes: | 16 (2 / 14) |
Uptime: | 115:27:11 |
Calls: | 9,056 |
Calls today: | 3 |
Files: | 13,395 |
Messages: | 6,016,380 |