Add fexpa learning path#2632
Conversation
Added draft status and cascade settings to the index file.
| FEXPA can be used to rapidly perform the table lookup. With this instruction a degree-2 polynomial is sufficient to obtain the same accuracy of the implementation we have seen before: | ||
|
|
||
| ```C | ||
| svfloat32_t lane_consts = svld1rq(pg, ln2_lo); // Load only ln2_lo |
There was a problem hiding this comment.
Do you need ld1rq here? It's confusing cos the comment says that only ln2_lo is loaded.
If ld1rq is used why is it not loading ln2_hi too?
| --- | ||
|
|
||
| ## Conclusion | ||
| The SVE2 FEXPA instruction can speed-up the computation of the exponential function by implementing Look-Up and bit manipulation. |
|
|
||
| Arm introduced in SVE an instruction called FEXPA: the Floating Point Exponential Accelerator. | ||
|
|
||
| Let’s segment the IEEE754 floating-point representation fraction part into several sub-fields (Index, Exp and Remaining bits) with respective length of Idxb, Expb and Remb bits. |
|
|
||
| Let’s segment the IEEE754 floating-point representation fraction part into several sub-fields (Index, Exp and Remaining bits) with respective length of Idxb, Expb and Remb bits. | ||
|
|
||
| | IEEE754 precision | Idxb | Expb | Remb | |
| Given what we said in the previous chapters, the exponential function can be implemented with SVE intrinsics in the following way: | ||
|
|
||
| ```C | ||
| svfloat32_t lane_consts = svld1rq(pg, constants); // Load ln2_lo, c0, c2, c4 in register |
There was a problem hiding this comment.
Is it worth using ld1rq in the example. That is not the most approachable for this audience.
I think it would save a few lines and help understanding to use duplication instead.
Then you can make a note that further memory-access optimisation can be performed, and maybe link to AOR versions.
Besides using pg is wrong here, you need to use an all true predicate.
| --- | ||
|
|
||
| ## Conclusion | ||
| The SVE2 FEXPA instruction can speed-up the computation of the exponential function by implementing Look-Up and bit manipulation. |
There was a problem hiding this comment.
I would generalise to "exponential functionS" (e^x, 2^x, 10^x, x^y...) by virtue of accelerating the computation of 2^n/N. Up to you
| - Fewer instructions (no back-and-forth to scalar/SVE code) | ||
| - Potentially higher aggregate throughput (more exponentials per cycle) | ||
| - Lower power & bandwidth (data being kept in SME engine) | ||
| - Cleaner fusion with GEMM/GEMV workloads |
There was a problem hiding this comment.
Are you intentionally not mentioning SoftMax and AI applications? It could help understanding the use for such fusion.
Before submitting a pull request for a new Learning Path, please review Create a Learning Path
Please do not include any confidential information in your contribution. This includes confidential microarchitecture details and unannounced product information.
By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of the Creative Commons Attribution 4.0 International License.