Caveat: Factors specified on this page are obtained from micro-benchmarks performed on specific primitive functions; in real applications factors will depend on a mix of primitives.
All benchmark tests were performed on 64-bit interpreters on Linux/Microsoft Windows operating systems.
Internal Benchmarks
Internal benchmarking was performed on the initial release of Dyalog version 17.0 and the results compared with the initial release of Dyalog version 16.0.
The benchmarking process comprises over 13,000 benchmarks in more than 130 groups; the group geometric mean timing ratios are measured and plotted against the groups sorted by their means. The vertical axis of the graph shows the ratios as a percentage change; negative values are shown in blue and indicate a performance enhancement, and positive values are shown in red and indicate a deterioration in performance.
Results showed that core interpreter performance in Dyalog version 17.0 has an average improvement of 26% over Dyalog version 16.0.
Areas of Focus
- Vector instructions – Dyalog version 17.0 massively increases the use of vector instructions both in terms of the supported platforms and the operations that are vectorised. Performance gains are huge in some cases, for example, comparing vectors of one-byte integers is 13.5 times faster in version 17.0 than in version 16.0 on modern Intel hardware.
- Simplifications – Many operations have special cases in which they are equivalent to less expensive operations. For example, 0+⍵ is just ⍵, and 0×⍵ is an array of zeroes. Using the more efficient operation (sometimes known as strength reduction) can speed up programs that require the generality of the first operation but usually only use the second. In Dyalog version 17.0, many simpler cases that can occur in scalar dyadics and set functions have been identified and are now supported by special code.
- Boolean operations – Many existing optimisations for Boolean operations have been extended with more complete, fast, implementations of Boolean scalar dyadics, reductions, scans, inner products and outer products.
- Replicate and related functions – Replicate (dyadic /), expand (dyadic \), and where (monadic ⍸) have been significantly reworked, not only increasing their speed in common cases but also making them less reliant on branch prediction (which can result in varying and machine-dependent performance).
Vector Instructions
- Taking the maximum of one vector of 1-byte integers and another of 2-byte integers is now 3.8x faster.
- Dividing a vector of 2-byte integers by another such vector is 4.5x faster.
- Taking the 0.5th power (square root) of a floating-point vector is now 4.7x faster.
- Adding a vector of Booleans to another vector of 1-byte integers is 5.2x faster.
- Dividing a floating-point array by itself is now 179x faster.
- Taking the first power of an array now takes a constant 60 nanoseconds.
Comparisons
1-byte | 13x | |
2-byte | 4.7x | |
4-byte | 1.6x | |
floating-point | 1.7x | |
decimal floating-point | 2.0x |
- Comparing a vector of 2-byte integers with a scalar number outside of 2-byte range is 133x faster
- If the scalar number is positive, then taking the minimum of these two arrays is now instantaneous (a constant 60 nanoseconds)
- Testing for equality between 1-byte integers and 1-byte characters is 209x faster
Indexing
- Indexing the rows of a shape 2 17 Boolean matrix by a Boolean vector is now 55x faster.
- Taking the outer product (∘.≠) of a length 1e4 Boolean vector with a length 7 Boolean vector is 22x faster.
- Selecting from up to 128-⎕IO Boolean values using a short-integer index vector is now 13x faster. That includes selecting from each row of a Boolean matrix. On a matrix it’s 18x faster than before.
- Selecting from up to 16-⎕IO 1-byte values using a short-integer index vector is now 2.7x faster.
Replicate (/), Expand (\), and Where (⍸)
- Compress (replicate with a Boolean left argument) on 1-byte values is 11x faster.
- Compress on 2-byte values is 6.1x faster.
- Replicate on a Boolean vector is about 2x faster, depending on the values in the left argument.
- The scalar replicate 4/ on 2-byte values is about 2.5x faster. Other types and constants benefit as well.
- The outer product (∘.+) uses replicate when the right argument is small enough. With a right argument of 5 short integers, it’s 2.8x faster.
- Where with a sparse argument consisting of a million values with one out of every hundred set is 6.1x faster.
Miscellaneous
- Reversing the rows of a shape 1e4 13 matrix of 1-byte data is 1.8x faster.
- The reduction ∧⌿ on a shape 1e4 12 Boolean matrix is at least 3.5x faster, but can finish early and be up to 53x faster.
- Minus reduction (-/) on a Boolean array is up to 1,460x faster.
- Converting integers to characters with ⎕UCS is 11.2x faster, and in the other direction it’s 12.7x faster.
- Match each (≡¨) and its negation (≢¨) are 5.8x faster on vectors of short vectors.