#### Abstract Expressionism for Parallel Performance

Robert Bernecky<sup>1</sup> Sven-Bodo Scholz<sup>2</sup>

<sup>1</sup>Snake Island Research Inc, Canada bernecky@snakeisland.com

<sup>2</sup>Heriot-Watt University, UK S.Scholz@hw.ac.uk

This paper was presented at PLDI 2015, Portland, OR.

August 31, 2015

Optimizing Functional Array Language (FAL) compilers for languages such as APL (APEX) and SAC (sac2c), now produce code that outperforms hand-optimized C in both serial and parallel arenas, while retaining the abstract expressionist nature of well-written FAL code.

In this talk, we demonstrate how FAL can now outperform C, in both serial and OpenMP variants, by up to a third, with *no* source code modifications. We also show that modern optimizers can sometimes generate identical loops from substantially different FAL source code.

- Serial performance: physics relaxation benchmark
- Parallel performance: physics relaxation benchmark
- Wild applause

#### A Physics Benchmark: Vector Relaxation

- ▶ Inputs: temperatures (fixed) at each end of *N*-element rod
- Output: End element temperatures remain unchanged;
   Other element temps are arithmetic mean of neighbors
- ► Application: image processing, *e.g.*, dust removal (2D)
- Application: temperature distribution in a rod

Dyalog APL/S-64 Version 14.1.25324 8-core AMD FX-8350 (Piledriver) @ 4013MHz, 32GB DRAM Ubuntu 14.04LTS, sac2c Build #18605, gcc 4.8.2-19ubuntu1 100000 iterations of relaxation kernel 100001-element vector argument, *N*  Three Ways to do Vector Relaxation in Dyalog APL

- Abstract: No tinkering of "memory"
- Expressions: No need for variables (convenience only)
- TD ← { (1↑ω), (((2↓ω)+<sup>-</sup>2↓ω)÷2.0),<sup>-</sup>1↑ω}
- ► ROT←{N←ρω

$$m \leftarrow (0=1N) \vee (N-1)=1N$$
$$(m \times \omega) + (\sim m) \times ((1\varphi \omega) + 1\varphi \omega) \div 2.0 \}$$

 $\begin{array}{l} m \leftarrow (0 = \iota N) \vee (N - 1) = \iota N \\ (m \times \omega) + (\sim m) \times ((1 \text{ shift } \omega) + 1 \text{ shift } \omega) \div 2 \\ \text{shift} \leftarrow \{((\times \alpha) \times \rho \omega) \uparrow \alpha \downarrow \omega\} \end{array}$ 

```
TD \leftarrow \{(1 \land \omega), (((2 \lor \omega) + 2 \lor \omega) \div 2.0), -1 \land \omega\}
ROT \leftarrow {N \leftarrow \rho \omega
            m \leftarrow (0=1N) \vee (N-1)=1N
            (m \times \omega) + (\sim m) \times ((1 \varphi \omega) + 1 \varphi \omega) \div 2.0
SHF \leftarrow {N \leftarrow \rho \omega
            m \leftarrow (0=1N) \vee (N-1)=1N
            (m \times \omega) + (\sim m) \times ((1 \text{ shift } \omega) + 1 \text{ shift } \omega) + 2
            shift ((\times \alpha) \times \rho \omega) \wedge \alpha \downarrow \omega
                           API, TD
                                                   82.6s
    Timings: APL ROT | 203.9s
                           APL SHF | 236.9s
```

# Serial Relaxation in C Using IF/THEN/ELSE

```
for( j=0; j<N; j++) {</pre>
     if(0==j) {
       res[j] = v[j];
     } else if((N-1)==j) {
       res[j] = v[j];
     } else {
       res[j] = (v[j-1] + v[j+1])/2.0;
     }
  }
           APL TD
                       82.6s
► Timings: APL ROT
                     203.9s
                      236.9s
           APL SHF
```

# Serial Relaxation in C Using IF/THEN/ELSE

```
for( j=0; j<N; j++) {
     if(0==j) {
       res[j] = v[j];
     } else if((N-1)==j) {
       res[j] = v[j];
     } else {
       res[j] = (v[j-1] + v[j+1])/2.0;
     }
  }
                               82.6s
           APL TD
                              203.9s
           APL ROT
Timings:
                              236.9s
           APL SHF
                               16.3s
           C IF/THEN/ELSE
```

# Serial Relaxation in C Using Conditional Expressions

|          | APL TD         | 82.6s            |
|----------|----------------|------------------|
|          | APL ROT        | 203.9s<br>236.9s |
| Timings: | APL SHF        | 236.9s           |
|          | C IF/THEN/ELSE | 16.3s            |
|          | C COND         | 16.4s            |

# Serial Relaxation in SAC Using Conditional Expressions

|                              | AFLI ID        | 02.03  |
|------------------------------|----------------|--------|
| <ul> <li>Timings:</li> </ul> | APL ROT        | 203.9s |
|                              | APL SHF        | 236.9s |
|                              | C IF/THEN/ELSE | 16.3s  |
|                              | C COND         | 16.4s  |
|                              | SAC COND       | 12.0s  |
|                              |                |        |

Can SAC do better? Three data-parallel With-Loop partitions:

```
res = with {
       ([0] <= [j] < [1]) : v[j];
       ([1] <= [j] < [N-1]) :
          (v[j-1] + v[j+1])/2.0;
       ([N-1] \le [j] \le [N]) : v[j];
      } : modarray( v);
                               82.6s
           APL TD
                              203.9s
           APL ROT
                              236.9s
           APL SHF
                                16.3
► Timings: C IF/THEN/ELSE
           C COND
                                16.4
                               12.0s
           SAC COND
                                5.9s
           SAC HAND
```

# Serial Relaxation using Abstract Expressionism and APEX

- Take and drop algorithm in APEX
- TD ← { (1↑ω), (((2↓ω)+<sup>-</sup>2↓ω)÷2.0), <sup>-</sup>1↑ω}
- Approximate APEX-generated SAC code

mid = (drop([2],v)+drop([-2],v))/2.0; res = take([1],v)++mid++take([-1],v); APL TD | 82.6s > Timings: SAC HAND | 5.9s APEX TD | 5.9s

► Identical inner loops for APEX TD and SAC HAND

# Serial Relaxation using Abstract Expressionism and APEX

```
ROT (N \leftarrow \rho \omega)

m \leftarrow (0 = iN) \lor (N-1) = iN

(m \times \omega) + (\sim m) \times ((1 \phi \omega) + (1 \phi \omega) \div 2.0)

m = (0 == iota(N)) | ((N-1) == iota(N));

res = (tod(m) * v) + tod(!m) *

((rotate([1], v) + rotate([-1], v)))/2.0;
```

Rotate algorithm in APEX, generated SAC code

|          | APL ROT  | 82.6s |
|----------|----------|-------|
| Timings: | SAC HAND | 5.9s  |
|          | APEX ROT | 5.9s  |

► Identical inner loops for APEX ROT and SAC HAND

# Serial Relaxation using Abstract Expressionism and APEX

#### Shift algorithm in APEX-generated SAC code

|                              | APL TD     | 82.6s  |
|------------------------------|------------|--------|
| <ul> <li>Timings:</li> </ul> | APL ROT    | 203.9s |
|                              | APL SHF    | 236.9s |
|                              | SAC HAND   | 5.9s   |
|                              | APEX TD    | 5.9s   |
|                              | APEX ROT   | 5.9s   |
|                              | APEX SHIFT | 5.9s   |

ALL inner loops are identical!

- APL source codes differ substantially
- Very different SAC stdlib code for rotate, shift, take/drop
- *E.g.*, number of With-Loops, setup code style
- See paper for stdlib code, here: http://www.snakeisland.com/abstractexpressionism.pdf

## Serial Performance GFLOPS

- Hard to do better? SAC/APEX approach maximum GFLOPS rate
- Let's look at parallel execution



Serial Relaxation Performance (One FPU)

Robert Bernecky, Sven-Bodo Scholz

Abstract Expressionism for Parallel Performance

#### Open MP

- Basic idea: Introduce ceremonial rubbish into SOURCE code
- See paper for ceremonial details
- Basic idea: Introduce pragmas into SOURCE code #pragma omp parallel for after SOME for statements.
- Compile with -fopenmp

## Parallel Relaxation Speedup in C Performance

Timings: (higher is better)



## Optimized Parallel Relaxation in C

```
for( j=0; j<N; j++) {
    if(0==j) {
        res[j] = v[j];
        } else if((N-1)==j) {
        res[j] = v[j];
        } else {
        res[j] = (v[j-1] + v[j+1])/2.0;
        }
}</pre>
```

- Bright idea: Replace multiple "res[j] =" by "e1 ="
- Bright idea: and add "res[j] = el;" after IF-statement
- Rationale: Eliminate multiple indexed assigns into "res"
- This should improve instruction cache use

### Pessimized Parallel Relaxation in C

Timings: (higher is better)

Relaxation Performance 8 ifc ifcoptimized 6 GFLOP/s 4 2 0 2 3 4 5 6

Number of threads

Robert Bernecky, Sven-Bodo Scholz

Abstract Expressionism for Parallel Performance

### Parallel Relaxation Slowdown in C Post-mortem

```
for( j=0; j<N; j++) {
    if(0==j) {
        el = v[j];
        } else if((N-1)==j) {
        el = v[j];
        } else {
        el = (v[j-1] + v[j+1])/2.0;
        }
        res[j] = el;
}</pre>
```

- What went wrong?
- el is shared, so it hops among all threads
- Bottom line: Bright idea not so bright (watch system monitor!)
- Bottom line: Writing parallel C code is NOT trivial

- Abstract expressionist APL matches best SAC code
- SAC and APL beat C by 2.75X in serial environment
- ▶ SAC and APL beat Open MP C by 1/3 in parallel environment
- ► NO changes to APL code for parallel execution, unlike C

### Serial and Parallel Relaxation Performance

#### Higher is better



Robert Bernecky, Sven-Bodo Scholz Abstract Expressionism for Parallel Performance

# SAC Keys to High-Performance FAL Compilation

- Provide purely functional Intermediate Language (IL)
- Preserve array semantics in IL
- Scalarize small arrays, *e.g.*:
- in Gaussian Elimination pivot, replacing: mat[rowa,rowb;] + mat[rowb,rowa;] by trow + mat[rowa;] > mat[rowa;] + mat[rowb;] > mat[rowb;] + trow
- ... gives 2X speedup!
- ► Do scalarization in the compiler, *NOT* in app source code.
- ▶ Use array-based optimizations, *e.g.*, with-loop folding (WLF)
- and others...
- Stay tuned for the book!

This work was supported in part by grant EP/L00058X/1, from the UK Engineering and Physical Sciences Research Council (EPSRC). The late Ken Iverson, an Albertan farm boy, had many excellent insights, for which we are grateful. The excellent performance of the sac2c compiler is due to the diligence of many researchers, whose contributions can be found on the SaC web site at http:sac-home.org. Our thanks to Philip Mucci and John D. McCalpin for answering our AMD architecture questions. We also thank the anonymous referees for their thoughtful comments. Thank you! Questions?