# Parallel Execution

PetaPerl automatically parallelizes eligible loops using [Rayon](https://github.com/rayon-rs/rayon), a work-stealing thread pool. Combined with JIT compilation, this enables dramatic speedups for compute-heavy workloads.

## How It Works

When PetaPerl encounters a parallelizable loop, it:

1. **Analyzes the loop body** for side effects and shared mutable state
2. **Identifies reduction variables** (accumulators like `$sum += ...`)
3. **Distributes iterations** across threads using Rayon's work-stealing scheduler
4. **Combines results** using the detected reduction operations

Each thread gets its own copy of loop-local variables. Reduction variables are combined after all threads complete.

## What Gets Parallelized

### JIT'd While-Loops

When the JIT compiles a while-loop and the analysis detects:
- A counter variable with known bounds
- Reduction variables (accumulate-only pattern)
- No I/O or side effects in the loop body

Then the loop body is compiled once and executed in parallel across threads.

### Built-in Functions

`map` and `grep` with pure callbacks can execute in parallel:

```perl
my @results = map { expensive_computation($_) } @large_array;
my @filtered = grep { complex_test($_) } @large_array;
```

Parallelization requires:
- No side effects in the callback
- No shared mutable state
- Collection size above the parallelization threshold

## CLI Control

```bash
# Default: parallelization enabled
pperl script.pl

# Disable parallelization
pperl --no-parallel script.pl

# Explicitly enable (default)
pperl --parallel script.pl

# Set thread count (default: number of CPU cores)
pperl --threads=4 script.pl

# Set minimum collection size for parallelization
pperl --parallel-threshold=1000 script.pl
```

The test harness runs with `--no-parallel` by default to ensure deterministic test output.

## Performance

### Mandelbrot Set (1000x1000)

| Mode | Time | vs perl5 |
|------|------|----------|
| perl5 | 12,514ms | 1.0x |
| pperl interpreter | ~3,500ms | 3.6x faster |
| pperl JIT | 163ms | 76x faster |
| pperl JIT + parallel (8 threads) | 29ms | 431x faster |

### Scaling

The work-stealing scheduler provides near-linear scaling for embarrassingly parallel workloads:

| Threads | Mandelbrot 4000x4000 | Scaling |
|---------|---------------------|---------|
| 1 | baseline | 1.0x |
| 2 | ~50% time | ~1.9x |
| 4 | ~25% time | ~3.8x |
| 8 | ~13% time | ~5.2x |

Scaling is sub-linear due to memory bandwidth, cache effects, and reduction overhead.

## Limitations

### String Operations

String operations (`.=` concat, string building) are not parallelized. The JIT's string support uses extern calls back to the Rust runtime, which requires mutable access to shared state. When the JIT detects string variables in a loop, parallel dispatch is disabled.

### Side-Effect Detection

The parallelization analyzer is conservative. Any of these disqualify a loop:

- I/O operations (`print`, `open`, file reads)
- Global variable writes
- Subroutine calls (unless proven pure)
- Regex operations with side effects (`s///`)

False negatives (missed parallelization opportunities) are safe — the loop simply runs sequentially. False positives (incorrect parallelization) would be bugs.

### Determinism

Parallel execution may change the order of side effects. For this reason, parallelization is only applied when the analysis proves the loop body is free of observable side effects.

Output order is preserved for `map` and `grep` — the results array maintains the same element ordering as sequential execution.

## How Reduction Detection Works

The analyzer identifies reduction variables by scanning for accumulation patterns and subtracting reset patterns:

```perl
# Detected as reduction: $sum accumulates, never reset in loop
my $sum = 0;
for my $x (@data) {
    $sum += $x;
}

# NOT a reduction: $temp is reset each iteration
for my $x (@data) {
    my $temp = $x * 2;  # reset (my declaration)
    $sum += $temp;       # $sum is still a reduction
}
```

The formula: `reductions = accumulations - resets`. This prevents false positives where a variable is both accumulated and reset within the loop body.