# Design Sketch: Integer Division and Modulus Operators ## Overview Currently, Emboss does not support division, which limits what it can express in its expression language. In particular, it is awkward to handle things like arrays of 16-bit values whose size is specified in bytes. The reason Emboss does not yet have support for division is that division is tricky to handle correctly. This document attempts to explain the pitfalls and the proper solutions. Note that Emboss does not support non-integer arithmetic, so this document only talks about *integer* division. ## Syntax ### Symbol Many programming languages use `/` for all division operations, whether they are "integer" division, floating point division, rational arithmetic division, matrix division, or anything else. Very, very few languages (Haskell) use an operator other than `%` for modulus/remainder, and no popular language overloads `%` with a second meaning. (Note that Haskell defines `%` as *rational number division*. `mod` and `rem` are used for modulus and remainder, respectively, and `div` and `quot` are used for truncating and flooring division.) `%` is technically modulus in some languages, and remainder in others, but in all cases the equality `x == (x / n * n) + x % n` holds, and does not seem to confuse programmers. #### C, C++, C#, Java, Go, Python 2, Rust, SQL, ... All of these languages use `/` for multiple kinds of division, including integer division. Python 2 deprecated `/` for integer division; in a Python 2 program with `from __future__ import division`, `/` is floating-point division. C++ and Python both allow `/` to be overloaded for user-defined types. #### JavaScript, TypeScript, Perl 5 These languages use `/` for floating-point division, and do not directly support integer division; both operands and result must be cast to int: ((l | 0) / (r | 0)) | 0 // JS, TS int(int($l) / int($r)) # Perl #### OCaml, Haskell, Python 3 These languages use separate operators for integer and non-integer division. Integer division: l `div` r -- Haskell l / r (* OCaml *) l // r # Python 3 Non-integer division: l / r -- Haskell l /. r (* OCaml *) l / r # Python 3 #### Emboss Emboss should use `//` for division, following Python 3's example. This is the only unambiguous integer division operator used by a TIOBE top-20 language, *and* Python 3's `//` has the same division semantics (flooring division) as Emboss -- most other programming languages use truncating (round-toward-0) division. If Emboss ever gets support for floating-point arithmetic, it should probably use OCaml's `/.` operator for division on floats. Because `/` does different things in different popular languages (flooring integer division, truncating integer division, floating-point division, or multiple types, depending on operand types), Emboss should not use it. Emboss should use `%` for modulus, following the example of almost every other language. ### Precedence In most programming languages, division is a left-associative operator with equal precedence to multiplication; i.e. this: a * b / c * d is equivalent to this: ((a * b) / c) * d However, a nontrivial percentage of programmers tend to read it as: (a * b) / (c * d) This may be because, after grade school, most division is expressed in fraction form, like: a * b ----- c * d Interestingly, fewer programmers seem to mis-read extra division: a * b / c / d is usually (correctly) parsed as: ((a * b) / c) / d Given that confusion, in Emboss, the construct: a * b / c * d should be a syntax error. The constructs: a * b / c / d a * b / c + d could be syntax errors, although the second one (`... + d`), in particular, seems to get parsed correctly by all programmers in my polling. This is in keeping with two general rules of Emboss design: 1. It is better to lock things down at first and gradually ease restrictions than the other way around. 2. Emboss syntax should be as unambiguous as possible, even when it makes implementation trickier or `.emb` files slightly wordier. (See, for example, the "separate-but-equal" precedence of the `&&` and `||` operators in Emboss.) ## Semantics ### Flooring Division vs Truncating Division Roughly speaking, there are two common forms of "integer division:" *flooring* (sometimes called *Euclidean*) and *truncating* (sometimes called *round-toward-0*) division. | Operation | Flooring Result | Truncating Result | | ---------- | ---------------- | ----------------- | | +8 / +3 | 2 | 2 | | +8 / -3 | -3 | -2 | | -8 / +3 | -3 | -2 | | -8 / -3 | -3 | 2 | | +8 % +3 | 2 | 2 | | +8 % -3 | -1 | 2 | | -8 % +3 | 1 | -2 | | -8 % -3 | -2 | -2 | Most programming languages and most CPUs implement truncating division, [however, truncating division is irregular around 0](https://dl.acm.org/doi/pdf/10.1145/128861.128862), which prevents some kinds of expression rewrites (e.g., `(x + 6) / 3` is not the same as `x / 3 + 2` when `x == -4`) and would force Emboss's bounds tracking to have wider bounds in some common cases, such as modulus by a known constant, where the dividend could be either positive or negative. Emboss should use flooring division. ### Undefined Results Division is the first operation in Emboss's expression language which can have an undefined result: all other operations are total. Emboss does have some notion of "invalid value," which it uses at runtime to handle fields whose backing store does not exist or whose existence condition is false, however, "invalid value" is not fully propagated in the expression type system, and it has a somewhat different meaning than "undefined." Currently, integer types in the Emboss expression language are sets of the form: {x ∈ ℤ | (min ≤ x ∨ min = -infinity) ∧ (x ≤ max ∨ max = infinity) ∧ x MOD m = n ∧ 0 ≤ n < m ∧ min ∈ ℤ ∪ {-infinity} ∧ max ∈ ℤ ∪ {infinity} ∧ m ∈ ℤ ∪ {infinity} ∧ n ∈ ℤ} where `min`, `max`, `m`, and `n` are parameters. The special values `infinity` and `-infinity` for `max` and `min` indicate that there is no known upper or lower bound, and the special value `infinity` for `m` indicates that `x` has a constant value equal to `n`. This will have to change to something like: {x ∈ ℤ ∪ {undefined} | ((can_be_undefined ∧ x = undefined) ∨ (min ≤ x ≤ max ∧ x MOD m = n)) ∧ 0 ≤ n < m ∧ can_be_undefined ∈ 𝔹 ∧ min ∈ ℤ ∪ {-infinity, undefined} ∧ max ∈ ℤ ∪ {infinity, undefined} ∧ m ∈ ℤ ∪ {infinity} ∧ n ∈ ℤ ∪ {undefined}} where `can_be_undefined` is a new boolean parameter. This also leaks out into boolean types, which are currently either {true, false}, {true}, or {false}: they can now be {true, false}, {true}, {false}, {true, false, undefined}, {true, undefined}, {false, undefined}, or {undefined}. ### Expression Bound Calculations #### Division For an expression x of the form `l // r`, Emboss must figure out the five parameters of the type of x (`x_min`, `x_max`, `x_m`, `x_n`, `x_can_be_undefined`) from the five parameters of each of the types of `l` and `r`. x_can_be_undefined = l_can_be_undefined ∨ r_can_be_undefined ∨ r_min ≤ 0 ≤ r_max That is, `x` can be undefined if either input is undefined, or if there could be division by zero. If `r_min = 0 = r_max`, then: x_max = undefined x_min = undefined x_m = infinity x_n = undefined Otherwise, the remaining parameters can be computed by parts, although for `x_max` and `x_min` it is simpler to simply check all pairs of {`l_min`, `-1`, `1`, `l_max`} // {`r_min`, `-1`, `1`, `r_max`} (removing zeroes from the right-hand side, and removing `-1` and `1` if they do not fall between the `min` and `max`). `x_m` is: infinity if l_m = r_m = infinity else GCD(l_m, r_n) if r_m = infinity else 1 `x_n` is l_n // r_n if l_m = r_m = infinity else (l_n // r_n) MOD GCD(l_m, r_n) if r_m = infinity else 0 #### Modulus Modulus is somewhat simpler. x_can_be_undefined = l_can_be_undefined ∨ r_can_be_undefined ∨ r_min ≤ 0 ≤ r_max If `l_m = r_m = infinity`, `x_m = infinity` and `x_min = x_max = x_n = l_n % r_n`. Otherwise: x_max = max(0, r_max - 1) x_min = min(0, r_min + 1) x_m = 1 x_n = 0 Note that, as with some other bounds calculations in Emboss, there are some special cases where `l` or `r` is a very small set where it is technically possible to find a narrower bound: for example, if 6 ≤ l ≤ 7, l % 4 could have the bounds `x_max = 3` and `x_min = 2`. However, Emboss does not need to find the absolute tightest bound; it only needs to find a reasonable bound. ## Other Notes ### `IsComplete()` The *intention* of `IsComplete()` is that it should return `true` iff adding more bytes cannot change the result of `Ok()` -- that is, iff the structure is "complete" in the sense that there are enough bytes to hold the structure, even if the structure is broken in some other way. If the size of the structure is a dynamic value which involves division by zero, the second definition of `IsComplete()` becomes ill-defined, in that it becomes impossible to know if there are enough bytes for the structure: struct Foo: 0 [+2] UInt divisor 2 [+512 // divisor] UInt:8[] payload If `divisor == 0`, the size of `payload` is undefined, so the structure's completeness is undefined. On the other hand, if `divisor == 0`, then the structure can never be `Ok()`, so in that sense it is 'complete': there is no way to add more bytes and get a structure that is functional. There may be another way for the client software to recover, such as scanning an input stream for a new start-of-message marker. I (bolms@) think that the best option is: 1. Add a new method `Maybe IsUndefined()` to virtual field views (and maybe to `UIntView`, `IntView`, and `FlagView`). 2. If `IntrinsicSizeInBytes().IsUndefined()`, `IsComplete()` should return `true`. (Note that `IntrinsicSizeInBytes()` is just the way the `$size_in_bytes` implicit virtual field is spelled in C++.) * This will require a bunch of plumbing into the whole expression generation logic -- probably changing `Maybe` to have a 'why not' reason when there is no `T`. 3. Otherwise, proceed with the current `IsComplete()` logic. 4. Update the documentation for `IntrinsicSizeInBytes()` and `IsComplete()` to note this caveat. In practice, it is rare to have a dynamic value that could be zero as a divisor in a field position or length: such divisions are almost always either division by a constant or division by one of a small set of known divisors, often powers of 2 (and in many of those cases, a shift operation would fit better).