The Details: diffbehavior

Are these two functions equivalent?:

# foo.py
from typing import List

def cut1(a: List[int], i: int) -> None:
  a[i:i+1] = []

def cut2(a: List[int], i: int) -> None:
  a[:] = a[:i] + a[i+1:]

Almost! But not quite.

CrossHair’s diffbehavior command can help you find out:

$ crosshair diffbehavior foo.cut1 foo.cut2

Given: (a=[9, 0], i=-1),
  foo.cut1 : after execution a=[9, 0]
  foo.cut2 : after execution a=[9, 9, 0]

How do I try it?

$ pip install crosshair-tool
$ crosshair diffbehavior <module>.<function> <module>.<function>

diffbehavior

crosshair diffbehavior --help
usage: crosshair diffbehavior [-h] [--verbose]
                              [--extra_plugin EXTRA_PLUGIN [EXTRA_PLUGIN ...]]
                              [--max_uninteresting_iterations MAX_UNINTERESTING_ITERATIONS]
                              [--per_path_timeout FLOAT]
                              [--per_condition_timeout FLOAT]
                              FUNCTION1 FUNCTION2

Find differences in the behavior of two functions.
See https://crosshair.readthedocs.io/en/latest/diff_behavior.html

positional arguments:
  FUNCTION1             first fully-qualified function to compare (e.g. "mymodule.myfunc")
  FUNCTION2             second fully-qualified function to compare

options:
  -h, --help            show this help message and exit
  --verbose, -v         Output additional debugging information on stderr
  --extra_plugin EXTRA_PLUGIN [EXTRA_PLUGIN ...]
                        Plugin file(s) you wish to use during the current execution
  --max_uninteresting_iterations MAX_UNINTERESTING_ITERATIONS
                        Maximum number of consecutive iterations to run without making
                        significant progress in exploring the codebase.
                        (by default, 5 iterations, unless --per_condition_timeout is set)

                        This option can be more useful than --per_condition_timeout
                        because the amount of time invested will scale with the complexity
                        of the code under analysis.

                        Use a small integer (3-5) for fast but weak analysis.
                        Values in the hundreds or thousands may be appropriate if you
                        intend to run CrossHair for hours.
  --per_path_timeout FLOAT
                        Maximum seconds to spend checking one execution path.
                        If unspecified:
                        1. CrossHair will timeout each path at the square root of
                           `--per_condition_timeout`, if specified.
                        3. Otherwise, it will timeout each path at a number of seconds
                           equal to `--max_uninteresting_iterations`, unless it is
                           explicitly set to zero.
                           (NOTE: `--max_uninteresting_iterations` is 5 by default)
                        2. Otherwise, it will not use any per-path timeout.
  --per_condition_timeout FLOAT
                        Maximum seconds to spend checking execution paths for one condition

diffbehavior your own code changes

Use git worktree to create an unmodified source tree, and then use crosshair diffbehavior to compare your local version to head.

# Let's say we edit the clean() function in foo.py

# Step 1: Create an unmodified source tree under a directory named "clean":
$ git worktree add --detach clean

# Step 2: Have CrossHair try to detect a difference:
$ crosshair diffbehavior foo.cut clean.foo.cut

# Step 3: Remove the "clean" directory when you're done:
$ git worktree remove clean

An example shell function

If you find yourself doing this often, make a function or script. For example, you might put this function in your ~/.bashrc file:

diffbehavior() {
    git worktree add --detach _clean || exit 1
    crosshair diffbehavior "$1" "_clean.$@"
    git worktree remove _clean
}

Then, you can diff your uncommitted changes very easily:

$ diffbehavior foo.cut
...

Refactoring? Use diffbehavior to make sure it’s safe.

Say we start with this:

# foo.py
def longest_str(items: List[str]) -> str:
  longest = ''
  for item in items:
    if len(item) > len(longest):
      longest = item
  return longest

… and change it to this:

def longest_str(items: List[str]) -> str:
  return max(items,
             key=lambda item: len(item),
             default='')

We can use the shell function above to help make sure the code doesn’t operate differently:

$ diffbehavior foo.longest_str
No differences found. (attempted 15 iterations)
Consider trying longer with: --per_condition_timeout=<seconds>

Developing new features or fixing bugs? diffbehavior finds inputs to test.

Say we start with this:

def isack(s: str) -> bool:
    if s in ('y', 'yes'):
        return True
    return False

… and change it to this:

def isack(s: str) -> bool:
    if s in ('y', 'yes', 'Y', 'YES'):
        return True
    if s in ('n', 'no', 'N', 'NO'):
        return False
    raise ValueError('invalid ack')

We can use the shell function above to find useful inputs for testing:

$ diffbehavior foo.isack
Given: (s='\x00'),
         foo.isack : returns False
  _clean.foo.isack : raises ValueError('invalid ack')
Given: (s='YES'),
         foo.isack : returns False
  _clean.foo.isack : returns True

CrossHair reports examples in order of added coverage, descending, so consider writing your unit tests using such inputs, from the top-down.

But don’t do it blindly! CrossHair doesn’t always give pleasant examples; instead of using '\x00', you should just use 'a' to cover the same logic.

How does this work?

CrossHair uses an SMT solver (a kind of theorem prover) to explore execution paths and look for arguments. It uses the same engine as the crosshair check and crosshair watch commands which check code contracts.

Caveats

  • This feature, as well as CrossHair generally, is a work in progress. If you are willing to try it out, thank you! Please file bugs or start discussions to let us know how it went.

  • Be aware that the absence of an example difference does not guarantee that the functions are equivalent.

  • CrossHair likely won’t be able to detect differences in complex code. Target it at the smallest piece of logic possible.

  • Your arguments must have proper type annotations.

  • Your arguments have to be deep-copyable and equality-comparable. (this is so that we can detect code that mutates them)

  • CrossHair is supported only on Python 3.7+ and only on CPython (the most common Python implementation).

  • Only deterministic behavior can be analyzed. (your code always does the same thing when starting with the same values)

  • Be careful: CrossHair will actually run your code and may apply any arguments to it.

Credits

The diffbehavior command was inspired by Hillel Wayne’s post about cross-branch testing!