`copilot-theorem`: Add function to produce counterexamples for invalid properties. Refs #589. #595

RyanGlScott · 2025-02-24T22:42:02Z

Currently, the Copilot.Theorem.What4.prove function returns a list of results, where each result contains a SatResult that describes whether a property is Valid, Invalid, or Unknown. The Invalid result has the limitation that it does not give any information about a specific counterexample that could drive Copilot into falsifying the property, however. This makes it challenging to interpret what the results of prove mean.

This introduces a new proveWithCounterExample function to Copilot.Theorem.What4 that mirrors the type signature of prove, except that it returns a variant of SatResult (SatResultCex) where the Invalid equivalent (InvalidCex) encodes counterexample information. copilot-theorem users can then interpret the results of the counterexample in Copilot specifications.

As part of this commit, we change the definition of the CounterExample data type. This is safe to do, as CounterExample was completely unused prior to this commit, nor was it exported.

Fixes #589.

Currently, the `Copilot.Theorem.What4.prove` function returns a list of results, where each result contains a `SatResult` that describes whether a property is `Valid`, `Invalid`, or `Unknown`. The `Invalid` result has the limitation that it does not give any information about a specific counterexample that could drive Copilot into falsifying the property, however. This makes it challenging to interpret what the results of prove mean. To define a companion function to `prove` that also returns counterexample information upon a failed proof, it is convenient to be able to display `Type` information in panic messages. This commit derives a basic `Show` instance for `Type` so that `copilot-theorem` can display them whenever an internal invariant is violated.

…Copilot-Language#589. Currently, the `Copilot.Theorem.What4.prove` function returns a list of results, where each result contains a `SatResult` that describes whether a property is `Valid`, `Invalid`, or `Unknown`. The `Invalid` result has the limitation that it does not give any information about a specific counterexample that could drive Copilot into falsifying the property, however. This makes it challenging to interpret what the results of prove mean. To update the `valFromExpr` function in order to produce concrete array values for counterexample purposes, we need to call the `Array` data constructor, which has a `Typed` constraint. However, the `XEmptyArray` and `XArray` data constructors do not record evidence that their array element types were instances of the `Typed` class, which makes it impossible to use them in `valFromExpr`. This commit adds the necessary constraints to each data constructor to make their array elements `Typed`.

ivanperez-keera · 2025-02-25T09:41:52Z

Change Manager: The build has failed. Please check.

RyanGlScott · 2025-02-25T11:27:14Z

My apologies, I was accidentally relying on code that only typechecked using GHC 8.8 or later. I've fixed this now—PTAL.

RyanGlScott · 2025-02-26T11:07:36Z

Implementor: Fix implemented, review requested.

copilot/examples/what4/ArithmeticCounterExamples.hs

copilot-theorem/src/Copilot/Theorem/What4.hs

copilot-theorem/tests/Test/Copilot/Theorem/What4.hs

copilot-theorem/src/Copilot/Theorem/What4.hs

copilot/examples/what4/ArithmeticCounterExamples.hs

Copilot-Language#589. Currently, the `Copilot.Theorem.What4.prove` function returns a list of results, where each result contains a `SatResult` that describes whether a property is `Valid`, `Invalid`, or `Unknown`. The `Invalid` result has the limitation that it does not give any information about a specific counterexample that could drive Copilot into falsifying the property, however. This makes it challenging to interpret what the results of prove mean. The `valFromExpr` function (which produces concrete values when making a counterexample) was lacking cases for `XEmptyArray` and `XArray`, so it would fail if the function was called on these values. This commit adds these missing cases, which make use of the `Typed` evidence added to `XEmptyArray` and `XArray` in a previous commit. We do not yet add a case for structs, which prove more challenging.

…properties. Refs Copilot-Language#589. Currently, the `Copilot.Theorem.What4.prove` function returns a list of results, where each result contains a `SatResult` that describes whether a property is `Valid`, `Invalid`, or `Unknown`. The `Invalid` result has the limitation that it does not give any information about a specific counterexample that could drive Copilot into falsifying the property, however. This makes it challenging to interpret what the results of prove mean. This introduces a new `proveWithCounterExample` function to `Copilot.Theorem.What4` that mirrors the type signature of `prove`, except that it returns a variant of `SatResult` (`SatResultCex`) where the `Invalid` equivalent (`InvalidCex`) encodes counterexample information. `copilot-theorem` users can then interpret the results of the counterexample in Copilot specifications. As part of this commit, we change the definition of the `CounterExample` data type. This is safe to do, as `CounterExample` was completely unused prior to this commit, nor was it exported.

Copilot-Language#589. Currently, the `Copilot.Theorem.What4.prove` function returns a list of results, where each result contains a `SatResult` that describes whether a property is `Valid`, `Invalid`, or `Unknown`. The `Invalid` result has the limitation that it does not give any information about a specific counterexample that could drive Copilot into falsifying the property, however. This makes it challenging to interpret what the results of prove mean. The `CounterExample`, `SatResultCex`, and `CopilotValue` data types lack `Show` and `ShowF` instances, which makes it impractical for users to display them. This commit adds `Show` and `ShowF` instances for all three data types so that they can be shown.

…ilot-Language#589. Currently, the `Copilot.Theorem.What4.prove` function returns a list of results, where each result contains a `SatResult` that describes whether a property is `Valid`, `Invalid`, or `Unknown`. The `Invalid` result has the limitation that it does not give any information about a specific counterexample that could drive Copilot into falsifying the property, however. This makes it challenging to interpret what the results of prove mean. A prior commit has introduced a `proveWithCounterExample` function, which provides a counterexample when a property is proven invalid. This commit updates the test suite to ensure that basic uses of `proveWithCounterExample` work as intended.

…lot-Language#589. Currently, the `Copilot.Theorem.What4.prove` function returns a list of results, where each result contains a `SatResult` that describes whether a property is `Valid`, `Invalid`, or `Unknown`. The `Invalid` result has the limitation that it does not give any information about a specific counterexample that could drive Copilot into falsifying the property, however. This makes it challenging to interpret what the results of prove mean. To demonstrate how to effectively use the newly added `proveWithCounterExample` function, this commit adds a new `examples/what4/ArithmeticCounterExamples.hs` function that behaves like `examples/what4/Arithmetic.hs`, but using `proveWithCounterExamples` instead of `prove`.

…#589.

RyanGlScott · 2025-02-27T14:28:07Z

Implementor: Fix implemented, review requested.

ivanperez-keera · 2025-02-28T03:14:04Z

Change Manager: Verified that:

Solution is implemented:

The code proposed compiles and passes all tests. Details:
Build log: https://github.com/Copilot-Language/copilot/runs/37925988037

The solution proposed produces the expected result. Details:
The following Dockerfile runs an example included in the PR that demonstrates the new feature by printing the results of proving some combination of valid and invalid properties, for which counterexamples are produced when applicable. I have checked by visual inspection that the results are produced, and that the message is well formatted and informative enough:

FROM ubuntu:focal

ENV DEBIAN_FRONTEND=noninteractive
RUN apt-get update

RUN apt-get install --yes \
      libz-dev \
      git \
      curl \
      gcc \
      g++ \
      make \
      libgmp3-dev  \
      pkg-config \
      z3

RUN mkdir -p $HOME/.ghcup/bin
RUN curl https://downloads.haskell.org/~ghcup/0.1.19.2/x86_64-linux-ghcup-0.1.19.2 -o $HOME/.ghcup/bin/ghcup
RUN chmod a+x $HOME/.ghcup/bin/ghcup
ENV PATH=$PATH:/root/.ghcup/bin/
ENV PATH=$PATH:/root/.cabal/bin/

SHELL ["/bin/bash", "-c"]

RUN ghcup install ghc 9.10
RUN ghcup install cabal 3.2
RUN ghcup set ghc 9.10
RUN cabal update

SHELL ["/bin/bash", "-c"]
CMD git clone $REPO && cd $NAME && git checkout $COMMIT && cd .. \
  && cabal v1-sandbox init \
  && cabal v1-install alex happy --constraint='happy <= 2' \
  && cabal v1-install $NAME/copilot**/ \
  && cabal v1-exec -- runhaskell $NAME/copilot/examples/what4/ArithmeticCounterExamples.hs \
  && echo "Success"

Command (substitute variables based on new path after merge):

$ docker run -e "REPO=https://github.com/GaloisInc/copilot-1" -e "NAME=copilot-1" -e "COMMIT=9861071718850fb8ef85e533a6c0d6d083f69094" -it copilot-verify-589

Implementation is documented. Details:
All new top-level definitions include haddock documentations, as well as the internals of the code and the examples.
Change history is clear.
Commit messages are clear.
Changelogs are updated.
Examples are updated. Details:
A new example is introduced to demonstrate the feature.
Required version bumps are evaluated. Details:
Bump required; the new feature changes the types of some value constructors in copilot-theorem, thus affecting the public API.

RyanGlScott added 2 commits February 24, 2025 15:29

RyanGlScott force-pushed the develop-copilot-theorem-counterexamples-take-three branch from 2321709 to 2aaeea1 Compare February 25, 2025 10:48