Introduction

The Java Virtual Machine (JVM) is a key part underlying the definition and portability of Java programs. The JVM is a relatively clean and simple abstract machine, which has been realized in software (as an interpreter) and is being realized in hardware as well. The JVM has a bytecoded instruction set designed to be compact and easily interpreted in either hardware or software.

The JVM is an object-oriented machine, manipulating objects as well as numeric data. The JVM also supports multithreaded programs, although much of the semantics of concurrent programs are defined by standard classes, rather than by the JVM itself.

The JVM was designed to allow efficient, safe execution of (bytecode compiled) Java programs. One innovative technique employed to promote efficiency is the use of the bytecode verifier. The JVM does not assure type-safe execution by itself. Rather, JVM execution is intended to be type-safe when running programs that have been ``accepted'' by the bytecode verifier. The bytecode verifier is intended only to accept programs whose execution on the JVM will be type-safe. The standard JVM is defined to elide some type-safety checks based on these intended guarantees, using these guarantees to justify optimizations in the JVM definition that would not be type-safe without those guarantees.

Thus, while the JVM execution is intended to be type-safe when running programs accepted by the bytecode verifier, the type-safety is not obvious. The JVM is not guaranteed to be type-safe when running arbitrary programs.

The defensive JVM (dJVM) augments the JVM definition with additional run-time checks to assure type-safe execution. The definition is intended to be obviously type-safe when run on any program. There is no need for a bytecode verifier to screen programs.

How does the dJVM differ from the JVM?

The dJVM is obviously type-safe. At least the dJVM is intended to be obviously type-safe. In contrast, the JVM is obviously not type-safe by itself. The behavior of the JVM is only specified for programs that are (mostly) type-safe. For example, the behavior of the iadd instruction is not defined unless the top two words on the operand stack are int values. In contrast, the behavior of iadd on the dJVM is completely defined.At least the behavior of iadd is defined in all contexts in which such a question makes sense. For example, it does not make sense to ask about the effect of executing an iadd instruction if the call stack is empty, because that means no method is active to execute the instruction.

The dJVM is completely defined.

Unlike the JVM, the dJVM is completely defined.

If some constraint (a ``must'' or ``must not'') in an instruction description is not satisfied at run time, the behavior of the Java Virtual Machine is undefined [emphasis added -- RMC].
[ 6.1, p. 151 in JVMS]

The effect of every instruction in the dJVM is defined for all possible dJVM states. For every state in which the effect of a JVM instruction is defined, the corresponding dJVM instruction has the same effect on the the corresponding dJVM state. In addition, in the states for which a JVM instruction is undefined, the effect of the corresponding dJVM instruction in the corresponding dJVM states is to halt with an error indication.

What use is the dJVM definition?

Some things we can do with a formal dJVM model:

Expose/resolve questions about the JVM behavior (e.g., those not answered in the Java Virtual Machine Specifications [JVMS]).
Prove that the JVM is type-safe on programs accepted by the bytecode verifier --- this is a standard interpreter-equivalence proof.
Examine the model's behavior by running test cases (assuming its an executable model).

Why is this dJVM version 0.5?

The current model is the first phase of modeling the JVM in ACL2. It was intended to demonstrate that such a model could reasonably capture the key properties of the JVM, but to be a complete JVM model. It is a rough draft.

This model includes the basic object-oriented instructions and data structures. It leaves out concurrency, interfaces, arrays, and garbage collection. It leaves out floating point data. It only includes 103 of the 202 standard JVM instructions.

Components of JVM included in dJVM

These are the essential features of the JVM, and were included in the dJVM 0.5 model so that this first model would address these key aspects of the JVM. These features also addresses the key aspects of object creation and manipulation in this initial phase of the model.

Representation of class objects and class instances.
A minimal subset of standard Java classes needed to support class definition, object creation, and method execution are provided. This includes portions of the standard classes: Object and Class.
Required instructions: new, aconst_null
Instance methods and class methods.
Required instructions: invoke virtual, invoke special, invoke static, return, areturn, ireturn.
Instance variables and class variables.
Required instructions: getfield, putfield, getstatic, putstatic.
Instance initialization methods (i.e., <init> methods).
Primitive types int and long.
Some of the instructions that manipulate int or long values are included, except those related to excluded types (e.g., the i2f instruction, which converts an int integer value to a floating point value.)
i2l, i2s, iadd, iand, idiv, iinc, imul, ineg, ior, isub, ixor, l2i, ladd, land, ldiv, bipush, sipush
Enough other instructions to run some small examples.
Such instructions include:
1. some load and store instructions
  aload, astore, iload, istore, lload, lstore
2. some stack-manipulation instructions
  pop, pop2, dup, dup2, dup_x1, dup_x2, dup2_x1, dup2_x2, swap
3. some control-flow instructions
  goto, if_acmpeq, if_acmpne, if_icmpeq, if_icmpne, if_icmplt, if_icmpge, if_icmpgt, if_icmple, ifeq, ifne, iflt, ifle, ifge, ifnull, ifnonnull

Important Features Left Out For Now

A number of important features of the JVM have not been included in the dJVM 0.5 model. Some of these are candidates for the first extensions to dJVM 0.5.

Dynamic loading of classes.
Class-file verifier and bytecode verifier.
Technically the class-file verifier and bytecode verifier are not part of the JVM itself. However, they do form an important component of most practical Java implementations. As such they would be worthy extensions of the dJVM.
Class initialization methods (i.e., <clinit> methods).
They are an implicit part of the JVM, because they are invoked implicitly when a class is created; they are never invoked explicitly.
dJVM presumes that all class objects have been constructed and initialized prior to the start of execution (i.e., effectively constructed by some primordial loading process).
Interfaces, and interface initialization methods.
Exception handling .
Including the instructions jsr, jsr_w, ret, and athrow.
Multithreading.
Monitors and synchronization.
The instructions monitorenter and monitorexit have been be left out, as well as the special-case of the method-invocation instructions that deals with synchronized methods.
Garbage collection or object finalization.
Garbage collection is logically a no-op, and so can reasonably be left out of the model.

Non-essential Features

This section describes features of the JVM that do not pose problems for modeling or formalization. They have been left out of this initial dJVM model merely to reduce the level of effort required.

Arrays and all of the instructions that create or operate on arrays
The data type char
The data types float and double
ACL2 does not support floating point numbers.ACL2 does support arbitrary-precision integers (a.k.a., bignums) and rational numbers. The axiomatic description of rational arithmetic is much more tractable than that of floating point arithmetic. Further, unlike floating point numbers, rational numbers obey the normal laws of arithmetic! So it is unlikely that the dJVM model will ever be extended to define floating point operations in the JVM. However, since the JVM is defined to perform floating point calculations according to the IEEE Floating Point Standard, there is little advantage to such an extension.
16-bit Unicode characters.
The instruction-modifier wide.
Some non-essential instructions.
1. ldc
2. Many arithmetic instructions.

Things not yet in our world

Native methods.
Native methods are really extensions to the JVM. They are effectively new instructions that are executed via the method invocation mechanism, rather than via normal instruction dispatch.
Extending the dJVM to handle native methods would require two steps:
1. Extend the representation of methods to allow recording information about native methods.
2. Extend the method invocation instructions to allow creating a call-frame for a native method.
3. Extend the instruction dispatcher to recognize when the current method is a native method, and invocate the ACL2 description of the method.
Since each native method is an extension of the dJVM, each new native method will require that an additional case be handled by the instruction dispatcher. Each time a new native is added to the model, the functions or axioms describing the native method must be added to the model and the instruction-dispatcher function must be revised to handle invocation of this new native method properly.Because ACL2 is a first-order logic, the instruction dispatcher must be extended explicitly. ACL2 lacks the Common Lisp functions apply and funcall. Functions are not ``first class objects.''
External system resources (e.g., file systems, networks, etc.)
Eventually we want to augment our model to describe interactions with external resources (e.g., file systems and networks).
The _quick instructions. They are implementation-specific optimiziations that do not currently fit nicely into the dJVM model. The ``quick'' instructions anticipate an implementation based on dispatch vectors. But the current dJVM model performs dynamic lookup, rather than caching method bindings in a dispatch vector. A variant of the current model that uses dispatch vectors or other method-caching techniques could be developed.

What is a Formal Model?

The dJVM model is built in ACL2. ACL2 is a mathematical logic based on the applicative subset of Common Lisp [CommonLisp]. As such it provides a means for describing a formal specification of the dJVM. ACL2 also has a theorem prover for the logic. This theorem prover is the latest in a line of theorem provers built by Bob Boyer and J Moore. [ACL,ACLH,ACL2,acl2-industrial] Thus we can (attempt to) prove properties of these specifications.

ACL2 allows specifications to be built using function definitions, in the manner of pure LISP and other functional programming languages. Since ACL2 is based on a subset of Common Lisp, specifications may be executed in an underlying implementation of Common Lisp. Thus we can build executable specifications.

We prefer to use the terms formal model and executable model, rather than formal specification and executable specification, because these formal artifacts are always models of the real computing systems they are meant to describe. Often they are abstract models, leaving out some level of detail in order allow formal analysis and proofs about the behavior of the model. If it is a good model of the real computing system, then the analysis of the behavior of the model should also be predictive about the behavior of the real system.

This Report as a Program

The text of this report is mechanically derived from the dJVM definition accepted and compiled by the ACL2 system. In the tradition of Knuth's literate programming, the formal program and its description were woven together, and subsequently derived portions can be processed by either a compiler, a theorem prover, or a document formatter. [literate-programming,weave]

Thus, the program text that appears in this report is derived directly from the ASCII files used for compilation and execution of the model. This increases our confidence in the fidelity of this presentation to the executable model.

There are two versions of this report. One includes only the executable portion of the dJVM definition, and is intended to be more usable as straightforward (and somewhat more compact) explanation of the behavior of the dJVM.

A second version of this report includes the theorems and rewrite rules given to the ACL2 theorem prover as part of the model development and to support ACL2 guard verification for the model. This is not a complete rendition of the ACL2 proof script because some ACL2 directives are suppressed in the text.All of the definitions and theorems of the ACL2 proof script are included. Numerous directives enabling and disabling rewrite rules are suppressed, as are the directives for including ACL2 ``books'' (i.e., libraries). Of course, all of the details are available in the ACL2 source files, from which the sources for this document were derived.

ACL2 requires that programs be defined in a bottom-up fashion. Each function must be defined (or axiomatized) before it can be used. The presentation in this report follows this same basic order. However, some portions of the definition that don't contribute to the exposition appear at the end of this document as appendices. These portions include the description of bounded-integer arithmetic and the definitions of macros used to build the actual instruction definition. The macros are described in the main text, but the details of their definitions are relegated to the appendices.

The File Structure of the Model

The ``source code'' for the ACL2 model (and this report) are divided into a number of separate files. The files form a dependency hierarchy with most files depending upon definitions and rules introduced by other files. Here's a list of the files, and roughly bottom-up order. Several low-level files that do not contribute to the exposition appear as appendices.

minor-utilities.lisp --- Appendix [ch:minor-utilities]
preliminaries.lisp --- Appendix [ch:preliminaries]
arith.lisp --- Appendix [ch:arithmetic]
primitive-values.lisp --- Chapter [ch:primitive-values]
state.lisp --- Chapters [ch:data-types] through [ch:jvm-state].
internal-operations.lisp --- Chapter [ch:internal-operations]
frame-operation-macro --- Appendix [ch:frame-operations]
define-inst-macro.lisp --- Appendix [ch:Define-JVM-Instruction-Macro]
simple-instructions-a-h.lisp --- Chapter [ch:simple-instructions]
simple-instructions-h-i.lisp --- Chapter [ch:simple-instructions]
simple-instructions-j-z.lisp --- Chapter [ch:simple-instructions]
tableswitch.lisp --- Chapter [ch:simple-instructions]
new-instance.lisp --- Chapter [ch:building-instances]
invokevirtual.lisp --- Chapter [ch:method-invocation]
invokespecial.lisp --- Chapter [ch:method-invocation]
invokestatic.lisp --- Chapter [ch:method-invocation]
method-returns.lisp --- Chapter [ch:method-invocation]
getfield.lisp --- Chapter [ch:building-instances]
getstatic.lisp --- Chapter [ch:building-instances]
run-djvm.lisp --- Chapter [ch:top-level]
initial-djvm.lisp --- Chapter [ch:top-level]

The Remainder of this Report

Here's a brief overview of the remainder of this report.

Chapters [ch:primitive-values] through [ch:data-types] describe the primitive values and data types manipulated by the dJVM.
Chapters [sec:objects] through [ch:jvm-state] build the representation of objects, classes, the call stack, the class table, and the heap to yield the full description of the dJVM state.
Chapter [ch:internal-operations] defines internal operations that manipulate the dJVM state.
Chapter 10 describes the the standard structure we will use to define dJVM instructions, and the abbreviated forms (macros) that will be used to give those definitions.
Chapter [ch:simple-instructions] defines 98 ``simple instructions'' --- instructions that manipulate only the operand stack and local variables without reference to the heap or the class table.
Chapters [ch:building-instances] and [ch:method-invocation] describe creation of objects and invocation of methods.
Chapter [ch:top-level] describes the top-level of the dJVM interpreter and the construction and initialization of initial dJVM states.
The appendices include some low-level definitions that support support the model.

Naming, Spelling, and Typeographic Conventions

This report documents a proof-of-concept for building a formal ACL2 model of the defensive Java Virtual Machine. Due to the fast pace of the development of this initial model, I have not been entirely consistent in the use of naming conventions. In fact, I employed several in different sections of the model. Common Lisp and ACL2 are normally not case sensitive in resolving function and variable names. The name Calling-Frame and the name calling-frame are considered to be the same. In some places I have taken advantage of that, and have used capitalization in an attempt to improve the readability of long function-names.I'd like to use mixed-case function names throughout the presentation, but don't yet.

When function definitions are presented, the name of the function is typeset in a larger and different type face to make it stand out. The same style is used for ACL2 events that define theorems, data structures, and so on.

Notes on ACL2, Future Extensions, etc.

The text notes about ACL2, future extensions of the dJVM model, the Java language, and the Java Virtual Machine. These notes are set out from the regular text in the following style.

Note on the
Java
Language
Specification{ The first public version of the Java language manual was approximately 35 pages. The second version was about 70 pages. The third (and current) version is over 800 pages. As you ask more and more specific questions, the explanation of the details just seem to multiply. } The margin note identifies the topic of the note.

Marginal Cross-references

The ACL2 definitions that comprise the dJVM 0.5 model appear in the text in a fixed width typeface. To facilitate perusal of the definitions cross-references to functions appear in the margins. For example, the definition of the function local-method? from page [local-method?] appears below.


(defun local-method? (method-name method-sig class-decl)
  (declare (xargs :guard (and (stringp method-name)
                              (stringp method-sig)
                              (class-decl-p class-decl))))
  (java-method-bound? method-name 
                      method-sig
                      (class-decl-methods class-decl)))

The boxes appearing in the margin are cross references for a function that appears on the corresponding line of the definition. In this case indicating that the definition of class-decl-p can be found on page [class-decl-p], and the definition of java-method-bound? can be found on page [java-method-bound?]. The function to which the cross-reference refers is printed in small capital letters (e.g., as java-method-bound? and class-decl-methods above).

Remark: If some function names are underlined, ignore the underlining. The underlining is just an indication of where the automatically generated cross-reference mechanism has broken down. The underlining should be removed in a later draft.

is supposed to put margin notes in the outer margin of the page. However, it doesn't always get it right, as you may be able to observe on this page and elsewhere in this document. Apparently the page-fill algorithm doesn't reliably communicate with the margin note placement mechanism when text is moved to start a new page. I'm not sure how to fix this.