A Cryptographic Checksum for Integrity Protection
Fred Cohen
Lehigh University

This paper describes a cryptographic checksum technique for verifying the integrity of information in computer systems with no built-in protection. The technique is based on the use of repeated encryption using an RSA cryptosystem as a pseudo-random number generator (PRNG), the use of a user specified key as a seed for the PRNG, and reduction in a pseudo-random modulus as a means for mixing user specified information with generated numbers.

Keywords: Information Protection, Integrity, Cryptography, Cryptographic Checksum, Pseudo-Random Numbers

1 - Introduction

Interest in authentication codes for integrity assurance has persisted for a long time, and is now a major application of cryptography. Historically, integrity protection mechanisms have been designed to protect transmitted information from illicit creation, deletion, and/or modification, but recent events have demonstrated that integrity protection in information systems and networks may be a more pressing problem than was ever previously suspected [1].

In transmission, it is fairly usual to assure the integrity of information by using a cryptographic authentication transform which is held by parties at both ends of the communications media. This allows the recipient to assure that information has not been illicitly modified in the media. The major difference between typical communications media and typical storage media is that in a storage media information is typically available for observation and modification over an extended period of time, whereas in a communications media, a more urgent sense of timeliness is typically required in order to launch a serious attack.

The use of cryptographic checksums to verify integrity has been studied for use in systems with no built in protection mechanisms [2], and it appears to be a feasible method for increasing the complexity of undetected modification of information. The major problem with such a mechanism is in the placement of tests within the system so that an attacker cannot determine the locations of the tests and thereby circumvent them. In systems with built-in protection mechanisms, it is possible to use cryptographic checksums to substantially reduce the space required by the operating system for maintaining integrity [3].

The major problem with the cryptographic checksum technique used in [3] is that its performance is unacceptably slow for practical use due to the performance limitations of the RSA cryptosystem [4]. A recent hardware device [5] has brought RSA performance up to the rate of 6500 bits/second, but this rate does not even approach the rate necessary for practical authentication of large files at run time (i.e. 1 Mbyte would take over 20.5 minutes). The only present systems that offer reasonable cryptographic protection and high performance in a cryptographic checksum operation are based on private key cryptosystems, and the use of such systems prohibits the application proposed in [3].

In this paper, we examine a technique for testing information to assure that is has not been modified. We perform this test by using a cryptographic checksum 'C' generated as a function of a user specified key 'Kd' and the original information. By testing that the checksum hasn't changed, we may verify the integrity of the information at a later date. Our system is based on a combination of the RSA cryptosystem and a lower quality multiplicative ring cipher in such a manner as to assure to a reasonable degree that the system cannot be violated, while providing sufficient performance for applications such as that in [3]. The major advantages of our method are:

2 - Cryptographic Authentication in Information Systems

Upon program creation or after legitimate modification, the user applies an authentication algorithm A(K,F) under a key (K) to a file (F) to generate a checksum (Ct) which is the image of F under A(K,F) at time t. In order to check the file for subsequent modification, the user again applies A(K,F) under K to the (possibly modified) version of F at time t' (F') to generate a checksum (Ct') which is the image of F' under A(K,F) at time t'. We then compare Ct to Ct' to assure that they are equal, and in the case that they are different, conclude that either F' is a corrupt version of F, or that Ct was generated with a different key than Ct'.

---------       ---------       ---------
|       |       |       |       |       |
|   F   |------>|A(K,F) |------>|  Ct   |
|       |       |       |       |       |
---------       ---------       ---------
                    |               ||
                    K               <>-------->Corrupt
---------       ---------       ---------
|       |       |       |       |       |
|   F'  |------>|A(K,F')|------>|   Ct' |
|       |       |       |       |       |
---------       ---------       ---------

A cryptographic checksum can be used by any program (P) to verify files over which it has control. Every time a file (F) is written, P replaces F's cryptographic checksum (C); and every time P is about to read F, it verifies C. Each of these instances is also an opportunity for P to verify itself.

3 - The Basic Method and the Basis for Trusting it.

The use of a user specified key (K) provides a method by which the individual user (U) may assure that another user (V) is unable to forge a valid checksum (C) of a file (F). In order for this to be effective, it must be easy to derive C from K and F, it must be difficult to derive K from valid (F,C) pairs, it must be difficult to derive a valid checksum (C') for an invalid file (F') from valid (F,C) pairs without K, and in order to be practical, the size of C (|C|) must be much smaller than the size of F (|F|) for large F.

In a more mathematical sense, we choose the problems we wish to be 'easy' and 'hard' out of the space of all problems which can arise from attempts to generate subsets of {F,K,C} from other subsets of {F,K,C}. In the case of authentication, we are also concerned with the difficulty of generating subsets of {F',C'} from subsets of {F,K,C}. We choose 'hard' to mean that no foreseeable algorithm on any foreseeable computer will be able to solve the problem in any relevant length of time, and choose easy to mean that the present algorithm can accomplish the task within reasonable time for the application.

To our knowledge, this exhaustive approach to cryptosystem analysis has never been explicitly published, and several cases have occurred in the literature where cryptosystems that were thought secure were subsequently broken by the exploitation of a failure to account for such conditions. Because it is so obvious that all such conditions must be considered to be certain of a system's ability to protect information, we assume that previous authors have simply overlooked stating it, and have had it in mind throughout their analysis. We begin with the space of all possible (non-equivalent) subset pairs as:
F=>K K=>F C=>F
F=>C K=>C C=>K
(F,K)=>C (F,C)=>K (K,C)=>F
F=>F' K=>F' C=>F'
F=>C' K=>C' C=>C'
(F,K)=>F' (F,C)=>F' (K,C)=>F'
(F,K)=>C' (F,C)=>C' (K,C)=>C'

Since our primary concern is the instance of the authentication problem wherein the legitimate user has K and the attacker does not have K, we wish to make (F,K)=>C easy, and almost any problem without K hard. For authentication, we are not concerned with the illicit generation of a C' since any C'<>C would cause us to detect corruption. Assuming that no K,K'<>K produce identical results in a significant portion of cases, no generation of F'<>F is of concern. This leaves the following constraints:

(F,K)=>C is easy and
C=>F', F=>K, F=>C, C=>F, C=>K, (F,C)=>K are all hard

We choose the use of an RSA cryptosystem as a cryptographically strong PRNG because it has been found that repeated encryption of a single initial value from an initial key is as secure as the RSA itself, both for the purposes of preventing the determination of a previous value from the present value, and for providing a subsequent value that is very randomly related to the previous value [6,7]. If we use an RSA based PRNG, we are then assured that F=>C and C=>F without K are hard, and that C=>K is at least as hard as inverting an RSA encryption.

In order to have a compressive code, we would like the property that for large F, (F,K)=>C s.t. |C| << |F|. F=>K is undeniably hard since they are both free variables, and thus are in fact unrelated. Almost any reasonable complexity algorithm for forming (F,K)=>C will be adequate if it fulfills the other properties.

The problem at hand is not in fact quite as simple as we have stated it because it is possible to forge an RSA authentication given that an attacker can perform selected plaintext attack by causing the holder of the key to perform O(|K|) signatures of attacker selected data [8]. It turns out that these attacks are only feasible if the messages being encoded are permitted to have sufficient redundancy. A feasible countermeasure to these attacks is the use of compression codes to first reduce the redundancy of the message being authenticated (as well as its length) or the use of a preprocessing step which performs a "one-way trap door function". This will be discussed in a later section of this paper.

If C=>F' were feasible, it would be feasible to generate a new program from the checksum of an old program, and thus to undetectably corrupt F by making it into F'. Even though it is unlikely that any resulting F' would be a sensible executable file, we may also be concerned with protecting data files in which case any sequence of bits would be possible. Assuming, that there is no systematic way to invert (F,K)=>C, there is no systematic way to invert the same function without K. Since the RSA PRNG is such that (F,K)=>C is hard, we are left with the attempt to randomly generate F' from C. If the size of the RSA key is sufficiently large to provide any reasonable degree of protection (i.e. >> 100 bits), the size of the checksum space is greater than 2**100, and thus we would expect the random generation of a forged F to take >> 2**99 attempts before the likelihood of a forgery is 1/2. Since this meets our own definition of hard, C=>F' is hard.

Finally, if (F,C)=>K were feasible, then so would be (F,C)=>(F',C'), in which case (F,C)=>C' would be feasible. Since (F,C)=>C' is hard, so is (F,C)=>(F',C'), and thus so is (F,C)=>K.

We have now shown that all of the problems that must be made hard in order for a cryptographic authentication to be applicable are made hard by the use of an RSA based PRNG (with the exception of the RSA signature attack to be covered later). We have also shown that the problem that must be made easy is sufficiently easy for some applications, but it remains for us to show that it can be made sufficiently easy for most reasonable applications. As we shall now see, substantial modification must be made in order to attain reasonable performance under current technology.

4 - Including File Data in the Checksum

As a notational convenience, we will use the symbol M to represent the modulus in an RSA cryptosystem, |M| to denote its length in bits, and K to denote the RSA key pair (Ke,M). In the generation of K, Kd is left ungenerated, and the values of associated variables that might allow its generation are destroyed. Thus, we have a true 'one-way' function in that no key for its inversion exists or can be feasibly generated. The basic authentication system based on the RSA as a PRNG is then:

We begin by performing an RSA encryption of the initial value S so that S cannot be determined regardless of the data in the file. If the file is empty, we get the encryption of the key. In order to make it impossible to simply replace a (file,checksum) pair (F,C) with (F',C') from another file, we include the name of the file in the scheme. In this way, we generate successive PRNs which depend on the initial value S, the file name, all the bits in the file, and their ordering. Since we do not allow an attacker to select S, it is infeasible to launch a selected plaintext attack which would allow RSA signatures to be forged. The technique is cryptographically strong because of the strength of the RSA as a PRNG, and it is compressive in that the final checksum will be no more than |M| bits long. The system is hard to forge even with K in the hands of the attacker, since the initial value S is secure from attack, and each checksum depends on S.

The only problems arise in that performing an RSA is quite poor, and the number of RSAs that must be performed is |file|/|M|. For a file size of 1Mbyte and a key length of 512, this requires almost 2,000 RSA encryptions. To eleviate this performance bottleneck, we modify the system to reduce the number of RSA encryptions by blocking data and using a weaker system to derive intermediate checksums of each block. It is important that this weaker system have certain properties in order to assure that blocks cannot be replaced without affecting the final checksum. Fortunately, certain advantages are given us by the present algorithm, in that we are given a high quality PRN as an initial value for our one way function for each block.

We are concerned with generating a function F(B,Q) where B is a block of data and Q is a random constant, such that determining a block B' such that F(B',Q)=F(B,Q) is hard. We also have the restrictions that the attacker does not have access to Q, F(B,Q) or F(B',Q), but only the general form of F, and a block B. An example of such a function is the use of Q as a modulus into which B is folded. The problem then becomes determining a value B' given B such that B is congruent to B' in an unknown modulus. There is no hope of this since we can make any B and B' equivalent or not equivalent by selecting different moduli, and thus we can be guaranteed that a B' that would work for one modulus would not work for all possible moduli. We know that this function is order dependent so that a transposition of the original data is unlikely to have an equivalent image, that the final RSA encryption in our algorithm assures that no value for F will ever be available in plaintext, and that even a value for B' that is equivalent to B (but not equal to it) in a given modulus Q is not likely to be equivalent in another modulus Q' given that Q and Q' are chosen at random. The final algorithm is then:

In this situation, there is no reason to keep K or C secret, so only the user defined initial seed S is required for user dependent cryptographic authentication. The performance of the above checksum mechanism (based on data from [5]) is computed as follows:

The total time taken is thus |F|/6400+0.08*(N+1) plus the overhead time required to read the file from the disk, store the checksum at the end, etc. Typical values for the checksum are:

  |F|    N      time
10000    2      1.8sec
10000    10     2.5sec
100000   10     17 sec
100000   100    25 sec

Significantly better performance can be attained by performing a faster operation than reduction in a modulus for the block checksum function. One suggestion that appears reasonable is the use of a multiplicative ring (i.e. A*X+B|Q where A and B are constants, and X is the next |M| sized sequence of bits from the file). This does not substantially increase the performance over the previous algorithm, but is more amenable to hardware assistance.

Another idea for improved performance is the use of only a small number of least significant bits of Q as a modulus. We add B|(Q|2**N) to Q for the new value of Q to replace step 3 of the previous algorithm. If we select N=32, for example, exhaustive search of the space for a modulus that produces an equivalent value is quite lengthy (i.e. 2**32 RSA encryptions are required on average => 343 million seconds = 10.9 years). Once this is done, the last 32 bits of the initial encryption of the user key is known, and forgery of that block is reduced to finding blocks that are equivalent in that modulus. This process has to be repeated for each block of the file being forged, since the last 32 bits of the PRN are not sufficient to tell what the next value of Q will be.

A prototype implementation of the above scheme has been developed and is available in source form from the author. The 'checker' version is intended as a subroutine which can be called by programs wishing to maintain their own integrity and the integrity of their data files. The 'general purpose' version exemplifies the use of the 'checker' version in an off the shelf product for use during boot up or hours when the PC is not in use for other purposes.

The checker program provides the basic routines for summing a file, checking the sum against a stored checksum file, storing a new checksum value when a file is changed, performing a self test on the program in which it is imbedded, and testing the checker routines. The general purpose checksum program uses the checker routines in an application which provides a user interface to check and sum files of the user's selection. It provides a typical command interface and exemplifies the use of the checker routines.

By far the most interesting application of the checksum routines are in the application of built-in self-test of programs and their associated data files. The general purpose program performs self-test of its executable code before allowing the user to use it. The same technique can be used to verify the integrity of data files by programs having exclusive access to them. By using the same routines in several related programs, and a user specified key, an individual can detect data corruption in files shared by many programs so long as they are interrelated in a POset structure [2].

5 - Summary, Conclusions, and Further Work

We have demonstrated a cryptographically strong checksumming technique with reasonable performance characteristics, and a sound argument for the belief that it performs as intended; a method by which performance may be improved over previous similar systems without substantially reducing the complexity of attack; and a method by which this technique can be easily incorporated into any program to allow it to verify its own integrity and the integrity of its data files.

The technique given here appears to be generally applicable, and well suited for such applications as those described in [2] and [3]. It is likely that this technique and closely related ones will become widespread in their application, and that their use will increase with the increased availability of high performance low priced RSA chips.

Extensions of this work to systems with lower performance constraints are quite likely since the technique allows a high degree of flexibility in the methods for block checksumming by providing most of the cryptographic protection in the inter block PRNG technique.


[1] F. Cohen, "Computer Viruses - Theory and Experiments", Computers and Security, IFIP-SEC V6#1, 1987

[2] F. Cohen, "A Complexity Based Integrity Maintenance Mechanism", Conference on Information Sciences and Systems, Princeton University, March 1986

[3] M. Pozzo and T. Gray, "An Approach to Containing Computer Viruses", Computers and Security, IFIP-SEC V6#2, 1987

[4] R. Rivest, A. Shamir, and L. Adleman, "A Method for Obtaining Digital Signatures and Public-Key Cryptosystems", CACM V21 (1978) pp 120-126.

[5] CYLINK, "CY1024 Processor Chip Key Management Applications", Sunnyvale, Ca. 1986

[6] U. Vizirani and V. Vizirani, "RSA bits are .732+epsilon secure", "Advances in Cryptology, proceedings of Crypto83", pp369-375, D. Chaum ed. 1984 Plenum Press, New York, NY

[7] C. Schnorr and W. Alexi, "RSA-bits are 0.5+epsilon Secure", "Advances in Cryptology - proceedings of Eurocrypt84", pp113-126, 1985, Springer-Verlag, New York, NY

[8] W. de Jonge and D. Chaum, "Attacks on Some RSA Signatures", Advances in Cryptology - proceedings of Crypto85, pp18-27, 1985, Springer-Verlag, New York, NY