Fermat's Library | Reflections on Trusting Trust annotated/explained version.

Daniel Borrow was part of the team that developed TENEX (TEN-EXtend...

Thompson received a BS in 1965 and a master's degree in 1966 in Ele...

A self reproducing program is called a quine. Quines are possible i...

The `s` array contains a representation of the program (character b...

This might seem somewhat confusing at first. How can the C compiler...

The compiler is written in C and is compiled for the system. Theref...

A countermeasure to this attack is described in my PhD dissertation...

Let’s say you have the binary for the bugged compiler (`B`) and the...

The Air Force document Thompson is referring to is [“Multics Securi...

TURING AWARD LECTURE

Reflections on Trusting Trust

To what extent should one trust a statement that a program is free of Trojan

horses? Perhaps it is more important to trust the people who wrote the

software.

KEN THOMPSON

INTRODUCTION

I thank the ACM for this award. I can't help but feel

that I am receiving this honor for timing and serendip-

ity as much as technical merit. UNIX 1 swept into popu-

larity with an industry-wide change from central main-

frames to autonomous minis. I suspect that Daniel Bob-

row [1] would be here instead of me if he could not

afford a PDP-10 and had had to "settle" for a PDP-11.

Moreover, the current state of UNIX is the result of the

labors of a large number of people.

There is an old adage, "Dance with the one that

brought you," which means that I should talk about

UNIX. I have not worked on mainstream UNIX in many

years, yet I continue to get undeserved credit for the

work of others. Therefore, I am not going to talk about

UNIX, but I want to thank everyone who has contrib-

uted.

That brings me to Dennis Ritchie. Our collaboration

has been a thing of beauty. In the ten years that we

have worked together, I can recall only one case of

miscoordination of work. On that occasion, I discovered

that we both had written the same 20-line assembly

language program. I compared the sources and was as-

tounded to find that they matched character-for-char-

acter. The result of our work together has been far

greater than the work that we each contributed.

I am a programmer. On my 1040 form, that is what I

put down as my occupation. As a programmer, I write

1 UNIX is a trademark of AT&T Bell Laboratories.

programs. I would like to present to you the cutest

program I ever wrote. I will do this in three stages and

try to bring it together at the end.

STAGE I

In college, before video games, we would amuse our-

selves by posing programming exercises. One of the

favorites was to write the shortest self-reproducing pro-

gram. Since this is an exercise divorced from reality,

the usual vehicle was FORTRAN. Actually, FORTRAN

was the language of choice for the same reason that

three-legged races are popular.

More precisely stated, the problem is to write a

source program that, when compiled and executed, will

produce as output an exact copy of its source. If you

have never done this, I urge you to try it on your own.

The discovery of how to do it is a revelation that far

surpasses any benefit obtained by being told how to do

it. The part about "shortest" was just an incentive to

demonstrate skill and determine a winner.

Figure 1 shows a self-reproducing program in the C 3

programming language. (The purist will note that the

program is not precisely a self-reproducing program,

but will produce a self-reproducing program.) This en-

try is much too large to win a prize, but it demonstrates

the technique and has two important properties that I

need to complete my story: 1) This program can be

easily written by another program. 2) This program can

contain an arbitrary amount of excess baggage that will

be reproduced along with the main algorithm. In the

example, even the comment is reproduced.

August 1984 Volume 27 Number 8 Communications of the ACM 781

Turing Award Lecture

char s[ ] = I

t011

i;i

"~lll

I'V,] ' ~

I'~t'] l p

(213 lines deleted)

• The string s is a

• representation of the body

• of this program from

'0'

• to the end.

main( )

{

int i;

printf("char\ts[ ]= {kn");

for(i=0; s[i]; i++)

printf("~t%d, \n", s[i]);

printf("%s", s);

Here are some simple transliterations to allow

a non-C programmer to read this code.

= assignment

== equal to .EQ.

!= not equal to .NE.

++ increment

'x'

single character constant

"xxx"

multiple character string

%d format to convert to decimal

%s format to convert to string

kt tab character

kn newline character

FIGURE 1.

STAGE II

The C compiler is written in C. What I am about to

describe is one of many "chicken and egg" problems

that arise when compilers are written in their own lan-

guage. In this case, I will use a specific example from

the C compiler.

C allows a string construct to specify an initialized

character array. The individual characters in the string

can be escaped to represent unprintable characters• For

example,

"Hello world\n"

represents a string with the character "\n," representing

the new line character.

Figure 2.1 is an idealization of the code in the C

compiler that interprets the character escape sequence.

This is an amazing piece of code. It "knows" in a com-

pletely portable way what character code is compiled

for a new line in any character set. The act of knowing

then allows it to recompile itself, thus perpetuating the

knowledge.

Suppose we wish to alter the C compiler to include

the sequence "\v" to represent the vertical tab charac-

ter. The extension to Figure 2.1 is obvious and is pre-

sented in Figure 2.2. We then recompile the C com-

piler, but we get a diagnostic. Obviously, since the bi-

nary version of the compiler does not know about "\v,"

the source is not legal C. We must "train" the compiler.

After it "knows" what "\v" means, then our new

change will become legal C. We look up on an ASCII

chart that a vertical tab is decimal 11. We alter our

source to look like Figure 2.3. Now the old compiler

accepts the new source. We install the resulting binary

as the new official C compiler and now we can write

the portable version the way we had it in Figure 2.2.

This is a deep concept. It is as close to a "learning"

program as I have seen. You simply tell it once, then

you can use this self-referencing definition.

STAGE III

Again, in the C compiler, Figure 3.1 represents the high

level control of the C compiler where the routine "com-

c = next( );

if(c != '\V)

return(c);

c = next( );

if(c == '\V)

return('\\');

if(c == 'n')

return('kn ');

FIGURE 2.2.

c = next( );

if(c ~= '\v)

return(c);

c = next( );

if(c == '\V)

return('kV);

if(c == 'n')

retum('kn');

if(c == 'v')

return('\v');

FIGURE 2.1.

c =

next( );

if(c != '\V)

return(c);

c = next( );

if(c == '\v)

return('\\');

if(c == 'n')

return('\ n');

if(c == 'v')

return(11 );

FIGURE 2.3.

762 Communications of the ACM

August 1984 Volume 27 Number 8

Turing Award Lecture

pile" is called to compile the next line of source. Figure

3.2 shows a simple modification to the compiler that

will deliberately miscompile source whenever a partic-

ular pattern is matched. If this were not deliberate, it

would be called a compiler "bug." Since it is deliberate,

it should be called a "Trojan horse."

The actual bug I planted in the compiler would

match code in the UNIX "login" command. The re-

placement code would miscompile the login command

so that it would accept either the intended encrypted

password or a particular known password. Thus if this

code were installed in binary and the binary were used

to compile the login command, I could log into that

system as any user.

Such blatant code would not go undetected for long.

Even the most casual perusal of the source of the C

compiler would raise suspicions.

The final step is represented in Figure 3.3. This sim-

ply adds a second Trojan horse to the one that already

exists. The second pattern is aimed at the C compiler.

The replacement code is a Stage I self-reproducing pro-

gram that inserts both Trojan horses into the compiler.

This requires a learning phase as in the Stage II exam-

ple. First we compile the modified source with the nor-

mal C compiler to produce a bugged binary. We install

this binary as the official C. We can now remove the

bugs from the source of the compiler and the new bi-

nary will reinsert the bugs whenever it is compiled. Of

course, the login command will remain bugged with no

trace in source anywhere.

compile(s)

char ,s;

FIGURE 3.1.

compile(s)

char ,s;

if(match(s, "pattern")) {

compUe("bug");

return;

FIGURE 3.2.

compile(s)

char

,s;

if(match(s, "pattern1 ")) {

compile

('bug1 ");

return;

if(match(s, =pattern 2")) I

compile

('bug 2");

return;

FIGURE 3.3.

MORAL

The moral is obvious. You can't trust code that you did

not totally create yourself. (Especially code from com-

panies that employ people like me.) No amount of

source-level verification or scrutiny will protect you

from using untrusted code. In demonstrating the possi-

bility of this kind of attack, I picked on the C compiler.

I could have picked on any program-handling program

such as an assembler, a loader, or even hardware mi-

crocode. As the level of program gets lower, these bugs

will be harder and harder to detect. A well-installed

microcode bug will be almost impossible to detect.

After trying to convince you that I cannot be trusted,

I wish to moralize. I would like to criticize the press in

its handling of the "hackers," the 414 gang, the Dalton

gang, etc. The acts performed by these kids are vandal-

ism at best and probably trespass and theft at worst. It

is only the inadequacy of the criminal code that saves

the hackers from very serious prosecution. The compa-

nies that are vulnerable to this activity, (and most large

companies are very vulnerable) are pressing hard to

update the criminal code. Unauthorized access to com-

puter systems is already a serious crime in a few states

and is currently being addressed in many more state

legislatures as well as Congress.

There is an explosive situation brewing. On the one

hand, the press, television, and movies make heros of

vandals by calling them whiz kids. On the other hand,

the acts performed by these kids will soon be punisha-

ble by years in prison.

I have watched kids testifying before Congress. It is

clear that they are completely unaware of the serious-

ness of theft acts. There is obviously a cultural gap. The

act of breaking into a computer system has to have

the

same social stigma as breaking into a neighbor's house.

It should not matter that the neighbor's door is un-

locked. The press must learn that misguided use of a

computer is no more amazing than drunk driving of an

automobile.

Acknowledgment.

I first read of the possibility of such

a Trojan horse in an Air Force critique [4] of the secu-

rity of an early implementation of Multics. I cannot find

a more specific reference to this document. I would

appreciate it if anyone who can supply this reference

would let me know.

REFERENCES

1, Bobrow, D.G., Burchfiel, J.D., Murphy, D.L., and Tomlinson, R.S.

TENEX, a paged time-sharing system for the PDP-10. Commun. ACM

15, 3 {Mar. 1972}, 135-143.

2. Kernighan, B.W., and Ritchie, D.M. The C Programming Language.

Prentice-Hall, Englewood Cliffs, N.J., 1978.

3. Ritchie, D.M., and Thompson, K. The UNIX time-sharing

system.

Commun. ACM 17, 0uly 1974), 365-375.

4. Unknown Air Force Document.

Author's Present Address: Ken Thompson, AT&T Bell Laboratories,

Room 2C-519, 600 Mountain Ave., Murray Hill, NJ 07974.

Permission to copy without fee all or part of this material is

granted

provided that the copies are not made or distributed for direct commer-

cial advantage, the ACM copyright notice and the title of the publication

and its date appear, and notice is given that copying

is by permission of

the

Association for Computing Machinery. To copy otherwise, or to

republish, requires a fee and/or specific permission.

August 1984 Volume 27 Number 8 Communications of the ACM 763

Discussion

Thompson received a BS in 1965 and a master's degree in 1966 in Electrical Engineering and Computer Science from UC Berkeley. After graduation, Thompson and Dennis Ritchie joined Bell Labs, and it was there that they worked together on UNIX. Why is there a need to compare `X` with `Z`; why can't we compare `X` with `Y`? The resulting `Y` binary might be different than `X` for reasons other than backdoors (for instance performance optimizations introduced by `A`). `B` and `Y` should however be functionally identical, therefore, they should yield the same binary (if they don't, then `B` is 'bugged'). Yes. The compiled binary would not be portable in this way. The `it` in `it is completely portable because it is...` refers to the source code, doesn't it? This is one of the concepts I was happy to understand. Comes down to (or how I've trivialized it is): write the first compiler of the new language in an already existing language, then when the new language is made, use it to build a compiler for itself, by following the specifications. Daniel Borrow was part of the team that developed TENEX (TEN-EXtended) an operating system for the DEC PDP-10. The PDP-10 cost about \$180,000 whereas the PDP-11 was considerably cheaper at around \$11,000. A self reproducing program is called a quine. Quines are possible in any Turing complete programming language. Here is a quine in python: ``` #!/usr/bin/env python s = '#!/usr/bin/env python\ns = %s ; print s %% repr(s)' ; print s % repr(s) ``` The `s` array contains a representation of the program (character by character, ending with `0`, the string terminating character in `C`). [Here]( https://gist.github.com/joaobatalha/9de926f251192c6137f4) is the full source of the quine (minus the comment). The first time you run this program, the output will be different from its original source because the characters in the `s` array will be replaced by their decimal representation in the corresponding character set. For instance, the first character in the array ‘\t’ (horizontal tab) will be replaced by 9 if the ASCII character set is being used. After that, the resulting quine reproduces itself perfectly. If you want to learn more about [quines here](http://research.swtch.com/zip) is a great article. A countermeasure to this attack is described in my PhD dissertation, [*Fully Countering Trusting Trust through Diverse Double-Compiling* by David A. Wheeler (2009)](https://www.dwheeler.com/trusting-trust/), which describes a technique for detecting this. If the problem is merely that a binary has been tampered with (as opposed to having a subverted compiler, the case discussed in this Turing aware lecture), the [reproducible builds project](https://reproducible-builds.org/) can counter the problem. The compiler is written in C and is compiled for the system. Therefore, it is completely portable because it is able to detect the literal `‘\n’` and return the character code for `‘\n’` without having it hardcoded in the source of the compiler (it does not need to know what the actual character code is, which depends on the character set used by the system). This might seem somewhat confusing at first. How can the C compiler be written in C? Well, the first C compiler was not written in C. C itself is a descendant of the language B (a language originally created for the PDP-7). The first C compiler was written in a language that Dennis Ritchie called NB (New B). What about the B compiler? The B compiler was written in PDP-7 assembler. Once you have the binary for the C compiler, you can write any program in C - including a C compiler. The Air Force document Thompson is referring to is [“Multics Security Evaluation: Vulnerability Analysis” by Paul A. Karger and Roger R. Schell](http://seclab.cs.ucdavis.edu/projects/history/papers/karg74.pdf). In Section 3.4.5.1 Karger and Schell describe the possibility for such Trojan Horse. Let’s say you have the binary for the bugged compiler (`B`) and the source of the compiler with the bugs/backdoors removed (`S`). In such situation, even if you take `S` and compile it with `B`, you will still get a buggy compiler binary. Therefore, if you later use this compiler (or any of its “descendants”) to compile the login code, the backdoor will be present. The reason this is so scary is because compilers are one of the central pieces of software for any computer system. A considerable percentage of programs have to go thru a compiler before being installed (this was especially true at the time Thompson gave this lecture, when UNIX ran in all kinds of hardware which made binary distributions more difficult). As a consequence, a compromised compiler would be able to inject hard-to-detect backdoors anywhere in the system. Technically, you could detect that something was not right if you were to look at the machine code of the compiler. By looking at the raw instructions you would be able to verify what was actually happening, and witness the insertion of the backdoor. However, this is not very feasible. People have proposed other ways of detecting a malicious compiler like the one Thompson describes. Going back to our previous scenario, say you have the binary for the bugged compiler (`B`) and the source of the compiler with the bugs/backdoors removed (`S`). Now, your friend has a clean (trusted) compiler binary (`A`). In order to verify if `B` is a buggy compiler or not you need to: - Compile `S` using `B`, which yields compiler binary `X` - Compile `S` using the trusted compiler `A` , which yields a new compiler binary `Y` - Compile `S` using `Y`, which yields a new compiler binary `Z`. If the `X` binary is different from the `Z` binary, then the `B` compiler is a “buggy” compiler in the sense that Thompson describes.

Comments

Products

Project