The Great Debate

"Will Compilers Ever Produce Code as Good as an Expert Assembly Language Programmer?"

(Randall Hyde)

Part III: Arithmetic

The "Great Debate" is a very emotional exchange that has been running continuously since the late '70s. On some newsgroup somewhere, you will find a thread discussing this topic; although you will have the best luck looking at the Internet comp.lang.asm.x86, comp.lang.c, alt.assembly, or comp.lang.c++ newsgroups. Of course, almost anything you read in these newsgroups is rubbish, no matter what side of the argument the author is championing. Because this debate has been raging for (what seems like) forever, it is clear there is no easy answer; especially one that someone can make up off the top of their head (this describes about 99.9% of all postings to a usenet newsgroup). This page contains a series of essays that discuss the advances of compilers and machine architectures in an attempt to answer the above question.

Although I intend to write a large number of these essays, I encourage others, even those with opposing viewpoints, to contribute to this exchange. If you would like to contribute a "well thought out, non-emotional essay" to this series, please send your contribution (HTML is best, ASCII text is second best) to

debate@webster.ucr.edu


I was an undergraduate at the tail end of the "efficiency is everything" phase of software engineering. Despite the premium placed on efficiency, the common thought was "numerically intensive (floating point) applications are not worth writing in assembly language because the floating point arithmetic is so slow." Sort of the precursor to the 90/10 rule, I guess.

Since those days floating point hardware has become ubiquitous and on many machines (e.g., Pentium and later for the x86 architecture) the performance of floating point arithmetic rivals that of integer arithmetic. Hence, avoiding numeric computation in assembly language simply because it is so slow to begin with is no longer a valid excuse.

A friend of mine, Anthony Tribelli, has done quite a bit of work with three-D (i.e., floating point intensive) calculations. He recently switched from a software based fixed point scheme to use the floating point hardware built into the Pentium processor and achieved much better results. Tony's applications need to be fast and, although the 3-D matrix transformation (i.e., matrix multiplication) code was not the biggest bottleneck in his code, he has spent considerable time speeding up this portion of his code as an academic exercise. In particular, he has tried just about every commercially available x86 C compiler out there in order to determine which would produce the fastest code for his application (MSVC++ wins, by the way; much to the disappointment of many GNU fans). Although Tony's assembly language skills were somewhat rusty, he grabbed the Pentium programmer's manual and searched the Internet for Pentium optimization tricks. He hand coded his program to within two cycles of the theoretical maximum speed (i.e., Intel's published cycle counts). The resulting code was significantly faster than that produced by any of the C compilers. This example disproves the adage that you shouldn't bother rewriting numeric intensive applications in assembly language because you won't gain anything.

Despite the anecdote above, this is not an essay about how to speed up your numeric intensive applications by using assembly language. Instead, I would like to concentrate on the fact that assembly language gives you complete control over the code the CPU executes; and sometimes this can make a big difference in the accuracy of your computations.

Consider the following two arithmetic expressions:


In the mathematical world of real arithmetic, these two expressions always produce the same results for any values of X, Y, and Z. Simple seventh-grade algebra can easily prove this. Unfortunately, the rules for (infinite precision) real arithmetic do not always apply to (finite precision) floating point arithmetic or to integer arithmetic. Given appropriate values for X, Y, and Z, the two expressions above can definitely produce different answers. The best way to demonstrate this is to use the integer values X=2, Y=1, and Z=3. The first expression above produces a zero result given these values, the second expression above produces the value one.

Perhaps you're thinking "gee, that's not fair, everyone knows that integer arithmetic truncates all intermediate values." You're right (about the integer arithmetic, anyway). But keep this fact in mind: all floating point operations suffer from truncation error as well. Therefore, the order of expression evaluation can have a big impact on the accuracy of your computations.

Consider the C programming language. In order to produce high-quality output code, the C language definition allows a compiler write to take certain liberties with the way it computes the result of an expression. In particular, a compiler can rearrange the order of evaluation in an expression. Consider the following "C" statement:


	a = (b + c) / (d + e);

The C language specification doesn't determine whether it first computes (b + c) or (d + e). Indeed, if the compiler so chooses it can rearrange this expression as:


	a = b / (d + e) + c / (d + e)

if it so chooses. For example it could turn out that the program has already computed c/(d + e) and it has already computed b/(d+e) and it decides to use this information in this computation to reduce this operation to a single addition. You cannot force a certain order of evaluation using operator precedence. Operator precedence simply doesn't apply in the expression above. Perhaps you're a little more knowledgable than the average C programmer and you're aware of these things known as "sequence points" in an expression. Sequence points only guarantee that any side effects an expression produces will be completed before crossing a sequence point, they do not guarantee order of evaluation. For example, a ";" (at the end of a statement) is a sequence point. Consider, however, the following C statements:


	a = (d == c);
	b = (x < y) && (d == c);

Although the "(d == c)" subexpression appears to the right of the "&&" operator (that defines a sequence point), most good C compilers will evaluate the expression "(d==c)" prior to evaluating "(x < y)" because they will use the value already computed in the previous statement. Therefore, you cannot force the compiler to compute "(x < y)" before "(d == c)" by simply breaking the second statement above into the sequence[1]:


	temp1 = (x < y );
	temp2 = (d == c);
	b = temp1 && temp2;

The bottom line is this: you cannot (portably) control the order of evaluation of subexpressions in a C program.

It gets even worse! The values of many expressions in C/C++ are undefined! Consider the following (very famous) example:


		i = 4;
		A[4] = 2;
		i = i + A[i++];
		printf("%d\n", i);

A neophyte C programmer might be tempted to claim that this program would produce the output "6". A more seasoned C programmer might claim the answer could be "6" or "7" (one of these answers is probably what you would get). However, the ANSI-C definition claims the result is undefined. That is, it could be anything. Even "123456" is a reasonable result according to the ANSI-C standard.

Using assembly language eliminates all these problems. Since the CPU executes the instructions you supply in the order you supply them, you have precise control over how you compute values[2]. By carefully studying your computations and the values you expect to supply to those computations, you can choose an instruction sequence that will maximize the accuracy of your system. You can also specify in a non-ambiguous way those instructions whose side effects produce undefined results in C++. For example, if you wanted the previous expression to produce six, you'd use code like:


; Assume 32-bit integers.
;
;		i = 4;

		mov	i, 4

;		A[4] = 2;

		mov	a[4*4], 2

;		i = i + A[i++];

		mov	eax, i
		add	eax, a[eax*4]
		inc	i  ;Of course, this could go away...
		mov	i, eax

;		printf("%d\n", i);

		printf	"%d\n", i

There is another issue you must consider. If you are working on a CPU with a floating point unit (e.g., the 486 or Pentium), most internal (to the FPU) computations use a full 80 bits. Once data leaves the chip it is generally truncated to 64 bits. Therefore, you will get more accurate results if you leave temporary calculations in one of the FPU's eight registers. While good x86 compilers generally do all their computations within a single expression on the FPU, I haven't noticed any that attempt to keep long term variables in the FPU registers. I certainly don't know of any compiler that would do this (across statements) as a means of maintaining the precision of an important variable; that simply requires too much intelligence.

While this essay will not attempt to explain how to maximize the accuracy of your computations (that is well beyond the scope of this essay), hopefully you can see that assembly language's absolute control the execution sequence provides some important benefits in those rare cases where "order of evaluation" can affect the outcome of your computations.


[1] Actually, this might not happen. I'm not 100% sure if C is required to complete all computations that produce side effects across a computation, or only those that would affect that outcome. Clearly, storing a result into a variable is a side-effect of an expression. However, given the fact that most good optimizing C compilers support optimizations like "reaching definitions" and "live/dead variable analysis" I suspect my assertion is correct.
[2] Strictly speaking, this is not true. Newer processors are beginning to support "out of order" execution of instructions. However, one would hope that the Intel and other chip makers would ensure that any out of sequence executions would not alter the results one would normally expect in a serially executing program.