[38.4] How can I decompile an executable program back into C++ source code?
You gotta be kidding, right?
Here are a few of the many reasons this is not even remotely feasible:
What makes you think the program was written in C++ to begin
with?
Even if you are sure it was originally written (at least partially)
in C++, which one of the gazillion C++ compilers produced it?
Even if you know the compiler, which particular version of the
compiler was used?
Even if you know the compiler's manufacturer and version number,
what compile-time options were used?
Even if you know the compiler's manufacturer and version number and
compile-time options, what third party libraries were linked-in, and what was
their version?
Even if you know all that stuff, most executables have had their
debugging information stripped out, so the resulting decompiled code will be
totally unreadable.
Even if you know everything about the compiler, manufacturer,
version number, compile-time options, third party libraries, and debugging
information, the cost of writing a decompiler that works with even one
particular compiler and has even a modest success rate at generating code
would be significant on the par with writing the compiler itself from
scratch.
But the biggest question is not how you can decompile someone's code,
but why do you want to do this? If you're trying to reverse-engineer
someone else's code, shame on you; go find honest work. If you're trying to
recover from losing your own source, the best suggestion I have is to make
better backups next time.
(Don't bother writing me email saying there are legitimate reasons for
decompiling; I didn't say there weren't.)
If the compiler uses the "over-allocation" technique, the code for p = new
Fred[n] looks something like the following. Note that WORDSIZE is an
imaginary machine-dependent constant that is at least sizeof(size_t),
possibly rounded up for any alignment constraints. On many machines, this
constant will have a value of 4 or 8. It is not a real C++ identifier that
will be defined for your compiler.
// Original code: Fred* p = new Fred[n];
char* tmp = (char*) operator new[] (WORDSIZE + n * sizeof(Fred));
Fred* p = (Fred*) (tmp + WORDSIZE);
*(size_t*)tmp = n;
size_t i;
try {
for (i = 0; i < n; ++i)
new(p + i) Fred(); // Placement new
}
catch (...) {
while (i-- != 0)
(p + i)->~Fred(); // Explicit call to the destructor
operator delete[] ((char*)p - WORDSIZE);
throw;
}
Then the delete[] p statement becomes:
// Original code: delete[] p;
size_t n = * (size_t*) ((char*)p - WORDSIZE);
while (n-- != 0)
(p + n)->~Fred();
operator delete[] ((char*)p - WORDSIZE);
Note that the address passed to operator delete[] is not the
same as p.
Compared to the associative array
technique, this technique is faster,
but more sensitive to the problem of programmers saying delete p rather than
delete[] p. For example, if you make a programming error by saying delete
p where you should have said delete[] p, the address that is passed to
operator delete(void*) is not the address of any valid heap
allocation. This will probably corrupt the heap. Bang! You're dead!
If the compiler uses the associative array technique, the code for p = new
Fred[n] looks something like this (where arrayLengthAssociation is
the imaginary name of a hidden, global associative array that maps from void*
to "size_t"):
// Original code: Fred* p = new Fred[n];
Fred* p = (Fred*) operator new[] (n * sizeof(Fred));
size_t i;
try {
for (i = 0; i < n; ++i)
new(p + i) Fred(); // Placement new
}
catch (...) {
while (i-- != 0)
(p + i)->~Fred(); // Explicit call to the destructor
operator delete[] (p);
throw;
}
arrayLengthAssociation.insert(p, n);
Then the delete[] p statement becomes:
// Original code: delete[] p;
size_t n = arrayLengthAssociation.lookup(p);
while (n-- != 0)
(p + n)->~Fred();
operator delete[] (p);
Cfront uses this technique (it uses an AVL tree to implement the associative
array).
Compared to the over-allocation
technique, the associative array
technique is slower, but less sensitive to the problem of programmers saying
delete p rather than delete[] p. For example, if you make a programming
error by saying delete p where you should have said delete[] p, only the
first Fred in the array gets destructed, but the heap may survive
(unless you've replaced operator delete[] with something that doesn't
simply call operator delete, or unless the destructors for the other
Fred objects were necessary).
[38.9] If name mangling was standardized, could I link code compiled with compilers from different compiler vendors?
Short answer: Probably not.
In other words, some people would like to see name mangling standards
incorporated into the proposed C++ ANSI standards in an attempt to avoiding
having to purchase different versions of class libraries for different
compiler vendors. However name mangling differences are one of the smallest
differences between implementations, even on the same platform.
Here is a partial list of other differences:
Number and type of hidden arguments to member functions.
[38.10] GNU C++ (g++) produces big executables for tiny programs; Why?
libg++ (the library used by g++) was probably compiled with debug info
(-g). On some machines, recompiling libg++ without debugging can save
lots of disk space (approximately 1 MB; the down-side: you'll be unable to
trace into libg++ calls). Merely strip-ping the executable doesn't
reclaim as much as recompiling without -g followed by subsequent
strip-ping the resultant a.out's.
Use size a.out to see how big the program code and data segments really
are, rather than ls -s a.out which includes the symbol table.
The primary yacc grammar you'll want is from Ed Willink. Ed believes
his grammar is fully compliant with the ISO/ANSI C++
standard, however he doesn't warrant it: "the grammar has not," he says,
"been used in anger." You can get
the
grammar without action routines or
the
grammar with dummy action routines. You can also get
the
corresponding lexer. For those who are interested in how he achieves
a context-free parser (by pushing all the ambiguities plus a small number of
repairs to be done later after parsing is complete), you might want to read
chapter 4 of
his
thesis.
There is also a very old yacc grammar that doesn't support templates,
exceptions, nor namespaces; plus it deviates from the core language in some
subtle ways. You can get that grammar
here or
here.
These are not versions of the language, but rather versions of Cfront, which
was the original C++ translator implemented by AT&T. It has become generally
accepted to use these version numbers as if they were versions of the language
itself.
Very roughly speaking, these are the major features:
2.0 includes multiple/virtual inheritance and
pure virtual functions
2.1 includes semi-nested classes and
delete[] pointerToArray
3.0 includes fully-nested classes, templates and i++vs.++i
Depends on what you mean. If you mean, Is it possible to convert
C++ to readable and maintainable C-code? then sorry, the answer is
No C++ features don't directly map to C, plus the generated C code
is not intended for humans to follow. If instead you mean, Are
there compilers which convert C++ to C for the purpose of compiling
onto a platform that yet doesn't have a C++ compiler? then you're
in luck keep reading.
A compiler which compiles C++ to C does full syntax and semantic
checking on the program, and just happens to use C code as a way of
generating object code. Such a compiler is not merely some kind of
fancy macro processor. (And please don't email me claiming these are
preprocessors they are not they are full compilers.) It is
possible to implement all of the features of ISO Standard C++ by
translation to C, and except for exception handling, it typically
results in object code with efficiency comparable to that of the code
generated by a conventional C++ compiler.
Here are some products that perform compilation to C (note: if you
know of any other products that do this, please let me know
(cline@parashift.com)):
LLVM is a downloadable compiler
that emits C code. See also
here.
Cfront, the
original implementation of C++, done by Bjarne Stroustrup and others at AT&T,
generates C code. However it has two problems: it's been difficult to obtain
a license since the mid 90s when it started going through a maze of ownership
changes, and development ceased at that same time and so it is doesn't get bug
fixes and doesn't support any of the newer language features (e.g.,
exceptions, namespaces, RTTI, member templates).
Contrary to popular myth, as of this writing there is no version of
g++ that translates C++ to C. Such a thing seems to be doable, but I am not
aware that anyone has actually done it (yet).
Note that you typically need to specify the target platform's CPU, OS
and C compiler so that the generated C code will be specifically
targeted for this platform. This means: (a) you probably can't take
the C code generated for platform X and compile it on platform Y; and
(b) it'll be difficult to do the translation yourself it'll
probably be a lot cheaper/safer with one of these tools.
One more time: do not email me saying these are just
preprocessors they are not they are compilers.