-
Notifications
You must be signed in to change notification settings - Fork 11
/
Copy pathenvironment.texi
828 lines (656 loc) · 31 KB
/
environment.texi
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
@node Emacs Lisp Bytecode Environment
@chapter Emacs Lisp Bytecode Environment
In this chapter we discuss the ways Emacs creates, modifies and
uses bytecode in order to run code. We describe a little of the two
kinds of interpreters Emacs has, what goes into a bytecode file, and
the inter-operability of bytecode between versions.
@menu
* Emacs Lisp Bytecode Objects::
* Emacs Lisp Bytecode Compiler::
* Emacs Lisp Bytecode Interpreter::
* Emacs Lisp Bytecode Bytes::
* Emacs Lisp Bytecode Files::
* Functions and Commands for working with LAP and Bytecode::
* Source and Bytecode Optimization::
* LAP Decompiler::
@end menu
@SECTION Emacs Lisp Bytecode Objects
@cindex bytecode
@emph{This section is expanded and edited from Chris Wellons' blog on
``Emacs byte code Internals'' and from the Emacs Lisp Reference
manual. See references at the end of this doc.}
Emacs Lisp bytecode is an encoded form of a low-level assembly format
that is suited to running Emacs Lisp primitives and functions.
Emacs Lisp bytecode is not a low-level sequence of octets (bytes) that
requires a lot of additional special-purpose machinery to run. There
is a custom C code interpreter to handle each of the instruction
primitives, and that is basically it. And even here, many of the
instructions are simply a bytecode form of some existing Emacs
primitive function like ``car'' or ``point''.
Emacs Lisp bytecode is a built-in Emacs Lisp type (the same as a
Lisp ``cons'' node, or a Lisp symbol).
Functions @code{aref} and @code{mapcar} can be used to extract the
components of bytecode once it is built, The bytecode object is made
up of other normal Emacs Lisp objects described next. Bytecode is
created using the
@findex make-byte-code
@code{make-byte-code} function.
One important component of the bytecode object is the ``constants
vector.'' It is a Emacs Lisp vector. The @code{constant} instruction
refers to one of these objects.
An Emacs Lisp object of a bytecode type is analogous to an Emacs Lisp
vector. As with a vector, elements are accessed in constant time.
The print syntax of this type is similar to vector syntax, except
@verb{|#[...]|} is displayed to display a bytecode literal instead of
@verb{|[...]|} as in a vector.
A bytecode object is one of the several kinds of functions that Emacs
understands. See @pxref{symbol-function} for other objects that act
like a function.
Valid bytecode objects have 4 to 6 elements and each element has a
particular structure elaborated on below.
There are two ways to create a bytecode object: using a bytecode
object literal or with @code{make-byte-code} (@pxref{make-byte-code}). Like vector literals,
bytecode functions don't need to be quoted.
The elements of a bytecode function literal are:
@iftex
@enumerate
@item Function Parameter (lambda) List
@item Bytecode Unibyte String
@item Constants Vector
@item Maximum Stack Usage
@item Docstring
@item ``Interactive'' Specification
@end enumerate
@end iftex
@menu
* Function Parameter (lambda) List::
* Bytecode Unibyte String::
* Constants Vector::
* Maximum Stack Usage::
* Docstring::
* Interactive Specification::
@end menu
@node Function Parameter (lambda) List
@subsection Function Parameter (lambda) List
The first element of a bytecode-function literal is the parameter list
for the @code{lambda}. The object takes on two different forms
depending on whether the function is lexically or dynamically
scoped. If the function is dynamically scoped, the argument list is a
list and is exactly what appears in Lisp code. In this case, the
arguments will be dynamically bound before executing the bytecode.
@b{Example showing how a parameter list is transformed:}
@findex byte-compile
@verbatim
ELISP> (setq lexical-binding nil) ; force lexical binding
ELISP> (byte-compile
(lambda (a b &optional c) 5))
#[(a b &optional c) "\300\207" [5] 1]
@end verbatim
Above we show raw bytecode data. Emacs after version 25
makes an effort to hide the data.
There is really no shorter way to represent the parameter list because
preserving the argument names is critical. With dynamic
scoping, while the function body is being evaluated these variables are
globally bound (eww!) to the function's arguments.
On the other hand, when the function is lexically scoped, the
parameter list is packed into an Emacs Lisp integer, indicating the
counts of the different kinds of parameters: required,
@verb{|&optional|}, and @verb{|&rest|}. No variable names are
needed. In contrast to dynamically-bound variables, the arguments
are on the stack of the byte-code interpreter before executing the
code
The following shows how parameter counts and flags are encoded:
@image{elisp-params-small,,,,.png}
The least significant 7 bits indicate the number of required
arguments. This limits compiled, lexically-scoped
functions to 127 required arguments. The 8th bit is the number of
@b{&rest} arguments (up to 1). The remaining bits indicate the total
number of optional and required arguments (not counting @b{&rest}). It's
really easy to parse these in your head when viewed as hexadecimal
because each portion almost always fits inside its own ``digit.''
@b{Examples showing how lexical parameters are encoded:}
@findex byte-compile-make-args-desc
@verbatim
ELISP> (byte-compile-make-args-desc '())
#x000 ;; (0 args, 0 rest, 0 required)
ELISP> (byte-compile-make-args-desc '(a b))
#x202 ;; (2 args, 0 rest, 2 required)
ELISP> (byte-compile-make-args-desc '(a b &optional c))
#x302 ;; (3 args, 0 rest, 2 required)
ELISP> (byte-compile-make-args-desc '(a b &optional c &rest d))
#x382 ;; (3 args, 1 rest, 2 required)
@end verbatim
The names of the arguments do not matter in lexical scope; they're
purely positional. This tighter argument specification is one of the
reasons lexical scope is sometimes faster: the byte-code interpreter doesn't
need to parse the entire lambda list and assign all of the variables
on each function invocation; furthermore, variable access is via a
compact index located usually in the operand value rather than an
index into the constants vector followed by a lookup of the variable.
@node Bytecode Unibyte String
@subsection Bytecode Unibyte String
The second element of a bytecode-function literal is either
@itemize
@item
a unibyte string, or
@item
a pointer to a unibyte string,
@item
An autoload function
@end itemize
A unibyte string is a sequence of bytes or octets. Despite the type
name, it is not interpreted with any sort of Unicode
encoding. These sequences should be created with
@code{unibyte-string()} because strings can get transformed into longer
sequences of bytes when encoded. To disambiguate the string type to
the Lisp reader when higher values are present (> 127), the strings
are printed in an escaped octal notation, keeping the string literal
inside the ASCII character set.
@b{Examples unibyte strings:}
Bytecode for @code{(defun double-eg(n) (+ n n))} is:
@verbatim
PC Byte Instruction
0 8 varref[0] n
1 137 dup
2 92 plus
3 135 return
Constants Vector: [n]
@end verbatim
To encode the byte sequence then for this we could use:
@verbatim
ELISP> (unibyte-string 8 137 92 135)
"^H^?\\\207"
@end verbatim
It is unusual to see a bytecode string that doesn't end with 135
(#o207, return).
We describe how to decode the bytecode string in @ref{Instruction-Description Format}.
However when a function has been defined as a result of reading a
bytecode file, the unibyte string is a pointer into that file. This
pointer is represented by a @code{cons} node where the @code{car} is the
filename and the @code{cdr} is the bytecode offset from the beginning of the file
@verbatim
ELISP> (aref
(symbol-function 'cl-gcd)
1) ;; 1 is the bytecode string field
"("/tmp/emacs/lisp/emacs-lisp/cl-extra.elc" . 7352)
@end verbatim
@verbatim
ELISP> (aref
(symbol-function 'ediff-buffers)
1)
(autoload "ediff" 975154 t nil)
@end verbatim
@node Constants Vector
@subsection Constants Vector
@cindex constants vector
The third object in a bytecode-function literal is the ``constants
vector''. It is a normal Emacs Lisp vector and can be created with
@code{(vector ...)} or using a vector literal.
There is a possibility for confusion by the name ``constants vector''.
The vector size and its values are indeed constant. Also, only the
@code{constant} bytecode instructions (@pxref{Constants-Vector
Retrieval Instructions}) refers to one of these objects. However, in
addition to values string and integer values that do not change,
values in this vector also can be function and variable names. So
although a variable or function @emph{name} stored in the constants
vector doesn't change, the @emph{binding} of that particular variable
or function can change, even in the course of running the bytecode.
By using a constants vector, operand sizes in the bytecode
instructions are fixed and small. Also, operand values can be shared,
reducing the size of the constant vector.
Since the constants vector is a true Emacs Lisp vector, the overall
bytecode interpreter is simpler: all Lisp objects are handled in a
unified way: the representation of a integers, vectors, lists,
strings, and other Lisp objects is no different from the
representation in the Emacs Lisp interpreter.
@b{Example Showing a Constants Vector:}
@verbatim
ELISP> (aref
(byte-compile
(lambda (a b)
(my-func '("hi" "there") a nil 5)))
2) ;; 2 is the bytecode constants field
[a my-func
("hi" "there")
nil 5]
@end verbatim
The above assumes that dynamic binding is in effect.
The constants vector in the above example contains 5 elements:
@itemize
@item @code{a} --- the symbol @code{a} which refers to a variable
@item @code{myfunc} ---the symbol @code{myfunc} which likely refers to an external function
@item @code{("hi" "there")} --- a list constant containing two strings
@item @code{nil} --- the nil constant
@item @code{5} --- the integer constant 5
@end itemize
The properties of symbol @code{a} and symbol @code{myfunc} are
consulted at run time, so there is no knowledge in the
bytecode representing the fact that @code{a} is a dynamically-bound
parameter while @code{my-func} is probably an external function.
If the lambda were lexically scoped, the constants vector would not
have the variable symbol @code{a} listed, but instead there would be a
stack entry.
Note that although the symbol @code{b} is a parameter of the lambda,
it does not appear in the constants vector, since it is not used in the
body of the function.
@node Maximum Stack Usage
@subsection Maximum Stack Usage
The fourth object in a bytecode-function literal is an integer which gives
the maximum stack space used by this bytecode. This value can be
derived from the bytecode itself, but it is pre-computed so that the
byte-code interpreter can quickly check for stack
overflow. Under-reporting this value is probably another way to crash
Emacs.
In our example above, the maximum-stack value is five since function
@code{myfunc} is called with four parameters which are pushed onto the
stack, and there is an additional stack entry pushed, the @code{myfunc}
symbol itself. All of this needs to be in place on the stack just
before a @code{call} instruction runs to perform the @code{myfunc}
call.
@node Docstring
@subsection Docstring
The fifth object in a bytecode-function literal. It is optional. As
with the bytecode unibyte string, this value is either a string
literal or a pointer to a string in a bytecode file.
@b{Examples showing DocStrings:}
@verbatim
ELISP> (aref
(byte-compile
(defun double(a)
"double parameter A"
(+ a a)))
4) ;; 4 is the bytecode docstring field
"double parameter A"
ELISP> (aref
(symbol-function 'cl-gcd)
4)
("/tmp/emacs/lisp/emacs-lisp/cl-extra.elc" . 7251)
@end verbatim
@node Interactive Specification
@subsection ``Interactive'' Specification
When there is a sixth field in the bytecode function, the function is a
command, i.e., an ``interactive'' function. Otherwise the function is
not a command. This parameter holds the exact contents of the
argument to @code{interactive} in the uncompiled function definition.
Note that @code{(interactive)} causes the sixth field to be nil,
which is distinct from there not being a sixth field.
@unnumberedsubsec Examples showing the ``interactive'' specification
@verbatim
ELISP> (aref
(byte-compile
(lambda (n)
(interactive "nNumber: ") n)
)
5) ;; 5 is the bytcode interactive specification field
"nNumber: "
ELISP> (aref
(byte-compile
(lambda (n)
(interactive (list (read))) n))
5)
(list
(read))
@end verbatim
The interactive expression is usually interpreted, which is fine because,
by definition, this code is going to be waiting on user
input, but it slows down keyboard macro playback.
@SECTION Emacs Lisp Bytecode Compiler
The bytecode compiler is an ahead-of-time compiler that
accepts Emacs Lisp input and produces bytecode that can be run by
Emacs. The compiler itself is written in Emacs Lisp @footnote{Usually
the compiler itself is compiled into bytecode, which avoids overflow
problems}, and is a comparatively compact program contained in the
files @code{bytecomp.el} and @code{byte-opt.el}.
Internally, the compiler first produces an intermediate Lisp structure
in LAP code, then performs various optimizations on that, and finally
translates the LAP code into bytecode. LAP code is used during
compilation, but not kept in memory or used when running bytecode.
It is possible to go back to LAP code from bytecode. This is done in
order to inline functions and when bytecode disassembly is requested.
@SECTION Emacs Lisp Bytecode Interpreter
@emph{Note: the @strong{bytecode} interpreter that is described here should not
be confused with the Emacs @strong{Lisp} Interpreter}.
When a function is called and the function is represented as bytecode,
control passes to the bytecode interpreter. The interpreter is
written in C and is written more for speed than readability.
The bytecode interpreter operates on a single function at a time. For
a function call, the bytecode interpreter calls other parts of Emacs,
which might call the bytecode interpreter again, recursively. Thus, in
contrast to languages like FORTH, there is no code stack per se, just
the C stack.
The bytecode interpreter implements a stack machine utilizing a
fixed-size evaluation stack, which is usually allocated as a block on
the C stack. Instructions can access either this stack or a constants
vector, which is produced at compile time and made part of the
bytecode object.
The evaluation stack, as well as the constants vector, contains Lisp
values, usually 64-bit words containing an integer (Emacs integers are
limited to 62 bits on 64-bit machines), symbol index, or a tagged
pointer to one of various Emacs structures such as markers, buffers,
floating-point numbers, vectors, or cons cells.
Values on the evaluation stack are created at run time. Values in the
constants vector are created when the byte-compiled file is read and
converted into bytecode objects. The underlying bit representation of
values in the constants vector can vary between Emacs instance; they
are constants in the sense that they do not vary within a single Emacs
instance.
Bytecode objects contain a number safely estimating the maximum stack
size the evaluation stack can grow to.
@SECTION Emacs Lisp Bytecode Bytes
The bytecode interpreter, once it has set up the evaluation stack and
constants vector, executes the instructions that make up the bytecode
byte sequence. Each instruction is between one and three bytes long,
containing an opcode in the first byte and sometimes an eight- or
16-bit integer in the following bytes. Those integers are usually
unsigned, and 16-bit integers are stored in little-endian byte order,
regardless of whether that is the natural byte order for the machine
Emacs runs on.
Some opcodes, allocated in blocks, encode an integer as part of the
opcode byte.
Bytecode instructions operate on the evaluation stack. For example,
@code{plus}, the addition function, removes two values from the
top of the stack and pushes a single value, the sum of the first two
values, back onto the stack.
Since the arguments for a function call need to be on the stack before
the function can operate on them, bytecode instructions use reverse
Polish notation: first the arguments are pushed onto the stack, then the
function or operation is called. For example, the Lisp expression
@code{(+ a b)} turns into this bytecode:
@c @code{(defun plus (a b) (+ a b))} generates
@verbatim
PC Byte Instruction
0 8 varref a
1 9 varref b
2 92 plus
@end verbatim
First @code{a} and @code{b} are dereferenced and their values pushed
onto the evaluation stack; then @code{plus} is executed, leaving
only a single value, the sum of @code{a} and @code{b}, on the stack.
@SECTION Emacs Lisp Bytecode Files
When Emacs is build from source code, there is C code for some
primitive or built-in functions. These include Lisp functions like
@code{car}, or primitive Emacs functions like @code{point}. Other
equally important functions are implemented in Emacs Lisp. These are
byte compiled and then loaded into Emacs. On many systems there is the
ability to dump Emacs in some kind of image format after these basic
functions have been loaded, but even if that does not happen, a file
called @code{loaddefs.el} is created which contains many of the
important basic primitive functions as bytecode.
When we invoke Emacs then, it has a number of functions already
loaded and these are either coded in C or have been byte compiled and
loaded. Before running a function, Emacs queries the type of code that
is associated with the function symbol and calls either its lambda
S-expression interpreter or its bytecode interpreter.
When we run @code{load}, which reads and evaluates Lisp code from a
file, at the top-level it does not matter whether the file contains
bytecode or Emacs Lisp source code. Either way the only thing done is
to open the file and read its contents using the normal
Lisp reader.
The difference between the two kinds of files is more about convention
than about their contents, and specifically two things: First the
bytecode file will have a comment header in it that starts
@verb{|;ELC^W^@^@^@|} while the source code probably does not
(although nothing to stop us from adding in that line if we feel like
it). And, in addition to this comment header, a bytecode file will
have other meta-comments such as which version of Emacs was used to
compile the file and whether optimization was used. In earlier
versions, there was information about the program that was used to
compile the program, such its version number, and the source code path
used to be in there as well. (I think these things should still be in
there but that's a different story.) See @ref{Instruction Changes
Between Emacs Releases} where we give examples of the headers to show
how they have changed.
The second thing that is typically different between source code files
and bytecode files is that bytecode files contain the @code{bytecode}
calls used in the file and lack of any @code{defun}, @code{defmacro},
or @code{lambda} calls. But again there is presumably nothing stopping
anyone from using these in their source code.
In fact, we can take a file with the @code{.elc} extension, rename it
with an @code{.el} extension and @code{load} that, and it
will run exactly the same if it had been loaded as a bytecode
file@footnote{If we go the other way and rename a Lisp file as
a bytecode file, Emacs will notice the discrepancy because at the top
of the file is a header that Emacs checks. But if we add a
reasonable-looking header we can go that direction as well.}.
Similarly, just as we can concatenate any number of independent Emacs
Lisp source code files into one file, and this is sometimes done as a
poor-man's way to create a package, we can also concatenate any
numbers of Emacs Lisp bytecode files.
Of course, there are probably certain programs that are fooled when
the extension is changed. In particular, the
@code{byte-recompile-directory} function will think that the
bytecode file does not exist because it has the wrong extension. So
even though Emacs is permissive about such matters, it is best to
stick with the normal Emacs conventions.
The final thing that should be mentioned when talking about bytecode
files is interoperability between Emacs versions.
Even though a bytecode header has a meta comment indicating the
version of Emacs that was used to compile it, that information is not
used in determining whether the bytecode file can be run or not. This
has the benefit of being able to run bytecode compiled in a different
Emacs version than the version currently running. Since Emacs bytecode
instructions do not change often, this largely works. The scary part,
though, is that opcode meanings have changed over the 30 years, and
the interpreter sometimes lacks checks. (In the past the interpreter
aborted when running an invalid bytecode.) So Emacs does not even know
when we are running bytecode from a different interpreter, and we
might run off a cliff running older or newer bytecode without a check.
Emacs developers maintain that, in practice, problems have not been
reported very much. Also, they try to keep backward compatibility
between versions so that bytecode generated in an older version of
Emacs will often still be interpreted in a recent newer version. While
this is a worthwhile intention, my experience is that this does not
always work, especially going back more than one version, and it is
unrealistic to expect for a program that is 30 years old.
Because there is no up-front checking, bytecode generated from a newer
version of Emacs will run silently on an older version until there is
opcode that the older version cannot handle. In some cases it will
complete. See @ref{Instruction Changes Between Emacs Releases} for
when this is likely to work and when it won't. Although running newer
bytecode in an older version of Emacs is not explicitly considered,
since bytecode does not change very often, this can sometimes work
out.
Note the sharp contrast with other bytecode interpreters, such as
Python, where the magic used in compiling has to be the same as the
value of the running interpreter or it will refuse to run.
It would be nice to have an Emacs Lisp bytecode
checker, perhaps a @code{safer-load} function that looks at the
bytecode. Its meta-comments would glean when there is something that is
known to cause problems. Any volunteers?
@include bytecode-commands.texi
@include optimization.texi
@node LAP Decompiler
@section LAP Decompiler
Let's face it --- most of us would rather work with Emacs Lisp, a
higher-level language than bytecode or its more human-friendly LAP
disassembly. There is a project in its early stages that can often
reconstruct Emacs Lisp from bytecode generated from Emacs Lisp.
See the @url{https://github.com/rocky/elisp-decompile, Elisp
Decompiler Project} for more details.
Although there is much work that needs to be done to make this work
more of the time, quite frankly, it blows my mind that I've been able
to get this far. And it was a lot of work to get this far. First
there is setup just to have some sort of decompilation framework,
albeit in Python. And then, I needed to first start writing
@emph{this} document since there wasn't anything nearly as complete
when I started. There were various tables around for bytecode
instructions, but I needed not just cumulative evaluation stack
effects for a bytecode operation, but how many entries on the stack
are read and then how many written. This was the motivation for adding
that to this document. Many of the mistakes that I've found here are a
direct result of working on the decompiler. And having the semantics
of each operation was helpful in writing the decompiler.
A number of people have opined that you really don't need a
decompiler.@footnote{Whenever something new or novel comes along, in
addition to the group that says ``Why not?!'' there is the ``Why
bother?'' group. In my (rocky) lifetime, I have experienced this
attitude when undo and regular expressions were added to GNU Emacs,
adding a debugger to languages where debugging support didn't exist or
was weak, using decomplation to give more precise error location
information, and here. Convincing the crowd that is happy with the
status quo is hard. The next section is more for those who have an
open mind and want to see a @emph{better} world.}
Below are the arguments offered and why I think they are inadequate.
@menu
* Of course I have the source code!::
* Isn't it simpler to just disassemble?::
@end menu
@node Of course I have the source code!
@subsection It's GNU Emacs, so of course I have the source code!
There is a difference between being able to find the source code and
having it @emph{accessible} when you need it, such as in an
interactive session or at runtime from a stack trace.
When many people hear the word ``decompile'' they think reverse
engineering or hacking code where source has deliberately been
withheld. But there are other situations where a decompiler is useful.
A common case is where you wrote the code, but have accidentally
deleted the source code and just have the bytecode file.
But, I know, you always use version control and GNU Emacs provides it's
tilde backup file.
So that leads us to the situation where there are @emph{several}
possible source-code versions around, e.g. a development version and a
stable version, or one of the various versions in your version-control
system, and you'd like to know which one of those is stored in a
bytecode file, that you have loaded.
And then we come to situation where there @emph{is} no source-code
file. One can create functions on the fly and change them on the
fly. Lisp is known for its culture in having programs create programs;
it is possible such a program doesn't have a ``debug'' switch (that
you know about) to either save or show the result before it was
byte-compiled. Perhaps you'd like to reconstruct the source code for a
function that you worked on interactively.
@node Isn't it simpler to just disassemble?
@subsection Isn't it simpler to just disassemble?
Interestingly, a number of people suggested that programmers who want
to understand bytecode should use LAP and decompile mentally.
Most people don't know LAP. In fact, before I started writing this
document, there really @emph{wasn't} any good documentation on LAP,
short of reading source code. So a side benefit is that the process of
writing a decompiler has made the documentation of what's there in
bytecode more easily accessible.
But from writing this decompiler and noting all of the subtleties and
intricacies, I am pretty convinced that those who say they understand
LAP have to spend a bit of time-consuming tedious work to decipher
things. I sure do.
This isn't is what computers were invented for? They do this kind of
thing very fast compared to humans.
Here are some simple examples:
@b{macros}
I would find it tedious enough just to descramble something that has
been macro expanded. And I am sure people may find that unsatisfying
right now with our results. Now imagine how you'd feel to add another
layer on that in going from LAP to Elisp for the expanded macros.
The LAP instructions for:
@verbatim
(define-minor-mode testing-minor-mode "Testing")
@end verbatim
expand to 60 or so LAP instructions; a lot more when various parameters
have been filled in.
And then you have things like @code{dolist} which are just long and
boring template kinds of things. Because it's long it is very easy to
lose sight of what it is.
@b{Stacked values}
Keeping track of values pushed on a stack is also tedious. Again, there
can be some non-locality in when a value is pushed with when it is
used and popped. As an example, consider the LAP instructions in Figure @ref{fig:lap-instructions}.
It is similar to a very well-known mathematical function.
@float Figure,fig:lap-instructions
@verbatim
constant fn
dup
varref n
sub1
call 1
constant fn
varref n
constant 2
diff
call 1
plus
call 1
@end verbatim
@end float
At what point is that duplicated function the second instruction used?
And what are the arguments to this function?
How long did it take you to figure this out? It takes a computer
hundredths of a second to reconstruct the Lisp code.
@b{Keyboard bindings}
This is yet another piece of tedium for the few that know how to do.
As before see how fast you can come up with the key names for:
@verbatim
[134217761]
[134217854]
[134217820]
@end verbatim
Again this is done in hundredths of a second by a computer.
@b{Pedagogy}
Another aspect about a decompiler, especially this one, is that it can help you learn LAP and Emacs bytecode, and how the existing compiler and optimizer work.
The program creates a parse tree from (abstracted) LAP instructions, where the the higher levels of the tree consist of grammar nodes that should be familiar in higher-level Emacs Lisp terms.
With this it is possible to see how the individual instructions combine to form the higher-level constructs.
Figure @ref{fig:lap-decompile} has is decompilation using of the LAP instructions in Figure @ref{fig:lap-instructions}.
@float Figure,fig:lap-decompile
@verbatim
$ lap-decompile --tree=after tmp/foo.lap
fn_body (3)
0. body
exprs
expr_stmt (2)
0. expr
call_exprn (3)
0. expr
name_expr
0 CONSTANT fn
1. expr
binary_expr (3)
0. expr
call_exprn (3)
0. expr
1 DUP
1. expr
unary_expr (2)
0. expr
2 VARREF n
1. unary_op
3 SUB1
2. 4 CALL_1 1
1. expr
call_exprn (3)
0. expr
name_expr
5 CONSTANT fn
1. expr
binary_expr (3)
0. expr
6 VARREF n
1. expr
name_expr
7 CONSTANT 2
2. binary_op
8 DIFF
2. 9 CALL_1 1
2. binary_op
10 PLUS
2. 11 CALL_1 1
1. opt_discard
1. opt_label
2. opt_return
(defun foo(n)
(fn (+ (fn (1- n)) (fn (- n 2)))))
@end verbatim
@end float
Looking at Figure @ref{fig:lap-decompile} we see that:
@verbatim
1 DUP
2 VARREF n
3 SUB1
4 CALL_1 1
@end verbatim
contains a function call and its parameters; One of the parameters is the unary expression:
@verbatim
2 VARREF n
3 SUB1
@end verbatim
And if you want to know which operation the stack value of first instruction @code{CONSTANT fn}
is used in, the nesting makes it easy to see it is the very last instruction @code{CALL_1} at offset 11.
As I mentioned before, personally, I find matching this stuff up a bit tedious.
You can also get control-flow graphs from the decompiler. See the github repository for more details.