%==============================================================================% % Start of Ch2.tex % %==============================================================================% % % Copyright % --------- % Copyright (C) 1992 Ross N. Williams. % This file contains a chapter of the FunnelWeb Hacker's Manual. % See the main TeX file for this manual for further information. % %==============================================================================% \chapter{FunnelWeb Implementation} \label{chapimplementation}\xx{FunnelWeb}{implementation} \section{Introduction} This chapter contains notes on the actual C code that implements FunnelWeb~V3. This chapter is rather patchy. It has acted mainly as a dumping ground for ideas that I bothered to write about during development. \section{History of FunnelWeb Implementations} \xx{FunnelWeb}{history}\x{FunnelWeb version 1}\x{FunnelWeb version 2}% \x{FunnelWeb version 3} The first implementation of FunnelWeb (FunnelWeb~V1) was written in Ada\x{Ada} in December 1986. The project was initially canned as requiring too much time, but was resurrected when I decided to commit to Ada\paper{USDOD83} and realized that I needed a program to write to help me to learn Ada. FunnelWeb~V1 was, in fact, my first Ada program. It took about one month to write. FunnelWeb~V1 was used intensively by myself to write Ada programs from January~1986 to July~1989 at which time I finished my Ph.D.\x{Ph.D.} and lost access to the VAX. During this time at least twenty thousand lines of code were generated using FunnelWeb. Hardly anyone but myself used FunnelWeb.\xx{FunnelWeb}{past use} After losing access to Ada and the Vax (and hence to FunnelWeb), I was forced back to programming the non-literate way. From time to time I found that I needed to use some of my old programs that I had written using Ada and FunnelWeb. I knew that Ada would become available on more machines, but certainly FunnelWeb wouldn't. I recognised a strong need for a portable version of FunnelWeb written in~C but didn't have the time or energy to create one. About this time (late 1989), David Hulse,\xn{David}{Hulse} at the time a student in Computer Science at the University of Adelaide,\xx{University}{Adelaide} volunteered to translate the 4000 line Ada version of FunnelWeb into~C. To my knowledge this translation process took about three weeks (in December 1989). The result was called FunnelWeb~V2 and was formally signed into the public domain on 5~May 1992. In general, David Hulse did a good job. However, the resultant code suffered from one or few serious defects, the most serious of which was a lack of portability. Lack of portability of the C~code, combined with the need for a rather solid design review, combined with the need to strengthen the program to bring it up to production standard, resulted in my performing a complete reworking of the code. The C code was entirely, but incrementally, replaced or reformatted. The code was also strengthened and new features were added. This process took about two months (November and December 1991). A further two months (approx) were spent writing documentation, constructing a regression test suite, porting the program to different machines, and sorting out the legal issues involved in its release. I would like to take this opportunity to record a debt of gratitude to David Hulse who translated FunnelWeb from Ada to~C. Although my reworking of his C~code obliterated most of his code, his translation was pivotal to the development process. Without his effort in moving from Ada to~C, I'm not sure that I would have mustered the energy and time to complete the process and bring FunnelWeb up to its current standard. \section{Why FunnelWeb Wasn't Used to Write Itself} \xx{FunnelWeb}{writing itself} After Knuth created the Web literate preprocessing system, he re-wrote it using Web and distributed the source code in Web source form. To allow the Web source code to be tangled by users not yet having a copy of Web, he also included the tangled Pascal code for the Tangler. While this approach is heroic and serves to convey a commitment and a confidence in literate programming, it seemed to me that writing FunnelWeb in FunnelWeb would simply be asking for trouble.\xx{trouble}{asking for} For a start, it would be very hard to modify any feature of FunnelWeb that had been used to write FunnelWeb, and the thought of what would happen if the working executable became inoperative for some reason does not bear thinking upon. One million billion computer programs were written in the non-literate style before FunnelWeb was created. Why not one more? \section{Coding Style} \xx{coding}{style} Although FunnelWeb wasn't coded under any particular coding standard, it was coded in accordance with a fairly strict personal style of C which developed during the development of FunnelWeb. This style was subsequently embodied in a real C coding standard prepared for the South\x{South Australian Government Department of Lands} Australian Government Department of Lands.\footnote{The standard is currently unavailable, but is likely to be released or published eventually.} Unfortunately, FunnelWeb was not formally developed under the standard and so some holes remain in FunnelWeb's coding style. This section aims to describe some of the more important aspects of the coding style. \thing{Portability:} This\x{portability} was a major goal of the FunnelWeb implementation. Two excellent books guided this move to portability. They were \paper{Rabinowitz90} (which deals with C code itself) and \paper{Horton90} (which deals with the portability of various library calls). Other works such as \paper{Kernighan88} and \paper{ANSI} were also helpful. \thing{Identifiers:} \paper{Rabinowitz90}\x{identifier}{length} specifies that for wide portability, identifiers of block and file scope should be unique to eight characters, and identifiers of program scope should be unique to six characters. I have gone further in FunnelWeb and actually made these restrictions actual limits on identifier length. Because names must be so short, a system of abbreviations was developed to organize the identifiers used within FunnelWeb. Each abbreviation consists of a letter pair. Here are \i{some} of the abbreviations used:\xx{identifier}{abbreviations} \begin{verbatim} bp - Body Part. cm - Compare. Used to prefix comparison routines that return [-1,0,1]. dc - Document component. dm - Dump package. el - Element. eq - Equal. Used to prefix comparison routines that return a boolean. ex - Expression. f - Global files. ll - List of lists. ln - Line record. ls - List Package. lr - Lister package. ma - Macro. mc - Macro Call. mn - Macro Name. op - Options package. pr - Parser. ps - Position record. sc - Scrap record. sn - Section. tb - Table package. ty - Typesetter directive. wf - Write file package. wl - Write with EOL (misc.c). wr - Write (misc.c). \end{verbatim} \thing{Pointers:} Variables or types denoting pointers\xx{pointers}{naming} start with \dqp{p\_}. \thing{Types:} Names denoting types end in \dqp{\_t}. Thus, a type for a pointer to a table would be named \p{p\_tb\_t}.\xx{types}{naming} \thing{File names:} All files\xx{filenames}{length} used in FunnelWeb have file names that are from one to eight characters long and file extensions that are from one to three characters long. This ensures that the files can be portably moved to all kinds of machines, even MSDOS!\x{MSDOS} \section{Use of Memory} \xx{use of}{memory} FunnelWeb is not a memory-stressed program. However, during its development, problems with the management of memory seemed to crop up again and again. This section documents some of these problems and the solutions adopted. There are three places where memory can be obtained: the heap, the stack, and from static variables. The following three sections deal with each of these areas. \section{The Heap} \xx{heap}{memory} One of the great frustrations of being a user is to find that a computer program is complaining about lack of memory when one knows full well that one has allocated at least ten times as much memory to the program as it would ever need to do its job. The reason for such error messages usually has to do with the programmer setting a fixed \dq{reasonable} limit to a particular data structure and then locking it up into an array whose bound is specified by a constant. While the use of arrays can increase the speed of a program, it also means that the user cannot increase the capacity of the program without obtaining the source code and recompiling it, which is usually a daunting option. The alternative is to use the heap for all data structures that can grow in proportion to the size of the user's input. This rule has been followed rigorously in FunnelWeb. This means that as memory spaces increase, users will be able to hand their version of FunnelWeb more memory without having to recompile it. \topicbreak Some problems arose early on the Macintosh\x{Macintosh} in the use of the heap. It seems that some of the allocations I was attempting to make were failing for some obscure reason, possibly my fault. Whatever it was, it went away when I replaced direct calls to \p{malloc}\x{malloc} with calls to a mini package I wrote (called \p{memory}) that allocated large chunks of memory and then doled out small pieces as required by the rest of the program.\xx{memory}{package} Having a package to manage all the memory allocation had two other benefits. First, only one check was required in the entire program to see if memory had run out (in the memory package), and if that failed, the program could be brought to a screaming halt. This organization was far preferable to having each piece of code that needed to allocate memory having to check to see if \p{malloc} had failed. Second, the decision to construct a mini-shell within FunnelWeb to support regression testing meant that FunnelWeb proper could be run many times in any given invocation of FunnelWeb. As a consequence it was necessary to make sure that there was no memory leakage\xx{memory}{leakage} between invocations of FunnelWeb proper. This was accomplished by reworking the memory package to operate a watermark system. The user of the package, when requesting memory, could request \dq{temporary} or \dq{permanent}. If permanent, the memory package forgot that it had allocated the memory. If temporary, the memory package places the allocated block on a list. There was then a function in the memory package that could be called to deallocate all the temporary memory. Thus, so long as all requests for memory within FunnelWeb proper were for temporary memory, and that memory was freed at the end of every run, one could be sure that there was no memory leakage. \section{The Stack} \xx{stack}{memory}\xx{stack}{size} For a while during the development of FunnelWeb a particularly nasty bug proved extremely hard to find. The symptom was that FunnelWeb would crash, sometimes at random, but more often upon entering a particular function. In the end about a day of specific debugging was required before the problem was tracked down to a stack problem. It turned out that somehow (either the fault of the Macintosh or the THINK~C language system), only 6K was being allocated for stack space!!!!!!! This experience led me immediately to go through the entire program and eliminate (or remove to the heap) any automatic variable declarations that used more than one hundred or so bytes. The lesson is clearly that C~programs that use more than a few thousand bytes of stack space are risking their portability. All large data structures should be placed in the heap. \section{Static Variables} \xx{memory}{static}\xx{static}{variables} Static variables also proved a problem on the Macintosh. It turns out that the Macintosh\x{Macintosh} THINK~C compiler\xx{ThinkC}{compiler} does not allow more than 32K of statics \i{in the entire program}. For a while this restriction was a serious threat to the program as it was discovered that constant strings were included in this total! However, some searching revealed a compiler option that removed the strings from the static category. Nevertheless, the 32K limit is rather severe. Again, it seems that for portability reasons, C~programs that use a lot of static variables are risking their portability. As a result, the FunnelWeb code avoids static variables where possible in favour of the heap. \section{Implementing Text Indentation} \xx{text}{indentation} At one point during the development of FunnelWeb, text indentation was fully implemented. However, it was subsequently removed because it was considered a dangerous feature. This section records the way in which text indentation was implemented so that if the feature ever has to be put back, this technique can be used again. 1. Create a new field in the \p{sc\_t} record call \p{sc\_postn}. \begin{verbatim} char *sc_postn; /* Pointer in the range [sc_first,sc_last+1]. */ /* It is the minimum possible value of sc_postn for */ /* which EOL does not appear in *sc_postn..*sc_last. */ /* i.e. Points to the byte following the first EOL in */ /* the scrap or sc_first if EOL does not appear. */ \end{verbatim} 2. Modify the scanner so that it generates this field. Sendtext should be modified so that it accepts an argument for the \p{p\_postn} value. \begin{verbatim} LOCAL void sendtext P_((p_ps_t,char *,char *,char *,bool)); LOCAL void sendtext(p_tkps,p_first,p_last,p_postn,is_white) /* Appends a text token to the end of the token list. */ /* IN: p_ps is a pointer to a position structure giving the position of the */ /* first character of the token. */ /* IN: p_first and p_last point to the first and last byte of the text scrap. */ /* IN: p_postn has the same definition as sc_postn (see fwdata.h). */ /* IN: is_white should be set to TRUE iff scrap is entirely whitespace. */ p_ps_t p_tkps; char *p_first; char *p_last; char *p_postn; bool is_white; { tk_t token; /* Empty text scraps should never be generated. */ assert(p_first<=p_last,"sendtext: Text scrap bounds are bad."); /* If ch=EOL then we should be scanning more text, not shipping it! */ assert(ch!=EOL,"senttext: Shipping text while still more to scan."); /* Check that p_postn is in range. See definition in fwdata.h. */ assert(p_first<=p_postn && p_postn<=p_last+1, "sendtext: p_postn is out of range."); /* Debug: Check the p_postn field using a brute force check. */ { char *i,*j; j=p_first; for (i=p_first;i<=p_last;i++) if (*i==EOL) j=i+1; assert(j==p_postn,"sendtext: sc_postn field is incorrect."); } /* Load the text token. */ token.tk_kind = TK_TEXT; ASSIGN(token.tk_ps,*p_tkps); token.tk_sc.sc_first = p_first; token.tk_sc.sc_last = p_last; token.tk_sc.sc_postn = p_postn; token.tk_white = is_white; token.tk_parno = 0; ls_add(token_list,PV &token); } \end{verbatim} Then all the calls to sendtext have to be changed: \begin{verbatim} /* @ instructs FunnelWeb to replace the special construct with the */ /* special character. Luckily one appears just before the @ !! */ /* Note: FALSE is OK because space is not a legal specialch. */ /* Note: Setting parameter p_postn to p_ch-1 is OK as EOL is not a */ /* legal specialch. */ sendtext(&ps_spec,p_ch-1,p_ch-1,p_ch-1,FALSE); break; /* + instructs FunnelWeb to insert an EOL. We can't look to the end of */ /* the previous line to find an EOL as this might be the first line. */ /* Running ahead to the end of the line is expensive, and having the */ /* liner mini-package maintain a variable for it would be extra */ /* housekeeping. Instead of all this, we just point to a static. */ {CONST static char stateol = EOL; sendtext(&ps_spec,&stateol,&stateol,(&stateol)+1,TRUE);} break; /* If we hit something that ends a text token */ /* then we can transmit a white text token. */ if (ch==specialch || ch==EOFCH) {sendtext(&ps_start,p_first,p_ch-1,MAX(p_sol,p_first),TRUE); return;} /* Otherwise we have some more (non-white) text to scan. */ /* We can then send a non-white text token. */ while (ch!=specialch && ch!=EOFCH) NEXTCH; sendtext(&ps_start,p_first,p_ch-1,MAX(p_sol,p_first),FALSE); \end{verbatim} The dump code needs to be changed too! \begin{verbatim} wf_str(p_wf,"\""); assert(token->tk_sc.sc_first !=NULL,"dm_tkls: NULL ptr1."); assert(token->tk_sc.sc_last !=NULL,"dm_tkls: NULL ptr2."); for (i=token->tk_sc.sc_first; i<=token->tk_sc.sc_last; i++) { if (i==token->tk_sc.sc_postn) wf_str(p_wf,""); if (*i=='\n') wf_wl(p_wf,""); else dm_byte(p_wf,*((ubyte_ *) i)); } if (i==token->tk_sc.sc_postn) wf_str(p_wf,""); wf_str(p_wf,"\""); } \end{verbatim} 3. Over in the Tangle module, create a massive array of pointers to scraps to be used as a stack. Maintain pointers into the stack called \p{current} and \i{base} (similar to the blank indentation variables). Implement the following: \begin{itemize} \item To write out a scrap, scan it byte by byte. Output each byte. When you hit an EOL, pop the stack back to \p{base}. Then write out an EOL followed by the stack contents but writing each scrap only from \p{postn} to end end of each scrap. When you have finished the new scrap, push it on the stack. \item When you hit a new macro to expand, save \p{base}. Restore it later. \end{itemize} The \p{postn} field solves the big problem of how to cope with something like this: \begin{verbatim} The rain in Spain falls mainly @ \end{verbatim} The trouble is that we want to text indent the lines in \p{@} with just \dqp{falls mainly~}. However, this string is only part of a scrap. The solution is to get the scanner to record, in the \p{postn} field of each scrap, the position of the first byte with a EOL-free run to the end of the scrap. This scheme is very efficient because all we are doing is pushing and popping pointers to scraps on a stack array. The main disadvantage is that the array must necessarily be finite and would impose a limit on the depth of indentation nesting (not a big problem). %==============================================================================% % Start of Ch2.tex % %==============================================================================%