php - Compiling an AST back to source code -


i'm in process of building php parser written in php, no existing parser came in my previous question. parser itself works well.

now parser little (apart static analysis). apply transformations ast , compile source code. applying transformations isn't of problem, normal visitor pattern should do.

what problem is, how compile ast source. there 2 possibilities see:

  1. compile code using predefined scheme
  2. keep formatting of original code , apply 1. on nodes changed.

for concentrate on 1. 2. seems pretty hard accomplish (but if got tips concerning that, hear them).

but i'm not sure design pattern can used compile code. easiest way see implement this, add ->compile method nodes. drawback see here, pretty hard change formatting of generated output. 1 need change nodes in order that. i'm looking different solution.

i have heard visitor pattern can used this, too, can't imagine how supposed work. understand visitor pattern have nodetraverser iterates recursively on nodes , calls ->visit method of visitor. sounds pretty promising node manipulation, visitor->visit method change node passed, don't know how can used compilation. obvious idea iterate node tree leaves root , replace visited nodes source code. somehow doesn't seem clean solution?

the problem of converting ast source code called "prettyprinting". there 2 subtle variations: regenerating text matching original as possible (i call "fidelity printing"), , (nice) prettyprinting, generates nicely formatted text. , how print matters depending on whether coders working on regenerated code (they want fidelity printing) or intention compile (at point legal prettyprinting fine).

to prettyprinting requires more information classic parser collects, aggravated fact parser generators don't support extra-information collection. call parsers collect enough information "re-engineering parsers". more details below.

the fundamental way prettyprinting accomplished walking ast ("visitor pattern" put it), , generating text based on ast node content. basic trick is: call children nodes left-to-right (assuming that's order of original text) generate text represent, interspersing additional text appropriate ast node type. prettyprint block of statements might have following psuedocode:

 prettyprintblock:      print("{"}; printnewline();      call prettyprint(node.children[1]); // prints out statements in block      print("}"); printnewline();      return;    prettyprintstatements:      i=1,number_of_children          call prettyprint(node.children[i]); print(";"); printnewline(); // print 1 statement      endo      return; 

note spits out text on fly visit tree.

there's number of details need manage:

  • for ast nodes representing literals, have regenerate literal value. harder looks if want answer accurate. printing floating point numbers without losing precision lot harder looks (scientists hate when damage value of pi). string literals, have regenerate quotes , string literal content; have careful regenerate escape sequences characters have escaped. php doubly-quoted string literals may bit more difficult, not represented single tokens in ast. (our php front end (a reengineering parser/prettyprinter represents them expression concatenates string fragments, enabling transformations applied inside string "literal").

  • spacing: languages require whitespace in critical places. tokens abc17 42 better not printed abc1742, ok tokens ( abc17 ) printed (abc17). 1 way solve problem put space wherever legal, people won't result: whitespace. not issue if compiling result.

  • newlines: languages allow arbitrary whitespace can technically regenerated single line of text. people hate this, if going compile result; have @ generated code , makes impossible. need way introduce newlines ast nodes representing major language elements (statements, blocks, methods, classes, etc.). isn't hard; when visiting node representing such construct, print out construct , append newline.

  • you discover, if want users accept result, have preserve properties of source text wouldn't think store literals, may have regenerate radix of literal; coders having entered number hex literal not happy when regenerate decimal equivalent though means same thing. likewise strings have have "original" quotes; languages allow either " or ' string quote characters , people want used originally. php, quote use matters, , determines characters in string literal has escaped. languages allow upper or lower case keywords (or abbreviations), , upper , lower case variable names meaning same variable; again original authors typically want original casing back. php has funny characters in different type of identifiers (e.g., "$") you'll discover isn't there (see $ variables in literal strings). people want original layout formatting; have store @ column-number information concrete tokens, , have prettyprinting rules when use column-number data position prettyprinted text in same column when possible, , if so-far-prettyprinted line filled past column.

  • comments: standard parsers (including 1 implemented using zend parser, i'm pretty sure) throw comments away completely. again, people hate this, , reject prettyprinted answer in comments lost. principal reason prettyprinters attempt regenerate code using original text (the other copy original code layout fidelity printing, if didn't capture column-number information). imho, right trick capture comments in ast, ast transformations can inspect/generate comments too, makes own design choice.

all of "extra" information collected reenginering parser. conventional parsers don't collect of it, makes printing acceptable asts difficult.

a more principled approach distinguishes prettyprinting purpose nice formatting, fidelity printing purpose regenerate text match original source maximal extent. should clear @ level of terminals, pretty want fidelity printing. depending on purpose, can pretty print nice formatting, or fidelity printing. strategy use default fidelity printing when ast hasn't been changed, , prettyprinting has (because change machinery doesn't have information column numbers or number radixes, etc.). transformations stamp ast nodes newly generated "no fidelity data present".

an organized approach prettyprinting nicely understand virtually text-based programming language rendered nicely in terms of rectangular blocks of text. (knuth's tex document generator has idea, too). if have set of text boxes representing pieces of regenerated code (e.g., primitive boxes generated directly terminal tokens), can imagine operators composing boxes: horizontal composition (stack 1 box right of another), vertical (stack boxes on top of each other; in effect replaces printing newlines), indent (horizontal composition box of blanks), etc. can construct prettyprinter building , composing text boxes:

 prettyprintblock:      box1=primitivebox("{"); box2=primitivebox("}");      childbox=prettyprint(node.children[1]); // gets box statements in block      resultbox=verticalbox(box1,indent(3,childbox),box2);      return resultbox;  prettyprintstatements:      resultbox=emptybox();      i=1,number_of_children          resultbox=verticalbox(resultbox,horizontalbox(prettyprint(node.children[i]); primitivebox(";")      endo      return; 

the real value in node can compose text boxes produced children in arbitrary order arbitrary intervening text. can rearrange huge blocks of text way (imagine vbox'ing methods of class in method-name order). no text spit out encountered; when root reached, or ast node known children boxes have been generated correctly.

our dms software reengineering toolkit uses approach prettyprint languages can parse (including php, java, c#, etc.). instead of attaching box computations ast nodes via visitors, attach box computations in domain-specific text-box notation

  • h(...) horizontal boxes
  • v(....) vertical boxes
  • i(...) indented boxes)

directly grammar rules, allowing succinctly express grammar (parser) , prettyprinter ("anti-parser") in 1 place. prettyprinter box rules compiled automatically dms visitor. prettyprinter machinery has smart enough understand how comments play this, , that's frankly bit arcane have once. dms example:

block = '{' statements '}' ; -- grammar rule recognize block of statements <<prettyprinter>>: { v('{',i(statements),'}'); }; 

you can see bigger example of how done wirth's oberon programming language prettyprinter showing how grammar rules , prettyprinting rules combined. php front end looks lot bigger, obviously.

a more complex way prettyprinting build syntax-directed translator (means, walk tree , build text or other data structures in tree-visted order) produce text-boxes in special text-box ast. text-box ast prettyprinted tree walk, actions trivial: print text boxes. see technical paper: pretty-printing software reengineering

an additional point: can of course go build machinery yourself. same reason choose use parser generator (its lot of work make one, , work doesn't contribute goal in interesting way) same reason want choose off-the-shelf prettyprinter generator. there lots of parser generators around. not many prettyprinter generators. [dms 1 of few has both built in.]


Comments