parsing - Alpha renaming in many languages -


i have imagine involved technical challenge: want able reliably alpha-rename identifiers in multiple languages (as many possible). require special consideration each language, , i'm asking advice how minimize amount of work need sharing code. unified parsing or abstract syntax framework has support many languages great.

for example, here python code:

def foo(x):     def bar(y):         return x+y     return bar 

an alpha renaming of x y changes x y , preserves semantics. become:

def foo(y):     def bar(y1):         return y+y1     return bar 

see how needed rename y y1 in order keep breaking code? why hard problem. seems program have have pretty knowledge of constitutes scope, rather doing, say, string search , replace.

i preserve of formatting possible: comments, spacing, indentation. not 100% necessary, nice.

any tips?

to safely, need able to determine

  • all identifiers (and things not, e.g., middle of comment) in code
  • the scopes of validity each identifer
  • the ability substitute new identifier old 1 in text
  • the ability determine if renaming identifier causes name shadowed

to determine identifiers accurately, need least langauge-accurate lexer. identifiers in php different in cobol.

to determine scopes of validity, have determine program structure in practice, since "scopes" defined such structure. means need langauge-accurate parser; scopes in php different scopes in cobol.

to determine names valid in scopes, need know language scoping rules. language may insist identifier x refer different xes depending on context in x found (consider object constructors named x different arguments). need able traverse scope structures according naming rules. single inheritance, multiple inheritance, overloading, default types pretty require build model of scopes programs, insert identifiers , corresponding types each scope, , climb point of encounter of identifier in program text through various scopes according language semantics. need symbol tables, inheritance linkages, asts, , ability navigage of these. these structures different php , cobol, share lots of common ideas need library common concept support.

to rename identifier, have modify text. in million lines of code, need point carefully. modifying ast node 1 way point carefully. actually, need modify all identifiers correspond 1 being renamed; have climb on tree find them all, or record in ast references exist can found easily. after modifyingy tree have regenerate source text after modifying ast. that's lot of machinery; see so answer on how prettyprint asts preseriving of stuff reasonably suggest should preserved. (your other choice keep track in ast of text string is, , read/patch/write file.)

before update file, need check haven't shadowed something. consider code:

 {  local x;      x=1;     {local y;      y=2;       {local z;          z=y          print(x);       }     }  } 

we agree code prints "1". decide rename y x. we've broken scoping, , print statement referred conceptually outer x refers x captured renamed y. code prints "2", our rename broke it. means 1 must check other identifiers in scopes in renamed variable might found, see if new name "captures" name weren't expecting. (this legal if print statement printed z).

this lot of machinery.

yes, there framework has of number of robust language front ends. see our dms software reengineering toolkit. has parsers producing asts, prettyprinters produce text asts, generic symbol table management machinery (including support multiple inheritance), ast visiting/modification machinery. ithas prettyprinting machinery turn asts text. has front ends c, c++, cobol , java implement name , type resolution (e.g. instanting symbol table scopes , identifier symbol table entry mappings); has front ends many other langauges don't have scoping implemented yet.

we've finished exercise in implementing "rename" java. (all above issues of course appeared). about start 1 c++.


Comments