Simplified Sem: akin to a "typed lisp" #385

saem · 2022-07-24T22:34:17Z

saem
Jul 24, 2022
Maintainer

Summary

Introducing new or changing existing (merge/split) Node kinds (eventually Symbols and Types) to simplify the compiler. Current nkError, VM, and surrounding work are driving to a design/implementation style that strongly prefers node/type/symbol kinds that describe everything required for initial analysis dispatch. This means no if statements to cover nil checks, random flags we have everywhere, etc, just simple kind dictates most things and that these branches are independent of each other.

The rest is background and definitions and I'll start posting concrete problems so style can be formalized through discovery. The benefit of all this is we end up with a better language, nicer implementation, and helps unblock some key pieces to move to DOD.

Call to Action

give the background a read, nailing the style upfront is unlikely, so just get some familiarity
provide input on the concrete postings as they're posted within the discussion thread

Background:

The work on nkError has surfaced a "case/let" style along with a traditional compiler architecture cue of separate expansion/reduction/production style (descriptions below). Together these simplify the compiler code, improve human and compiler reasoning, and highlight design issues who's solutions improve the language all the way up to the syntax (interface). Overall, I consider this a big win.

The design issues remain and we need a way to solve them. The style does need to be well defined, but that happens by working through it rather than trying to write it all out too early when we have the least amount of information (discovery, not invention). So I'll post the style below as a temporary home and then this discussion thread can talk about concrete issues and solutions.

The particular strategy I'm putting forth here is to formalize the style and cataloge and/or fix the issues, both in an co-evolving iterative fashion. So instead of trying to solve a grand problem right off the bad, I'm going to post under this discussion concrete issues and as they're solved further start building out a style guide (which at the moment is pointing at a proc like semBlock).

Definitions

Summary of Principles

Some key principles that these styles drive towards:

composition requires independence: node kind processing branches must be free of cross-branch dependencies
simplicity throughout: feature implementation complexity all but guarantees flaw in design and merit of the feature
code is data, data is code: syntax to codegen we're (partially) interpreting mini-programs, favour simple switch interpreters
completness: nearly impossible in the presence of heavy control flow (special handling) and lack of exhaustiveness

`case/let style`

Used to describe a programming style in the compiler where:

the primary control flow of a proc is a simple case statement
facts/decisions or compiler queries are stored in self-describing let variables
- although we're attempting to avoid single use let opting for comments

A clear example of this at time of writing is semBlock:

nimskull/compiler/sem/semexprs.nim

Line 2941 in b4790ec

proc semBlock(c: PContext, n: PNode; flags: TExprFlags): PNode =

This style has a number of advantages, it's immediately clear what this proc expects, any additional control flow complexity stands out like a store thumb, and the exhaustive nature of case statements means better reasoning at time of writing or reading, to name a few.

Expansion/Reduction/Production Style

Compiler Architecture suggest semantic analysis architecture in the following way:

our semantic analysis procedure sem1 called on node A, we are in the expansion phase
1. sem1 receives the node as primary input along with the environment(s) for querying
2. sem1's caller might have dervied some facts (attributes) and these are considered inherited
sem1 can do some early synthesis of attributes and/or queries, but all the children are not analysed so this is usually minimal
sem1 analyses any of its children via analysis sub-routines, which again inherit attributes
sem1 may interleave child analysis sub-routines completetion with addtional analysis, as it moves into the reduction phases
finally sem1 finalizes the production (returned node) with all attribute data, which sem1's caller will inherit

The influence of this is far more subtle today, but the let usage and the way result is used should hint at the influences.

reference: https://www.tutorialspoint.com/compiler_design/compiler_design_semantic_analysis.htm

saem · 2022-07-24T22:40:49Z

saem
Jul 24, 2022
Maintainer Author

Concrete Problem 1: nkStmtList and nfBlockArg

Issue: (convert to issue after this specific problem discussion concludes)

Problem Description

The node flag nfBlockArg, which is applied to a nkStmtList for do: notation block statement parsing. Recently, I made it that nkStmtList and nkStmtListExpressions collapse during semantic analysis, as that makes the most sense form an evaluation perspective.

Possible Solutions

I see two solutions, in both cases we introduce a node and then throw away the flag:

Introduce a new statement list kind (nkStmtListArg) that means don't collapse any children
Introduce a new list of statement lists (nkStmtListArgs), which doesn't collapse its children

One is at the child level and informs the parent, while the other is at the parent level and informs child handling. I think I prefer the second option.

3 replies

saem Jul 24, 2022
Maintainer Author

@Clyybber got any thoughts on this one as you've done the most AST kind changes of the bunch?

@zerbina you and I have talked about thinking about the compiler internals sorta kinda like "lisp/racket", figured I'd ping you on this specific problem and/or you might have feedback on the overall discussion?

Clyybber Jul 25, 2022
Collaborator

I don't think another node kind is needed or will solve the issue as fundamentally you cannot collapse nkStmtList(-Expr)s that are passed as arguments into the caller "scope" like:

someCall:
  var a = 1
  a

cannot be turned into

var a = 1
someCall(a)

because someCall might be a macro or template and actually wants that whole statementlist/expr.
So flattening nkStmtList/-Expr is fine, as long as the flattening stays simple and doesn't cross call nodes (or other nodes).
After sem/expansion some of these "node-crossing" flattenings will be fine, but not necessary if we have an IR that is quite flat.

Somewhat related to this is potential removal of nkPar. Since nkStmtList/-Expr can't be flattened across call nodes anyways it makes a lot of sense to use them for precedence (and they are already, nkPar always wraps a single nkStmtList/-Expr) and get rid of nkPar. (There's still the question of wether there are macros that are fundamentally affected by a removal of nkPar though).

saem Jul 25, 2022
Maintainer Author

AFAIK, we have

someCall:
  someDo: # will become an nkstmtlist with nfBlockArg
  otherDo: # will become an nkstmtlist with nfBlockArg

The problem right now is that those do get transformed into stmtList and the flag is required to avoid merging. So a new node kind is required as it's not a simple stmtList, no?

saem · 2023-01-02T22:50:22Z

saem
Jan 2, 2023
Maintainer Author

These are some thoughts about a possible approach to handle AST (Abstract Syntax
Trees) and IRs (Intermediate Representations) -- note, an AST is an IR itself.
The reason the AST and IRs need "handling" is because they go through various
recursive changes over their lifetime in order to "lower" it to a tiny grammar
amenable to code generation, interpretation, suggestion/introspect,
documentation generation, and etc. To top it all off, thanks to metaprogramming
and compile time execution, there are plenty of cycles in an already complex
process.

This "lowering" concept is key as we're effectively moving from one language to
a sub-set and/or simpler language. When metaprogramming or even incremental
compilation introduces cycles in this process, we have to support "raising",
where we take AST that's partially transformed and get it back to the consumer
in an untyped/typed format, where upon return we make not assumptions and start
with untyped syntax (eg: typed macros: typed -> untyped).

Idea Outline/Sketch

Briefly, the idea is to have a single AST/IR focusing on the sem, the parser,
and what gets passed to the backends.

type
  # introduce a "level" concept, representing language levels, or the result of
  # a compiler phase (lowering or raising)
  TAstLevel = enum
    ## the type and level names are illustrative, need improvement
    lvlUntyped # from the parser, a template, macro, etc 
    lvlNorm    # untyped but normalized
    lvlMeta    # metaprogramming annotations exist, macros/templates not applied
    lvlGnrc    # generics still exist, aren't instanced away
    lvlInst    # all generics are instanced, no generics exist
    lvlTransf  # for -> while, etc
    ...

  TNode = object
    level*: TAstLevel # use this to indicate the level of production

Then rework sem procs to be more regular and uses the level field to signal
how far along an analysis left a node. Ultimately the output is
untyped/typed/transf/whatever to typed AST. By having the level field we can
have this same traversal pattern, but vary the analysis. For example, given
lvlInst then we know metaprogramming and generics "no longer exist" and we can
simplify the analysis.

The level field would be required on TType, where we can drop parts of the
type system as we lower below metaprogramming or generics. I haven't checked but
TSym is more than likely required.

Simultaneously, we can have consts and/or enums + conversion functions that
define what nodes/types/syms etc... we can and cannot see at each language
level. This allows for easy exhaustiveness and/or assert checking regardless of
the level.

It would clarify things for everyone such as pragmas, and why/when/how to deal
with them and where they should disappear:

metaprogramming, disappear fairly early
metadata pragmas, sit around
type: effect/tag pramgas, lock levels, etc... stay late
- we can actually reuse some/weaker versions at lower layers!
compiler directives: all the way through the backend

Looking across various phases (sem, seminst, transf, lowerings, etc) the
traversals, structural analysis, and productions are all very similar/identical.
What's hard is how to reason about what can be received and what will be
produced. As I hope is clear from the language levels I sketched out, there are
many complex and interacting features, and unclear pre/post conditions make it
difficult to reason about.

A Brief Examination of Code Impact

I see compiler procs using it something like this, here are the first bits from
semProcAux, this is meant to be a very small illustration:

# original
proc semProcAux(c: PContext, n: PNode, kind: TSymKind,
                validPragmas: TSpecialWords, flags: TExprFlags = {}): PNode =
  result = semProcAnnotation(c, n, validPragmas)
  if result != nil: return result
  result = n
  checkMinSonsLen(n, bodyPos + 1, c.config)
  # ... rest of code ...

# with levels
proc semProcAux(c: PCtx, n: PNode): PNode =
  result = semProcAnnotation(c, n)            # 1
  if result != nil: return result             # 2
  result = n                                  # 3
  case n.kind                                 # 4
  of nkProcDef, nkFuncDef:
    checkSonsLen(n, bodyPos + 1, c.config)
  # ... rest of case and code ...
  
  result.level = lvlGnrc                      # 5

With levels we can drop TSymKind, TSpecialWords, and TExprFlags. The
incoming node's kind and level are more than enough information. Examining
by comment number:

validPragmas is now not needed, internally semProcAnnotation if lower
than lvlMeta is now only handling type and compiler directives
this can be improved, but requires a few more changes, so not diving into
that discussion yet, but between the level (lvlGnrc or lower) and kind
(nkRoutineDefs), we can figure out what to process, so no more nil return
this mutation is required because semProcAux is used in so many ways, two
"weird" ones are: a) previously typed def and/or consumption for a typed
macro/template, b) analyse the definition of an existing symbol (TSym.ast),
where there is an expectation for certain mutation, but not others... yeah
this would expand a lot, I didn't type it all out, but here we can handle
methods, which have dispatcherPos without impacting others, also based on
level we disqualify nkMacroDef or nkLambda as we could choose to lower
past those.
this level is mostly true, we could drop it lower if we a generic proc isn't
produced, but that's an optimization.

Possible Extensions

adding a epoch or version field, which is bumped each time a node goes
through a phase would let procs signal "sealing" information to each other. As
in don't change a node if its epoch is ahead of yours, likely assert to flag
errors.
with node/type/sym kinds + levels + a context variable, we can turn certain
analysis/lowering on and off. For example tuple unpacking, lambda lifting, etc
is a lot for a backend to implement, a generic versions that can be
conditionally flagged on and off in sem might work for some.

Next Steps

Maybe some discussion here and then try it out a cleaner form of the idea in a
branch.

Inspiration/Background

These ideas draw upon a lot of things, the most notable recent ones are:

macro output to untyped ast sanitizer PR by Zerbina
- the santizer "raises" the level of the AST, undoing analysis/clarity
- sanitizing per case is very nice and regular
- some nodes are plain dropped or "devolved"
Martin Odersky's Keynote: Simply Scala
- the Dotty compiler use a single AST type throughout the compiler pipeline
- pre and post conditions are used in lieu of types are used between phases
- phases have regular traversals
- lots of good stuff here throughout
- most relevants bits are between time codes 7:43 to 23:01
John Goodwin's (Cone Lang author) blog about ASTs/IRs
- this is a good overview post about AST/IR management challenges
- great for those unfamiliar with the problem space
- their design priorities are spot on
- lots of other great posts, Cone is super interesting

0 replies

saem · 2023-02-09T21:44:43Z

saem
Feb 9, 2023
Maintainer Author

Stemming from "how make sem not hurt as much". Leaving it here in case someone has some insights.

I'm pretty sure this might be a baking and conceptualization gap.

Context:

At the AST level, expansion/reduction, inheritance/synthesis, and production are all pretty clear.

What gets fuzzy is when AST is hung of the symbol table, like in let/var/proc defs/etc part way through a semantic analysis procedure and we then do things like apply pragmas, further reason about types and the like -- semProcAux, semNormalizedLetorVar, etc all do this.

Concerning part:

In both cases that symbol and its AST start diverging from the AST that's passed further down the compilation pipeline. Only to be summoned at some later point, typically identifier lookup or dispatch.

Additionally, we often re-sem the AST, which is in the past, aka decided, and that may result in mutations.

Questions/Line of Inquiry:

(Outlining my questions/what I'm trying to figure out, hopefully it's enough for someone to inject their own thoughts/questions that help get to ground.)

I think the big fuzzy part is that we don't have clear names and description of the concepts at play wrt symbol's and their AST handling.

Under the 'Concerning parts' section, the first concern relates to the compiler doing some sort of symbolic evaluation (?), some type level evaluation, as well. The application of the pragmas during the analysis of vars is a must because we need to know if it's a threadvar, which let's us check for an initialization error. But that error generates an nkError, which IIRC doesn't make it back into the AST.

The second concern relates to the compiler potentially mutating what it shouldn't, but we don't have a clear way of differentiating symbol definition analysis vs AST generation.

Ultimately, I do believe the divergence is likely dangerous, but at times entirely necessary. Danger-wise it should only be tolerated within a proc's lifetime as it's accumulating results but shouldn't leave the proc in a diverged way (not counting helpers).

Even if I "just start" separating things into:

"symbol analysis" and its reintegration into the rest of the main production, or
the analysis of an existing symbol definition to "ensure typed symbol" vs "symbol fitting"

It clearly illustrates a big blank spot in the conceptual understanding and mental model, as indicated by the poor names.

PS: It's a somewhat similar problem for types and their AST.

1 reply

zerbina Feb 12, 2023
Collaborator

I'm unsure of how useful they are, but here are some first, mostly unstructured, thoughts:

Generally, I think that symbols should neither own nor store the AST of the entity associated with them. If we'd have a whole-module (or program) AST, then the symbol could only store a stable pointer (in the conceptual sense) into the AST, making any divergence impossible. The same would apply for types.

If each kind of symbol is only generated in a single place (e.g. semNormalizedLetOrVar, semProcAux, etc.), then I think reasoning about them would become a bit easier. The whole process looks / would look like (listing it mainly for my own understanding):

macro/template pragmas only applicable on untyped AST are applied
the AST in a "name" slot is evaluated, which produces an identifier (or an error)
a basic symbol is constructed from the identifier and the context it's used in (procdef, identdef, etc.)
the symbol is registered with the context and associated with the AST of the definition
the rest of the definition is analysed
if there are macro/template pragmas only applicable on typed AST, they're applied, the symbol is deleted, and the whole processes starts again from 1.
if there were no errors, the symbol is remembered in the AST via placing it in the "name" slot. Otherwise, an error node is placed there

Writing that down, it seems to me like a distinction between "symbol evaluation failed" and "semantic analysis of the associated AST failed" is missing. Or maybe they're the same thing?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Simplified Sem: akin to a "typed lisp" #385

{{title}}

Replies: 3 comments 4 replies

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Simplified Sem: akin to a "typed lisp" #385

saem Jul 24, 2022 Maintainer

Summary

Call to Action

Background:

Definitions

Summary of Principles

case/let style

Expansion/Reduction/Production Style

Replies: 3 comments · 4 replies

saem Jul 24, 2022 Maintainer Author

Problem Description

Possible Solutions

saem Jul 24, 2022 Maintainer Author

Clyybber Jul 25, 2022 Collaborator

saem Jul 25, 2022 Maintainer Author

saem Jan 2, 2023 Maintainer Author

Idea Outline/Sketch

A Brief Examination of Code Impact

Possible Extensions

Next Steps

Inspiration/Background

saem Feb 9, 2023 Maintainer Author

Context:

Concerning part:

Questions/Line of Inquiry:

zerbina Feb 12, 2023 Collaborator

saem
Jul 24, 2022
Maintainer

`case/let style`

Replies: 3 comments 4 replies

saem
Jul 24, 2022
Maintainer Author

saem Jul 24, 2022
Maintainer Author

Clyybber Jul 25, 2022
Collaborator

saem Jul 25, 2022
Maintainer Author

saem
Jan 2, 2023
Maintainer Author

saem
Feb 9, 2023
Maintainer Author

zerbina Feb 12, 2023
Collaborator