Heim > Artikel > Backend-Entwicklung > Teil I: Implementieren eines Ausdrucksinterpreters zum Erstellen von DSL – Einführung des PEG-Parsers

Teil I: Implementieren eines Ausdrucksinterpreters zum Erstellen von DSL – Einführung des PEG-Parsers

PHPzOriginal: 2024-08-05 20:40:20475Durchsuche

Part I: Implement an expression interpreter for building DSL - Introduce the PEG parser

Im letzten Beitrag habe ich euch vorgestellt, WARUM und WIE ich dieses Projekt beginne und WAS das DSL am Ende aussieht. Ab diesem Beitrag werde ich die Umsetzung des gesamten Projekts mit Ihnen teilen.

Wenn wir eine Sprache implementieren, denken wir im Allgemeinen zuerst an den Lexer und dann an den Parser. Deshalb werde ich Ihnen in diesem Beitrag vorstellen, wie ich mein DSL mit spezifizierten Details, aber weniger Konzeption umsetze, damit es hoffentlich nicht zu verwirrend wird.

Was ist Lexer?

Im Allgemeinen wird Lexer für die lexikalische Analyse oder, wenn Sie so wollen, Tokenisierung verwendet. Nehmen wir als Beispiel den Satz „We will rock you!“ (die berühmte Rock'n'Roll-Musik von Queen). Nachdem wir es gemäß der Grammatikregel des Englischen tokenisiert haben, soll es eine Liste ausgeben [Subject("We"), Auxiliary("will"), Verb("rock"), Object("you"), Interpunktion("!" )]. Daher ein Lexer wird hauptsächlich dazu verwendet, Texte anhand ihrer lexikalischen Bedeutung in bestimmte Typen zu klassifizieren. Dies ist für uns von Bedeutung, da eine Syntax tatsächlich aus lexikalischen Elementen und nicht aus Zeichen oder Wörtern besteht.

Normalerweise implementieren wir einen Lexer mit einem Codegenerator, der Regular Express wie Ragel, Nex usw. analysieren kann. Aber ich denke, Sie werden überrascht sein, wie einfach es ist, einen Lexer zu implementieren, nachdem Sie sich Lexical Scanning in Go von Rob Pike angesehen haben. Er hatte ein interessantes Muster zur Implementierung einer endlichen Zustandsmaschine eingeführt, die meiner Meinung nach den Kern eines Lexers darstellt.

Wie wäre es mit dem Parser?

Wie wäre es also mit einem Parser? Wofür wird es verwendet? Grundsätzlich ein Parser wird verwendet, um eine Liste lexikalischer Elemente mit dem angegebenen Muster zu identifizieren, das wir auch Grammatik genannt haben. Nehmen Sie das Beispiel „Wir werden Sie rocken!“, das wir zuvor vorgestellt haben, es erzeugt eine Folge von [Subjekt(„Wir“), Hilfssatz („wird“), Verb („Stein“), Objekt („Sie“), Interpunktion („!“)], die dem Muster von „Zukunftsform“ entspricht. Grammatik auf Englisch. Genau das macht ein Parser und wird als „Syntaxanalyse“ bezeichnet.

Nehmen wir ein weiteres, computergestützteres Beispiel: Wie wäre es mit einem Ausdruck wie 1 + 2 * 3? Es ist offensichtlich, dass sie vom Lexer in [Number(1), Operator(+), Number(2), Operator(*), Number(3)] übersetzt werden, aber in was diese Sequenz von einem Parser übersetzt wird mit einer gemeinsamen mathematischen Ausdrucksgrammatik? Im Allgemeinen wird eine Sequenz lexikalischer Elemente vom Parser wie folgt in einen Abstrakten Syntaxbaum (oder kurz AST) übersetzt:

Mit einem abstrakten Syntaxbaum können wir die Syntax in der richtigen Reihenfolge gemäß der Regel unserer Grammatik analysieren.

Lassen Sie uns einen Parser implementieren

Jetzt werden wir selbst einen Parser implementieren. Nun ja, wir sind nicht ganz auf uns allein gestellt, aber wir brauchen noch ein Tool, das uns hilft, Code für unseren Parser zu generieren, da es schwierig ist, ihn korrekt zu implementieren, und ein Handschrift-Parser möglicherweise schwierig zu warten ist, selbst wenn Sie dies tun, könnte die Leistung beeinträchtigt sein arm.

Glücklicherweise gibt es viele ausgereifte generative Parser-Tools, die uns helfen, Golang-Code mit einer Grammatikdefinitionsdatei zu generieren. Die erste Wahl, die mir in den Sinn kam, war go-yacc, das offizielle Tool zum Generieren von Go-Code für den Parser. Ich habe damit früher einen SQL-Analysator geschrieben und es war kein Vergnügen, weil es:

erfordert einen externen Lexer.
fehlende Dokumentation.
Union-Definition und Werttyp-Deklaration sind beide ziemlich verwirrend.

„Warum nicht etwas Neues ausprobieren?“ Ich denke, also los geht's, mit dem erstaunlichen Tool Peg war ich in der Lage, sowohl den Lexer als auch den Parser in einer einzigen Grammatikdatei, einer einzigen Schnittstelle, zu implementieren. Schauen wir uns den Stift genauer an.

Ein genauerer Blick auf PEG

PEG steht für „Parsing Expression Grammar“, das 2004 von Bryan Ford eingeführt wurde und eine Alternative zur traditionellen kontextfreien Grammatik darstellt, die zum Beschreiben und Ausdrücken der Syntax und des Protokolls von Programmiersprachen verwendet wird.

Seit Jahrzehnten verwenden wir generative Parser-Tools wie yacc und Bison, die CFG zum Generieren von Parser-Code unterstützen. Wenn Sie diese bereits verwendet haben, fällt es Ihnen möglicherweise schwer, Mehrdeutigkeiten zu vermeiden und sie in den Lexer oder den zu integrieren regulärer Ausdruck. Tatsächlich besteht die Syntax einer Programmiersprache nicht nur aus Mustern lexikalischer Elemente, sondern auch aus den Regeln dieser lexikalischen Elemente, die irgendwie im CFG fehlen. Wenn wir also Tools wie yacc verwenden müssen wir den Lexer selbst implementieren. Darüber hinaus müssen wir in CFG die Priorität für jeden Operator definieren, um Mehrdeutigkeiten zu vermeiden (wie die Priorität zwischen Plus und Multiplikation, sehen Sie sich das an). All diese entscheidenden Tatsachen machen es unnötig schwierig, einen Parser zu entwickeln.

But thanks to Bryan Ford, now we have another good choice, the PEG that allows us to define the lexical and syntax rule all in one single file with a tight DSL and resolve ambiguity in an elegant and simple way. Let me show you how easy it can be done with peg.

Example and code come first

I gonna take examples from my gendsl library which implements a lisp-like syntax(you can check it out here). Here is a simple snippet that can parse hex and decimal number literals in the golang style:

package playground

type parser Peg {
}

Script          <- Value EOF

EOF             <- !.

Value           <- IntegerLiteral

IntegerLiteral  <- [+\-]? ('0' ('x' / 'X') HexNumeral 
                           / DecimalNumeral ) [uU]?

HexNumeral      <- HexDigit ([_]* HexDigit)* / '0'

HexDigit        <- [0-9] / [A-F] / [a-f]

DecimalNumeral  <- [1-9] ([_]* [0-9])* / '0'     

# ...

The first line package gendsl is package declaration which decides which package the generated golang file belongs to. The following type declaration type parser Peg {} used to define the parser type, which we will use it later for evaluation but you can ignore it for now.

After the parser type declaration we can start to define your syntax rule till the end. This is different from the workflow I used to work with with yacc when I have to define a union type and a lot of token types before I can actually define my grammar, which could be really confusing. Anyway, let's take a quick look at the grammar definition.

The very first rule

If you have worked with CFG before you might find the definition DSL syntax quite familiar. The right hand side of the '<-' refers to the pattern of lexical elements, which could be some other patterns or character sequence, and the left hand side is the name of the pattern. Pretty easy, right?

Let's pay attention to the first pattern rule here since the first rule is always to entry point of the parser. The entry point Script here is consist of two parts, one is a rule refers to Value which is consist of a sequence of specified characters(we will get back to this later), the other one EOF is kind of interesting. Let's jump to the next rule to find the pattern of EOF. As you can see, EOF is consist of !.. What does !. mean? The !actually means NOT, and . means any character, so !. means NOTHING AT ALL or End Of File if you will. As a result whenever the parser find there is no character to read, it will stop here and treat it as an dummy rule call EOF which might produces the rule Script. This is quite a common pattern for PEG.

More about the rule syntax

So much like the regular expression(RE), the syntax of PEG is simple:

. stands for any character.
character set like [a-z] is also supported.
X matches one character if X is a character or a pattern when X is the name of an rule.
X? matches one character or pattern or none, while X* matches zero or more and 'X+' matches one or more.
X \ Y matches X or Y where X and Y could be characters, patterns or rule name.

Take the rule of DecimalNumeral as an example. The first part [1-9] means the start of an DecimalNumeral must be one of a digit ranging from 1 to 9, ([_]* [0-9])* means starting from the second position, all character, if there is any, must all be digit(0-9) that has might have no '_' or more than one '_' as its prefix so it could match string like "10_2_3". Otherwise, indicated by the operator \, it could also just be one single character '0' which means 0 obviously .

Resolving ambiguity

I'd like to spend more time to explain the or operator \, since it is quite important as the solution to the ambiguity. The PEG will always try to match the first pattern and then the second, the third and so on til it finds one matched, which is considered as earliest-match-first. For example, a string "ab" will never be able to match the grammar G <- 'a' / 'a' 'b', since the first character 'a' will be reduced to G but the 'b' left cannot match anything. By the way, CFG doesn't allow such a rule and will throw the reduce/shift conflict error.

There is no much syntax left, you can explore them yourself in the pointlander/peg README or peg doc.

Let's have a try

Now that we already have a simple syntax rule prepared above, though it is not the whole grammar for the gendsl project but it can still parse some numbers. Anyway let's generate some code and see if it works as we expect.

Preparation

First we have to install the peg binary tool for code generate following this guide, then we gonna setup our workspace directory for playing:

> mkdir peg_playground && peg_playground 
> go mod init peg_playground 
> touch grammar.peg

Paste the grammar we have before into the peg_playground/grammar.peg, now we should be able to genreate the code using the generate tool but why not make a Makefile in peg_playground/makefile

GO := go

.SUFFIXES: .peg .go

grammar.go: grammar.peg
    peg -switch -inline -strict -output ./$@ $<

all: grammar.go

clean:
    rm grammar.go

Generate and test the parser

Now that we have everything ready, let's generate the code for parser:

make grammar.go

After running the command, you should see a generated grammar.go in the workspace directory. Let's write a function to parse a string and access our parser:

// peg_playground/parser.go
package playground

func PrintAST(script string) error {
    parser := &parser{
        Buffer: script,
        Pretty: true,
    }

    if err := parser.Init(); err != nil {
        return err
    }
    if err := parser.Parse(); err != nil {
        return err
    }

    parser.PrintSyntaxTree()
    return nil
}

The snippet here is pretty simple, it initializes the parser, parses the script we pass to it and print the syntax tree in final. Let's write an unit test to see if it works:

// peg_playground/parser_test.go
package playground

import (
    "testing"
)

func TestPrintTree(t *testing.T) {
    if err := PrintAST(`0x123`); err != nil {
        t.Fatal(err)
    }
    t.Log("-----------------------------------------------------")

    if err := PrintAST(`10_2_3`); err != nil {
        t.Fatal(err)
    }
    t.Log("-----------------------------------------------------")
}

The test function TestPrintTree calls the PrintAST and check the error. Let's run it now and see what it gonna print:

go test . -v

Now we should see the whole syntax tree in the output:

=== RUN   TestPrintTree
Script "0x123"
 Value "0x123"
  IntegerLiteral "0x123"
   HexNumeral "123"
    HexDigit "1"
    HexDigit "2"
    HexDigit "3"
    parser_test.go:11: -----------------------------------------------------
Script "10_2_3"
 Value "10_2_3"
  IntegerLiteral "10_2_3"
   DecimalNumeral "10_2_3"
    parser_test.go:16: -----------------------------------------------------
--- PASS: TestPrintTree (0.00s)
PASS
ok      playground      0.649s

It looks great, right? Everything works as we expected. No syntax error thrown and it prints every rule matched and the string it matches in a format of tree, which could be really useful when debugging.

Wrap it up

In this post, I have introduced you the two basic but significant parts of interpreter programming language:

Lexer, for lexical analysis that transforms a string into a sequence of lexical elements.
Parser, for syntax analysis that identify the the pattern (so called grammar) in the lexical elements and produces a syntax tree.

And then I introduce the PEG for parser code generating, and address its advantages comparing the traditional CFG:

Lexer rule integrated, no standalone lexer need to be implemented.
Simple, regular expression like syntax to start up fast.
No ambiguity, no reduce/shift conflict, always earlier-match-first.

Finally I have a tiny demonstration of how to generate parser with PEG, which is the basis of our interpreter.
In next post, I will walk you through the gendsl grammar in detail.
Thank you for checking this post, hope you enjoy it.

Das obige ist der detaillierte Inhalt vonTeil I: Implementieren eines Ausdrucksinterpreters zum Erstellen von DSL – Einführung des PEG-Parsers. Für weitere Informationen folgen Sie bitte anderen verwandten Artikeln auf der PHP chinesischen Website!

Stellungnahme：

Der Inhalt dieses Artikels wird freiwillig von Internetnutzern beigesteuert und das Urheberrecht liegt beim ursprünglichen Autor. Diese Website übernimmt keine entsprechende rechtliche Verantwortung. Wenn Sie Inhalte finden, bei denen der Verdacht eines Plagiats oder einer Rechtsverletzung besteht, wenden Sie sich bitte an admin@php.cn

Vorheriger Artikel：Umleitung des Browsers über HTMXNächster Artikel：Umleitung des Browsers über HTMX

In Verbindung stehende Artikel

Mehr sehen