首页  >  文章  >  后端开发  >  第一部分:实现用于构建 DSL 的表达式解释器 - 介绍 PEG 解析器

第一部分:实现用于构建 DSL 的表达式解释器 - 介绍 PEG 解析器

PHPz
PHPz原创
2024-08-05 20:40:20398浏览

Part I: Implement an expression interpreter for building DSL - Introduce the PEG parser

在上一篇文章中,我向您介绍了我为什么以及如何开始这个项目,以及 DSL 最终是什么样子。从这篇文章开始,我将与大家分享整个项目的实现。

通常,当我们实现一种语言时,我们首先想到的是词法分析器,然后是解析器。因此,在这篇文章中,我将向您介绍如何以指定的细节但较少的概念来实现我的 DSL,这样我希望它不会太混乱。

什么是词法分析器?

一般来说,词法分析器用于词法分析或标记化(如果您愿意的话)。我们以“We will rock you!”(皇后乐队著名的摇滚音乐)这句话为例。当我们按照英语的语法规则对其进行标记后,它将输出一个列表 [Subject("We"), Auxiliary("will"), Verb("rock"), Object("you"), Punctuation("!" )]。因此词法分析器主要用于根据其词汇含义将文本分类为某些类型。这对我们来说很重要,因为语法实际上由词汇元素而不是字符或单词组成。

通常我们会使用一些代码生成器来实现词法分析器,这些代码生成器可以解析正则表达式,如 Ragel、nex 等。但我想在检查 Rob Pike 的 Go 中的词法扫描之后,你会惊讶地发现实现词法分析器是多么容易。他引入了一种有趣的模式来实现有限状态机,我认为这是词法分析器的核心。

解析器怎么样?

那么解析器怎么样?它有什么用?基本上,解析器用于识别具有指定模式的词汇元素列表,我们也将其称为语法。以我们之前介绍过的“我们会震撼你!”为例,它产生一系列 [主语(“We”)、助动词(“will”)、动词(“rock”)、宾语(“you”)、标点符号(“!”)],与“将来时”的模式匹配英语语法。这正是解析器所做的,即所谓的“语法分析”。

让我们以更计算机化的方式再举一个例子,像 1 + 2 * 3 这样的表达式怎么样?很明显,它们将被词法分析器翻译成 [Number(1), Operator(+), Number(2), Operator(*), Number(3)],但是这个序列将被解析器翻译成什么具有通用的数学表达式语法?一般来说,词汇元素序列会被解析器翻译成抽象语法树(简称AST),如下所示:

      *
     / \
    +   3
   /  \
  1    2

通过抽象语法树,我们可以根据语法规则以正确的顺序分析语法。

让我们实现一个解析器

现在我们要自己实现一个解析器。好吧,并不完全靠我们自己,我们仍然需要一些工具来帮助我们为解析器生成代码,因为很难正确实现它,并且手写解析器可能很难维护,即使你这样做,性能也可能会很差。可怜。

幸运的是,有很多成熟的解析器生成工具可以帮助我们用语法定义文件生成golang代码。我想到的第一个选择是 go-yacc,它是为解析器生成 go 代码的官方工具。我曾经用它来编写 SQL 分析器,但这并不是一种乐趣,因为它:

  • 需要外部词法分析器。
  • 缺乏文档。
  • 联合定义和值类型声明都非常混乱。

“为什么不尝试一些新的东西呢?”我想,我们就这样吧,有了这个令人惊叹的工具钉,我能够在一个语法文件、一个接口中实现词法分析器和解析器。让我们仔细看看挂钩。

仔细看看 PEG

PEG 代表 Bryan Ford 于 2004 年提出的解析表达式语法,它是用于描述和表达编程语言语法和协议的传统上下文无关语法的替代方案。

在过去的几十年里,我们一直在使用像 yacc、bison 这样支持 CFG 的解析器生成工具来生成解析器代码,如果你以前使用过它们,你可能会发现很难避免歧义并将它们与词法分析器或正则表达式。事实上,编程语言的语法不仅仅是词法元素的模式,还包括这些词法元素的规则,而CFG不知何故缺少,所以当我们使用像yacc这样的工具我们必须自己实现词法分析器。此外,为了避免 CFG 中的歧义(例如加法和乘法之间的优先级,请检查这一点),我们必须定义每个运算符的优先级。所有这些关键事实使得开发解析器变得不必要的困难。

But thanks to Bryan Ford, now we have another good choice, the PEG that allows us to define the lexical and syntax rule all in one single file with a tight DSL and resolve ambiguity in an elegant and simple way. Let me show you how easy it can be done with peg.

Example and code come first

I gonna take examples from my gendsl library which implements a lisp-like syntax(you can check it out here). Here is a simple snippet that can parse hex and decimal number literals in the golang style:

package playground

type parser Peg {
}

Script          <- Value EOF

EOF             <- !.

Value           <- IntegerLiteral

IntegerLiteral  <- [+\-]? ('0' ('x' / 'X') HexNumeral 
                           / DecimalNumeral ) [uU]?

HexNumeral      <- HexDigit ([_]* HexDigit)* / '0'

HexDigit        <- [0-9] / [A-F] / [a-f]

DecimalNumeral  <- [1-9] ([_]* [0-9])* / '0'     

# ...                      

The first line package gendsl is package declaration which decides which package the generated golang file belongs to. The following type declaration type parser Peg {} used to define the parser type, which we will use it later for evaluation but you can ignore it for now.

After the parser type declaration we can start to define your syntax rule till the end. This is different from the workflow I used to work with with yacc when I have to define a union type and a lot of token types before I can actually define my grammar, which could be really confusing. Anyway, let's take a quick look at the grammar definition.

The very first rule

If you have worked with CFG before you might find the definition DSL syntax quite familiar. The right hand side of the '<-' refers to the pattern of lexical elements, which could be some other patterns or character sequence, and the left hand side is the name of the pattern. Pretty easy, right?

Let's pay attention to the first pattern rule here since the first rule is always to entry point of the parser. The entry point Script here is consist of two parts, one is a rule refers to Value which is consist of a sequence of specified characters(we will get back to this later), the other one EOF is kind of interesting. Let's jump to the next rule to find the pattern of EOF. As you can see, EOF is consist of !.. What does !. mean? The !actually means NOT, and . means any character, so !. means NOTHING AT ALL or End Of File if you will. As a result whenever the parser find there is no character to read, it will stop here and treat it as an dummy rule call EOF which might produces the rule Script. This is quite a common pattern for PEG.

More about the rule syntax

So much like the regular expression(RE), the syntax of PEG is simple:

  • . stands for any character.
  • character set like [a-z] is also supported.
  • X matches one character if X is a character or a pattern when X is the name of an rule.
  • X? matches one character or pattern or none, while X* matches zero or more and 'X+' matches one or more.
  • X \ Y matches X or Y where X and Y could be characters, patterns or rule name.

Take the rule of DecimalNumeral as an example. The first part [1-9] means the start of an DecimalNumeral must be one of a digit ranging from 1 to 9, ([_]* [0-9])* means starting from the second position, all character, if there is any, must all be digit(0-9) that has might have no '_' or more than one '_' as its prefix so it could match string like "10_2_3". Otherwise, indicated by the operator \, it could also just be one single character '0' which means 0 obviously .

Resolving ambiguity

I'd like to spend more time to explain the or operator \, since it is quite important as the solution to the ambiguity. The PEG will always try to match the first pattern and then the second, the third and so on til it finds one matched, which is considered as earliest-match-first. For example, a string "ab" will never be able to match the grammar G <- 'a' / 'a' 'b', since the first character 'a' will be reduced to G but the 'b' left cannot match anything. By the way, CFG doesn't allow such a rule and will throw the reduce/shift conflict error.

There is no much syntax left, you can explore them yourself in the pointlander/peg README or peg doc.

Let's have a try

Now that we already have a simple syntax rule prepared above, though it is not the whole grammar for the gendsl project but it can still parse some numbers. Anyway let's generate some code and see if it works as we expect.

Preparation

First we have to install the peg binary tool for code generate following this guide, then we gonna setup our workspace directory for playing:

> mkdir peg_playground && peg_playground 
> go mod init peg_playground 
> touch grammar.peg

Paste the grammar we have before into the peg_playground/grammar.peg, now we should be able to genreate the code using the generate tool but why not make a Makefile in peg_playground/makefile

GO := go

.SUFFIXES: .peg .go

grammar.go: grammar.peg
    peg -switch -inline -strict -output ./$@ $<

all: grammar.go

clean:
    rm grammar.go 

Generate and test the parser

Now that we have everything ready, let's generate the code for parser:

make grammar.go

After running the command, you should see a generated grammar.go in the workspace directory. Let's write a function to parse a string and access our parser:

// peg_playground/parser.go
package playground

func PrintAST(script string) error {
    parser := &parser{
        Buffer: script,
        Pretty: true,
    }

    if err := parser.Init(); err != nil {
        return err
    }
    if err := parser.Parse(); err != nil {
        return err
    }

    parser.PrintSyntaxTree()
    return nil
}

The snippet here is pretty simple, it initializes the parser, parses the script we pass to it and print the syntax tree in final. Let's write an unit test to see if it works:

// peg_playground/parser_test.go
package playground

import (
    "testing"
)

func TestPrintTree(t *testing.T) {
    if err := PrintAST(`0x123`); err != nil {
        t.Fatal(err)
    }
    t.Log("-----------------------------------------------------")

    if err := PrintAST(`10_2_3`); err != nil {
        t.Fatal(err)
    }
    t.Log("-----------------------------------------------------")
}

The test function TestPrintTree calls the PrintAST and check the error. Let's run it now and see what it gonna print:

go test . -v

Now we should see the whole syntax tree in the output:

=== RUN   TestPrintTree
Script "0x123"
 Value "0x123"
  IntegerLiteral "0x123"
   HexNumeral "123"
    HexDigit "1"
    HexDigit "2"
    HexDigit "3"
    parser_test.go:11: -----------------------------------------------------
Script "10_2_3"
 Value "10_2_3"
  IntegerLiteral "10_2_3"
   DecimalNumeral "10_2_3"
    parser_test.go:16: -----------------------------------------------------
--- PASS: TestPrintTree (0.00s)
PASS
ok      playground      0.649s

It looks great, right? Everything works as we expected. No syntax error thrown and it prints every rule matched and the string it matches in a format of tree, which could be really useful when debugging.

Wrap it up

In this post, I have introduced you the two basic but significant parts of interpreter programming language:

  • Lexer, for lexical analysis that transforms a string into a sequence of lexical elements.
  • Parser, for syntax analysis that identify the the pattern (so called grammar) in the lexical elements and produces a syntax tree.

And then I introduce the PEG for parser code generating, and address its advantages comparing the traditional CFG:

  • Lexer rule integrated, no standalone lexer need to be implemented.
  • Simple, regular expression like syntax to start up fast.
  • No ambiguity, no reduce/shift conflict, always earlier-match-first.

Finally I have a tiny demonstration of how to generate parser with PEG, which is the basis of our interpreter.
In next post, I will walk you through the gendsl grammar in detail.
Thank you for checking this post, hope you enjoy it.

以上是第一部分:实现用于构建 DSL 的表达式解释器 - 介绍 PEG 解析器的详细内容。更多信息请关注PHP中文网其他相关文章!

声明:
本文内容由网友自发贡献,版权归原作者所有,本站不承担相应法律责任。如您发现有涉嫌抄袭侵权的内容,请联系admin@php.cn