首頁  >  文章  >  後端開發  >  第一部分:實作用於建立 DSL 的表達式解釋器 - 介紹 PEG 解析器

第一部分:實作用於建立 DSL 的表達式解釋器 - 介紹 PEG 解析器

PHPz
PHPz原創
2024-08-05 20:40:20398瀏覽

Part I: Implement an expression interpreter for building DSL - Introduce the PEG parser

在上一篇文章中,我向您介紹了我為什麼以及如何開始這個項目,以及 DSL 最終是什麼樣子。從這篇文章開始,我將與大家分享整個專案的實現。

通常,當我們實作一種語言時,我們首先想到的是詞法分析器,然後是解析器。因此,在這篇文章中,我將向您介紹如何以指定的細節但較少的概念來實現我的 DSL,這樣我希望它不會太混亂。

什麼是詞法分析器?

一般來說,詞法分析器用於詞法分析或標記化(如果您願意的話)。我們以「We will rock you!」(皇后樂團著名的搖滾音樂)這句話為例。當我們按照英語的語法規則對其進行標記後,它將輸出一個列表[Subject("We"), Auxiliary("will"), Verb("rock"), Object("you"), Punctuation(" !" )]。因此詞法分析器主要用於根據其詞彙含義將文本分類為某些類型。 這對我們來說很重要,因為語法實際上由詞彙元素而不是字元或單字組成。

通常我們會使用一些程式碼產生器來實作詞法分析器,這些程式碼產生器可以解析正規表示式,如 Ragel、nex 等。但我想在檢查 Rob Pike 的 Go 中的詞法掃描之後,你會驚訝地發現實現詞法分析器是多麼容易。他引入了一種有趣的模式來實現有限狀態機,我認為這是詞法分析器的核心。

解析器怎麼樣?

那麼解析器怎麼樣?它有什麼用?基本上,解析器用於識別具有指定模式的詞彙元素列表,我們也稱之為語法。 以我們之前介紹過的「我們會震撼你!」為例,它產生一系列[主詞(“We”)、助動詞(“will”)、動詞(“rock”) 、受詞(“you”)、標點符號(“!”)],與“將來式”的模式匹配英語文法。這正是解析器所做的,即所謂的「語法分析」。

讓我們以更電腦化的方式再舉一個例子,像 1 + 2 * 3 這樣的表達式怎麼樣?很明顯,它們將被詞法分析器翻譯成[Number(1), Operator(+), Number(2), Operator(*), Number(3)],但是這個序列將被解析器翻譯成什麼具有通用的數學表達式語法?一般來說,詞彙元素序列會被解析器翻譯成抽象語法樹(簡稱AST),如下所示:

      *
     / \
    +   3
   /  \
  1    2

透過抽象語法樹,我們可以根據語法規則以正確的順序分析語法。

讓我們實作一個解析器

現在我們要自己實作一個解析器。好吧,並不完全靠我們自己,我們仍然需要一些工具來幫助我們為解析器生成程式碼,因為很難正確實現它,而且手寫解析器可能很難維護,即使你這樣做,性能也可能會很差。可憐。

幸運的是,有很多成熟的解析器產生工具可以幫助我們用語法定義檔產生golang程式碼。我想到的第一個選擇是 go-yacc,它是為解析器產生 go 程式碼的官方工具。我曾經用它來寫 SQL 分析器,但這並不是一種樂趣,因為它:

  • 需要外部詞法分析器。
  • 缺乏文件。
  • 聯合定義和值類型聲明都非常混亂。

「為什麼不嘗試一些新的東西呢?」我想,我們就這樣吧,有了這個令人驚嘆的工具釘,我能夠在一個語法檔案、一個介面中實現詞法分析器和解析器。讓我們仔細看看掛鉤。

仔細看 PEG

PEG 代表 Bryan Ford 於 2004 年提出的解析表達式語法,它是用於描述和表達程式語言語法和協定的傳統上下文無關語法的替代方案。

在過去的幾十年裡,我們一直在使用像yacc、bison 這樣支援CFG 的解析器生成工具來產生解析器程式碼,如果你以前使用過它們,你可能會發現很難避免歧義並將它們與詞法分析器或正規表示式。事實上,程式語言的語法不僅僅是詞法元素的模式,還包括這些詞法元素的規則,而CFG不知何故缺少,所以當我們使用像yacc這樣的工具我們必須自己實現詞法分析器。此外,為了避免 CFG 中的歧義(例如加法和乘法之間的優先級,請檢查這一點),我們必須定義每個運算符的優先級。所有這些關鍵事實使得開發解析器變得不必要的困難。

But thanks to Bryan Ford, now we have another good choice, the PEG that allows us to define the lexical and syntax rule all in one single file with a tight DSL and resolve ambiguity in an elegant and simple way. Let me show you how easy it can be done with peg.

Example and code come first

I gonna take examples from my gendsl library which implements a lisp-like syntax(you can check it out here). Here is a simple snippet that can parse hex and decimal number literals in the golang style:

package playground

type parser Peg {
}

Script          <- Value EOF

EOF             <- !.

Value           <- IntegerLiteral

IntegerLiteral  <- [+\-]? ('0' ('x' / 'X') HexNumeral 
                           / DecimalNumeral ) [uU]?

HexNumeral      <- HexDigit ([_]* HexDigit)* / '0'

HexDigit        <- [0-9] / [A-F] / [a-f]

DecimalNumeral  <- [1-9] ([_]* [0-9])* / '0'     

# ...                      

The first line package gendsl is package declaration which decides which package the generated golang file belongs to. The following type declaration type parser Peg {} used to define the parser type, which we will use it later for evaluation but you can ignore it for now.

After the parser type declaration we can start to define your syntax rule till the end. This is different from the workflow I used to work with with yacc when I have to define a union type and a lot of token types before I can actually define my grammar, which could be really confusing. Anyway, let's take a quick look at the grammar definition.

The very first rule

If you have worked with CFG before you might find the definition DSL syntax quite familiar. The right hand side of the '<-' refers to the pattern of lexical elements, which could be some other patterns or character sequence, and the left hand side is the name of the pattern. Pretty easy, right?

Let's pay attention to the first pattern rule here since the first rule is always to entry point of the parser. The entry point Script here is consist of two parts, one is a rule refers to Value which is consist of a sequence of specified characters(we will get back to this later), the other one EOF is kind of interesting. Let's jump to the next rule to find the pattern of EOF. As you can see, EOF is consist of !.. What does !. mean? The !actually means NOT, and . means any character, so !. means NOTHING AT ALL or End Of File if you will. As a result whenever the parser find there is no character to read, it will stop here and treat it as an dummy rule call EOF which might produces the rule Script. This is quite a common pattern for PEG.

More about the rule syntax

So much like the regular expression(RE), the syntax of PEG is simple:

  • . stands for any character.
  • character set like [a-z] is also supported.
  • X matches one character if X is a character or a pattern when X is the name of an rule.
  • X? matches one character or pattern or none, while X* matches zero or more and 'X+' matches one or more.
  • X \ Y matches X or Y where X and Y could be characters, patterns or rule name.

Take the rule of DecimalNumeral as an example. The first part [1-9] means the start of an DecimalNumeral must be one of a digit ranging from 1 to 9, ([_]* [0-9])* means starting from the second position, all character, if there is any, must all be digit(0-9) that has might have no '_' or more than one '_' as its prefix so it could match string like "10_2_3". Otherwise, indicated by the operator \, it could also just be one single character '0' which means 0 obviously .

Resolving ambiguity

I'd like to spend more time to explain the or operator \, since it is quite important as the solution to the ambiguity. The PEG will always try to match the first pattern and then the second, the third and so on til it finds one matched, which is considered as earliest-match-first. For example, a string "ab" will never be able to match the grammar G <- 'a' / 'a' 'b', since the first character 'a' will be reduced to G but the 'b' left cannot match anything. By the way, CFG doesn't allow such a rule and will throw the reduce/shift conflict error.

There is no much syntax left, you can explore them yourself in the pointlander/peg README or peg doc.

Let's have a try

Now that we already have a simple syntax rule prepared above, though it is not the whole grammar for the gendsl project but it can still parse some numbers. Anyway let's generate some code and see if it works as we expect.

Preparation

First we have to install the peg binary tool for code generate following this guide, then we gonna setup our workspace directory for playing:

> mkdir peg_playground && peg_playground 
> go mod init peg_playground 
> touch grammar.peg

Paste the grammar we have before into the peg_playground/grammar.peg, now we should be able to genreate the code using the generate tool but why not make a Makefile in peg_playground/makefile

GO := go

.SUFFIXES: .peg .go

grammar.go: grammar.peg
    peg -switch -inline -strict -output ./$@ $<

all: grammar.go

clean:
    rm grammar.go 

Generate and test the parser

Now that we have everything ready, let's generate the code for parser:

make grammar.go

After running the command, you should see a generated grammar.go in the workspace directory. Let's write a function to parse a string and access our parser:

// peg_playground/parser.go
package playground

func PrintAST(script string) error {
    parser := &parser{
        Buffer: script,
        Pretty: true,
    }

    if err := parser.Init(); err != nil {
        return err
    }
    if err := parser.Parse(); err != nil {
        return err
    }

    parser.PrintSyntaxTree()
    return nil
}

The snippet here is pretty simple, it initializes the parser, parses the script we pass to it and print the syntax tree in final. Let's write an unit test to see if it works:

// peg_playground/parser_test.go
package playground

import (
    "testing"
)

func TestPrintTree(t *testing.T) {
    if err := PrintAST(`0x123`); err != nil {
        t.Fatal(err)
    }
    t.Log("-----------------------------------------------------")

    if err := PrintAST(`10_2_3`); err != nil {
        t.Fatal(err)
    }
    t.Log("-----------------------------------------------------")
}

The test function TestPrintTree calls the PrintAST and check the error. Let's run it now and see what it gonna print:

go test . -v

Now we should see the whole syntax tree in the output:

=== RUN   TestPrintTree
Script "0x123"
 Value "0x123"
  IntegerLiteral "0x123"
   HexNumeral "123"
    HexDigit "1"
    HexDigit "2"
    HexDigit "3"
    parser_test.go:11: -----------------------------------------------------
Script "10_2_3"
 Value "10_2_3"
  IntegerLiteral "10_2_3"
   DecimalNumeral "10_2_3"
    parser_test.go:16: -----------------------------------------------------
--- PASS: TestPrintTree (0.00s)
PASS
ok      playground      0.649s

It looks great, right? Everything works as we expected. No syntax error thrown and it prints every rule matched and the string it matches in a format of tree, which could be really useful when debugging.

Wrap it up

In this post, I have introduced you the two basic but significant parts of interpreter programming language:

  • Lexer, for lexical analysis that transforms a string into a sequence of lexical elements.
  • Parser, for syntax analysis that identify the the pattern (so called grammar) in the lexical elements and produces a syntax tree.

And then I introduce the PEG for parser code generating, and address its advantages comparing the traditional CFG:

  • Lexer rule integrated, no standalone lexer need to be implemented.
  • Simple, regular expression like syntax to start up fast.
  • No ambiguity, no reduce/shift conflict, always earlier-match-first.

Finally I have a tiny demonstration of how to generate parser with PEG, which is the basis of our interpreter.
In next post, I will walk you through the gendsl grammar in detail.
Thank you for checking this post, hope you enjoy it.

以上是第一部分:實作用於建立 DSL 的表達式解釋器 - 介紹 PEG 解析器的詳細內容。更多資訊請關注PHP中文網其他相關文章!

陳述:
本文內容由網友自願投稿,版權歸原作者所有。本站不承擔相應的法律責任。如發現涉嫌抄襲或侵權的內容,請聯絡admin@php.cn