首页 >Java >java教程 >如何在数小时内开发代码分析器

如何在数小时内开发代码分析器

WBOY
WBOY原创
2024-08-14 18:43:32998浏览

静态分析是一个强大的工具,可以帮助开发人员控制代码质量。让我们尝试使用 Java 为 Lua 开发一个简单的分析器,看看静态分析器的底层是什么。

How to develop code analyzer in hours

小前言

我们将用 Java 编写 Lua 的静态分析器。

为什么是Lua?它的语法很简单,没有必要陷入细节之中。它也是一种很好的语言,尤其是与 JavaScript 相比。为什么选择Java?它为我们提供了全方位的技术栈,而且Java也是一种用户友好的开发语言。

不:)但是,本文介绍了我们如何在内部黑客马拉松中开发真正的分析仪试点。所以,这个标题不仅仅是一个标题诱饵——我们确实有 48 小时。我们的团队有三名开发人员,所以如果您想单独尝试这个,请准备好花费更多时间。

由于这只是一个试点,我们在文章中将我们的分析器称为“Mun”。

什么是分析仪?

在开始之前,最好先了解一下分析器是什么并规划出工作范围。总而言之,很明显:抓住代码,如果这里有问题就抱怨。我们到底需要什么?我们对静态分析的以下方面感兴趣:

  • 词法分析器和解析器。我们获取源代码并将其转换为易于使用的树(AST)。
  • AST(或抽象语法树)是将程序的数据结构表示为树的一种方式。它包含有关程序语法的数据。
  • 语义数据。仅语法数据不足以进行分析。因此,我们需要一种额外的机制来聚合树中的语义(含义)数据。例如,它可能是变量范围。
  • 数据流分析。如果我们想做一些深入的分析,我们可以尝试预测程序节点中的变量值。例如,它有助于捕获与零除法相关的错误。

听起来很乏味,但这只是分析器的核心。顺便说一句,编译器在前端也有同样的东西。然而,最棘手的部分(代码生成)就潜伏在后端。

核心的计划是什么?范围如下:

  1. 编写诊断规则来检测代码中的错误;
  2. 收集检测到的错误并发出报告;
  3. 创建我们自己的插件来查看警告(可选)。

确实,48小时太多了:)有些东西必须放弃,有些东西要精简,有些东西要重用。当我们阅读本文时,我们将考虑此类情况。

为了更好地理解,让我们将其图解一下:

How to develop code analyzer in hours

这足以开始了。您可以在我们的网站上了解有关静态分析的更多信息。

词法分析器和解析器

理论

所以,词法分析器和解析器是分析器的基础,没有它们我们就无法走得更远。如果我们想要正则表达式之外的任何东西。

使用词法分析器非常简单:将源代码作为文本处理很麻烦。因此,我们需要将其转换为中间表示,即将其拆分为标记。也就是说,词法分析器必须不聪明。这是必须具备的,因为所有最困难和最混乱的东西都会进入解析器。

然后让我们转向解析器。它接收传入的 token,计算出它们,然后构建 AST。这是一个简单的背景知识。

语言有语法,并且有不同类型的解析器可以使用它们。如果我们自己编写,最好为上下文无关语法编写简单的 LL(1) 解析器。正如我们之前所说,Lua 有直观的语法,所以这就足够了。

LL 表示从左到右读取输入字符串,并为其构造从左到右的输出。一般来说,仅仅查看当前标记 (LL(0)) 是不够的,因此我们可能需要查看前面的 k 个标记。这样的解析器称为 LL(k)。

但是,我们不需要它,因为我们不会写任何东西。为什么?我们只有 48 小时,没有时间创建解析器 - 特别是如果您从未开发过它。

选择的方法

我们必须用什么替代方案来编写我们的词法分析器和解析器?发电机!这是一整类实用程序,只需要我们提供一个专门描述的语法文件作为输入来生成解析器。

我们选择了 ANTLR v4。该工具也是用 Java 编写的,这使得它非常易于使用。经过多年的发展,它已经开始表现得很好。

问题也潜伏在这里——我们需要知道如何为解析器编写语法文件。幸运的是,在 GitHub 上找到现成的选项并不难,所以我们就从这里获取。

使用 ANTLR 配置项目后,让我们继续构建抽象语法树。

抽象语法树

说到树:ANTLR 经过精心设计,可以可视化代码分析。例如,下面是阶乘计算:

function fact (n)
  if n == 0 then
    return 1
  else
    return n * fact(n-1)
  end
end

print("enter a number:")
a = io.read("*number")
print(fact(a))

我们可以获得以下树:

How to develop code analyzer in hours

我们认为它可能更接近于分类的解析树。我们可以停在这里并开始使用它。

但我们不会这样做并将其转换为 AST。为什么?作为 Java 团队,使用与分析器中的树类似的树会更容易。您可以在此处阅读有关 Java 开发的更多信息。 Spoon 的示例类层次结构如下所示:

How to develop code analyzer in hours

好了,推迟手动工作已经足够了,是时候编写代码了。我们没有完全展示整个代码——它没有任何意义,因为它太大而且难看。为了让您稍微了解我们的思维方式,我将在剧透下留下一些代码片段,供感兴趣的人使用。

我们从树顶开始处理:

public void enterStart_(LuaParser.Start_Context ctx) {
    _file = new CtFileImpl();
    _file.setBlocks(new ArrayList<>());
    for (var chunk : ctx.children) {
        if (chunk instanceof LuaParser.ChunkContext) {
            CtBlock block = getChunk((LuaParser.ChunkContext)chunk);
            if (block == null)
                continue;

            block.setParent(_file);
            _file.getBlocks().add(block);
        }
    }
}

private CtBlock getChunk(LuaParser.ChunkContext ctx) {
    for (var block : ctx.children) {
        if (block instanceof LuaParser.BlockContext) {
            return getBlock((LuaParser.BlockContext)block);
        }
    }
    return  null;
}

private CtBlock getBlock(LuaParser.BlockContext ctx) {
    CtBlock block = new CtBlockImpl();
    block.setLine(ctx.start.getLine());
    block.setColumn(ctx.start.getCharPositionInLine());
    block.setStatements(new ArrayList<>());

    for (var statement : ctx.children) {
        if (statement instanceof LuaParser.StatContext) {
            var statements = getStatement((LuaParser.StatContext)statement);
            for (var ctStatement : statements) {
                ctStatement.setParent(block);
                block.getStatements().add(ctStatement);
            }
        }
    }
    return block;
}

我们只需从上到下遍历这棵树,并沿途构建我们的树。迟早,我们会到达终端节点。以下是函数参数:

private List<CtParameter> parseParameters(
    LuaParser.NamelistContext ctx,
    CtElement parent
) {
    var parameters = new ArrayList<CtParameter>();
    for (var child : ctx.children) {
        if (Objects.equals(child.toString(), ","))
            continue;

        var parameter = new CtParameterImpl();
        parameter.setParameterName(child.toString());
        parameter.setParent(parent);
        parameters.add(parameter);
    }
    return parameters;
}

这似乎也不是很有挑战性——我们只是将一个物体变成另一个物体。我们也将代码列表包含在这里。我们希望它的工作方式是清楚的。

坦率地说,从长远来看,这种方法不会带来太大的收益。我们不是将文本转换为树,而是将一棵树转换为另一棵树。这两项任务都相当乏味。我们有什么选择?

  • 我们可以从头开始编写词法分析器和解析器。这很好,但当我们受到截止日期和技能的限制时就不行了。
  • 我们可以配置 ANTLR 以立即输出所需的 AST。听起来好得令人难以置信,但我们仍然需要研究 ANTLR,这也是极大的时间浪费。
  • 解决方案快速上市。我们必须利用现有的资源,将生成的树转换为目标树。虽然不好,但是还可以忍受。
  • 我们根本不会转换。如果团队中没有 Java 分析器开发人员,我们早就这样做了。

如果我们没有为 AST 提供代码分析示例,那么本节显然是不完整的。您可以在剧透下检查树的漂亮打印,以进行相同的阶乘计算。

CtGlobal:
  CtFile:
    CtFunction: func
      Parameters:
        CtParameter: n
      CtBlock:
        CtIf:
          Condition:
            CtBinaryOperator: Equals
              Left:
                CtVariableRead: n
              Right:
                CtLiteral: 0
          Then block
            CtBlock:
              CtReturn:
                CtLiteral: 1    
          Else block
            CtBlock:
              CtReturn:
                CtBinaryOperator: Multiplication
                  Left:
                    CtVariableRead: n
                  Right:
                    CtInvocation: fact
                      Arguments:
                        CtBinaryOperator: Minus
                          Left:
                            CtVariableRead: n
                          Right:
                            CtLiteral: 1
    CtInvocation: print
      Arguments:
        CtLiteral: "enter a number:"
    CtAssignment:
      Left:
        CtVariableWrite: a
      Right:
        CtInvocation:
          CtFieldRead: 
            Target: io
            Field: read
          Arguments:
            CtParameter: "*number"
    CtInvocation: print
      Arguments:
        CtInvocation: fact
          Arguments:
            CtVariableRead: a

让我们对术语进行快速评论以结束本文。虽然最好将我们的实体称为树翻译器,但现在我们将其称为解析器。

游客

接下来,我们将开发执行分析并用于构建诊断规则的功能。这是一棵树的遍历。总的来说,很容易掌握如何实现树迭代。然而,我们需要对树节点做一些有用的事情。访客模式已登场。

你还记得经典 GoF 书中那个引人入胜的模式,它有一个很酷的实现,但使用场景却相当模糊?因此,我们有一个关于如何在实际情况中使用它的基准案例。为了保持文章简洁,我不会在书中展示它是如何实现的,但我会展示我们如何在分析器中实现它。

让我们从简单的树遍历开始。我们定义 CtScanner 类,并为单个项目及其集合添加两个 scan 方法。

public <T extends CtElement> void scan(T element) {
    if (element != null) {
        element.accept(this);
    }
}

public <T extends CtElement> void scan(List<T> elements) {
    if (elements != null) {
        for (var element : elements) {
            scan(element);
        }
    }
}

你看到这个来自CtElement接受了吗?在我们的例子中,任何继承 CtVisitable 接口的类都将实现 void accept(CtAbstractVisitor Visitor) 方法。稍后我们将讨论 CtAbstractVisitor。现在,只要知道 CtScanner 是从它继承的就足够了。

这就是 acceptCtAssignment:
中的样子

@Override
public void accept(CtAbstractVisitor visitor){
    visitor.visitCtAssignment(this);
}

是的,这是小菜一碟。节点在访问者中调用它们的方法。在我们的CtScanner中应该有树节点每个类的方法:

@Override
public void visitCtIf(CtIf ctIf) {
    scan(ctIf.getCondition());
    scan((CtStatement) ctIf.getThenStatement());
    scan((CtStatement) ctIf.getElseStatement());
}

@Override
public <T> void visitCtLiteral(CtLiteral<T> literal) {
}
@Override
public void visitCtStatement(CtStatement statement) {
}

// ....

现在让我们回到 CtAbstractVisitor,一个从 CtScanner 中提取的接口。它包含访问树节点的方法,但只有这些方法,没有 scan 方法。在访问者方法的实现中,我们要么为将来的重载留下一个占位符(如果它是终端节点),要么通过沿着树节点执行递归下降来继续展开树节点。这就是我们需要知道的才能继续。

Semantic parsing

Introduction

It'd seem that's all with the core. For example, now our analyzer can catch simple errors like the variable self-assignment. We call it the signature analysis. To provide more advanced analysis, we'd like to obtain more data about what's going on in the code. The parser has done its job, it's time to create new entities.

So far, we've completely followed the way of compilers regarding of the analyzer structure, but now our path diverges. If Lua had a compiler, it'd start analysis to ensure that the code is syntactically correct. Although our tools will continue to overlap in features.

For our needs, let's create SemanticResolver for two purposes:

  • define the variable scope;
  • evaluate the variable type based on the duck typing.

However, we'll call it from the parser along with its traversal. No decorator this time.

To keep the app structure seamless, we'll contain all the semantic data in the same AST nodes. First, let's define the necessary properties in the CtVariableAccess interface:

CtBlock getScope();
void setScope(CtBlock block);

TypeKind getTypeKind();
void setTypeKind(TypeKind type);

Variable scope

Let's start with scope; this will be our key tool for defining the variable along with its name. First, we define the variable entity inside SemanticResolver. To keep it short, we'll show you only the interface, but the gist should be clear:

public static class Variable {
    public Variable(String identifier);
    public String getIdentifier();
    public CtBlock getScope();
    public void setScope(CtBlock block);
    public void setType(TypeKind type);
    public TypeKind getType();
    // Methods use only identifier
    @Override
    public boolean equals(Object o);
    @Override
    public int hashCode();
}

Let's also further define the stack for variable scopes:

private final Stack<Pair<CtBlock, HashSet<Variable>>> stack = new Stack<>();
private final CtGlobal global;
public SemanticResolver(CtGlobal global) {
    pushStack(global);
    this.global = stack.peek();
}

public void pushStack(CtBlock block) {
    stack.push(Pair.of(block, new HashSet<>()));
}
public void popStack() {
    stack.pop();
}

The stack entry consists of a tuple of scope and variables set registered there. The stack operation is mundane. Here's how it works in the parser:

private CtBlock getBlock(LuaParser.BlockContext ctx) {
    CtBlock block = new CtBlockImpl();
    resolver.pushStack(block);

    // ....

    resolver.popStack();
    return block;
}

We have to register the variables somehow. If a variable is local, it's simple—let's take the current scope and pass the variable there:

public CtBlock registerLocal(Variable variable) {
    var scope = stack.pop();
    variable.setScope(scope.getLeft());
    scope.getRight().add(variable);
    stack.push(scope);
    return scope.getLeft();
}

If the local keyword isn't used, the variable will be either global or be declared somewhere above. So, we first go through the stack and double-check if it exists:

public CtBlock registerUndefined(Variable variable) {
    var pair = lookupPair(variable);
    pair.getRight().add(variable);
    return pair.getLeft();
}

public Pair<CtBlock, HashSet<Variable>> lookupPair(Variable variable) {
    var buf = new Stack<Pair<CtBlock, HashSet<Variable>>>();
    Pair<CtBlock, HashSet<Variable>> res = null;
    while (!stack.isEmpty()) {
        var scope = stack.pop();
        buf.push(scope);
        if (scope.getRight().contains(variable)) {
            res = scope;
            break;
        }
    }

    while (!buf.isEmpty()) {
        stack.push(buf.pop());
    }

    if (res == null) {
        return global;
    }

    return res;
}

We can set the scope to variables when we have the entry:

private CtVariableWrite getVariableWriteInternal(
        ParseTree ctx,
        boolean isLocal
) {
    var node = new CtVariableWriteImpl();
    node.setVariableName(ctx.getChild(0).toString());
    CtBlock scope;
    if (isLocal) {
        scope = resolver.registerLocal(
             new SemanticResolver.Variable(node.getVariableName()));
    } else {
        scope = resolver.registerUndefined(
            new SemanticResolver.Variable(node.getVariableName()));
    }
    node.setScope(scope);
    return node;
}

And defining it for readings:

private CtExpression getExpression(LuaParser.ExpContext ctx) {
    // ....
    if (child instanceof LuaParser.PrefixexpContext) {
        // ....
        var scope = resolver.lookupScope(
            new SemanticResolver.Variable(variableRead.getVariableName())
        );
        variableRead.setScope(scope);

        return variableRead;
    }
    // ....
}

We won't show the lookupScope code. It's a single-line wrapper over lookupPair, which we can see above. Here, we can end with the variable scope. We'll check the mechanism in the diagnostic rule on a separate section. Now we continue working on the semantic parsing. Now, let's move to the variable type determination.

Duck typing

How to determine the variable type? Indeed, we obtain them from the literals. Let's further define the type and the enumeration for them:

public interface CtLiteral<T> extends CtExpression, CtVisitable {
  // ....
  void setTypeKind(TypeKind kind);
  TypeKind getTypeKind();
}
public enum TypeKind {
    Undefined,
    Number,
    String,
    Boolean,
    Nil
}

Thus, the data type can be numeric, string, logical, and nil. However, it'll be undefined by default. The split of undefined and nil may seem far-fetched, but it's okay for the pilot.

We store the literal type only in the tree node, doing it from the parser:

private <T> CtLiteralImpl<T> createLiteral(
    // ....
    TypeKind type,
) {
    // ....

    literal.setTypeKind(type);
    return literal;
}

However, the variable type will be both in the tree and in SemanticResolver. So, we can request it during further traversal and the AST building:

private ArrayList<CtAssignment> parseAssignments(LuaParser.StatContext ctx) {
    // ....
    for (int i = 0; i < variables.size(); i++) {
        var assignment = new CtAssignmentImpl();
        var variable = variables.get(i);

        // ....


        variable.setTypeKind(resolver.lookupType(variable.getVariableName()));
        resolver.setType(
            variable.getVariableName(),
            variable.getScope(),
            SemanticResolver.evaluateExpressionType(expression)
        );
    }
    return assignments;
}

There is no error in the operator precedence. Let the stored variable have the type from its past assignment. It'll facilitate our work in the future. As for the methods used here, there's nothing incredible about the lookupType implementation—it's basically the same as lookupPair. There's nothing complex about setType:

public void setType(String variable, CtBlock scope, TypeKind type) {
    var opt = stack.stream()
             .filter(x -> Objects.equals(x.getLeft(), scope))
             .findFirst();
    if (opt.isPresent()) {
        var pair = opt.get();
        var newVar = new Variable(variable);
        var meta = pair.getRight()
                       .stream()
                       .filter(x -> x.equals(newVar))
                       .findFirst();
        meta.ifPresent(value -> value.setType(type));
    }
}

However, evaluateExpressionType is trickier. Computing variable types in dynamic languages can be a bit of a hassle. Just look at the jokes about JavaScript and string concatenation. However, firstly, Lua has a separate operator '..', and secondly, we're trying to keep the process simple. So, we'll only determine whether all the operands are the same type. We'll use the familiar CtScanner.

public static TypeKind evaluateExpressionType(CtExpression expression) {
    Mutable<TypeKind> type = new MutableObject<>(null);
    var typeEvaluator = new CtScanner() {
        private boolean stop = false;
        @Override
        public void scan(CtElement el) {
            if (stop) { return; }

            if (el instanceof CtVariableRead || el instanceof CtLiteral<?>) {
                var newType = el instanceof CtVariableRead
                              ? ((CtVariableRead) el).getTypeKind()
                              : ((CtLiteral<?>) el).getTypeKind();

                if (newType.equals(TypeKind.Undefined)) {
                    type.setValue(TypeKind.Undefined);
                    stop = true;
                    return;
                } else if (type.getValue() == null) {
                    type.setValue(newType);
                } else if (!type.getValue().equals(newType)) {
                    type.setValue(TypeKind.Undefined);
                    stop = true;
                    return;
                }
            }
            super.scan(el);
        }
    };
    typeEvaluator.scan(expression);
    return type.getValue();
}

In parseAssignments, we've set the assignment type to the (CtVariableWrite) variable but forgot about reading (CtVariableRead). Let's fix it:

private CtExpression getExpression(LuaParser.ExpContext ctx) {
    // ....
    if (child instanceof LuaParser.PrefixexpContext) {
        // ....
        variableRead.setTypeKind(
            resolver.lookupType(variableRead.getVariableName())
        );
        var scope = resolver.lookupScope(
            new SemanticResolver.Variable(variableRead.getVariableName()));
        variableRead.setScope(scope);

        return variableRead;
    }
    // ....
}

We've completed the semantic analysis and almost ready to start searching for bugs.

Data-flow analysis

Inside structure

First, we make two quick stops. While the topic of the data-flow analysis deserves a series of articles, it'd be wrong to skip over it. Here, we won't dive deep into theory but just try to memorize the values set by literals.

First, let's fall into the sin of self-copying and define the variable entity for DataFlow again but in a simpler way. Again, we'll only show you the interface:

private static class Variable {
    private Variable(String identifier, CtBlock scope);
    // Methods use identifier and scope
    @Override
    public boolean equals(Object o);
    @Override
    public int hashCode();
}

Here's the rest of the class content:

public class DataFlow {
    private static class Variable {
        // ....
    }
    Map<Variable, Object> variableCache = new HashMap<>();

    public void scanDataFlow(CtElement element) {
        if (element instanceof CtAssignment) {
            CtAssignment variableWrite = (CtAssignment) element;
            if (variableWrite.getAssignment() instanceof CtLiteral<?>) {
                var assigned = variableWrite.getAssigned();
                var variable = new Variable(
                    assigned.getVariableName(),
                    assigned.getScope()
                );
                variableCache.put(
                    variable, 
                    getValue(variableWrite.getAssignment())
                );
            }
        }
    }

    public Object getValue(CtExpression expression) {
        if (expression instanceof CtVariableRead) {
            CtVariableRead variableRead = (CtVariableRead) expression;
            var variable = new Variable(
                variableRead.getVariableName(),
                variableRead.getScope()
            );
            return variableCache.getOrDefault(variable, null);
        } else if (expression instanceof CtLiteral<?>) {
            return ((CtLiteral<?>) expression).getValue();
        }
        return null;
    }
}

It's quite simple: in scanDataFlow, we associate the value with the variable, and in getValue, we extract that value for the set node. Everything is simple because we don't factor in branching, loops, or even expressions. Why don't we factor them in? Branching is the very topic that deserves its own series of articles. What about expressions? Well, we didn't make it in two days. Given what we've achieved in the last two days, we think this task is feasible. We just postpone it as a homework.

That's all. It's clear that such a solution is far from a real product, but we have laid some foundation. Then we can either try to enhance the code—and we'll introduce data-flow analysis via AST. Or we can redo everything properly and build the control-flow graph.

We've done the class implementation, but we haven't discussed how to use it. We've written the class, but what shall we do with it? Let's talk about it in the following section. Now we just say that DataFlow works right before calling the diagnostic rules. Those are called when traversing the finished AST. Thus, the rules will have access to the current variable values. This is called Environment; you can see it in your debugger.

Walker

Welcome to the last section regarding the core. We've already had the AST that is full of semantic data, as well as the data-flow analysis that is just waiting to be run. It's a good time to put it all together and set the stage for our diagnostic rules.

How is the analysis performed? It's simple: we get on the topmost tree node, and then we start recursive traversal. That is, we need something that will traverse the tree. We have CtScanner for that. Based on it, we define MunInvoker:

public class MunInvoker extends CtScanner {
    private final List<MunRule> rules = new ArrayList<>();
    private final Analyzer analyzer;

    public MunInvoker(Analyzer analyzer) {
        this.analyzer = analyzer;
        rules.add(new M6005(analyzer));
        rules.add(new M6020(analyzer));
        rules.add(new M7000(analyzer));
        rules.add(new M7001(analyzer));
    }

    @Override
    public <T extends CtElement> void scan(T element) {
        if (element != null) {
            analyzer.getDataFlow().scanDataFlow(element);
            rules.forEach(element::accept);
            super.scan(element);
        }
    }
}

You can notice a few unknown things in the code:

  • the Analyzer class. It encapsulates the whole analysis process and contains shared resources that need to be accessed within the rules. In our case, this is the DataFlow instance. We'll get back to Analyzer later;
  • four confusing classes that are added to rules. We'll talk about them in the next section, so don't panic. A little spoiler for you :)

Otherwise, the class operation shouldn't raise any questions. Each time we enter any tree node, the analyzer rule is called, and the variable values are evaluated right before it. Next, the traversal continues in line with to the CtScanner algorithm.

Analysis

Preparing for writing diagnostic rules

Rule class

So, we have the analyzer core prototype—it's a good time to start analyzing something.

The base for our rules is ready—it's the CtAbstractVisitor class. The analysis goes as follows: the rule overloads a few visitors and analyzes the data contained in the AST nodes. Let's extend CtAbstractVisitor with the abstract MunRule class, which we use to create rules. In this class, we also define the addRule method that generates warnings.

Speaking of warnings: what data do they need? First is a warning message to demonstrate users what they may have been mistaken about. Second, the user needs to know where and what the analyzer warns. So, let's add data about the file where the analyzer has detected the troublesome code block and the location of this code fragment.

Here's what the MunRule class looks like:

public abstract class MunRule extends CtAbstractVisitor {
    private Analyzer analyzer;
    public void MunRule(Analyzer analyzer) {
        this.analyzer = analyzer;
    }
    protected Analyzer getAnalyzer() {
        return analyzer;
    }

    protected void addRule(String message, CtElement element) {
        var warning = new Warning();
        warning.message = message;

        WarningPosition pos = new WarningPosition(
            Analyzer.getFile(), 
            element.getLine(), 
            element.getColumn() + 1
        );
        warning.positions.add(pos);
        analyzer.addWarning(warning);
    }

    public DataFlow getDataFlow() {
        return analyzer.getDataFlow();
    }
}

The WarningPosition and Warning classes are just data stores, so we won't list them. We'll talk about addWarning now.

Merging it together

The last thing to prepare is how we're going to view the diagnostic rules. To do this, we combine all our features together, using the already mentioned Analyzer class. Actually, here it is:

public class Analyzer {
    private DataFlow dataFlow = new DataFlow();

    public DataFlow getDataFlow() {
        return dataFlow;
    }

    public CtElement getAst(String pathToFile) throws IOException {
        InputStream inputStream = new FileInputStream(pathToFile);
        Lexer lexer = new LuaLexer(CharStreams.fromStream(inputStream));

        ParseTreeWalker walker = new ParseTreeWalker();
        var listener = new LuaAstParser();
        walker.walk(listener, new LuaParser(
            new CommonTokenStream(lexer)
        ).start_());
        return listener.getFile();
    }

    protected void addWarning(Warning warning) {
        Main.logger.info(
            "WARNING: " + warning.code + " "
            + warning.message + " ("
            + warning.positions.get(0).line + ", "
            + warning.positions.get(0).column + ")");
      }      

    public MunInvoker getMunInvoker() {
        return new MunInvoker(this);
    }

    public void analyze(String pathToFile) {
        try {
            var top = getAst(pathToFile);
            var invoker = getMunInvoker();
            invoker.scan(top);
        }
        catch (IOException ex) {
            Main.logger.error("IO error: " + ex.getMessage());
        }
    }
}

To explain how it works, we'll give you an example of the whole analysis process:

  1. In the getAst method, we build our AST using the scheme lexer —> parser —> tree translator;
  2. Then we call MunIvoker, which traverses the tree and calls our diagnostic rules along with the data-flow analysis;
  3. If necessary, the rules call the Analyzer class to get the DataFlow instance;
  4. They call addWarning when the analyzer spots suspicious code fragment. That one just outputs the message to the log.

We've got the prep work—it's time to start writing diagnostic rules.

Writing diagnostic rules

Assigning a variable to itself

We've decided to start writing the rules with a simple one: PVS-Studio has the Java diagnostic rule, V6005, where the analyzer check if a variable is assigned to itself. It can simply be copied and slightly adapted to our tree. Since our analyzer is called Mun, we start the numbers of our diagnostic rules with M. Let's create the M6005 class, extending MunRule, and override the visitCtAssignment method in it. The following check will be located in the method:

public class M6005 extends MunRule {
    private void addRule(CtVariableAccess variable) {
        addRule("The variable is assigned to itself.", variable);
    }
    @Override
    public void visitCtAssignment(CtAssignment assignment) {
        if (RulesUtils.equals(assignment.getAssigned(), 
            assignment.getAssignment())) {
            addRule(assignment.getAssigned());
        }
    }
}

The used RulesUtils.equals method is a wrapper and overload for another equals method that checks the name and scope:

public static boolean equals(CtVariableAccess left, CtVariableAccess right) {
    return left.getVariableName().equals(right.getVariableName())
           && left.getScope().equals(right.getScope());
}

We need to check the scope, because the next code fragment isn't an assignment of the variable to itself:

local a = 5;
begin
    local a = a;
end

Now, we can test the diagnostic rule on some simple code and see if it works. The following example will issue the warning on the marked line: "M6005 The variable is assigned to itself":

local a = 5;
local b = 3;

if (b > a) then
    a = a;      <=
end

Zero division

Well, we've warmed up, so let's move on. The analyzer has already had the primitive data-flow analysis (DataFlow) that we can and should use. Again, let's look at one of our existing diagnostic rules, V6020, where the analyzer checks for the zero division. We try to adapt it for our analyzer. The divisor can be either the variable as zero or a literal as zero. We need to access the cache of variables to check their value.

Here's the simple implementation of such a diagnostic rule:

public class M6020 extends MunRule {
  private void addWarning(CtElement expression, String opText) {
    addRule(String.format(
            "%s by zero. Denominator '%s' == 0.",
            opText, expression instanceof CtLiteral
                ? ((CtLiteral) expression).getValue()
                : ((CtVariableRead) expression).getVariableName()
        ),
        expression
    );
  }

  @Override
  public void visitCtBinaryOperator(CtBinaryOperator operator) {
    BinaryOperatorKind opKind = operator.getKind();
    if (opKind != BinaryOperatorKind.DIV && opKind != BinaryOperatorKind.MOD) {
      return;
    }

    apply(operator.getRightHandOperand(), opKind == BinaryOperatorKind.MOD);
  }


  private void apply(CtExpression expr, boolean isMod) {
    Object variable = getDataFlow().getValue(expr);
    if (variable instanceof Integer) {
      if ((Integer) variable == 0) {
        String opText = isMod ? "Mod" : "Divide";
        addWarning(expr, opText);
      }
    }
  }
}

We can see that the diagnostic rule works on simple examples and issues the following warning: "M6020 Divide by zero. Denominator 'b' == 0" on these lines:

local a = 5;
local b = 0;
local c = a / b;   <=
local d = a / 0;   <=

If you have made the expression evaluator as your homework, you may try the diagnostic rule on this code:

local b = 7;
local b = b - 7;
local c = a / b;

Overwritten types

Since we're writing the Lua analyzer, we need to write diagnostic rules for this language. Let's start simple.

Lua is a dynamic scripting language. Let's use this feature and write the diagnostic rule that allows us to catch overwritten types.

We'll also need to pick a new number for the diagnostic rule. Earlier we just copied them from the Java analyzer. Now, it seems like it's time to start a new thousand—the seventh. Who knows what diagnostic rule it'll be in PVS-Studio, but at the time of writing this article, it will be Lua.

Knowing about types, we facilitate our case: we need to check that the left and right assignments are of different types. Note that we ignore Undefined for the left part and Nil for the right part. The code looks like this:

public class M7000 extends MunRule {
    @Override
    public void visitCtAssignment(CtAssignment assignment) {
        var assigned = assignment.getAssigned();
        var exprType = SemanticResolver.evaluateExpressionType(
            assignment.getAssignment());
        if (assigned.getTypeKind().equals(TypeKind.Undefined)
            || exprType.equals(TypeKind.Nil)
        ) {
            return;
        }

        if (!assigned.getTypeKind().equals(exprType)) {
            addRule(
                String.format(
                    "Type of the variable %s is overridden from %s to %s.",
                    assigned.getTypeKind().toString(),
                    exprType.toString()
                assigned,
            );
        }
    }
}

It's time to check the diagnostic rule in real cases. In the following example, our analyzer issues the warning only on the last line:

local a = "string";

if (true) then
  local a = 5;
end

a = 5;     <=

The analyzer warning: "M7000 Type of the variable a is overridden from Integer to String".

Lost local

Let's finish smoothly. The Lua plugin for VS Code has a diagnostic rule that detects global lowercase variables. This check can help detect the forgotten local identifiers. Let's implement the same diagnostic rule in our analyzer.

Here, just as before, we'll need to use the variable scope data that is obtained via the semantic analysis. We just find where the variable was declared. That's also where the variable is assigned a value. Then check its scope and name. If the variable is global and starts with a lowercase, the analyzer will warn. Easy-peasy.

Let's create a new class and override the visitCtAssignment method in it again. That way, we can look for the problematic global variables:

public class M7001 extends MunRule {
    @Override
    public void visitCtAssignment(CtAssignment assignment) {
        var variable = assignment.getAssigned();
        var firstLetter = variable.getVariableName().substring(0, 1);

        if (variable.getScope() instanceof CtGlobal && 
            !firstLetter.equals(firstLetter.toUpperCase())) {
            addRule("Global variable in lowercase initial.", variable);
        }
    }
}

Let's check the diagnostic rule work. It issues the warning: "M7001 Global variable in lowercase initial." It's issued on the second line in the code snippet below:

function sum_numbers(b, c)
    a = b + c;  <=
    return a;
end

local a = sum_numbers(10, 5);

Alright, we've written the diagnostic rules, we've done with the analyzer (and it even works). Breathe out. Now we can enjoy the fruits of our labors.

Viewing warnings

Above we've already shown the code that allows us to view warnings in a console or file. Here is how the process of working with the analyzer looks like in the console. We run the analyzer using the command:

java -jar mun-analyzer.jar -input "C:\munproject\test.lua"

And we get:

How to develop code analyzer in hours

However, first, the analyzer is a tool, and it should be user-friendly. Instead of working with the console, it'd be much more convenient to work with a plugin.

又到了借钱的时间了。 PVS-Studio 已经为许多不同的 IDE 提供了插件,例如 Visual Studio Code 和 IntelliJ IDEA。您可以查看警告并浏览它们。由于分析器报告具有标准化格式,因此我们可以简单地借用 Java 分析器的 JSON 报告生成算法。该算法广泛且枯燥,因此我们不会展示它。我们仍然必须从命令行运行它,但使用参数* -output "D:git/MunProject/report.json"*.

然后,我们可以在 IntelliJ IDEA 或 VS Code 中打开报告并查看 Lua 分析器警告:

How to develop code analyzer in hours

甜甜的!现在我们可以将分析器用于其全部预期目的,而无需牺牲可用性。

星际探索

那么,我们已经编写了成熟的分析器了吗?呃,不完全是。在这段漫长旅程的终点​​,我们拥有了最真实的飞行员,经历了所有阶段。然而,增长空间是巨大的:

  1. 核心增强功能:
    • 增强数据流分析;
    • 考虑控制流;
    • 添加过程间和模块间分析。您可以在这里阅读我们如何用 C++ 做到这一点;
    • 添加注释机制以帮助增强数据流分析和鸭子类型;
    • 提供更多语义数据;
    • 微调现有机制;
    • 并且不要忘记更好的解析器。
  2. 诊断规则的增强:
    • 增强现有的;
    • 写更多新的;
    • 编写超出几十行的更复杂的诊断规则。
  3. 分析器可用性增强:
    • 创建适当的插件支持;
    • 集成到 CI/CD 中。
  4. 单元测试和回归测试,用于检查诊断规则在开发和修改过程中的性能。

还有很多很多。换句话说,从试点到成熟工具的道路相当荆棘。因此,PVS-Studio 专注于现有的方向:C#、C、C++ 和 Java,而不是新的方向。如果您使用这些语言中的任何一种编写,您可以尝试我们的分析器。

结语

这篇文章比我们想象的要长很多,所以如果您读到了最后,请发表评论:)我们很乐意收到您的反馈。

如果您对此主题感兴趣,您可以阅读有关为我们的语言开发分析器的信息:

  1. 开发新的静态分析器:PVS-Studio Java
  2. Roslyn 简介及其在程序开发中的使用
  3. 为 C# 创建基于 Roslyn API 的静态分析器

以上是如何在数小时内开发代码分析器的详细内容。更多信息请关注PHP中文网其他相关文章!

声明:
本文内容由网友自发贡献,版权归原作者所有,本站不承担相应法律责任。如您发现有涉嫌抄袭侵权的内容,请联系admin@php.cn