Home >Java >javaTutorial >JVM in-depth learning-Sample code of Java parsing Class file process
As a javaprogrammer, how can I not understand JVM? If you want to learn JVM, you must understand the Class file , Class is to the virtual machine, just like a fish is to water, the virtual machine is alive because of the Class. "In-depth Understanding of Java Virtual Machine" spends an entire chapter explaining Class files, but after reading it, I am still confused and half-understood. I happened to read a very good book some time ago: "Write Your Own Java Virtual Machine". The author used the go language to implement a simple JVM. Although it did not fully implement all the functions of the JVM, it is useful for those who are slightly interested in the JVM. That said, the readability is still very high. The author explains it in detail, and each process is divided into one chapter, part of which explains how to parse Class files.
This book is not too thick and I read it quickly. After reading it, I gained a lot. However, it was only a matter of time before I learned it on paper, and I knew that I had to do it myself, so I tried to parse the Class file myself. Although the Go language is excellent, I am not proficient in it after all, especially if I am not used to its syntax of putting types after variables, so I'd better stick to Java.
The reason why java can achieve cross-platform is that its compilation stage does not directly compile the code into platform-related machine language, but first compiles it into binary form of java bytecode and puts it in the Class file. , the virtual machine then loads the Class file and parses out the content required to run the program. Each class will be compiled into a separate class file, and the internal class will also be used as an independent class to generate its own class.
Find a class file and open it with Sublime Text like this:
Are you confused? However, the basic format of the class file is given in the Java virtual machine specification. You only need to parse it according to this format:
ClassFile { u4 magic; u2 minor_version; u2 major_version; u2 constant_pool_count; cp_info constant_pool[constant_pool_count-1]; u2 access_flags; u2 this_class; u2 super_class; u2 interfaces_count; u2 interfaces[interfaces_count]; u2 fields_count; field_info fields[fields_count]; u2 methods_count; method_info methods[methods_count]; u2 attributes_count; attribute_info attributes[attributes_count]; }
The field types in ClassFile are u1, u2, and u4. What type is this? In fact, it is very simple, it means 1 byte, 2 bytes and 4 bytes respectively. The first four bytes of
are: Magic, which is used to uniquely identify the file format. It is generally called the magic number so that the virtual machine can recognize the loaded file. Whether it is in class format, the magic number of class files is cafebabe. Not just class files, basically most files have a magic number to identify their format.
The next part mainly contains some information about the class file, such as constant pool, class access flag, parent class, interface information, fields, methods, etc. For specific information, please refer to "Java Virtual Machine Specification".
As mentioned above, the field types in ClassFile are u1, u2, and u4, which respectively represent 1 byte, 2 bytes and 4 Unsigned integer of bytes. In Java, short, int, and long are signed integers of 2, 4, and 8 bytes respectively. Without the sign bit, they can be used to represent u1, u2, and u4.
public class U1 { public static short read(InputStream inputStream) { byte[] bytes = new byte[1]; try { inputStream.read(bytes); } catch (IOException e) { e.printStackTrace(); } short value = (short) (bytes[0] & 0xFF); return value; } } public class U2 { public static int read(InputStream inputStream) { byte[] bytes = new byte[2]; try { inputStream.read(bytes); } catch (IOException e) { e.printStackTrace(); } int num = 0; for (int i= 0; i < bytes.length; i++) { num <<= 8; num |= (bytes[i] & 0xff); } return num; } } public class U4 { public static long read(InputStream inputStream) { byte[] bytes = new byte[4]; try { inputStream.read(bytes); } catch (IOException e) { e.printStackTrace(); } long num = 0; for (int i= 0; i < bytes.length; i++) { num <<= 8; num |= (bytes[i] & 0xff); } return num; } }
After defining the field type, we can read the class file. First, we need to read basic information such as magic numbers. This part is very simple:
FileInputStream inputStream = new FileInputStream(file); ClassFile classFile = new ClassFile(); classFile.magic = U4.read(inputStream); classFile.minorVersion = U2.read(inputStream); classFile.majorVersion = U2.read(inputStream);
This part is just a warm-up, the next big part lies in the constant pool. Before analyzing the constant pool, let us first explain what the constant pool is.
Constant pool, as the name suggests, is a resource pool that stores constants. The constants here refer to literals and symbol references. Literals refer to some string resources, and symbol references are divided into three categories: class symbol references, method symbol references and field symbol references. By placing resources in the constant pool, other items can be directly defined as indexes in the constant pool, avoiding the waste of space, not just class files, but the same is true for Android executable files dex. String resources, etc. are placed in DexData, and other items locate resources through indexes. The Java virtual machine specification gives the format of each item in the constant pool:
cp_info { u1 tag; u1 info[]; }
The above format is just a general format. There are 14 formats of data actually contained in the constant pool, and the tag value of each format Different, the details are as follows:
Because there are too many formats, only a part of the article is selected to explain:
Here first read the size of the constant pool and initialize it Constant pool:
//解析常量池 int constant_pool_count = U2.read(inputStream); ConstantPool constantPool = new ConstantPool(constant_pool_count); constantPool.read(inputStream);
Next, read each item one by one and store it in the array cpInfo. It should be noted here that the cpInfo[] subscript starts from 1, and 0 is invalid. , and the real constant pool size is constant_pool_count-1.
public class ConstantPool { public int constant_pool_count; public ConstantInfo[] cpInfo; public ConstantPool(int count) { constant_pool_count = count; cpInfo = new ConstantInfo[constant_pool_count]; } public void read(InputStream inputStream) { for (int i = 1; i < constant_pool_count; i++) { short tag = U1.read(inputStream); ConstantInfo constantInfo = ConstantInfo.getConstantInfo(tag); constantInfo.read(inputStream); cpInfo[i] = constantInfo; if (tag == ConstantInfo.CONSTANT_Double || tag == ConstantInfo.CONSTANT_Long) { i++; } } } }
Let’s first take a look at the CONSTANT_Utf8 format. This item stores a MUTF-8 encoded string:
CONSTANT_Utf8_info { u1 tag; u2 length; u1 bytes[length]; }
So how to read this item?
public class ConstantUtf8 extends ConstantInfo { public String value; @Override public void read(InputStream inputStream) { int length = U2.read(inputStream); byte[] bytes = new byte[length]; try { inputStream.read(bytes); } catch (IOException e) { e.printStackTrace(); } try { value = readUtf8(bytes); } catch (UTFDataFormatException e) { e.printStackTrace(); } } private String readUtf8(byte[] bytearr) throws UTFDataFormatException { //copy from java.io.DataInputStream.readUTF() } }
很简单,首先读取这一项的字节数组长度,接着调用readUtf8(),将字节数组转化为String字符串。
再来看看CONSTANT_Class这一项,这一项存储的是类或者接口的符号引用:
CONSTANT_Class_info { u1 tag; u2 name_index; }
注意这里的name_index并不是直接的字符串,而是指向常量池中cpInfo数组的name_index项,且cpInfo[name_index]一定是CONSTANT_Utf8格式。
public class ConstantClass extends ConstantInfo { public int nameIndex; @Override public void read(InputStream inputStream) { nameIndex = U2.read(inputStream); } }
常量池解析完毕后,就可以供后面的数据使用了,比方说ClassFile中的this_class指向的就是常量池中格式为CONSTANT_Class的某一项,那么我们就可以读取出类名:
int classIndex = U2.read(inputStream); ConstantClass clazz = (ConstantClass) constantPool.cpInfo[classIndex]; ConstantUtf8 className = (ConstantUtf8) constantPool.cpInfo[clazz.nameIndex]; classFile.className = className.value; System.out.print("classname:" + classFile.className + "\n");
解析常量池之后还需要接着解析一些类信息,如父类、接口类、字段等,但是相信大家最好奇的还是java指令的存储,大家都知道,我们平时写的java代码会被编译成java字节码,那么这些字节码到底存储在哪呢?别急,讲解指令之前,我们先来了解下ClassFile中的method_info,其格式如下:
method_info { u2 access_flags; u2 name_index; u2 descriptor_index; u2 attributes_count; attribute_info attributes[attributes_count]; }
method_info里主要是一些方法信息:如访问标志、方法名索引、方法描述符索引及属性数组。这里要强调的是属性数组,因为字节码指令就存储在这个属性数组里。属性有很多种,比如说异常表就是一个属性,而存储字节码指令的属性为CODE属性,看这名字也知道是用来存储代码的了。属性的通用格式为:
attribute_info { u2 attribute_name_index; u4 attribute_length; u1 info[attribute_length]; }
根据attribute_name_index可以从常量池中拿到属性名,再根据属性名就可以判断属性种类了。
Code属性的具体格式为:
Code_attribute { u2 attribute_name_index; u4 attribute_length; u2 max_stack; u2 max_locals; u4 code_length; u1 code[code_length]; u2 exception_table_length; { u2 start_pc; u2 end_pc; u2 handler_pc; u2 catch_type; } exception_table[exception_table_length]; u2 attributes_count; attribute_info attributes[attributes_count]; }
其中code数组里存储就是字节码指令,那么如何解析呢?每条指令在code[]中都是一个字节,我们平时javap命令反编译看到的指令其实是助记符,只是方便阅读字节码使用的,jvm有一张字节码与助记符的对照表,根据对照表,就可以将指令翻译为可读的助记符了。这里我也是在网上随便找了一个对照表,保存到本地txt文件中,并在使用时解析成HashMap。代码很简单,就不贴了,可以参考我代码中InstructionTable.java。
接下来我们就可以解析字节码了:
for (int j = 0; j < methodInfo.attributesCount; j++) { if (methodInfo.attributes[j] instanceof CodeAttribute) { CodeAttribute codeAttribute = (CodeAttribute) methodInfo.attributes[j]; for (int m = 0; m < codeAttribute.codeLength; m++) { short code = codeAttribute.code[m]; System.out.print(InstructionTable.getInstruction(code) + "\n"); } } }
整个项目终于写完了,接下来就来看看效果如何,随便找一个class文件解析运行:
哈哈,是不是很赞!
The above is the detailed content of JVM in-depth learning-Sample code of Java parsing Class file process. For more information, please follow other related articles on the PHP Chinese website!