Home > Article > Backend Development > Use PHP to decode POP3 emails 2_PHP tutorial
Introduction to MIME encoding method (Author: Chen Junqing, October 24, 2000 15:09) Introduction to MIME encoding method Subject: =?gb2312?B?xOO6w6Oh?= This is the subject of the email, but because it is encoded, we cannot see what it is. Content, its original text is: "Hello!" Let's first look at the two methods of MIME encoding. The original reason for encoding emails is because many gateways on the Internet cannot correctly transmit 8-bit internal code characters, such as Chinese characters. The principle of encoding is to convert 8-bit content into 7-bit form so that it can be transmitted correctly, and then restore it to 8-bit content after the receiver receives it. MIME is the abbreviation of "Multipurpose Internet Mail Extensions". Before the MIME protocol, there were encoding methods such as UUENCODE for mail encoding. However, because the MIME protocol algorithm is simple and easy to expand, it has now become the mainstream mail encoding method. Not only It is used to transmit 8-bit characters, and can also be used to transmit binary files, such as images, audio and other information in email attachments, and has expanded many MIME-based applications. In terms of encoding methods, MIME defines two encoding methods: Base64 and QP (Quote-Printable): Base 64 is a universal method, and its principle is very simple, that is, three Byte data is represented by 4 Byte, so , among these four Bytes, only the first 6 bits are actually used, so there is no problem that only 7-bit characters can be transmitted. The abbreviation of Base 64 is usually "B". The Subject in this letter uses Base64 encoding. Another method is the QP (Quote-Printable) method, usually abbreviated as "Q" method. Its principle is to represent an 8-bit character with two hexadecimal values, and then add "=" in front.So we see that the file after QP encoding usually looks like this: =B3=C2=BF=A1=C7=E5=A3=AC=C4=FA=BA=C3=A3=A1. In PHP, the system has two functions that can easily implement decoding: base64_decode() and quoted_printable_decode(). The former can be used for decoding base64 encoding, and the latter is used for decoding QP encoding method. Now let's take a look at the content of Subject: =?gb2312?B?xOO6w6Oh?=. This is not a complete encoding, only part of it is encoded. This part is enclosed by two marks =? ?=, = ? What is explained later is that the character set of this text is GB2312, and a B after a ? represents the Base64 encoding. Through this analysis, let’s take a look at this MIME decoding function: (This function is provided by Sadly, the webmaster of PHPX.COM. I put it into a class and made a few modifications. I would like to thank you) function decode_mime( $string) { $pos = strpos($string, =?); if (!is_int($pos)) { return $string; } $preceding = substr($string, 0, $pos); // save any preceding text $search = substr($string, $pos+2); /* the mime header spec says this is the longest a single encoded word can be */ $d1 = strpos($search, ?); if (!is_int( $d1)) { Return $string; } $charset = substr($string, $pos+2, $d1); //Get the definition part of the character set $search = substr($search, $d1+1); / /The part after the character set definition =>$search; $d2 = strpos($search, ?); if (!is_int($d2)) { return $string; } $encoding = substr($search, 0, $d2 ); ////Part of the encoding method between two?: q or b $search = substr($search, $d2+1); $end = strpos($search, ?=); //$d2+ Between 1 and $end is the encoded content: => $endcoded_text; if (!is_int($end)) { return $string; } $encoded_text = substr($search, 0, $end); $rest = substr ($string, (strlen($preceding . $charset . $encoding . $encoded_text)+6)); //+6 is the previous removed =????= Six characters switch ($encoding) { case Q: case q: //$encoded_text = str_replace(_, %20, $encoded_text); //$encoded_text = str_replace(=, %, $encoded_text); //$decoded = urldecode($encoded_text); $decoded=quoted_printable_decode( $encoded_text); if (strtolower($charset) == windows-1251) { $decoded = convert_cyr_string($decoded, w, k); (strtolower($charset) == windows-1251) { $decoded = convert_cyr_string($decoded, w, k); $decoded = convert_cyr_string($decoded, w, k); ?=; break; } Return $preceding . $decoded . $this->decode_mime($rest); } } This function uses a recursive method to decode a character containing the above Subject segment. Comments have been added to the program. I believe anyone with some basic knowledge of PHP programming can understand it. This function is also decoded by calling the two system functions base64_decode() and quoted_printable_decode(), but it requires a large amount of string analysis on the email source file. However, PHP's string operations can be regarded as the most convenient and free among all languages. The final return of the function $preceding . $decoded . $this->decode_mime($rest); implements recursive decoding. Because this function is actually placed in a MIME decoding class to be introduced later, $this- is used. >Decode_mime($rest) This form of calling method. Now let’s look at the text. This is related to some header information of MIME. Let’s give a brief introduction first (if readers are interested in learning more, please refer to the official documentation of MIME). MIME-Version: 1.0 Indicates the version number of MIME used, usually 1.0; Content-Type: Defines the type of text. We actually use this identifier to know what type of file the text is, for example: text/plain means is the unformatted text body, text/html represents the Html document, image/gif represents the image in gif format, etc. What needs to be explained in this article is the compound types commonly used in emails. The multipart type indicates that the text is composed of multiple parts. The following subtypes describe the relationship between these parts. The three types used in emails are: multipart/alternative: indicates that the text is composed of two parts, which can be selected. Any of them. The main function is that when the essay has both text format and html format, you can choose one of the two bodies to display. Mail client software that supports the html format will generally display its HTML body, while those that do not support it will display its Text body. ; multipart/mixed: Indicates that multiple parts of the document are mixed, referring to the relationship between the text and attachments.If the MIME type of the email is multipart/mixed, it means that the email contains attachments; multipart/related: means that multiple parts of the document are related, generally used to describe the Html text and its related images. These composite types can be nested. For example, if an email contains an attachment and has body text in both HTML and text formats, the structure of the email is: Content-Type: multipart/mixed Part 1: Content Type: multipart/alternative: Text text; Text part two in Html format: Attachment email end character; Since the composite type is composed of multiple parts, a delimiter is needed to separate these multiple parts, which is what is in the email source file above As described by boundary="----=_NextPart_000_0007_01C03166.5B1E9510", for each content of Contact type: multipart/*, there will be such a description, indicating the separation between multiple parts. This separator is not in the text. A possible combination of a string of ancient characters. In the document, "--" plus the boundary is used to indicate the beginning of a section. At the end of the document, "--" is added to the boundary and then "--" is added at the end. " to indicate the end of the document. Since composite types can be nested, there may be multiple boundaries in the email. There is also the most important MIME header tag: Content-Transfer-Encoding: base64 It indicates the encoding method of this part of the document, which is the Base64 or QP (Quote-Printable) we introduced above. Only by identifying this description can we decode it using the correct decoding method. Due to space limitations, this is the only introduction to MIME. Below I will give a class for decoding MIME emails and give a brief description of it.