Home  >  Article  >  Backend Development  >  Method to detect and delete page BOM (UTF-8) blank lines_PHP tutorial

Method to detect and delete page BOM (UTF-8) blank lines_PHP tutorial

WBOY
WBOYOriginal
2016-07-13 10:49:141046browse

We often find that there are some extra blank lines in the page for no reason, but we see it in the editor. We know that this is caused by BOM (UTF-8). Let me share with you some of them. Methods for detecting and deleting BOM (UTF-8).

The picture below is the HTML code seen with firebug after the situation mentioned above occurs.

Figure 1

There is an extra blank line inexplicably, but when we look at the source code, it is not there.


My most common method is to use php to replace

BOM: Universal code file signature BOM (Byte Order Mark, U+FEFF)

The content of the BOM can indicate which encoding UNICODE is, but the received file needs to be disassembled and written into the DB. Seeing the BOM feels a bit ooxx.


In utf8_encode, I saw two programs that can be used to test writing/removing BOM.

Add BOM before the written file content

The code is as follows Copy code
 代码如下 复制代码

function writeUTF8File($filename,$content)
{
$f = fopen($filename, 'w');
fwrite($f, pack("CCC", 0xef,0xbb,0xbf));
fwrite($f,$content);
fclose($f);
}
?>

function writeUTF8File($filename,$content)

{
代码如下 复制代码

function removeBOM($str = '')
{
if (substr($str, 0,3) == pack("CCC",0xef,0xbb,0xbf)) {
$str = substr($str, 3);
}
return $str;
}
?>

$f = fopen($filename, 'w');

fwrite($f, pack("CCC", 0xef,0xbb,0xbf));

fwrite($f,$content);

fclose($f);
}
?>

 代码如下 复制代码

function isUTF8($string)
{
    return (utf8_encode(utf8_decode($string)) == $string);
}

Remove BOM function
The code is as follows Copy code
function removeBOM($str = '')<🎜> {<🎜> if (substr($str, 0,3) == pack("CCC",0xef,0xbb,0xbf)) {<🎜>          $str = substr($str, 3);<🎜> }<🎜> Return $str;<🎜> }<🎜> ?>
Thus, the above BOM = pack("CCC",0xef,0xbb,0xbf), so the way to remove BOM can be written with the above removeBOM function or one of the following: ■str_replace("锘�", '', $bom_content); ■preg_replace("/^锘�/", '', $bom_content); Also see the function to determine whether this string is UTF-8:
The code is as follows Copy code
function isUTF8($string) { Return (utf8_encode(utf8_decode($string)) == $string); }

Use shell in linux system to solve the problem

Before discussing in detail the detection and deletion of BOM in UTF-8 encoding, you might as well warm up with an example:

 代码如下 复制代码
shell> curl -s http://www.bKjia.c0m/ | head -1 | sed -n l
锘� //EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"> $

As shown above, the first three bytes are 357, 273, and 277 respectively, which is the octal BOM.

 代码如下 复制代码
shell> curl -s http://www.111cn.Net/ | head -1 | hexdump -C
00000000  ef bb bf 3c 21 44 4f 43  54 59 50 45 20 68 74 6d  |... 00000010 6c 20 50 55 42 4c 49 43 20 22 2d 2f 2f 57 33 43 |l PUBLIC "-//W3C|
00000020 2f 2f 44 54 44 20 58 48 54 4d 4c 20 31 2e 30 20 |//DTD XHTML 1.0 |
00000030 54 72 61 6e 73 69 74 69 6f 6e 61 6c 2f 2f 45 4e |Transitional//EN|
00000040 22 20 22 68 74 74 70 3a 2f 2f 77 77 77 2e 77 33 |" "http://www.w3|
00000050 2e 6f 72 67 2f 54 52 2f 78 68 74 6d 6c 31 2f 44 |.org/TR/xhtml1/D|
00000060 54 44 2f 78 68 74 6d 6c 31 2d 74 72 61 6e 73 69 |TD/xhtml1-transi|
00000070 74 69 6f 6e 61 6c 2e 64 74 64 22 3e 0d 0a |tional.dtd">..|

As shown above, the first three bytes are EF, BB, and BF, which is the hexadecimal BOM. Note: The page of a third-party website is used, and there is no guarantee that the example will always be available. When actually doing project development, you may face hundreds or thousands of text files. If a few files are mixed with BOM, it will be difficult to detect. If there is no UTF-8 text file with BOM, you can use vi to make up a few. The relevant commands are as follows:

Set UTF-8 encoding:

 代码如下 复制代码
:set fileencoding=utf-8

Add BOM:

 代码如下 复制代码
:set bomb

Delete BOM:

 代码如下 复制代码
:set nobomb

Query BOM:

 代码如下 复制代码
:set bomb?

How to detect BOM in UTF-8 encoding?

The code is as follows Copy code
 代码如下 复制代码

shell> grep -r -I -l $'^锘�' /path如何删除UTF-8编码中的BOM呢?

shell> grep -r -I -l $'^锘�' /path | xargs sed -i 's/^锘�//;q'

shell> grep -r -I -l $'^锘�' /path How to delete the BOM in UTF-8 encoding?

shell> grep -r -I -l $'^锘�' /path | xargs sed -i 's/^锘�//;q'
 代码如下 复制代码

#!/bin/bash

REPOS=""
TXN=""

SVNLOOK=/usr/bin/svnlook

for FILE in $($SVNLOOK changed -t "$TXN" "$REPOS" | awk '/^[AU]/ {print $NF}'); do
    if $SVNLOOK cat -t "$TXN" "$REPOS" "$FILE" | grep -q $'^锘�'; then
        echo "Byte Order Mark be found in $FILE" 1>&2
        exit 1
    fi
done

Recommendation: If you use SVN, you can add relevant code to the pre-commit hook to eliminate BOM.

The code is as follows Copy code

#!/bin/bash REPOS="$1"

TXN="$2"


SVNLOOK=/usr/bin/svnlook

for FILE in $($SVNLOOK changed -t "$TXN" "$REPOS" | awk '/^[AU]/ {print $NF}'); do If $SVNLOOK cat -t "$TXN" "$REPOS" "$FILE" | grep -q $'^锘�'; then
echo "Byte Order Mark be found in $FILE" 1>&2
exit 1
fi

done

This article uses a lot of shell commands

Method three, use the ultraedit editor to modify the document directly Just save the document with blank lines in a format without BOM. The picture below is the encoding format when ultraedit saves the document: Figure 2 Choose UTF8 inside - no BOM, everything is solved
http://www.bkjia.com/PHPjc/632732.htmlwww.bkjia.comtruehttp: //www.bkjia.com/PHPjc/632732.htmlTechArticleWe often find that there are some blank lines in the page for no reason, but we see them again in the editor. , we know that this is caused by BOM (UTF-8), the editor will share with you some of the following...
Statement:
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn