Home  >  Article  >  Backend Development  >  About the detection and deletion of BOM in UTF-8 encoding

About the detection and deletion of BOM in UTF-8 encoding

WBOY
WBOYOriginal
2016-07-25 09:05:271389browse
  1. Shell: #!/bin/sh: No such file or directory
  2. PHP: Warning: Cannot modify header information – headers already sent
Copy code

Discuss BOM in UTF-8 encoding in detail Before detecting and deleting problems, you might as well warm up with an example:

  1. shell> curl -s http://phone.jbxue.com/ | head -1 | sed -n l
  2. 357273277// EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">r$
Copy the code

As shown above, the first three bytes are 357, 273, 277, this is the BOM in octal.

  1. shell> curl -s http://phone.jbxue.com/ | head -1 | hexdump -C
  2. 00000000 ef bb bf 3c 21 44 4f 43 54 59 50 45 20 68 74 6d |.. .00000010 6c 20 50 55 42 4c 49 43 20 22 2d 2f 2f 57 33 43 |l PUBLIC "-//W3C|
  3. 00000020 2f 2f 44 54 44 20 58 48 54 4d 4c 20 31 2e 30 20 |//DTD XHTML 1.0 |
  4. 00000030 54 72 61 6e 73 69 74 69 6f 6e 61 6c 2f 2f 45 4e | Transitional//EN| 3a 2f 2f 77 77 77 2e 77 33 | "" http://www.w3 | 000000005 2E 6F 72 67 2F 54 52 2F 78 74 6D 6C 31 2F 44 | .org/TR/XHTML1/D | 00000060 54 44 2F 78 74 6D 6C 31 2d 74 72 61 6e 73 69 |TD/xhtml1-transi|
  5. 00000070 74 69 6f 6e 61 6c 2e 64 74 64 22 3e 0d 0a |tional.dtd">..|
  6. Copy code
Same as above As shown in the figure, the first three bytes are EF, BB, and BF, which is the hexadecimal BOM.
Note: The page of a third-party website is used, and there is no guarantee that the example will always be available.

In actual project development, you may face hundreds or thousands of text files. If several files are mixed with BOM, it will be difficult to detect. If there are no examples of UTF-8 text files with BOM, you can use vi to make up several , the relevant commands are as follows: #Set UTF-8 encoding :set fileencoding=utf-8 #Add BOM :set bomb #Delete BOM :set nobomb #Query BOM :set bomb?

Detect BOM in UTF-8 encoding

shell> grep -r -I -l $'^xEFxBBxBF' /path
  1. Copy code
Remove BOM in UTF-8 encoding

shell> grep -r -I -l $'^xEFxBBxBF' /path | , you can add relevant code to the pre-commit hook to eliminate BOM.

#!/bin/sh

REPOS="$1"

TXN="$2"

SVNLOOK=/ usr/bin/svnlook

FILES=`$SVNLOOK changed -t "$TXN" "$REPOS" | awk '/^[UA]/ {print $2}'`

for FILE in $FILES; do

    if $SVNLOOK cat -t "$TXN" "$REPOS" "$FILE" | grep -q $'^xEFxBBxBF'; then
  1. echo "Byte Order Mark be found in $FILE" 1>&2
  2. exit 1
  3. fi
  4. done

  5. Copy code
  6. Articles you may be interested in:
  7. php example: detect and clear BOM information at the beginning of the file
  8. Php implementation code for batch removal of BOM header information
Sharing code for removing BOM in php
A simple example of PHP filtering BOM data in the page
Detect whether the php file has BOM header code How to batch clear BOM in php files Check and clear the BOM function in the php file About the solution to style confusion caused by UTF-8 BOM Analysis of the difference between BOM and DOM
Statement:
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn