Home  >  Article  >  Backend Development  >  PHP character transcoding solves the problem of garbled data crawled by Sina

PHP character transcoding solves the problem of garbled data crawled by Sina

WBOY
WBOYOriginal
2016-07-25 08:53:37904browse
  1. function unescape($str) {
  2. $str = rawurldecode($str);
  3. preg_match_all("/(?:%u.{4})|.+/",$str ,$r);
  4. $ar = $r[0];
  5. foreach($ar as $k=>$v) {
  6. if(substr($v,0,2) == '%u' && strlen ($v) == 6)
  7. $ar[$k] = iconv("UCS-2","utf-8",pack("H4",substr($v,-4)));
  8. }
  9. return join("",$ar);
  10. }
Copying the code

has a little problem, so I changed it to another function, which seems to be more powerful.

  1. function unescape($str) {
  2. $str = rawurldecode($str);
  3. preg_match_all("/%u.{4}|&#x.{4};|& #d+;|&#d+?|.+/U",$str,$r);
  4. $ar = $r[0];
  5. foreach($ar as $k=>$v) {
  6. if( substr($v,0,2) == "%u")
  7. $ar[$k] = iconv("UCS-2","utf-8",pack("H4",substr($v,- 4)));
  8. elseif(substr($v,0,3) == "")
  9. $ar[$k] = iconv("UCS-2","utf-8",pack(" H4",substr($v,3,-1)));
  10. elseif(substr($v,0,2) == "") {
  11. $ar[$k] = iconv("UCS-2 ","utf-8",pack("n",preg_replace("/[^d]/","",$v)));
  12. }
  13. }
  14. return join("",$ar);
  15. }
Copy the code

After using it for a while, I found that it can be used locally, but not in our online environment. Online is *nux, local is XP, and the PHP version is different. Later, I found a similar function in the manual It also supports utf8. I personally think it should be more versatile.

  1. //php character transcoding
  2. function utf8RawUrlDecode ($source) {
  3. $decodedStr = "";
  4. $pos = 0;
  5. $len = strlen ($source);
  6. while ($pos < $len) {
  7. $charAt = substr ($source, $pos, 1);
  8. if ($charAt == '%') {
  9. $pos++;
  10. $charAt = substr ($source, $ pos, 1);
  11. if ($charAt == 'u') {
  12. // we got a unicode character
  13. $pos++;
  14. $unicodeHexVal = substr ($source, $pos, 4);
  15. $unicode = hexdec ( $unicodeHexVal);
  16. $entity = "". $unicode . ';';
  17. $decodedStr .= utf8_encode ($entity);
  18. $pos += 4;
  19. }
  20. else {
  21. // we have an escaped ascii character
  22. $hexVal = substr ($source, $pos, 2);
  23. $decodedStr .= chr (hexdec ($hexVal));
  24. $pos += 2;
  25. }
  26. } else {
  27. $decodedStr .= $ charAt;
  28. $pos++;
  29. }
  30. }
  31. return $decodedStr;
  32. }
Copy code

Use this function to successfully solve the problem.



Statement:
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn