Home  >  Article  >  Backend Development  >  PHP function to remove useless formats pasted directly from word

PHP function to remove useless formats pasted directly from word

WBOY
WBOYOriginal
2016-07-25 09:08:10755browse
  1. function ClearHtml($content,$allowtags='') {

  2. mb_regex_encoding('UTF-8');

  3. //replace MS special characters first
  4. $search = array('/‘/u', '/’/u', '/“/u', '/”/u', '/—/u');
  5. $replace = array(''', ''', '"', '"', '-');
  6. $content = preg_replace($search, $replace, $content);
  7. //make sure _all_ html entities are converted to the plain ascii equivalents - it appears
  8. //in some MS headers, some html entities are encoded and some aren't
  9. $content = html_entity_decode($content, ENT_QUOTES, 'UTF-8');
  10. //try to strip out any C style comments first, since these, embedded in html comments, seem to
  11. //prevent strip_tags from removing html comments (MS Word introduced combination)
  12. if(mb_stripos($content, '/*') !== FALSE){
  13. $content = mb_eregi_replace('#/*.*?*/#s', '', $content, 'm');
  14. }
  15. //introduce a space into any arithmetic expressions that could be caught by strip_tags so that they won't be
  16. //'<1' becomes '< 1'(note: somewhat application specific)
  17. $content = preg_replace(array('/<([0-9]+)/'), array('< $1'), $content);

  18. $content = strip_tags($content, $allowtags);

  19. //eliminate extraneous whitespace from start and end of line, or anywhere there are two or more spaces, convert it to one
  20. $content = preg_replace(array('/^ss+/', '/ss+$/', '/ss+/u'), array('', '', ' '), $content);
  21. //strip out inline css and simplify style tags
  22. $search = array('#<(strong|b)[^>]*>(.*?)#isu', '#<(em|i)[^>]*>(.*?)#isu', '#]*>(.*?)#isu');
  23. $replace = array('$2', '$2', '$1');
  24. $content = preg_replace($search, $replace, $content);

  25. //on some of the ?newer MS Word exports, where you get conditionals of the form 'if gte mso 9', etc., it appears

  26. //that whatever is in one of the html comments prevents strip_tags from eradicating the html comment that contains
  27. //some MS Style Definitions - this last bit gets rid of any leftover comments */
  28. $num_matches = preg_match_all("//isu', '', $content);
  29. }
  30. return $content;
  31. }
  32. ?>

复制代码

测试:

  1. $content = '

  2. 《优伴户外旅行》——让旅行成为习惯!

    越发忙碌的你,是否想给自己放个假?专注工作的你,是否还记得上一次锻炼是什么时候?优伴户外旅行,给你不一样的旅行体验:给心自由,便处处都是风景!

    ';
  3. echo ClearHtml($content,'

    ');

  4. /*

  5. 得到的结果:
  6. 《优伴户外旅行》--让旅行成为习惯!

    越发忙碌的你,是否想给自己放个假?专注工作的你,是否还记得上一次锻炼是什么时候?优伴户外旅行,给你不一样的旅行体验:给心自由,便处处都是风景!

  7. */
  8. ?>

复制代码


Statement:
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn