Home >Backend Development >PHP Tutorial >xhtml PHP+Tidy - perfect XHTML error correction + filtering

xhtml PHP+Tidy - perfect XHTML error correction + filtering

WBOY
WBOYOriginal
2016-07-29 08:36:55946browse

Input and output
Input and output should be said to be the basic functions of many websites. Users input data and the website outputs the data for others to browse.
Take the currently popular Blog as an example. The input and output here is that the author edits the article and generates a blog article page for others to read.
There is a problem here, that is, user input is usually uncontrolled, it may contain incorrect formats or codes with security risks; but the final output of the website must be correct HTML code. This requires error correction and filtering of user input.
Never trust user input
You may say: There are WYSIWYG editors (WYSIWYG) everywhere now, FCKeditor, TinyMCE... You may name a lot. Yes, they all can automatically generate standard XHTML code, but as a web developer, you must have heard "never trust user-submitted data".
Therefore it is necessary to correct and filter user input data.
Need better error correction and filtering
So far, I have not seen any relevant implementation that satisfies me. The ones I have come across are usually inefficient, less than ideal, and have obvious flaws of one kind or another. To give a well-known example: WordPress is a very widely used blog system. It is simple to operate, powerful and has rich plug-in support. However, its integrated TinyMCE and a bunch of clever error correction and filtering codes in the background are quite a headache. , forced replacement of half-width characters, overly conservative replacement rules, etc... make it difficult to achieve the requirement of pasting a piece of code to display it correctly.
I would like to complain here by the way. This blog is hosted by WordPress. In order to make these articles display the code correctly, I searched a lot online and tried some plug-ins. In the end, I went through its code and commented out some filtering rules. It can be displayed more decently -.-b
Of course, I don’t want to criticize it (wordpress) too much, I just want to show that it can do better.
What is Tidy and how does it work?
Excerpted from Tidy ManPage description:
Tidy reads HTML, XHTML and XML files and writes cleaned up markup. s on most browsers. A common use of Tidy is to convert plain HTML to XHTML. W3C standard HTML code, supporting HTML, XHTML, XML. Tidy provides a library TidyLib to facilitate the use of Tidy's powerful functions in other applications. Fortunately, PHP has the corresponding tidy module available.
Brother, why PHP again?
Uh, this question... I'm ashamed, because I only know a little bit about PHP -.-v
But fortunately, what I talk about here is not pure code, at least there is some analysis. Process, sharing these things is much more useful than posting code.
Using Tidy in PHP
To use Tidy in PHP, you need to install the Tidy module, which means loading the PHP extension tidy.so. The specific process is omitted, it is purely physical work. Finally, if you can see "Tidy support enabled" in phpinfo(), it's OK.
With the support of this module, almost all functions provided by Tidy can be used in PHP. Commonly used HTML cleaning is extremely easy. It can even generate a parse tree of the document and operate each node of HTML like operating DOM on the client. There will be specific code instructions below, and you can also look at the official PHP manual.
PHP+Tidy implementation of error correction and filtering
There are so many background materials mentioned above, it seems too confusing, the specific code to solve the problem is the most direct.
1. Simple error correction implementation
function HtmlFix($html)
{
if(!function_exists('tidy_repair_string'))
return $html;
//use tidy to repair html code
//repair
$str = tidy_repair_string($html,
                  array('output-xhtml'=>true),
                                                                                                                                                        tidy_repair_string($html,
                  array('output-xhtml'=>true
return $ s;
}
foreach($nodes as $n){
$s.= $n->value;
}
return $s;
}
The above code is to clean up and correct the XHTML code that may not be standardized. Wrong, output standard XHTML code (both input and output are UTF-8 encoded). The implementation code is not the most streamlined, because in order to cooperate with the filtering function below, I wrote it as detailed as possible.
2. Advanced implementation: Error correction + filtering
Function:
XHTML error correction, output standard XHTML code.
Filters unsafe codes but does not affect content display. It only clears unsafe codes in style/javascript.
Insert the tag into extremely long strings to achieve browser-compatible automatic line wrapping. For related articles, please refer to the problem of line breaks in extremely long text on web pages.
function HtmlFixSafe($html)
{
if(!function_exists('tidy_repair_string'))
return $html;
//use tidy to repair html code
// tidy parameter settings
$conf = array(
         ' output-xhtml'=>true
, 'drop-empty-paras'=>FALSE ,'join-classes'=>TRUE
,'show-body-only'=>TRUE
          );
 / /repair
$str = tidy_repair_string($html,$conf,'utf8');
//Generate parse tree
$str = tidy_parse_string($str,$conf,'utf8');
$s ='';
//Get the body node
$body = @tidy_get_body($str);
//Function _dumpnode, check each node, filter and output
function _dumpnode($node,&$s){
//View the node name, if If it is