Home  >  Article  >  Backend Development  >  PHP Kernel Exploration Variables - Extraordinary Strings_PHP Tutorial

PHP Kernel Exploration Variables - Extraordinary Strings_PHP Tutorial

WBOY
WBOYOriginal
2016-07-13 09:58:47656browse

Variables explored by PHP kernel - non-trivial strings

Come on, there is nothing good to study about a string.
Don’t say that, have you ever watched "The Ordinary World"? Ordinary strings can also have extraordinary stories. Preview:
(1) In C language, what is the time complexity of strlen to calculate a string? What about in PHP?
(2) How to handle multi-byte strings in PHP? How is PHP's support for unicode?
It’s also a string, why are the C language different from C/PHP/Java?
The data structure determines the algorithm, this sentence is absolutely true.
So today we will take a look at the string structure in PHP and the implementation of related string functions.
1. String Basics
Strings can be said to be one of the most encountered data structures in PHP (another more commonly used one is arrays, see Variables (4) - Array Operations in PHP Kernel Exploration). Due to the characteristics and application scenarios of the PHP language, many of our daily tasks are actually processing strings. It is for this reason that PHP provides developers with a wealth of string manipulation functions (preliminary statistics show that there are about 100, which is a considerable number). So, how are strings implemented in PHP? What is the difference from C language?
1. Representation of strings in PHP
There are four common forms of using strings in PHP:
(1) Double quotes
This form is more common: $str="this is
(2) Single quotation mark
Characters contained in single quotes are considered raw, so variables, control characters, etc. in single quotes will not be parsed:
$string = "test";
$str = 'this is $string, ahan';
echo $str;
(3) Heredoc
Heredoc is more suitable for longer string representation, and is more flexible and versatile for multi-line string representation. Similar to the double-quote representation, variables can also be included in a heredoc. The common form is:
$string ="test string";
$str = <<
This is a string n,
My string is $string
STR;
echo $str;
(4) nowdoc (5.3 supported)
Nowdoc and heredoc are so similar that we can think of them as brothers. The start identifier of nowdoc is enclosed in single quotes. Similar to single quotes, it will not parse the variables, format control characters, etc.:
$s = <<<'EOT'
this is $str
this is t test;
EOT;
echo $s;
2. The structure of strings in PHP
As mentioned before, variables in PHP are stored using a structure such as Zval (PHP kernel exploration variable (1) Zval). The structure of Zval is:
struct _zval_struct {
zvalue_value value; /* value */
zend_uint refcount__gc; /* variable ref count */
zend_uchar type; /* active type */
zend_uchar is_ref__gc; /* if it is a ref variable */
};
And the value of the variable is a union such as zvalue_value:
typedef union _zvalue_value {
long lval;
double dval;
struct {                   /* string */
char *val;
int len;
} str;
HashTable *ht;
zend_object_value obj;
} zvalue_value;
We extract the structure of the string from it:
struct {
char *val;
int len;
} str;
It is now clear that a string in PHP is actually a structure at the bottom level, which contains a pointer to a string and the length of the string.
So why do you do this? In other words, what are the benefits of doing this? Next, we will compare PHP strings with C language strings to explain the advantages of using such a structure to store strings.
3. Comparison with c language strings
We know that in C language, a string can be stored in two common forms, one is using pointers, and the other is using character arrays. Our following instructions use character arrays in C language to store strings.
(1) PHP strings are binary safe, while C strings are not.
We often mention the term "binary security", so what exactly does binary security mean?
The definition of Binary Safe in Wikipedia is:
Binary-safe is a computer programming term mainly used in connection with string manipulating functions.
A binary-safe function is essentially one that treats its input as a raw stream of data without any specific format.
It should thus work with all 256 possible values ​​that a character can take (assuming 8-bit characters).
Translated:
Binary security is a computer programming term, mainly used for string manipulation functions. A binary-safe function essentially means that it treats the input as a raw data stream without any special formatting.
So why aren’t C strings binary safe? We know that in C language, a string represented by a character array always ends with
(2) Efficiency comparison.
Since it is used in C string
struct{
char *val;
int len;
} str;
is represented by such a structure, so obtaining the length of a string can be completed in constant time:
#define Z_STRLEN(zval) (zval).value.str.len
Of course, just the performance of the strlen function cannot support the conclusion that "strings in PHP are more efficient than c strings" (an obvious reason is that PHP is a high-level language built on the C language), and It just means that in terms of time complexity, PHP strings are more efficient than C strings.
(3) Many C string functions have buffer overflow vulnerabilities
Buffer overflow is a common vulnerability in C language, and this security risk is often fatal. A typical example of buffer overflow is as follows:
void str2Buf(char *str) {
char buffer[16];
strcpy(buffer,str);
}
This function copies the contents of str into the buffer array, and the size of the buffer array is 16, so if the length of str is greater than 16, a buffer overflow problem will occur.
In addition to strcpy, string functions such as gets, strcat, and fprintf also have buffer overflow problems.
There are no functions such as strcpy and strcat in PHP. In fact, due to the simplicity of the PHP language, there is no need to provide functions such as strcpy and strcat. For example, if we want to copy a string, just use =:
$str = "this is a string";
$str_copy = $str;
Due to the characteristics of variable sharing zval in PHP, there is no waste of space. And the simple. connector can easily realize string connection:
$str = "this is";
$str .= "test string";
echo $str;
Regarding the memory allocation and management during the string concatenation process, you can view the implementation of the zend engine part, which is ignored here for now.
2. String operation related functions (part)
There is no doubt that the purpose of studying string is not just to know its structure and characteristics, but to use it better. In our daily work, most of the work involves dealing with strings: such as processing a date string, encrypting a password, obtaining user information, regular expression matching and replacement, string replacement, formatting a string, etc. . It can be said that in PHP development, you cannot avoid direct or indirect contact with strings (just like you cannot escape breathing). Because of this, PHP provides developers with a large number of rich string operation functions (http://cn2.php.net/manual/en/ref.strings.php), which are suitable for more than 90% of string operations. , is basically enough.
Because there are so many string functions, it is impossible to explain them one by one. Here we only select a few typical string operation functions for brief explanation (I believe that more than 80% of PHPers have a very good grasp of string operation functions).
Before starting the explanation, it is necessary to emphasize the principles of using string functions. Understanding and mastering these principles is very important for efficient and proficient use of string functions. These principles include (not limited to):
(1) If your operation can use either regular expressions or strings. Then prioritize string operations.
Regular expressions are an excellent tool for processing text, especially for applications such as pattern search and pattern replacement. Regular expressions can be said to be invincible. Because of this, regular expressions are misused in many situations. If your string operation can be completed using either string functions or regular expressions, then please give priority to string operation functions, because regular expressions can have serious performance problems in certain situations.
(2) Note false and 0
PHP is a weak variable type, I believe many phpers have suffered from it at first
var_dump( 0 == false);//bool(true)
var_dump( 0 === false);//bool(false)
Wait, what does this have to do with string manipulation functions?
In PHP, there is a type of function used for searching (such as strpos, stripos). When the search is successful, this type of search function returns the index of the substring in the original string, such as strpos:
var_dump(strpos("this is abc", "abc"));
When the search is unsuccessful, false is returned:
var_dump(strpos("this is abc", "angle"));//false
There is a pitfall here: the index of the string also starts with 0! If the substring happens to appear at the beginning of the source string, then a simple == comparison cannot distinguish whether strpos is successful:
var_dump(strpos("this is abc", "this"));
Therefore we must use === to compare:
if((strpos("this is abc", "this")) === false){
// not found
}
(3) Read more manuals to avoid reinventing the wheel.
I believe many PHPer interviews have encountered this question: How to flip a string? Since the question only mentions "how", there is no restriction on "not using PHP built-in functions". So for this question, the simplest method is naturally to use the strrev function. Another function that illustrates that you should not reinvent the wheel is the levenshtein function. As its name suggests, this function returns the edit distance between two strings. As one of the typical representative cases of dynamic programming (DP), I think editing is familiar to many people. If you encounter this kind of problem, are you still ready to start DP? One function does it:
$str1 = "this is test";
$str2 = "his is tes";
echo levenshtein($str1, $str2);
We should all be as "lazy" as possible in certain situations, right?
The following is an excerpt of string operation functions (for the most common operations, please refer to the manual directly)
1.strlen
As soon as this title came out, I guess most people’s expressions were like this:
Or like this:
What I want to say is not the function itself, but the return value of this function.
int strlen ( string $string )
Returns the length of the given string.
Although the manual clearly states that "the strlen function returns the length of a given string", it does not provide any explanation of the length unit. Does length refer to "the number of characters" or "the number of bytes of characters" . What we have to do is not to speculate, but to test:
In GBK encoding format:
echo strlen("This is Chinese"); //8
Explanation that the strlen function returns the number of bytes of the string. Then there is another problem. If it is UTF-8 encoding, since Chinese uses 3 bytes for each Chinese character in UTF8 encoding, the expected result should be 12:
echo strlen("This is Chinese");//12
This means: The length of the string calculated by strlen depends on the current encoding format, and its value is not unique! In some cases, this is naturally unsatisfactory. At this time, multi-byte extended mbstring has room for use:
echo mb_strlen("This is Chinese", "GB2312");//4
Regarding this point, there will be corresponding explanations in multi-byte processing, which will be skipped here.
2. str_word_count
str_word_count is another powerful and easily overlooked string function.
mixed str_word_count ( string $string [, int $format = 0 [, string $charlist ]] )
Different values ​​of $format can cause the str_word_count function to behave differently. Now, we have this text at hand:
When I am down and, oh my soul, so weary
When troubles come and my heart burdened be
Then, I am still and wait here in the silence
Until you come and sit awhile with me
You raise me up, so I can stand on mountains
You raise me up, to walk on stormy seas
I am strong, when I am on your shoulders
You raise me up… To more than I can ber
You raise me up, so I can stand on mountains
You raise me up, to walk on stormy seas
I am strong, when I am on your shoulders
You raise me up, To more than I can be.
Then:
(1)$format = 0
$format=0, $format returns the number of words in the text:
echo str_word_count(file_get_contents(“word”)); //112
(2)$format = 1
When $format=1, an array of all words in the text is returned:
print_r(file_get_contents(“word”),1 );
Array
(
[0] => When
[1] => I
[2] => am
[3] => down
[4] => and
[5] => oh
[6] => my
[7] => soul
[8] => so
[9] => weary
[10] => When
[11] => troubles
......
)
What does this feature do? For example, English participles. Remember the "word count" problem? str_word_count can easily complete the TopK word statistics problem:
$s = file_get_contents("./word");
$a = array_count_values(str_word_count($s, 1)) ;
arsort( $a );
print_r( $a );
/*
Array
(
[I] => 10
[me] => 7
[raise] => 6
[up] => 6
[You] => 6
[am] => 6
[on] => 6
[can] => 4
[and] => 4
[be] => 3
[so] => 3
);*/
(3)$format = 2
When $format=2, an associative array is returned:
$a = str_word_count($s, 2);
print_r($a);
/*
Array
(
[0] => When
[5] => I
[7] => am
[10] => down
[15] => and
[20] => oh
[23] => my
[26] => soul
[32] => so
[35] => weary
[41] => When
[46] => troubles
[55] => come
...
)*/
With other array functions, you can achieve more diverse functions. For example, with array_flip, you can calculate the last occurrence position of a word:
$t = array_flip(str_word_count($s, 2));
print_r($t);
And if you combine array_unique and then array_flip, you can calculate the position where a word first appears:
$t = array_flip( array_unique(str_word_count($s, 2)) );
print_r($t);
Array
(
[When] => 0
[I] => 5
[am] => 7
[down] => 10
[and] => 15
[oh] => 20
[my] => 23
[soul] => 26
[so] => 32
[weary] => 35
[troubles] => 46
[come] => 55
[heart] => 67
...
)
3. similar_text
This is another function besides the levenshtein() function that calculates the similarity of two strings:
int similar_text ( string $first , string $second [, float &$percent ] )
$t1 = "You raise me up, so I can stand on mountains";
$t2 = "You raise me up, to walk on stormy seas";
$percent = 0;
echo similar_text($t1, $t2, $percent).PHP_EOL;//26
echo $percent;// 62.650602409639
Apart from the specific usage, I am curious about how the underlying similarity of strings is defined.
Similar_text function implementation is located in ext/standard/string.c, excerpt its key code:
PHP_FUNCTION(similar_text){
char *t1, *t2;
zval **percent = NULL;
int ac = ZEND_NUM_ARGS();
int sim;
int t1_len, t2_len;
/* Parameter analysis */
if (zend_parse_parameters(ZEND_NUM_ARGS() TSRMLS_CC, "ss|Z", &t1, &t1_len, &t2, &t2_len, &percent) == FAILURE) {
return;
}
/* set percent to double type */
if (ac > 2) {
convert_to_double_ex(percent);
}
/* t1_len == 0 && t2_len == 0 */
if (t1_len t2_len == 0) {
if (ac > 2) {
Z_DVAL_PP(percent) = 0;
}
RETURN_LONG(0);
}  
/* Count the number of identical strings */
sim = php_similar_char(t1, t1_len, t2, t2_len);
/* Similarity percentage */
if (ac > 2) {
Z_DVAL_PP(percent) = sim * 200.0 / (t1_len t2_len);
}
RETURN_LONG(sim);
}
It can be seen that the number of similar strings is achieved through the php_similar_char function, and the similarity percentage is achieved through the formula:
percent = sim * 200 / (t1 string length t2 string length)
to define.
Specific implementation of php_similar_char:
static int php_similar_char(const char *txt1, int len1, const char *txt2, int len2)
{
int sum;
int pos1 = 0, pos2 = 0, max;
php_similar_str(txt1, len1, txt2, len2, &pos1, &pos2, &max);
if ((sum = max)) {
if (pos1 && pos2) {
sum = php_similar_char(txt1, pos1,txt2, pos2);
}
if ((pos1 max < len1) && (pos2 max < len2)) {
sum = php_similar_char(txt1 pos1 max, len1 - pos1 - max,txt2 pos2 max, len2 - pos2 - max);
}
}
return sum;
}
This function completes the statistics of the number of similar strings by calling php_similar_str, and php_similar_str returns the longest identical string length between string s1 and string s2:
static void php_similar_str(const char *txt1, int len1, const char *txt2, int len2, int *pos1, int *pos2, int *max)
{
char *p, *q;
char *end1 = (char *) txt1 len1;
char *end2 = (char *) txt2 len2;
int l;
*max = 0;
/* Find the longest string */
for (p = (char *) txt1; p < end1; p ) {
for (q = (char *) txt2; q < end2; q ) {
for (l = 0; (p l < end1) && (q l < end2) && (p[l] == q[l]); l );
if (l > *max) {
              *max = l;
                *pos1 = p - txt1;
                *pos2 = q - txt2;
      }
}
}
}
After php_similar_str matching is completed, the original string is divided into three parts:
The first part is the left part of the longest string. This part contains similar strings, but it is not the longest;
The second part is the longest similar string part;
The third part is the right part of the longest string, which is similar to the first part. This part contains similar strings, but it is not the longest. Therefore, we need to recursively find the length of similar strings for the first and third parts:
/* Similar strings on the left side of the longest string */
if (pos1 && pos2) {
sum = php_similar_char(txt1, pos1,txt2, pos2);
}
/* Similar strings in the right half */
if ((pos1 max < len1) && (pos2 max < len2)) {
sum = php_similar_char(txt1 pos1 max, len1 - pos1 - max, txt2 pos2 max, len2 - pos2 - max);
}
The matching process is shown in the figure below:
For more explanations of string functions, you can refer to the PHP online manual, and I will not list them one by one here.
3. Multi-byte string
All string and related manipulation functions we have discussed so far are single-byte. However, the world is so colorful, just like there are red watermelons and yellow watermelons, and strings are no exception. For example, when our commonly used Chinese characters are encoded in GBK, they are actually encoded using two bytes. Multi-byte strings are not limited to Chinese characters, but also include characters in Japanese, Korean and other countries. Because of this, the processing of multi-byte strings is extremely important.
Characters and character sets are terms that are inevitably encountered in the programming process. If there are children who are not particularly clear about the content of this section, it is recommended to move to "Encoding Major 1 Character Encoding Basics - Characters and Character Sets,"
Since we use Chinese more in our daily life, we take Chinese string interception as an example and focus on the problem of Chinese strings.
Interception of Chinese strings Interception of Chinese strings has always been a relatively troublesome problem. The reasons are:
(1) PHP’s native substr function only supports the interception of single-byte strings, and is slightly powerless for multi-byte strings
(2) The PHP extension mbstring requires server support. In fact, many development environments do not enable the mbstring extension. It is a pity for children who are accustomed to using the mbstring extension.
(3) A more complicated problem is that in the case of UTF-8 encoding, although Chinese is 3 bytes, some special characters in Chinese (such as the caret ·) are actually double words section coded. This undoubtedly makes it more difficult to intercept Chinese strings (after all, it is impossible for Chinese strings to contain no special characters at all).
Apart from the headache, I still need to build a Chinese string interception library. This string interception function should have a similar function parameter list to substr, and it should support interception in Chinese GBK encoding and UTF-8 encoding. , for the sake of efficiency, if the server has enabled the mbstring extension, then the string interception of mbstring should be used directly.
API:
String cnSubstr(string $str, int $start, int $len, [$encode=’GBK’]);//Note that $start and $len in the parameters are the number of characters instead of the number of bytes.
We take UTF-8 encoding as an example to illustrate the idea of ​​intercepting Chinese under UTF8 encoding.
(1) Coding range:
UTF-8 encoding range (utf-8 uses 1-6 bytes to encode characters, but actually only uses 1-4 bytes):
1 byte: 00——7F
2 bytes: C080——DFBF
3 characters: E08080——EFBFBF
4 characters: F0808080——F7BFBFBF
According to this, the number of bytes occupied by the character can be determined based on the range of the first byte:
$ord = ord($str{$i});
$ord < 192 single byte and control characters
192 <= $ord < 224 double byte
224<= $ord < 240 three bytes
Chinese does not have four-byte characters
(2) When $start is negative
if( $start < 0 ){
$start = cnStrlen_utf8( $str );
if( $start < 0 ){
$start = 0;
}
}
Most string interception versions on the Internet do not handle the situation of $start<0. According to the API design of PHP substr, when $start<0, the length of the string should be added (multi-byte refers to the number of characters) .
where cnStrlen_utf8 is used to obtain the number of characters in a string under utf8 encoding:
function cnStrlen_utf8( $str ){
$len = 0;
$i = 0;
$slen = strlen( $str );
while( $i < $slen ){
$ord = ord( $str{$i} );
if( $ord < 127){
$i ;
}else if( $ord < 224 ){
$i = 2;
}else{
$i = 3;
}
$len ;
}
return $len;
}
So the interception algorithm of UTF-8 is:
function cnSubstr_utf8( $str, $start, $len ){
if( $start < 0 ){
$start = cnStrlen_utf8( $str );
if( $start < 0 ){
$start = 0;
}
}
$slen = strlen( $str );
if( $len < 0 ){
$len = $slen - $start;
if($len < 0){
$len = 0;
}
}
$i = 0;
$count = 0;
/* Get the starting position */
while( $i < $slen && $count < $start){
$ord = ord( $str{$i} );
if( $ord < 127){
$i ;
}else if( $ord < 224 ){
$i = 2;
}else{
$i = 3;
}
$count ;
}
$count = 0;
$substr = '';
/* Intercept $len characters */
while( $i < $slen && $count < $len){
$ord = ord( $str{$i} );
if( $ord < 127){
$substr .= $str{$i};
$i ;
}else if( $ord < 224 ){
$substr .= $str{$i} . $str{$i 1};
$i = 2;
}else{
$substr .= $str{$i} . $str{$i 1} . $str{$i 2};
$i = 3;
}
$count ;
}
return $substr;
}
The final cnSubstr() can be designed as follows (the program still has a lot of room for optimization):
function cnSubstr( $str, $start, $len, $encode = 'gbk' ){
if( extension_loaded("mbstring") ){
//echo "use mbstring";
//return mb_substr( $str, $start, $len, $encode );
}
$enc = strtolower( $encode );
switch($enc){
case 'gbk':
case 'gb2312':
Return cnsubstr_gbk ($ Str, $ Start, $ Len);
break;
case 'utf-8':
case 'utf8':
return cnSubstr_utf8($str, $start, $len);
break;
default:
//do some warning or trigger error;
}
}
A simple test:
$str = "This is a Chinese string string, and abs·";
for($i = 0; $i < 10; $i ){
echo cnSubstr( $str, $i, 3, 'utf8').PHP_EOL;
}
Finally, post the msubstr function provided in ThinkPHP extend (this is substr made with regular expressions):
function msubstr($str, $start=0, $length, $charset="utf-8", $suffix=true) {
if(function_exists("mb_substr"))
$slice = mb_substr($str, $start, $length, $charset);
elseif(function_exists('iconv_substr')) {
$slice = iconv_substr($str,$start,$length,$charset);
if(false === $slice) {
$slice = '';
}
}else{
$re['utf-8'] = "/[x01-x7f]|[xc2-xdf][x80-xbf]|[xe0-xef][x80-xbf]{2}|[xf0-xff ][x80-xbf]{3}/";
$re['gb2312'] = "/[x01-x7f]|[xb0-xf7][xa0-xfe]/";
$re['gbk'] = "/[x01-x7f]|[x81-xfe][x40-xfe]/";
$re['big5'] = "/[x01-x7f]|[x81-xfe]([x40-x7e]|xa1-xfe])/";
preg_match_all($re[$charset], $str, $match);
$slice = join("",array_slice($match[0], $start, $length));
}
return $suffix ? $slice.'...' : $slice;
}
Due to the length of the article, there are more questions that I won’t go into detail here. Again, if you have any questions, please feel free to point them out.

www.bkjia.comtruehttp: //www.bkjia.com/PHPjc/976800.htmlTechArticleVariables explored in PHP kernel - non-trivial string cutting, what is there to study in a string. Don't say that, have you ever watched "The Ordinary World"? Ordinary strings can also have extraordinary...
Statement:
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn