Home >Backend Development >PHP Tutorial >Data research tool to implement simple linear regression in PHP_PHP tutorial

Data research tool to implement simple linear regression in PHP_PHP tutorial

WBOY
WBOYOriginal
2016-07-13 17:34:27823browse

Concept

The basic goal behind simple linear regression modeling is to derive values ​​from pairs of X values ​​and Y values ​​(i.e., X and Y measurements ) to find the best straight line in the two-dimensional plane composed of. Once the straight line is found using the minimum variance method, various statistical tests can be performed to determine how well the straight line deviates from the observed Y values.

The linear equation (y = mx + b) has two parameters that must be estimated based on the provided X and Y data, they are the slope ( m) and y-intercept (b). Once these two parameters are estimated, you can enter the observed values ​​into the linear equation and observe the predicted values ​​of Y produced by the equation.

To estimate the m and b parameters using the minimum variance method, it is necessary to find the estimated values ​​of m and b such that they are obtained for all X values. The observed and predicted values ​​of the 🎜>Y value are the smallest. The difference between the observed value and the predicted value is called the error (y i- (mx i+ b) ), and if for each error value square and then sum these residuals, the result is a number called the predicted squared difference . Using the minimum variance method to determine the best fit involves finding estimates of m and b that minimize the prediction variance. Two basic methods can be used to find the estimated values ​​
m
and b that satisfy the minimum variance method. In the first approach, one can use a numerical search process to set different m and b values ​​and evaluate them, ultimately deciding on the estimate that yields the minimum variance. The second method is to use calculus to find equations for estimating m and b. I'm not going to get into the calculus involved in deriving these equations, but I did use these analytical equations in the SimpleLinearRegression class to find the least squares estimates of m and b (See the getSlope() and getYIntercept methods in the SimpleLinearRegression class). Even if you have equations that can be used to find the least squares estimates of
m
and b, it does not mean that simply plugging these parameters into a linear equation will result in a line that fits the data well. Matching straight lines. The next step in this simple linear regression process is to determine whether the remaining prediction variance is acceptable. You can use the statistical decision process to reject the alternative hypothesis that the straight line fits the data. This process is based on the calculation of the T statistic, using a probability function to find the probability of a randomly large observation. As mentioned in Part 1, the SimpleLinearRegression class generates a number of summary values, one of the important summary values ​​is the T statistic, which measures how well the linear equation fits the data. If the fit is good, the T statistic tends to be a large value; if the T value is small, you should replace your linear equation with a default model that assumes that the mean of the
Y
values ​​is Best predictor (because the average of a set of values ​​can often be a useful predictor of the next observation). To test whether the T statistic value is large enough to not use the average of the
Y
values ​​as the best prediction value, you need to calculate the probability of obtaining the T statistic value randomly. If the probability is low, then the null assumption that the mean is the best predictor can be dispensed with, and accordingly one can be confident that a simple linear model is a good fit to the data. (See Part 1 for more information on calculating the probability of a T-statistic.) Let’s go back to the statistical decision-making process. It tells you when not to adopt the null hypothesis, but it does not tell you whether to accept the alternative hypothesis. In a research setting, linear model alternative hypotheses need to be established through theoretical and statistical parameters.

The data research tool you will build implements a statistical decision-making process for linear models (t-tests) and provides summary data that can be used to construct the theoretical and statistical parameters needed to build linear models. Data research tools can be classified as decision support tools for knowledge workers to study patterns in small to medium-sized data sets.

From a learning perspective, simple linear regression modeling is worth studying because it is the only way to understand more advanced forms of statistical modeling. For example, many core concepts in simple linear regression establish a good foundation for understanding multiple regression (Multiple Regression), factor analysis (Factor Analysis), and time series (Time Series).

Simple linear regression is also a versatile modeling technique. It can be used to model curvilinear data by transforming the raw data (usually with a logarithmic or power transformation). These transformations linearize the data so that it can be modeled using simple linear regression. The resulting linear model will be represented as a linear formula related to the transformed values.


Probability function
​In the
previous article
, I used R to find the probability value, thereby avoiding the problem of using PHP to implement the probability function. I wasn't completely satisfied with this solution, so I started researching the question: what is needed to develop a probability function based on PHP. I started looking online for information and code. One source for both is Probability Functions in the book Numerical Recipes in C. I reimplemented some probability function code (gammln.c and betai.c functions) in PHP, but I'm still not satisfied with the results. It seems to have a bit more code than some other implementations. Additionally, I need the inverse probability function.

Fortunately, I stumbled upon John Pezzullo's Interactive Statistical Calculation. John's website on Probability Distribution Functions has all the functions I need, implemented in JavaScript to make learning easier.

I ported the Student T and Fisher F functions to PHP. I changed the API a bit to conform to Java naming style and embedded all functions into a class called Distribution. A great feature of this implementation is the doCommonMath method, which is reused by all functions in this library. Other tests that I didn't bother to implement (normality test and chi-square test) also use the doCommonMath method.

Another aspect of this transplant is also worth noting. By using JavaScript, users can assign dynamically determined values ​​to instance variables, such as:

            var PiD2 = pi() / 2
            

You can't do this in PHP. Only simple constant values ​​can be assigned to instance variables. Hopefully this flaw will be addressed in PHP5.

Note that the code in Listing 1 does not define instance variables — this is because in the JavaScript version, they are dynamically assigned values.

Listing 1. Implement probability function

            <?php
            // Distribution.php
            // Copyright John Pezullo
            // Released under same terms as PHP.
            // PHP Port and OOfying by Paul Meagher
            class Distribution {
            function doCommonMath($q, $i, $j, $b) {
            $zz = 1;
            $z  = $zz;
            $k  = $i;
            while($k <= $j) {
            $zz = $zz * $q * $k / ($k - $b);
            $z  = $z + $zz;
            $k  = $k + 2;
            }
            return $z;
            }
            function getStudentT($t, $df) {
            $t  = abs($t);
            $w  = $t  / sqrt($df);
            $th = atan($w);
            if ($df == 1) {
            return 1 - $th / (pi() / 2);
            }
            $sth = sin($th);
            $cth = cos($th);
            if( ($df % 2) ==1 ) {
            return
            1 - ($th + $sth * $cth * $this->doCommonMath($cth * $cth, 2, $df - 3, -1))
            / (pi()/2);
            } else {
            return 1 - $sth * $this->doCommonMath($cth * $cth, 1, $df - 3, -1);
            }
            }
            function getInverseStudentT($p, $df) {
            $v =  0.5;
            $dv = 0.5;
            $t  = 0;
            while($dv > 1e-6) {
            $t = (1 / $v) - 1;
            $dv = $dv / 2;
            if ( $this->getStudentT($t, $df) > $p) {
            $v = $v - $dv;
            } else {
            $v = $v + $dv;
            }
            }
            return $t;
            }
            function getFisherF($f, $n1, $n2) {
            // implemented but not shown
            }
            function getInverseFisherF($p, $n1, $n2) {
            // implemented but not shown
            }
            }
            ?>
            
Output method

Now that you have implemented the probability function in PHP, the only remaining challenge in developing a PHP-based data research tool is designing a method for displaying the analysis results.

The simple solution is to display the values ​​of all instance variables to the screen as needed. In the first article, I did this when displaying the linear equation, T values, and T probabilities for the Burnout Study. It is helpful to be able to access specific values ​​for specific purposes, and SimpleLinearRegression supports this usage.

However, another method for outputting results is to systematically group parts of the output. If you study the output of the major statistical software packages used for regression analysis, you will find that they tend to group their output in the same way. They tend to have Summary Table , Analysis Of Variance table, Parameter Estimate table and R Value . Similarly, I created some output methods with the following names:
  • showSummaryTable()

    www.bkjia.comtruehttp: //www.bkjia.com/PHPjc/508478.htmlTechArticle Concept The basic goal behind simple linear regression modeling is to derive Find the best straight line in the two-dimensional plane composed of Y measurement values). Once using minimum variance...
Statement:
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn