Home  >  Article  >  Backend Development  >  Simple linear regression implemented in PHP_PHP tutorial

Simple linear regression implemented in PHP_PHP tutorial

WBOY
WBOYOriginal
2016-07-21 14:52:211145browse

In Part 1 of this two-part series ("Simple Linear Regression in PHP"), I explained why math libraries are useful for PHP. I also demonstrated how to develop and implement core parts of a simple linear regression algorithm using PHP as the implementation language.

The goal of this article is to show you how to use the SimpleLinearRegression class discussed in Part 1 to build an important data research tool.

Brief Review: Concepts

The basic goal behind simple linear regression modeling is to find the best-fitting straight line from a two-dimensional plane consisting of pairs of X and Y values ​​(i.e., X and Y measurements). Once the line is found using the minimum variance method, various statistical tests can be performed to determine how well the line fits the observed deviation from the Y value.

The linear equation (y = mx + b) has two parameters that must be estimated based on the X and Y data provided, they are the slope (m) and the y-intercept (b). Once these two parameters are estimated, you can enter the observed values ​​into the linear equation and observe the Y predictions generated by the equation.

To use the minimum variance method to estimate the m and b parameters, you need to find the estimated values ​​of m and b such that they minimize the observed and predicted values ​​of Y for all X values. The difference between the observed and predicted values ​​is called the error ( y i- (mx i+ b) ), and if you square each error value and then sum these residuals, the result is a prediction squared Bad number. Using the minimum variance method to determine the best fit involves finding estimates of m and b that minimize the prediction variance.

Two basic methods can be used to find the estimates m and b that satisfy the minimum variance method. In the first approach, one can use a numerical search process to set different values ​​of m and b and evaluate them, ultimately deciding on the estimate that yields the minimum variance. The second method is to use calculus to find equations for estimating m and b. I'm not going to get into the calculus involved in deriving these equations, but I did use these analytical equations in the SimpleLinearRegression class to find least square estimates of m and b (see getSlope() and getYIntercept in the SimpleLinearRegression class method).

Even if you have an equation that can be used to find the least squares estimate of m and b, it does not mean that if you plug these parameters into a linear equation, the result will be a straight line that fits the data well. The next step in this simple linear regression process is to determine whether the remaining prediction variance is acceptable.

You can use the statistical decision process to reject the alternative hypothesis that the straight line fits the data. This process is based on the calculation of the T statistic, using a probability function to find the probability of a randomly large observation. As mentioned in Part 1, the SimpleLinearRegression class generates a number of summary values, one of the important summary values ​​is the T statistic, which measures how well the linear equation fits the data. If the fit is good, the T statistic will tend to be a large value; if the T value is small, you should replace your linear equation with a default model that assumes that the mean of the Y values ​​is the best predictor (because The average of a set of values ​​can often be a useful predictor of the next observation).

To test whether the T statistic is large enough to not use the average Y value as the best predictor, you need to calculate the probability of obtaining the T statistic randomly. If the probability is low, then the null assumption that the mean is the best predictor can be dispensed with, and accordingly one can be confident that a simple linear model is a good fit to the data. (See Part 1 for more information on calculating the probability of a T-statistic.)

Back to discussing the statistical decision-making process. It tells you when not to adopt the null hypothesis, but it does not tell you whether to accept the alternative hypothesis. In a research setting, linear model alternative hypotheses need to be established through theoretical and statistical parameters.

The data research tool you will build implements a statistical decision-making process for linear models (T-tests) and provides summary data that can be used to construct the theoretical and statistical parameters needed to build linear models. Data research tools can be classified as decision support tools for knowledge workers to study patterns in small to medium-sized data sets.

From a learning perspective, simple linear regression modeling is worth studying as it is the only way to understand more advanced forms of statistical modeling. For example, many core concepts in simple linear regression establish a good foundation for understanding multiple regression (Multiple Regression), factor analysis (Factor Analysis), and time series (Time Series).

Simple linear regression is also a versatile modeling technique. It can be used to model curvilinear data by transforming the raw data (usually with a logarithmic or power transformation). These transformations linearize the data so that it can be modeled using simple linear regression. The resulting linear model will be represented as a linear formula related to the transformed values.

Probability function

In the previous article, I got around the problem of implementing probability functions in PHP by asking R to find the probability value. I wasn't completely satisfied with this solution, so I started researching the question: what is needed to develop a probability function based on PHP.

I started looking online for information and code. One source for both is Probability Functions in the book Numerical Recipes in C. I reimplemented some probability function code (gammln.c and betai.c functions) in PHP, but I'm still not satisfied with the results. It seems to have a bit more code than some other implementations. Additionally, I need the inverse probability function.

Luckily, I stumbled upon John Pezzullo’s Interactive Statistical Calculation. John's website on Probability Distribution Functions has all the functions I need, implemented in JavaScript to make learning easier.

I ported the Student T and Fisher F functions to PHP. I changed the API a bit to conform to Java naming style and embedded all functions into a class called Distribution. A great feature of this implementation is the doCommonMath method, which is reused by all functions in this library. Other tests that I didn't bother to implement (normality test and chi-square test) also use the doCommonMath method.

Another aspect of this transplant is also worth noting. By using JavaScript, users can assign dynamically determined values ​​to instance variables, such as:

var PiD2 = pi() / 2

You cannot do this in PHP. Only simple constant values ​​can be assigned to instance variables. Hopefully this flaw will be resolved in PHP5.

Note that the code in Listing 1 does not define instance variables — this is because in the JavaScript version, they are dynamically assigned values.

List 1. Implement probability function


doCommonMath($cth * $cth, 2, $df - 3, -1)) / (pi()/2); } else { return 1 - $sth * $this->doCommonMath($cth * $cth, 1, $df - 3, -1); } } function getInverseStudentT($p, $df) { $v = 0.5; $dv = 0.5; $t = 0; while($dv > 1e-6) { $t = (1 / $v) - 1; $dv = $dv / 2; if ( $this->getStudentT($t, $df) > $p) { $v = $v - $dv; } else { $v = $v + $dv; } } return $t; } function getFisherF($f, $n1, $n2) { // implemented but not shown } function getInverseFisherF($p, $n1, $n2) { // implemented but not shown } } ?>

Graphic output

The output methods you have implemented so far all display summary values ​​in HTML format. It is also suitable for displaying scatter plots or line plots of these data in GIF, JPEG or PNG format.

Rather than writing the code to generate line and distribution plots myself, I thought it would be better to use a PHP-based graphics library called JpGraph. JpGraph is being actively developed by Johan Persson, whose project website describes it this way:

Whether it’s a “quick and dirty” graph with minimal code, or a complex professional graph that requires very fine-grained control, JpGraph makes drawing them simple. JpGraph is equally suitable for scientific and business type graphs.

The JpGraph distribution includes a number of example scripts that can be customized to your specific needs. Using JpGraph as a data research tool is as simple as finding a sample script that does something similar to what I need and adapting it to fit my specific needs.

The script in Listing 3 is extracted from the sample data exploration tool (explore.php) and demonstrates how to call the library and populate the Line and Scatter classes with data from the SimpleLinearRegression analysis. The comments in this code were written by Johan Persson (who does a great job documenting the JPGraph codebase).

Listing 3. Details of functions from the sample data research tool explore.php


SetScale("linlin"); // Setup title $graph->title->Set("$title"); $graph->img->SetMargin(50,20,20,40); $graph->xaxis->SetTitle("$x_name","center"); $graph->yaxis->SetTitleMargin(30); $graph->yaxis->title->Set("$y_name"); $graph->title->SetFont(FF_FONT1,FS_BOLD); // make sure that the X-axis is always at the // bottom at the plot and not just at Y=0 which is // the default position $graph->xaxis->SetPos('min'); // Create the scatter plot with some nice colors $sp1 = new ScatterPlot($slr->Y, $slr->X); $sp1->mark->SetType(MARK_FILLEDCIRCLE); $sp1->mark->SetFillColor("red"); $sp1->SetColor("blue"); $sp1->SetWeight(3); $sp1->mark->SetWidth(4); // Create the regression line $lplot = new LinePlot($slr->PredictedY, $slr->X); $lplot->SetWeight(2); $lplot->SetColor('navy'); // Add the pltos to the line $graph->Add($sp1); $graph->Add($lplot); // ... and stroke $graph_name = "temp/test.png"; $graph->Stroke($graph_name); ?> ?>
 

数据研究脚本

该数据研究工具由单个脚本( explore.php)构成,该脚本调用 SimpleLinearRegressionHTML 类和 JpGraph 库的方法。

该脚本使用了简单的处理逻辑。该脚本的第一部分对所提交的表单数据执行基本验证。如果这些表单数据通过验证,则执行该脚本的第二部分。

该脚本的第二部分所包含的代码用于分析数据,并以 HTML 和图形格式显示汇总结果。 清单 4中显示了 explore.php脚本的基本结构:

清单 4. explore.php 的结构


$title"; $slr->showTableSummary($x_name, $y_name); echo "

"; $slr->showAnalysisOfVariance(); echo "

"; $slr->showParameterEstimates($x_name, $y_name); echo "
"; $slr->showFormula($x_name, $y_name); echo "

"; $slr->showRValues($x_name, $y_name); echo "
"; include ("jpgraph/jpgraph.php"); include ("jpgraph/jpgraph_scatter.php"); include ("jpgraph/jpgraph_line.php"); // The code for displaying the graphics is inline in the // explore.php script. The code for these two line plots // finishes off the script: // Omitted code for displaying scatter plus line plot // Omitted code for displaying residuals plot } ?>

www.bkjia.comtruehttp://www.bkjia.com/PHPjc/371643.htmlTechArticle在这个由两部分组成的系列文章的第 1 部分( 用 PHP 实现的简单线性回归)中,我说明了数学库对 PHP 有用的原因。我还演示了如何用 PHP...
Statement:
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn