Home  >  Article  >  Backend Development  >  The Importance of Mathematical Library for Implementing Simple Linear Regression in PHP_PHP Tutorial

The Importance of Mathematical Library for Implementing Simple Linear Regression in PHP_PHP Tutorial

WBOY
WBOYOriginal
2016-07-20 11:17:13752browse

Compared to other open source languages ​​such as Perl and Python, the PHP community lacks a strong effort to develop math libraries.

One reason for this situation may be that there are already a large number of mature mathematical tools, which may hinder the community from developing PHP tools on their own. For example, I worked on a powerful tool, S System, which had an impressive set of statistical libraries, was specifically designed to analyze data sets, and won an ACM Award in 1998 for its language design. If S or its open source cousin R is just an exec_shell call, why go to the trouble of implementing the same statistical computing functionality in PHP? For more information about the S System, its ACM Award, or R, see related references.

Isn’t this a waste of developer energy? If the motivation for developing a PHP math library was to save developer effort and use the best tool for the job, then PHP's current topic makes sense.

On the other hand, pedagogical motivations may encourage the development of PHP math libraries. For about 10% of people, mathematics is an interesting subject to explore. For those who are also proficient in PHP, the development of a PHP math library can enhance the math learning process. In other words, don't just read the chapter about T-tests, but also implement a program that can calculate the corresponding intermediate values ​​and display them in a standard format. their classes.

Through coaching and training, I hope to demonstrate that developing a PHP math library is not a difficult task and may represent an interesting technical and learning challenge. In this article, I will provide a PHP math library example called SimpleLinearRegression that demonstrates a general approach that can be used to develop PHP math libraries. Let's start by discussing some general principles that guided me in developing this SimpleLinearRegression class.

Guiding Principles

I used six general principles to guide the development of the SimpleLinearRegression class.

Create a class for each analysis model.
Use reverse linking to develop classes.
Expect a large number of getters.
Store intermediate results.
Set preferences for detailed APIs.
Perfection is not the goal.

Let’s examine each of these guidelines in more detail.

Create a class for each analysis model

Each major analysis test or process should have a PHP class with the same name as the test or process. This class contains input functions, functions for calculating intermediate and summary values, and output functions (the intermediate and summary values ​​are Display all on screen in text or graphic format).

Use reverse linking to develop classes

In mathematical programming, the coding target is usually the standard output value that an analysis procedure (such as MultipleRegression , TimeSeries , or ChiSquared ) wishes to produce. From a problem-solving perspective, this means you can use backward chaining to develop mathematical-like methods.

For example, the summary output screen displays one or more summary statistics. These summary statistical results rely on the calculation of intermediate statistical results, and these intermediate statistical results may involve deeper intermediate statistical results, and so on. This backlink-based development approach leads to the next principle.

Anticipate a large number of getters

Most of the class development work in mathematics involves calculating intermediate values ​​and summary values. In practice, this means that you shouldn't be surprised if your class contains many getter methods that calculate intermediate and aggregate values.

Store intermediate results

Storing intermediate calculation results within a result object allows you to use the intermediate results as input for subsequent calculations. This principle is implemented in the S language design. In the current context, this principle is implemented by selecting instance variables to represent calculated intermediate values ​​and summary results.

Set preferences for detailed APIs

When developing a naming scheme for the member functions and instance variables in the SimpleLinearRegression class, I discovered that if I use longer names (something like getSumSquaredError instead of getYY2) to describe the member functions and instance variables, then It is easier to understand the operation content of the function and the meaning of the variables.

I haven’t given up on abbreviated names entirely; however, when I use an abbreviated form of a name, I have to try to provide a comment that fully explains the meaning of the name. My take is this: highly abbreviated naming schemes are common in mathematical programming, but they make it more difficult to understand and prove that a certain mathematical routine is correct than it need be.

Perfection is not the goal

The goal of this coding exercise is not necessarily to develop a highly optimized and rigorous math engine for PHP. In the early stages, emphasis should be placed on learning to implement meaningful analytical tests and solving difficult problems in this area.

Instance variables

When modeling a statistical test or process, you need to indicate which instance variables are declared.

The selection of instance variables can be determined by accounting for the intermediate and summary values ​​generated by the analysis process. Each intermediate and summary value can have a corresponding instance variable, with the variable's value as an object property.

I used this analysis to determine which variables to declare for the SimpleLinearRegression class in Listing 1. Similar analysis can be performed on MultipleRegression, ANOVA, or TimeSeries procedures.

Listing 1. Instance variables of the SimpleLinearRegression class

// Copyright 2003, Paul Meagher
// Distributed under GPL 
class SimpleLinearRegression {
var $n;
var $X = array();
var $Y = array();
var $ConfInt;
var $Alpha;
var $XMean;
var $YMean;
var $SumXX;
var $SumXY;
var $SumYY;
var $Slope;
var $YInt;
var $PredictedY = array();
var $Error = array();
var $SquaredError = array();
var $TotalError;
var $SumError;
var $SumSquaredError;
var $ErrorVariance;
var $StdErr;
var $SlopeStdErr;
var $SlopeVal; // T value of Slope
var $YIntStdErr;  
var $YIntTVal; // T value for Y Intercept
var $R;
var $RSquared;
var $DF; // Degrees of Freedom
var $SlopeProb; // Probability of Slope Estimate
var $YIntProb; // Probability of Y Intercept Estimate
var $AlphaTVal; // T Value for given alpha setting
var $ConfIntOfSlope;                                
var $RPath = "/usr/local/bin/R"; // Your path here
 
var $format = "%01.2f"; // Used for formatting output
 
}
?>

Constructor

The constructor method of the SimpleLinearRegression class accepts an X and a Y vector, each with the same number of values. You can also set a default 95% confidence interval for your expected Y value.

The constructor method starts by verifying that the data form is suitable for processing. Once the input vectors pass the "equal size" and "value greater than 1" tests, the core part of the algorithm is executed.

Performing this task involves calculating the intermediate and summary values ​​of a statistical process through a series of getter methods. Assign the return value of each method call to an instance variable of the class. Storing calculation results in this way ensures that intermediate and summary values ​​are available to calling routines in chained calculations. You can also display these results by calling the output method of the class, as described in Listing 2.

Listing 2. Calling class output method

 
// Copyright 2003, Paul Meagher  
// Distributed under GPL    
function SimpleLinearRegression($X, $Y, $ConfidenceInterval="95") { 
  $numX = count($X); 
  $numY = count($Y); 
  if ($numX != $numY) { 
    die("Error: Size of X and Y vectors must be the same."); 
  }  
  if ($numX <= 1) {  
    die("Error: Size of input array must be at least 2."); 
  } 
   
  $this->n               = $numX; 
  $this->X               = $X; 
  $this->Y               = $Y;   
   
  $this->ConfInt         = $ConfidenceInterval;  
  $this->Alpha           = (1 + ($this->ConfInt / 100) ) / 2; 
  $this->XMean           = $this->getMean($this->X); 
  $this->YMean           = $this->getMean($this->Y); 
  $this->SumXX           = $this->getSumXX(); 
  $this->SumYY           = $this->getSumYY(); 
  $this->SumXY           = $this->getSumXY();     
  $this->Slope           = $this->getSlope(); 
  $this->YInt            = $this->getYInt(); 
  $this->PredictedY      = $this->getPredictedY(); 
  $this->Error           = $this->getError(); 
  $this->SquaredError    = $this->getSquaredError(); 
  $this->SumError        = $this->getSumError(); 
  $this->TotalError      = $this->getTotalError();     
  $this->SumSquaredError = $this->getSumSquaredError(); 
  $this->ErrorVariance   = $this->getErrorVariance(); 
  $this->StdErr          = $this->getStdErr();   
  $this->SlopeStdErr     = $this->getSlopeStdErr();      
  $this->YIntStdErr      = $this->getYIntStdErr();          
  $this->SlopeTVal       = $this->getSlopeTVal();             
  $this->YIntTVal        = $this->getYIntTVal();                 
  $this->R               = $this->getR();    
  $this->RSquared        = $this->getRSquared(); 
  $this->DF              = $this->getDF();           
  $this->SlopeProb       = $this->getStudentProb($this->SlopeTVal, $this->DF); 
  $this->YIntProb        = $this->getStudentProb($this->YIntTVal, $this->DF); 
  $this->AlphaTVal       = $this->getInverseStudentProb($this->Alpha, $this->DF); 
  $this->ConfIntOfSlope  = $this->getConfIntOfSlope();  
  return true; 

?> 

Method names and their sequences were derived through a combination of backlinking and reference to a statistics textbook used by undergraduate students, which explains step-by-step how to calculate intermediate values. The name of the intermediate value I need to calculate is prefixed with "get", thus deriving the method name.

Fit the model to the data

The SimpleLinearRegression procedure is used to produce a straight line fit to the data, where the straight line has the following standard equation:

 y = b + mx

The PHP format of this equation looks similar to Listing 3:

Listing 3. PHP equations to fit the model to the data


$PredictedY[$i] = $YIntercept + $Slope * $X[$i]

The SimpleLinearRegression class uses the least squares criterion to derive estimates of the Y-intercept (Y Intercept) and slope (Slope) parameters. These estimated parameters are used to construct a linear equation (see Listing 3) that models the relationship between the X and Y values.

Using the derived linear equation, you can get the predicted Y value corresponding to each X value. If the linear equation fits the data well, then the observed and predicted values ​​of Y tend to be consistent.

How to determine whether it is a good match

The SimpleLinearRegression class generates quite a few summary values. An important summary value is the T statistic, which measures how well a linear equation fits the data. If the agreement is very good, the T statistic will tend to be large. If the T statistic is small, then the linear equation should be replaced with a model that assumes that the mean of the Y values ​​is the best predictor (that is, the mean of a set of values ​​is usually a useful predictor of the next observation, make it the default model).

To test whether the T statistic is large enough not to consider the mean Y value as the best predictor, you need to calculate the random probability of obtaining the T statistic. If the probability of obtaining a T-statistic is low, then you can reject the null hypothesis that the mean is the best predictor and, accordingly, be confident that the simple linear model fits the data well.

So, how to calculate the probability of T statistic value?

Calculate the probability of T statistic

Since PHP lacks mathematical routines for calculating the probability of T statistic values, I decided to leave this task to the statistical computing package R (see www.r-project.org in Resources) to obtain the necessary values. I also want to draw attention to this bag because:

R provides many ideas that PHP developers might emulate in PHP math libraries.
With R, it is possible to determine whether the values ​​obtained from the PHP math library are consistent with those obtained from mature, freely available open source statistical packages.
The code in Listing 4 demonstrates how easy it is to leave it to R to get a value.

Listing 4. Handling it to the R statistical package to get a value


// Copyright 2003, Paul Meagher
// Distributed under GPL 
class SimpleLinearRegression {
 
var $RPath = "/usr/local/bin/R"; // Your path here
function getStudentProb($T, $df) {
$Probability = 0.0;
$cmd = "echo 'dt($T, $df)' | $this->RPath --slave";
$result = shell_exec($cmd);
list($LineNumber, $Probability) = explode(" ", trim($result));
Return $Probability;
}
function getInverseStudentProb($alpha, $df) {
$InverseProbability = 0.0;
$cmd = "echo 'qt($alpha, $df)' | $this->RPath --slave";
$result = shell_exec($cmd);
list($LineNumber, $InverseProbability) = explode(" ", trim($result));
Return $InverseProbability;
}
}
?>

Note that the path to the R executable has been set and used in both functions. The first function returns the probability value associated with the T statistic based on the Student's T distribution, while the second inverse function computes the T statistic corresponding to the given alpha setting. The getStudentProb method is used to evaluate the fit of the linear model; the getInverseStudentProb method returns an intermediate value, which is used to calculate the confidence interval for each predicted Y value.

Due to limited space, it is impossible for me to detail all the functions in this class one by one, so if you want to figure out the terminology and steps involved in simple linear regression analysis, I encourage you to refer to the statistics textbook used by undergraduate students. .

Burnup research

To demonstrate how to use this class, I can use data from a study of burnout in utilities. Michael Leiter and Kimberly Ann Meechan studied the relationship between a measure of burnout called the Exhaustion Index and an independent variable called Concentration. Concentration refers to the proportion of people's social contacts that come from their work environment.

To study the relationship between consumption index values ​​and concentration values ​​for individuals in their sample, load these values ​​into an appropriately named array and instantiate this class with these array values. After instantiating a class, display some summary values ​​generated by the class to evaluate how well the linear model fits the data.

Listing 5 shows the script that loads the data and displays summary values:

Listing 5. Script to load data and display summary values ​​


// BurnoutStudy.php
// Copyright 2003, Paul Meagher
// Distributed under GPL 
include "SimpleLinearRegression.php";
// Load data from burnout study
$Concentration = array(20,60,38,88,79,87,
                        68,12,35,70,80,92,
                                                                                                                                                                                                                                                                                                                                                                                    to                                                                                   $ExhaustionIndex = array(100,525,300,980,310,900,
                        410,296,120,501,920,810,
506,493,892,527,600,855,
709,791,718,684,141,400,970);
                                                                          $slr = new SimpleLinearRegression($Concentration, $ExhaustionIndex);
$YInt = sprintf($slr->format, $slr->YInt);
$Slope = sprintf($slr->format, $slr->Slope);
$SlopeTVal = sprintf($slr->format, $slr->SlopeTVal); 
$SlopeProb = sprintf("%01.6f", $slr->SlopeProb); 
?>













Equation:
T:
Prob > T:



Running this script through a web browser produces the following output:

Equation: Exhaustion = -29.50 + (8.87 * Concentration)

T: 6.03

Prob > T: 0.000005

The last row of this table indicates that the random probability of obtaining such a large value of T is very low. It can be concluded that a simple linear model has better predictive power than simply using the mean of the consumption values.

Knowing the concentration of connections in someone’s workplace can be used to predict the level of burnout they may be consuming. This equation tells us that for every 1 unit increase in the concentration value, the consumption value of a person in the social services field will increase by 8 units. This is further evidence that to reduce potential burnout, individuals in social services should consider making friends outside of their workplace.

This is just a rough description of what these results might mean. To fully explore the implications of this data set, you may want to study the data in more detail to make sure this is the correct interpretation. In the next article I will discuss what other analyzes should be performed.

What did you learn?

For one, you don’t have to be a rocket scientist to develop meaningful PHP-based math packages. By adhering to standard object-oriented techniques and explicitly adopting a backlink problem-solving approach, some relatively basic statistical procedures can be implemented relatively easily in PHP.

From a teaching standpoint, I think this exercise is very useful, if only because it requires you to think about statistical tests or routines at higher and lower levels of abstraction. In other words, a great way to supplement your statistical testing or procedural learning is to implement the procedure as an algorithm.

Implementing statistical tests often requires going beyond the scope of the given information and creatively solving and discovering problems. It is also a good way to discover gaps in knowledge about a subject.

On the downside, you find that PHP lacks inherent means for sampling distributions, which is necessary to implement most statistical tests. You'll need to let R do the processing to get these values, but I'm afraid you won't have the time or interest to install R. Native PHP implementations of some common probability functions can solve this problem.

Another problem: this class generates many intermediate and summary values, but the summary output doesn't actually take advantage of this. I've provided some unwieldy output, but it's neither sufficient nor well organized so that you can adequately interpret the results of the analysis. Actually, I have absolutely no idea how I can integrate the output method into this class. This needs to be addressed.

Finally, to understand the data, it’s not just about looking at the summary values. You also need to understand how individual data points are distributed. One of the best ways to do this is to graph your data. Again, I don't know much about this, but if you want to use this class to analyze real data, you need to solve this problem.

www.bkjia.comtruehttp: //www.bkjia.com/PHPjc/372077.htmlTechArticleCompared to other open source languages ​​(such as Perl and Python), the PHP community lacks strong work to develop mathematics Library. One reason for this may be that there are already large...
Statement:
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn