Fitting of noisy data

We used a linear least squares fit to analyze the linear relationship between given data. Linear data fitting is applicable for many problems, and in fact can also be used for exponential laws (simply take the logarithm of the y- values and then fit), and even power laws (take the logarithm of the x and y values and fit). In this worksheet we continue along the same lines, but extend the fit to allow for a quadratic function. The fit is still linear in the parameters to be determined, i.e., [Maple Math] . Note that the problem where nonlinear parameters are to be determined, and where the problem cannot be reduced to a linear fit of a related problem (as mentioned above), is considerably more difficult. No built-in procedures are available in Maple for nonlinear fitting at the present time. There exist implementations of some algorithms for this case described in the literature (cf. W. Press et al., Numerical Recipes , Cambridge University Press).

We follow another approach in the following respect: we will use a random number generator to produce data with uncertainties of a known variance. Then we will test the fitting techniques and the error estimates. Thus, we will use Maple to perform a computer experiment. We will rely solely on Maple's built-in fit procedure, even though it does not provide us with uncertainties for the fitted coefficients.

We start with the drawing tools that we developed for plotting symbols with error bars. They work reasonably well if the scaling on the x and y axes is not very different. You may need to scale the data if they don't perform well for you (or replace the Circle procedure by an Ellipse).

> Circle:=proc() local col;

> if nargs<4 then col:=black else col:=args[4] fi;

> plot([args[3]*cos(t)+args[1],args[3]*sin(t)+args[2],t=0..2*Pi],scaling=CONSTRAINED,colour=col,thickness=2); end:

> Line:=proc() local col;

> if nargs<5 then col:=black else col:=args[5] fi;

> plot([[args[1],args[2]],[args[3],args[4]]],scaling=CONSTRAINED,colour=col); end:

> Expoint:=proc() local size,col; size:=2*args[5];

> if nargs<6 then col:=black else col:=args[6] fi;

> Circle(args[1],args[2],0.5*args[5],col),Circle(args[1],args[2],args[5],col),Line(args[1],args[3],args[1],args[4],col),Line(args[1]-size,args[3],args[1]+size,args[3],col),Line(args[1]-size,args[4],args[1]+size,args[4],col);

> end:

> with(plots):

We need a random number generator that produces random numbers on the interval [-1,1], which are then easily transformed to some other interval. Maple has by default an integer random number generator. The stats package is of some help.

> with(stats);

[Maple Math]

> with(stats[statplots]):

We learn that there is a random number generator for normally or uniformly distributed data, and a histogram function that works either with tallied (adjusted) intervals so that each bin has equal area, or for equidistant bins. The bin size is calculated by the program, and the area under the histogram equals unity to allow for a direct probability interpretation.

> histogram([random[normald](250)]);

[Maple Plot]

> data:=random[normald](250):

> histogram[1]([data]);

[Maple Plot]

> xrand:=stats[random, uniform](20);

[Maple Math]
[Maple Math]
[Maple Math]

> histogram[1]([xrand]);

[Maple Plot]

These are (supposedly) uniformly distributed over [0,1].

Linear fit case:

We generate 5 data points with random errors according to y = 1 + 0.5 x . We use the x interval 0..2:

> Y:=x->1+0.5*x;

[Maple Math]

> errsize:=0.1;

[Maple Math]

> imax:=5: for i from 1 to imax do: xval[i]:=2/imax*(i-0.5):

> yval[i]:=Y(xval[i])+errsize*(2.*xrand[i]-1.); od:

> P1:=plot(Y(x),x=0..2,color=red):

> P2:={}: i:='i': symbolsize:=0.02: for i from 1 to imax do:

> P2:={op(P2),Expoint(xval[i],yval[i],yval[i]+errsize,yval[i]-errsize,symbolsize,blue)}: od: P2:=op(P2):

> display({P1,P2});

[Maple Plot]

Note that our error bounds are not standard errors, but brackets on the error as we have a uniform distribution of the random numbers used to produce the data points from exact values.

We are ready to perform a least squares fit. We can get help by requesting ?fit. We need to do some experimenting to realize that fit cannot take tables, but needs lists of data values. Thus we convert in the call:

> y:='y': sol:=fit[leastsquare[[x,y], y=a*x+b]]( [convert(xval,list),convert(yval,list)]);

[Maple Math]

> assign(sol);

> P3:=plot(y,x=0..2,color=green):

> display({P1,P2,P3});

[Maple Plot]

For comparison:

> sort(y); sort(Y(x));

[Maple Math]

[Maple Math]

The difference in the slope and y intercept between the 'true' and fitted results is small, in fact much smaller than the absolute error bound for each data point.

Quadratic fit case:

We generate 6 data points with random errors according to [Maple Math] .

We use the x interval 0..2 again, and keep the error size. The number of data points was increased, since there is an extra parameter to be determined.

> Y:=x->1-0.5*x+0.5*x^2;

[Maple Math]

> imax:=6: for i from 1 to imax do: xval[i]:=2/imax*(i-0.5):

> yval[i]:=Y(xval[i])+errsize*(2.*xrand[i]-1.); od:

> P1:=plot(Y(x),x=0..2,color=red):

> P2:={}: i:='i': symbolsize:=0.02: for i from 1 to imax do:

> P2:={op(P2),Expoint(xval[i],yval[i],yval[i]+errsize,yval[i]-errsize,symbolsize,blue)}: od: P2:=op(P2):

> display({P1,P2});

[Maple Plot]

Note that our error bounds are not standard errors, but brackets on the error as we have a uniform distribution of the random numbers used to produce the data points from exact values.

We are ready to perform a least squares fit. The model is now specified to include a quadratic term:

> y:='y': sol:=fit[leastsquare[[x,y], y=a*x^2+b*x+c]]( [convert(xval,list),convert(yval,list)]);

[Maple Math]

> assign(sol);

> P3:=plot(y,x=0..2,color=green):

> display({P1,P2,P3});

[Maple Plot]

For comparison:

> sort(y); sort(Y(x));

[Maple Math]

[Maple Math]

We learn that a least squares fit for a quadratic function can be a trickier problem: the added flexibility puts a greater demand on the problem. The coefficients deviate more from the 'true' model (red curve).

Exercises:

1) Improve the situation by adding more data points in the interval. Also try fewer data points. Repeat the fits with increased error margins.

2) Observe the deviation between the 'true' and the model result based on randomized data fitting outside the interval in which the fitting was performed (extrapolation problem).

3) Repeat the data fitting problems with normally distributed errors (the error size plays the role of the standard deviation).

4) Test the data fit problem for a cubic function.

>