Experiment¶
It is the responsibility of the user to define their own experiment structure. The usual definition will describe the sample of interest, the instrument configuration, and the measured data, and will provide a theory function which computes the expected data given the sample and instrument parameters. The theory function frequently has a physics component for computing the ideal data given the sample and an instrument effects component which computes the expected data from the ideal data. Together, sample, instrument, and theory function define the fitting model which needs to match the data.
The curve fitting problem can be expressed as:
That is, the probability of seeing a particular set of model parameter values given the observed data depends on the probability of seeing the measured data given a proposed set of parameter values scaled by the probability of those parameter values and the probability of that data being measured. The experiment definition must return the negative log likelihood as computed using the expression on the right. Bumps will explore the space of the sample and instrument parameters in the model, returning the maximum likelihood and confidence intervals on the parameters.
There is a strong relationship between the usual \(\chi^2\) optimization problem and the maximum likelihood problem. Given Gaussian uncertainty for data measurements, we find that data \(y_i\) measured with uncertainty \(\sigma_i\) will be observed for sample parameters \(p\) when the instrument is at position \(x_i\) with probability
The negative log likelihood of observing all points in the data set for the given set of sample parameters is
Note that this is the unnormalized \(\chi^2\), whose expected value is the number of degrees of freedom in the model, not the reduced \(\chi^2_R\) whose expected value is \(1\). The Bumps fitting process is not sensitive to the constant \(C\) and it can be safely ignored.
Casting the problem as a log likelihood problem rather than \(\chi^2\) provides several advantages. We can support a richer set of measurement techniques whose uncertainties do not follow a Gaussian distribution. For example, if we have a Poisson process with a low count rate, the likelihood function will be asymmetric, and a gaussian fit will tend to overestimate the rate. Furthermore, we can properly handle background rates since we can easily compute the probability of seeing the observed number of counts given the proposed signal plus background rate. Gaussian modeling can lead to negative rates for signal or background, which is fundamentally wrong. See Simple functions for a demonstration of this effect.
We can systematically incorporate prior information into our models, such as uncertainty in instrument configuration. For example, if our sample angle control motor position follows a Gaussian distribution with a target position of 3° and an uncertainty of 0.2°, we can set
ignoring the scaling constant as before, and add this to \(\tfrac12\chi^2\) to get log of the product of the uncertainties. Similarly, if we know that our sample should have a thickness of 100 ± 3.5 Å based on how we constructed the sample, we can incorporate this information into our model in the same way.
Simple experiments¶
The simplest experiment is defined by a python function which takes a list of instrument configuration and has arguments defining the parameters. For example, to fit a line you would need:
def line(x, m, b):
return m*x + b
Assuming the data was in a 3 column ascii file with x, y and uncertainty, you would turn this into a bumps model file using:
# 3 column data file with x, y and uncertainty
x,y,dy = numpy.loadtxt('line.txt').T
M = Curve(line, x, y, dy)
Using the magic of python introspection,
Curve
is able to determine
the names of the fittable parameters from the arguments to the
function. These are converted to
Parameter
objects, the
basis of the Bumps modeling system. For each parameter, we can set
bounds or values:
M.m.range(0,1) # limit slope between 0 and 45 degrees
M.b.value = 1 # the intercept is set to 1.
We could even set a parameter to a probability distribution, using
Parameter.dev
for Gaussian
distributions or setting parameter.bounds to
Distribution
for other distributions.
Bumps includes code for polynomial interpolation including
B-splines
,
monotonic splines
,
and chebyshev polynomials
.
For counts data, PoissonCurve
is also
available.
Likelihood functions¶
If you are already have the negative log likelihood function and you don’t
need to manage data, you can use it with PDF
:
x,y,dy = numpy.loadtxt('line.txt').T
def nllf(m, b):
return numpy.sum(((y - (m*x + b))/dy)**2)
M = PDF(nllf)
You can use M.m and M.b to the parameter ranges as usual, then return the model as a fitting problem:
M.m.range(-inf,inf)
M.b.range(-inf,inf)
problem = FitProblem(M)
Complex models¶
More sophisticated models, with routines for data handling and specialized
plotting should define the Fitness
interface. The Peak Fitting example sets up a problem for fitting
multiple peaks plus a background against a 2-D data set.
Models are parameterized using Parameter
objects, which identify the fitted parameters in the model, and the bounds over
which they may vary. The fitness object must provide a set of fitting
parameters to the fit problem using the
parameters
method.
Usually this returns a dictionary, with the key corresponding to the
attribute name for the parameter and the value corresponding to a
parameter object. This allows the user of the model to guess that
parameter “p1” for example can be referenced using model.p1. If the
model consists of parts, the parameters for each part must be returned.
The usual approach is to define a parameters method for each part
and build up the dictionary when needed (the parameters function is
only called at the start of the fit, so it does not need to be efficient).
This allows the user to guess that parameter “p1” of part “a” can be
referenced using model.a.p1. A set of related parameters, p1, p2, …
can be placed in a list and referenced using, e.g., model.a.p[i].
The fitness constructor should accept keyword arguments for each
parameter giving reasonable defaults for the initial value. The
parameter attribute should be created using
Parameter.default
.
This method allows the user to set an initial parameter value when the
model is defined, or set the value to be another parameter in the fitting
problem, or to a parameter expression. The name given to the default
method should include the name of the model. That way when the same
type of model is used for different data sets, the two sets of parameters
can be distinguished. Ideally the model name would be based on the
data set name so that you can more easily figure out which parameter
goes with which data.
During an analysis, the optimizer will ask to evaluate a series of
points in parameter space. Once the parameters have been set, the
update
method will be called,
if there is one. This method should clear any cached results from the
last fit point. Next the nllf
method will be called to compute the negative log likelihood of observing
the data given the current values of the parameters. This is usually
just \(\sum{(y_i - f(x_i))^2 / (2 \sigma_i^2)}\) for data measured with
Gaussian uncertainty, but any probability distribution can be used.
For the Levenberg-Marquardt optimizer, the
residuals
method will be
called instead of nllf. If residuals are unavailable, then the L-M
method cannot be used.
The numpoints
method is used
to report fitting progress. With Gaussian measurement uncertainty, the
nllf return value is \(\chi^2/2\), which has an expected value of
the number of degrees of freedom in the fit. Since this is an awkward
number, the normalized chi-square,
\(\chi^2_N = \chi^2/\text{DoF} = -2 \ln (P)/(n-p)\), is shown
instead, where \(-\ln P\) is the nllf value, \(n\) is the of points
and \(p\) is the number of fitted parameters. \(\chi^2_N\) has a value near 1
for a good fit. The same calculation is used for non-gaussian
distributions even though nllf is not returning sum squared residuals.
The save
and
plot
methods will be called at
the end of the fit. The save method should save the model for the
current point. This may include things such as the calculated scattering
curve and the real space model for scattering inverse problems, or it
may be a save of the model parameters in a format that can be loaded by
other programs. The plot method should use the current matplotlib
figure to draw the model, data, theory and residuals.
The resynth_data
method
is used for an alternative monte carlo error analysis where random
data sets are generated from the measured value and the uncertainty
then fitted. The resulting fitted parameters can be processed much
like the MCMC datasets, yielding a different estimate on the uncertainties
in the parameters. The
restore_data
method
restores the data to the originally measured values. These methods
are optional, and only used if the alternative error analysis is
requested.
Linear models¶
Linear problems with normally distributed measurement error can be
solved directly. Bumps provides bumps.wsolve.wsolve()
, which weights
values according to the uncertainty. The corresponding
bumps.wsolve.wpolyfit()
function fits polynomials with measurement
uncertainty.
Foreign models¶
If your modeling environment already contains a sophisticated parameter
handling system (e.g. sympy or PyMC) you may want to tie into the Bumps
system at a higher level. In this case you will need to define a
class which implements the FitProblem
interface. This has been done already for
PyMCProblem <bumps.pymcfit.PyMCProblem
and interested parties are directed therein for a working example.