Nonparametric statistics

Nonparametric methods are used to make inferences about infinite dimensional parameters in statistical models.

There is a healthy tradition of study programs available in the Department of Statistics, beginning at the founding of UNSW.

In situations when a very precise knowledge about a distribution function, a curve or a surface is available, parametric methods are used to identify them when noisy data set is available. When the knowledge is much less precise, non-parametric methods are used. In such situations, the modelling assumptions are that a curve or a surface belongs to certain class of functions. When using the limited information from the noisy data, attempt is made by the statistician to identify the function from the class that “best” fits the data. Typically, expansions of the function in a suitable basis are used and coefficients of the expansion up to certain order are evaluated using the data. Depending on the bases used, one ends up with kernel methods, wavelet methods, spline-based methods, Edgeworth expansions etc. The semi-parametric approach in inference is also widely used in cases where the main component of interest is parametric but there is a non-parametric “nuisance” component involved in the model specification.

The group has strengths in:

Wavelet methods in non-parametric inference. 
These include applications in density estimation, non-parametric regression and in signal analysis. Wavelets are specific orthonormal bases used for series expansions of curves that exhibit local irregularities and abrupt changes. When estimating spatially inhomogeneous curves wavelet methods outperform traditional nonparametric methods. We have focused on improving the flexibility of wavelet methods in curve estimation. We are looking for a finer balance between stochastic and approximation terms in the mean integrated squared error decomposition. We are also dealing with data-based recommendations for choosing the tuning parameters in the wavelet-based estimation procedure, and with implementing some modifications in the wavelet estimators as to make them satisfy some additional shape constraints like non-negativity, integral equal to one etc. Another interesting application is in choosing multiple sampling rates for economic sampling of signals. The technique involves increasing the sampling rate when high-frequency terms are incorporated in the wavelet estimator, and decreasing it when signal complexity is judged to have decreased. The size of the wavelet coefficients at suitable resolution levels is used in deciding how and when to switch rates. Wavelets also have found wide-range applications in image analysis.

Edgeworth and Saddlepoint approximations for densities and tail-area probabilities. 
Although being asymptotic in spirit with respect to the sample size, they give accurate approximations even down to a sample size of one. Similar in spirit to Saddlepoint is the Wiener germ approximation which we have applied to approximating the non-central chi-square distribution and its quantiles. When a saddlepoint approximation is ’’inverted”, it can be used for quantile evaluation as an alternative to the Cornish-Fisher method. Quantile evaluation becomes an important area of research because of its application in financial risk evaluation.

(Higher order) Edgeworth expansions deliver better approximations to the limiting distribution of a statistic in comparison to first order approximations delivered by the Central Limit Theorem. They are beneficial when sample sizes are small to moderate. The graph shows results of the approximation of the distribution of a kernel estimator of the 0.9-quantile of the standard exponential distribution based on 15 observations only. The true distribution: continuous line, one-term Edgeworth approximation: dot-dashed line, Central Limit Theorem-based normal approximation: dashed line.

Nonparametric and semi-parametric inference in regression, density estimation and in inference about copulae. Modern regression models are often semiparametric. Hereby the influence of some linear predictors is combined with covariate from other covariates such as time, spatial position etc. The influence of the latter is modelled non-parametrically thus resulting in a final flexible semiparametric model. Issues about efficient estimation of the parametric component are of interest.

Non-parametric binary regression. When the response in a regression problem is binary (that is, it can only take on the values 0 and 1), the regression function is just the conditional probability that Y takes the value 1 given X. In that case, the panacea seems to fit a logistic model (or sometimes a probit model).

However, these models are built on strong parametric constraints, whose validity is evidently rarely checked. The binary nature of the response indeed makes usual visual tools for checking model adequacy (scatter plots, residuals plots, etc.) unavailable. Blindly fitting a logistic model is thus just the same thing as fitting a linear regression model without looking at the scatter plot! To prevent model misspecification, and so misleading conclusions with serious consequences, non-parametric methods may be used. In the example to the left, the model is meant to predict the probability that a newborn is affected by Bronchopulmonary dysplasia (BPD) (a chronic lung disease particularly affecting premature babies), in function of the baby's birth weight. Obviously, the “scatter-plot” of the observations at disposal is not informative (no shape in the cloud of the points, if this can be called a cloud of points). Basing on a logistic (red) or a probit (blue) model, we would conclude that the probability of being affected is very low for babies with birth weight higher than 1400 g.

This is however not what the data suggest! The non-parametric estimate (green) clearly shows that this probability remains approximately constant (around 0.15) for birth weight above 1300g.

Nonparametric methods in counting process intensity function estimation. In such areas as medical statistics, seismology, insurance, finance, and engineering, a very useful statistical model is the counting process, which is just a stochastic process that registers the number of the “interesting” events that have happened up to any specified time point. Some examples of the events of interest include death of a study subject, failure of an human organ or a part of an industrial product, infection by an infectious disease, relapse of a disease, recovery from a disease, earthquake, trade of a stock, update of a stock index level, default on payment of credit card bills, default of a bond issuer, and firing of an excited neural cell.

In the application of counting process, a relevant concept is the intensity function of the counting process, which played an important role in understanding the underlying mechanism that generates the events of interest and in predicting when the event will occur/reoccur or how many events to be expected in a specified interval.

In different applications, the intensity function goes by different names, such as hazard rate function in survival data analysis and reliability theory, mortality rate function in actuarial science, and infection rate in epidemiology. When not enough prior knowledge about the specific form of the intensity function is available for the statistician to use a parametric method to estimate the intensity function, one of several nonparametric methods can be used, such as the kernel smoothing method, the roughness penalty method, the spline method, the wavelet shrinkage method, and the local polynomial method.

The local polynomial method has been demonstrated to retain the computational ease of the kernel method and meanwhile it does not suffer from the boundary effects of the kernel method. For instance, the above figure shows the estimates of a specific hazard rate function based on 100 simulated right-censored survival data sets using both the kernel method and the local linear method. It's clear that the local linear method outperforms the kernel method.

Functional data analysis. Functional data analysis has recently become a very hot topic in statistical research, as recent technological progress in measuring devices now allows one to observe any spatio-temporal phenomena on arbitrarily fine grids, that is, almost in a continuous manner. The infinite dimension of those functional objects often poses a problem, though. There is thus a clear need for efficient functional data analysis methods. This said, like for the binary regression problem, the infinite-dimension of the considered object makes unconceivable any graphical representation and so any visual guide to specify and validate a model. The risk of misspecification is thus here even higher than in classical data analysis methods, which motivates the use of non-parametric techniques. These have however to be carefully thought about, as infinite-dimensional functional methods are affected by a severe version of the infamous “Curse of Dimensionality”.