Abstract

Real-world datasets often exhibit a high degree of (possibly) non-linear correlations and constraints among their features. Consequently, despite residing in a high-dimensional embedding space, the data typically lie on a manifold with a much lower intrinsic dimension (ID), which—under the presence of noise—may depend on the scale at which the data are analysed. This situation raises interesting questions: How many variables or combinations thereof are necessary to describe a real-world dataset without significant information loss? What is the appropriate scale at which one should analyse and visualize data? Although these two issues are often considered unrelated, they are in fact strongly entangled and can be addressed within a unified framework.

We introduce an approach in which the optimal number of variables and the optimal scale are determined self-consistently, recognizing and bypassing the scale at which the data are affected by noise. To this end, we estimate the data ID in an adaptive manner. Sometimes, within the same dataset, it is possible to identify more than one ID, meaning that different subsets of data points lie on manifolds with different IDs. Identifying these manifolds provides a clustering of the data.

Examples of exploitation of data ID will be presented ranging from gene expression to protein folding, and pandemic evolution, all the way to fMRI, financial and network data. All these real-world applications show how a simple topological feature such as the ID allows us to uncover a rich data structure and improves our insight into subsequent statistical analyses.

Speaker

Antonietta Mira

Research Area

Statistics seminar

Affiliation

Università della Svizzera italiana and Insubria University

Date

Wed, 18 Feb 2026, 4:00 pm

Venue

Microsoft Teams/ Anita B. Lawrence 3085