Machine Learning to Study Causality with Big Datasets: Towards Methods Yielding Valid Statistical Conclusions

The researchers want to develop machine learning methods that facilitate the study of complex causal relationships in the social and health sciences, based on large-scale record-linked databases.

Big datasets make it possible to study complex causal relationships in the social and health sciences.

In quite a few countries, including Sweden, it is possible to link administrative records with health records for research purposes. Linking records for entire populations in this way over multiple decades results in large-scale databases covering millions of individuals. Data encompass hundreds of thousands of characteristics, including how socioeconomic conditions and health status develop over time for each individual, as well as for their partners, relatives, neighbors and co-workers.

The research team, which is studying socioeconomic health inequalities, has access to record-linked data infrastructures of this kind. The issues they are addressing are of a causal nature. For instance, if it is found that breast cancer survival differs between income groups, the researchers want to study the mechanisms causing this inequality.

One scientific challenge is that classical statistical methods are not adapted to cope with large volumes of data. This can result in erroneous conclusions. The researchers are developing machine learning methods for causality, e.g. using neural networks. They hope to achieve results similar in quality to those obtained with machine learning for prognosis purposes, e.g. in automated tumor identification.

To avoid the risk of underestimating uncertainty in the final statistical results, the researchers plan to develop machine learning methods that take into account key sources of uncertainty in the assumptions on which the analysis is based. They will also be developing optimal estimation methods, i.e. methods that yield the most reliable conclusions.

The aim of the project is thus to develop tools that enable new and more reliable conclusions to be drawn from studies of the determinants of health inequalities, as well as other complex causal relationships in the social and health sciences based on large-scale datasets.

Affiliated with WASP-HS

This project is affiliated with WASP-HS and generously funded by the Marianne and Marcus Wallenberg Foundation.

Principal Investigator(s)

Xavier de Luna
Professor, Umeå University

Project Members

Tetiana Gorbach
Senior Lecturer at Umeå University, Guest researcher at The London School of Hygiene and Tropical Medicine

Yanyan Ma
Professor at Penn State, USA

Per Gustafsson
Senior Lecturer at Umeå University

1 June 2022 until 1 December 2027