< Projects

Title

Machine Learning to Study Causality with Big Datasets: Towards Methods Yielding Valid Statistical Conclusions

About the project

Big datasets make it possible to study complex causal relationships in the social and health sciences.

In quite a few countries, including Sweden, it is possible to link administrative records with health records for research purposes. Linking records for entire populations in this way over multiple decades results in large-scale databases covering millions of individuals. Data encompass hundreds of thousands of characteristics, including how socioeconomic conditions and health status develop over time for each individual, as well as for their partners, relatives, neighbors and co-workers.

The research team, which is studying socioeconomic health inequalities, has access to record-linked data infrastructures of this kind. The issues they are addressing are of a causal nature. For instance, if it is found that breast cancer survival differs between income groups, the researchers want to study the mechanisms causing this inequality.

One scientific challenge is that classical statistical methods are not adapted to cope with large volumes of data. This can result in erroneous conclusions. The researchers are developing machine learning methods for causality, e.g. using neural networks. They hope to achieve results similar in quality to those obtained with machine learning for prognosis purposes, e.g. in automated tumor identification.

To avoid the risk of underestimating uncertainty in the final statistical results, the researchers plan to develop machine learning methods that take into account key sources of uncertainty in the assumptions on which the analysis is based. They will also be developing optimal estimation methods, i.e. methods that yield the most reliable conclusions.

The aim of the project is thus to develop tools that enable new and more reliable conclusions to be drawn from studies of the determinants of health inequalities, as well as other complex causal relationships in the social and health sciences based on large-scale datasets.

Duration

Start: 1 June 2022
End: 31 December 2026

Project type

Affiliated with WASP-HS

Keywords

Universities and institutes

Umeå University

Project members

Xavier de Luna

Xavier de Luna

Professor

Umeå University

Tetiana Gorbach

Tetiana Gorbach

Associate Professor

Umeå University

Per Gustafsson

Per Gustafsson

Associate Professor

Umeå University

Xijia Liu

Xijia Liu

Associate Professor

Umeå University

Mohammad Ghasempour

Mohammad Ghasempour

PhD student

Umeå University