MPclusters
In the Spring of 2023 before starting my PhD, I worked a couple of months as a
freelancer. Among the different data science projects, I collaborated with a
biotech company in Paris whose name cannot be disclosed for confidentiality
reasons. They develop a clinical test whose purpose is to be an indicator of
depression. Interestingly, the tests measure the response time of patients.
The purpose was to measure the predictive power of the tests against the
practitioner's knowledge about which patient was in fact depressive. In
addition, there was a desire to group patients according to their performance of
the tests to see if certain patterns would emerge. In the most basic sense, the
company was hoping to see groups of depressive and non-depressive patients but
also to catch onto other forms of neurological diseases. We could see difference
in groups using statistics tests for sailliant features.
(Small image of cortex, eye etc.)
The purpose of this blog is to give a technical overview of steps that can be
taken to analyze temporal, multi-structural data to apply classification,
clustering, and sequence analysis.
Standard processing pipeline :ATTACH:
[[attachment:~/Documents/Application/KnowledgeManager/documents/27/1560df-b36b-4d51-89cf-23121b62ef47/2024-09-24_23-27-15_screenshot.png][screenshot.png]]
Description of feature engineering :ATTACH:
The dataset being quite small due to the expensive nature of the tests, an
initial step of feature engineering was important. Even more so, thanks to the
fact that the neurologists I worked had very strong hypothesis of what was
supposed to happen with the data.
[[attachment:~/Documents/Application/KnowledgeManager/documents/27/1560df-b36b-4d51-89cf-23121b62ef47/2024-09-24_23-25-33_screenshot.png][screenshot.png]]
** Description
t_avg
t_delayed_after xyz
Methods employed
Dimensionality reduction
Statistics test for sailliant features in clusters
Basic clustering / partitioning algorithm
xarray, multi-dimensional array handling
mlflow for feature selection
Clustering algorithms :ATTACH:
[[attachment:~/Documents/Application/KnowledgeManager/documents/27/1560df-b36b-4d51-89cf-23121b62ef47/2024-09-24_23-27-47_screenshot.png][screenshot.png]]
Conclusion
In conclusion, this project gave a fascinating glimpse into the process of
clinical trials. As a parallel to my current work, this was a demonstration of
the challenges that plague medical research namely very small sample size which
show the limits of novel and advanced machine learning techniques, overfitting
and interpretability issues.
All in all, this work can be used as a blueprint for anyone that desires working
on multi-dimensional data. I strongly recommend using the xarray and in general
taking some time to use the right data structures for a project before diving
it. This can make the difference between an easy to maintain project and a
complicated patchwork of custom scripts.