Analyzing Multi-Dimensional Data for Clinical Tests: A Practical Case Study

Analyzing Multi-Dimensional Data for Clinical Tests: A Practical Case Study
Photo by Europeana / Unsplash

Purpose: The purpose of this article is to give a technical overview of the steps that can be taken to analyze temporal, multi-dimensional data to apply classification, clustering, and sequence analysis in the context of a clinical test. #feature_engineering #time_series #biotech

Introduction: In the Spring of 2023 before starting my PhD, I worked a couple of months as a freelancer. Among the different data science projects, I collaborated with a biotech company in Paris. They had developed a clinical test whose purpose was to be an indicator of depression. They were looking for a data scientist to make sense of their clinical tests.

More specifically, the company wanted to prove that their new solution was predictive of depression and other neurological pathologies. The clinical test in question consisted in series of computer-administered almost questions that looked like visual riddles. This was meant to isolate the cognitive reaction time of patients which correlates with the previous illnesses. The head of the project wanted to group patients using clustering algorithms because he strongly suspected that behavioral patterns should emerge depending on the test profiles of the patients.

Domain knowledge: It was very interesting for me to dive into the problem and learn that a patient's reaction time could be broken into (1) the time it takes for the information to travel to the brain, (2) the processing time, and (3) the time to go back to the muscle and click on the correct answer.

The reaction time T is the sum of (1) the time it takes for the information to reach the visual cortex $T_c$, (2) the processing time $T_p$, and (3) the time to send the electric impulse $T_i$. $T = T_c + T_p + T_i$

Given that the company was interested really in the cognitive processing time (2) as opposed to the motor skills (1) and (3), they first administered a reaction time test as a baseline where the patients had to click as soon as they saw an image. They then administered a second test that combined reaction time with analytical thinking to truly measure cognitive fitness. Given this understanding, I created a new variable in the analysis by subtracting the motor skills reaction time from the total reaction time. This new variable or feature encapsulates the specific aspect of reaction time that the company is most interested in – cognitive processing – making it far more informative for the problem. This approach highlights how even a simple transformation rooted in domain knowledge can significantly enhance the quality and relevance of your analysis.

Description of feature engineering: The dataset being quite small due to the expensive nature of the tests, an initial step of feature engineering was essential. The neurologist I collaborated with had very strong hypothesis about how the data should behave therefore we immediately encoded all his assumptions.

  • Motor response will always be fixed with slight decrease with age
  • Response patterns
  • Slow down after error
  • Delta should follow V or inverted V over the series

The outcome of our discussions can be seen on the following graph. It represents the breakdown of all the features we came with along with the dependencies with the different data.

Feature dependency graph: How various aspects of the dataset were combined to form informative features after several discussions with the practitioners.

Data structures: At the start of the project, I knew that we would be doing a lot of experimentation and I needed to find a good data structure to be flexible and productive. I spent a fair amount of time cleaning the dataset. The data that was given to me consisted in data about the patients like age and gender as well as the different time series that corresponded to the tests the patients took. After some research, I discovered that xarray was the perfect tool for handling multi-dimensional structures and all the steps in the feature engineering, arithmetic between individual values, time series, the differencing, filtering, convolutions became a breeze and allowed me to seriously keep my hours down while delivering high value to my client.

Rest of the project: The remainder of the project was quite important but a routine part of the job. Once the data was meaningful and clean, I applied a dimensionality reduction technique called umap along with a clustering technique k-means to generate clusters.

I ran a feature selection search algorithm with my mlflow infrastructure to optimize for cluster consistency which resulted in clusters we thought to be adequate for the problem. Last not but not least, we wanted to have a strong handle on what these clusters actually meant so I computed summary statistics of each cluster and I used statistics test to highlight sailient features within a cluster. After a few weeks of good work, the project was done and the customer happy.

Final processing pipeline used to create clusters for my client.

Conclusion: This project gave a fascinating glimpse into the world of
clinical trials and post-hoc data analysis. In complement to my current work, this was a demonstration of the challenges that plague medical research namely small datasets. Novel and advanced machine learning techniques show clear limitations in this context like overfitting and the problem of interpretability. At the time of writing, my current research involves in unlocking new forms of synthetic data to solve the problem of limited or biased datasets (link to article).

Key takeaways:

  • Small dataset call for commensurate tools: feature engineering is king, classic machine learning models also.
  • Listen carefully to the client and don't be afraid to brainstorm with him/her.
  • Always use the right data structure. Thanks to xarray, I was able to translate the neurologists thoughts effortlessly.

Read more

Reinforcement Learning for Optimizing Compute Clusters

Reinforcement Learning for Optimizing Compute Clusters

Main Contributions * Theoretical generalization of Q-learning to multi-objective environments * Implementation of a proof-of-concept on synthetic examples * Dashboard visualization of performance over time Preview Abstract Hyperparameter tuning of complex systems is a problem that relies on many decision-making algorithms and hand-crafted heuristics. Developers are often called upon to tune these hyperparameters

By Yann HOFFMANN