A Conversation with Dave Kil

Dave Kil & Rupal Shah

Trustworthy ML/AI in Higher Education

Q: What’s CML Insight and its charter?

CML Insight’s mission is to help students do better by democratizing causal machine learning, which focuses on understanding causal relationships between treatment and its impact on student success so such real-world evidence can be applied to help active students. This is the most direct and effective way of using analytics in lowering equity gaps and helping all students learn better and finish strong. This is why the company is named after Causal Machine Learning or CML Insight.

Q: What’s lacking in today’s approach to machine learning?

Too frequently, ML/AI has been traditionally associated with risk prediction using sensitive student data, especially demographic and other non-malleable data, which can potentially exacerbate equity gaps. Furthermore, as ML has become commoditized, a lot of nuances can get lost during operationalization, which can lead to suboptimal model performances and distrust in model scores. This blog discusses various innovative ways of using integrated analytics to lower equity gaps, going beyond just predictions using black-box models.

Q: How can your AI application ensure ethical, non-biased results? How can you ensure student privacy and data security? Do you use student data for other purposes than deriving insights for clients?

We deliver the most important insights and action recommendations in a safe, ethical, non-biased way as follows:

While prediction scores can be useful in some industry applications, we don't deliver individual student prediction scores to higher-ed institutions because there are more efficient and effective ways to help institutions achieve student success and lower equity gaps. We build various student success models in order to construct and evaluate a large number of quasi-retrospective experiments to compile real-world evidence in an attempt to help active students in prospective trials using the most promising, evidence-based treatment.
We invest in thorough time-series feature engineering to infer hidden variables. Typically we find demographic and non-malleable (DNM) variables to be of little importance in predictive and causal models because the underlying derived variables explain much of whatever outcome differences may exist for DNM variables.
Our data and application platforms are built on a secure Google Cloud Platform (GCP), which is compliant with all known privacy standards, such as GDPR, FERPA, and HIPAA.
CML Insight personnel, having worked in defense R&D, Amazon, Facebook, healthcare, and higher-ed industries, are familiar with and practice privacy standards, policies and procedures, and tech stacks that preserve differential privacy in data storage, processing, and MLOps.

Q: How can you measure everything and truly know what is causally related to a treatment/intervention?

The most popular method is propensity score matching (PSM), first proposed by Rosenbaum and Rubin in 1983. The key tenet is that by (1) building a model for treatment participation using data from participants and non-participants at baseline (before treatment commences) and (2) matching in the one-dimensional model score space, quasi pilot and control groups can be formed. This treatment-participation prediction model score is called a propensity score.

The key concept is deceptively simple. By balancing different treatment and control groups that naturally occur in observational settings across all patient covariates through the use of a propensity score, statistically-equivalent treatment and control groups can be formed, where the difference in outcomes can be attributed to treatment since that’s the only difference between the two statistically-matched groups.

However, when I first implemented PSM in healthcare, I found results to be quite noisy and inconsistent. As a result, I soon came to realize the power of robust matching using more than propensity score and more than one outcomes metric. Furthermore, I implemented multiple complementary algorithms to ensure that the results were valid from multiple perspectives. It was truly amazing to see businesses embracing causal insights, engaging in process innovations through prospective trials, and continuously building real-world evidence to help employees do their best work.

Q: What about confounders pertaining to social-psychological, behavioral, motivational factors that are not observable?

It is true that they are not directly observable unless you ask through ecological momentary assessment. Based on my own experience, these factors can be inferred through time-series feature engineering. For example, response under adversity can be ascertained by observing how students behave through LMS activities post-adverse events. In healthcare, reaching out to patients right after adverse events is far more effective than waiting a month before reaching out. Another example is to use relative metrics with respect to a group that is exposed to the same influencer, such as sections or classes. Oftentimes, change-based variables were far more important than static variables because changes reflect and accommodate these social-psychological factors. In an NIH-funded research, features derived from micro surveys, social networks, and activity event data were leveraged to quantify the effects of social and motivational factors crucial in spreading good health behaviors through one’s social network.

Q: What’s next?

As machine learning models become more commoditized, more emphasis will be placed on connecting risk model outputs to success outcomes. Lately, machine-learning vulnerabilities have been in full display with self-driving cars, self-learning chatbots, health-risk models worsening equity gaps due to biased data, and ML models applied to quick-profit opportunities, where failures lead to human costs and higher levels of inequities given human costs are not equally distributed. What’s sorely needed is a balanced approach to all kinds of machine learning analytics with attention to detail, which gets to the core of hidden variables that explain why we do what we do, and how we can reshape trajectories in a more favorable direction to help lower equity gaps, improve success rates of equitable interventions, and learn from causal ML insights. CML Insight is dedicated to finding causal insights and linking them to prospective trials to help one student at a time while helping everyone in the student success science field do their best work.