clustering calcium 45 and NIST data

I am working towards putting all the work I’ve been doing at Oak Ridge online, and one recent project has been data exploration for calcium 45 data and NIST BL2 data.

why am I doing this?

This boils down to: “why do data exploration?”, since in both of these use cases the motivations are the same as in most experimental sciences:

  1. You gather data,  plan for sources of error, predict what will happen, and test, but after collecting the data you still want to get a good look at it. There could be problems you didn’t notice or unexplained trends.
  2. There are problems with your data that you anticipate, but you want to investigate

Really both of these come down to visualization. I used to think that data visualization was an inconsequential field, but I’ve spent hours on problems that I could have solved much faster if I had a good visualization of the subprocesses.


what do I want to understand?

There are two data sources – calcium 45 and NIST. I am more familiar with the calcium 45 data, since that experiment is a precursor to the Nab Experiment, which is what I’ve been working on for two years. For the calcium 45 data, there are a variety of effects at play:

  • data acquisition effects
    • some of the data is partially overwritten
    • some of the data is partially corrupted
    • the data acquisition system’s behavior changes across particle energy ranges
  • physics effects
    • some particles hit the same detector twice in a short time
    • some particles hit a detector edge and deposit energy in two detectors
    • some particles are actually cosmic rays
    • the setup has physical oscillations

Applying clustering methods allows an exploration of these problems. For instance, if the system is shaking, there are several follow up questions such as “How much does this affect the signal?” and “What is the frequency of oscillation?”.

how did I explore the data?

For both data sources, I tried multiple different approaches to clustering or exploring the data. I varied the preprocessing significantly: I cut the data based on energy, normalized in different ways, filtered in an attempt to undo the data acquisition system’s electronics shaping, and cut the data based on the geometry of the detector. I also tried to extract different clusters using all sorts of methods, from parametric to nonparametric.

what did I find?

For the real results, check out the github page: link

The results I cared about most:

  1. There are significant oscillations with a pretty specific shape that is large enough to matter
  2. There is a change in the shape of the signal as it changes in energy that has been simulated by Monte Carlo by Tom Shelton, but it was cool to see it in the real data
  3. The NIST data is full of this effect called “preamp saturation”, an effect from very high particle energies
  4. The data is very nonlinear, so the nonparametric approaches were the only successful ones.
  5. As a side note, the cluster means from k means clustering seem to make good basis functions for pseudoinverse fitting

APS DNP 2019 talk


I recently got back from the 2019 American Physical Society Division Nuclear Physics conference where I gave a talk about applying machine learning tools to a signal processing problem in nuclear physics and radiation detection. This post will be some of that talk (here’s a link to the slides).

what problem is this addressing?

In radiation detection systems commonly used in nuclear physics, a charged particle such as a proton or electron hits a cold semiconductor, inducing a charge. That charge is digitized and recorded. When the data acquisition system detects an event, it records the charge in a small amount of time around the event (in my example 14 milliseconds). In a few different cases, multiple particles can hit the detector at the same time, resulting in multiple signals in the same readout. If this isn’t identified, then the data acquisition system misreads the characteristics of the event. This edge behavior has to be caught in a precision experiment! Below is a figure showing what happens when two particles come in closer and closer and have some oscillations in the background.

how does this solve the problem?

I approach this problem from two sides. First, I try to develop a supervised learner that can identify this with synthetic data. I implement

  • vanilla, 3 layer dense neural networks
  • a 3 layer convolutional networks
  • an LSTM network — super slow and trains poorly
  • a support vector machine
  • several ensemble regression methods
  • a one dimensional ResNet 50

What I find is that the ResNet 50 does really well for a decent chunk of data (I have as much as I want because I can create synthetic data). Instead of predicting a binary classification, I have the network predict the time delay (and minimize the RMS) so that a threshold can be used to cut the data and the result is more interpretable in terms of where the problems are.

My second approach is to try to find this kind of event in real data from the Calcium 45 experiment. This is very tricky. First, data wrangling is always a hassle, but this event only occurs very rarely in Calcium 45 (but will be about 22% of other experiments’ data rate). I was not expecting to find any pileup, but I was able to find about 10 examples in the data, as well as several other problems with the calcium 45 data (that will be another post!). Here’s a plot of real data – it looks very like the synthetic and has a very small time delay, which gives me hope about using unsupervised methods in the new experiment, Nab.

what did I learn?

There are a few takeaways I got from this:

  1. Labeling data is very time intensive – data exploration is easy when you’re just cruising around, but if you’re looking for a specific effect it’s much harder
  2. Preprocessing is far more involved than actual machine learning stuff – maybe that’s just the success of modern machine learning libraries, that cleaning up the data is more work than doing ML.
  3. Sometimes a simple solution is fine – I went with ResNet 50 because I could get more returns if I put the training data to to max, but the simple, 3 layer, convolutional neural net performed pretty well
  4. Practice makes perfect! – I realized I was bad at public speaking about a year and a half ago and signed up for a bunch of public speaking since then. It was really rewarding to see myself improve.

where’s the code?

Here’s the code: link