Today, we will have a look at a dataset on critical temperatures for superconductivity as part of my “Exploring Less Known Datasets for Machine Learning” series. The dataset is hosted on the UCI Machine Learning Repository and originates from a Japanese database. The dataset was used in this publication by K. Hamidieh (2018). The baseline result is a RMSE of ±9.5 K. Let’s see if we can beat it without much effort.

The problem with superconductors is that if they cool down below a critical temperature, then they lose their conductivity. Let’s see how good we can predict this temperature from a few extrated features.

import time
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

filepath_input_data = "./data/train.csv"
input_data_df = pd.read_csv(filepath_input_data)
display(input_data_df.head(3))
display(input_data_df.tail(3))
input_data_df.describe()
number_of_elements mean_atomic_mass wtd_mean_atomic_mass gmean_atomic_mass wtd_gmean_atomic_mass entropy_atomic_mass wtd_entropy_atomic_mass range_atomic_mass wtd_range_atomic_mass std_atomic_mass ... wtd_mean_Valence gmean_Valence wtd_gmean_Valence entropy_Valence wtd_entropy_Valence range_Valence wtd_range_Valence std_Valence wtd_std_Valence critical_temp
0 4 88.944468 57.862692 66.361592 36.116612 1.181795 1.062396 122.90607 31.794921 51.968828 ... 2.257143 2.213364 2.219783 1.368922 1.066221 1 1.085714 0.433013 0.437059 29.0
1 5 92.729214 58.518416 73.132787 36.396602 1.449309 1.057755 122.90607 36.161939 47.094633 ... 2.257143 1.888175 2.210679 1.557113 1.047221 2 1.128571 0.632456 0.468606 26.0
2 4 88.944468 57.885242 66.361592 36.122509 1.181795 0.975980 122.90607 35.741099 51.968828 ... 2.271429 2.213364 2.232679 1.368922 1.029175 1 1.114286 0.433013 0.444697 19.0
number_of_elements mean_atomic_mass wtd_mean_atomic_mass gmean_atomic_mass wtd_gmean_atomic_mass entropy_atomic_mass wtd_entropy_atomic_mass range_atomic_mass wtd_range_atomic_mass std_atomic_mass ... wtd_mean_Valence gmean_Valence wtd_gmean_Valence entropy_Valence wtd_entropy_Valence range_Valence wtd_range_Valence std_Valence wtd_std_Valence critical_temp
21260 2 99.663190 95.609104 99.433882 95.464320 0.690847 0.530198 13.51362 53.041104 6.756810 ... 4.80 4.472136 4.781762 0.686962 0.450561 1 3.20 0.500000 0.400000 1.98
21261 2 99.663190 97.095602 99.433882 96.901083 0.690847 0.640883 13.51362 31.115202 6.756810 ... 4.69 4.472136 4.665819 0.686962 0.577601 1 2.21 0.500000 0.462493 1.84
21262 3 87.468333 86.858500 82.555758 80.458722 1.041270 0.895229 71.75500 43.144000 29.905282 ... 4.50 4.762203 4.242641 1.054920 0.970116 3 1.80 1.414214 1.500000 12.80
number_of_elements mean_atomic_mass wtd_mean_atomic_mass gmean_atomic_mass wtd_gmean_atomic_mass entropy_atomic_mass wtd_entropy_atomic_mass range_atomic_mass wtd_range_atomic_mass std_atomic_mass ... wtd_mean_Valence gmean_Valence wtd_gmean_Valence entropy_Valence wtd_entropy_Valence range_Valence wtd_range_Valence std_Valence wtd_std_Valence critical_temp
count 21263.000000 21263.000000 21263.000000 21263.000000 21263.000000 21263.000000 21263.000000 21263.000000 21263.000000 21263.000000 ... 21263.000000 21263.000000 21263.000000 21263.000000 21263.000000 21263.000000 21263.000000 21263.000000 21263.000000 21263.000000
mean 4.115224 87.557631 72.988310 71.290627 58.539916 1.165608 1.063884 115.601251 33.225218 44.391893 ... 3.153127 3.056536 3.055885 1.295682 1.052841 2.041010 1.483007 0.839342 0.673987 34.421219
std 1.439295 29.676497 33.490406 31.030272 36.651067 0.364930 0.401423 54.626887 26.967752 20.035430 ... 1.191249 1.046257 1.174815 0.393155 0.380291 1.242345 0.978176 0.484676 0.455580 34.254362
min 1.000000 6.941000 6.423452 5.320573 1.960849 0.000000 0.000000 0.000000 0.000000 0.000000 ... 1.000000 1.000000 1.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000210
25% 3.000000 72.458076 52.143839 58.041225 35.248990 0.966676 0.775363 78.512902 16.824174 32.890369 ... 2.116732 2.279705 2.091251 1.060857 0.775678 1.000000 0.921454 0.451754 0.306892 5.365000
50% 4.000000 84.922750 60.696571 66.361592 39.918385 1.199541 1.146783 122.906070 26.636008 45.123500 ... 2.618182 2.615321 2.434057 1.368922 1.166532 2.000000 1.063077 0.800000 0.500000 20.000000
75% 5.000000 100.404410 86.103540 78.116681 73.113234 1.444537 1.359418 154.119320 38.356908 59.322812 ... 4.026201 3.727919 3.914868 1.589027 1.330801 3.000000 1.918400 1.200000 1.020436 63.000000
max 9.000000 208.980400 208.980400 208.980400 208.980400 1.983797 1.958203 207.972460 205.589910 101.019700 ... 7.000000 7.000000 7.000000 2.141963 1.949739 6.000000 6.992200 3.000000 3.000000 185.000000

Well, visualizations are always nicer :)

Next, we have to scale the input data and throw some machine learning algorithms at it (train-test splitting: 0.75-0.25 with 5 fold cross-validation):

The most interesting result is that both Random Forest as well XGBoost outperform the baseline result a bit. With more careful hyperparameter tuning and perhaps different pre-processing this probably could be improved even more. The performance of the neural networks is bad. *I didn’t spent any time on designing them, just used a NN with 2 and one with 5 dense layers. But there is one thing I want to point out here: metrics for regression problems, especiall R2, MAE and RMSE have problems to evaluate symmetric error distributions properly.