Today, we will have a look at this dataset on Concrete Compressive Strength by Yeh (1998) [1] as part of my “Exploring Less Known Datasets for Machine Learning” series. Let us see how state-of-the-art algorithms compare with the results from 1998. (My views on this dataset are entirely based on Yeh (1998)).
Contents
- Dataset, baseline results from Yeh (1998) and some domain knowledge
- Dataset exploration and preprocessing
- Applying classical machine learning algorithms
- Applying Deep Neural Networks
- Discussion of results
Dataset, baseline results from Yeh (1998) and some domain knowledge
Legal
dataset © Prof. I-Cheng Yeh [1]; published in UCI Machine Learning Repository [2].
The aim of the dataset is to predict concrete compressive strength of high performance concrete (HPC). HPC does not always means high strength but covers all kinds of concrete for special applications that are not possible with standard concretes. Therefore, our target value is:
target y
- Concrete compressive strength [MPa]
In this case the compressive strength is the cylindrical compressive strength meaning a cylindrical sample (15 cm diameter; 30 cm height) was used for testing. The value is a bit smaller than testing on cubic samples. Both tests assess the uniaxial compressive strength. Usually, we get both values if we buy concrete.
To predict compressive strengths, we have these features available:
input X
- Cement \(\left[\frac{kg}{m^3}\right]\)
- Fly ash \(\left[\frac{kg}{m^3}\right]\)
- Blast furnance slag \(\left[\frac{kg}{m^3}\right]\)
- Water \(\left[\frac{kg}{m^3}\right]\)
- Superplasticizer \(\left[\frac{kg}{m^3}\right]\)
- Fine aggregate \(\left[\frac{kg}{m^3}\right]\)
- Coarse aggregate \(\left[\frac{kg}{m^3}\right]\)
- Age \([d]\)
Let us have a closer look at some of the features:
Age
Concrete hardens with time and strength increases. Usually, concrete is tested after 28 days.
Fly ash and slag
Both features can be considered as binder together with cement. Both increase strength and durability of concrete. However, the hardening process takes longer and therefore it requires more time to reach full compressive strength.
Superplasticizer
Superplasticizer are used to ensure better flow properties because they minimize particle segregation. Further, they allow to decrease the water-cement ratio which leads to higher compressive strength.
Since it is a small dataset, Yeh (1998) selected training and testing sets manually as well as randomly. In the manual splitting the dataset was splitted into 4 sets and were used in a “k-fold style”.
The handpicked sets show signs of overfitting:
- R2training: 0.917 - 0.935
- R2testing: 0.814 - 0.895
The random sets show less signs signs of overfitting and better overall performance:
- R2training: 0.923 - 0.945
- R2testing: 0.908 - 0.922.
Dataset exploration and preprocessing
Basic dataset properties
Let us load the dataset (xls file) using pandas and have a look at the data:
I cleaned the header a bit to make more accessible.
filepath_input_data = './data/Concrete_Data_clean_header.xls'
input_data = pd.read_excel(filepath_input_data)
display(input_data.head(5))
| Cement | Blast Furnace Slag | Fly Ash | Water | Superplasticizer | Coarse Aggregate | Fine Aggregate | Age | Concrete compressive strength | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | 540.0 | 0.0 | 0.0 | 162.0 | 2.5 | 1040.0 | 676.0 | 28 | 79.986111 |
| 1 | 540.0 | 0.0 | 0.0 | 162.0 | 2.5 | 1055.0 | 676.0 | 28 | 61.887366 |
| 2 | 332.5 | 142.5 | 0.0 | 228.0 | 0.0 | 932.0 | 594.0 | 270 | 40.269535 |
| 3 | 332.5 | 142.5 | 0.0 | 228.0 | 0.0 | 932.0 | 594.0 | 365 | 41.052780 |
| 4 | 198.6 | 132.4 | 0.0 | 192.0 | 0.0 | 978.4 | 825.5 | 360 | 44.296075 |
display(input_data.tail(5))
| Cement | Blast Furnace Slag | Fly Ash | Water | Superplasticizer | Coarse Aggregate | Fine Aggregate | Age | Concrete compressive strength | |
|---|---|---|---|---|---|---|---|---|---|
| 1025 | 276.4 | 116.0 | 90.3 | 179.6 | 8.9 | 870.1 | 768.3 | 28 | 44.284354 |
| 1026 | 322.2 | 0.0 | 115.6 | 196.0 | 10.4 | 817.9 | 813.4 | 28 | 31.178794 |
| 1027 | 148.5 | 139.4 | 108.6 | 192.7 | 6.1 | 892.4 | 780.0 | 28 | 23.696601 |
| 1028 | 159.1 | 186.7 | 0.0 | 175.6 | 11.3 | 989.6 | 788.9 | 28 | 32.768036 |
| 1029 | 260.9 | 100.5 | 78.3 | 200.6 | 8.6 | 864.5 | 761.5 | 28 | 32.401235 |
input_data.describe()
| Cement | Blast Furnace Slag | Fly Ash | Water | Superplasticizer | Coarse Aggregate | Fine Aggregate | Age | Concrete compressive strength | |
|---|---|---|---|---|---|---|---|---|---|
| count | 1030.000000 | 1030.000000 | 1030.000000 | 1030.000000 | 1030.000000 | 1030.000000 | 1030.000000 | 1030.000000 | 1030.000000 |
| mean | 281.165631 | 73.895485 | 54.187136 | 181.566359 | 6.203112 | 972.918592 | 773.578883 | 45.662136 | 35.817836 |
| std | 104.507142 | 86.279104 | 63.996469 | 21.355567 | 5.973492 | 77.753818 | 80.175427 | 63.169912 | 16.705679 |
| min | 102.000000 | 0.000000 | 0.000000 | 121.750000 | 0.000000 | 801.000000 | 594.000000 | 1.000000 | 2.331808 |
| 25% | 192.375000 | 0.000000 | 0.000000 | 164.900000 | 0.000000 | 932.000000 | 730.950000 | 7.000000 | 23.707115 |
| 50% | 272.900000 | 22.000000 | 0.000000 | 185.000000 | 6.350000 | 968.000000 | 779.510000 | 28.000000 | 34.442774 |
| 75% | 350.000000 | 142.950000 | 118.270000 | 192.000000 | 10.160000 | 1029.400000 | 824.000000 | 56.000000 | 46.136287 |
| max | 540.000000 | 359.400000 | 200.100000 | 247.000000 | 32.200000 | 1145.000000 | 992.600000 | 365.000000 | 82.599225 |
A scatterplot helps us to understand discrete distributions of each feature and provides us with the correlation coefficent (Pearson).
scatterplot_matrix = pd.plotting.scatter_matrix(input_data, alpha=0.2, figsize=(17, 17), diagonal='kde')
corr = input_data.corr().as_matrix()
for i,j in zip (*plt.np.triu_indices_from(scatterplot_matrix, k=1)):
scatterplot_matrix[i,j].annotate('%.4f' %corr[i,j], (0.5,0.5), xycoords='axes fraction', ha='center', va='center')
plt.tight_layout()
plt.show()
The scatterplot matrix shows us that there are some features such as ‘Age’ are less evenly distributed than other features. Using boxplots tell us a bit more about our data:
Since most of our machine learning algorithms assume that we have normalized data, we can apply MaxAbsScaler to scale them.
Feature engineering
A key factor in concrete engineering is the water-cement ratio.
Yeh (1998) used
- Water to cement
- Water to binder
- Superplasticizer to binder
- Fly ash to binder
- Slag to binder
- Fly ash + slag to binder
This raises a few questions. Mainly: what exactly is binder?
With this dataset we cannot reproduce any values of the ratios that Yeh (1998) used. It is also not clear what features he finally used. Did he used only a part of them? Only ratio and age? This is only documented properly for (manually designed) linear regerssion models but not for the neural networks.
Water to cement ratio
Water to cement seems to be the simplest:
\[\text{w/c} = \frac{\text{Water} \left[\frac{kg}{m^3}\right]}{\text{Cement} \left[\frac{kg}{m^3}\right]}\]# pure water-cement ratio feature
input_features_unscaled_with_wcratio = input_data.copy()
input_features_unscaled_with_wcratio.insert(input_features_unscaled_with_wcratio.shape[-1]-1,'water-cement-ratio', input_features_unscaled_with_wcratio['Water ']/input_features_unscaled_with_wcratio['Cement'])
input_features_unscaled_with_wcratio.drop(['Water ', 'Cement'], axis=1, inplace=True)
# scale it!
scaler = MaxAbsScaler()
input_features_unscaled_with_wcratio_scaled = input_features_unscaled_with_wcratio.copy()
input_features_unscaled_with_wcratio_sc = scaler.fit_transform(input_features_unscaled_with_wcratio)
input_features_unscaled_with_wcratio_scaled.loc[:,:] = input_features_unscaled_with_wcratio_sc
# making the scaling function more accessible
extract_scaling_function = np.ones((1,input_features_unscaled_with_wcratio.shape[1]))
extract_scaling_function = scaler.inverse_transform(extract_scaling_function)
# split data into X and y
y_df = input_features_unscaled_with_wcratio_scaled['Concrete compressive strength'].copy()
X_df = input_features_unscaled_with_wcratio_scaled.copy()
X_df.drop('Concrete compressive strength', axis=1, inplace=True)
# add it to the dict of datasets for fast and iterative testing
X_train, X_test, y_train, y_test = train_test_split(X_df.values, y_df.values,test_size=0.2, random_state=42, shuffle=True)
comment = 'pure water-cement-ratio; 7 inputs, 1 output'
datasets[dataset_id] = {'X_train': X_train, 'X_test' : X_test, 'y_train': y_train, 'y_test' : y_test, 'scaler' : scaler, 'scaler_array' : extract_scaling_function, 'comment' : comment, 'dataset' : dataset_id}
dataset_id += 1
input_features_unscaled_with_wcratio['water-cement-ratio'].head()
0 0.300000
1 0.300000
2 0.685714
3 0.685714
4 0.966767
Name: water-cement-ratio, dtype: float64
input_features_unscaled_with_wcratio['water-cement-ratio'].describe()
count 1030.000000
mean 0.748269
std 0.314005
min 0.266893
25% 0.533333
50% 0.675349
75% 0.935014
max 1.882353
Name: water-cement-ratio, dtype: float64
Roughly 300 datapoints were added compared to the original publication. There the water-cement ratio was described as:
min 0.24
mean 0.97
max 2.73
This raises a question: what has happened? Different min and mean are explainable but not a lower max value.
Water to binder ratio
Depending on the definition we could count fly ash and slag as binder as well. Usually they are weighted with k-values which we do not have. Hence, we have to try without them:
\[\text{w/b} = \frac{\text{Water} \left[\frac{kg}{m^3}\right]}{\text{Cement} \left[\frac{kg}{m^3}\right] + \text{Fly ash} \left[\frac{kg}{m^3}\right] + \text{Blast furnance slag} \left[\frac{kg}{m^3}\right]}\]input_features_unscaled_with_wcratio_full = input_data.copy()
input_features_unscaled_with_wcratio_full.insert(input_features_unscaled_with_wcratio_full.shape[-1]-1,'water-cement-ratio_with_slag_ash',input_features_unscaled_with_wcratio_full['Water ']/(input_features_unscaled_with_wcratio_full['Cement'] + input_features_unscaled_with_wcratio_full['Fly Ash'] + input_features_unscaled_with_wcratio_full['Blast Furnace Slag ']))
input_features_unscaled_with_wcratio_full.drop(['Water ', 'Cement', 'Fly Ash', 'Blast Furnace Slag '], axis=1, inplace=True)
scaler = MaxAbsScaler()
input_features_unscaled_with_wcratio_full_scaled = input_features_unscaled_with_wcratio_full.copy()
input_features_unscaled_with_wcratio_full_sc = scaler.fit_transform(input_features_unscaled_with_wcratio_full)
input_features_unscaled_with_wcratio_full_scaled.loc[:,:] = input_features_unscaled_with_wcratio_full_sc
extract_scaling_function = np.ones((1,input_features_unscaled_with_wcratio_full.shape[1]))
extract_scaling_function = scaler.inverse_transform(extract_scaling_function)
# split data into X and y
y_df = input_features_unscaled_with_wcratio_full_scaled['Concrete compressive strength'].copy()
X_df = input_features_unscaled_with_wcratio_full_scaled.copy()
X_df.drop('Concrete compressive strength', axis=1, inplace=True)
X_train, X_test, y_train, y_test = train_test_split(X_df.values, y_df.values,test_size=0.2, random_state=42, shuffle=True)
comment = 'water-cement-ratio (binder?); 5 inputs, 1 output'
datasets[dataset_id] = {'X_train': X_train, 'X_test' : X_test, 'y_train': y_train, 'y_test' : y_test, 'scaler' : scaler, 'scaler_array' : extract_scaling_function, 'comment' : comment, 'dataset' : dataset_id}
input_features_unscaled_with_wcratio_full['water-cement-ratio_with_slag_ash'].head()
0 0.30000
1 0.30000
2 0.48000
3 0.48000
4 0.58006
Name: water-cement-ratio_with_slag_ash, dtype: float64
input_features_unscaled_with_wcratio_full['water-cement-ratio_with_slag_ash'].describe()
count 1030.000000
mean 0.469236
std 0.127127
min 0.235073
25% 0.383916
50% 0.472118
75% 0.561244
max 0.900000
Name: water-cement-ratio_with_slag_ash, dtype: float64
Applying classical machine learning algorithms
When it comes to applying classical machine learning algorithms to relatively small datasets (low amount of features, not too many data points), then we may try to apply a brute-force approach to find suitable models. In this case we can use GridSearchCV from scikit-learn to train and test various hyperparameters for the following algorithms (4-fold cross-validation to keep validation and test set sizes similar: both 0.2 * dataset size):
- Linear Regression
- Decision Tree Regression
- SVM Regression
- Random Forest Regression
- AdaBoost Regression
- XGBoost Regression
In this case using the mean-squared error as a scorer is appropriate. However, for later interpretation we have to re-calculate our metrics on the unscaled data as well since it resembles a physical value. Therefore, our regression looks like this:
def xgboost_regression(X_train, X_test, y_train, y_test,scorer,dataset_id):
x_gradient_boosting_regression = xgboost.XGBRegressor(random_state=42)
grid_parameters_x_gradient_boosting_regression = {'n_estimators' : [3,5,10,15,18,20,25,50,60,80,100,120,150,200,300],
'max_depth' : [1,2, 3,5,7,9,10,11,15],
'learning_rate' :[ 0.0001, 0.001, 0.01, 0.1, 0.15, 0.2, 0.8, 1.0],
}
start_time = time.time()
grid_obj = GridSearchCV(x_gradient_boosting_regression, param_grid=grid_parameters_x_gradient_boosting_regression, cv=kfold_vs_size, n_jobs=-1, scoring=scorer, verbose=0)
grid_fit = grid_obj.fit(X_train, y_train)
training_time = time.time() - start_time
best_x_gradient_boosting_regression = grid_fit.best_estimator_
prediction = best_x_gradient_boosting_regression.predict(X_test)
r2 = r2_score(y_test, prediction)
mse = mean_squared_error(y_test, prediction)
mae = mean_absolute_error(y_true=y_test, y_pred=prediction)
# metrics for true values
# r2 remains unchanged, mse, mea will change and cannot be scaled
# because there is some physical meaning behind it
prediction_true_scale = prediction * datasets[dataset_id]['scaler_array'][:,-1]
y_test_true_scale = y_test * datasets[dataset_id]['scaler_array'][:,-1]
mae_true_scale = mean_absolute_error(y_true=y_test_true_scale, y_pred=prediction_true_scale)
medae_true_scale = median_absolute_error(y_true=y_test_true_scale, y_pred=prediction_true_scale)
mse_true_scale = mean_squared_error(y_true=y_test_true_scale, y_pred=prediction_true_scale)
return {'Regression type' : 'eXtreme Gradient Boosting Regression', 'model' : grid_fit, 'Predictions' : prediction, 'R2' : r2, 'MSE' : mse, 'MAE' : mae, 'MSE_true_scale' : mse_true_scale, 'RMSE_true_scale' : np.sqrt(mse_true_scale), 'MAE_true_scale' : mae_true_scale, 'MedAE_true_scale' : medae_true_scale ,'Training time' : training_time, 'dataset' : dataset_id}
This leads to the following results:
Dataset 0 (no features added or dropped)
| Regression type | R2 | MSE | MAE | MSE_true_scale | RMSE_true_scale | MAE_true_scale | MedAE_true_scale |
|---|---|---|---|---|---|---|---|
| Linear Regression | 0.628008 | 0.014050 | 0.093831 | 95.855234 | 9.790569 | 7.750363 | 6.208384 |
| Decision Tree Regression | 0.809726 | 0.007186 | 0.056071 | 49.029968 | 7.002140 | 4.631395 | 3.137805 |
| SVM Regression | 0.796093 | 0.007701 | 0.068799 | 52.542968 | 7.248653 | 5.682717 | 4.554011 |
| Random Forest Regression | 0.875810 | 0.004690 | 0.048152 | 32.001287 | 5.656968 | 3.977335 | 2.662481 |
| AdaBoost Regression | 0.775230 | 0.008489 | 0.076112 | 57.918903 | 7.610447 | 6.286776 | 6.055077 |
| eXtreme Gradient Boosting Regression | 0.921480 | 0.002966 | 0.037675 | 20.233235 | 4.498137 | 3.111918 | 1.982341 |
Dataset 1 (water-cement ratio added, ‘Water’ and ‘Cement’ dropped)
| Regression type | R2 | MSE | MAE | MSE_true_scale | RMSE_true_scale | MAE_true_scale | MedAE_true_scale |
|---|---|---|---|---|---|---|---|
| Linear Regression | 0.583423 | 0.015734 | 0.098371 | 107.344074 | 10.360699 | 8.125363 | 6.605434 |
| Decision Tree Regression | 0.884075 | 0.004378 | 0.046782 | 29.871788 | 5.465509 | 3.864138 | 2.758249 |
| SVM Regression | 0.792823 | 0.007825 | 0.068682 | 53.385716 | 7.306553 | 5.673112 | 4.536345 |
| Random Forest Regression | 0.896781 | 0.003898 | 0.044800 | 26.597586 | 5.157285 | 3.700447 | 2.627287 |
| AdaBoost Regression | 0.761645 | 0.009002 | 0.076656 | 61.419547 | 7.837062 | 6.331761 | 5.468500 |
| eXtreme Gradient Boosting Regression | 0.942883 | 0.002157 | 0.033113 | 14.718070 | 3.836414 | 2.735069 | 1.980554 |
Dataset 1 (water-binder ratio added, ‘Water’,’Cement’,’Slag’ and ‘Fly ash’ dropped)
| Regression type | R2 | MSE | MAE | MSE_true_scale | RMSE_true_scale | MAE_true_scale | MedAE_true_scale |
|---|---|---|---|---|---|---|---|
| Linear Regression | 0.616510 | 0.014484 | 0.098235 | 98.818192 | 9.940734 | 8.114129 | 7.272473 |
| Decision Tree Regression | 0.809843 | 0.007182 | 0.056120 | 48.999786 | 6.999985 | 4.635475 | 2.871331 |
| SVM Regression | 0.780538 | 0.008289 | 0.069751 | 56.551185 | 7.520052 | 5.761395 | 4.419597 |
| Random Forest Regression | 0.888769 | 0.004201 | 0.044238 | 28.662208 | 5.353710 | 3.653984 | 2.216723 |
| AdaBoost Regression | 0.759691 | 0.009076 | 0.076208 | 61.923092 | 7.869123 | 6.294741 | 5.834458 |
| eXtreme Gradient Boosting Regression | 0.915372 | 0.003196 | 0.038303 | 21.807015 | 4.669798 | 3.163834 | 2.303538 |
First, using XGBoost on dataset 1 outperforms all neural networks presented in the original paper by Yeh (1998). Moreover, XGBoost reaches performance similar to the neural networks on all three datasets. We should not be surprise that dataset 2 does not perform well. As mentioned earlier the water-binder ratio is a wild guess on what they might have used.
Further, Random Forest regression performs in the range of the neural network if the neural network was used on handpicked training and testing sets.
Applying Deep Neural Networks
Do we really need neural networks? From a time investment perspective this questionable however not from a perspective of having fun ;). We can use Keras to build our DNNs since this is a small and simple dataset and therefore we do not need to optimize much for speed.
Neural Networks similar to Yeh (1998)
Yeh writes in his 1998 paper that he uses 1 layer with 8 units (for 8 input features?) and one output layer. Further, he used a learning rate of 1.0 and a “momentum factor” of 0.5 and 3000 iterations. It is unclear if any cross-validation was used. Unfortunately, that is all that was published on the neural network. I guess this is as close as we can get to it:
# simple model close to Yeh (1998)
def build_baseline_model(input_dim):
model = Sequential()
model.add(Dense(input_dim, input_dim=input_dim, activation='sigmoid'))
model.add(Dense(1))
sgd = optimizers.SGD(lr=1.0, momentum=0.5)
model.compile(loss='mean_squared_error', optimizer=sgd, metrics=['mae'])
return model
It is difficult to assume a useful bath size. Perhaps unlimited is appropriate since the original work was carried out in 1998. Let us start with batch_size = 16:
# example of how the training process looks like for all neural networks
def run_baseline_model(X_train, X_test, y_train, y_test, dataset_id, epochs=3000,validation_split=0.2, batch_size=16):
model = build_baseline_model(datasets[dataset_id]['X_train'].shape[1])
checkpoint = callbacks.ModelCheckpoint('keras_models/model_baseline_model_'+str(dataset)+'_best.hdf5', monitor='val_loss', save_best_only=True, save_weights_only=True)
start_time = time.time()
model.fit(X_train, y_train,callbacks=[checkpoint], batch_size=batch_size, epochs=epochs, validation_split=validation_split)
training_time = time.time() - start_time
# load best model
model.load_weights('keras_models/model_baseline_model_'+str(dataset)+'_best.hdf5')
prediction = model.predict(X_test)
r2 = r2_score(y_test, prediction)
mse = mean_squared_error(y_test, prediction)
mae = mean_absolute_error(y_true=y_test, y_pred=prediction)
# metrics for true values
# r2 remains unchanged, mse, mea will change and cannot be scaled
# because there is some physical meaning behind it
prediction_true_scale = prediction * datasets[dataset_id]['scaler_array'][:,-1]
y_test_true_scale = y_test * datasets[dataset_id]['scaler_array'][:,-1]
mae_true_scale = mean_absolute_error(y_true=y_test_true_scale, y_pred=prediction_true_scale)
medae_true_scale = median_absolute_error(y_true=y_test_true_scale, y_pred=prediction_true_scale)
mse_true_scale = mean_squared_error(y_true=y_test_true_scale, y_pred=prediction_true_scale)
return {'Regression type' : 'Baseline NN', 'model' : model, 'Predictions' : prediction, 'R2' : r2, 'MSE' : mse, 'MAE' : mae, 'MSE_true_scale' : mse_true_scale, 'RMSE_true_scale' : np.sqrt(mse_true_scale), 'MAE_true_scale' : mae_true_scale, 'MedAE_true_scale' : medae_true_scale ,'Training time' : training_time, 'dataset' : dataset_id}
This model yields total chaos:
| R2 | MSE | MAE | MSE_true_scale | RMSE_true_scale | MAE_true_scale | MedAE_true_scale | Dataset |
|---|---|---|---|---|---|---|---|
| -0.001312 | 0.037818 | 0.158211 | 258.019144 | 16.062974 | 13.068101 | 12.093447 | 0 |
| 0.594574 | 0.015312 | 0.101047 | 104.470664 | 10.221089 | 8.346391 | 7.861097 | 1 |
| -0.001457 | 0.037824 | 0.158224 | 258.056630 | 16.064141 | 13.069158 | 12.090982 | 2 |
Changing to use a single batch file leads to surprisingly small changes. It seems like that SGD optimizer with a learning rate of 1 and momentum of 0.5 causes the trouble. Therefore, we should try some more modern approaches with a similar setup because of this really bad results and poor documentation on what features have been used with what model architecture:
# simple model similar to Yeh (1998)
# optimizer changed
def build_baseline_model_modified_optimizer(input_dim):
model = Sequential()
model.add(Dense(input_dim, input_dim=input_dim, activation='sigmoid'))
model.add(Dense(1))
model.compile(loss='mean_squared_error', optimizer='adam', metrics=['mae'])
return model
| R2 | MSE | MAE | MSE_true_scale | RMSE_true_scale | MAE_true_scale | MedAE_true_scale | dataset |
|---|---|---|---|---|---|---|---|
| 0.848254 | 0.005731 | 0.059346 | 39.102025 | 6.253161 | 4.901952 | 4.055229 | 0 |
| 0.820670 | 0.006773 | 0.066122 | 46.209889 | 6.797786 | 5.461594 | 4.458681 | 0 |
| 0.841751 | 0.005977 | 0.059619 | 40.777837 | 6.385753 | 4.924500 | 3.821284 | 1 |
# simple model similar to Yeh (1998)
# optimizer and activation function changed
def build_baseline_model_modified_optimizer_activation_function(input_dim):
model = Sequential()
model.add(Dense(input_dim, input_dim=input_dim, activation='relu'))
model.add(Dense(1))
model.compile(loss='mean_squared_error', optimizer='adam', metrics=['mae'])
return model
| R2 | MSE | MAE | MSE_true_scale | RMSE_true_scale | MAE_true_scale | MedAE_true_scale | dataset |
|---|---|---|---|---|---|---|---|
| 0.816829 | 0.006918 | 0.065828 | 47.199752 | 6.870208 | 5.437363 | 4.505559 | 1 |
| 0.826317 | 0.006560 | 0.062368 | 44.754891 | 6.689910 | 5.151518 | 3.635845 | 2 |
| 0.792628 | 0.007832 | 0.070945 | 53.435914 | 7.309987 | 5.860026 | 4.813152 | 2 |
If we use adam as optimizer, then we achieve much better results than with stochastic gradient decent (with settings close to the original paper). And we have our first surprise: ReLU performs worse than Sigmoid as an activation function in such a shallow neural network. Nevertheless, we still perform worse than
More DNN experimentation
Let us try some more models. All four models use a batch_size of 128 and 1500 iterations.
def build_model_1(input_dim):
model = Sequential()
model.add(Dense(input_dim, input_dim=input_dim, activation='sigmoid'))
model.add(Dense(20, activation='relu'))
model.add(Dense(20, activation='relu'))
model.add(Dense(1))
model.compile(loss='mean_squared_error', optimizer='adam', metrics=['mae'])
return model
| R2 | MSE | MAE | MSE_true_scale | RMSE_true_scale | MAE_true_scale | MedAE_true_scale | dataset |
|---|---|---|---|---|---|---|---|
| 0.849771 | 0.005674 | 0.057560 | 38.711114 | 6.221826 | 4.754404 | 3.278198 | 0 |
| 0.827993 | 0.006496 | 0.063274 | 44.323052 | 6.657556 | 5.226355 | 4.137564 | 1 |
| 0.814577 | 0.007003 | 0.064154 | 47.779991 | 6.912307 | 5.299110 | 4.138828 | 2 |
def build_model_2(input_dim):
model = Sequential()
model.add(Dense(input_dim, input_dim=input_dim, activation='sigmoid'))
model.add(BatchNormalization())
model.add(Dense(20, activation='relu'))
model.add(BatchNormalization())
model.add(Dense(20, activation='relu'))
model.add(BatchNormalization())
model.add(Dense(1))
model.compile(loss='mean_squared_error', optimizer='adam', metrics=['mae'])
return model
| R2 | MSE | MAE | MSE_true_scale | RMSE_true_scale | MAE_true_scale | MedAE_true_scale | dataset |
|---|---|---|---|---|---|---|---|
| 0.858410 | 0.005348 | 0.052534 | 36.485129 | 6.040292 | 4.339288 | 3.415696 | 0 |
| 0.813830 | 0.007031 | 0.063669 | 47.972606 | 6.926226 | 5.259008 | 4.033573 | 1 |
| 0.843381 | 0.005915 | 0.057839 | 40.357724 | 6.352773 | 4.777441 | 3.771590 | 2 |
def build_model_3(input_dim):
model = Sequential()
model.add(Dense(input_dim, input_dim=input_dim, activation='relu'))
model.add(BatchNormalization())
model.add(Dense(30, activation='relu'))
model.add(BatchNormalization())
model.add(Dense(50, activation='relu'))
model.add(BatchNormalization())
model.add(Dense(20, activation='relu'))
model.add(Dense(1))
model.compile(loss='mean_squared_error', optimizer='adam', metrics=['mae'])
return model
| R2 | MSE | MAE | MSE_true_scale | RMSE_true_scale | MAE_true_scale | MedAE_true_scale | dataset |
|---|---|---|---|---|---|---|---|
| 0.791940 | 0.007858 | 0.061559 | 53.613003 | 7.322090 | 5.084692 | 3.382265 | 0 |
| 0.813332 | 0.007050 | 0.056767 | 48.100938 | 6.935484 | 4.688899 | 3.234740 | 1 |
| 0.825277 | 0.006599 | 0.061221 | 45.022793 | 6.709903 | 5.056845 | 3.742250 | 2 |
def build_model_4(input_dim):
model = Sequential()
model.add(Dense(input_dim, input_dim=input_dim, activation='elu'))
model.add(BatchNormalization())
model.add(Dense(20, activation='relu'))
model.add(BatchNormalization())
model.add(Dense(40, activation='relu'))
model.add(BatchNormalization())
model.add(Dense(30, activation='relu'))
model.add(BatchNormalization())
model.add(Dense(20, activation='relu'))
model.add(Dense(1))
model.compile(loss='mean_squared_error', optimizer='adam', metrics=['mae'])
return model
| R2 | MSE | MAE | MSE_true_scale | RMSE_true_scale | MAE_true_scale | MedAE_true_scale | dataset |
|---|---|---|---|---|---|---|---|
| 0.830969 | 0.006384 | 0.053768 | 43.556202 | 6.599712 | 4.441234 | 2.965220 | 0 |
| 0.864979 | 0.005100 | 0.049786 | 34.792282 | 5.898498 | 4.112321 | 2.718826 | 1 |
| 0.849393 | 0.005688 | 0.057379 | 38.808450 | 6.229643 | 4.739484 | 3.987607 | 2 |
On spending more time on building DNNs
Well, we missed the baseline performance from the publication as well as the performances of XGBoost. Would it make sense to spend a few more hours to train and test various neural network architectures?
We are dealing with natural materials, a very small dataset and more importantly an almost undocumented dataset. Would it change anything in application?
Discussion of results
Let us assess the final model results. The best performance is reached with XGBoost on dataset 1 (pure water-cement ratio):
Besides the averaged error metrics, we can have a look at the two error types directly:
The absolute errors are more important for us to understand the consequences of our model. If we take the median absolute error of ~ 2 MPa, the unknowns in the dataset and the nature of mixing concrete and the construction process into consideration, then we can say that this is sufficient enough.
Applying the best model on the complete data set
For the bean counters in data science and machine learning: no model is trained on this data. It is a final evaluation in terms of getting a deeper physical understanding of the model.
Another helpful toll with such small datasets that involve some variability is to have a look a the whole dataset. It simply helps to understand if there are some extreme outliers in the training data or not.
What is still unclear (and some of this is mentioned in the original publication as well):
- What exactly is defined as coarse and fine aggregates. Yes there are boundaries: all aggregates that pass a 4.25 mm sieve are considered as fine. However, to derive any meaningful mechanical properties from aggregates we should know the d/D value which describes the mesh diameter of the fine (d) and coarse (D) sieve.
- What kind of aggregates were used?
- What was the ambient temperature during hardening (btw there are different standard temperatures for room temperature)? Was the temperature constant considering all as well as single samples.
- What type of cement is used? We have to remember that the concept of water-cement ratios is 100 years old.
- What kind of superplasticizer was used during lab testing? Was it the same from the charge all the time? They act differently depending on their chemistry.
- What kind of fly ash was used? This is important because the dataset is about high performance concrete (HPC). The selection of fly ash can depend on requirements.
- What is the porosity of the concrete?
Therefore, it is almost impossible to extract more useful information from this dataset.
Nevertheless, I enjoyed exploring it!
References
[1] Yeh, I.-C. (1998): Modeling of strength of high performance concrete using artificial
neural networks. In: Cement and Concrete Research 28 (12), 1797-1808. doi:10.1016/S0008-8846(98)00165-3
[2] Dua, D.; Taniskidou, K.E. (2018). UCI Machine Learning Repository http://archive.ics.uci.edu/ml. Irvine, CA: University of California, School of Information and Computer Science.
Acknowledgements
I would like to thank I-Cheng Yeh for making the dataset available.