Prediction of Concrete Compressive Strength According to Components with Machine Learning

Halis Manaz
7 min readOct 11, 2021

--

Introduction

Concrete is the most commonly used material in civil engineering. That is why lots of research and experiments are done on concrete. In this experiment, it is tried to understand how the compressive strength will be according to the materials in the concrete. Concrete has many properties like shear strength, tensile strength. Compressive strength is one of the most important properties. Compressive strength is changed according to materials in concrete like other properties. To determine and understand that changing, lots of experiments should be done. That means it needs a lot of time, money, and effort. There is another way. Without further experimentation, it can be understood how the compression strength will be and how it is affected by machine learning models based on the existing experimental dataset.

The experimental dataset has 1030 instances and there is no missing data. Dataset has eight input and one output variable. Input values are respectively cement, blast furnace slag, fly ash, water, superplasticizers, coarse aggregate, fine aggregate, and age. The first seven input variables units are kg in meter cube. The last input variable which is the age unit is day. Also output variable is compressive strength and the unit is megapascal.

You can access dataset in here.

Data Preprocessing and Correlation Matrix

Importing Libraries and Dataset

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
dataset = pd.read_excel('Concrete_Data.xls')
X = dataset.iloc[:, :-1].values
y = dataset.iloc[:, -1].values
dataset
Dataset

Spliting Dataset and Creating Correlation Matrix

from sklearn.model_selection import train_test_splitX_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.1, random_state = 0)

We need to split X and y variables as X train, X test, y train, and y test. X train and y train values are used to build a machine learning model. After building a machine learning model, make predictions using X test values and comparing with y test value to understand how model predictions are close enough to real values. In this model, test variables are chosen as %10.

corr = dataset.corr()
sns.heatmap(corr, annot=True, cmap="coolwarm")
plt.title("Correlation Matrix of Concrete Compenents")
plt.show()
Correlation Matrix

Correlation matrix are created. However, it looks confused and messy. To get simpler and clear:

# Simple version of correlation matrixmask = np.zeros_like(corr)mask[np.triu_indices_from(mask)] = Trueplt.title("Correlation Matrix of Concrete Compenents")components = ["Cement", "Blast Furnace Slag", "Fly Ash", "Water", "Superplasticizer", "Coarse Aggregate", "Fine Aggregate", "Age", "Compressive Strength"]sns.heatmap(corr, annot=True, cmap="coolwarm", linewidths=0.75, mask = mask, xticklabels= components, yticklabels= components)plt.gcf().set_size_inches(12, 8)plt.show()
Correlation Matrix (Simple Version)

According to correlation matrix, find highest three correlation

# The highest three correlationcorrelations = corr.unstack().sort_values(ascending = False)correlations[correlations != 1].sort_values(ascending = False)[1:7:2]
The highest three correlations

For the highest correlation values, a 4-dimensional graph was created to better understand the relationship between them. In this graph orderly x axis, y axis, z axis and 4th axis represent cement, superplasticizers, flyash and compressive strength. Graph was drawn with plotly library

px.scatter_3d(dataset, x=dataset.columns[0], y=dataset.columns[4], z=dataset.columns[2], color = dataset.columns[-1], color_continuous_scale= 'rainbow',labels = {dataset.columns[0] : "Cement",dataset.columns[4] : "Superplasticizer",dataset.columns[2] : "Flyash",dataset.columns[-1] : "Compressive Strength"})
4D Graph

Also according to correlation matrix, find the highest three correlation

#  Lowest three correlationcorrelations[correlations != 1].sort_values(ascending = True)[1:7:2]
The lowest three correlations

Interpretation of Correlation Results

This is one of the most critical parts of the project. The correlation matrix is just a mathematical representation of components effect on each other. We can’t take results directly from the correlation matrix. We have to interpret the results. Because as an engineer we have to decide wisely which results are important which results are not. Let’s look at results and interpret.

1.Compressive strength has the highest correlation with cement and the lowest correlation with water. This is a very expected result. Because every civil engineer knows that one of the main variables on concrete strength is cement/water ratio.

2. Also, compressive strength has a relatively high correlation with age. That is another expected result. Because concrete gains hardening in time and gains strength.

3. Superplasticizer has a high correlation with compressive strength and the lowest correlation with water. Because superplasticizers allow reducing water content without any negative effects on the workability of fresh concrete. Because of reducing water content, both cement/water ratio and compressive strentgth increase

4. Water has low correlation with fine aggreate. That was unexpected result. Because when fine aggregates increases, concrete neeed much more water.

5.Flyash has high correlation with superplasticizer and low correlation with cement. Also that was unexpected result. There is no direct explanation for this situation. It should be investigated by other studies.

Building Machine Learning Model

In this project seven different machine learning models are used to predict compressive strength. According to models’ mean squared error, mean absolute error and R2 score decided to which model is the best.

Before building machine learning models, dataset is created to compare models.

# Create dataframe for comparison modelsmodels_comparison = pd.DataFrame(columns=["Model Name", "Mean Squared Error", "Mean Absolute Error", "R2 Score"])models_comparison

Simple Linear Regression

Simple linear regression model is the very basic model. That is why the mean squared error is high.

from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score, mean_squared_error, mean_absolute_error
# Create models and train
linear_regressor = LinearRegression()
linear_regressor.fit(X_train, y_train)
# Make predictions
y_pred = linear_regressor.predict(X_test)
accuracies = cross_val_score(linear_regressor,X_train,y_train, cv=10)metrics = {"Model Name" : "Linear Regression","Mean Squared Error" : mean_squared_error(y_test, y_pred),"Mean Absolute Error" : mean_absolute_error(y_test, y_pred),"R2 Score" : r2_score(y_test, y_pred),}models_comparison = models_comparison.append(metrics, ignore_index=True)models_comparison

Support Vector Regression

Before train support vector regression models, X and y variables should be scaled. Because all kernel in support vector regression based on distance.

from sklearn.preprocessing import StandardScalersc_X = StandardScaler()
X = sc_X.fit_transform(X_train)
sc_y = StandardScaler()
y = sc_y.fit_transform(np.reshape(y_train, (-1, 1)))

Train model and make predictions. Note that kernel is chosen as rbf

from sklearn.svm import SVRregressor = SVR(kernel = 'rbf')
regressor.fit(X,y)
y_pred = sc_y.inverse_transform(regressor.predict(sc_X.transform(X_test)))metrics = {"Model Name" : "Support Vector Regression","Mean Squared Error" : mean_squared_error(y_test, y_pred),"Mean Absolute Error" : mean_absolute_error(y_test, y_pred),"R2 Score" : r2_score(y_test, y_pred),}models_comparison = models_comparison.append(metrics, ignore_index=True)models_comparison

As can be seen in the dataset, there is a huge improvement in all metrics.

Decision Tree Regression

from sklearn.tree import DecisionTreeRegressorregressor = DecisionTreeRegressor(random_state= 0)regressor.fit(X_train,y_train)

After training lets look at results

Decision Tree Regression

Decision tree regression has similar result with support vector regression. To better understand how decision tree regression works, schematic of the code’s working mechanism can be seen below.

from sklearn import treefig = plt.figure(figsize=(25,20))_ = tree.plot_tree(regressor,feature_names=dataset.columns[:-1],class_names=dataset.columns[:-1],filled=True)
Schematic Representation of Decision Tree Regression

Random Forest Regression

from sklearn.ensemble import RandomForestRegressorregressor = RandomForestRegressor(n_estimators= 10, random_state= 0)regressor.fit(X_train,y_train)

After training lets look at results

Random Forest Regression Results

As can be seen in table, there is a huge improvement. Up to the present, random forest regression is the best model except boost models.

XGBoost, LightGMB, CatBoost

In order not to make the article too long, the result in three models was shown at the same time. Only the method and comparative result of the winning model was shown.

models_comparison = models_comparison.round(3)models_comparison.sort_values(by = ['Mean Squared Error'])# Winner is LGM Regression
Results of All Models

For LGM regression model:

regressor = lightgbm.LGBMRegressor()regressor.fit(X_train, y_train)y_pred = regressor.predict(X_test)results = pd.DataFrame(columns=["Experiment Results", "Predictions"])results["Experiment Results"] = y_testresults["Predictions"] = y_pred

To better understand how prediction close to the real result, bar chart was created for random 20 result. It was made for only 20 random results because the graph looks very complex when the bar graph is created for all results.

import plotly.express as px
import plotly.graph_objects as go
from plotly.offline import init_notebook_mode, iplot
results_rnd20 = results.sample(n=20)data1 = go.Bar(x=[i for i in range(1, len(results_rnd20)+1)],
y=results_rnd20["Experiment Results"],name='Experiment Result')
data2 = go.Bar(x=[i for i in range(1, len(results_rnd20)+1)],
y=results_rnd20["Predictions"],name='Prediction')
data = [data1, data2]layout = go.Layout(barmode='group',legend={'traceorder':'normal'},
title='Experiement Results vs Prediction Results (Random 20 Sample)',title_x=0.5,xaxis_tickfont_size=14,
yaxis=dict(title='Compressive Strength (MPa)',titlefont_size=16,tickfont_size=14,))fig = go.Figure(data=data, layout=layout)iplot(fig, filename='Experiement Results vs Prediction Results')

Results

As can be seen in the bar chart, predictions are close enough to experiment results. That means the LGM regression model is a successful model to predict the compressive strength of concrete according to components in concrete. Also, machine learning is used to predict accidents at work and the cost of projects. However, it should not be forgotten that machine learning is not magic, not a fortuneteller. Machine learning does not always work. It’s just a model. That is why models should be created according to aim and outputs of results. Also, should be discussed that which result are important and unexpected.

--

--

Halis Manaz

Hi, I am Halis. I am interested in Python, data analysis, and machine learning. I share what I learned, what I found interesting, and my portfolio projects