Earthquake Surrogate#


Import Libraries#

import os
import torch
import torch.nn as nn

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestRegressor

import xgboost as xgb
csv_file = "Formatted_ETAS_Output.csv"
df = pd.read_csv(csv_file, sep=',', lineterminator='\n')
summary_stats = df.describe(include="all")
print(summary_stats)
           Date        Time          Year            X             Y  \
count     32001       32001  32001.000000  32001.00000  32001.000000   
unique    19127       26803           NaN          NaN           NaN   
top     4/26/64  0:31:11.00           NaN          NaN           NaN   
freq         10           5           NaN          NaN           NaN   
mean        NaN         NaN   1991.924790   -117.54676     34.810868   
std         NaN         NaN     18.407271      2.11036      2.388448   
min         NaN         NaN   1960.002196   -123.48700     29.080400   
25%         NaN         NaN   1975.851679   -118.73330     33.343000   
50%         NaN         NaN   1992.311728   -117.20600     34.587100   
75%         NaN         NaN   2007.784510   -116.10520     36.820000   
max         NaN         NaN   2023.487118   -113.24630     39.475100   

           Magnitude           Z\r  
count   32001.000000  32001.000000  
unique           NaN           NaN  
top              NaN           NaN  
freq             NaN           NaN  
mean        3.452693      9.615045  
std         0.460765      6.170866  
min         3.000000      0.000000  
25%         3.130000      4.236500  
50%         3.310000      9.076000  
75%         3.630000     15.139700  
max         7.650000     20.000000  

Data Table Transformation#

Variables were made to be suitable for machine learning models. For example the date was put into a format where the day, month, and year were separated. Furthermore we extrapolate from the dates the quarter of the year the earthquake occurred. Feature encoding was also performed, specifically ordinal encoding.

new_df = df.copy()

# Convert "Date" column to datetime in the new DataFrame
new_df["Date"] = pd.to_datetime(new_df["Date"], errors="coerce", format="%m/%d/%y")
# Adjust two-digit year values to four-digit format in the new DataFrame
new_df.loc[new_df["Date"].dt.year > pd.Timestamp.now().year, "Date"] -= pd.DateOffset(years=100)

# Extract components from the "Date" column in the new DataFrame
new_df["Day"] = new_df["Date"].dt.day
new_df["Month"] = new_df["Date"].dt.month
new_df["Year"] = new_df["Date"].dt.year
new_df["DayOfWeek"] = new_df["Date"].dt.dayofweek
new_df["Quarter"] = new_df["Date"].dt.quarter

# Convert "Time" column to datetime in the new DataFrame
new_df["Time"] = pd.to_datetime(new_df["Time"], format="%H:%M:%S.%f")

# Extract time components in the new DataFrame
new_df["Hour"] = new_df["Time"].dt.hour
new_df["Minute"] = new_df["Time"].dt.minute
new_df["Second"] = new_df["Time"].dt.second
new_df["Millisecond"] = new_df["Time"].dt.microsecond // 1000

# Drop the original "Time" column from the new DataFrame
new_df = new_df.drop("Time", axis=1)

# Drop the "Date" column from the new DataFrame
new_df = new_df.drop("Date", axis=1)
new_df = new_df.drop("Year", axis=1)

new_df.head()
X Y Magnitude Z\r Day Month DayOfWeek Quarter Hour Minute Second Millisecond
0 -119.0502 33.9790 6.50 8.2474 31 12 3 4 0 3 9 0
1 -115.6222 33.0793 4.25 7.9322 2 1 5 1 0 8 49 0
2 -115.6323 33.1220 3.03 8.4015 2 1 5 1 0 10 31 0
3 -115.5851 33.0745 3.03 7.9678 2 1 5 1 0 10 32 0
4 -115.6256 33.0290 3.08 7.9737 2 1 5 1 0 11 7 0

Model Training And Preprocessing#

Once the data was transformed and made suitable for our models, we split the data using a 70% training set to 30% testing set split.

For certain models, relative influence was able to be calculated in order to see which predictor variables affected the specific model the most. This calculation and visualizations can be seen specifically for the XGBoost, Linear Regression, and Random Forest models.

X = new_df.drop("Magnitude", axis=1)  # Features
y = new_df["Magnitude"]  # Target variable

# Create a StandardScaler object
scaler = StandardScaler()

# Scale the features
X_scaled = scaler.fit_transform(X)

# Split the scaled data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.3, random_state=42)

# Print the data to verify the changes
print("Scaled X_train:")
print(X_train)

print("y_train:")
print(y_train)

print("Scaled X_test:")
print(X_test)

print("y_test:")
print(y_test)
Scaled X_train:
[[ 0.57704818 -1.19823102  1.49530055 ... -1.17876566 -0.83204313
   0.        ]
 [-0.30276807  1.3833182  -0.79177252 ...  0.78103568  1.0807553
   0.        ]
 [ 0.01798791  0.38587671 -1.28610312 ... -1.00584201 -0.89000671
   0.        ]
 ...
 [-0.40360552 -0.5749714   1.65289841 ... -0.89055958  1.54446401
   0.        ]
 [ 0.72015396  0.9207512   1.45023324 ... -0.37178863  0.84890095
   0.        ]
 [-1.95810392  1.04719511  1.58157829 ...  1.70329514  1.02279171
   0.        ]]
y_train:
31932    3.21
31187    4.40
14148    4.23
23611    3.07
23500    3.51
         ... 
29802    3.24
5390     3.05
860      3.14
15795    3.29
23654    5.13
Name: Magnitude, Length: 22400, dtype: float64
Scaled X_test:
[[ 1.04958158 -1.03921315  1.44085029 ...  0.95395933  1.19668248
   0.        ]
 [-0.04205018  0.70303321  0.13277997 ... -1.3516893  -0.426298
   0.        ]
 [-0.2884101   1.43117429 -1.3567102  ... -0.48707107  1.25464607
   0.        ]
 ...
 [-1.14211665  0.05515471  0.55235489 ... -1.29404809 -1.64353337
   0.        ]
 [-0.01992091  0.33081918 -0.01669897 ...  1.5880127  -1.00593389
   0.        ]
 [-0.12080575  1.40986305 -0.00563065 ...  0.43518839  1.0807553
   0.        ]]
y_test:
23100    3.02
17362    3.26
8993     3.26
19566    3.00
3798     3.79
         ... 
1943     3.26
29579    3.10
5955     3.92
18112    3.00
13274    3.60
Name: Magnitude, Length: 9601, dtype: float64

XGB Regressor#

A variation of a gradient boosting algorithm best suited for regression and classification tasks through decision trees

xgb_model = xgb.XGBRegressor(objective="reg:squarederror", random_state=42)
xgb_model.fit(X_train, y_train)
XGBRegressor(base_score=None, booster=None, callbacks=None,
             colsample_bylevel=None, colsample_bynode=None,
             colsample_bytree=None, early_stopping_rounds=None,
             enable_categorical=False, eval_metric=None, feature_types=None,
             gamma=None, gpu_id=None, grow_policy=None, importance_type=None,
             interaction_constraints=None, learning_rate=None, max_bin=None,
             max_cat_threshold=None, max_cat_to_onehot=None,
             max_delta_step=None, max_depth=None, max_leaves=None,
             min_child_weight=None, missing=nan, monotone_constraints=None,
             n_estimators=100, n_jobs=None, num_parallel_tree=None,
             predictor=None, random_state=42, ...)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
from sklearn.metrics import r2_score
from sklearn.metrics import mean_absolute_error
from sklearn.metrics import mean_squared_error

val_score_xgb = xgb_model.score(X_test, y_test)
print("XGB Regressor R^2:", val_score_xgb)

predicted = xgb_model.predict(X_test)
MAE_xgb = mean_absolute_error(y_test, predicted)
print("XGB Regressor MAE:", MAE_xgb)

mean_squared_score_xgb = mean_squared_error(y_test, predicted)
print("XGB Regressor MSE:", mean_squared_score_xgb)
XGB Regressor R^2: -0.07775879866078106
XGB Regressor MAE: 0.35240284276075656
XGB Regressor MSE: 0.23095209796608107

XGB Regressor - Actual vs. Predicted#

sns.set_context("notebook", rc={"font.size": 14, "axes.labelsize": 16, "axes.titlesize": 18})

plt.figure(figsize=(10, 6))
sns.scatterplot(x=y_test, y=predicted, color='#1f77b4', alpha=0.5, edgecolor='none')
sns.lineplot(x=y_test, y=y_test, color='red', linestyle='dashed')

plt.xlabel('Actual')
plt.ylabel('Predicted')
plt.title('XGB Regressor - Actual vs. Predicted')
plt.grid(color='gray', linestyle='--', linewidth=0.5)

sns.despine()
plt.gca().patch.set_facecolor('none')
plt.gca().set_frame_on(True)

plt.grid(True)
plt.show()
_images/fc90a438a74a4e18ced91c69337ca80e7d43468ec1b85ecb2cfcc86251bf4051.png

Feature Importance In XGB Regressor#

# Convert X_test to a DataFrame
X_test_df = pd.DataFrame(X_test, columns=X.columns)

# Get the feature importances
feature_importance = xgb_model.feature_importances_
feature_names = X_test_df.columns

# Create a DataFrame to store the feature importances
feature_importance_df_1 = pd.DataFrame({'Feature': feature_names, 'Importance': feature_importance})

# Sort the features by importance in descending order
feature_importance_df_1 = feature_importance_df_1.sort_values(by='Importance', ascending=False)

print("Relative Influence of Features:")
print(feature_importance_df_1)
Relative Influence of Features:
        Feature  Importance
2           Z\r    0.126057
1             Y    0.123776
7          Hour    0.112574
8        Minute    0.111738
4         Month    0.110232
0             X    0.108772
3           Day    0.107846
9        Second    0.103845
5     DayOfWeek    0.095160
6       Quarter    0.000000
10  Millisecond    0.000000
# Plot the feature importances in a bar plot
sns.set_context("notebook", rc={"font.size": 14, "axes.labelsize": 16, "axes.titlesize": 18})

plt.figure(figsize=(10, 6))
sns.barplot(data=feature_importance_df_1, x='Feature', y='Importance', color='#17becf',  alpha=0.7, edgecolor='black')

plt.xticks(rotation=90)
plt.xlabel('Features')
plt.ylabel('Importance')
plt.title('Feature Importances')
plt.grid(color='gray', linestyle='--', linewidth=0.5)

sns.despine()
plt.gca().patch.set_facecolor('none')
plt.gca().set_frame_on(True)


plt.grid(axis='x')
plt.show()
C:\Users\Vishal\AppData\Local\Programs\Python\Python310\lib\site-packages\IPython\core\pylabtools.py:151: UserWarning: Glyph 13 (
) missing from current font.
  fig.canvas.print_figure(bytes_io, **kw)
_images/640caa02eb2259f3d1510e42cd7992fe6ec51b9334997628b47ace76ca1c8ed8.png

Support Vector Regression#

A model used for regression tasks specifically through finding a function that approximates the relationship between the input features and the target variable, while controlling the margin of error.

from sklearn.svm import SVR

regr = SVR(C=1.0, epsilon=0.2)

regr.fit(X_train, y_train)
SVR(epsilon=0.2)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
from sklearn.metrics import mean_squared_error
from sklearn.metrics import mean_absolute_error

r2_svr = regr.score(X_test, y_test)
print("SVR R^2:", r2_svr)

predicted = regr.predict(X_test)

MAE_svr = mean_absolute_error(y_test, predicted)
print("SVR MAE:", MAE_svr)
mean_squared_score_svr = mean_squared_error(y_test, predicted)
print("SVR MSE:", mean_squared_score_svr)
SVR R^2: -0.04471690218943092
SVR MAE: 0.3235623886053753
SVR MSE: 0.22387157557060747

SVR - Actual vs. Predicted#

sns.set_context("notebook", rc={"font.size": 14, "axes.labelsize": 16, "axes.titlesize": 18})
plt.figure(figsize=(10, 6))

sns.scatterplot(x=y_test, y=predicted, color='#1f77b4', alpha=0.5, edgecolor='none')
sns.lineplot(x=np.linspace(min(y_test), max(y_test)), y=np.linspace(min(y_test), max(y_test)), color='red', linestyle='dashed')

plt.xlabel('Actual')
plt.ylabel('Predicted')
plt.title('SVR - Actual vs. Predicted')
plt.grid(color='gray', linestyle='--', linewidth=0.5)

sns.despine()
plt.gca().patch.set_facecolor('none')
plt.gca().set_frame_on(True)

plt.grid(True)
plt.show()
_images/37ff55894cf35d27bdb6b4219e17bebeaebfb654816c7dc14972bd067d559c42.png

MLP Regressor#

The multi-layered perceptron regression model is an artificial neural network used for regression tasks. It is used in approximating complex non-linear relationships and mapping feature values to corresponding output values.

from sklearn.neural_network import MLPRegressor

smlp = MLPRegressor(hidden_layer_sizes=1000, activation='relu', solver='adam', batch_size=32, learning_rate='invscaling', max_iter=10000, alpha=0.0001).fit(X_train, y_train)
C:\Users\Vishal\AppData\Local\Programs\Python\Python310\lib\site-packages\sklearn\neural_network\_multilayer_perceptron.py:698: UserWarning: Training interrupted by user.
  warnings.warn("Training interrupted by user.")
from sklearn.metrics import mean_squared_error

R2_mlp = smlp.score(X_test, y_test)
print("MLP Regressor R^2:", R2_mlp)

predicted = smlp.predict(X_test)

MAE_mlp = mean_absolute_error(y_test, predicted)
print("MLP Regressor MAE:", MAE_mlp)
mean_squared_score_mlp = mean_squared_error(y_test, predicted)
print("MLP Regressor MSE:", mean_squared_score_mlp)
MLP Regressor R^2: -0.06420248395389594
MLP Regressor MAE: 0.33416011477004043
MLP Regressor MSE: 0.22804712579036424

MLP Regressor - Actual vs. Predicted#

sns.set_context("notebook", rc={"font.size": 14, "axes.labelsize": 16, "axes.titlesize": 18})
plt.figure(figsize=(10, 6))

sns.scatterplot(x=y_test, y=predicted, color='#1f77b4', alpha=0.5, edgecolor='none')
sns.lineplot(x=np.linspace(min(y_test), max(y_test)), y=np.linspace(min(y_test), max(y_test)), color='red', linestyle='dashed')

plt.xlabel('Actual')
plt.ylabel('Predicted')
plt.title('MLP Regressor - Actual vs. Predicted')

plt.grid(color='gray', linestyle='--', linewidth=0.5)
sns.despine()
plt.gca().patch.set_facecolor('none')
plt.gca().set_frame_on(True)

plt.grid(True)
plt.show()
_images/fa320ede883f5008c20ea7eb3becd35a80291aedfbcc5d5c0bc458482474042b.png

Linear Regression#

This is a simple model that attempts to minimize the sum of squared errors. The final model predicts along a linear slope line that goes through the data points.

from sklearn.linear_model import LinearRegression

clf_1 = LinearRegression().fit(X_train, y_train)
from sklearn.metrics import mean_squared_error

val_acc_ln = clf_1.score(X_test, y_test)

print("Linear Regression R^2:", val_acc_ln)


predicted = clf_1.predict(X_test)
MAE_ln = mean_absolute_error(y_test, predicted)
print("Linear Regression MAE:", MAE_ln)
mean_squared_score_ln = mean_squared_error(y_test, predicted)
print("Linear Regression MSE", mean_squared_score_ln)
Linear Regression R^2: -0.0007016649741236058
Linear Regression MAE: 0.3404332764766683
Linear Regression MSE 0.2144395844887611

Linear Regression - Actual vs. Predicted#

sns.set_context("notebook", rc={"font.size": 14, "axes.labelsize": 16, "axes.titlesize": 18})
plt.figure(figsize=(10, 6))

# Scatter plot for the actual vs. predicted values
sns.scatterplot(x=y_test, y=predicted, color='#1f77b4', alpha=0.5, edgecolor='none')
sns.lineplot(x=np.linspace(min(y_test), max(y_test)), y=np.linspace(min(y_test), max(y_test)), color='red', linestyle='dashed')

plt.xlabel('Actual')
plt.ylabel('Predicted')
plt.title('Linear Regression - Actual vs. Predicted')
plt.grid(color='gray', linestyle='--', linewidth=0.5)

sns.despine()
plt.gca().patch.set_facecolor('none')
plt.gca().set_frame_on(True)

plt.grid(True)
plt.show()
_images/207dc6b888931c4c03b78d861f8e0c0ac5e6e0b3e796568be4c1006ee64177e7.png

Feature Importance In Linear Regression#

# Get the coefficients (feature importances) from the trained model
coefficients = clf_1.coef_

# Create a DataFrame to store the feature importances
feature_importance_df_2 = pd.DataFrame({'Feature': X_test_df.columns, 'Importance': coefficients})

# Sort the features by importance in descending order
feature_importance_df_2 = feature_importance_df_2.sort_values(by='Importance', ascending=False)

print("Relative Influence of Features:")
print(feature_importance_df_2)
Relative Influence of Features:
        Feature  Importance
4         Month    0.024669
2           Z\r    0.007264
9        Second    0.003136
3           Day    0.000082
10  Millisecond    0.000000
0             X   -0.001522
1             Y   -0.001735
8        Minute   -0.004456
5     DayOfWeek   -0.004879
6       Quarter   -0.006226
7          Hour   -0.014279
sns.set_context("notebook", rc={"font.size": 14, "axes.labelsize": 16, "axes.titlesize": 18})
plt.figure(figsize=(10, 6))

sns.barplot(data=feature_importance_df_2, x='Feature', y='Importance', color='#17becf', alpha=0.7, edgecolor='black')

plt.xticks(rotation=90)
plt.xlabel('Features')
plt.ylabel('Importance')
plt.title('Feature Importances in Linear Regression Model')
plt.tight_layout()


plt.grid(color='gray', linestyle='--', linewidth=0.5)
sns.despine()

plt.gca().patch.set_facecolor('none')
plt.gca().set_frame_on(True)

plt.grid(axis='x')
plt.show()
) missing from current font.l\Temp\ipykernel_19916\816986420.py:10: UserWarning: Glyph 13 (
  plt.tight_layout()
) missing from current font.l\Programs\Python\Python310\lib\site-packages\IPython\core\pylabtools.py:151: UserWarning: Glyph 13 (
  fig.canvas.print_figure(bytes_io, **kw)
_images/edb353dc0e3aef467aa8ba068bed67de633be81a0a1da41087b0cde33186d359.png

Random Forest#

This is a supervised machine learning model that uses the mean result from decision trees in order to predict an output.

clf_2 = RandomForestRegressor(max_depth=2, random_state=42).fit(X_train, y_train)
from sklearn.metrics import mean_squared_error

val_acc_rf = clf_2.score(X_test, y_test)

print("Random Forest R^2:", val_acc_rf)


predicted = clf_2.predict(X_test)
MAE_rf = mean_absolute_error(y_test, predicted)
print("Random Forest MAE:", MAE_rf)
mean_squared_score_rf = mean_squared_error(y_test, predicted)
print("Random Forest MSE", mean_squared_score_rf)
Random Forest R^2: -0.0014886206464004292
Random Forest MAE: 0.3405884869516936
Random Forest MSE 0.2146082206100755

Random Forest - Actual vs. Predicted#

sns.set_context("notebook", rc={"font.size": 14, "axes.labelsize": 16, "axes.titlesize": 18})
plt.figure(figsize=(10, 6))

sns.scatterplot(x=y_test, y=predicted, color='#1f77b4', alpha=0.5, edgecolor='none')
sns.lineplot(x=np.linspace(min(y_test), max(y_test)), y=np.linspace(min(y_test), max(y_test)), color='red', linestyle='dashed')

plt.xlabel('Actual')
plt.ylabel('Predicted')
plt.title('Random Forest - Actual vs. Predicted')
plt.grid(color='gray', linestyle='--', linewidth=0.5)

sns.despine()
plt.gca().patch.set_facecolor('none')
plt.gca().set_frame_on(True)

plt.grid(True)
plt.show()
_images/f8cfdeb479f9baa255411457f20c08539fd437fd631491dd4d3e923a0f0567b1.png

Feature Importance In Random Forest#

# Get the feature importances from the trained model
feature_importance = clf_2.feature_importances_
feature_names = X_test_df.columns

# Create a DataFrame to store the feature importances
feature_importance_df_3 = pd.DataFrame({'Feature': feature_names, 'Importance': feature_importance})

# Sort the features by importance in descending order
feature_importance_df_3 = feature_importance_df_3.sort_values(by='Importance', ascending=False)

print("Relative Influence of Features:")
print(feature_importance_df_3)
Relative Influence of Features:
        Feature  Importance
1             Y    0.421510
8        Minute    0.129558
2           Z\r    0.127120
0             X    0.123066
9        Second    0.070646
5     DayOfWeek    0.066045
3           Day    0.034490
4         Month    0.013680
7          Hour    0.010584
6       Quarter    0.003302
10  Millisecond    0.000000
sns.set_context("notebook", rc={"font.size": 14, "axes.labelsize": 16, "axes.titlesize": 18})
plt.figure(figsize=(10, 6))

sns.barplot(data=feature_importance_df_3, x='Feature', y='Importance', color='#17becf', alpha=0.7, edgecolor='black')

plt.xticks(rotation=90)
plt.xlabel('Features')
plt.ylabel('Importance')
plt.title('Feature Importances in RandomForestRegressor Model')
plt.tight_layout()
plt.grid(color='gray', linestyle='--', linewidth=0.5)

sns.despine()
plt.gca().patch.set_facecolor('none')
plt.gca().set_frame_on(True)

plt.grid(axis='x')
plt.show()
) missing from current font.l\Temp\ipykernel_19916\139572493.py:10: UserWarning: Glyph 13 (
  plt.tight_layout()
) missing from current font.l\Programs\Python\Python310\lib\site-packages\IPython\core\pylabtools.py:151: UserWarning: Glyph 13 (
  fig.canvas.print_figure(bytes_io, **kw)
_images/ea3d2e3e619a882effd96c14d7e4db337b77b90ca97c7def10f3708ce02adfe2.png
model_names = ['XGBoost', 'SVR', 'MLP Regressor', 'Linear Regression', 'Random Forest']
mae_scores = {'XGBoost': MAE_xgb, 'SVR': MAE_svr, 'MLP Regressor': MAE_mlp,
              'Linear Regression': MAE_ln, 'Random Forest': MAE_rf}
mse_scores = {'XGBoost': mean_squared_score_xgb, 'SVR': mean_squared_score_svr, 'MLP Regressor': mean_squared_score_mlp,
              'Linear Regression': mean_squared_score_ln, 'Random Forest': mean_squared_score_rf}
r2_scores = {'XGBoost': val_score_xgb, 'SVR': r2_svr, 'MLP Regressor': R2_mlp,
             'Linear Regression': val_acc_ln, 'Random Forest': val_acc_rf}

results_df = pd.DataFrame({'Model': model_names, 'MAE': [mae_scores[model] for model in model_names],
                           'MSE': [mse_scores[model] for model in model_names], 'R^2': [r2_scores[model] for model in model_names]})
print(results_df)
               Model       MAE       MSE       R^2
0            XGBoost  0.352403  0.230952 -0.077759
1                SVR  0.323562  0.223872 -0.044717
2      MLP Regressor  0.334160  0.228047 -0.064202
3  Linear Regression  0.340433  0.214440 -0.000702
4      Random Forest  0.340588  0.214608 -0.001489

Plotting Error Values#

# Create a DataFrame from the scores dictionaries
results_df = pd.DataFrame({'Model': model_names, 'MAE': [mae_scores[model] for model in model_names],
                           'MSE': [mse_scores[model] for model in model_names], 'R^2': [r2_scores[model] for model in model_names]})
sns.set_context("notebook", rc={"font.size": 14, "axes.labelsize": 16, "axes.titlesize": 18})

# Plot MAE scores
plt.figure(figsize=(10, 6))
sns.barplot(data=results_df, x='Model', y='MAE', color='#2ca02c',alpha=0.7, edgecolor='black')
plt.xlabel('Model')
plt.ylabel('Mean Absolute Error (MAE)')
plt.title('Mean Absolute Error (MAE) for Different Models')

plt.xticks(rotation=45, ha='right')
plt.tight_layout()
sns.despine()

plt.grid(axis='y')
plt.show()
_images/8bf63facce3aac0b1668d85176b85dc62e1e01fb177e90c8136ae33cf3ddd350.png
sns.set_context("notebook", rc={"font.size": 14, "axes.labelsize": 16, "axes.titlesize": 18})

# Plot MSE scores
plt.figure(figsize=(10, 6))
sns.barplot(data=results_df, x='Model', y='MSE', color='#9467bd',alpha=0.7, edgecolor='black')
plt.xlabel('Model')
plt.ylabel('Mean Squared Error (MSE)')
plt.title('Mean Squared Error (MSE) for Different Models')

plt.xticks(rotation=45, ha='right')
plt.tight_layout()
sns.despine()

plt.grid(axis='y')
plt.show()
_images/4ad19d8b69800aa694efb34f25d589ff87d5c9441fdc101894d1a4f4fe8dd99a.png
sns.set_context("notebook", rc={"font.size": 14, "axes.labelsize": 16, "axes.titlesize": 18})

# Plot R^2 scores
plt.figure(figsize=(10, 6))
sns.barplot(data=results_df, x='Model', y='R^2', color='#ff7f0e',alpha=0.7, edgecolor='black')
plt.xlabel('Model')
plt.ylabel('R^2 Score')
plt.title('R^2 Score for Different Models')

plt.xticks(rotation=45, ha='right')
plt.tight_layout()
sns.despine()

plt.grid(axis='y')
plt.show()
_images/15dcdda0b30a4668f92247ed4bc682c0e578d8f5780cf1a45689e9492b4489dc.png
feature_importance_df_1.rename(columns={'Importance': 'Importance_XGBoost'}, inplace=True)
feature_importance_df_2.rename(columns={'Importance': 'Importance_LinearRegression'}, inplace=True)
feature_importance_df_3.rename(columns={'Importance': 'Importance_RandomForest'}, inplace=True)

merged_df = pd.merge(feature_importance_df_1, feature_importance_df_2, on='Feature', how='outer')
merged_df = pd.merge(merged_df, feature_importance_df_3, on='Feature', how='outer')

print(merged_df)
        Feature  Importance_XGBoost  Importance_LinearRegression  \
0           Z\r            0.126057                     0.007264   
1             Y            0.123776                    -0.001735   
2          Hour            0.112574                    -0.014279   
3        Minute            0.111738                    -0.004456   
4         Month            0.110232                     0.024669   
5             X            0.108772                    -0.001522   
6           Day            0.107846                     0.000082   
7        Second            0.103845                     0.003136   
8     DayOfWeek            0.095160                    -0.004879   
9       Quarter            0.000000                    -0.006226   
10  Millisecond            0.000000                     0.000000   

    Importance_RandomForest  
0                  0.127120  
1                  0.421510  
2                  0.010584  
3                  0.129558  
4                  0.013680  
5                  0.123066  
6                  0.034490  
7                  0.070646  
8                  0.066045  
9                  0.003302  
10                 0.000000  

Comparison of Models#

# Set a custom color palette for the plot
custom_palette = sns.color_palette("muted")  # Using Seaborn's muted color palette
sns.set_context("notebook", rc={"font.size": 14, "axes.labelsize": 16, "axes.titlesize": 18})

plt.figure(figsize=(12, 6))

# Bar plot for feature importances using Seaborn
sns.barplot(data=merged_df, x='Feature', y='Importance_XGBoost', color=custom_palette[0], alpha=0.7, label='XGBoost', edgecolor='black')
sns.barplot(data=merged_df, x='Feature', y='Importance_LinearRegression', color=custom_palette[1], alpha=0.7, label='Linear Regression', edgecolor='black')
sns.barplot(data=merged_df, x='Feature', y='Importance_RandomForest', color=custom_palette[2], alpha=0.7, label='Random Forest', edgecolor='black')

plt.xticks(rotation=90)
plt.xlabel('Features')
plt.ylabel('Importance')
plt.title('Feature Importances for XGBoost, Linear Regression, and RandomForest')

plt.legend()
plt.tight_layout()
plt.grid(axis='y',color='gray', linestyle='--', linewidth=0.5)

sns.despine()
plt.gca().set_frame_on(True)
plt.gca().patch.set_facecolor('none')

plt.show()
) missing from current font.l\Temp\ipykernel_19916\1467680380.py:18: UserWarning: Glyph 13 (
  plt.tight_layout()
) missing from current font.l\Programs\Python\Python310\lib\site-packages\IPython\core\pylabtools.py:151: UserWarning: Glyph 13 (
  fig.canvas.print_figure(bytes_io, **kw)
_images/cbaea75738636020d442ba468c2e8c757a58ab8ccaabf440ce4a0736e8c2ebca.png

Results#

Earthquakes as a whole are a fairly random and unpredictable phenomena. We found that deterministic models were not suitable for predicting earthquake magnitudes as seen by our low MAE, MSE, and R^2 values. Though relative influence was an interesting factor to consider, the earthquake data does not fit these models. We hope to find success in the near future through exploring generative models as suitable models for earthquake prediction.