The HTRU2 dataset describes a sample of pulsar candidates collected during the High Time Resolution Universe Survey.
Pulsars are a rare type of Neutron star that produce radio emission detectable here on Earth. They are of considerable scientific interest as probes of space-time, the inter-stellar medium, and states of matter .
As pulsars rotate, their emission beam sweeps across the sky, and when this crosses our line of sight, produces a detectable pattern of broadband radio emission. As pulsars rotate rapidly, this pattern repeats periodically. Thus pulsar search involves looking for periodic radio signals with large radio telescopes.
Each pulsar produces a slightly different emission pattern, which varies slightly with each rotation . Thus a potential signal detection known as a ‘candidate’, is averaged over many rotations of the pulsar, as determined by the length of an observation. In the absence of additional info, each candidate could potentially describe a real pulsar. However in practice almost all detections are caused by radio frequency interference (RFI) and noise, making legitimate signals hard to find.
The dataset contains a total of 17898 observations, where 1639 are positive examples, and 16 259 are negative.
In this Project, We implemented Naive Bayes Classification on Python
The data contains the following columns:
- the mean of the integrated profile;
- the standard deviation of the integrated profile;
- the excess kurtosis of the integrated profile;
- the skewness of the integrated profile;
- the mean of the DM-SNR curve;
- the standard deviation of the DM-SNR curve;
- the excess kurtosis of the DM-SNR curve;
- the skewness of the DM-SNR curve;
- the target class, where the values used are 1 for candidates
identified positively as pulsars, and 0 otherwise.
let’s get our environment ready with the libraries we’ll need and then import the data!
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
from sklearn.metrics import confusion_matrix
from sklearn import metrics
Check out the Data
#importing the dataset
df = pd.read_csv('/Users/sadegh/Desktop/DataSet GitHub/Naive Bayes/pulsar_stars.csv')
df.head()

df.info()

plt.figure(figsize=(12,8))
sns.heatmap(df.describe()[1:].transpose(),
annot=True,linecolor="w",
linewidth=2,cmap=sns.color_palette("Set3"))
plt.title("Data summary")
plt.show()

.
Exploratory Data Analysis
Let’s check out the correlation between variables!
#corelation matrix.
cor_mat= df[:].corr()
mask = np.array(cor_mat)
mask[np.tril_indices_from(mask)] = False
fig=plt.gcf()
fig.set_size_inches(30,12)
sns.heatmap(data=cor_mat,mask=mask,square=True,annot=True,cbar=True)

Let’s check out the Proportion of target variable in dataset!
plt.figure(figsize=(12,6))
plt.pie(df["target_class"].value_counts().values,
labels=["not pulsar stars","pulsar stars"],
autopct="%1.0f%%",wedgeprops={"linewidth":2,"edgecolor":"white"})
my_circ = plt.Circle((0,0),.7,color = "white")
plt.gca().add_artist(my_circ)
plt.subplots_adjust(wspace = .2)
plt.title("Proportion of target variable in dataset")
plt.show()

Let’s see the PAIR PLOT between all variables!
sns.pairplot(data=df,
palette="hls",
hue="target_class",
vars=["mean_profile",
"std_profile",
"kurtosis_profile",
"skewness_profile",
"mean_dmsnr_curve",
"std_dmsnr_curve",
"kurtosis_dmsnr_curve"])
plt.tight_layout()
plt.show()

.
Train Test Split
Split the data into a training set and a testing set
X = df.drop('target_class',axis=1)
y = df['target_class']
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size = 0.25, random_state = 0)
Train a Model
Now it’s time to train a Naive Bayes Classifier.
#fitting classifier to the training set
from sklearn.naive_bayes import GaussianNB
classifier = GaussianNB()
classifier.fit(X_train,y_train)

Model Evaluation
Now get predictions from the model and create a confusion matrix and a classification report.
y_pred = classifier.predict(X_test)
from sklearn.metrics import confusion_matrix,classification_report
cm = confusion_matrix(y_test,y_pred)
class_names=[0,1] # name of classes
fig, ax = plt.subplots()
tick_marks = np.arange(len(class_names))
plt.xticks(tick_marks, class_names)
plt.yticks(tick_marks, class_names)
# create heatmap
sns.heatmap(pd.DataFrame(cm), annot=True, cmap="OrRd" ,fmt='g')
ax.xaxis.set_label_position("top")
plt.tight_layout()
plt.title('Confusion matrix', y=1.1)
plt.ylabel('Actual label')
plt.xlabel('Predicted label')

3946 and 308 are the correct predictions. In addition, 52 and 169 are
Correct Predictions : 3946+308 = 4254
Incorrect Predictions: 52+169 = 221
Create a classification report for the model.
print(classification_report(y_test,y_pred))

The accuracy of the model to predict the pulsar star is %95!

You may have heard the world is made up of atoms and molecules, but it’s really made up of stories.