The HTRU2 dataset describes a sample of pulsar candidates collected during the High Time Resolution Universe Survey.
Pulsars are a rare type of Neutron star that produce radio emission detectable here on Earth. They are of considerable scientific interest as probes of space-time, the inter-stellar medium, and states of matter .
As pulsars rotate, their emission beam sweeps across the sky, and when this crosses our line of sight, produces a detectable pattern of broadband radio emission. As pulsars rotate rapidly, this pattern repeats periodically. Thus pulsar search involves looking for periodic radio signals with large radio telescopes.
Each pulsar produces a slightly different emission pattern, which varies slightly with each rotation . Thus a potential signal detection known as a ‘candidate’, is averaged over many rotations of the pulsar, as determined by the length of an observation. In the absence of additional info, each candidate could potentially describe a real pulsar. However in practice almost all detections are caused by radio frequency interference (RFI) and noise, making legitimate signals hard to find.
The dataset contains a total of 17898 observations, where 1639 are positive examples, and 16 259 are negative.
In this Project, We implemented Naive Bayes Classification on Python
The data contains the following columns:
- the mean of the integrated profile;
- the standard deviation of the integrated profile;
- the excess kurtosis of the integrated profile;
- the skewness of the integrated profile;
- the mean of the DM-SNR curve;
- the standard deviation of the DM-SNR curve;
- the excess kurtosis of the DM-SNR curve;
- the skewness of the DM-SNR curve;
- the target class, where the values used are 1 for candidates
identified positively as pulsars, and 0 otherwise.
let’s get our environment ready with the libraries we’ll need and then import the data!
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
from sklearn.metrics import confusion_matrix
from sklearn import metrics
Check out the Data
#importing the dataset
df = pd.read_csv('/Users/sadegh/Desktop/DataSet GitHub/Naive Bayes/pulsar_stars.csv')
df.head()

df.info()

plt.figure(figsize=(12,8))
sns.heatmap(df.describe()[1:].transpose(),
annot=True,linecolor="w",
linewidth=2,cmap=sns.color_palette("Set3"))
plt.title("Data summary")
plt.show()

.
Exploratory Data Analysis
Let’s check out the correlation between variables!
#corelation matrix.
cor_mat= df[:].corr()
mask = np.array(cor_mat)
mask[np.tril_indices_from(mask)] = False
fig=plt.gcf()
fig.set_size_inches(30,12)
sns.heatmap(data=cor_mat,mask=mask,square=True,annot=True,cbar=True)

Let’s check out the Proportion of target variable in dataset!
plt.figure(figsize=(12,6))
plt.pie(df["target_class"].value_counts().values,
labels=["not pulsar stars","pulsar stars"],
autopct="%1.0f%%",wedgeprops={"linewidth":2,"edgecolor":"white"})
my_circ = plt.Circle((0,0),.7,color = "white")
plt.gca().add_artist(my_circ)
plt.subplots_adjust(wspace = .2)
plt.title("Proportion of target variable in dataset")
plt.show()

Let’s see the PAIR PLOT between all variables!
sns.pairplot(data=df,
palette="hls",
hue="target_class",
vars=["mean_profile",
"std_profile",
"kurtosis_profile",
"skewness_profile",
"mean_dmsnr_curve",
"std_dmsnr_curve",
"kurtosis_dmsnr_curve"])
plt.tight_layout()
plt.show()

.
Train Test Split
Split the data into a training set and a testing set
X = df.drop('target_class',axis=1)
y = df['target_class']
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size = 0.25, random_state = 0)
Train a Model
Now it’s time to train a Naive Bayes Classifier.
#fitting classifier to the training set
from sklearn.naive_bayes import GaussianNB
classifier = GaussianNB()
classifier.fit(X_train,y_train)

Model Evaluation
Now get predictions from the model and create a confusion matrix and a classification report.
y_pred = classifier.predict(X_test)
from sklearn.metrics import confusion_matrix,classification_report
cm = confusion_matrix(y_test,y_pred)
class_names=[0,1] # name of classes
fig, ax = plt.subplots()
tick_marks = np.arange(len(class_names))
plt.xticks(tick_marks, class_names)
plt.yticks(tick_marks, class_names)
# create heatmap
sns.heatmap(pd.DataFrame(cm), annot=True, cmap="OrRd" ,fmt='g')
ax.xaxis.set_label_position("top")
plt.tight_layout()
plt.title('Confusion matrix', y=1.1)
plt.ylabel('Actual label')
plt.xlabel('Predicted label')

3946 and 308 are the correct predictions. In addition, 52 and 169 are
Correct Predictions : 3946+308 = 4254
Incorrect Predictions: 52+169 = 221
Create a classification report for the model.
print(classification_report(y_test,y_pred))

The accuracy of the model to predict the pulsar star is %95!

You may have heard the world is made up of atoms and molecules, but it’s really made up of stories. When you sit with an individual that’s been here, you can give quantitative data a qualitative overlay.