Email-Sms-Spam-Classifier
https://email-sms-spam-classifier-hzlsqvj8kvnjliwg8fbfdq.streamlit.app/
Objective : Prediction of Spam Email or SMS using Machine Learning
Introduction :
Spam email is unwanted junk email sent out in massive amount or in bulk to an indiscriminate recipient list. Generally, spam is sent for commercial purposes. It is sent in massive volume by botnets, networks of infected computers. Spam email can often be a malicious attempt to gain access to your system. Spam prevents the user from making full and good utilization of cpu time, storage capacity and network bandwidth. It becomes a huge problem especially at times when there are Spam mails which come in between important business mails. Hence, it becomes inevitable to solve such problems which are encountered by spam email. So, this problem can be solved by using Machine Learning methods which can successfully detect and filter spam.
Problem Statement :
The person responsible for sending the spam messages is referred to as the spammer. Such a person gathers email addresses from different websites, chat rooms etc. The huge volume of Spam mails flowing through the computer networks have destructive effects on the memory space of the email server, communication bandwith, cpu power and user time. In all, existing system does not find spam mails effectively. Hence, it also results in untold financial losses to many users. It leads to low test and prediction accuracy, less security and also loss of data.
So according to above problem statement, I have built the machine learning model which will predict the Email or SMS is Spam or Not Spam.
Python libraries used to build the model:
- Numpy
- Pandas
- Matplotlib
- Seaborn
- Scikit-learn
The dataset contains the two columns:
- TEXT (input)
- Target (output: 1. Spam 2. Not Spam)
- Shape of dataset is (5572 , 2)
The Model is built by using following steps :
- Data cleaning
- EDA
- Text Preprocessing
- Model building
- Evaluation
- Improvement
1. Data Cleaning :
-
I have renamed the columns as column names in the dataset were v1 and v2 and new columns are v1 as ‘target’ and v2 as ‘text’
-
As you can see above, our target variable is a categorical feature so I have converted into numerical variable by using LABEL ENCODING from SKLEARN library. i.e. I have maped the SPAM category as 1 and ham category as 0.
-
After that I have checked the null values present in the dataset
I found no missing value in the dataset
-
I checked the duplicate values in the dataset
Now shape of the dataset becomes (5169,2) after dropping all duplicate rows
2. EDA(Exploratory Data Analysis):
-
I have checked the distrbution of ham and spam category in the target variable(i.e. how many are ham and how many are spam). I used the matplotlib library for the visualization in the form of pie chart
-
After that I have created the 3 columns of number of characters, number of words and number of sentences in every single row or sample. I have used the NLTK which is a standard python library that provides a set of diverse algorithms for NLP. It is one of the most used libraries for NLP and Computational Linguistics.
-
I have created the Histplot to see how number of characters and number of words distrbuted in input
-
The I checked how number of characters, number of words and number of sentences are correlated by using seaborn library
3. Data Preprocessing:
2. Tokenization:
Tokenization is the preocess of converting paragraph into list of sentences and sentences into list of words and I converted text data into list of words for every sample row
3. Removing special characters:
The special characters like ‘!’,’%’,’*’,’$’ are removed from sentences
4. Removing Stop words:
The words like ‘The’, ‘is’, ‘am’, ‘it’ are removed because it does no meaning and does not affects on output
5. Stemming:
Stemming is process of finding the root words of the all words in the text data. For eg. ‘calling’, ‘called’ have root word ‘call’ and ‘gone’, ‘goes’ have root word ‘go’
6. I have created the wordcloud chart which shows most frequently words when EMAIL or SMS is spam and ham
For Spam
For ham
7. Our final datastet will be:
8. Using TFIDF(term frequency-inverse document frequency) I have converted the text input into vectors after that we wiil get the as many columns as we have unique words in the input dataset
The shape of dataset after preprocessing becomes
4. Model Building and Evaluation :
As our problem is of text classification so the algorithm called Naive Bayes Classifier works very well on this type of data. i.e. Text Data
Naive Bayes Classifier have 3 types :
-
Multinomial Naive Bayes :
Multinomial Naïve Bayes consider a feature vector where a given term represents the number of times it appears or very often i.e. frequency. Multinomial Naive Bayes - Widely used classifier for document classification which keeps the count of frequent words present in the documents.
-
Bernoulli Naive Bayes : Bernoulli is a binary algorithm used when the feature is present or not.
-
Guassian Naive Bayes : Gaussian is based on continuous distribution i.e. Used when we are dealing with continuous data.
I have calculated the accuracy, confusion matrix and precision score of each of three classifiers
From three I have got the good accuracy and prescision of Multinomial Naive Bayes which is 97.1 % and precision is 100 % which is best for our model.
Lets see demo of model or test the model:
Website Link:
https://email-sms-spam-classifier-hzlsqvj8kvnjliwg8fbfdq.streamlit.app/
Testing 1
Testing 2
Conclusion :
This is how I have created the EMAIL/SMS Spam Classifier model by using the machine learning algorithm of Naive Bayes Classifier.
Project By - Shivsharan Malage
Github
Linkedin