Sentiment analysis with machine learning

September 20, 2022

Sentiment Analysis with Machine Learning

Sentiment analysis, also known as opinion mining, is the process of identifying and extracting subjective information from text, images, or videos. This information can be used to determine the polarity or emotional tone of the content, which is commonly classified as positive, negative, or neutral. Sentiment analysis has several applications in marketing, customer service, social media monitoring, and more. In this blog post, we’ll explore sentiment analysis using machine learning techniques and discuss key aspects of the topic.

Preprocessing the Data

To perform sentiment analysis, we first need to preprocess the data. This involves cleaning the text to remove any unwanted characters, converting the text to lowercase, tokenizing the text into words, and removing stop words such as “the,” “and,” “in,” etc. We also need to perform stemming or lemmatization to normalize the text and reduce the number of unique words.

Here is the Python code to preprocess the data using the Natural Language Toolkit (NLTK) library.

import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer

nltk.download('stopwords')
nltk.download('punkt')
nltk.download('wordnet')

stop_words = set(stopwords.words('english'))
lemmatizer = WordNetLemmatizer()

def preprocess(text):
    # remove unwanted characters
    text = re.sub('[^a-zA-Z0-9\s]', '', text)
    # convert to lowercase
    text = text.lower()
    # tokenize text into words
    words = word_tokenize(text)
    # remove stop words and lemmatize words
    filtered_words = [lemmatizer.lemmatize(w) for w in words if w not in stop_words]
    return " ".join(filtered_words)

Choosing a Machine Learning Algorithm

There are several machine learning algorithms that can be used for sentiment analysis, such as Naive Bayes, Support Vector Machines (SVM), Random Forest, and Neural Networks. Each algorithm has its strengths and weaknesses, and the choice of algorithm depends on the size of the dataset, the quality of the data, and the desired level of accuracy.

Here, we’ll use the Naive Bayes algorithm, which is a probabilistic classification algorithm that calculates the posterior probability of each class and selects the class with the highest probability.

Here is the Python code to create a Naive Bayes classifier using the scikit-learn library.

from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# load the dataset
data = pd.read_csv('data.csv')
# preprocess the text
data['text'] = data['text'].apply(preprocess)

# create a bag-of-words representation of the text
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(data['text'])
y = data['sentiment']

# split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# create a Naive Bayes classifier
clf = MultinomialNB()
clf.fit(X_train, y_train)

# make predictions on the test data
y_pred = clf.predict(X_test)

# calculate the accuracy score
accuracy = accuracy_score(y_test, y_pred)
print('Accuracy:', accuracy)

Evaluating the Model

To evaluate the performance of the sentiment analysis model, we need to calculate the accuracy score, precision, recall, and F1 score. The accuracy score measures the percentage of correctly classified examples, while the precision measures the percentage of positive predictions that are actually positive, and the recall measures the percentage of true positives that are correctly identified. The F1 score is the harmonic mean of precision and recall, and it provides a balanced measure of the model’s performance.

Here is the Python code to calculate the evaluation metrics using the scikit-learn library.

from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

# calculate the evaluation metrics
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred, average='macro')
recall = recall_score(y_test, y_pred, average='macro')
f1 = f1_score(y_test, y_pred, average='macro')

print('Accuracy:', accuracy)
print('Precision:', precision)
print('Recall:', recall)
print('F1 Score:', f1)

Improving the Model

To improve the performance of the sentiment analysis model, we can explore several techniques, such as feature engineering, hyperparameter tuning, and ensemble methods.

Feature engineering involves creating new features that can capture more information about the text, such as n-grams, part-of-speech tags, and sentiment lexicons. Hyperparameter tuning involves selecting the best hyperparameters that optimize the performance of the model, such as the number of features, the alpha parameter in Naive Bayes, and the kernel function in SVM. Ensemble methods involve combining multiple models to improve their performance, such as bagging, boosting, and stacking.

Here is the Python code to implement feature engineering using n-grams and hyperparameter tuning for the Naive Bayes classifier.

from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV

# create a pipeline with n-gram feature extraction and Naive Bayes classifier
pipeline = Pipeline([
    ('vectorizer', CountVectorizer()),
    ('classifier', MultinomialNB())
])

# specify the hyperparameters to tune
parameters = {
    'vectorizer__ngram_range': [(1, 1), (1, 2), (1, 3)],
    'classifier__alpha': [0.1, 1, 10]
}

# perform a grid search to find the best hyperparameters
clf = GridSearchCV(pipeline, parameters, cv=5)
clf.fit(X_train, y_train)

# make predictions on the test data
y_pred = clf.predict(X_test)

# calculate the evaluation metrics
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred, average='macro')
recall = recall_score(y_test, y_pred, average='macro')
f1 = f1_score(y_test, y_pred, average='macro')

print('Accuracy:', accuracy)
print('Precision:', precision)
print('Recall:', recall)
print('F1 Score:', f1)

Conclusion

Sentiment analysis is a powerful tool for analyzing and understanding the opinions and emotions of customers, users, and stakeholders. With machine learning algorithms, we can classify the polarity of text with high accuracy and efficiency. Preprocessing the data, choosing the right algorithm, evaluating the model, and improving the model are crucial steps in building effective sentiment analysis models. By using the code snippets and resources provided in this blog post, you can start exploring the exciting world of sentiment analysis and machine learning.

Additional Resources:

NLTK documentation: https://www.nltk.org/
scikit-learn documentation: https://scikit-learn.org/
Kaggle sentiment analysis datasets: https://www.kaggle.com/c/sentiment-analysis-on-movie-reviews/data