Prescription Drug Clustering and Recommendation System

This project aims to analyze medication user experiences and develop recommendation systems for patients and healthcare professionals. Using data from drugs.com and Kaggle, various clustering algorithms and recommendation techniques were applied to classify drugs based on reviews and side effects.

Project Objective

The primary objective of this project is to utilize data science techniques to detect different user experiences with various medications and analyze patterns among them. Additionally, the goal is to create a recommendation system to streamline the recommendation process for both healthcare professionals and patients, ultimately providing feedback to the pharmaceutical sector. This project leverages techniques like TF-IDF, clustering, and Jaccard similarity, along with tools such as Pandas, NumPy, and SciPy, to achieve these objectives.

Building the Dataset

Web Scraping

Data was scraped from drugs.com using a script that extracted information on different drugs, their side effects, and user reviews. The script iterated through multiple pages of reviews per drug, resulting in a dataset of 57,675 rows.

Figure 1. Example Row of the scraped dataset

Creation of the Dataset

The web-scraped dataset was merged with a secondary dataset from Kaggle, which contained specific drug details like alcohol compatibility, pregnancy considerations, and FDA approval status. This merged dataset underwent a data cleaning process to prepare it for analysis.

Descriptive Data Analysis

An initial analysis of the dataset examined the distribution of attributes and visualized them. The numerical attributes, such as 'rating' and 'usefulCount,' did not follow a normal distribution. The text attributes were analyzed based on review length, revealing an average length of 95 words per review. Common side effects and top conditions treated by the drugs were also identified and visualized.

Figure 2. Top 10 Most Common Side Effects

Clustering with TF-IDF

Two distinct clustering analyses were conducted: one for clustering reviews based on conditions and another for clustering side effects based on conditions. The input for clustering was the sparse matrix derived from applying TF-IDF to the text attributes.

Description of Clustering Algorithms

DBSCAN: A density-based algorithm that groups points into high-density areas, identifying clusters of arbitrary shapes and handling noise in the data.
CURE: Utilizes a hierarchical strategy by compressing the dataset using cluster representatives, enhancing efficiency.
KMeans: Partitions data into K distinct clusters, aiming to minimize variance within each cluster.
Agglomerative Hierarchical Clustering: Constructs a cluster hierarchy by iteratively merging the most similar clusters, building a dendrogram illustrating data relationships.
Manually Implemented Algorithm: A handcrafted algorithm closely following the idea of KMeans without using the prebuilt KMeans library from sklearn.cluster.

Figure 3. Performance metrics

Clustering Based on Reviews

Experiments with different numbers of conditions indicated that 15 conditions provided the optimal computational cost. The dataset was filtered by reviews, stopwords were applied, and TF-IDF was used to convert text data into numerical form. KMeans and DBSCAN clustering algorithms were compared, with KMeans showing better performance. The conclusion was that conditions with fewer clusters had clearer expectations.

Clustering Based on Side Effects

The clustering was performed on medicines' side effects given a particular condition. Similar to the review clustering, DBSCAN yielded poor results, while KMeans, Agglomerative Hierarchical Clustering, and the manually implemented algorithm showed well-defined clusters.

Evaluation of Clustering

Clustering performance was evaluated using metrics such as the Davies-Bouldin index, silhouette score, and Calinski-Harabasz index. The results indicated that KMeans and the manually implemented algorithm provided moderate performance and well-defined clusters, unlike DBSCAN.

Recommendation System

Three recommendation systems were implemented, each taking a medical condition and a drug as inputs to provide similar medication recommendations. The systems differ in preprocessing and statistical methods used for assessing similarity.

Recommendation System 1a (Patient)

This system is designed for patient use, employing TF-IDF and cosine similarity to recommend drugs based on user input reviews. The output includes average ratings, conditions treated, and drug classes.

Recommendation System 1b (Health Professionals)

This system focuses on health professionals, using TF-IDF and cosine similarity to recommend drugs and provide a word cloud of the most important side effects. It helps professionals find similar drugs for their patients and understand common side effects.

Figure 4. Wordcloud output of the recommendation system 1b example

Recommendation System 2 (TF-IDF, Clustering, Jaccard Similarity)

This system clusters medications based on side effects and computes Jaccard similarity to recommend similar drugs. It combines clustering output with similarity scores to provide recommendations.

Conclusion

The project screated a useful dataset, performing clustering analyses, and developing recommendation systems. KMeans proved to be the most effective clustering algorithm for drug ratings and side effects. The recommendation systems, despite using different methods, often obtained the same output, demonstrating their reliability. These findings contribute to the advancement of clustering algorithms and recommendation systems in drug analysis.