From 44149ea47815ff9bc5bc4161ccc1b839db54f817 Mon Sep 17 00:00:00 2001 From: Daniel Schwartz Date: Thu, 2 Nov 2023 09:47:29 -0400 Subject: [PATCH 1/9] Initial draft of clustering module --- python_clustering/python_clustering.md | 265 +++++++++++++++++++++++++ 1 file changed, 265 insertions(+) create mode 100644 python_clustering/python_clustering.md diff --git a/python_clustering/python_clustering.md b/python_clustering/python_clustering.md new file mode 100644 index 000000000..c685151ef --- /dev/null +++ b/python_clustering/python_clustering.md @@ -0,0 +1,265 @@ + + +# Python Lesson on Clustering for Machine Learning + +@overview + +### What is clustering? +- Clustering is an unsupervised machine learning technique that groups unlabeled data points into clusters based on their similarity. The goal of clustering is to identify groups of data points that are similar to each other and dissimilar to data points in other groups. Clustering algorithms work by measuring the similarity between data points and then grouping similar data points together. There are many different clustering algorithms, each with its own strengths and weaknesses. Some of the most common clustering algorithms include K-Means clustering, hierarchical clustering, and Gaussian Mixture Models (GMMs). + + + + +[True/False] Clustering algorithms are always able to find the "correct" clusters in the data. + + +[( )] True +[(X)] False +*** +
+ +This question is designed to test the test-taker's understanding of the limitations of clustering algorithms. Clustering algorithms are heuristics, which means that they do not guarantee to find the "correct" clusters in the data. The results of a clustering algorithm will depend on the distance metric used, the initialization of the algorithm, and the parameters of the algorithm. + + +
+*** + + +[True/False] Clustering algorithms can be used to detect outliers in the data. + + +[( )] True +[(X)] False +*** +
+ +This question is designed to test the test-taker's understanding of the difference between clustering and anomaly detection. Clustering algorithms are used to group similar data points together, while anomaly detection algorithms are used to identify data points that are significantly different from the rest of the data. + + +
+*** + +### Unsupervised vs. Supervised Learning +- **Unsupervised learning** is a type of machine learning where the algorithm is trained on unlabeled data. This means that the data does not have pre-defined labels or categories. The goal of unsupervised learning is to identify patterns and relationships in the data without any prior knowledge of the data. +- **Supervised learning** is a type of machine learning where the algorithm is trained on labeled data. This means that the data has pre-defined labels or categories. The goal of supervised learning is to train a model to predict the labels for new data points. + + + +Which of the following is a goal of supervised learning? + + + +[( )] Identify patterns and relationships in the data +[( )] Group similar data points together +[( )] Detect outliers in the data +[(X)] Predict the labels for new data points +[( )] Understand the underlying structure of the data +*** +
+ +Predicting the labels for new data points is a goal of supervised learning, not unsupervised learning. Unsupervised learning algorithms are used to identify patterns and relationships in the data without any prior knowledge of the data. This can be useful for tasks such as market segmentation, fraud detection, and anomaly detection. + + + +
+*** + + + + + +### Applications of clustering in machine learning +Clustering can be used for a variety of tasks, such as: + +- **Customer segmentation:** Clustering can be used to segment customers into different groups based on their demographics, purchase behavior, or other characteristics. This information can then be used to target marketing campaigns or product development efforts to specific customer segments. +- **Product grouping:** Clustering can be used to group products with similar characteristics, such as price, features, or customer reviews. This information can be used to improve product recommendations or to identify opportunities for cross-selling and up-selling. +- **Image segmentation:** Clustering can be used to segment images into different objects or regions. This information can be used in tasks such as object detection, image classification, and image compression. +- **Anomaly detection:** Clustering can be used to identify anomalous data points that are different from the rest of the data. This information can be used to detect fraud, identify errors in data collection, or predict future events. +- **Medical diagnosis:** Clustering can be used to group patients with similar symptoms or medical histories together. This information can be used to improve the accuracy of medical diagnosis and to develop more personalized treatment plans. +- **Scientific research:** Clustering can be used to identify patterns and relationships in scientific data. This information can be used to advance scientific knowledge and to develop new technologies. + +### Examples of clustering in real-world applications +- **Netflix uses clustering to recommend movies and TV shows to its users.** Netflix clusters its users based on their viewing history and then recommends movies and TV shows to users based on the clusters they belong to. +- **Amazon uses clustering to recommend products to its customers.** Amazon clusters its products based on customer reviews and purchase behavior. Amazon then recommends products to customers based on the clusters the products belong to and the customer's past purchase history. +- **Google uses clustering to improve the accuracy of its search results.** Google clusters search results based on the relevance of the results to the search query. Google then displays the most relevant results at the top of the search results page. +- **Banks use clustering to detect fraudulent transactions.** Banks cluster transactions based on their characteristics, such as the amount of money involved, the type of transaction, and the location of the transaction. Banks then flag anomalous transactions as potentially fraudulent. +- **Medical researchers use clustering to identify new biomarkers for diseases.** Medical researchers cluster patients based on their medical histories and symptoms. Researchers then look for patterns in the clusters to identify new biomarkers that can be used to diagnose and treat diseases. + +## K-Means Clustering Algorithm +- The K-Means clustering algorithm works by iteratively assigning data points to clusters based on their distance to the cluster centroids. The cluster centroids are the average values of all the data points in a cluster. + +``` +1. Choose the number of clusters (K): + - This is an important step, as it will determine the outcome of the clustering process. + - There is no one-size-fits-all answer to the question of how to choose K. One approach is to use the elbow method, which involves plotting the within-cluster sum of squares (WCSS) for different values of K. + - The elbow point on the plot is the point where the WCSS starts to flatten out, and this is often a good choice for K. +2. Initialize the cluster centroids + - The cluster centroids can be initialized randomly or by choosing K data points from the dataset. +3. Assign each data point to the cluster with the nearest centroid + - The distance between a data point and a cluster centroid can be measured using any distance metric, such as Euclidean distance or Manhattan distance. +4. Recalculate the cluster centroids + - The cluster centroids are recalculated by taking the average of all the data points in each cluster. +5. Repeat steps 3 and 4 until the cluster assignments no longer change + +``` + + + +What is the goal of the K-Means clustering algorithm? + + +[( )] To identify the clusters with the highest within-cluster sum of squares (WCSS) +[( )] To identify the clusters with the lowest within-cluster sum of squares (WCSS) +[( )] To identify the clusters with the highest between-cluster sum of squares (BCSS) +[( )] To identify the clusters with the lowest between-cluster sum of squares (BCSS) +[(X)] To group similar data points together +*** +
+ +The goal of the K-Means clustering algorithm is to group similar data points together. This is achieved by iteratively assigning data points to clusters based on their distance to the cluster centroids. + + + +
+*** + + + +### Python Implementation of K-Means Clustering + + +``` +import numpy as np # Library for math manipulation, loading data +import matplotlib.pyplot as plt # Library for plotting +from sklearn.cluster import KMeans # Library for KMeans clustering + +# Load the data +data = np.loadtxt("data.csv", delimiter=",") + +# Choose the number of clusters +n_clusters = 3 + +# Initialize the KMeans model +kmeans = KMeans(n_clusters=n_clusters) + +# Fit the model to the data +kmeans.fit(data) + +# Predict the cluster labels for each data point +cluster_labels = kmeans.predict(data) + +plt.scatter(data[:, 0], data[:, 1], c=cluster_labels) +plt.xlabel("Feature 1") +plt.ylabel("Feature 2") +plt.title("K-Means Clustering") +plt.show() +``` + + +### Applying K-Means Clustering to a Real-World Dataset + +- **Loading and cleaning the data:** The first step is to load the data into Python and clean it as needed. This may involve removing outliers, handling missing values, and scaling the data. +- **Scaling the data:** It is important to scale the data before applying K-Means clustering. This helps to ensure that all features have equal importance in the clustering process. +- **Choosing the number of clusters (K):** There is no one-size-fits-all answer to the question of how to choose the number of clusters (K). One approach is to use the elbow method, which involves plotting the within-cluster sum of squares (WCSS) for different values of K. The elbow point on the plot is the point where the WCSS starts to flatten out, and this is often a good choice for K. +- **Training and evaluating the K-Means model:** Once you have chosen the number of clusters, you can train the K-Means model on the data. You can then evaluate the model by computing the silhouette score. The silhouette score is a measure of how well the data points are clustered, and a higher score indicates better clustering. +- **Visualizing the clusters:** Once you have trained and evaluated the K-Means model, you can visualize the clusters using a scatter plot. This can help you to understand how the data is clustered and to identify any outliers. + +### Important Notes +Clustering is a machine learning technique that groups unlabeled data points into clusters based on their similarity. It is a powerful tool that can be used to solve a variety of problems, such as customer segmentation, product grouping, and anomaly detection. However, clustering also has some limitations. Here are some of the most important limitations of clustering: + +- **Sensitivity to the initialization:** Many clustering algorithms, such as k-means clustering, are sensitive to the initialization of the cluster centroids. If the cluster centroids are not initialized correctly, the clustering algorithm may not be able to find the optimal clusters. +- **Difficulty in choosing the number of clusters:** K-means clustering requires the user to specify the number of clusters (k) in advance. However, there is no one-size-fits-all answer to the question of how to choose k. Choosing the wrong number of clusters can lead to inaccurate results. +- **Inability to handle outliers:** Clustering algorithms are often sensitive to outliers, which are data points that are significantly different from the rest of the data. Outliers can have a large impact on the clustering results and can lead to inaccurate clusters. +- **Difficulty in interpreting the results:** It can be difficult to interpret the results of clustering algorithms, especially when the data is high-dimensional. It can be difficult to understand what the clusters represent and why the data points were assigned to the clusters they were assigned to. + + + +Which of the following techniques can be used to mitigate the sensitivity of clustering algorithms to the initialization? + + + +[( )] Running the clustering algorithm multiple times with different initializations and selecting the best results +[( )] Using a more robust clustering algorithm that is less sensitive to the initialization +[( )] Preprocessing the data to remove outliers +[(X)] All of the above +*** +
+ +All of the above techniques can be used to mitigate the sensitivity of clustering algorithms to the initialization. + + + +
+*** + + +## Conclusion + +At the end of the lesson, students should have a good understanding of the concept of clustering and how to implement the K-Means clustering algorithm in Python. They should also be able to apply K-Means clustering to real-world datasets to identify patterns and insights. + +## Additional Resources + +## Feedback + +@feedback From be925e84eb352cedaebdfe1edb3e275f0850db04 Mon Sep 17 00:00:00 2001 From: Daniel Schwartz Date: Sat, 11 Nov 2023 16:01:43 -0500 Subject: [PATCH 2/9] Added heart data for clustering python exercise --- python_clustering/data/heart.csv | 304 +++++++++++++++++++++++++++++++ 1 file changed, 304 insertions(+) create mode 100644 python_clustering/data/heart.csv diff --git a/python_clustering/data/heart.csv b/python_clustering/data/heart.csv new file mode 100644 index 000000000..0966e67b5 --- /dev/null +++ b/python_clustering/data/heart.csv @@ -0,0 +1,304 @@ +age,sex,cp,trtbps,chol,fbs,restecg,thalachh,exng,oldpeak,slp,caa,thall,output +63,1,3,145,233,1,0,150,0,2.3,0,0,1,1 +37,1,2,130,250,0,1,187,0,3.5,0,0,2,1 +41,0,1,130,204,0,0,172,0,1.4,2,0,2,1 +56,1,1,120,236,0,1,178,0,0.8,2,0,2,1 +57,0,0,120,354,0,1,163,1,0.6,2,0,2,1 +57,1,0,140,192,0,1,148,0,0.4,1,0,1,1 +56,0,1,140,294,0,0,153,0,1.3,1,0,2,1 +44,1,1,120,263,0,1,173,0,0,2,0,3,1 +52,1,2,172,199,1,1,162,0,0.5,2,0,3,1 +57,1,2,150,168,0,1,174,0,1.6,2,0,2,1 +54,1,0,140,239,0,1,160,0,1.2,2,0,2,1 +48,0,2,130,275,0,1,139,0,0.2,2,0,2,1 +49,1,1,130,266,0,1,171,0,0.6,2,0,2,1 +64,1,3,110,211,0,0,144,1,1.8,1,0,2,1 +58,0,3,150,283,1,0,162,0,1,2,0,2,1 +50,0,2,120,219,0,1,158,0,1.6,1,0,2,1 +58,0,2,120,340,0,1,172,0,0,2,0,2,1 +66,0,3,150,226,0,1,114,0,2.6,0,0,2,1 +43,1,0,150,247,0,1,171,0,1.5,2,0,2,1 +69,0,3,140,239,0,1,151,0,1.8,2,2,2,1 +59,1,0,135,234,0,1,161,0,0.5,1,0,3,1 +44,1,2,130,233,0,1,179,1,0.4,2,0,2,1 +42,1,0,140,226,0,1,178,0,0,2,0,2,1 +61,1,2,150,243,1,1,137,1,1,1,0,2,1 +40,1,3,140,199,0,1,178,1,1.4,2,0,3,1 +71,0,1,160,302,0,1,162,0,0.4,2,2,2,1 +59,1,2,150,212,1,1,157,0,1.6,2,0,2,1 +51,1,2,110,175,0,1,123,0,0.6,2,0,2,1 +65,0,2,140,417,1,0,157,0,0.8,2,1,2,1 +53,1,2,130,197,1,0,152,0,1.2,0,0,2,1 +41,0,1,105,198,0,1,168,0,0,2,1,2,1 +65,1,0,120,177,0,1,140,0,0.4,2,0,3,1 +44,1,1,130,219,0,0,188,0,0,2,0,2,1 +54,1,2,125,273,0,0,152,0,0.5,0,1,2,1 +51,1,3,125,213,0,0,125,1,1.4,2,1,2,1 +46,0,2,142,177,0,0,160,1,1.4,0,0,2,1 +54,0,2,135,304,1,1,170,0,0,2,0,2,1 +54,1,2,150,232,0,0,165,0,1.6,2,0,3,1 +65,0,2,155,269,0,1,148,0,0.8,2,0,2,1 +65,0,2,160,360,0,0,151,0,0.8,2,0,2,1 +51,0,2,140,308,0,0,142,0,1.5,2,1,2,1 +48,1,1,130,245,0,0,180,0,0.2,1,0,2,1 +45,1,0,104,208,0,0,148,1,3,1,0,2,1 +53,0,0,130,264,0,0,143,0,0.4,1,0,2,1 +39,1,2,140,321,0,0,182,0,0,2,0,2,1 +52,1,1,120,325,0,1,172,0,0.2,2,0,2,1 +44,1,2,140,235,0,0,180,0,0,2,0,2,1 +47,1,2,138,257,0,0,156,0,0,2,0,2,1 +53,0,2,128,216,0,0,115,0,0,2,0,0,1 +53,0,0,138,234,0,0,160,0,0,2,0,2,1 +51,0,2,130,256,0,0,149,0,0.5,2,0,2,1 +66,1,0,120,302,0,0,151,0,0.4,1,0,2,1 +62,1,2,130,231,0,1,146,0,1.8,1,3,3,1 +44,0,2,108,141,0,1,175,0,0.6,1,0,2,1 +63,0,2,135,252,0,0,172,0,0,2,0,2,1 +52,1,1,134,201,0,1,158,0,0.8,2,1,2,1 +48,1,0,122,222,0,0,186,0,0,2,0,2,1 +45,1,0,115,260,0,0,185,0,0,2,0,2,1 +34,1,3,118,182,0,0,174,0,0,2,0,2,1 +57,0,0,128,303,0,0,159,0,0,2,1,2,1 +71,0,2,110,265,1,0,130,0,0,2,1,2,1 +54,1,1,108,309,0,1,156,0,0,2,0,3,1 +52,1,3,118,186,0,0,190,0,0,1,0,1,1 +41,1,1,135,203,0,1,132,0,0,1,0,1,1 +58,1,2,140,211,1,0,165,0,0,2,0,2,1 +35,0,0,138,183,0,1,182,0,1.4,2,0,2,1 +51,1,2,100,222,0,1,143,1,1.2,1,0,2,1 +45,0,1,130,234,0,0,175,0,0.6,1,0,2,1 +44,1,1,120,220,0,1,170,0,0,2,0,2,1 +62,0,0,124,209,0,1,163,0,0,2,0,2,1 +54,1,2,120,258,0,0,147,0,0.4,1,0,3,1 +51,1,2,94,227,0,1,154,1,0,2,1,3,1 +29,1,1,130,204,0,0,202,0,0,2,0,2,1 +51,1,0,140,261,0,0,186,1,0,2,0,2,1 +43,0,2,122,213,0,1,165,0,0.2,1,0,2,1 +55,0,1,135,250,0,0,161,0,1.4,1,0,2,1 +51,1,2,125,245,1,0,166,0,2.4,1,0,2,1 +59,1,1,140,221,0,1,164,1,0,2,0,2,1 +52,1,1,128,205,1,1,184,0,0,2,0,2,1 +58,1,2,105,240,0,0,154,1,0.6,1,0,3,1 +41,1,2,112,250,0,1,179,0,0,2,0,2,1 +45,1,1,128,308,0,0,170,0,0,2,0,2,1 +60,0,2,102,318,0,1,160,0,0,2,1,2,1 +52,1,3,152,298,1,1,178,0,1.2,1,0,3,1 +42,0,0,102,265,0,0,122,0,0.6,1,0,2,1 +67,0,2,115,564,0,0,160,0,1.6,1,0,3,1 +68,1,2,118,277,0,1,151,0,1,2,1,3,1 +46,1,1,101,197,1,1,156,0,0,2,0,3,1 +54,0,2,110,214,0,1,158,0,1.6,1,0,2,1 +58,0,0,100,248,0,0,122,0,1,1,0,2,1 +48,1,2,124,255,1,1,175,0,0,2,2,2,1 +57,1,0,132,207,0,1,168,1,0,2,0,3,1 +52,1,2,138,223,0,1,169,0,0,2,4,2,1 +54,0,1,132,288,1,0,159,1,0,2,1,2,1 +45,0,1,112,160,0,1,138,0,0,1,0,2,1 +53,1,0,142,226,0,0,111,1,0,2,0,3,1 +62,0,0,140,394,0,0,157,0,1.2,1,0,2,1 +52,1,0,108,233,1,1,147,0,0.1,2,3,3,1 +43,1,2,130,315,0,1,162,0,1.9,2,1,2,1 +53,1,2,130,246,1,0,173,0,0,2,3,2,1 +42,1,3,148,244,0,0,178,0,0.8,2,2,2,1 +59,1,3,178,270,0,0,145,0,4.2,0,0,3,1 +63,0,1,140,195,0,1,179,0,0,2,2,2,1 +42,1,2,120,240,1,1,194,0,0.8,0,0,3,1 +50,1,2,129,196,0,1,163,0,0,2,0,2,1 +68,0,2,120,211,0,0,115,0,1.5,1,0,2,1 +69,1,3,160,234,1,0,131,0,0.1,1,1,2,1 +45,0,0,138,236,0,0,152,1,0.2,1,0,2,1 +50,0,1,120,244,0,1,162,0,1.1,2,0,2,1 +50,0,0,110,254,0,0,159,0,0,2,0,2,1 +64,0,0,180,325,0,1,154,1,0,2,0,2,1 +57,1,2,150,126,1,1,173,0,0.2,2,1,3,1 +64,0,2,140,313,0,1,133,0,0.2,2,0,3,1 +43,1,0,110,211,0,1,161,0,0,2,0,3,1 +55,1,1,130,262,0,1,155,0,0,2,0,2,1 +37,0,2,120,215,0,1,170,0,0,2,0,2,1 +41,1,2,130,214,0,0,168,0,2,1,0,2,1 +56,1,3,120,193,0,0,162,0,1.9,1,0,3,1 +46,0,1,105,204,0,1,172,0,0,2,0,2,1 +46,0,0,138,243,0,0,152,1,0,1,0,2,1 +64,0,0,130,303,0,1,122,0,2,1,2,2,1 +59,1,0,138,271,0,0,182,0,0,2,0,2,1 +41,0,2,112,268,0,0,172,1,0,2,0,2,1 +54,0,2,108,267,0,0,167,0,0,2,0,2,1 +39,0,2,94,199,0,1,179,0,0,2,0,2,1 +34,0,1,118,210,0,1,192,0,0.7,2,0,2,1 +47,1,0,112,204,0,1,143,0,0.1,2,0,2,1 +67,0,2,152,277,0,1,172,0,0,2,1,2,1 +52,0,2,136,196,0,0,169,0,0.1,1,0,2,1 +74,0,1,120,269,0,0,121,1,0.2,2,1,2,1 +54,0,2,160,201,0,1,163,0,0,2,1,2,1 +49,0,1,134,271,0,1,162,0,0,1,0,2,1 +42,1,1,120,295,0,1,162,0,0,2,0,2,1 +41,1,1,110,235,0,1,153,0,0,2,0,2,1 +41,0,1,126,306,0,1,163,0,0,2,0,2,1 +49,0,0,130,269,0,1,163,0,0,2,0,2,1 +60,0,2,120,178,1,1,96,0,0,2,0,2,1 +62,1,1,128,208,1,0,140,0,0,2,0,2,1 +57,1,0,110,201,0,1,126,1,1.5,1,0,1,1 +64,1,0,128,263,0,1,105,1,0.2,1,1,3,1 +51,0,2,120,295,0,0,157,0,0.6,2,0,2,1 +43,1,0,115,303,0,1,181,0,1.2,1,0,2,1 +42,0,2,120,209,0,1,173,0,0,1,0,2,1 +67,0,0,106,223,0,1,142,0,0.3,2,2,2,1 +76,0,2,140,197,0,2,116,0,1.1,1,0,2,1 +70,1,1,156,245,0,0,143,0,0,2,0,2,1 +44,0,2,118,242,0,1,149,0,0.3,1,1,2,1 +60,0,3,150,240,0,1,171,0,0.9,2,0,2,1 +44,1,2,120,226,0,1,169,0,0,2,0,2,1 +42,1,2,130,180,0,1,150,0,0,2,0,2,1 +66,1,0,160,228,0,0,138,0,2.3,2,0,1,1 +71,0,0,112,149,0,1,125,0,1.6,1,0,2,1 +64,1,3,170,227,0,0,155,0,0.6,1,0,3,1 +66,0,2,146,278,0,0,152,0,0,1,1,2,1 +39,0,2,138,220,0,1,152,0,0,1,0,2,1 +58,0,0,130,197,0,1,131,0,0.6,1,0,2,1 +47,1,2,130,253,0,1,179,0,0,2,0,2,1 +35,1,1,122,192,0,1,174,0,0,2,0,2,1 +58,1,1,125,220,0,1,144,0,0.4,1,4,3,1 +56,1,1,130,221,0,0,163,0,0,2,0,3,1 +56,1,1,120,240,0,1,169,0,0,0,0,2,1 +55,0,1,132,342,0,1,166,0,1.2,2,0,2,1 +41,1,1,120,157,0,1,182,0,0,2,0,2,1 +38,1,2,138,175,0,1,173,0,0,2,4,2,1 +38,1,2,138,175,0,1,173,0,0,2,4,2,1 +67,1,0,160,286,0,0,108,1,1.5,1,3,2,0 +67,1,0,120,229,0,0,129,1,2.6,1,2,3,0 +62,0,0,140,268,0,0,160,0,3.6,0,2,2,0 +63,1,0,130,254,0,0,147,0,1.4,1,1,3,0 +53,1,0,140,203,1,0,155,1,3.1,0,0,3,0 +56,1,2,130,256,1,0,142,1,0.6,1,1,1,0 +48,1,1,110,229,0,1,168,0,1,0,0,3,0 +58,1,1,120,284,0,0,160,0,1.8,1,0,2,0 +58,1,2,132,224,0,0,173,0,3.2,2,2,3,0 +60,1,0,130,206,0,0,132,1,2.4,1,2,3,0 +40,1,0,110,167,0,0,114,1,2,1,0,3,0 +60,1,0,117,230,1,1,160,1,1.4,2,2,3,0 +64,1,2,140,335,0,1,158,0,0,2,0,2,0 +43,1,0,120,177,0,0,120,1,2.5,1,0,3,0 +57,1,0,150,276,0,0,112,1,0.6,1,1,1,0 +55,1,0,132,353,0,1,132,1,1.2,1,1,3,0 +65,0,0,150,225,0,0,114,0,1,1,3,3,0 +61,0,0,130,330,0,0,169,0,0,2,0,2,0 +58,1,2,112,230,0,0,165,0,2.5,1,1,3,0 +50,1,0,150,243,0,0,128,0,2.6,1,0,3,0 +44,1,0,112,290,0,0,153,0,0,2,1,2,0 +60,1,0,130,253,0,1,144,1,1.4,2,1,3,0 +54,1,0,124,266,0,0,109,1,2.2,1,1,3,0 +50,1,2,140,233,0,1,163,0,0.6,1,1,3,0 +41,1,0,110,172,0,0,158,0,0,2,0,3,0 +51,0,0,130,305,0,1,142,1,1.2,1,0,3,0 +58,1,0,128,216,0,0,131,1,2.2,1,3,3,0 +54,1,0,120,188,0,1,113,0,1.4,1,1,3,0 +60,1,0,145,282,0,0,142,1,2.8,1,2,3,0 +60,1,2,140,185,0,0,155,0,3,1,0,2,0 +59,1,0,170,326,0,0,140,1,3.4,0,0,3,0 +46,1,2,150,231,0,1,147,0,3.6,1,0,2,0 +67,1,0,125,254,1,1,163,0,0.2,1,2,3,0 +62,1,0,120,267,0,1,99,1,1.8,1,2,3,0 +65,1,0,110,248,0,0,158,0,0.6,2,2,1,0 +44,1,0,110,197,0,0,177,0,0,2,1,2,0 +60,1,0,125,258,0,0,141,1,2.8,1,1,3,0 +58,1,0,150,270,0,0,111,1,0.8,2,0,3,0 +68,1,2,180,274,1,0,150,1,1.6,1,0,3,0 +62,0,0,160,164,0,0,145,0,6.2,0,3,3,0 +52,1,0,128,255,0,1,161,1,0,2,1,3,0 +59,1,0,110,239,0,0,142,1,1.2,1,1,3,0 +60,0,0,150,258,0,0,157,0,2.6,1,2,3,0 +49,1,2,120,188,0,1,139,0,2,1,3,3,0 +59,1,0,140,177,0,1,162,1,0,2,1,3,0 +57,1,2,128,229,0,0,150,0,0.4,1,1,3,0 +61,1,0,120,260,0,1,140,1,3.6,1,1,3,0 +39,1,0,118,219,0,1,140,0,1.2,1,0,3,0 +61,0,0,145,307,0,0,146,1,1,1,0,3,0 +56,1,0,125,249,1,0,144,1,1.2,1,1,2,0 +43,0,0,132,341,1,0,136,1,3,1,0,3,0 +62,0,2,130,263,0,1,97,0,1.2,1,1,3,0 +63,1,0,130,330,1,0,132,1,1.8,2,3,3,0 +65,1,0,135,254,0,0,127,0,2.8,1,1,3,0 +48,1,0,130,256,1,0,150,1,0,2,2,3,0 +63,0,0,150,407,0,0,154,0,4,1,3,3,0 +55,1,0,140,217,0,1,111,1,5.6,0,0,3,0 +65,1,3,138,282,1,0,174,0,1.4,1,1,2,0 +56,0,0,200,288,1,0,133,1,4,0,2,3,0 +54,1,0,110,239,0,1,126,1,2.8,1,1,3,0 +70,1,0,145,174,0,1,125,1,2.6,0,0,3,0 +62,1,1,120,281,0,0,103,0,1.4,1,1,3,0 +35,1,0,120,198,0,1,130,1,1.6,1,0,3,0 +59,1,3,170,288,0,0,159,0,0.2,1,0,3,0 +64,1,2,125,309,0,1,131,1,1.8,1,0,3,0 +47,1,2,108,243,0,1,152,0,0,2,0,2,0 +57,1,0,165,289,1,0,124,0,1,1,3,3,0 +55,1,0,160,289,0,0,145,1,0.8,1,1,3,0 +64,1,0,120,246,0,0,96,1,2.2,0,1,2,0 +70,1,0,130,322,0,0,109,0,2.4,1,3,2,0 +51,1,0,140,299,0,1,173,1,1.6,2,0,3,0 +58,1,0,125,300,0,0,171,0,0,2,2,3,0 +60,1,0,140,293,0,0,170,0,1.2,1,2,3,0 +77,1,0,125,304,0,0,162,1,0,2,3,2,0 +35,1,0,126,282,0,0,156,1,0,2,0,3,0 +70,1,2,160,269,0,1,112,1,2.9,1,1,3,0 +59,0,0,174,249,0,1,143,1,0,1,0,2,0 +64,1,0,145,212,0,0,132,0,2,1,2,1,0 +57,1,0,152,274,0,1,88,1,1.2,1,1,3,0 +56,1,0,132,184,0,0,105,1,2.1,1,1,1,0 +48,1,0,124,274,0,0,166,0,0.5,1,0,3,0 +56,0,0,134,409,0,0,150,1,1.9,1,2,3,0 +66,1,1,160,246,0,1,120,1,0,1,3,1,0 +54,1,1,192,283,0,0,195,0,0,2,1,3,0 +69,1,2,140,254,0,0,146,0,2,1,3,3,0 +51,1,0,140,298,0,1,122,1,4.2,1,3,3,0 +43,1,0,132,247,1,0,143,1,0.1,1,4,3,0 +62,0,0,138,294,1,1,106,0,1.9,1,3,2,0 +67,1,0,100,299,0,0,125,1,0.9,1,2,2,0 +59,1,3,160,273,0,0,125,0,0,2,0,2,0 +45,1,0,142,309,0,0,147,1,0,1,3,3,0 +58,1,0,128,259,0,0,130,1,3,1,2,3,0 +50,1,0,144,200,0,0,126,1,0.9,1,0,3,0 +62,0,0,150,244,0,1,154,1,1.4,1,0,2,0 +38,1,3,120,231,0,1,182,1,3.8,1,0,3,0 +66,0,0,178,228,1,1,165,1,1,1,2,3,0 +52,1,0,112,230,0,1,160,0,0,2,1,2,0 +53,1,0,123,282,0,1,95,1,2,1,2,3,0 +63,0,0,108,269,0,1,169,1,1.8,1,2,2,0 +54,1,0,110,206,0,0,108,1,0,1,1,2,0 +66,1,0,112,212,0,0,132,1,0.1,2,1,2,0 +55,0,0,180,327,0,2,117,1,3.4,1,0,2,0 +49,1,2,118,149,0,0,126,0,0.8,2,3,2,0 +54,1,0,122,286,0,0,116,1,3.2,1,2,2,0 +56,1,0,130,283,1,0,103,1,1.6,0,0,3,0 +46,1,0,120,249,0,0,144,0,0.8,2,0,3,0 +61,1,3,134,234,0,1,145,0,2.6,1,2,2,0 +67,1,0,120,237,0,1,71,0,1,1,0,2,0 +58,1,0,100,234,0,1,156,0,0.1,2,1,3,0 +47,1,0,110,275,0,0,118,1,1,1,1,2,0 +52,1,0,125,212,0,1,168,0,1,2,2,3,0 +58,1,0,146,218,0,1,105,0,2,1,1,3,0 +57,1,1,124,261,0,1,141,0,0.3,2,0,3,0 +58,0,1,136,319,1,0,152,0,0,2,2,2,0 +61,1,0,138,166,0,0,125,1,3.6,1,1,2,0 +42,1,0,136,315,0,1,125,1,1.8,1,0,1,0 +52,1,0,128,204,1,1,156,1,1,1,0,0,0 +59,1,2,126,218,1,1,134,0,2.2,1,1,1,0 +40,1,0,152,223,0,1,181,0,0,2,0,3,0 +61,1,0,140,207,0,0,138,1,1.9,2,1,3,0 +46,1,0,140,311,0,1,120,1,1.8,1,2,3,0 +59,1,3,134,204,0,1,162,0,0.8,2,2,2,0 +57,1,1,154,232,0,0,164,0,0,2,1,2,0 +57,1,0,110,335,0,1,143,1,3,1,1,3,0 +55,0,0,128,205,0,2,130,1,2,1,1,3,0 +61,1,0,148,203,0,1,161,0,0,2,1,3,0 +58,1,0,114,318,0,2,140,0,4.4,0,3,1,0 +58,0,0,170,225,1,0,146,1,2.8,1,2,1,0 +67,1,2,152,212,0,0,150,0,0.8,1,0,3,0 +44,1,0,120,169,0,1,144,1,2.8,0,0,1,0 +63,1,0,140,187,0,0,144,1,4,2,2,3,0 +63,0,0,124,197,0,1,136,1,0,1,0,2,0 +59,1,0,164,176,1,0,90,0,1,1,2,1,0 +57,0,0,140,241,0,1,123,1,0.2,1,0,3,0 +45,1,3,110,264,0,1,132,0,1.2,1,0,3,0 +68,1,0,144,193,1,1,141,0,3.4,1,2,3,0 +57,1,0,130,131,0,1,115,1,1.2,1,1,3,0 +57,0,1,130,236,0,0,174,0,0,1,1,2,0 From c0436530d846c2eaad3478152a00cf6a42cde508 Mon Sep 17 00:00:00 2001 From: Daniel Schwartz Date: Sat, 11 Nov 2023 17:08:14 -0500 Subject: [PATCH 3/9] Added clustering python exercise --- python_clustering/python_clustering.md | 92 +++++++++++++++++++++----- 1 file changed, 74 insertions(+), 18 deletions(-) diff --git a/python_clustering/python_clustering.md b/python_clustering/python_clustering.md index c685151ef..9cf95254e 100644 --- a/python_clustering/python_clustering.md +++ b/python_clustering/python_clustering.md @@ -58,7 +58,8 @@ Previous versions: @end import: https://raw.githubusercontent.com/arcus/education_modules/main/_module_templates/macros.md -import: https://raw.githubusercontent.com/arcus/education_modules/main/_module_templates/macros_python.md +import: https://raw.githubusercontent.com/arcus/education_modules/pyodide_testing/_module_templates/macros_python.md +import: https://raw.githubusercontent.com/LiaTemplates/Pyodide/master/README.md --> # Python Lesson on Clustering for Machine Learning @@ -189,32 +190,87 @@ The goal of the K-Means clustering algorithm is to group similar data points tog ### Python Implementation of K-Means Clustering +To implement k-means clustering in Python using Scikit-learn, we can follow these steps: + +1. Import the necessary libraries: +```python +import numpy as np +import pandas as pd +import matplotlib.pyplot as plt +from sklearn.model_selection import train_test_split +from sklearn.cluster import KMeans +from scipy.spatial import distance ``` -import numpy as np # Library for math manipulation, loading data -import matplotlib.pyplot as plt # Library for plotting -from sklearn.cluster import KMeans # Library for KMeans clustering +@Pyodide.eval + + +2. Load the data: +```python @Pyodide.exec + +import pandas as pd +import io +from pyodide.http import open_url + +url = "https://raw.githubusercontent.com/arcus/education_modules/python_clustering/python_clustering/data/heart.csv" -# Load the data -data = np.loadtxt("data.csv", delimiter=",") +url_contents = open_url(url) +text = url_contents.read() +file = io.StringIO(text) -# Choose the number of clusters -n_clusters = 3 +data = pd.read_csv(file) -# Initialize the KMeans model -kmeans = KMeans(n_clusters=n_clusters) -# Fit the model to the data -kmeans.fit(data) +# Analyze data and features +data.info() +``` -# Predict the cluster labels for each data point -cluster_labels = kmeans.predict(data) -plt.scatter(data[:, 0], data[:, 1], c=cluster_labels) -plt.xlabel("Feature 1") -plt.ylabel("Feature 2") -plt.title("K-Means Clustering") +3. Visualize data +```python +# Create the scatter plot +data.plot.scatter(x='chol', y='trtbps', c='output', colormap='viridis') +plt.xlabel("Cholesterol") +plt.ylabel("Resting Blood Pressure") +plt.title("Scatter Plot of Cholesterol vs. Blood Pressure") plt.show() ``` +@Pyodide.eval + +3. Split the data into training and testing sets: +```python +# Normalize dataframe +def normalize(df, features): + result = df.copy() + for feature_name in features: + max_value = df[feature_name].max() + min_value = df[feature_name].min() + result[feature_name] = (df[feature_name] - min_value) / (max_value - min_value) + return result + +normalized_data = normalize(data, data.columns) +``` +@Pyodide.eval + +4. Train the clustering model and visualize: +```python +# Run KMeans +kmeans = KMeans(n_clusters = 2, max_iter = 500, n_init = 40, random_state = 2) + +# Predict clusters +identified_clusters = kmeans.fit_predict(normalized_data.values) +results = normalized_data.copy() +results['cluster'] = identified_clusters + +# Compute distance from cluster +distance_from_centroid = [distance.euclidean(val[:-1],kmeans.cluster_centers_[int(val[-1])]) for val in results.values] +results['dist'] = distance_from_centroid +results.plot.scatter(x='chol', y='trtbps', c='cluster', colormap='viridis', s='dist') +plt.xlabel("Cholesterol") +plt.ylabel("Resting Blood Pressure") +plt.show() +``` +@Pyodide.eval + ### Applying K-Means Clustering to a Real-World Dataset From 46656def5fac5c0e89e06742abc26fc785af7da2 Mon Sep 17 00:00:00 2001 From: Schwartz Date: Thu, 21 Mar 2024 09:08:24 -0400 Subject: [PATCH 4/9] Added polyps data for clustering real world example --- python_clustering/data/polyps.csv | 23 +++++++++++++++++++++++ 1 file changed, 23 insertions(+) create mode 100644 python_clustering/data/polyps.csv diff --git a/python_clustering/data/polyps.csv b/python_clustering/data/polyps.csv new file mode 100644 index 000000000..54f327919 --- /dev/null +++ b/python_clustering/data/polyps.csv @@ -0,0 +1,23 @@ +"","participant_id","sex","age","baseline","treatment","number3m","number12m" +"1","001","female",17,7,"sulindac",6,NA +"2","002","female",20,77,"placebo",67,63 +"3","003","male",16,7,"sulindac",4,2 +"4","004","female",18,5,"placebo",5,28 +"5","005","male",22,23,"sulindac",16,17 +"6","006","female",13,35,"placebo",31,61 +"7","007","female",23,11,"sulindac",6,1 +"8","008","male",34,12,"placebo",20,7 +"9","009","male",50,7,"placebo",7,15 +"10","010","male",19,318,"placebo",347,44 +"11","011","male",17,160,"sulindac",142,25 +"12","012","female",23,8,"sulindac",1,3 +"13","013","male",22,20,"placebo",16,28 +"14","014","male",30,11,"placebo",20,10 +"15","015","male",27,24,"placebo",26,40 +"16","016","male",23,34,"sulindac",27,33 +"17","017","female",22,54,"placebo",45,46 +"18","018","male",13,16,"sulindac",10,NA +"19","019","male",34,30,"placebo",30,50 +"20","020","female",23,10,"sulindac",6,3 +"21","021","female",22,20,"sulindac",5,1 +"22","022","male",42,12,"sulindac",8,4 From 3af59e39d9dba071a9c6682c2740d97a79fb4f9e Mon Sep 17 00:00:00 2001 From: Schwartz Date: Thu, 21 Mar 2024 09:15:19 -0400 Subject: [PATCH 5/9] Updated module to include real-world example --- python_clustering/python_clustering.md | 96 ++++++++++++++++++++++++++ 1 file changed, 96 insertions(+) diff --git a/python_clustering/python_clustering.md b/python_clustering/python_clustering.md index 9cf95254e..bf7e0ef50 100644 --- a/python_clustering/python_clustering.md +++ b/python_clustering/python_clustering.md @@ -66,6 +66,12 @@ import: https://raw.githubusercontent.com/LiaTemplates/Pyodide/master/README.md @overview + + + + + + ### What is clustering? - Clustering is an unsupervised machine learning technique that groups unlabeled data points into clusters based on their similarity. The goal of clustering is to identify groups of data points that are similar to each other and dissimilar to data points in other groups. Clustering algorithms work by measuring the similarity between data points and then grouping similar data points together. There are many different clustering algorithms, each with its own strengths and weaknesses. Some of the most common clustering algorithms include K-Means clustering, hierarchical clustering, and Gaussian Mixture Models (GMMs). @@ -190,6 +196,8 @@ The goal of the K-Means clustering algorithm is to group similar data points tog ### Python Implementation of K-Means Clustering +This dataset contains various clinical attributes of patients, including their age, sex, chest pain type (cp), resting blood pressure (trtbps), serum cholesterol level (chol), fasting blood sugar (fbs) level, resting electrocardiographic results (restecg), maximum heart rate achieved (thalachh), exercise-induced angina (exng), ST depression induced by exercise relative to rest (oldpeak), slope of the peak exercise ST segment (slp), number of major vessels (caa) colored by fluoroscopy, thalassemia (thall) type, and the presence of heart disease (output). The data seems to be related to the diagnosis of heart disease, with the output variable indicating whether a patient has heart disease (1) or not (0). Each row represents a different patient, with their respective clinical characteristics recorded. + To implement k-means clustering in Python using Scikit-learn, we can follow these steps: 1. Import the necessary libraries: @@ -310,6 +318,94 @@ All of the above techniques can be used to mitigate the sensitivity of clusterin *** + +### Real World Code Example + +This dataset, derived and refined from a landmark study published in the New England Journal of Medicine in 1993, investigates the effectiveness of sulindac treatment in individuals with familial adenomatous polyposis (FAP), a hereditary condition characterized by the development of numerous adenomatous polyps in the colon and rectum. Enhanced from the original datasets "polyps" and "polyps3" in the {HSAUR} package, this dataset includes crucial variables such as participant ID, sex, age, baseline polyp count, assigned treatment (sulindac or placebo), and polyp counts at 3 and 12 months post-treatment. These enhancements involved meticulous referencing of the original paper and offer improved granularity and completeness for analyzing the impact of sulindac treatment on polyp progression in FAP patients. This dataset serves as a valuable resource for further research and analysis in the field of gastrointestinal medicine and pharmacology. + +1. Install Packages: +```python @Pyodide.exec + +import pandas as pd +import io +from pyodide.http import open_url +from sklearn.cluster import KMeans +from sklearn.preprocessing import StandardScaler +import matplotlib.pyplot as plt +``` + +2. Load the data: +```python +# Load dataset and read to pandas dataframe +url = "https://raw.githubusercontent.com/arcus/education_modules/python_clustering/python_clustering/data/polyps.csv" +url_contents = open_url(url) +text = url_contents.read() +file = io.StringIO(text) +df = pd.read_csv(file) + +# Analyze data and features +df.info() + +# Select features for clustering +features = ['age', 'baseline', 'number3m', 'number12m'] +X = df[features] + +# Fill missing values with the mean of each column +X.fillna(X.mean(), inplace=True) + +# Standardize the feature values +scaler = StandardScaler() +X_scaled = scaler.fit_transform(X) +``` +@Pyodide.eval + + +3. Cluster Data: +```python +# Define the number of clusters +num_clusters = 3 + +# Apply KMeans clustering +kmeans = KMeans(n_clusters=num_clusters, random_state=42) +kmeans.fit(X_scaled) + +# Assign cluster labels to the original dataframe +df['cluster'] = kmeans.labels_ +``` +@Pyodide.eval + + +4. Visualize Clusters: +```python +# Visualize clusters for 'number3m' vs 'number12m' +plt.figure(figsize=(10, 8)) +colors = ['red', 'blue', 'green'] # Change colors as needed for more clusters + +for i in range(num_clusters): + cluster_data = df[df['cluster'] == i] + plt.scatter(cluster_data['number3m'], cluster_data['number12m'], + color=colors[i], label=f'Cluster {i}') + +plt.xlabel('Number of Polyps at 3 Months') +plt.ylabel('Number of Polyps at 12 Months') +plt.title('K-Means Clustering of Polyp Data: Number of Polyps at 3 Months vs Number of Polyps at 12 Months') +plt.legend() +plt.show() +``` +@Pyodide.eval + +If the K-Means algorithm identified distinct clusters with minimal overlap, it suggests there might be three underlying patient groups regarding polyp count progression: + +- **Cluster 1 (Low Progression):** This cluster might represent participants who have a relatively low number of polyps at 3 months and a stable or slightly increased number at 12 months. This could be associated with effective treatment or naturally slow polyp growth. +- **Cluster 2 (Moderate Progression):** This cluster could include participants with a moderate number of polyps at 3 months and a somewhat steeper increase by 12 months. This might indicate a less effective treatment or a faster natural growth rate for polyps. +- **Cluster 3 (High Progression):** This cluster might contain participants with a high number of polyps at 3 months and a substantial increase by 12 months. This could be linked to factors like a particularly aggressive polyp type or treatment resistance. + +**While clustering provides valuable insights into potential patient subgroups, further analysis of treatment effects and other relevant features is necessary to fully understand the underlying factors influencing polyp count progression.** + + + + + ## Conclusion At the end of the lesson, students should have a good understanding of the concept of clustering and how to implement the K-Means clustering algorithm in Python. They should also be able to apply K-Means clustering to real-world datasets to identify patterns and insights. From 981f599a3d818dd048c8f365d32387444498de1b Mon Sep 17 00:00:00 2001 From: Schwartz Date: Thu, 18 Apr 2024 12:19:54 -0400 Subject: [PATCH 6/9] Addressed Rose's comments from the following module --- python_clustering/python_clustering.md | 106 +++++++++++++++++++++---- 1 file changed, 90 insertions(+), 16 deletions(-) diff --git a/python_clustering/python_clustering.md b/python_clustering/python_clustering.md index bf7e0ef50..8a41eb7cd 100644 --- a/python_clustering/python_clustering.md +++ b/python_clustering/python_clustering.md @@ -75,7 +75,20 @@ import: https://raw.githubusercontent.com/LiaTemplates/Pyodide/master/README.md ### What is clustering? - Clustering is an unsupervised machine learning technique that groups unlabeled data points into clusters based on their similarity. The goal of clustering is to identify groups of data points that are similar to each other and dissimilar to data points in other groups. Clustering algorithms work by measuring the similarity between data points and then grouping similar data points together. There are many different clustering algorithms, each with its own strengths and weaknesses. Some of the most common clustering algorithms include K-Means clustering, hierarchical clustering, and Gaussian Mixture Models (GMMs). +
+A little encouragement...
+As in many fields, machine learning involves a lot of technical language, some of which is unclear, redundant, or downright confusing. +For example: + +**Outcome** variables are also called **response variables**, **dependent variables**, or **labels**. + +**Input** variables are also called **predictors**, **features**, **independent variables**, or even just **variables**. + +To make matters worse, sometimes the same words are used to mean different things in different subfields. +If you find yourself stumbling on vocabulary as you read about machine learning, know you're not alone! + +
[True/False] Clustering algorithms are always able to find the "correct" clusters in the data. @@ -86,7 +99,11 @@ import: https://raw.githubusercontent.com/LiaTemplates/Pyodide/master/README.md ***
-This question is designed to test the test-taker's understanding of the limitations of clustering algorithms. Clustering algorithms are heuristics, which means that they do not guarantee to find the "correct" clusters in the data. The results of a clustering algorithm will depend on the distance metric used, the initialization of the algorithm, and the parameters of the algorithm. +Clustering algorithms are helpful tools, but they're not magic. Here's why this statement is false: + +- Clustering isn't about "right" or "wrong": There's often no single "correct" way to group data. Clustering depends on how you measure similarity and the type of patterns you're interested in finding. +- Different setups, different results: The clusters you get can change based on the clustering algorithm you choose, how you measure distances between data points, and even the starting settings of the algorithm. +Key takeaway: Clustering is an exploratory process. It can suggest interesting groupings in your data, but it's up to you to decide if those groupings make sense and are useful for your analysis.
@@ -101,7 +118,10 @@ This question is designed to test the test-taker's understanding of the limitati ***
-This question is designed to test the test-taker's understanding of the difference between clustering and anomaly detection. Clustering algorithms are used to group similar data points together, while anomaly detection algorithms are used to identify data points that are significantly different from the rest of the data. +While clustering algorithms can sometimes help identify potential outliers, they are not specifically designed for this purpose. Here's why: + +- Clustering focuses on grouping: Clustering algorithms aim to find groups of similar data points. Outliers, by definition, don't fit well into any group. +- Outliers might influence clusters: A significant outlier might distort the clustering process, either by forming its own tiny cluster or being forced into a larger cluster where it doesn't truly belong.
@@ -140,18 +160,25 @@ Predicting the labels for new data points is a goal of supervised learning, not Clustering can be used for a variety of tasks, such as: - **Customer segmentation:** Clustering can be used to segment customers into different groups based on their demographics, purchase behavior, or other characteristics. This information can then be used to target marketing campaigns or product development efforts to specific customer segments. -- **Product grouping:** Clustering can be used to group products with similar characteristics, such as price, features, or customer reviews. This information can be used to improve product recommendations or to identify opportunities for cross-selling and up-selling. -- **Image segmentation:** Clustering can be used to segment images into different objects or regions. This information can be used in tasks such as object detection, image classification, and image compression. -- **Anomaly detection:** Clustering can be used to identify anomalous data points that are different from the rest of the data. This information can be used to detect fraud, identify errors in data collection, or predict future events. -- **Medical diagnosis:** Clustering can be used to group patients with similar symptoms or medical histories together. This information can be used to improve the accuracy of medical diagnosis and to develop more personalized treatment plans. -- **Scientific research:** Clustering can be used to identify patterns and relationships in scientific data. This information can be used to advance scientific knowledge and to develop new technologies. - -### Examples of clustering in real-world applications -- **Netflix uses clustering to recommend movies and TV shows to its users.** Netflix clusters its users based on their viewing history and then recommends movies and TV shows to users based on the clusters they belong to. -- **Amazon uses clustering to recommend products to its customers.** Amazon clusters its products based on customer reviews and purchase behavior. Amazon then recommends products to customers based on the clusters the products belong to and the customer's past purchase history. -- **Google uses clustering to improve the accuracy of its search results.** Google clusters search results based on the relevance of the results to the search query. Google then displays the most relevant results at the top of the search results page. -- **Banks use clustering to detect fraudulent transactions.** Banks cluster transactions based on their characteristics, such as the amount of money involved, the type of transaction, and the location of the transaction. Banks then flag anomalous transactions as potentially fraudulent. -- **Medical researchers use clustering to identify new biomarkers for diseases.** Medical researchers cluster patients based on their medical histories and symptoms. Researchers then look for patterns in the clusters to identify new biomarkers that can be used to diagnose and treat diseases. +### Applications of Clustering in Biomedical Research + +Clustering is an invaluable machine learning technique with wide-ranging applications in biomedical research. Here are some key areas where it can be used : + +- **Patient Stratification:** Identify distinct subgroups within patient populations based on gene expression profiles, clinical data, or disease biomarkers. This can lead to insights into disease subtypes and more personalized treatment options. + - Specifically in research, ["Use of Latent Class Analysis and k-Means Clustering to Identify Complex Patient Profiles"](https://jamanetwork.com/journals/jamanetworkopen/article-abstract/2774074) employs statistical techniques to categorize patients into specific groups based on their gene expression profiles, clinical data, or biomarkers, allowing for the identification of unique disease subtypes and facilitating personalized treatment options. This approach aligns with patient stratification by utilizing clustering methods to segregate patients into distinct categories, enabling healthcare professionals to tailor interventions based on individualized characteristics and needs. +- **Drug Development:** Clustering can help group compounds based on chemical structure, efficacy, or target interactions. This facilitates the identification of novel drug candidates or the repurposing of existing drugs. + - ["Integration of k-means clustering algorithm with network analysis for drug-target interactions network prediction"](https://www.inderscienceonline.com/doi/abs/10.1504/IJDMB.2018.094776) combines k-means clustering with network analysis to predict drug-target interactions, aiding in the identification of potential drug candidates or repurposing existing drugs by grouping compounds based on their interactions and properties. This study directly aligns with drug development goals by leveraging clustering to categorize compounds and enhance the understanding of their interactions, thereby facilitating the discovery and optimization of therapeutic agents. + +- **Gene Expression Analysis:** Clustering genes with similar expression patterns across different conditions or time points can help uncover regulatory networks and potential therapeutic targets. + - ["Clust: automatic extraction of optimal co-expressed gene clusters from gene expression data"](https://link.springer.com/article/10.1186/s13059-018-1536-8) automates the extraction of co-expressed gene clusters from gene expression data, aiding in the identification of regulatory networks and potential therapeutic targets by clustering genes with similar expression patterns across different conditions or time points. This tool directly aligns with gene expression analysis goals by utilizing clustering to group genes based on their expression profiles, facilitating the discovery of underlying biological mechanisms and potential targets for intervention. + +- **Medical Image Analysis:** Segment medical images (MRI, CT scans) to differentiate tissues, identify tumors and other abnormalities. Clustering can aid in diagnosis and disease tracking. + - ["Diagnosis of Brain Tumor Using Combination of K-Means Clustering and Genetic Algorithm"](http://www.ijmi.ir/index.php/IJMI/article/view/159) utilizes a combination of k-means clustering and genetic algorithms to accurately diagnose brain tumors by segmenting medical images, demonstrating how clustering techniques can aid in medical image analysis to differentiate tissues and identify abnormalities such as tumors, aligning with the objective of leveraging clustering for diagnosis and disease tracking in medical imaging. + +- **Disease-Risk Prediction:** Analyze patient data to cluster individuals based on risk factors and medical history, enabling the prediction of susceptibility to various diseases. + - ["A K-Means Approach to Clustering Disease Progressions"](https://ieeexplore.ieee.org/document/8031156) utilizes k-means clustering to categorize individuals based on disease progression patterns, facilitating disease-risk prediction by analyzing patient data to cluster individuals according to their risk factors and medical histories. This study directly relates to the objective of disease-risk prediction by employing clustering techniques to identify distinct groups of patients with similar disease progressions, thereby enabling more accurate predictions of susceptibility to various diseases based on individualized characteristics. + + ## K-Means Clustering Algorithm - The K-Means clustering algorithm works by iteratively assigning data points to clusters based on their distance to the cluster centroids. The cluster centroids are the average values of all the data points in a cluster. @@ -170,7 +197,12 @@ Clustering can be used for a variety of tasks, such as: 5. Repeat steps 3 and 4 until the cluster assignments no longer change ``` +
+Learning connection
+ +To learn more about Linear Regression and for a visual explanation, watch [StatQuest: K-means clustering](https://youtu.be/4b5d3muPQmA?si=KMQxx23Ru8w7GOFP). +
What is the goal of the K-Means clustering algorithm? @@ -192,6 +224,25 @@ The goal of the K-Means clustering algorithm is to group similar data points tog *** + + + +### Understanding Machine Learning Techniques + +Before diving into the example, it's valuable to understand some key concepts used in machine learning. These techniques help us build more accurate and reliable models for clustering. + +- **Normalization:** Normalization is crucial for scaling the features of the dataset to a uniform range, typically between 0 and 1, ensuring that each feature contributes equally to the clustering process. + + - By ensuring equitable treatment of all features, normalization prevents features with larger magnitudes from dominating distance calculations in clustering algorithms. This fosters the identification of clusters based on similarity across multiple dimensions and enhances the discovery of meaningful patterns and relationships within the data. + +- **Computing Distance from Cluster Centroid:** Calculating the distance from each data point to its assigned cluster centroid provides a quantitative measure of the data point's fit within its cluster. + + - Distance metrics aid in assessing the compactness of clusters and the separation between clusters. In applications, distance calculations play a pivotal role in cluster validation and refinement, quantifying the similarity of data points within clusters and improving the overall efficacy of clustering algorithms in delineating coherent and distinct groups within the dataset. + +- **Visualization:** Visualizing clustering results facilitates intuitive interpretation and assessment of identified clusters. + + - Visual representations, such as scatter plots, enable the identification of inherent data patterns, outliers, and delineation of cluster boundaries. In applications, visualization aids in informed decision-making by providing stakeholders with insights into the data's structure and characteristics, fostering actionable insights and informed decisions. + ### Python Implementation of K-Means Clustering @@ -244,17 +295,26 @@ plt.show() ``` @Pyodide.eval -3. Split the data into training and testing sets: +3. This code defines a function called normalize that performs min-max scaling normalization on a DataFrame df, specifically on the features specified by the features parameter. The normalized DataFrame is returned as the output. Then, it calls this function to normalize all columns of a DataFrame data and assigns the result to a variable named normalized_data. ```python # Normalize dataframe def normalize(df, features): + # Create a copy of the DataFrame to avoid modifying the original data. result = df.copy() + + # Iterate through each feature specified for normalization. for feature_name in features: + # Find the maximum and minimum values of the current feature. max_value = df[feature_name].max() min_value = df[feature_name].min() + + # Normalize the current feature using min-max scaling formula. + # This ensures that all values of the feature are scaled between 0 and 1. result[feature_name] = (df[feature_name] - min_value) / (max_value - min_value) return result +# Call the normalize function with the entire DataFrame 'data' and all its columns. +# Store the result in 'normalized_data'. normalized_data = normalize(data, data.columns) ``` @Pyodide.eval @@ -263,15 +323,29 @@ normalized_data = normalize(data, data.columns) ```python # Run KMeans kmeans = KMeans(n_clusters = 2, max_iter = 500, n_init = 40, random_state = 2) +``` +@Pyodide.eval +5. Train the clustering model and visualize: +```python # Predict clusters identified_clusters = kmeans.fit_predict(normalized_data.values) results = normalized_data.copy() results['cluster'] = identified_clusters +``` +@Pyodide.eval -# Compute distance from cluster +6. Train the clustering model and visualize: +```python +# Compute distance from cluster. Loop through each data point and calculate the Euclidean distance between the data point and its assigned cluster centroid. distance_from_centroid = [distance.euclidean(val[:-1],kmeans.cluster_centers_[int(val[-1])]) for val in results.values] results['dist'] = distance_from_centroid +``` +@Pyodide.eval + + +7. Train the clustering model and visualize. Scatter plot of 'chol' (Cholesterol) against 'trtbps' (Resting Blood Pressure), colored by cluster, with marker size proportional to the distance from the cluster centroid. +```python results.plot.scatter(x='chol', y='trtbps', c='cluster', colormap='viridis', s='dist') plt.xlabel("Cholesterol") plt.ylabel("Resting Blood Pressure") From 8d6f0443b601b0e390036a3297550e38b387eca5 Mon Sep 17 00:00:00 2001 From: Schwartz Date: Mon, 27 May 2024 15:39:09 -0400 Subject: [PATCH 7/9] Split clustering and extract python code --- python_clustering/python_clustering.md | 359 +++--------------- .../python_clustering_exercise.ipynb | 200 ++++++++++ 2 files changed, 261 insertions(+), 298 deletions(-) create mode 100644 python_clustering/python_clustering_exercise.ipynb diff --git a/python_clustering/python_clustering.md b/python_clustering/python_clustering.md index 8a41eb7cd..50b22decb 100644 --- a/python_clustering/python_clustering.md +++ b/python_clustering/python_clustering.md @@ -70,179 +70,6 @@ import: https://raw.githubusercontent.com/LiaTemplates/Pyodide/master/README.md - - -### What is clustering? -- Clustering is an unsupervised machine learning technique that groups unlabeled data points into clusters based on their similarity. The goal of clustering is to identify groups of data points that are similar to each other and dissimilar to data points in other groups. Clustering algorithms work by measuring the similarity between data points and then grouping similar data points together. There are many different clustering algorithms, each with its own strengths and weaknesses. Some of the most common clustering algorithms include K-Means clustering, hierarchical clustering, and Gaussian Mixture Models (GMMs). - -
-A little encouragement...
- -As in many fields, machine learning involves a lot of technical language, some of which is unclear, redundant, or downright confusing. -For example: - -**Outcome** variables are also called **response variables**, **dependent variables**, or **labels**. - -**Input** variables are also called **predictors**, **features**, **independent variables**, or even just **variables**. - -To make matters worse, sometimes the same words are used to mean different things in different subfields. -If you find yourself stumbling on vocabulary as you read about machine learning, know you're not alone! - -
- - -[True/False] Clustering algorithms are always able to find the "correct" clusters in the data. - - -[( )] True -[(X)] False -*** -
- -Clustering algorithms are helpful tools, but they're not magic. Here's why this statement is false: - -- Clustering isn't about "right" or "wrong": There's often no single "correct" way to group data. Clustering depends on how you measure similarity and the type of patterns you're interested in finding. -- Different setups, different results: The clusters you get can change based on the clustering algorithm you choose, how you measure distances between data points, and even the starting settings of the algorithm. -Key takeaway: Clustering is an exploratory process. It can suggest interesting groupings in your data, but it's up to you to decide if those groupings make sense and are useful for your analysis. - - -
-*** - - -[True/False] Clustering algorithms can be used to detect outliers in the data. - - -[( )] True -[(X)] False -*** -
- -While clustering algorithms can sometimes help identify potential outliers, they are not specifically designed for this purpose. Here's why: - -- Clustering focuses on grouping: Clustering algorithms aim to find groups of similar data points. Outliers, by definition, don't fit well into any group. -- Outliers might influence clusters: A significant outlier might distort the clustering process, either by forming its own tiny cluster or being forced into a larger cluster where it doesn't truly belong. - - -
-*** - -### Unsupervised vs. Supervised Learning -- **Unsupervised learning** is a type of machine learning where the algorithm is trained on unlabeled data. This means that the data does not have pre-defined labels or categories. The goal of unsupervised learning is to identify patterns and relationships in the data without any prior knowledge of the data. -- **Supervised learning** is a type of machine learning where the algorithm is trained on labeled data. This means that the data has pre-defined labels or categories. The goal of supervised learning is to train a model to predict the labels for new data points. - - - -Which of the following is a goal of supervised learning? - - - -[( )] Identify patterns and relationships in the data -[( )] Group similar data points together -[( )] Detect outliers in the data -[(X)] Predict the labels for new data points -[( )] Understand the underlying structure of the data -*** -
- -Predicting the labels for new data points is a goal of supervised learning, not unsupervised learning. Unsupervised learning algorithms are used to identify patterns and relationships in the data without any prior knowledge of the data. This can be useful for tasks such as market segmentation, fraud detection, and anomaly detection. - - - -
-*** - - - - - -### Applications of clustering in machine learning -Clustering can be used for a variety of tasks, such as: - -- **Customer segmentation:** Clustering can be used to segment customers into different groups based on their demographics, purchase behavior, or other characteristics. This information can then be used to target marketing campaigns or product development efforts to specific customer segments. -### Applications of Clustering in Biomedical Research - -Clustering is an invaluable machine learning technique with wide-ranging applications in biomedical research. Here are some key areas where it can be used : - -- **Patient Stratification:** Identify distinct subgroups within patient populations based on gene expression profiles, clinical data, or disease biomarkers. This can lead to insights into disease subtypes and more personalized treatment options. - - Specifically in research, ["Use of Latent Class Analysis and k-Means Clustering to Identify Complex Patient Profiles"](https://jamanetwork.com/journals/jamanetworkopen/article-abstract/2774074) employs statistical techniques to categorize patients into specific groups based on their gene expression profiles, clinical data, or biomarkers, allowing for the identification of unique disease subtypes and facilitating personalized treatment options. This approach aligns with patient stratification by utilizing clustering methods to segregate patients into distinct categories, enabling healthcare professionals to tailor interventions based on individualized characteristics and needs. -- **Drug Development:** Clustering can help group compounds based on chemical structure, efficacy, or target interactions. This facilitates the identification of novel drug candidates or the repurposing of existing drugs. - - ["Integration of k-means clustering algorithm with network analysis for drug-target interactions network prediction"](https://www.inderscienceonline.com/doi/abs/10.1504/IJDMB.2018.094776) combines k-means clustering with network analysis to predict drug-target interactions, aiding in the identification of potential drug candidates or repurposing existing drugs by grouping compounds based on their interactions and properties. This study directly aligns with drug development goals by leveraging clustering to categorize compounds and enhance the understanding of their interactions, thereby facilitating the discovery and optimization of therapeutic agents. - -- **Gene Expression Analysis:** Clustering genes with similar expression patterns across different conditions or time points can help uncover regulatory networks and potential therapeutic targets. - - ["Clust: automatic extraction of optimal co-expressed gene clusters from gene expression data"](https://link.springer.com/article/10.1186/s13059-018-1536-8) automates the extraction of co-expressed gene clusters from gene expression data, aiding in the identification of regulatory networks and potential therapeutic targets by clustering genes with similar expression patterns across different conditions or time points. This tool directly aligns with gene expression analysis goals by utilizing clustering to group genes based on their expression profiles, facilitating the discovery of underlying biological mechanisms and potential targets for intervention. - -- **Medical Image Analysis:** Segment medical images (MRI, CT scans) to differentiate tissues, identify tumors and other abnormalities. Clustering can aid in diagnosis and disease tracking. - - ["Diagnosis of Brain Tumor Using Combination of K-Means Clustering and Genetic Algorithm"](http://www.ijmi.ir/index.php/IJMI/article/view/159) utilizes a combination of k-means clustering and genetic algorithms to accurately diagnose brain tumors by segmenting medical images, demonstrating how clustering techniques can aid in medical image analysis to differentiate tissues and identify abnormalities such as tumors, aligning with the objective of leveraging clustering for diagnosis and disease tracking in medical imaging. - -- **Disease-Risk Prediction:** Analyze patient data to cluster individuals based on risk factors and medical history, enabling the prediction of susceptibility to various diseases. - - ["A K-Means Approach to Clustering Disease Progressions"](https://ieeexplore.ieee.org/document/8031156) utilizes k-means clustering to categorize individuals based on disease progression patterns, facilitating disease-risk prediction by analyzing patient data to cluster individuals according to their risk factors and medical histories. This study directly relates to the objective of disease-risk prediction by employing clustering techniques to identify distinct groups of patients with similar disease progressions, thereby enabling more accurate predictions of susceptibility to various diseases based on individualized characteristics. - - - -## K-Means Clustering Algorithm -- The K-Means clustering algorithm works by iteratively assigning data points to clusters based on their distance to the cluster centroids. The cluster centroids are the average values of all the data points in a cluster. - -``` -1. Choose the number of clusters (K): - - This is an important step, as it will determine the outcome of the clustering process. - - There is no one-size-fits-all answer to the question of how to choose K. One approach is to use the elbow method, which involves plotting the within-cluster sum of squares (WCSS) for different values of K. - - The elbow point on the plot is the point where the WCSS starts to flatten out, and this is often a good choice for K. -2. Initialize the cluster centroids - - The cluster centroids can be initialized randomly or by choosing K data points from the dataset. -3. Assign each data point to the cluster with the nearest centroid - - The distance between a data point and a cluster centroid can be measured using any distance metric, such as Euclidean distance or Manhattan distance. -4. Recalculate the cluster centroids - - The cluster centroids are recalculated by taking the average of all the data points in each cluster. -5. Repeat steps 3 and 4 until the cluster assignments no longer change - -``` -
-Learning connection
- -To learn more about Linear Regression and for a visual explanation, watch [StatQuest: K-means clustering](https://youtu.be/4b5d3muPQmA?si=KMQxx23Ru8w7GOFP). - -
- - -What is the goal of the K-Means clustering algorithm? - - -[( )] To identify the clusters with the highest within-cluster sum of squares (WCSS) -[( )] To identify the clusters with the lowest within-cluster sum of squares (WCSS) -[( )] To identify the clusters with the highest between-cluster sum of squares (BCSS) -[( )] To identify the clusters with the lowest between-cluster sum of squares (BCSS) -[(X)] To group similar data points together -*** -
- -The goal of the K-Means clustering algorithm is to group similar data points together. This is achieved by iteratively assigning data points to clusters based on their distance to the cluster centroids. - - - -
-*** - - - - - -### Understanding Machine Learning Techniques - -Before diving into the example, it's valuable to understand some key concepts used in machine learning. These techniques help us build more accurate and reliable models for clustering. - -- **Normalization:** Normalization is crucial for scaling the features of the dataset to a uniform range, typically between 0 and 1, ensuring that each feature contributes equally to the clustering process. - - - By ensuring equitable treatment of all features, normalization prevents features with larger magnitudes from dominating distance calculations in clustering algorithms. This fosters the identification of clusters based on similarity across multiple dimensions and enhances the discovery of meaningful patterns and relationships within the data. - -- **Computing Distance from Cluster Centroid:** Calculating the distance from each data point to its assigned cluster centroid provides a quantitative measure of the data point's fit within its cluster. - - - Distance metrics aid in assessing the compactness of clusters and the separation between clusters. In applications, distance calculations play a pivotal role in cluster validation and refinement, quantifying the similarity of data points within clusters and improving the overall efficacy of clustering algorithms in delineating coherent and distinct groups within the dataset. - -- **Visualization:** Visualizing clustering results facilitates intuitive interpretation and assessment of identified clusters. - - - Visual representations, such as scatter plots, enable the identification of inherent data patterns, outliers, and delineation of cluster boundaries. In applications, visualization aids in informed decision-making by providing stakeholders with insights into the data's structure and characteristics, fostering actionable insights and informed decisions. - ### Python Implementation of K-Means Clustering @@ -251,7 +78,16 @@ This dataset contains various clinical attributes of patients, including their a To implement k-means clustering in Python using Scikit-learn, we can follow these steps: -1. Import the necessary libraries: +1. Import Libraries + +* **numpy (np):** This library provides tools for numerical operations and working with arrays, which are essential for data manipulation in machine learning. +* **pandas (pd):** Pandas is used for data analysis and manipulation, especially with tabular data. It makes it easy to load, clean, and organize your data. +* **matplotlib.pyplot (plt):** Matplotlib is a powerful plotting library for creating graphs and visualizations. We'll use it to visualize our data and clustering results. +* **sklearn.model_selection (train_test_split):** We'll use this function later if we need to split our data into training and testing sets for model evaluation. +* **sklearn.cluster (KMeans):** This is where the heart of our clustering algorithm lies. KMeans is the specific algorithm we'll use to group our data into clusters. +* **scipy.spatial (distance):** Scipy is a broader scientific computing library. The distance module provides functions to calculate distances between points, which we'll use in our KMeans analysis. + + ```python import numpy as np import pandas as pd @@ -263,7 +99,11 @@ from scipy.spatial import distance @Pyodide.eval -2. Load the data: +2. Loading the Data +* `data = pd.read_csv(file)`: This line reads the CSV (Comma-Separated Values) file, which presumably contains your patient data, into a Pandas DataFrame called `data`. DataFrames are like tables, where each row represents a patient, and each column represents a feature (e.g., age, cholesterol). +* `data.info()`: This function gives you a summary of the DataFrame, showing the column names, their data types, and how many non-null values are in each column. This helps you understand the structure of your data. + + ```python @Pyodide.exec import pandas as pd @@ -284,7 +124,9 @@ data.info() ``` -3. Visualize data +3. Visualize Data +This code generates a scatter plot with `chol` (Cholesterol) on the x-axis and `trtbps` (Resting Blood Pressure) on the y-axis. The data points are colored based on the `output` column, using the `viridis` colormap. Labels and a title are added, and then the plot is displayed. + ```python # Create the scatter plot data.plot.scatter(x='chol', y='trtbps', c='output', colormap='viridis') @@ -295,7 +137,12 @@ plt.show() ``` @Pyodide.eval -3. This code defines a function called normalize that performs min-max scaling normalization on a DataFrame df, specifically on the features specified by the features parameter. The normalized DataFrame is returned as the output. Then, it calls this function to normalize all columns of a DataFrame data and assigns the result to a variable named normalized_data. +4. Normalize DataFrame + +* The function `normalize(df, features)` is defined to perform min-max normalization of the features listed in `features` within the DataFrame `df`. It creates a copy `result` of the DataFrame and iterates over each feature to scale its values to the range [0, 1]. The normalized DataFrame `result` is returned. +* The `normalize` function is then applied to the `data` DataFrame to normalize all columns, and the results are stored in `normalized_data`. + + ```python # Normalize dataframe def normalize(df, features): @@ -319,32 +166,49 @@ normalized_data = normalize(data, data.columns) ``` @Pyodide.eval -4. Train the clustering model and visualize: +5. Run KMeans +* This line creates a KMeans object. + + * `n_clusters = 2` tells KMeans to find two clusters in your data. + * `max_iter = 500` sets a maximum of 500 iterations for the algorithm to converge. + * `n_init = 40` means the algorithm will be run 40 times with different random initializations, and the best result will be chosen. + * `random_state = 2` ensures reproducibility; you'll get the same clustering results each time you run the code. + + ```python # Run KMeans kmeans = KMeans(n_clusters = 2, max_iter = 500, n_init = 40, random_state = 2) ``` @Pyodide.eval -5. Train the clustering model and visualize: +6. Predict Clusters +* `kmeans.fit_predict()` does two things: + 1. It fits the KMeans model to your normalized data, meaning it finds the cluster centers. + 2. It predicts which cluster each data point belongs to, returning an array `identified_clusters` where each element corresponds to the cluster assignment of a data point. + +* We create a copy `results` of the `normalized_data` and add a new column `cluster` to it, storing the identified cluster labels. + + ```python -# Predict clusters identified_clusters = kmeans.fit_predict(normalized_data.values) results = normalized_data.copy() results['cluster'] = identified_clusters ``` @Pyodide.eval -6. Train the clustering model and visualize: +7. Compute Distance from Cluster Centroid +* This line calculates the Euclidean distance between each data point and its assigned cluster centroid. This distance is stored in the list `distance_from_centroid` and added as a new column `dist` in the results DataFrame. + ```python -# Compute distance from cluster. Loop through each data point and calculate the Euclidean distance between the data point and its assigned cluster centroid. distance_from_centroid = [distance.euclidean(val[:-1],kmeans.cluster_centers_[int(val[-1])]) for val in results.values] results['dist'] = distance_from_centroid ``` @Pyodide.eval -7. Train the clustering model and visualize. Scatter plot of 'chol' (Cholesterol) against 'trtbps' (Resting Blood Pressure), colored by cluster, with marker size proportional to the distance from the cluster centroid. +8. Train the clustering model and visualize +* Creates a scatter plot of `chol` (Cholesterol) against `trtbps` (Resting Blood Pressure), colored by the identified clusters, with marker size proportional to the distance from the cluster centroid. + ```python results.plot.scatter(x='chol', y='trtbps', c='cluster', colormap='viridis', s='dist') plt.xlabel("Cholesterol") @@ -355,134 +219,33 @@ plt.show() -### Applying K-Means Clustering to a Real-World Dataset - -- **Loading and cleaning the data:** The first step is to load the data into Python and clean it as needed. This may involve removing outliers, handling missing values, and scaling the data. -- **Scaling the data:** It is important to scale the data before applying K-Means clustering. This helps to ensure that all features have equal importance in the clustering process. -- **Choosing the number of clusters (K):** There is no one-size-fits-all answer to the question of how to choose the number of clusters (K). One approach is to use the elbow method, which involves plotting the within-cluster sum of squares (WCSS) for different values of K. The elbow point on the plot is the point where the WCSS starts to flatten out, and this is often a good choice for K. -- **Training and evaluating the K-Means model:** Once you have chosen the number of clusters, you can train the K-Means model on the data. You can then evaluate the model by computing the silhouette score. The silhouette score is a measure of how well the data points are clustered, and a higher score indicates better clustering. -- **Visualizing the clusters:** Once you have trained and evaluated the K-Means model, you can visualize the clusters using a scatter plot. This can help you to understand how the data is clustered and to identify any outliers. - -### Important Notes -Clustering is a machine learning technique that groups unlabeled data points into clusters based on their similarity. It is a powerful tool that can be used to solve a variety of problems, such as customer segmentation, product grouping, and anomaly detection. However, clustering also has some limitations. Here are some of the most important limitations of clustering: - -- **Sensitivity to the initialization:** Many clustering algorithms, such as k-means clustering, are sensitive to the initialization of the cluster centroids. If the cluster centroids are not initialized correctly, the clustering algorithm may not be able to find the optimal clusters. -- **Difficulty in choosing the number of clusters:** K-means clustering requires the user to specify the number of clusters (k) in advance. However, there is no one-size-fits-all answer to the question of how to choose k. Choosing the wrong number of clusters can lead to inaccurate results. -- **Inability to handle outliers:** Clustering algorithms are often sensitive to outliers, which are data points that are significantly different from the rest of the data. Outliers can have a large impact on the clustering results and can lead to inaccurate clusters. -- **Difficulty in interpreting the results:** It can be difficult to interpret the results of clustering algorithms, especially when the data is high-dimensional. It can be difficult to understand what the clusters represent and why the data points were assigned to the clusters they were assigned to. - - - -Which of the following techniques can be used to mitigate the sensitivity of clustering algorithms to the initialization? - - - -[( )] Running the clustering algorithm multiple times with different initializations and selecting the best results -[( )] Using a more robust clustering algorithm that is less sensitive to the initialization -[( )] Preprocessing the data to remove outliers -[(X)] All of the above -*** -
- -All of the above techniques can be used to mitigate the sensitivity of clustering algorithms to the initialization. - - - -
-*** - -### Real World Code Example -This dataset, derived and refined from a landmark study published in the New England Journal of Medicine in 1993, investigates the effectiveness of sulindac treatment in individuals with familial adenomatous polyposis (FAP), a hereditary condition characterized by the development of numerous adenomatous polyps in the colon and rectum. Enhanced from the original datasets "polyps" and "polyps3" in the {HSAUR} package, this dataset includes crucial variables such as participant ID, sex, age, baseline polyp count, assigned treatment (sulindac or placebo), and polyp counts at 3 and 12 months post-treatment. These enhancements involved meticulous referencing of the original paper and offer improved granularity and completeness for analyzing the impact of sulindac treatment on polyp progression in FAP patients. This dataset serves as a valuable resource for further research and analysis in the field of gastrointestinal medicine and pharmacology. - -1. Install Packages: -```python @Pyodide.exec - -import pandas as pd -import io -from pyodide.http import open_url -from sklearn.cluster import KMeans -from sklearn.preprocessing import StandardScaler -import matplotlib.pyplot as plt -``` - -2. Load the data: -```python -# Load dataset and read to pandas dataframe -url = "https://raw.githubusercontent.com/arcus/education_modules/python_clustering/python_clustering/data/polyps.csv" -url_contents = open_url(url) -text = url_contents.read() -file = io.StringIO(text) -df = pd.read_csv(file) - -# Analyze data and features -df.info() - -# Select features for clustering -features = ['age', 'baseline', 'number3m', 'number12m'] -X = df[features] - -# Fill missing values with the mean of each column -X.fillna(X.mean(), inplace=True) - -# Standardize the feature values -scaler = StandardScaler() -X_scaled = scaler.fit_transform(X) -``` -@Pyodide.eval - - -3. Cluster Data: -```python -# Define the number of clusters -num_clusters = 3 - -# Apply KMeans clustering -kmeans = KMeans(n_clusters=num_clusters, random_state=42) -kmeans.fit(X_scaled) - -# Assign cluster labels to the original dataframe -df['cluster'] = kmeans.labels_ -``` -@Pyodide.eval +## Conclusion -4. Visualize Clusters: -```python -# Visualize clusters for 'number3m' vs 'number12m' -plt.figure(figsize=(10, 8)) -colors = ['red', 'blue', 'green'] # Change colors as needed for more clusters - -for i in range(num_clusters): - cluster_data = df[df['cluster'] == i] - plt.scatter(cluster_data['number3m'], cluster_data['number12m'], - color=colors[i], label=f'Cluster {i}') - -plt.xlabel('Number of Polyps at 3 Months') -plt.ylabel('Number of Polyps at 12 Months') -plt.title('K-Means Clustering of Polyp Data: Number of Polyps at 3 Months vs Number of Polyps at 12 Months') -plt.legend() -plt.show() -``` -@Pyodide.eval +Through this lesson, you've gained a solid foundation in clustering, a cornerstone of unsupervised machine learning. You've learned how the K-Means algorithm works, its strengths and limitations, and most importantly, how to harness it within Python's powerful data science ecosystem. -If the K-Means algorithm identified distinct clusters with minimal overlap, it suggests there might be three underlying patient groups regarding polyp count progression: +Here's a summary of key takeaways to keep in mind: -- **Cluster 1 (Low Progression):** This cluster might represent participants who have a relatively low number of polyps at 3 months and a stable or slightly increased number at 12 months. This could be associated with effective treatment or naturally slow polyp growth. -- **Cluster 2 (Moderate Progression):** This cluster could include participants with a moderate number of polyps at 3 months and a somewhat steeper increase by 12 months. This might indicate a less effective treatment or a faster natural growth rate for polyps. -- **Cluster 3 (High Progression):** This cluster might contain participants with a high number of polyps at 3 months and a substantial increase by 12 months. This could be linked to factors like a particularly aggressive polyp type or treatment resistance. +* **Clustering Unveils Hidden Structures:** K-Means can reveal meaningful groupings within your data that might not be immediately apparent. This is crucial for tasks like customer segmentation, anomaly detection, and even preliminary exploration before applying more complex models. +* **Real-World Applications Abound:** Clustering isn't just theoretical. We've seen how it can be used in medical diagnostics (predicting heart disease risk based on patient attributes) and in pharmaceutical research (identifying patient subgroups responding differently to treatments). This demonstrates the algorithm's versatility across domains. +* **Data Preprocessing is Key:** The quality of your clustering results depends heavily on how you prepare your data. Normalization and feature scaling are often essential steps to ensure that all features contribute equally to the clustering process. +* **K-Means Isn't Perfect:** Remember that K-Means has its limitations. It assumes clusters are spherical and of equal size, which isn't always the case in real-world data. Additionally, choosing the optimal number of clusters (K) requires careful consideration and experimentation. -**While clustering provides valuable insights into potential patient subgroups, further analysis of treatment effects and other relevant features is necessary to fully understand the underlying factors influencing polyp count progression.** - +**Looking Ahead: Beyond K-Means** +While K-Means is a great starting point, the world of clustering is vast. As you progress in your machine learning journey, you'll encounter more sophisticated algorithms like DBSCAN, hierarchical clustering, and Gaussian mixture models. Each has its own strengths and use cases. +Consider exploring these areas to expand your clustering toolkit: -## Conclusion +* **Dimensionality Reduction:** Techniques like PCA can help visualize high-dimensional clustered data. +* **Cluster Evaluation:** Learn metrics like silhouette score to assess the quality of your clusters objectively. +* **Ensemble Clustering:** Combining multiple clustering algorithms can often lead to more robust and accurate results. -At the end of the lesson, students should have a good understanding of the concept of clustering and how to implement the K-Means clustering algorithm in Python. They should also be able to apply K-Means clustering to real-world datasets to identify patterns and insights. +The knowledge you've gained here equips you to tackle a wide range of data analysis challenges. By applying clustering thoughtfully and critically, you can unlock valuable insights and drive data-driven decision-making. ## Additional Resources diff --git a/python_clustering/python_clustering_exercise.ipynb b/python_clustering/python_clustering_exercise.ipynb new file mode 100644 index 000000000..8433e3130 --- /dev/null +++ b/python_clustering/python_clustering_exercise.ipynb @@ -0,0 +1,200 @@ +{ + "nbformat": 4, + "nbformat_minor": 0, + "metadata": { + "colab": { + "provenance": [] + }, + "kernelspec": { + "name": "python3", + "display_name": "Python 3" + }, + "language_info": { + "name": "python" + } + }, + "cells": [ + { + "cell_type": "markdown", + "source": [ + "# Introduction" + ], + "metadata": { + "id": "JPi5A3zdglwx" + } + }, + { + "cell_type": "markdown", + "source": [ + "**Real World Code Example: Analyzing Polyp Progression in FAP Patients**\n", + "\n", + "This notebook investigates the effectiveness of sulindac treatment in individuals with familial adenomatous polyposis (FAP) using a refined dataset based on a landmark study published in the New England Journal of Medicine in 1993. We'll use K-Means clustering to explore potential subgroups of patients based on their polyp progression over time.\n", + "\n", + "**Key Variables:**\n", + "\n", + "* `age`: Patient's age\n", + "* `baseline`: Baseline polyp count\n", + "* `number3m`: Polyp count at 3 months post-treatment\n", + "* `number12m`: Polyp count at 12 months post-treatment" + ], + "metadata": { + "id": "sZSi_qPNgomr" + } + }, + { + "cell_type": "markdown", + "source": [ + "# Install and Import" + ], + "metadata": { + "id": "kemfT40hg05U" + } + }, + { + "cell_type": "code", + "source": [ + "import pandas as pd\n", + "import io\n", + "from pyodide.http import open_url\n", + "from sklearn.cluster import KMeans\n", + "from sklearn.preprocessing import StandardScaler\n", + "import matplotlib.pyplot as plt" + ], + "metadata": { + "id": "jb2mmZaPhANS" + }, + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "source": [ + "# Load and Prepare Data" + ], + "metadata": { + "id": "Qls_4siSg2uu" + } + }, + { + "cell_type": "code", + "source": [ + "# Load data from GitHub\n", + "url = \"https://raw.githubusercontent.com/arcus/education_modules/python_clustering/python_clustering/data/polyps.csv\"\n", + "url_contents = open_url(url)\n", + "text = url_contents.read()\n", + "file = io.StringIO(text)\n", + "df = pd.read_csv(file)\n", + "\n", + "# Print data information\n", + "print(df.info())\n", + "\n", + "# Select features for clustering\n", + "features = ['age', 'baseline', 'number3m', 'number12m']\n", + "X = df[features]\n", + "\n", + "# Fill missing values with the mean\n", + "X.fillna(X.mean(), inplace=True)\n", + "\n", + "# Standardize features\n", + "scaler = StandardScaler()\n", + "X_scaled = scaler.fit_transform(X)" + ], + "metadata": { + "id": "B5HVt_INhB9J" + }, + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "source": [ + "# Cluster Data" + ], + "metadata": { + "id": "z4yBbkVVg4cW" + } + }, + { + "cell_type": "code", + "source": [ + "# Define number of clusters\n", + "num_clusters = 3\n", + "\n", + "# Apply K-Means clustering\n", + "kmeans = KMeans(n_clusters=num_clusters, random_state=42)\n", + "kmeans.fit(X_scaled)\n", + "\n", + "# Assign cluster labels\n", + "df['cluster'] = kmeans.labels_" + ], + "metadata": { + "id": "7-VYpcyXhDaq" + }, + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "source": [ + "# Visualize Clusters" + ], + "metadata": { + "id": "fEdnjjTlg6Pe" + } + }, + { + "cell_type": "code", + "source": [ + "# Visualize clusters\n", + "plt.figure(figsize=(10, 8))\n", + "colors = ['red', 'blue', 'green']\n", + "\n", + "for i in range(num_clusters):\n", + " cluster_data = df[df['cluster'] == i]\n", + " plt.scatter(cluster_data['number3m'], cluster_data['number12m'],\n", + " color=colors[i], label=f'Cluster {i}')\n", + "\n", + "plt.xlabel('Number of Polyps at 3 Months')\n", + "plt.ylabel('Number of Polyps at 12 Months')\n", + "plt.title('K-Means Clustering: Polyp Progression (3 vs. 12 Months)')\n", + "plt.legend()\n", + "plt.show()" + ], + "metadata": { + "id": "HUAP2uG0hE7j" + }, + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "source": [ + "# Interpretation and Further Analysis" + ], + "metadata": { + "id": "TesjSE05g8Ea" + } + }, + { + "cell_type": "markdown", + "source": [ + "The K-Means clustering results suggest potential subgroups of patients based on polyp progression patterns:\n", + "\n", + "* **Cluster 1 (Low Progression):** Potentially stable or slow polyp growth.\n", + "* **Cluster 2 (Moderate Progression):** Some increase in polyps over time.\n", + "* **Cluster 3 (High Progression):** Substantial increase in polyps over time.\n", + "\n", + "**Further Analysis (Not Shown):**\n", + "* Investigate differences in treatment (sulindac vs. placebo) between clusters.\n", + "* Explore other patient characteristics (e.g., age, sex) within each cluster.\n", + "* Consider alternative clustering methods or a different number of clusters.\n", + "\n", + "\n", + "Remember, clustering is exploratory. Additional analysis is needed to confirm these patterns and understand the underlying factors influencing polyp progression." + ], + "metadata": { + "id": "XuwxBwNshH5r" + } + } + ] +} \ No newline at end of file From a9aade13c61052527369dc964c8ee3aa3084add3 Mon Sep 17 00:00:00 2001 From: Schwartz Date: Sun, 23 Jun 2024 20:42:44 -0400 Subject: [PATCH 8/9] Updated changes based off Elizabeth's comments --- python_clustering/python_clustering.md | 256 +++++++++++++++++++++++-- 1 file changed, 236 insertions(+), 20 deletions(-) diff --git a/python_clustering/python_clustering.md b/python_clustering/python_clustering.md index 50b22decb..00438ed9b 100644 --- a/python_clustering/python_clustering.md +++ b/python_clustering/python_clustering.md @@ -2,7 +2,7 @@ author: Daniel Schwartz email: des338@drexel.edu -version: 0.0.0 +version: 1.0.0 current_version_description: Initial version module_type: standard docs_version: 3.0.0 @@ -68,17 +68,37 @@ import: https://raw.githubusercontent.com/LiaTemplates/Pyodide/master/README.md +## Summary of Key Concepts in Clustering + +- **Clustering Definition:** Clustering is an unsupervised machine learning technique used to group unlabeled data points into clusters based on their similarity. The goal is to identify groups of data points that are similar to each other and dissimilar to data points in other groups. Common algorithms include K-Means, hierarchical clustering, and Gaussian Mixture Models. + +- **Unsupervised vs. Supervised Learning:** Clustering falls under unsupervised learning, where algorithms are trained on unlabeled data to identify patterns and relationships without prior knowledge. Supervised learning, on the other hand, involves training on labeled data to predict labels for new data points. + +- **Applications:** Clustering finds applications in various fields such as customer segmentation, biomedical research, drug development, gene expression analysis, medical image analysis, and disease-risk prediction. + +- **K-Means Clustering Algorithm:** K-Means works by iteratively assigning data points to clusters based on their distance to cluster centroids. Key steps include choosing the number of clusters (K), initializing centroids, assigning data points, recalculating centroids, and iterating until convergence. + +- **Understanding Techniques:** Techniques like normalization, computing distances from cluster centroids, and visualization aid in building accurate clustering models and interpreting results. + +- **Challenges and Limitations:** Challenges include sensitivity to initialization, difficulty in choosing the number of clusters, handling outliers, and interpreting results in high-dimensional data. + +- **Mitigating Sensitivity:** Techniques like running the algorithm multiple times with different initializations, using robust algorithms, and preprocessing data help mitigate sensitivity to initialization. + +- **Conclusion:** Clustering is a powerful tool with diverse applications, but it's essential to understand its limitations and challenges. With foundational knowledge in clustering techniques, one can explore advanced methods and make informed decisions in data analysis and machine learning endeavors. -### Python Implementation of K-Means Clustering +## Python Implementation of K-Means Clustering This dataset contains various clinical attributes of patients, including their age, sex, chest pain type (cp), resting blood pressure (trtbps), serum cholesterol level (chol), fasting blood sugar (fbs) level, resting electrocardiographic results (restecg), maximum heart rate achieved (thalachh), exercise-induced angina (exng), ST depression induced by exercise relative to rest (oldpeak), slope of the peak exercise ST segment (slp), number of major vessels (caa) colored by fluoroscopy, thalassemia (thall) type, and the presence of heart disease (output). The data seems to be related to the diagnosis of heart disease, with the output variable indicating whether a patient has heart disease (1) or not (0). Each row represents a different patient, with their respective clinical characteristics recorded. To implement k-means clustering in Python using Scikit-learn, we can follow these steps: -1. Import Libraries +### 1. Import Libraries + +**Description:** +This step imports essential libraries needed for data manipulation, analysis, and visualization, as well as the KMeans clustering algorithm. * **numpy (np):** This library provides tools for numerical operations and working with arrays, which are essential for data manipulation in machine learning. * **pandas (pd):** Pandas is used for data analysis and manipulation, especially with tabular data. It makes it easy to load, clean, and organize your data. @@ -87,6 +107,8 @@ To implement k-means clustering in Python using Scikit-learn, we can follow thes * **sklearn.cluster (KMeans):** This is where the heart of our clustering algorithm lies. KMeans is the specific algorithm we'll use to group our data into clusters. * **scipy.spatial (distance):** Scipy is a broader scientific computing library. The distance module provides functions to calculate distances between points, which we'll use in our KMeans analysis. +**Why is it important:** +These libraries provide the foundational tools and functions required to perform data preprocessing, clustering, and visualization. Without them, we wouldn't be able to efficiently handle the data or perform the clustering analysis. ```python import numpy as np @@ -98,11 +120,23 @@ from scipy.spatial import distance ``` @Pyodide.eval +**Output:** +There is no direct output for this step, as it is focused on importing necessary libraries. However, successful execution without errors indicates that the libraries are correctly imported and ready for use. + + + + +### 2. Loading the Data + +**Description:** +This step involves loading the patient data from a CSV file into a Pandas DataFrame and then examining the structure of the data. -2. Loading the Data * `data = pd.read_csv(file)`: This line reads the CSV (Comma-Separated Values) file, which presumably contains your patient data, into a Pandas DataFrame called `data`. DataFrames are like tables, where each row represents a patient, and each column represents a feature (e.g., age, cholesterol). * `data.info()`: This function gives you a summary of the DataFrame, showing the column names, their data types, and how many non-null values are in each column. This helps you understand the structure of your data. +**Why it's important:** +Understanding the structure of your data is crucial before performing any data manipulation or analysis. It helps identify any missing values, understand data types, and get a general overview of the dataset. + ```python @Pyodide.exec @@ -124,8 +158,19 @@ data.info() ``` -3. Visualize Data -This code generates a scatter plot with `chol` (Cholesterol) on the x-axis and `trtbps` (Resting Blood Pressure) on the y-axis. The data points are colored based on the `output` column, using the `viridis` colormap. Labels and a title are added, and then the plot is displayed. +**Output:** + +`data.info()` gives a summary of the DataFrame, including the number of non-null entries for each column and their data types. +`print(data.head())` displays the first few rows of the DataFrame to give learners a feel for what the data looks like. + + + +### 3. Visualize Data +**Description:** This code generates a scatter plot with `chol` (Cholesterol) on the x-axis and `trtbps` (Resting Blood Pressure) on the y-axis. The data points are colored based on the `output` column, using the `viridis` colormap. Labels and a title are added, and then the plot is displayed. + + +**Why it's important:** +Understanding the structure of your data is crucial before performing any data manipulation or analysis. It helps identify any missing values, understand data types, and get a general overview of the dataset. ```python # Create the scatter plot @@ -137,11 +182,21 @@ plt.show() ``` @Pyodide.eval -4. Normalize DataFrame + +**Output:** +By adding the `print(data.head())` statement, you can see the first few rows of the data, which helps understand the dataset's structure and the columns being used in the plot. + + +### 4. Normalize DataFrame + +**Description:** * The function `normalize(df, features)` is defined to perform min-max normalization of the features listed in `features` within the DataFrame `df`. It creates a copy `result` of the DataFrame and iterates over each feature to scale its values to the range [0, 1]. The normalized DataFrame `result` is returned. * The `normalize` function is then applied to the `data` DataFrame to normalize all columns, and the results are stored in `normalized_data`. +**Why it is important:** +Normalization is crucial because it scales the data to a common range without distorting differences in the ranges of values. This ensures that no single feature dominates the clustering algorithm due to its scale, leading to more meaningful and comparable results. + ```python # Normalize dataframe @@ -163,11 +218,24 @@ def normalize(df, features): # Call the normalize function with the entire DataFrame 'data' and all its columns. # Store the result in 'normalized_data'. normalized_data = normalize(data, data.columns) + +# Print the normalized data to see the transformed values +print(normalized_data) ``` @Pyodide.eval -5. Run KMeans -* This line creates a KMeans object. +**Output:** +This code performs min-max normalization on the dataset and prints the resulting normalized_data. The output will show the scaled values of each feature, ensuring that all values are between 0 and 1. This step is critical for ensuring that the clustering algorithm treats each feature equally. + + + + + +### 5. Run KMeans +**Description:** This line creates a KMeans object. + +**Why this is important:** +The KMeans algorithm is a popular clustering method that partitions the data into distinct groups (clusters) based on feature similarity. By configuring the parameters, we can control the behavior of the algorithm and ensure consistent results. * `n_clusters = 2` tells KMeans to find two clusters in your data. * `max_iter = 500` sets a maximum of 500 iterations for the algorithm to converge. @@ -176,38 +244,91 @@ normalized_data = normalize(data, data.columns) ```python -# Run KMeans -kmeans = KMeans(n_clusters = 2, max_iter = 500, n_init = 40, random_state = 2) +# Create KMeans object +kmeans = KMeans(n_clusters=2, max_iter=500, n_init=40, random_state=2) +print("KMeans object created with the following parameters:") +print(f"Number of clusters: {kmeans.n_clusters}") +print(f"Maximum iterations: {kmeans.max_iter}") +print(f"Number of initializations: {kmeans.n_init}") +print(f"Random state: {kmeans.random_state}") + ``` @Pyodide.eval -6. Predict Clusters +**Output:** +Since the KMeans object creation itself does not produce output, the impact of this step will be evident in the following steps where we fit the model and predict clusters. + + + + + +### 6. Predict Clusters +**Description:** + * `kmeans.fit_predict()` does two things: + 1. It fits the KMeans model to your normalized data, meaning it finds the cluster centers. 2. It predicts which cluster each data point belongs to, returning an array `identified_clusters` where each element corresponds to the cluster assignment of a data point. * We create a copy `results` of the `normalized_data` and add a new column `cluster` to it, storing the identified cluster labels. +**Why this is important:** +Fitting the KMeans model to the data and predicting clusters are crucial steps in the clustering process. By assigning each data point to a cluster, we can analyze patterns and group similar data points together. This can reveal underlying structures in the data and help in further analysis or decision-making processes. + ```python +# Fit the KMeans model to the normalized data and predict the clusters identified_clusters = kmeans.fit_predict(normalized_data.values) + +# Create a copy of the normalized data to store the results results = normalized_data.copy() + +# Add the identified cluster labels as a new column 'cluster' in the results DataFrame results['cluster'] = identified_clusters + +# Print the results to observe the DataFrame with the cluster assignments +print(results.head()) ``` @Pyodide.eval -7. Compute Distance from Cluster Centroid -* This line calculates the Euclidean distance between each data point and its assigned cluster centroid. This distance is stored in the list `distance_from_centroid` and added as a new column `dist` in the results DataFrame. +**Output:** +The output will be a preview of the first few rows of the results DataFrame, which now includes the original normalized data along with the new cluster column. In this output: + +* Each row corresponds to a data point (e.g., a patient's data in a medical dataset). +* The columns represent the normalized features (e.g., age, sex, cp, etc.). +* The cluster column indicates the cluster assignment for each data point, with values such as 0 or 1 representing different clusters. +* This output allows shows how their data points have been grouped into clusters based on the KMeans algorithm. + + + + +### 7. Compute Distance from Cluster Centroid +**Description:** This line calculates the Euclidean distance between each data point and its assigned cluster centroid. This distance is stored in the list `distance_from_centroid` and added as a new column `dist` in the results DataFrame. + +**Why this is important:** +Computing the distance from each data point to its cluster centroid provides insight into how well the data points are clustered around their centroids. It helps assess the compactness of clusters and can be useful for evaluating the quality of the clustering. ```python -distance_from_centroid = [distance.euclidean(val[:-1],kmeans.cluster_centers_[int(val[-1])]) for val in results.values] +# Calculate the Euclidean distance between each data point and its assigned cluster centroid +distance_from_centroid = [distance.euclidean(val[:-1], kmeans.cluster_centers_[int(val[-1])]) for val in results.values] + +# Add the computed distances as a new column 'dist' in the results DataFrame results['dist'] = distance_from_centroid + +# Print the results to observe the DataFrame with the distance values +print(results.head()) ``` @Pyodide.eval +**Output:** +The output will display the first few rows of the results DataFrame with the newly added dist column, representing the distances of each data point from its assigned cluster centroid. This output allows learners to understand how the distances are calculated and see the impact of the clustering on the data. + -8. Train the clustering model and visualize -* Creates a scatter plot of `chol` (Cholesterol) against `trtbps` (Resting Blood Pressure), colored by the identified clusters, with marker size proportional to the distance from the cluster centroid. +### 8. Train the clustering model and visualize +**Description:** Creates a scatter plot of `chol` (Cholesterol) against `trtbps` (Resting Blood Pressure), colored by the identified clusters, with marker size proportional to the distance from the cluster centroid. + +**Why this is important:** +Visualization is crucial for understanding clustering results. By plotting the data points with identified clusters, we can visually inspect how well the clustering algorithm has grouped similar data points together. Additionally, using marker size to represent the distance from the cluster centroid provides insights into the compactness of each cluster. ```python results.plot.scatter(x='chol', y='trtbps', c='cluster', colormap='viridis', s='dist') @@ -217,8 +338,42 @@ plt.show() ``` @Pyodide.eval - +**Output:** +The output is a scatter plot where each data point is represented by a marker. The markers are colored based on the identified clusters, and their sizes vary depending on the distance from the cluster centroid. This visualization allows learners to visually inspect how the data points are grouped into clusters and how compact each cluster is. + + +## Review your knowledge + +```python +from sklearn.cluster import KMeans + +# Create a KMeans instance with ____ clusters +kmeans = KMeans(n_clusters=____) + +# Fit the model to the data +kmeans.fit(____) + +# Get the cluster centroids +centroids = kmeans.cluster_centers_ + +# Predict the cluster labels for the data points +labels = kmeans.predict(____) +``` + + +Fill in the blanks to implement the K-Means clustering algorithm in Python: + +[( )] `k`, `k`, `X` +[( )] `n_clusters`, `K`, `data` +[(X)] `K`, `n_clusters`, `data` +[( )] `data`, `n_clusters`, `K` +*** +
+ +This question tests your understanding of implementing the K-Means clustering algorithm using the scikit-learn library in Python. To answer correctly, you need to identify the correct placeholders for the number of clusters and the dataset in the code snippet. The correct option, "K, n_clusters, data," corresponds to the appropriate parameters and function calls required for the K-Means algorithm. + +
@@ -227,6 +382,7 @@ plt.show() Through this lesson, you've gained a solid foundation in clustering, a cornerstone of unsupervised machine learning. You've learned how the K-Means algorithm works, its strengths and limitations, and most importantly, how to harness it within Python's powerful data science ecosystem. +### Key Takeaways Here's a summary of key takeaways to keep in mind: * **Clustering Unveils Hidden Structures:** K-Means can reveal meaningful groupings within your data that might not be immediately apparent. This is crucial for tasks like customer segmentation, anomaly detection, and even preliminary exploration before applying more complex models. @@ -235,8 +391,7 @@ Here's a summary of key takeaways to keep in mind: * **K-Means Isn't Perfect:** Remember that K-Means has its limitations. It assumes clusters are spherical and of equal size, which isn't always the case in real-world data. Additionally, choosing the optimal number of clusters (K) requires careful consideration and experimentation. -**Looking Ahead: Beyond K-Means** - +### Beyond K-Means While K-Means is a great starting point, the world of clustering is vast. As you progress in your machine learning journey, you'll encounter more sophisticated algorithms like DBSCAN, hierarchical clustering, and Gaussian mixture models. Each has its own strengths and use cases. Consider exploring these areas to expand your clustering toolkit: @@ -249,6 +404,67 @@ The knowledge you've gained here equips you to tackle a wide range of data analy ## Additional Resources +### Full Code Implementation + +At the end of this module, here you will find a "Full Code" section where all the code is consolidated into a single cell block. This allows for easy copying and pasting for those who want to implement the entire process quickly. While this single block of code isn't designed as a step-by-step educational tool, it serves as a convenient reference for future use and helps streamline the process for those already familiar with the concepts. Below is the complete code implementation: + +```python +# Import Libraries +import numpy as np +import pandas as pd +import matplotlib.pyplot as plt +from sklearn.model_selection import train_test_split +from sklearn.cluster import KMeans +from scipy.spatial import distance +import io +from pyodide.http import open_url + +# Load Data +url = "https://raw.githubusercontent.com/arcus/education_modules/python_clustering/python_clustering/data/heart.csv" +url_contents = open_url(url) +text = url_contents.read() +file = io.StringIO(text) +data = pd.read_csv(file) +data.info() + +# Visualize Data +data.plot.scatter(x='chol', y='trtbps', c='output', colormap='viridis') +plt.xlabel("Cholesterol") +plt.ylabel("Resting Blood Pressure") +plt.title("Scatter Plot of Cholesterol vs. Blood Pressure") +plt.show() + +# Normalize DataFrame +def normalize(df, features): + result = df.copy() + for feature_name in features: + max_value = df[feature_name].max() + min_value = df[feature_name].min() + result[feature_name] = (df[feature_name] - min_value) / (max_value - min_value) + return result + +normalized_data = normalize(data, data.columns) + +# Run KMeans +kmeans = KMeans(n_clusters = 2, max_iter = 500, n_init = 40, random_state = 2) + +# Predict Clusters +identified_clusters = kmeans.fit_predict(normalized_data.values) +results = normalized_data.copy() +results['cluster'] = identified_clusters + +# Compute Distance from Cluster Centroid +distance_from_centroid = [distance.euclidean(val[:-1], kmeans.cluster_centers_[int(val[-1])]) for val in results.values] +results['dist'] = distance_from_centroid + +# Visualize Clusters +results.plot.scatter(x='chol', y='trtbps', c='cluster', colormap='viridis', s='dist') +plt.xlabel("Cholesterol") +plt.ylabel("Resting Blood Pressure") +plt.show() +``` +@Pyodide.eval + ## Feedback @feedback From 98ff10500fbddabf253982d37ebc671768417ad5 Mon Sep 17 00:00:00 2001 From: drelliche <99294374+drelliche@users.noreply.github.com> Date: Wed, 17 Jul 2024 11:29:40 -0400 Subject: [PATCH 9/9] first pass --- python_clustering/python_clustering.md | 33 ++++++++++++++++---------- 1 file changed, 20 insertions(+), 13 deletions(-) diff --git a/python_clustering/python_clustering.md b/python_clustering/python_clustering.md index 00438ed9b..5c9fd0aa6 100644 --- a/python_clustering/python_clustering.md +++ b/python_clustering/python_clustering.md @@ -49,12 +49,7 @@ coding_language: python @end @version_history - -Previous versions: - -- [x.x.x](link): that version's current version description -- [x.x.x](link): that version's current version description -- [x.x.x](link): that version's current version description +No previous versions. @end import: https://raw.githubusercontent.com/arcus/education_modules/main/_module_templates/macros.md @@ -62,21 +57,23 @@ import: https://raw.githubusercontent.com/arcus/education_modules/pyodide_testin import: https://raw.githubusercontent.com/LiaTemplates/Pyodide/master/README.md --> -# Python Lesson on Clustering for Machine Learning +# Clustering in Python @overview -## Summary of Key Concepts in Clustering +## Review of Clustering -- **Clustering Definition:** Clustering is an unsupervised machine learning technique used to group unlabeled data points into clusters based on their similarity. The goal is to identify groups of data points that are similar to each other and dissimilar to data points in other groups. Common algorithms include K-Means, hierarchical clustering, and Gaussian Mixture Models. +**Clustering** is a machine learning technique used to group unlabeled data points into clusters based on their similarity. The goal is to identify groups of data points that are similar to each other and dissimilar to data points in other groups. In this lesson we will work through an example of K-Means clustering. Other common algorithms hierarchical clustering, and Gaussian Mixture Models. -- **Unsupervised vs. Supervised Learning:** Clustering falls under unsupervised learning, where algorithms are trained on unlabeled data to identify patterns and relationships without prior knowledge. Supervised learning, on the other hand, involves training on labeled data to predict labels for new data points. +For a more in-depth look at what clustering is, see the [_other clustering module_](link). + +Clustering is a type of **unsupervised learning**. Unsupervised learning algorithms are algorithms trained on unlabeled data to identify patterns and relationships without prior knowledge. This is different from supervised learning, where an algorithm is initially trained on labeled data in order to predict labels for new data points. - **Applications:** Clustering finds applications in various fields such as customer segmentation, biomedical research, drug development, gene expression analysis, medical image analysis, and disease-risk prediction. -- **K-Means Clustering Algorithm:** K-Means works by iteratively assigning data points to clusters based on their distance to cluster centroids. Key steps include choosing the number of clusters (K), initializing centroids, assigning data points, recalculating centroids, and iterating until convergence. + - **Understanding Techniques:** Techniques like normalization, computing distances from cluster centroids, and visualization aid in building accurate clustering models and interpreting results. @@ -88,9 +85,19 @@ import: https://raw.githubusercontent.com/LiaTemplates/Pyodide/master/README.md -## Python Implementation of K-Means Clustering - +## The K-Means Clustering Algorithm + +The **K-Means Clustering Algorithm**, sometimes refered to as simply "K-Means," works by iteratively assigning data points to clusters based on their distance to cluster centroids. + +The key steps of K-Means clustering are: +1. choosing the number of clusters (K), +2. initializing centroids, assigning data points, +3. recalculating centroids, and +4. iterating until convergence. + +### Clustering Patients +We are going to use use an dataset ***(add more details about its origin)*** This dataset contains various clinical attributes of patients, including their age, sex, chest pain type (cp), resting blood pressure (trtbps), serum cholesterol level (chol), fasting blood sugar (fbs) level, resting electrocardiographic results (restecg), maximum heart rate achieved (thalachh), exercise-induced angina (exng), ST depression induced by exercise relative to rest (oldpeak), slope of the peak exercise ST segment (slp), number of major vessels (caa) colored by fluoroscopy, thalassemia (thall) type, and the presence of heart disease (output). The data seems to be related to the diagnosis of heart disease, with the output variable indicating whether a patient has heart disease (1) or not (0). Each row represents a different patient, with their respective clinical characteristics recorded. To implement k-means clustering in Python using Scikit-learn, we can follow these steps: