Cohen’s d statistical measure : 30-Oct-2023
Clustering (Unsupervised Learning): Clustering is a technique used in unsupervised machine learning to group data points based on their inherent similarities, without any prior labels or categorizations. The main goal is to partition a dataset into clusters where items within a cluster are more alike to each other than to items in other clusters. This method helps uncover hidden patterns within data, making it particularly valuable when we don’t have any predefined categories or when we want to discover new insights from the data.
Hierarchical Clustering: Hierarchical clustering creates a tree of clusters. Unlike K-means, we don’t need to specify the number of clusters upfront. The method starts by treating each data point as a single cluster and then continually merges the closest pairs of clusters until only one large cluster remains. The result is a tree-like diagram called a dendrogram, which gives a multi-level hierarchy of clusters. One can then decide the number of clusters by cutting the dendrogram at a desired level. Hierarchical clustering is great for smaller datasets and when we want to understand hierarchical relationships, but it can be computationally intensive for larger datasets.
K-Means clustering has solidified its position in the world of unsupervised machine learning, offering a potent technique to group data points based on their similarities. This algorithm endeavors to partition the dataset into ‘k’ distinct clusters, each defined by a central point known as a centroid. It iteratively assigns data points to the cluster with the nearest centroid, recalculating centroids until convergence. With applications ranging from customer segmentation in marketing to image compression in computer vision, K-Means stands as a versatile solution for pattern recognition.
In contrast, Density-Based Spatial Clustering of Applications with Noise (DBSCAN) takes a distinctive approach, identifying regions of high data density. Unlike K-Means, DBSCAN doesn’t require users to predefine the number of clusters. It classifies points into core, border, and noise categories. Core points, surrounded by a minimum number of other points within a specified radius, form cluster nuclei. Border points lie on cluster peripheries, while sparser regions contain noise points. DBSCAN excels at discovering clusters of arbitrary shapes and effectively handling outliers.
When choosing between K-Means and DBSCAN, the nature of the dataset and desired outcomes are crucial considerations. K-Means is ideal when the number of clusters is known, and clusters are well-defined and spherical. In contrast, DBSCAN shines with datasets of varying densities and irregularly shaped clusters. The adaptability of these clustering algorithms empowers data scientists to unveil hidden structures, facilitating more informed decision-making across diverse fields.
Today I have gone through the statistical analysis of the age distribution for people killed by police, comparing black and white populations. Here’s a summary that i have undergone about the data.
Overview: The analysis focuses on the age distribution of individuals killed by police, specifically comparing black and white populations.
Black Population:
In class today, we looked at the ages of people who were killed by the police, using information from the Washington Post. When we checked the ages of everyone, we noticed that the numbers were a bit tilted to the right, meaning that more people were older. We did some math stuff, and it turns out that about 67% of the people who were shot by the police were between 24 and 50 years old.
Then, we did the same thing but just for black and white people. For black individuals, about 78% were between 25 and 44, and for white individuals, it was between 23 and 53.
Next, we wanted to see if there was a difference in the average ages between black and white people shot by the police. And yep, there was! We used something called the Monte Carlo method and figured out that there’s roughly a 7-year difference, and it didn’t happen by chance.
To understand how much this age difference matters overall, we used Cohen’s d method. The number we got was 0.577, and according to the guidelines, that’s a medium-sized difference. So, it seems like the age gap has a noticeable impact on the whole dataset.
Today I have gained knowledge on Bootstrap. Bootstrapping is a statistical method for estimating the sampling distribution of a statistic without assuming a known underlying distribution. It works by repeatedly resampling the data with replacement and calculating the statistic of interest on each resampled dataset. It can be used to estimate the sampling distribution of any statistic, including the sample median, sample variance, and sample correlation coefficient. It is a powerful tool for statistical inference, and it can be used in a variety of settings.