Cohen’s d statistical measure : 30-Oct-2023

Cohen’s d is a statistical measure that quantifies the standardized difference between two means. In simpler terms, it helps assess the size of the difference between two groups by expressing it in standard deviation units. A larger Cohen’s d indicates a more substantial difference between the groups being compared. It serves as a metric for gauging the standardized distinction between two means. In the examination of age distribution among individuals fatally shot by the police, Cohen’s d was employed to evaluate the effect size pertaining to the age gap between black and white victims.
To compute Cohen’s d, the variance between the means (specifically, the mean ages of white and black individuals killed by police) is divided by the pooled standard deviation of both groups. The pooled standard deviation, a weighted average adjusted for sample sizes, accounts for the standard deviations of both groups. In the given dataset, the computed Cohen’s d stood at 0.577485, signifying a medium effect size according to Cohen-Salkowski guidelines. This implies that the 7.3-year average age difference between white and black individuals killed by police is of moderate magnitude, suggesting a statistically significant yet intermediary impact—not large or small, but falling in between. The resulting Cohen’s d value provides a standardized measure of effect size, helping to interpret the magnitude of the difference between the means of the two groups.

 

Hierarchical Clustering

Clustering (Unsupervised Learning): Clustering is a technique used in unsupervised machine learning to group data points based on their inherent similarities, without any prior labels or categorizations. The main goal is to partition a dataset into clusters where items within a cluster are more alike to each other than to items in other clusters. This method helps uncover hidden patterns within data, making it particularly valuable when we don’t have any predefined categories or when we want to discover new insights from the data.

 

Hierarchical Clustering: Hierarchical clustering creates a tree of clusters. Unlike K-means, we don’t need to specify the number of clusters upfront. The method starts by treating each data point as a single cluster and then continually merges the closest pairs of clusters until only one large cluster remains. The result is a tree-like diagram called a dendrogram, which gives a multi-level hierarchy of clusters. One can then decide the number of clusters by cutting the dendrogram at a desired level. Hierarchical clustering is great for smaller datasets and when we want to understand hierarchical relationships, but it can be computationally intensive for larger datasets.

K-Means and DBSCAN

K-Means clustering has solidified its position in the world of unsupervised machine learning, offering a potent technique to group data points based on their similarities. This algorithm endeavors to partition the dataset into ‘k’ distinct clusters, each defined by a central point known as a centroid. It iteratively assigns data points to the cluster with the nearest centroid, recalculating centroids until convergence. With applications ranging from customer segmentation in marketing to image compression in computer vision, K-Means stands as a versatile solution for pattern recognition.

In contrast, Density-Based Spatial Clustering of Applications with Noise (DBSCAN) takes a distinctive approach, identifying regions of high data density. Unlike K-Means, DBSCAN doesn’t require users to predefine the number of clusters. It classifies points into core, border, and noise categories. Core points, surrounded by a minimum number of other points within a specified radius, form cluster nuclei. Border points lie on cluster peripheries, while sparser regions contain noise points. DBSCAN excels at discovering clusters of arbitrary shapes and effectively handling outliers.

When choosing between K-Means and DBSCAN, the nature of the dataset and desired outcomes are crucial considerations. K-Means is ideal when the number of clusters is known, and clusters are well-defined and spherical. In contrast, DBSCAN shines with datasets of varying densities and irregularly shaped clusters. The adaptability of these clustering algorithms empowers data scientists to unveil hidden structures, facilitating more informed decision-making across diverse fields.

 

Difference in average ages – black people versus white people shot by police: October 18

Today I have gone through the statistical analysis of the age distribution for people killed by police, comparing black and white populations. Here’s a summary that i have undergone about the data.

                      Overview: The analysis focuses on the age distribution of individuals killed      by police, specifically comparing black and white populations.

                       Black Population:

    • About 68% of ages for black people killed by police fall within one standard deviation of the mean (between Mean – Standard Deviation and Mean + Standard Deviation).
    • The mean age is between 21.29 and 44.3White Population:
      • Similarly, about 68% of ages for white people killed by police fall within one standard deviation of the mean (between Mean – Standard Deviation and Mean + Standard Deviation).
      • The mean age is between 26.70 and 53.26
      • Based on the analysis suggestion that white individuals killed by police are statistically significantly older, on average, than black individuals.
      • Visualizations, such as smooth histograms, show the differences in age distributions between black and white populations.
      • The text mentions the challenge of using a t-test due to deviations from normality in age distributions.Conclusion:
        • The mean age of white individuals killed by police is noted to be almost 28.

Project 2 Data Sample analysis: October 16

In class today, we looked at the ages of people who were killed by the police, using information from the Washington Post. When we checked the ages of everyone, we noticed that the numbers were a bit tilted to the right, meaning that more people were older. We did some math stuff, and it turns out that about 67% of the people who were shot by the police were between 24 and 50 years old.

Then, we did the same thing but just for black and white people. For black individuals, about 78% were between 25 and 44, and for white individuals, it was between 23 and 53.

Next, we wanted to see if there was a difference in the average ages between black and white people shot by the police. And yep, there was! We used something called the Monte Carlo method and figured out that there’s roughly a 7-year difference, and it didn’t happen by chance.

To understand how much this age difference matters overall, we used Cohen’s d method. The number we got was 0.577, and according to the guidelines, that’s a medium-sized difference. So, it seems like the age gap has a noticeable impact on the whole dataset.

Estimating the sampling distribution using Bootstrap-04-oct-2023

Today I have gained knowledge on Bootstrap. Bootstrapping is a statistical method for estimating the sampling distribution of a statistic without assuming a known underlying distribution. It works by repeatedly resampling the data with replacement and calculating the statistic of interest on each resampled dataset. It can be used to estimate the sampling distribution of any statistic, including the sample median, sample variance, and sample correlation coefficient. It is a powerful tool for statistical inference, and it can be used in a variety of settings.

Here is an example of how bootstrapping can be used:
Suppose we want to test the hypothesis that the average height of men is different from the average height of women. We could collect a sample of heights from men and a sample of heights from women, and then use bootstrapping to estimate the sampling distribution of the difference in sample means.
To do this, we would draw bootstrap samples from the men’s and women’s samples with replacement, and then calculate the difference in sample means on each bootstrap sample. The distribution of the difference in sample means from the bootstrap samples would be an estimate of the sampling distribution of the difference in sample means.
We could then use this distribution to calculate a p-value for the hypothesis test. The p-value would be the proportion of bootstrap samples that had a difference in sample means that was as large or larger than the difference in sample means from the original sample. If the p-value is less than a significance level of 0.05, then we would reject the null hypothesis and conclude that there is a statistically significant difference in the average heights of men and women.
Bootstrapping is a versatile and powerful statistical tool that can be used for a variety of purposes. It is a good choice for researchers who want to make inferences about their data without assuming a known underlying distribution.