Bayesian Statistician: 08-Nov-2023

A Bayesian statistician approaches statistical inference using Bayesian probability, which represents a degree of belief or certainty. Bayesian statistics incorporates prior knowledge or beliefs about parameters and updates them based on observed data using Bayes’ theorem. This leads to the calculation of posterior probabilities, which express the probability of hypotheses given the data. Bayesian methods are particularly useful when dealing with small sample sizes or when incorporating existing knowledge into statistical analysis.

For example imagine a scenario where a pharmaceutical company wants to test the effectiveness of a new drug. A Bayesian statistician would start with prior beliefs about the drug’s effectiveness based on existing knowledge or previous studies. As new data from clinical trials becomes available, these prior beliefs are updated using Bayes’ theorem to calculate the posterior probability of the drug being effective. The Bayesian approach allows for the incorporation of prior knowledge into the analysis, making it especially useful when dealing with limited data.

Frequentist Statistician: 06-Nov-2023

Frequentist Statistician:

A frequentist statistician approaches statistical inference by focusing on the frequency or probability of events. In this framework, probabilities are associated with the frequency of events in repeated random sampling. Key concepts include point estimation, confidence intervals, and hypothesis testing. The emphasis is on using observed data to make inferences about the true values of population parameters. Frequentist methods do not assign probabilities to hypotheses; instead, they view hypotheses as fixed and the data as variable.

For example, Let’s say we want to estimate the average height of students in a school. A frequentist statistician would take random samples of students, calculate the average height in each sample, and use these averages to make inferences about the true average height of all students in the school. The focus is on using the observed data (sample means) to estimate and make statements about the population parameter (true average height).

KNN algorithm : 01-Nov-2023

Today I have learned about KNN (K Nearest Neighbors) algorithm, a handy tool for our Project 2 dataset. KNN excels at finding similarities in datasets where similar features tend to cluster together. The process involves addressing missing values, normalizing numerical features for equal importance, selecting relevant features like demographics and city data, splitting the dataset for training and testing, determining the optimal number of neighbors (K), training the model, evaluating its performance, and finally using it to predict categories for new data points. KNN’s focus on similarity makes it particularly useful for uncovering patterns in geographical dynamics.

Cohen’s d statistical measure : 30-Oct-2023

Cohen’s d is a statistical measure that quantifies the standardized difference between two means. In simpler terms, it helps assess the size of the difference between two groups by expressing it in standard deviation units. A larger Cohen’s d indicates a more substantial difference between the groups being compared. It serves as a metric for gauging the standardized distinction between two means. In the examination of age distribution among individuals fatally shot by the police, Cohen’s d was employed to evaluate the effect size pertaining to the age gap between black and white victims.
To compute Cohen’s d, the variance between the means (specifically, the mean ages of white and black individuals killed by police) is divided by the pooled standard deviation of both groups. The pooled standard deviation, a weighted average adjusted for sample sizes, accounts for the standard deviations of both groups. In the given dataset, the computed Cohen’s d stood at 0.577485, signifying a medium effect size according to Cohen-Salkowski guidelines. This implies that the 7.3-year average age difference between white and black individuals killed by police is of moderate magnitude, suggesting a statistically significant yet intermediary impact—not large or small, but falling in between. The resulting Cohen’s d value provides a standardized measure of effect size, helping to interpret the magnitude of the difference between the means of the two groups.

 

Hierarchical Clustering

Clustering (Unsupervised Learning): Clustering is a technique used in unsupervised machine learning to group data points based on their inherent similarities, without any prior labels or categorizations. The main goal is to partition a dataset into clusters where items within a cluster are more alike to each other than to items in other clusters. This method helps uncover hidden patterns within data, making it particularly valuable when we don’t have any predefined categories or when we want to discover new insights from the data.

 

Hierarchical Clustering: Hierarchical clustering creates a tree of clusters. Unlike K-means, we don’t need to specify the number of clusters upfront. The method starts by treating each data point as a single cluster and then continually merges the closest pairs of clusters until only one large cluster remains. The result is a tree-like diagram called a dendrogram, which gives a multi-level hierarchy of clusters. One can then decide the number of clusters by cutting the dendrogram at a desired level. Hierarchical clustering is great for smaller datasets and when we want to understand hierarchical relationships, but it can be computationally intensive for larger datasets.

K-Means and DBSCAN

K-Means clustering has solidified its position in the world of unsupervised machine learning, offering a potent technique to group data points based on their similarities. This algorithm endeavors to partition the dataset into ‘k’ distinct clusters, each defined by a central point known as a centroid. It iteratively assigns data points to the cluster with the nearest centroid, recalculating centroids until convergence. With applications ranging from customer segmentation in marketing to image compression in computer vision, K-Means stands as a versatile solution for pattern recognition.

In contrast, Density-Based Spatial Clustering of Applications with Noise (DBSCAN) takes a distinctive approach, identifying regions of high data density. Unlike K-Means, DBSCAN doesn’t require users to predefine the number of clusters. It classifies points into core, border, and noise categories. Core points, surrounded by a minimum number of other points within a specified radius, form cluster nuclei. Border points lie on cluster peripheries, while sparser regions contain noise points. DBSCAN excels at discovering clusters of arbitrary shapes and effectively handling outliers.

When choosing between K-Means and DBSCAN, the nature of the dataset and desired outcomes are crucial considerations. K-Means is ideal when the number of clusters is known, and clusters are well-defined and spherical. In contrast, DBSCAN shines with datasets of varying densities and irregularly shaped clusters. The adaptability of these clustering algorithms empowers data scientists to unveil hidden structures, facilitating more informed decision-making across diverse fields.

 

Difference in average ages – black people versus white people shot by police: October 18

Today I have gone through the statistical analysis of the age distribution for people killed by police, comparing black and white populations. Here’s a summary that i have undergone about the data.

                      Overview: The analysis focuses on the age distribution of individuals killed      by police, specifically comparing black and white populations.

                       Black Population:

    • About 68% of ages for black people killed by police fall within one standard deviation of the mean (between Mean – Standard Deviation and Mean + Standard Deviation).
    • The mean age is between 21.29 and 44.3White Population:
      • Similarly, about 68% of ages for white people killed by police fall within one standard deviation of the mean (between Mean – Standard Deviation and Mean + Standard Deviation).
      • The mean age is between 26.70 and 53.26
      • Based on the analysis suggestion that white individuals killed by police are statistically significantly older, on average, than black individuals.
      • Visualizations, such as smooth histograms, show the differences in age distributions between black and white populations.
      • The text mentions the challenge of using a t-test due to deviations from normality in age distributions.Conclusion:
        • The mean age of white individuals killed by police is noted to be almost 28.

Project 2 Data Sample analysis: October 16

In class today, we looked at the ages of people who were killed by the police, using information from the Washington Post. When we checked the ages of everyone, we noticed that the numbers were a bit tilted to the right, meaning that more people were older. We did some math stuff, and it turns out that about 67% of the people who were shot by the police were between 24 and 50 years old.

Then, we did the same thing but just for black and white people. For black individuals, about 78% were between 25 and 44, and for white individuals, it was between 23 and 53.

Next, we wanted to see if there was a difference in the average ages between black and white people shot by the police. And yep, there was! We used something called the Monte Carlo method and figured out that there’s roughly a 7-year difference, and it didn’t happen by chance.

To understand how much this age difference matters overall, we used Cohen’s d method. The number we got was 0.577, and according to the guidelines, that’s a medium-sized difference. So, it seems like the age gap has a noticeable impact on the whole dataset.

1 2