Bayesian Statistician: 08-Nov-2023

A Bayesian statistician approaches statistical inference using Bayesian probability, which represents a degree of belief or certainty. Bayesian statistics incorporates prior knowledge or beliefs about parameters and updates them based on observed data using Bayes’ theorem. This leads to the calculation of posterior probabilities, which express the probability of hypotheses given the data. Bayesian methods are particularly useful when dealing with small sample sizes or when incorporating existing knowledge into statistical analysis.

For example imagine a scenario where a pharmaceutical company wants to test the effectiveness of a new drug. A Bayesian statistician would start with prior beliefs about the drug’s effectiveness based on existing knowledge or previous studies. As new data from clinical trials becomes available, these prior beliefs are updated using Bayes’ theorem to calculate the posterior probability of the drug being effective. The Bayesian approach allows for the incorporation of prior knowledge into the analysis, making it especially useful when dealing with limited data.

Frequentist Statistician: 06-Nov-2023

Frequentist Statistician:

A frequentist statistician approaches statistical inference by focusing on the frequency or probability of events. In this framework, probabilities are associated with the frequency of events in repeated random sampling. Key concepts include point estimation, confidence intervals, and hypothesis testing. The emphasis is on using observed data to make inferences about the true values of population parameters. Frequentist methods do not assign probabilities to hypotheses; instead, they view hypotheses as fixed and the data as variable.

For example, Let’s say we want to estimate the average height of students in a school. A frequentist statistician would take random samples of students, calculate the average height in each sample, and use these averages to make inferences about the true average height of all students in the school. The focus is on using the observed data (sample means) to estimate and make statements about the population parameter (true average height).

KNN algorithm : 01-Nov-2023

Today I have learned about KNN (K Nearest Neighbors) algorithm, a handy tool for our Project 2 dataset. KNN excels at finding similarities in datasets where similar features tend to cluster together. The process involves addressing missing values, normalizing numerical features for equal importance, selecting relevant features like demographics and city data, splitting the dataset for training and testing, determining the optimal number of neighbors (K), training the model, evaluating its performance, and finally using it to predict categories for new data points. KNN’s focus on similarity makes it particularly useful for uncovering patterns in geographical dynamics.

Cohen’s d statistical measure : 30-Oct-2023

Cohen’s d is a statistical measure that quantifies the standardized difference between two means. In simpler terms, it helps assess the size of the difference between two groups by expressing it in standard deviation units. A larger Cohen’s d indicates a more substantial difference between the groups being compared. It serves as a metric for gauging the standardized distinction between two means. In the examination of age distribution among individuals fatally shot by the police, Cohen’s d was employed to evaluate the effect size pertaining to the age gap between black and white victims.
To compute Cohen’s d, the variance between the means (specifically, the mean ages of white and black individuals killed by police) is divided by the pooled standard deviation of both groups. The pooled standard deviation, a weighted average adjusted for sample sizes, accounts for the standard deviations of both groups. In the given dataset, the computed Cohen’s d stood at 0.577485, signifying a medium effect size according to Cohen-Salkowski guidelines. This implies that the 7.3-year average age difference between white and black individuals killed by police is of moderate magnitude, suggesting a statistically significant yet intermediary impact—not large or small, but falling in between. The resulting Cohen’s d value provides a standardized measure of effect size, helping to interpret the magnitude of the difference between the means of the two groups.

 

Hierarchical Clustering

Clustering (Unsupervised Learning): Clustering is a technique used in unsupervised machine learning to group data points based on their inherent similarities, without any prior labels or categorizations. The main goal is to partition a dataset into clusters where items within a cluster are more alike to each other than to items in other clusters. This method helps uncover hidden patterns within data, making it particularly valuable when we don’t have any predefined categories or when we want to discover new insights from the data.

 

Hierarchical Clustering: Hierarchical clustering creates a tree of clusters. Unlike K-means, we don’t need to specify the number of clusters upfront. The method starts by treating each data point as a single cluster and then continually merges the closest pairs of clusters until only one large cluster remains. The result is a tree-like diagram called a dendrogram, which gives a multi-level hierarchy of clusters. One can then decide the number of clusters by cutting the dendrogram at a desired level. Hierarchical clustering is great for smaller datasets and when we want to understand hierarchical relationships, but it can be computationally intensive for larger datasets.

K-Means and DBSCAN

K-Means clustering has solidified its position in the world of unsupervised machine learning, offering a potent technique to group data points based on their similarities. This algorithm endeavors to partition the dataset into ‘k’ distinct clusters, each defined by a central point known as a centroid. It iteratively assigns data points to the cluster with the nearest centroid, recalculating centroids until convergence. With applications ranging from customer segmentation in marketing to image compression in computer vision, K-Means stands as a versatile solution for pattern recognition.

In contrast, Density-Based Spatial Clustering of Applications with Noise (DBSCAN) takes a distinctive approach, identifying regions of high data density. Unlike K-Means, DBSCAN doesn’t require users to predefine the number of clusters. It classifies points into core, border, and noise categories. Core points, surrounded by a minimum number of other points within a specified radius, form cluster nuclei. Border points lie on cluster peripheries, while sparser regions contain noise points. DBSCAN excels at discovering clusters of arbitrary shapes and effectively handling outliers.

When choosing between K-Means and DBSCAN, the nature of the dataset and desired outcomes are crucial considerations. K-Means is ideal when the number of clusters is known, and clusters are well-defined and spherical. In contrast, DBSCAN shines with datasets of varying densities and irregularly shaped clusters. The adaptability of these clustering algorithms empowers data scientists to unveil hidden structures, facilitating more informed decision-making across diverse fields.

 

1 2 3