Difference in average ages – black people versus white people shot by police: October 18

Today I have gone through the statistical analysis of the age distribution for people killed by police, comparing black and white populations. Here’s a summary that i have undergone about the data.

                      Overview: The analysis focuses on the age distribution of individuals killed      by police, specifically comparing black and white populations.

                       Black Population:

    • About 68% of ages for black people killed by police fall within one standard deviation of the mean (between Mean – Standard Deviation and Mean + Standard Deviation).
    • The mean age is between 21.29 and 44.3White Population:
      • Similarly, about 68% of ages for white people killed by police fall within one standard deviation of the mean (between Mean – Standard Deviation and Mean + Standard Deviation).
      • The mean age is between 26.70 and 53.26
      • Based on the analysis suggestion that white individuals killed by police are statistically significantly older, on average, than black individuals.
      • Visualizations, such as smooth histograms, show the differences in age distributions between black and white populations.
      • The text mentions the challenge of using a t-test due to deviations from normality in age distributions.Conclusion:
        • The mean age of white individuals killed by police is noted to be almost 28.

Project 2 Data Sample analysis: October 16

In class today, we looked at the ages of people who were killed by the police, using information from the Washington Post. When we checked the ages of everyone, we noticed that the numbers were a bit tilted to the right, meaning that more people were older. We did some math stuff, and it turns out that about 67% of the people who were shot by the police were between 24 and 50 years old.

Then, we did the same thing but just for black and white people. For black individuals, about 78% were between 25 and 44, and for white individuals, it was between 23 and 53.

Next, we wanted to see if there was a difference in the average ages between black and white people shot by the police. And yep, there was! We used something called the Monte Carlo method and figured out that there’s roughly a 7-year difference, and it didn’t happen by chance.

To understand how much this age difference matters overall, we used Cohen’s d method. The number we got was 0.577, and according to the guidelines, that’s a medium-sized difference. So, it seems like the age gap has a noticeable impact on the whole dataset.

Estimating the sampling distribution using Bootstrap-04-oct-2023

Today I have gained knowledge on Bootstrap. Bootstrapping is a statistical method for estimating the sampling distribution of a statistic without assuming a known underlying distribution. It works by repeatedly resampling the data with replacement and calculating the statistic of interest on each resampled dataset. It can be used to estimate the sampling distribution of any statistic, including the sample median, sample variance, and sample correlation coefficient. It is a powerful tool for statistical inference, and it can be used in a variety of settings.

Here is an example of how bootstrapping can be used:
Suppose we want to test the hypothesis that the average height of men is different from the average height of women. We could collect a sample of heights from men and a sample of heights from women, and then use bootstrapping to estimate the sampling distribution of the difference in sample means.
To do this, we would draw bootstrap samples from the men’s and women’s samples with replacement, and then calculate the difference in sample means on each bootstrap sample. The distribution of the difference in sample means from the bootstrap samples would be an estimate of the sampling distribution of the difference in sample means.
We could then use this distribution to calculate a p-value for the hypothesis test. The p-value would be the proportion of bootstrap samples that had a difference in sample means that was as large or larger than the difference in sample means from the original sample. If the p-value is less than a significance level of 0.05, then we would reject the null hypothesis and conclude that there is a statistically significant difference in the average heights of men and women.
Bootstrapping is a versatile and powerful statistical tool that can be used for a variety of purposes. It is a good choice for researchers who want to make inferences about their data without assuming a known underlying distribution.

 

 

K-Fold Cross Validation using Health Dataset:27-Sep-2023

In a today’s class, we delve into the practical application of cross-validation using a dataset that revolves around three vital health variables: obesity, inactivity, and diabetes. This dataset offered a comprehensive view of these health aspects, drawing from detailed measurements of 354 individuals. Our main goal is to develop predictive models that could shed light on the intricate relationships between these variables. To effectively assess and select the most suitable model that aligns with our data, we’ve adopted a 5-fold cross-validation technique.
This method divides our dataset into five distinct sections or “folds.” For each iteration, one fold was reserved for testing, while the other four were used for training. This approach offered a systematic way to gauge how effectively our models could adapt to unseen data. As someone who values precision and accuracy in data analysis, the concept of Mean Squared Error (MSE) resonated deeply with me. MSE served as a reliable yardstick to measure the performance of our models, quantifying how closely their predictions aligned with the actual data. It was reassuring to have such a robust metric at our disposal to objectively evaluate model performance.
To gain a more intuitive understanding of the dataset, we constructed a 3D scatter plot. This visual representation brought the data to life, with each data point depicted as a black dot, and the axes showcasing the values for obesity, inactivity, and diabetes. This visual tool allowed us to spot trends and clusters within the data, enhancing our grasp of these health-related variables.

Resampling Methods: Cross-Validation 25-sep-2023

Today I have gained knowledge on Resampling Methods: Cross-Validation and Bootstrap by watching the videos. These Methods are statistical techniques used in data analysis and machine learning to estimate the performance of a model or to create new datasets by manipulating the existing data. They are particularly useful when we are dealing with most relevant datasets or when we want to assess the strength of a model. For example they provide estimates of t-test prediction error and the standard deviation and bias of our parameter estimates.

Cross-validation serves as a valuable technique for evaluating the ability of a predictive model to perform well on unseen data. This method involves in dividing the dataset into several partitions, commonly referred to as “K subsets” or “folds called as K fold cross validation.” The model is then trained using a subset of these folds while being tested on a distinct fold and this cycle repeats several times, with each iteration using a different fold as the test set while the rest are employed for training purposes.

Estimating Prediction error with the distinction between test error and training error. The test error predicts the response on a new observation, for which the one that was not used in the training method. Whereas the training error can be easily calculated by applying the statistical learning method to the observations used in its training method. Here we randomly divide the set of samples in to two parts: a training set and a validation set referring as a validation set-approach. This model fits on the training set and the fitted model is used to predict the responses for the observations in the validation set. This resulting validation-set error provides an estimate of the test error.

In summary, these resampling methods and insights into prediction error estimation equip me with the tools and knowledge needed to assess model performance and make data-driven decisions in the field of data analysis and machine learning.

Linear Model explanation with CRAB Molt Model : 09-20-2023

In today’s class , I have acquired knowledge about the linear regression model that fit to data in which both variables are not normally distributed, have high skewness, high variance, and high kurtosis. But this can be a challenging scenario for linear regression, because it assumes certain characteristics of the data that may not hold true in this case. When we are working with such data, it’s important to consider alternative modeling approaches that can handle non-normality, skewness, and high kurtosis, such as robust regression methods or non-linear models. The best way to explain this model is using CRAB Method.
For eg. Pre-molt describes the shell’s size prior to molting, while Post-molt refers to the dimensions of a crab’s shell after molting.
Today’s data model was proposed to create a linear model to predict the size of a crab’s shell before molting based on the size of the shell after molting and we tried to understand if the difference between both the states means statistical significance or not and came to the conclusion that by standard statistical inference across a lot of cases, the p value in this case was less than 0.05 with makes us reject the null hypothesis that there is no real difference.
We also got to know about the t-test analysis, which is a a statistical hypothesis test used to determine if there is a significant difference between the means of two groups or populations. For eg. Imagine you have two groups, like group A and group B. Group A, who were taught using Method 1, and Group B, who were taught using Method 2. You want to determine if there is a significant difference in the average test scores between the two groups. The t-test wishes it knew the  scores of a special secret ingredient (lets call it “scaling term”) in the math, but it doesn’t. So, it has to guess it from the numbers. If it guesses right (under some specific conditions), it can use its rulebook to say if the groups are different or not.
In summary, a linear model using CRAB Molt in statistical analysis is a tool used to study and quantify the molting behavior of crabs, with the goal of understanding the factors that influence this behavior. These models can help researchers and practitioners make informed decisions related to crab populations and management.

Multiple Linear Regression and its model

Today I have Covered Multiple regression and its models. Multiple regression is a statistical method that allows us to model the relationship between a dependent variable (also known as the outcome or target variable) and multiple independent variables (also known as the predictor or explanatory variables).
The multiple regression model is a linear equation of the form:
y = b0 + b1x1 + b2x2 + … + bnxn
where:
y is the dependent variable
x1, x2, …, xn are the independent variables
b0 is the intercept
b1, b2, …, bn are the regression coefficients
The regression coefficients represent the strength and direction of the relationship between each independent variable and the dependent variable. For example, if the regression coefficient for an independent variable is positive and significant, it means that an increase in that independent variable will lead to an increase in the dependent variable.
Linear Model:
A linear model is a statistical model that assumes that the relationship between the dependent variable (also known as the outcome or target variable) and the independent variables (also known as the predictor or explanatory variables) is linear.

In other words, a linear model assumes that the dependent variable can be expressed as a linear function of the independent variables, plus an error term.

Linear models are widely used in a variety of fields, including science, engineering, business, and economics. They are used to make predictions, identify important relationships, and control for the effects of other variables.

Quadratic Model:

A quadratic model is a mathematical model that represents the relationship between a dependent variable (also known as the outcome or target variable) and an independent variable (also known as the predictor or explanatory variable) using a quadratic equation.

A quadratic equation is a polynomial equation of the second degree. The general form of a quadratic equation is:

ax^2 + bx + c = 0

where a, b, and c are real numbers and a ≠ 0.

The graph of a quadratic equation is a parabola. A parabola is a U-shaped curve that opens up or down. The vertex of the parabola is the point where the curve changes direction.

This model can be used to model a wide variety of real-world phenomena, such as the motion of a projectile, the growth of a population, and the production of a good or service.

P-Value

P Value is the probability for the “Null Hypothesis” to be true, where Null Hypothesis means which treats everything same or equal. The P stands for probability and Measures how likely it is that any observed difference between groups is due to chance. Being a probability, P can take any value between 0 and 1. Values close to 0 indicate that the observed difference is unlikely to be due to chance, whereas a P value close to 1 suggests no difference between the groups other than due to chance.

For example:

Let’s say We want to perform an experiment to see if a new type of weight loss drug (Drug X) causes people to lose body weight. So, we randomly sample a collection of volunteers and randomly assign them into two groups: Group A and Group B. We give Group A a placebo. In other words, this contains no active ingredients. Group A are therefore the control group. And, you give Group B the new drug (Drug X).

The participants are weighed at the start of the study and at the end of the study. This way, we can work out the body weight difference. At the end of the study you work out that Group A’s average body weight difference was 0 kg, in other words, they did not gain or lose any body weight. Group B’s weight difference was -1 kg, so on average they lost 1 kg of their body weight. So, does this mean that the drug worked?

To determine this, we first ask ourselves: What would happen in a world where the weight difference in volunteers who receive Drug X (Group B) is the same as the weight difference who received the placebo (Group A)?

This is where the null hypothesis comes in. Usually, the null hypothesis states that there are no differences between groups, for example. So, our null hypothesis is: The weight difference in those who receive Drug X, is the same as the weight difference in those who receive the placebo.

Now, we ask ourselves: If this null hypothesis were true, what is the chance (or probability) of discovering a 1 kg reduction (or more) in body weight in those treated with Drug X from our sample?

This probability, or p-value, measures the strength of evidence against the null hypothesis. We can think of this as a court trial where the defendant is innocent until proven guilty; in this case, the defendant is the null hypothesis. The smaller the p-value, the stronger the evidence against the null hypothesis.

To determine the p-value, scientists use what are known as statistical hypothesis tests. Common examples of statistical hypothesis tests include the Student T-test and a one-way ANOVA. Since this is a top line overview, We will not bombard with statistical jargon, but instead, pretend we have performed a statistical test using our data.

So, after inputting our experimental data into a statistical test, we get a p-value in return. Let’s say for our example, the p-value is 0.02. It’s worth mentioning that the p-value is a fraction, however, it may be easier to convert this to a percentage to understand  the concept better. So, a value of 0.02, would be 2% ( simply multiplied the fraction by 100).

But what does this p-value result of 0.02 (or 2%) actually represent?

Essentially a p-value of 2% means that if the null hypothesis were true (that the two population means are identical), then there is a 2% chance of observing a difference as large (or larger) than what we observed in our sample. To put that into perspective, a 2% chance corresponds to 1 in every 50 experiments of this size. But, how can this be? What is accounting for this 2%? Simply, this 2% can be accounted for by random noise.

Let’s elaborate a little more on Random noise.

There are quite a few things that can impact a p-value, and some of these factors are collectively known as random noise or random chance. One type of factor that can contribute to random noise, especially in human studies, is the coincidence random sampling. For example, humans can exhibit a large amount of variation between one another due to genetic and environmental influences.

If we relate back to our example, some humans may contain an unknown gene that speeds up their metabolism and causes them to lose weight more than those without the gene. When recruiting volunteers for our experiment, we did not perform any DNA analysis before randomly assigning the volunteers to either Group A, the control group, or Group B, the Drug X group; so, there was no way in knowing who was a carrier or not.

Imagine a situation where, just by pure coincidence, more volunteers with the high metabolism gene, are placed in Group B, than Group A; so, it makes sense that this group lost more weight. So, here you can see that by a pure coincidence of random sampling, this can have a knock-on effect on the p-value.

So, to sum up, a p-value refers to a probability value. This p-value is a value between 0 and 1.This number represents the probability of obtaining the observed difference (or a larger one) in the outcome measure of the sample, given that no difference exists between treatments in the population (the null-hypothesis is true).

Random noise can affect the p-value. An example of random noise is the coincidence of random sampling.

Simple Linear Regression

Today I have learned Simple Linear Regression which is a statistical method that allows us to Summarize and study relationships between two continuous variables, which are:
.The Independent variable X, also known as the predictor, regressor or explanatory variable.
.The dependent variable Y, also known as the outcome or predicted variable.

Mathematically, we can write this linear relationship as
Y= α+βX.

As per the datasets provided, relating %Obesity and %Inactivity.

X may represent %Obesity and Y may represent %Inactivity
Then we can regress %Inactivity onto %Obesity by fitting the model.

α and β are two unknown constants that represent
the intercept and slope terms in the linear model.

ˆy = ˆα + ˆβx + ε
where ˆy indicates a prediction of Y on the basis of X = x and where ε
is a mean-zero random error term.  Here we use a
hat symbol, ˆ ,to denote the estimated value for an unknown parameter
or coefficient, or to denote the predicted value of the response.

The least squares approach chooses ˆα and ˆβ to minimize the RSS. Using
some calculus formula, we can show that the minimizers are

ˆα =(∑(from i=1 to n)(xi − x¯)(yi − y¯))⁄(∑(from i=1 to n)(xi -¯x)²

ˆβ = ¯y − ˆα¯x

The first thing we have to do is to generate a description of the %obesity and %inactivity data for the common data points this is just  for good data analytical practice in getting to know and understand our data – it is the basic first step in any statistical analysis.

After this will try import all %obesity data and extract from that list those data points for which we also have %inactivity data and later will generate descriptive statistics of the %obesity data points for which we have %inactivity data in the next topic session which is about Heteroscedasticity where we plot the residuals versus the predicted values from the linear model:

1 2 3