September | 2023 |

K-Fold Cross Validation using Health Dataset:27-Sep-2023

September 29, 2023 Siva Leave a comment

In a today’s class, we delve into the practical application of cross-validation using a dataset that revolves around three vital health variables: obesity, inactivity, and diabetes. This dataset offered a comprehensive view of these health aspects, drawing from detailed measurements of 354 individuals. Our main goal is to develop predictive models that could shed light on the intricate relationships between these variables. To effectively assess and select the most suitable model that aligns with our data, we’ve adopted a 5-fold cross-validation technique.

This method divides our dataset into five distinct sections or “folds.” For each iteration, one fold was reserved for testing, while the other four were used for training. This approach offered a systematic way to gauge how effectively our models could adapt to unseen data. As someone who values precision and accuracy in data analysis, the concept of Mean Squared Error (MSE) resonated deeply with me. MSE served as a reliable yardstick to measure the performance of our models, quantifying how closely their predictions aligned with the actual data. It was reassuring to have such a robust metric at our disposal to objectively evaluate model performance.

To gain a more intuitive understanding of the dataset, we constructed a 3D scatter plot. This visual representation brought the data to life, with each data point depicted as a black dot, and the axes showcasing the values for obesity, inactivity, and diabetes. This visual tool allowed us to spot trends and clusters within the data, enhancing our grasp of these health-related variables.

Uncategorized

Resampling Methods: Cross-Validation 25-sep-2023

September 25, 2023 Siva Leave a comment

Today I have gained knowledge on Resampling Methods: Cross-Validation and Bootstrap by watching the videos. These Methods are statistical techniques used in data analysis and machine learning to estimate the performance of a model or to create new datasets by manipulating the existing data. They are particularly useful when we are dealing with most relevant datasets or when we want to assess the strength of a model. For example they provide estimates of t-test prediction error and the standard deviation and bias of our parameter estimates.

Cross-validation serves as a valuable technique for evaluating the ability of a predictive model to perform well on unseen data. This method involves in dividing the dataset into several partitions, commonly referred to as “K subsets” or “folds called as K fold cross validation.” The model is then trained using a subset of these folds while being tested on a distinct fold and this cycle repeats several times, with each iteration using a different fold as the test set while the rest are employed for training purposes.

Estimating Prediction error with the distinction between test error and training error. The test error predicts the response on a new observation, for which the one that was not used in the training method. Whereas the training error can be easily calculated by applying the statistical learning method to the observations used in its training method. Here we randomly divide the set of samples in to two parts: a training set and a validation set referring as a validation set-approach. This model fits on the training set and the fitted model is used to predict the responses for the observations in the validation set. This resulting validation-set error provides an estimate of the test error.

In summary, these resampling methods and insights into prediction error estimation equip me with the tools and knowledge needed to assess model performance and make data-driven decisions in the field of data analysis and machine learning.

Uncategorized

Linear Model explanation with CRAB Molt Model : 09-20-2023

September 22, 2023 Siva Leave a comment

In today’s class , I have acquired knowledge about the linear regression model that fit to data in which both variables are not normally distributed, have high skewness, high variance, and high kurtosis. But this can be a challenging scenario for linear regression, because it assumes certain characteristics of the data that may not hold true in this case. When we are working with such data, it’s important to consider alternative modeling approaches that can handle non-normality, skewness, and high kurtosis, such as robust regression methods or non-linear models. The best way to explain this model is using CRAB Method.

For eg. Pre-molt describes the shell’s size prior to molting, while Post-molt refers to the dimensions of a crab’s shell after molting.

Today’s data model was proposed to create a linear model to predict the size of a crab’s shell before molting based on the size of the shell after molting and we tried to understand if the difference between both the states means statistical significance or not and came to the conclusion that by standard statistical inference across a lot of cases, the p value in this case was less than 0.05 with makes us reject the null hypothesis that there is no real difference.

We also got to know about the t-test analysis, which is a a statistical hypothesis test used to determine if there is a significant difference between the means of two groups or populations. For eg. Imagine you have two groups, like group A and group B. Group A, who were taught using Method 1, and Group B, who were taught using Method 2. You want to determine if there is a significant difference in the average test scores between the two groups. The t-test wishes it knew the scores of a special secret ingredient (lets call it “scaling term”) in the math, but it doesn’t. So, it has to guess it from the numbers. If it guesses right (under some specific conditions), it can use its rulebook to say if the groups are different or not.

In summary, a linear model using CRAB Molt in statistical analysis is a tool used to study and quantify the molting behavior of crabs, with the goal of understanding the factors that influence this behavior. These models can help researchers and practitioners make informed decisions related to crab populations and management.

Uncategorized

Multiple Linear Regression and its model

September 19, 2023 Siva Leave a comment

Today I have Covered Multiple regression and its models. Multiple regression is a statistical method that allows us to model the relationship between a dependent variable (also known as the outcome or target variable) and multiple independent variables (also known as the predictor or explanatory variables).

The multiple regression model is a linear equation of the form:

y = b0 + b1x1 + b2x2 + … + bnxn

where:

y is the dependent variable

x1, x2, …, xn are the independent variables

b0 is the intercept

b1, b2, …, bn are the regression coefficients

The regression coefficients represent the strength and direction of the relationship between each independent variable and the dependent variable. For example, if the regression coefficient for an independent variable is positive and significant, it means that an increase in that independent variable will lead to an increase in the dependent variable.

Linear Model:

A linear model is a statistical model that assumes that the relationship between the dependent variable (also known as the outcome or target variable) and the independent variables (also known as the predictor or explanatory variables) is linear.

In other words, a linear model assumes that the dependent variable can be expressed as a linear function of the independent variables, plus an error term.

Linear models are widely used in a variety of fields, including science, engineering, business, and economics. They are used to make predictions, identify important relationships, and control for the effects of other variables.

Quadratic Model:

A quadratic model is a mathematical model that represents the relationship between a dependent variable (also known as the outcome or target variable) and an independent variable (also known as the predictor or explanatory variable) using a quadratic equation.

A quadratic equation is a polynomial equation of the second degree. The general form of a quadratic equation is:

ax^2 + bx + c = 0

where a, b, and c are real numbers and a ≠ 0.

The graph of a quadratic equation is a parabola. A parabola is a U-shaped curve that opens up or down. The vertex of the parabola is the point where the curve changes direction.

This model can be used to model a wide variety of real-world phenomena, such as the motion of a projectile, the growth of a population, and the production of a good or service.

Uncategorized

P-Value

September 14, 2023 Siva Leave a comment

P Value is the probability for the “Null Hypothesis” to be true, where Null Hypothesis means which treats everything same or equal. The P stands for probability and Measures how likely it is that any observed difference between groups is due to chance. Being a probability, P can take any value between 0 and 1. Values close to 0 indicate that the observed difference is unlikely to be due to chance, whereas a P value close to 1 suggests no difference between the groups other than due to chance.

For example:

Let’s say We want to perform an experiment to see if a new type of weight loss drug (Drug X) causes people to lose body weight. So, we randomly sample a collection of volunteers and randomly assign them into two groups: Group A and Group B. We give Group A a placebo. In other words, this contains no active ingredients. Group A are therefore the control group. And, you give Group B the new drug (Drug X).

The participants are weighed at the start of the study and at the end of the study. This way, we can work out the body weight difference. At the end of the study you work out that Group A’s average body weight difference was 0 kg, in other words, they did not gain or lose any body weight. Group B’s weight difference was -1 kg, so on average they lost 1 kg of their body weight. So, does this mean that the drug worked?

To determine this, we first ask ourselves: What would happen in a world where the weight difference in volunteers who receive Drug X (Group B) is the same as the weight difference who received the placebo (Group A)?

This is where the null hypothesis comes in. Usually, the null hypothesis states that there are no differences between groups, for example. So, our null hypothesis is: The weight difference in those who receive Drug X, is the same as the weight difference in those who receive the placebo.

Now, we ask ourselves: If this null hypothesis were true, what is the chance (or probability) of discovering a 1 kg reduction (or more) in body weight in those treated with Drug X from our sample?

This probability, or p-value, measures the strength of evidence against the null hypothesis. We can think of this as a court trial where the defendant is innocent until proven guilty; in this case, the defendant is the null hypothesis. The smaller the p-value, the stronger the evidence against the null hypothesis.

To determine the p-value, scientists use what are known as statistical hypothesis tests. Common examples of statistical hypothesis tests include the Student T-test and a one-way ANOVA. Since this is a top line overview, We will not bombard with statistical jargon, but instead, pretend we have performed a statistical test using our data.

So, after inputting our experimental data into a statistical test, we get a p-value in return. Let’s say for our example, the p-value is 0.02. It’s worth mentioning that the p-value is a fraction, however, it may be easier to convert this to a percentage to understand the concept better. So, a value of 0.02, would be 2% ( simply multiplied the fraction by 100).

But what does this p-value result of 0.02 (or 2%) actually represent?

Essentially a p-value of 2% means that if the null hypothesis were true (that the two population means are identical), then there is a 2% chance of observing a difference as large (or larger) than what we observed in our sample. To put that into perspective, a 2% chance corresponds to 1 in every 50 experiments of this size. But, how can this be? What is accounting for this 2%? Simply, this 2% can be accounted for by random noise.

Let’s elaborate a little more on Random noise.

There are quite a few things that can impact a p-value, and some of these factors are collectively known as random noise or random chance. One type of factor that can contribute to random noise, especially in human studies, is the coincidence random sampling. For example, humans can exhibit a large amount of variation between one another due to genetic and environmental influences.

If we relate back to our example, some humans may contain an unknown gene that speeds up their metabolism and causes them to lose weight more than those without the gene. When recruiting volunteers for our experiment, we did not perform any DNA analysis before randomly assigning the volunteers to either Group A, the control group, or Group B, the Drug X group; so, there was no way in knowing who was a carrier or not.

Imagine a situation where, just by pure coincidence, more volunteers with the high metabolism gene, are placed in Group B, than Group A; so, it makes sense that this group lost more weight. So, here you can see that by a pure coincidence of random sampling, this can have a knock-on effect on the p-value.

So, to sum up, a p-value refers to a probability value. This p-value is a value between 0 and 1.This number represents the probability of obtaining the observed difference (or a larger one) in the outcome measure of the sample, given that no difference exists between treatments in the population (the null-hypothesis is true).

Random noise can affect the p-value. An example of random noise is the coincidence of random sampling.

Uncategorized

Simple Linear Regression

September 12, 2023 Siva Leave a comment

Today I have learned Simple Linear Regression which is a statistical method that allows us to Summarize and study relationships between two continuous variables, which are:
.The Independent variable X, also known as the predictor, regressor or explanatory variable.
.The dependent variable Y, also known as the outcome or predicted variable.

Mathematically, we can write this linear relationship as
Y= α+βX.

As per the datasets provided, relating %Obesity and %Inactivity.

X may represent %Obesity and Y may represent %Inactivity
Then we can regress %Inactivity onto %Obesity by fitting the model.

α and β are two unknown constants that represent
the intercept and slope terms in the linear model.

ˆy = ˆα + ˆβx + ε
where ˆy indicates a prediction of Y on the basis of X = x and where ε
is a mean-zero random error term. Here we use a
hat symbol, ˆ ,to denote the estimated value for an unknown parameter
or coefficient, or to denote the predicted value of the response.

The least squares approach chooses ˆα and ˆβ to minimize the RSS. Using
some calculus formula, we can show that the minimizers are

ˆα =(∑(from i=1 to n)(xi − x¯)(yi − y¯))⁄(∑(from i=1 to n)(xi -¯x)²

ˆβ = ¯y − ˆα¯x

The first thing we have to do is to generate a description of the %obesity and %inactivity data for the common data points this is just for good data analytical practice in getting to know and understand our data – it is the basic first step in any statistical analysis.

After this will try import all %obesity data and extract from that list those data points for which we also have %inactivity data and later will generate descriptive statistics of the %obesity data points for which we have %inactivity data in the next topic session which is about Heteroscedasticity where we plot the residuals versus the predicted values from the linear model: