Month: September 2023
Resampling Methods: Cross-Validation 25-sep-2023
Today I have gained knowledge on Resampling Methods: Cross-Validation and Bootstrap by watching the videos. These Methods are statistical techniques used in data analysis and machine learning to estimate the performance of a model or to create new datasets by manipulating the existing data. They are particularly useful when we are dealing with most relevant datasets or when we want to assess the strength of a model. For example they provide estimates of t-test prediction error and the standard deviation and bias of our parameter estimates.
Cross-validation serves as a valuable technique for evaluating the ability of a predictive model to perform well on unseen data. This method involves in dividing the dataset into several partitions, commonly referred to as “K subsets” or “folds called as K fold cross validation.” The model is then trained using a subset of these folds while being tested on a distinct fold and this cycle repeats several times, with each iteration using a different fold as the test set while the rest are employed for training purposes.
Estimating Prediction error with the distinction between test error and training error. The test error predicts the response on a new observation, for which the one that was not used in the training method. Whereas the training error can be easily calculated by applying the statistical learning method to the observations used in its training method. Here we randomly divide the set of samples in to two parts: a training set and a validation set referring as a validation set-approach. This model fits on the training set and the fitted model is used to predict the responses for the observations in the validation set. This resulting validation-set error provides an estimate of the test error.
In summary, these resampling methods and insights into prediction error estimation equip me with the tools and knowledge needed to assess model performance and make data-driven decisions in the field of data analysis and machine learning.
Linear Model explanation with CRAB Molt Model : 09-20-2023
Multiple Linear Regression and its model
In other words, a linear model assumes that the dependent variable can be expressed as a linear function of the independent variables, plus an error term.
Linear models are widely used in a variety of fields, including science, engineering, business, and economics. They are used to make predictions, identify important relationships, and control for the effects of other variables.
Quadratic Model:
A quadratic model is a mathematical model that represents the relationship between a dependent variable (also known as the outcome or target variable) and an independent variable (also known as the predictor or explanatory variable) using a quadratic equation.
A quadratic equation is a polynomial equation of the second degree. The general form of a quadratic equation is:
ax^2 + bx + c = 0
where a, b, and c are real numbers and a ≠ 0.
The graph of a quadratic equation is a parabola. A parabola is a U-shaped curve that opens up or down. The vertex of the parabola is the point where the curve changes direction.
This model can be used to model a wide variety of real-world phenomena, such as the motion of a projectile, the growth of a population, and the production of a good or service.
P-Value
P Value is the probability for the “Null Hypothesis” to be true, where Null Hypothesis means which treats everything same or equal. The P stands for probability and Measures how likely it is that any observed difference between groups is due to chance. Being a probability, P can take any value between 0 and 1. Values close to 0 indicate that the observed difference is unlikely to be due to chance, whereas a P value close to 1 suggests no difference between the groups other than due to chance.
For example:
Let’s say We want to perform an experiment to see if a new type of weight loss drug (Drug X) causes people to lose body weight. So, we randomly sample a collection of volunteers and randomly assign them into two groups: Group A and Group B. We give Group A a placebo. In other words, this contains no active ingredients. Group A are therefore the control group. And, you give Group B the new drug (Drug X).
The participants are weighed at the start of the study and at the end of the study. This way, we can work out the body weight difference. At the end of the study you work out that Group A’s average body weight difference was 0 kg, in other words, they did not gain or lose any body weight. Group B’s weight difference was -1 kg, so on average they lost 1 kg of their body weight. So, does this mean that the drug worked?
To determine this, we first ask ourselves: What would happen in a world where the weight difference in volunteers who receive Drug X (Group B) is the same as the weight difference who received the placebo (Group A)?
This is where the null hypothesis comes in. Usually, the null hypothesis states that there are no differences between groups, for example. So, our null hypothesis is: The weight difference in those who receive Drug X, is the same as the weight difference in those who receive the placebo.
Now, we ask ourselves: If this null hypothesis were true, what is the chance (or probability) of discovering a 1 kg reduction (or more) in body weight in those treated with Drug X from our sample?
This probability, or p-value, measures the strength of evidence against the null hypothesis. We can think of this as a court trial where the defendant is innocent until proven guilty; in this case, the defendant is the null hypothesis. The smaller the p-value, the stronger the evidence against the null hypothesis.
To determine the p-value, scientists use what are known as statistical hypothesis tests. Common examples of statistical hypothesis tests include the Student T-test and a one-way ANOVA. Since this is a top line overview, We will not bombard with statistical jargon, but instead, pretend we have performed a statistical test using our data.
So, after inputting our experimental data into a statistical test, we get a p-value in return. Let’s say for our example, the p-value is 0.02. It’s worth mentioning that the p-value is a fraction, however, it may be easier to convert this to a percentage to understand the concept better. So, a value of 0.02, would be 2% ( simply multiplied the fraction by 100).
But what does this p-value result of 0.02 (or 2%) actually represent?
Essentially a p-value of 2% means that if the null hypothesis were true (that the two population means are identical), then there is a 2% chance of observing a difference as large (or larger) than what we observed in our sample. To put that into perspective, a 2% chance corresponds to 1 in every 50 experiments of this size. But, how can this be? What is accounting for this 2%? Simply, this 2% can be accounted for by random noise.
Let’s elaborate a little more on Random noise.
There are quite a few things that can impact a p-value, and some of these factors are collectively known as random noise or random chance. One type of factor that can contribute to random noise, especially in human studies, is the coincidence random sampling. For example, humans can exhibit a large amount of variation between one another due to genetic and environmental influences.
If we relate back to our example, some humans may contain an unknown gene that speeds up their metabolism and causes them to lose weight more than those without the gene. When recruiting volunteers for our experiment, we did not perform any DNA analysis before randomly assigning the volunteers to either Group A, the control group, or Group B, the Drug X group; so, there was no way in knowing who was a carrier or not.
Imagine a situation where, just by pure coincidence, more volunteers with the high metabolism gene, are placed in Group B, than Group A; so, it makes sense that this group lost more weight. So, here you can see that by a pure coincidence of random sampling, this can have a knock-on effect on the p-value.
So, to sum up, a p-value refers to a probability value. This p-value is a value between 0 and 1.This number represents the probability of obtaining the observed difference (or a larger one) in the outcome measure of the sample, given that no difference exists between treatments in the population (the null-hypothesis is true).
Random noise can affect the p-value. An example of random noise is the coincidence of random sampling.
Simple Linear Regression
Today I have learned Simple Linear Regression which is a statistical method that allows us to Summarize and study relationships between two continuous variables, which are:
.The Independent variable X, also known as the predictor, regressor or explanatory variable.
.The dependent variable Y, also known as the outcome or predicted variable.
Mathematically, we can write this linear relationship as
Y= α+βX.
As per the datasets provided, relating %Obesity and %Inactivity.
X may represent %Obesity and Y may represent %Inactivity
Then we can regress %Inactivity onto %Obesity by fitting the model.
α and β are two unknown constants that represent
the intercept and slope terms in the linear model.
ˆy = ˆα + ˆβx + ε
where ˆy indicates a prediction of Y on the basis of X = x and where ε
is a mean-zero random error term. Here we use a
hat symbol, ˆ ,to denote the estimated value for an unknown parameter
or coefficient, or to denote the predicted value of the response.
The least squares approach chooses ˆα and ˆβ to minimize the RSS. Using
some calculus formula, we can show that the minimizers are
ˆα =(∑(from i=1 to n)(xi − x¯)(yi − y¯))⁄(∑(from i=1 to n)(xi -¯x)²
ˆβ = ¯y − ˆα¯x
The first thing we have to do is to generate a description of the %obesity and %inactivity data for the common data points this is just for good data analytical practice in getting to know and understand our data – it is the basic first step in any statistical analysis.
After this will try import all %obesity data and extract from that list those data points for which we also have %inactivity data and later will generate descriptive statistics of the %obesity data points for which we have %inactivity data in the next topic session which is about Heteroscedasticity where we plot the residuals versus the predicted values from the linear model:
Hello world!
Welcome to UMassD WordPress. This is your first post. Edit or delete it, then start blogging!