In this task, you will check for outliers and their potential impact using the following steps:
Before you analyze your data, it is very important that you check the distribution and normality of the data and identify outliers for continuous variables.
Statements | Explanation |
---|---|
data =demo_BP2b
|
Use the proc univariate procedure to get all default descriptive statistics, such as mean, minimum and maximum values, standard deviation, and skewness, etc. |
where ridstatr=2 and ridageyr>=20; |
Use the where statement to select the participants who were interviewed and examined in the MEC and who were age 20 years and older. |
normal plot ; |
Use the normal plot statement to obtain a plot of normality. |
id seqn; |
Use the id statement to list the sequence numbers associated with extreme values in the output. |
var lbxtc; |
Use the var statement to list the variables of interest. |
Highlighted items from the univariate analysis output :
In this example, you will plot the 4-year MEC survey weight (wtmec4yr) against the distribution of the cholesterol variable to determine whether the extreme observations are outliers.
Statements | Explanation |
---|---|
symbol1 value =dot height = .2; data =demo_BP2b; plot wtmec4yr*lbxtc/ frame ; title 'NHANES 1999-2002, adults age 20 years and older' ; ; |
Use the proc gplot procedure to plot the total serum cholesterol (lbxtc) by the corresponding sample weight for each observation in the dataset. Symbol and height are option statements used to format the output of the plot. Use the where statement to select the participants who were interviewed and examined in the MEC and who were age 20 years and older. |
Highlighted items from plotting the survey weight against the distribution of the cholesterol variable:
In this step you will:
Statements | Explanation |
---|---|
set demo_BP2b; |
exclu_3SPs;Use the data and set statements to refer to your analytic dataset. |
if seqn in (10494, 13996, 17821) then delete; | Use the if, then statements to delete the outliers using their SEQN previously identified in the plot of survey weight versus distribution of the variable. The SEQNs associated with these outliers are listed in the proc univariate output under extreme observations. |
value race
1='Mexican
American' |
;Use the proc format procedure to give easily understood labels to your race/ethnicity variable values. |
data=demo_BP2b mean stderr maxdec=1; | Use the proc means procedure to determine the mean and standard error for the dataset with the outliers. |
where ridstatr=2 and ridageyr>=20; | Use the where statement to select the participants who were interviewed and examined in the MEC and who were age 20 years and older. |
var lbxtc; | Use the var statement to indicate the variable of interest. |
class ridreth1; | Use the class statement to group the variable of interest by race/ethnicity categories. |
weight wtmec4yr; | Use the weight statement to account for the unequal probability of sampling and non-response. In this example, the MEC weight for four years of data is used. |
format ridreth1 race.; | Use the format statement to label your race variable with English labels you defined in the proc format statement. |
data=exclu_3SPs mean stderr maxdec=1; | Use the proc means procedure to determine the mean and standard error for the dataset without the outliers. |
var lbxtc; | Use the var statement to indicate the variable of interest. |
class ridreth1; | Use the class statement to group the variable of interest by race/ethnicity categories. |
weight wtmec4yr; | Use the weight statement to account for the unequal probability of sampling and non-response. In this example, the MEC weight for four years of data is used. |
format ridreth1 race.; | Use the format statement to label your race variable with easily understood labels you defined in the proc format statement. |
Highlighted items from comparison of the results with and without outliers: