How to Identify and Describe the Impact of Influential Outliers
Before you analyze your data, it is very important that you examine the data for the presence of outlying values.
Delete Observations with Implausible Values.
There are 10,080 minutes in each week. Censor the data by deleting study participants who have minutes of weekly activity that exceed this number. In the PAQMSTR dataset we created a variable that combines minutes of household/yard, transportation, and leisure-time weekly activity to describe total moderate-to-vigorous minutes of physical activity per week (TOTMINW).
Sample Code
data =paq;
tables TOTMINW*SEQN/ list ;
where WTINT4CD > 0 and RIDAGEYR >= 16 ;
where TOTMINW > 10080 ;
;
set paq;
if TOTMINW > 10080 then delete ;
;
Check for Outliers among Plausible Data by Running a Univariate Analysis
Use the PROC UNIVARIATE procedure to get all default descriptive statistics such as mean, minimum and maximum values, standard deviation, and skewness. Use the VAR statement to identify the variable of interest (PAG_MINW). Use the ID statement to list the sequence numbers associated with extreme values in the output.
Sample Code
data =paq normal plot ;
var TOTMINW;
where WTINT4CD > 0 and RIDAGEYR >= 16 ;
id seqn;
title 'Distribution of TOTMINW among study participants aged 16 and older' ;
;
Output of Program
Download program output [PDF - 196 KB]
Plot Sample Weight against the Distribution of the Variable
Use the PROC GPLOT procedure to plot total minutes of moderate-to-vigorous activity per week (TOTMINW) by the corresponding sample weight for each observation in the dataset. Set 7,560 minutes per week as the maximum reasonable volume of weekly activity based on a maximum of 18 hours per day considering that study participants are sleep for a minimum of 6 hours each night.
Sample Code
symbol1 value = square height = .5 ;
data = paq;
plot WTINT4CD*TOTMINW/ href = 7560 frame ;
;