Outliers

Outliers are values for a numeric variable that do not fit within the general pattern of the rest of the data. They can be identified by visually inspecting the data, either with a histogram (for a single variable) or scatterplot (for two variables).

What happens if I have an outlier in my data?

If you notice an outlier in your dataset, the first step is to investigate if it is some sort of measurement error or typo. If an error in the data file cannot be ruled out, try to identify the source of the data. Is it possible that the person coding the data made an entry mistake? Is it logical that a survey respondent made an error? Was the sample subject in question possibly outside of your target population? In general, we do not advocate removing outliers from the dataset unless there is a sound reason to exclude those values (which can be debatable).

How will an outlier affect my analysis?

There are different scenarios in which outliers may be more influential than others. An influential point is a value that, if removed, would noticeably alter the results of your analysis. To test for the level of influence, conduct your analyses with and without that sample subject. Does your interpretation of the data change when you remove it? If so, you have a choice to make: remove that record (which we wouldn’t recommend without justification) and continue on with your analysis, or present your results with and without the influential point and discuss the findings in your report. Another option would be to run a more advanced statistical method, such as robust regression, which is less sensitive to outliers.