I am trying to find a quick way to replace missing values with the average of the two nearest non-missing values. Example:
Any suggestions? I tried using the retain function, but I can only figure out how to retain last non-missing value.
I thinks what you are looking for might be more like interpolation. While this is not mean of two closest values, it might be useful.
There is a nifty little tool for interpolating in datasets called proc expand. (It should do extrapolation as well, but I haven't tried that yet.) It's very handy when making series of of dates and cumulative calculations.
data have; input Id Amount; datalines; 1 10 2 . 3 20 4 30 5 . 6 . 7 40 ; run; proc expand data=have out=Expanded; convert amount=amount_expanded / method=join; id id; /*second is column name */ run;
For more on the proc expand see documentation: https://support.sas.com/documentation/onlinedoc/ets/132/expand.pdf
data have; input id amount; cards; 1 10 2 . 3 20 4 30 5 . 6 . 7 40 ; run; proc sort data=have out=reversed; by descending id; run; data retain_non_missing; set reversed; retain next_non_missing; if amount ne . then next_non_missing = amount; run; proc sort data=retain_non_missing out=ordered; by id; run; data final; set ordered; retain last_non_missing; if amount ne . then last_non_missing = amount; if amount = . then amount = (last_non_missing + next_non_missing) / 2; run;
but as ever, will need extra error checking etc for production use.
The key idea is to sort the data into reverse order, allowing it to use
RETAIN to carry the
next_non_missing value back up the data set. When sorted back into the correct order, you then have enough information to interpolate the missing values.
There may well be a
PROC to do this in a more controlled way (I don't know anything about
PROC STANDARDIZE, mentioned in Reeza's comment) but this works as a data step solution.
Here's an alternative requiring no sorting. It does require IDs to be sequential, though that can be worked around if they're not.
What it does is uses two
set statements, one that gets the main (and previous) amounts, and one that sets until the next amount is found. Here I use the sequence of
id variables to guarantee it will be the right record, but you could write this differently if needed (keeping track of what loop you're on) if the id variables aren't sequential or in an order of any sort.
I use the
first.amount check to make sure we don't try to execute the second
set statement more than we should (which would terminate early).
You need to do two things differently if you want first/last rows treated differently. Here I assume prev_amount is 0 if it's the first row, and I assume last_amount is missing, meaning the last row just gets the last prev_amount repeated, while the first row is averaged between 0 and the next_amount. You can treat either one differently if you choose, I don't know your data.
data have; input Id Amount; datalines; 1 10 2 . 3 20 4 30 5 . 6 . 7 40 ;;;; run; data want; set have; by amount notsorted; *so we can tell if we have consecutive missings; retain prev_amount; *next_amount is auto-retained; if not missing(amount ) then prev_amount=amount; else if _n_=1 then prev_amount=0; *or whatever you want to treat the first row as; else if first.amount then do; do until ((next_id > id and not missing(next_amount)) or (eof)); set have(rename=(id=next_id amount=next_amount)) end=eof; end; amount = mean(prev_amount,next_amount); end; else amount = mean(prev_amount,next_amount); run;