Dataset looks like the following. There are 1000's of IDs and 100,000's of observations.
'Wanted Value' is the desired output using the variable "Value". The number of observations for ID is not constant, normally between 5 and 30. There could be intermittent missing values in between which need to be repopulated plus or minus the date difference, depending upon the moving forward vs backward. Missing values are labelled as ".". Thanks for your help!!
ID | Date | Value | Date Diff | Wanted Value |
1 | 1/1/2000 | . | - | 600-92-91-91 |
1 | 4/1/2000 | . | 91 | 600-92-91-91 |
1 | 7/1/2000 | . | 91 | 600-92-91 |
1 | 10/1/2000 | . | 92 | 600-92 |
1 | 1/1/2001 | 600 | 92 | 600 |
1 | 4/1/2001 | 600 | 90 | 600 |
1 | 7/1/2001 | . | 91 | 600+91 |
1 | 10/1/2001 | 730 | 92 | 730 |
2 | 1/1/2002 | . | 92 | 550-92 |
2 | 4/1/2002 | 550 | 90 | 550 |
2 | 7/1/2002 | . | 91 | 550+91 |
2 | 10/1/2002 | 600 | 92 | 600 |
2 | 1/1/2003 | . | 92 | 600+92 |
2 | 4/1/2003 | . | 90 | 600+92+90 |
2 | 7/1/2003 | . | 91 | 600+92+90+91 |
2 | 10/1/2003 | . | 92 | 600+92+90+91+92 |
2 | 1/1/2004 | . | 92 | 600+92+90+91+92+92 |
2 | 4/1/2004 | . | 91 | 600+92+90+91+92+92+91 |
data have;
infile cards expandtabs truncover;
input ID Date :$20. Value DateDiff;
cards;
1 1/1/2000 . .
1 4/1/2000 . 91
1 7/1/2000 . 91
1 10/1/2000 . 92
1 1/1/2001 600 92
1 4/1/2001 600 90
1 7/1/2001 . 91
1 10/1/2001 730 92
2 1/1/2002 . 92
2 4/1/2002 550 90
2 7/1/2002 . 91
2 10/1/2002 600 92
2 1/1/2003 . 92
2 4/1/2003 . 90
2 7/1/2003 . 91
2 10/1/2003 . 92
2 1/1/2004 . 92
2 4/1/2004 . 91
;
data temp;
set have;
n+1;
if missing(DateDiff) then DateDiff=0;
run;
data temp1;
do until(last.id);
set temp;
by id notsorted;
if not missing(value) then want=value;
else want=want+DateDiff;
output;
end;
run;
proc sort data=temp1 out=temp2;
by descending n;
run;
data temp3;
do until(last.id);
set temp2;
by id notsorted;
if not missing(want) then want2=want;
else want2=want2-DateDiff;
output;
end;
run;
proc sort data=temp3 out=want(drop=want n);
by n;
run;
So the final number should be:
ID | Date | Value | Date Diff | Wanted Value |
1 | 1/1/2000 | . | - | 326 |
and so on.
Is that second column supposed to be a DATE value? Which dates are those? Is the second value supposed to mean April fools day ? or the fourth day of January?
Why are some of the cells using subtraction and some are using addition?
What is the meaning of the column header 'date diff'? Is that a difference in dates? So 92 means about 3 months? Or is a difference in the VALUES. The later makes more sense in terms of the equations you have the last column.
Why does one cell have just a hyphen in it? Is that supposed by a missing value? Or some type of special missing that is different than the other missing values?
Is that second column supposed to be a DATE value? Which dates are those? Is the second value supposed to mean April fools day ? or the fourth day of January?
-- 1 date to 2 date is always about 3 months (April, 1st - Jan 1st and so on)
Why are some of the cells using subtraction and some are using addition?
-- if 'wanted value' back in time then subtraction is needed. Opposite for forward in time i.e. addition
What is the meaning of the column header 'date diff'? Is that a difference in dates?
-- Yes: Date2 - Date1
So 92 means about 3 months?
- Yes
Or is a difference in the VALUES. The later makes more sense in terms of the equations you have the last column.
Why does one cell have just a hyphen in it? Is that supposed by a missing value? Or some type of special missing that is different than the other missing values?
- that should be 0 or no values. Because of Date1 - (no previous date).
Explain this part:
Why are some of the cells using subtraction and some are using addition?
-- if 'wanted value' back in time then subtraction is needed. Opposite for forward in time i.e. addition
How would we know if your wanted value is back in time or forward in time? Back in time or forward in time from what point in time?
@PaigeMiller wrote:
Explain this part:
Why are some of the cells using subtraction and some are using addition?
-- if 'wanted value' back in time then subtraction is needed. Opposite for forward in time i.e. addition
How would we know if your wanted value is back in time or forward in time? Back in time or forward in time from what point in time?
I think the reference point intended is, for each ID, the first record where VALUE is not null.
@SAS_nova wrote:
Why does one cell have just a hyphen in it? Is that supposed by a missing value? Or some type of special missing that is different than the other missing values?
- that should be 0 or no values. Because of Date1 - (no previous date).
Since you are calculating stuff BY ID, shouldn't DateDIff be 0 (or null) for the first record of each ID? Meaning ID=2 Date=1/1/2002 should have DateDiff=0 not 92? Or alternatively, maybe DateDiff should have a value for the first record, because you know how many days were in that quarter, even if you don't have a date value in your data for the prior quarter.
But again, more big picture words about the purpose of this code would be helpful. Also curious that the meaning of VALUE is. I assume it's unit is days, since you are subtracting from and adding days to it.
You have to explain how "Wanted Value" is computed from the rest of the data.
We can't possibly tell you if a macro may be appropriate until we understand this (although it sure doesn't look like a macro is needed right now).
@SAS_nova wrote:
'Wanted Value' is the desired output using the variable "Value". You need to provide RULES for how Value is calculated. Examples I can program to exactly match your shown "need" but without a set of rules on how things are calculated the code is very unlikely to work for any other data.
"There could be intermittent missing values in between which need to be repopulated plus or minus the date difference, depending upon the moving forward vs backward." What is "repopulated"? Into what variable? Using what rules? What does "forward" mean? What does "backward" mean? How do we know when to apply forward and when backward?
It appears that "date diff" is the number of days from the previous date value, is that interpretation correct?
If so I am wondering what the "value" is if you are adding/subtracting days with that value.
I've stared at this for a while, and I think I'm starting to see the pattern.
I think the data you HAVE is the table with first four columns.
The data you WANT is the same table with the 5th column, WantedValue, added. Is that right? And the values of WantedValues is the result of evaluating the expression you entered (e.g. 600-92-91-91=326, so you want the value 326 for the first row).
It looks to me like your WantedValue is basically an accumulator. Where if VALUE is nonmissing, then WantedValue=VALUE. Else if Value is missing then WantedValue=WantedValue (retained)+Date Diff. Is that right?
The only tricky part is that you don't have an initial value for Wanted Value. You need to compute the initial WantedValue for each ID as the first non-missing VALUE for that ID minus the sum of all the DateDIff values before that record. I think. : )
That said, can you describe in words what you are doing? The dates are clearly the start of each quarter. DateDiff is number of days in a quarter. What is the meaning of VALUE? If this is all just date arithmetic, there is probably a better way to get what you want.
Also, could you please post your HAVE data as code (i.e. a DATA step with cards statement)?
But to answer your question, no, it doesn't look like a macro problem. There is likely a DATA step approach to solving this.
@HB wrote:
I'm wondering why if datediff is in fact a calculated date difference why one would store that instead of calculating it when it is needed.
Yep, part of the next round of clarifications.
The proper way to present example data is a DATA step with DATALINES, see here:
data have;
input id $ date :yymmdd10. value date_diff;
format date yymmdd10.;
datalines;
1 2000-01-01 . .
1 2000-04-01 . 91
;
Variable types, attributes and raw contents are clear.
Dates in YMD order are unambiguous.
Please expand on this example. Make changes to the code where necessary (where my assumptions do not reflect your real situation).
data have;
infile cards expandtabs truncover;
input ID Date :$20. Value DateDiff;
cards;
1 1/1/2000 . .
1 4/1/2000 . 91
1 7/1/2000 . 91
1 10/1/2000 . 92
1 1/1/2001 600 92
1 4/1/2001 600 90
1 7/1/2001 . 91
1 10/1/2001 730 92
2 1/1/2002 . 92
2 4/1/2002 550 90
2 7/1/2002 . 91
2 10/1/2002 600 92
2 1/1/2003 . 92
2 4/1/2003 . 90
2 7/1/2003 . 91
2 10/1/2003 . 92
2 1/1/2004 . 92
2 4/1/2004 . 91
;
data temp;
set have;
n+1;
if missing(DateDiff) then DateDiff=0;
run;
data temp1;
do until(last.id);
set temp;
by id notsorted;
if not missing(value) then want=value;
else want=want+DateDiff;
output;
end;
run;
proc sort data=temp1 out=temp2;
by descending n;
run;
data temp3;
do until(last.id);
set temp2;
by id notsorted;
if not missing(want) then want2=want;
else want2=want2-DateDiff;
output;
end;
run;
proc sort data=temp3 out=want(drop=want n);
by n;
run;
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.