Hoping to receive help with the following problem; data set immediately below followed by problem.
The actual data set is much more complex, but here's simplified data:
data test;
input id year date period span;
datalines;
1 2000 9 1 40
1 2000 9 2 375
1 2000 10 1 10
1 2000 10 2 355
1 2001 1 1 -15
1 2001 1 2 290
1 2001 9 1 39
1 2001 9 2 320
2 2000 11 1 43
2 2000 11 2 350
2 2000 12 1 25
2 2000 12 2 310
2 2000 1 1 -40
2 2000 1 2 280
;
run;
For each id-year group (the above data set is spaced based upon the id-year groupings), I need to keep a single id-year-date group (i.e., a set of period 1-2 pairs) where the observation from period 1 (of the period 1-2 pair) has a value of span which meets two criterion: 1) span is greater than 0, and 2) the value of span is closest to 0 of all other period 1 observations for the id-year group. Note, I do not want the period 2 observation that meets those 2 criterion, rather I want to keep period 2 based upon its period 1 counterpart that best meets those criterion.
Therefore, the final data set should look like this:
1 2000 9 1 40 (eliminated because 40 is farther away from 0 than 10)
1 2000 9 2 375 (eliminated because its period 1 pair was eliminated due to its paired period 1's span being at a greater distance)
1 2000 10 1 10
1 2000 10 2 355
1 2001 1 1 -15 (eliminated because the value of span is negative: -15)
1 2001 1 2 290 (eliminated because its period 1 pair was eliminated due to a negative span)
1 2001 9 1 39
1 2001 9 2 320
2 2000 11 1 43 (eliminated because 43 is farther away from 0 than 25)
2 2000 11 2 350 (eliminated because its period 1 pair was eliminated due to its paired period 1's span being at a greater distance)
2 2000 12 1 25
2 2000 12 2 310
2 2000 1 1 -40 (eliminated because the value of span was negative: -40)
2 2000 1 2 280 (eliminated because its period 1 pair was eliminated due to a negative span)
The actual dataset has more than just periods 1 and 2 and can contain any number of period groupings per id-year.
An acceptable solution could possibly copy the value of span from period 1 of each id-year-date group to the other periods within the group and then do a two-step process of first, eliminating all observations with span less than zero, and then second, sorting on id-year-period-span in descending order and then retain only the first observation of each id-year-period group...?
is this
2 2000 12 1 25
2 2000 12 2 310
correct in your expected output
or should it be
B 2000 1 2 280 ?
data test;
input id $ year period span;
datalines;
A 2000 1 40
A 2000 2 375
A 2000 1 10
A 2000 2 355
A 2001 1 -15
A 2001 2 290
A 2001 1 39
A 2001 2 320
B 2000 1 43
B 2000 2 350
B 2000 1 25
B 2000 2 310
B 2000 1 -40
B 2000 2 280
;
run;
proc sql;
create table want as
select *
from test
group by id,year,period
having min(case when span<0 then . else span end)=span
order by id, year, period;
quit;
Please test your example data datastep before posting it. ID is missing in your example as you do not use: input id $ so id is treated as numeric;
If ID or date has any role in the result, which I think it does, then you should show it.
I think your "want" example is incorrect: You show
1 2001 9 2 320
but there is a row of data
A 2001 1 2 290
which is the same year and period but 290 is closest to 0.
proc sort data=test; by id year period span; run; data want; set test; where span>0; by id year period ; if first.period; run;
I apologize for my test data, I posted it and then continually made revisions trying to make it more clear, but just caused everyone more problems.
That being said, my want data is not incorrect. Each period 1-2 pair comes as a pair based on the date value. It is imperative that the period 1 value in span of each 1-2 pair not be negative, therefore, A 2001 1 2 290 is disqualified because its period 1 pair based on date is A 2001 1 1 -15.
data test;
input id $ year period span;
datalines;
A 2000 1 40
A 2000 2 375
A 2000 1 10
A 2000 2 355
A 2001 1 -15
A 2001 2 290
A 2001 1 39
A 2001 2 320
B 2000 1 43
B 2000 2 350
B 2000 1 25
B 2000 2 310
B 2000 1 -40
B 2000 2 280
;
run;
data grp;
set test;
by id year period;
if period =1 then
grp+1;
run;
data temp;
f=1;
do until(last.grp);
set grp;
by grp notsorted;
if period =1 and span<0 then f=.;
if not missing(f) then output;
end;
drop grp;
run;
proc sql;
create table want as
select *
from temp
group by id, year,period
having min(span)=span
order by id, year, period;
quit;
@yeaforme wrote:
I apologize for my test data, I posted it and then continually made revisions trying to make it more clear, but just caused everyone more problems.
That being said, my want data is not incorrect. Each period 1-2 pair comes as a pair based on the date value. It is imperative that the period 1 value in span of each 1-2 pair not be negative, therefore, A 2001 1 2 290 is disqualified because its period 1 pair based on date is A 2001 1 1 -15.
Since you have changed the data at least twice and the requirements as well since I started to answer this request I have no further response.
@ballardw Oh my gosh, i didn't notice that
The requirements have not changed, though how detailed your understanding of it may have.
data test;
input id year date period span;
datalines;
1 2000 9 1 40
1 2000 9 2 375
1 2000 10 1 10
1 2000 10 2 355
1 2001 1 1 -15
1 2001 1 2 290
1 2001 9 1 39
1 2001 9 2 320
2 2000 11 1 43
2 2000 11 2 350
2 2000 12 1 25
2 2000 12 2 310
2 2000 1 1 -40
2 2000 1 2 280
;
run;
data grp;
set test;
by id year period notsorted;
if period =1 then
grp+1;
run;
data temp;
f=1;
do until(last.grp);
set grp;
by grp notsorted;
if period =1 and span<0 then f=.;
if not missing(f) then output;
end;
drop grp f;
run;
proc sql;
create table want as
select *
from temp
group by id, year,period
having min(span)=span
order by id, year, period;
quit;
I've figured out a solution. It involves: 1) using the retain statement to first spread the period 1's span value to the rest of group, 2) sort based on the retained span value, 3) eliminate group with negative retained span, and then 4) retain only the first group based on retained span. It may not be the most elegant, but fits the bill. Thanks to all for trying.
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.