Re: Remove a group based on conditions

yeaforme · Posted 06-05-2018 03:53 PM

Hoping to receive help with the following problem; data set immediately below followed by problem.

The actual data set is much more complex, but here's simplified data:

data test;

input id year date period span;

datalines;

1 2000 9 1 40

1 2000 9 2 375

1 2000 10 1 10

1 2000 10 2 355

1 2001 1 1 -15

1 2001 1 2 290

1 2001 9 1 39

1 2001 9 2 320

2 2000 11 1 43

2 2000 11 2 350

2 2000 12 1 25

2 2000 12 2 310

2 2000 1 1 -40

2 2000 1 2 280

;

run;

For each id-year group (the above data set is spaced based upon the id-year groupings), I need to keep a single id-year-date group (i.e., a set of period 1-2 pairs) where the observation from period 1 (of the period 1-2 pair) has a value of span which meets two criterion: 1) span is greater than 0, and 2) the value of span is closest to 0 of all other period 1 observations for the id-year group. Note, I do not want the period 2 observation that meets those 2 criterion, rather I want to keep period 2 based upon its period 1 counterpart that best meets those criterion.

Therefore, the final data set should look like this:

~~1 2000 9 1 40~~ (eliminated because 40 is farther away from 0 than 10)

~~1 2000 9 2 375~~ (eliminated because its period 1 pair was eliminated due to its paired period 1's span being at a greater distance)

1 2000 10 1 10

1 2000 10 2 355

~~1 2001 1 1 -15~~ (eliminated because the value of span is negative: -15)

~~1 2001 1 2 290~~ (eliminated because its period 1 pair was eliminated due to a negative span)

1 2001 9 1 39

1 2001 9 2 320

~~2 2000 11 1 43~~ (eliminated because 43 is farther away from 0 than 25)

~~2 2000 11 2 350~~ (eliminated because its period 1 pair was eliminated due to its paired period 1's span being at a greater distance)

2 2000 12 1 25

2 2000 12 2 310

~~2 2000 1 1 -40~~ (eliminated because the value of span was negative: -40)

~~2 2000 1 2 280~~ (eliminated because its period 1 pair was eliminated due to a negative span)

The actual dataset has more than just periods 1 and 2 and can contain any number of period groupings per id-year.

An acceptable solution could possibly copy the value of span from period 1 of each id-year-date group to the other periods within the group and then do a two-step process of first, eliminating all observations with span less than zero, and then second, sorting on id-year-period-span in descending order and then retain only the first observation of each id-year-period group...?

novinosrin · Posted 06-05-2018 04:18 PM

is this

2 2000 12 1 25

2 2000 12 2 310

correct in your expected output

or should it be

B 2000 1 2 280 ?

data test;

input id $ year period span;

datalines;
A 2000 1 40
A 2000 2 375
A 2000 1 10
A 2000 2 355 
A 2001 1 -15
A 2001 2 290
A 2001 1 39
A 2001 2 320
B 2000 1 43
B 2000 2 350
B 2000 1 25
B 2000 2 310
B 2000 1 -40
B 2000 2 280
;

run;

proc sql;
create table want as
select *
from test
group by id,year,period
having min(case  when span<0 then . else span    end)=span
order by id, year, period;
quit;

yeaforme · Posted 06-05-2018 04:20 PM

The 2 in ID should be B, but the correct grouping for ID B should be the 1-2 pair with values 25-310. I added an additional date variable to the original data to hopefully make the span sets more obvious.

yeaforme · Posted 06-05-2018 05:04 PM

Please see updated inquiry.

ballardw · Posted 06-05-2018 04:21 PM

Please test your example data datastep before posting it. ID is missing in your example as you do not use: input id $ so id is treated as numeric;

If ID or date has any role in the result, which I think it does, then you should show it.

I think your "want" example is incorrect: You show

1 2001 9 2 320

but there is a row of data

A 2001 1 2 290

which is the same year and period but 290 is closest to 0.

proc sort data=test;
   by id year period  span;
run;
data want;
   set test;
   where span>0;
   by id year period ;
   if first.period;
run;

yeaforme · Posted 06-05-2018 04:27 PM

I apologize for my test data, I posted it and then continually made revisions trying to make it more clear, but just caused everyone more problems.

That being said, my want data is not incorrect. Each period 1-2 pair comes as a pair based on the date value. It is imperative that the period 1 value in span of each 1-2 pair not be negative, therefore, A 2001 1 2 290 is disqualified because its period 1 pair based on date is A 2001 1 1 -15.

novinosrin · Posted 06-05-2018 05:45 PM

data test;

input id $ year period span;

datalines;
A 2000 1 40
A 2000 2 375
A 2000 1 10
A 2000 2 355 
A 2001 1 -15
A 2001 2 290
A 2001 1 39
A 2001 2 320
B 2000 1 43
B 2000 2 350
B 2000 1 25
B 2000 2 310
B 2000 1 -40
B 2000 2 280
;

run;
data grp;
set test;
by id  year period;
if period =1 then 
grp+1;
run;
data temp;
f=1;
do until(last.grp);
set grp;
by  grp notsorted;
if period =1 and span<0 then f=.;
if not missing(f) then output;
end;
drop grp;
run;

proc sql;
create table want as
select *
from temp
group by id, year,period
having min(span)=span
order by id, year, period;
quit;

ballardw · Posted 06-05-2018 05:59 PM

@yeaforme wrote:

I apologize for my test data, I posted it and then continually made revisions trying to make it more clear, but just caused everyone more problems.

That being said, my want data is not incorrect. Each period 1-2 pair comes as a pair based on the date value. It is imperative that the period 1 value in span of each 1-2 pair not be negative, therefore, A 2001 1 2 290 is disqualified because its period 1 pair based on date is A 2001 1 1 -15.

Since you have changed the data at least twice and the requirements as well since I started to answer this request I have no further response.

novinosrin · Posted 06-05-2018 06:04 PM

@ballardw Oh my gosh, i didn't notice that

yeaforme · Posted 06-05-2018 06:40 PM

The requirements have not changed, though how detailed your understanding of it may have.

novinosrin · Posted 06-05-2018 06:06 PM

data test;

input id year date period span;

datalines;
1 2000 9 1 40
1 2000 9 2 375
1 2000 10 1 10
1 2000 10 2 355 
1 2001 1 1 -15
1 2001 1 2 290
1 2001 9 1 39
1 2001 9 2 320
2 2000 11 1 43
2 2000 11 2 350
2 2000 12 1 25
2 2000 12 2 310
2 2000 1 1 -40
2 2000 1 2 280
;

run;

data grp;
set test;
by id  year period notsorted;
if period =1 then 
grp+1;
run;
data temp;
f=1;
do until(last.grp);
set grp;
by  grp notsorted;
if period =1 and span<0 then f=.;
if not missing(f) then output;
end;
drop grp f;
run;

proc sql;
create table want as
select *
from temp
group by id, year,period
having min(span)=span
order by id, year, period;
quit;

yeaforme · Posted 06-05-2018 06:47 PM

I've figured out a solution. It involves: 1) using the retain statement to first spread the period 1's span value to the rest of group, 2) sort based on the retained span value, 3) eliminate group with negative retained span, and then 4) retain only the first group based on retained span. It may not be the most elegant, but fits the bill. Thanks to all for trying.

Registration is open

SAS Training: Just a Click Away