Hi,
I have following dataset.
data test;
input id date yymmdd8. X Y;
format date date9.;
datalines;
1 20160131 1 1
1 20160229 1 2
1 20160331 2 3
1 20160430 3 4
1 20160531 3 5
1 20160630 3 6
2 20160131 1 1
2 20160229 1 2
2 20160331 2 3
2 20160430 2 4
2 20160531 3 5
2 20160630 3 6
3 20160131 1 1
3 20160229 4 2
3 20160331 9 3
3 20160430 9 4
;
run;
id | date | X | Y |
1 | 31Jan2016 | 1 | 1 |
1 | 29Feb2016 | 1 | 2 |
1 | 31Mar2016 | 2 | 3 |
1 | 30Apr2016 | 3 | 4 |
1 | 31May2016 | 3 | 5 |
1 | 30Jun2016 | 3 | 6 |
2 | 31Jan2016 | 1 | 1 |
2 | 29Feb2016 | 1 | 2 |
2 | 31Mar2016 | 2 | 3 |
2 | 30Apr2016 | 2 | 4 |
2 | 31May2016 | 3 | 5 |
2 | 30Jun2016 | 3 | 6 |
3 | 31Jan2016 | 1 | 1 |
3 | 29Feb2016 | 4 | 2 |
3 | 31Mar2016 | 9 | 3 |
3 | 30Apr2016 | 9 | 4 |
From this I need to select the record for each ID, at the instance when (X > 2) for the first time.
So from the above dataset my output should be:
id | date | X | Y |
1 | 30Apr2016 | 3 | 4 |
2 | 31May2016 | 3 | 5 |
3 | 29Feb2016 | 4 | 2 |
I tried doing:
data test;
input id date yymmdd8. X Y;
format date date9.;
datalines;
1 20160131 1 1
1 20160229 1 2
1 20160331 2 3
1 20160430 3 4
1 20160531 3 5
1 20160630 3 6
2 20160131 1 1
2 20160229 1 2
2 20160331 2 3
2 20160430 2 4
2 20160531 3 5
2 20160630 3 6
3 20160131 1 1
3 20160229 4 2
3 20160331 9 3
3 20160430 9 4
;
run;
proc sort data=test out=test1;
by id descending x date;
run;
data test2;
set test1;
by id;
if first.id then output;
run;
But this would give me:
id | date | X | Y |
1 | 30Apr2016 | 3 | 4 |
2 | 31May2016 | 3 | 5 |
3 | 31Mar2016 | 9 | 3 |
Wrong entry for Id=3.
Any easy way to do this on a dataset with 60million records.
Thanks
I'm sure you will get to choose from a few possibilities. Here's one that assumes your data set is sorted by ID DATE:
data want;
need_a_record='Y';
do until (last.id);
set have;
by id;
if need_a_record='Y' and X > 2 then do;
output;
need_a_record='N';
end;
end;
if need_a_record='Y' then output;
drop need_a_record;
run;
*** Statement added to handle the change in requirements.
I'm sure you will get to choose from a few possibilities. Here's one that assumes your data set is sorted by ID DATE:
data want;
need_a_record='Y';
do until (last.id);
set have;
by id;
if need_a_record='Y' and X > 2 then do;
output;
need_a_record='N';
end;
end;
if need_a_record='Y' then output;
drop need_a_record;
run;
*** Statement added to handle the change in requirements.
@Astounding @ballardwthanks
If not sorted:
proc sort data=test ; by id date; run; data want;
set subset (where=(x >2)); by id date; if first.id; run;
If sorted by id and date then the datastep should suffice assuming you do not want any id if there is never a value greater than 2.
If you do want something when there are not any values for x > 2 then we need a rule.
@ballardw @Astounding Sorry as i think throught it, i shouldnt drops the ones that never went x>2. So for the ones which has X < 2, i would need to retain the most latest record. In the data below ID-4 has X=1 throughout, so I need to retain the last record for 4, which is 4 20160430 1 4.
data test;
input id date yymmdd8. X Y;
format date date9.;
datalines;
1 20160131 1 1
1 20160229 1 2
1 20160331 2 3
1 20160430 3 4
1 20160531 3 5
1 20160630 3 6
2 20160131 1 1
2 20160229 1 2
2 20160331 2 3
2 20160430 2 4
2 20160531 3 5
2 20160630 3 6
3 20160131 1 1
3 20160229 4 2
3 20160331 9 3
3 20160430 9 4
4 20160131 1 1
4 20160229 1 2
4 20160331 1 3
4 20160430 1 4
5 20160131 1 1
5 20160229 1 2
5 20160331 2 3
5 20160430 2 4
;
run;
thanks
Another approach:
proc sort data=test; by id date; run; data want; set test; by id date; retain found; if first.id then found=0; if found=0 and x>2 then do; output; found=1; end; if last.id and found=0 then output; drop found; run;
Please post code in code boxes opened with the {i} icon. Sometimes we get artifacts from pasting in the main window as your code gets transformed to html in the main box of the message. For instance copy some of your code from the forum and paste back into your editor. See any difference?
data have; input id date yymmdd8. X Y; format date date9.; datalines; 1 20160131 1 1 1 20160229 1 2 1 20160331 2 3 1 20160430 3 4 1 20160531 3 5 1 20160630 3 6 2 20160131 1 1 2 20160229 1 2 2 20160331 2 3 2 20160430 2 4 2 20160531 3 5 2 20160630 3 6 3 20160131 1 1 3 20160229 4 2 3 20160331 9 3 3 20160430 9 4 ; run; data want; set have; prev_x = lag(X); if prev_x > 2 and x > 2 then bad_now = 'N'; else do; if x>2 then bad_now='Y'; else bad_now='N'; end; run; data final; set want; where bad_now='Y'; run;
@ballardwthanks. I have thought of another approach 🙂
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.