Solved: Re: Output a group if conditional statement is met

OlsabeckT29 · Posted 05-30-2023 07:47 AM

Hello,

I have a dataset of 18 million records. There are repeating person ids. I need this dataset split into two separate datasets. One dataset should be for a person who has ever had mx = NVX. The other data set would be for everyone else.

Have:

Obs person_id MX 
1 10000103722 PFR 
2 10000103722 PFR 
3 10000181132 NVX 
4 10000181132 MOD 
5 10000189649 PFR 
6 10000189649 PFR 
7 10000189649 NVX 
8 10000189649 PFR 
9 10000371289 PFR 
10 10000371289 PFR 
11 10000751831 NVX
12 10000810473 MOD 
13 10000810473 MOD 
14 10000829168 JSN 
15 10000864651 PFR 
16 10000864651 PFR 
17 10000865007 PFR 
18 10000865007 NVX 
19 10000865500 PFR 
20 10000865500 PFR 
21 10000865883 MOD 
22 10000865883 MOD 
23 10000866093 PFR 
24 10000866093 NVX 
25 10000866093 NVX

Want:

Obs person_id MX 
3 10000181132 NVX 
4 10000181132 MOD 
5 10000189649 PFR 
6 10000189649 PFR 
7 10000189649 NVX 
11 10000751831 NVX
17 10000865007 PFR 
18 10000865007 NVX 
23 10000866093 PFR 
24 10000866093 NVX 
25 10000866093 NVX

I used the below code to try and do this, but it is only outputting records when mx=NVX and not outputting the entire group when NVX is found.

Data admin_adult2 admin_nvx;
Set admin_adult;
By person_id;
	if mx IN ('NVX') then output admin_nvx;
	Else output admin_adult2;
Run;

Kurt_Bremser · Posted 06-03-2023 01:08 AM

Use a double DO loop (your dataset seems to already be sorted by person_id):

data want;
do until (last.person_id);
  set have;
  by person_id;
  if mx = "NVX" then flag = 1;
end;
do until (last.person_id);
  set have;
  by person_id;
  if flag then output;
end;
drop flag;
run;

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

View solution in original post

A_Kh · Posted 05-30-2023 09:31 AM

One way to get the result is to create a macro variable to reference all person_id that have met the criteria., then subset the data based on the macro variable values.
Eg (assuming person_id is a numeric variable):

data have;
input Obs person_id MX $;
cards; 
1 10000103722 PFR 
2 10000103722 PFR 
3 10000181132 NVX 
4 10000181132 MOD 
5 10000189649 PFR 
6 10000189649 PFR 
7 10000189649 NVX 
8 10000189649 PFR 
9 10000371289 PFR 
10 10000371289 PFR 
11 10000751831 NVX
12 10000810473 MOD 
13 10000810473 MOD 
14 10000829168 JSN 
15 10000864651 PFR 
16 10000864651 PFR 
17 10000865007 PFR 
18 10000865007 NVX 
19 10000865500 PFR 
20 10000865500 PFR 
21 10000865883 MOD 
22 10000865883 MOD 
23 10000866093 PFR 
24 10000866093 NVX 
25 10000866093 NVX
; 
proc print; run; 

proc sql noprint;
	select distinct person_id format=16.
		into: ids separated by ','
			from have
		where upcase(strip(mx)) eq 'NVX';
	%put Selected Person_id= &ids;
quit; 
data want1 want2;
	set have;
	if person_id in (&ids) then output want1;
	else output want2;
run;

Ksharp · Posted 05-31-2023 07:35 AM

data have;
input Obs person_id MX $;
cards; 
1 10000103722 PFR 
2 10000103722 PFR 
3 10000181132 NVX 
4 10000181132 MOD 
5 10000189649 PFR 
6 10000189649 PFR 
7 10000189649 NVX 
8 10000189649 PFR 
9 10000371289 PFR 
10 10000371289 PFR 
11 10000751831 NVX
12 10000810473 MOD 
13 10000810473 MOD 
14 10000829168 JSN 
15 10000864651 PFR 
16 10000864651 PFR 
17 10000865007 PFR 
18 10000865007 NVX 
19 10000865500 PFR 
20 10000865500 PFR 
21 10000865883 MOD 
22 10000865883 MOD 
23 10000866093 PFR 
24 10000866093 NVX 
25 10000866093 NVX
; 


proc sql;
create table NVX          as select distinct person_id from have where MX='NVX';
create table admin_nvx    as select * from have where person_id in (select  person_id from NVX);
create table admin_adult2 as select * from have where person_id not in (select  person_id from NVX);
quit;

s_lassen · Posted 05-31-2023 10:09 AM

Assuming your input data is sorted by person_id, a fast way to do it is this:

data admin_nvx admin_adult2;
  merge have(where=(mx='NVX') in=NVX) have;
  by person_id;
  if nvx then output admin_nvx;
  else output admin_adult2;
run;

donricardo · Posted 06-02-2023 05:30 PM

See if this does the job - It's returning the same number of records as your output above, but in a slightly different order. Just using some proc sort and SQL....

proc sort data=have nodupkey out=have_1_nvx;
by person_id;
where mx = 'NVX';
run;

proc sql;
select have.obs, have.person_id format=12., have.mx
from have_1_nvx,have
where have_1_nvx.person_id = have.person_id;
quit;

mkeintz · Posted 06-02-2023 11:20 PM

If you data are sorted by person_id, then a self merge will help you deterimine where person_id's have an 'NVX' record:


data have;
input Obs person_id MX $;
cards; 
1 10000103722 PFR 
2 10000103722 PFR 
3 10000181132 NVX 
4 10000181132 MOD 
5 10000189649 PFR 
6 10000189649 PFR 
7 10000189649 NVX 
8 10000189649 PFR 
9 10000371289 PFR 
10 10000371289 PFR 
11 10000751831 NVX
12 10000810473 MOD 
13 10000810473 MOD 
14 10000829168 JSN 
15 10000864651 PFR 
16 10000864651 PFR 
17 10000865007 PFR 
18 10000865007 NVX 
19 10000865500 PFR 
20 10000865500 PFR 
21 10000865883 MOD 
22 10000865883 MOD 
23 10000866093 PFR 
24 10000866093 NVX 
25 10000866093 NVX
; 
data want1 want2;
  merge have (where=(mx='NVX') in=found_nvx)
        have ;
  by person_id;
  if found_nvx then output want1;
  else output want2;
run;

The IN parameter FOUND_NVX will be 1 for an entire person_id even if only 1 such record is found.

And it's crucial to this self-merge that the dataset name with the WHERE=(mx='NVX') parameter appears first. This generates the desired creation of the FOUND_NVX dummy, while allowing the sequence of data records from the second reference to HAVE overwrite records from the first, to be passed on to the output datasets.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

Kurt_Bremser · Posted 06-03-2023 01:08 AM

Use a double DO loop (your dataset seems to already be sorted by person_id):

data want;
do until (last.person_id);
  set have;
  by person_id;
  if mx = "NVX" then flag = 1;
end;
do until (last.person_id);
  set have;
  by person_id;
  if flag then output;
end;
drop flag;
run;

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

SAS Innovate 2025: Call for Content

Classroom Training Available!