Hello,
I have a dataset of 18 million records. There are repeating person ids. I need this dataset split into two separate datasets. One dataset should be for a person who has ever had mx = NVX. The other data set would be for everyone else.
Have:
Obs person_id MX
1 10000103722 PFR
2 10000103722 PFR
3 10000181132 NVX
4 10000181132 MOD
5 10000189649 PFR
6 10000189649 PFR
7 10000189649 NVX
8 10000189649 PFR
9 10000371289 PFR
10 10000371289 PFR
11 10000751831 NVX
12 10000810473 MOD
13 10000810473 MOD
14 10000829168 JSN
15 10000864651 PFR
16 10000864651 PFR
17 10000865007 PFR
18 10000865007 NVX
19 10000865500 PFR
20 10000865500 PFR
21 10000865883 MOD
22 10000865883 MOD
23 10000866093 PFR
24 10000866093 NVX
25 10000866093 NVX
Want:
Obs person_id MX
3 10000181132 NVX
4 10000181132 MOD
5 10000189649 PFR
6 10000189649 PFR
7 10000189649 NVX
11 10000751831 NVX
17 10000865007 PFR
18 10000865007 NVX
23 10000866093 PFR
24 10000866093 NVX
25 10000866093 NVX
I used the below code to try and do this, but it is only outputting records when mx=NVX and not outputting the entire group when NVX is found.
Data admin_adult2 admin_nvx;
Set admin_adult;
By person_id;
if mx IN ('NVX') then output admin_nvx;
Else output admin_adult2;
Run;
Use a double DO loop (your dataset seems to already be sorted by person_id):
data want;
do until (last.person_id);
set have;
by person_id;
if mx = "NVX" then flag = 1;
end;
do until (last.person_id);
set have;
by person_id;
if flag then output;
end;
drop flag;
run;
One way to get the result is to create a macro variable to reference all person_id that have met the criteria., then subset the data based on the macro variable values.
Eg (assuming person_id is a numeric variable):
data have;
input Obs person_id MX $;
cards;
1 10000103722 PFR
2 10000103722 PFR
3 10000181132 NVX
4 10000181132 MOD
5 10000189649 PFR
6 10000189649 PFR
7 10000189649 NVX
8 10000189649 PFR
9 10000371289 PFR
10 10000371289 PFR
11 10000751831 NVX
12 10000810473 MOD
13 10000810473 MOD
14 10000829168 JSN
15 10000864651 PFR
16 10000864651 PFR
17 10000865007 PFR
18 10000865007 NVX
19 10000865500 PFR
20 10000865500 PFR
21 10000865883 MOD
22 10000865883 MOD
23 10000866093 PFR
24 10000866093 NVX
25 10000866093 NVX
;
proc print; run;
proc sql noprint;
select distinct person_id format=16.
into: ids separated by ','
from have
where upcase(strip(mx)) eq 'NVX';
%put Selected Person_id= &ids;
quit;
data want1 want2;
set have;
if person_id in (&ids) then output want1;
else output want2;
run;
data have;
input Obs person_id MX $;
cards;
1 10000103722 PFR
2 10000103722 PFR
3 10000181132 NVX
4 10000181132 MOD
5 10000189649 PFR
6 10000189649 PFR
7 10000189649 NVX
8 10000189649 PFR
9 10000371289 PFR
10 10000371289 PFR
11 10000751831 NVX
12 10000810473 MOD
13 10000810473 MOD
14 10000829168 JSN
15 10000864651 PFR
16 10000864651 PFR
17 10000865007 PFR
18 10000865007 NVX
19 10000865500 PFR
20 10000865500 PFR
21 10000865883 MOD
22 10000865883 MOD
23 10000866093 PFR
24 10000866093 NVX
25 10000866093 NVX
;
proc sql;
create table NVX as select distinct person_id from have where MX='NVX';
create table admin_nvx as select * from have where person_id in (select person_id from NVX);
create table admin_adult2 as select * from have where person_id not in (select person_id from NVX);
quit;
Assuming your input data is sorted by person_id, a fast way to do it is this:
data admin_nvx admin_adult2;
merge have(where=(mx='NVX') in=NVX) have;
by person_id;
if nvx then output admin_nvx;
else output admin_adult2;
run;
See if this does the job - It's returning the same number of records as your output above, but in a slightly different order. Just using some proc sort and SQL....
proc sort data=have nodupkey out=have_1_nvx;
by person_id;
where mx = 'NVX';
run;
proc sql;
select have.obs, have.person_id format=12., have.mx
from have_1_nvx,have
where have_1_nvx.person_id = have.person_id;
quit;
If you data are sorted by person_id, then a self merge will help you deterimine where person_id's have an 'NVX' record:
data have;
input Obs person_id MX $;
cards;
1 10000103722 PFR
2 10000103722 PFR
3 10000181132 NVX
4 10000181132 MOD
5 10000189649 PFR
6 10000189649 PFR
7 10000189649 NVX
8 10000189649 PFR
9 10000371289 PFR
10 10000371289 PFR
11 10000751831 NVX
12 10000810473 MOD
13 10000810473 MOD
14 10000829168 JSN
15 10000864651 PFR
16 10000864651 PFR
17 10000865007 PFR
18 10000865007 NVX
19 10000865500 PFR
20 10000865500 PFR
21 10000865883 MOD
22 10000865883 MOD
23 10000866093 PFR
24 10000866093 NVX
25 10000866093 NVX
;
data want1 want2;
merge have (where=(mx='NVX') in=found_nvx)
have ;
by person_id;
if found_nvx then output want1;
else output want2;
run;
The IN parameter FOUND_NVX will be 1 for an entire person_id even if only 1 such record is found.
And it's crucial to this self-merge that the dataset name with the WHERE=(mx='NVX') parameter appears first. This generates the desired creation of the FOUND_NVX dummy, while allowing the sequence of data records from the second reference to HAVE overwrite records from the first, to be passed on to the output datasets.
Use a double DO loop (your dataset seems to already be sorted by person_id):
data want;
do until (last.person_id);
set have;
by person_id;
if mx = "NVX" then flag = 1;
end;
do until (last.person_id);
set have;
by person_id;
if flag then output;
end;
drop flag;
run;
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.