BookmarkSubscribeRSS Feed
☑ This topic is solved. Need further help from the community? Please sign in and ask a new question.
OlsabeckT29
Fluorite | Level 6

Hello,

I have a dataset of 18 million records. There are repeating person ids. I need this dataset split into two separate datasets. One dataset should be for a person who has ever had mx = NVX. The other data set would be for everyone else.

Have:

 

Obs person_id MX 
1 10000103722 PFR 
2 10000103722 PFR 
3 10000181132 NVX 
4 10000181132 MOD 
5 10000189649 PFR 
6 10000189649 PFR 
7 10000189649 NVX 
8 10000189649 PFR 
9 10000371289 PFR 
10 10000371289 PFR 
11 10000751831 NVX
12 10000810473 MOD 
13 10000810473 MOD 
14 10000829168 JSN 
15 10000864651 PFR 
16 10000864651 PFR 
17 10000865007 PFR 
18 10000865007 NVX 
19 10000865500 PFR 
20 10000865500 PFR 
21 10000865883 MOD 
22 10000865883 MOD 
23 10000866093 PFR 
24 10000866093 NVX 
25 10000866093 NVX

Want: 

Obs person_id MX 
3 10000181132 NVX 
4 10000181132 MOD 
5 10000189649 PFR 
6 10000189649 PFR 
7 10000189649 NVX 
11 10000751831 NVX
17 10000865007 PFR 
18 10000865007 NVX 
23 10000866093 PFR 
24 10000866093 NVX 
25 10000866093 NVX

I used the below code to try and do this, but it is only outputting records when mx=NVX and not outputting the entire group when NVX is found.

Data admin_adult2 admin_nvx;
Set admin_adult;
By person_id;
	if mx IN ('NVX') then output admin_nvx;
	Else output admin_adult2;
Run;

 

1 ACCEPTED SOLUTION

Accepted Solutions
Kurt_Bremser
Super User

Use a double DO loop (your dataset seems to already be sorted by person_id):

data want;
do until (last.person_id);
  set have;
  by person_id;
  if mx = "NVX" then flag = 1;
end;
do until (last.person_id);
  set have;
  by person_id;
  if flag then output;
end;
drop flag;
run;

View solution in original post

6 REPLIES 6
A_Kh
Lapis Lazuli | Level 10

One way to get the result is to create a macro variable to reference all person_id that have met the criteria., then subset the data based on the macro variable values. 
Eg (assuming person_id is a numeric variable):

data have;
input Obs person_id MX $;
cards; 
1 10000103722 PFR 
2 10000103722 PFR 
3 10000181132 NVX 
4 10000181132 MOD 
5 10000189649 PFR 
6 10000189649 PFR 
7 10000189649 NVX 
8 10000189649 PFR 
9 10000371289 PFR 
10 10000371289 PFR 
11 10000751831 NVX
12 10000810473 MOD 
13 10000810473 MOD 
14 10000829168 JSN 
15 10000864651 PFR 
16 10000864651 PFR 
17 10000865007 PFR 
18 10000865007 NVX 
19 10000865500 PFR 
20 10000865500 PFR 
21 10000865883 MOD 
22 10000865883 MOD 
23 10000866093 PFR 
24 10000866093 NVX 
25 10000866093 NVX
; 
proc print; run; 

proc sql noprint;
	select distinct person_id format=16.
		into: ids separated by ','
			from have
		where upcase(strip(mx)) eq 'NVX';
	%put Selected Person_id= &ids;
quit; 
data want1 want2;
	set have;
	if person_id in (&ids) then output want1;
	else output want2;
run; 
Ksharp
Super User
data have;
input Obs person_id MX $;
cards; 
1 10000103722 PFR 
2 10000103722 PFR 
3 10000181132 NVX 
4 10000181132 MOD 
5 10000189649 PFR 
6 10000189649 PFR 
7 10000189649 NVX 
8 10000189649 PFR 
9 10000371289 PFR 
10 10000371289 PFR 
11 10000751831 NVX
12 10000810473 MOD 
13 10000810473 MOD 
14 10000829168 JSN 
15 10000864651 PFR 
16 10000864651 PFR 
17 10000865007 PFR 
18 10000865007 NVX 
19 10000865500 PFR 
20 10000865500 PFR 
21 10000865883 MOD 
22 10000865883 MOD 
23 10000866093 PFR 
24 10000866093 NVX 
25 10000866093 NVX
; 


proc sql;
create table NVX          as select distinct person_id from have where MX='NVX';
create table admin_nvx    as select * from have where person_id in (select  person_id from NVX);
create table admin_adult2 as select * from have where person_id not in (select  person_id from NVX);
quit;
s_lassen
Meteorite | Level 14

Assuming your input data is sorted by person_id, a fast way to do it is this:

data admin_nvx admin_adult2;
  merge have(where=(mx='NVX') in=NVX) have;
  by person_id;
  if nvx then output admin_nvx;
  else output admin_adult2;
run;
donricardo
SAS Employee

See if this does the job - It's returning the same number of records as your output above, but in a slightly different order.   Just using some proc sort and SQL....

 

proc sort data=have nodupkey out=have_1_nvx;
by person_id;
where mx = 'NVX';
run;


proc sql;
select have.obs, have.person_id format=12., have.mx
from have_1_nvx,have
where have_1_nvx.person_id = have.person_id;
quit;

donricardo_0-1685741384740.png

 

 

mkeintz
PROC Star

If you data are sorted by person_id, then a self merge will help you deterimine where person_id's have an 'NVX' record:

 


data have;
input Obs person_id MX $;
cards; 
1 10000103722 PFR 
2 10000103722 PFR 
3 10000181132 NVX 
4 10000181132 MOD 
5 10000189649 PFR 
6 10000189649 PFR 
7 10000189649 NVX 
8 10000189649 PFR 
9 10000371289 PFR 
10 10000371289 PFR 
11 10000751831 NVX
12 10000810473 MOD 
13 10000810473 MOD 
14 10000829168 JSN 
15 10000864651 PFR 
16 10000864651 PFR 
17 10000865007 PFR 
18 10000865007 NVX 
19 10000865500 PFR 
20 10000865500 PFR 
21 10000865883 MOD 
22 10000865883 MOD 
23 10000866093 PFR 
24 10000866093 NVX 
25 10000866093 NVX
; 
data want1 want2;
  merge have (where=(mx='NVX') in=found_nvx)
        have ;
  by person_id;
  if found_nvx then output want1;
  else output want2;
run;

The IN parameter FOUND_NVX will be 1 for an entire person_id even if only 1 such record is found.

 

And it's crucial to this self-merge that the dataset name with the WHERE=(mx='NVX') parameter appears first.  This generates the desired creation of the FOUND_NVX dummy, while allowing the sequence of data records from the second reference to HAVE overwrite records from the first, to be passed on to the output datasets.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------
Kurt_Bremser
Super User

Use a double DO loop (your dataset seems to already be sorted by person_id):

data want;
do until (last.person_id);
  set have;
  by person_id;
  if mx = "NVX" then flag = 1;
end;
do until (last.person_id);
  set have;
  by person_id;
  if flag then output;
end;
drop flag;
run;

SAS Innovate 2025: Call for Content

Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!

Submit your idea!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 6 replies
  • 894 views
  • 2 likes
  • 7 in conversation