Dear Community,
I have 2 data sets. Let's call them REFERENCE and PRACTICE.
Here's what REFERENCE looks like:
Group Code1 Code2 Code3
A 12 123 1234
B 25 26
C 526
Here's what PRACTICE looks like:
ID Var1 Var2 Var3 Var4 Var5
1 123
2 26 12 762 85
3 9 34 526
For each row in PRACTICE, I want to check if any of Var1 to Var5 exists in any row of REFERENCE. If so, then I want to flag that ID with the corresponding Group.
I think that this is best done by having 3 indicator variables in PRACTICE. Here's what I want PRACTICE to look like.
ID Var1 Var2 Var3 Var4 Var5 A B C
1 123 1 0 0
2 26 12 762 85 1 1 0
3 9 34 526 0 0 1
Ultimately, I want to count how many observations in PRACTICE fall under each group. (It is possible for an observation in PRACTICE to fall under multiple groups.)
Would you know how to do this efficiently? My actual REFERENCE has 14 rows and 3 codes. My actual PRACTICE has about 7,000 rows and up to 11 variables.
Thanks for your help!
Eric
How about:
data REFERENCE ; input Group $ Code1 Code2 Code3 ; cards; A 12 123 1234 B 25 26 . C 526 . . ; run; data PRACTICE ; input ID Var1 Var2 Var3 Var4 Var5 ; cards; 1 123 . . . . 2 26 12 762 85 . 3 9 34 526 . . ; run; data hash(keep=group val); set REFERENCE; array c{*} code: ; do i=1 to dim(c); val=c{i};if not missing(val) then output; end; run; proc sql noprint; select group into : list separated by ' ' from REFERENCE; quit; data want(drop=i j k rc val group); if _n_ eq 1 then do; if 0 then set hash; declare hash ha(dataset:'hash'); ha.definekey('val'); ha.definedata('group'); ha.definedone(); end; set PRACTICE; array g{*} &list ; array v{*} var: ; do k=1 to dim(g); g{k}=0; end; do i=1 to dim(v); val=v{i};rc=ha.find(); if rc=0 then do; do j=1 to dim(g); if group=vname(g{j}) then g{j}=1; end; end; end; run;
Xia Keshan
Transpose your reference data to create 14 formats and then loop through your dataset to check.
You could also transpose both datasets and merge them together, and then back to the original data.
Honestly though, the data isn't big enough to worry about efficiency because any extra time spent in programming/problem solving will outweigh any net speed benefits.
Hi,
You can use proc sql like below (slows down on a lot of observations - then I would suggest the one further down):
Simple:
proc sql;
create table WANT as
select A.*,
(select THIS.VAR1 from WORK.REFERENCE THIS
where THIS.VAR1=A.VAR1 and THIS.VAR2=A.VAR2
and THIS.VAR3=THIS.VAR4 and THIS.VAR5=A.VAR5) as FLAG
from WORK.PRACTICE;
quit;
Using join:
proc sql;
create table WANT as
select A.*,
B.VAR1 /* Non-null=present */
from WORK.PRACTICE A
left join WORK.REFERENCE B
on A.VAR1=B.VAR1
and A.VAR2=B.VAR2
and A.VAR3=B.VAR3
and A.VAR4=B.VAR4
and A.VAR5=B.VAR5;
quit;
How about:
data REFERENCE ; input Group $ Code1 Code2 Code3 ; cards; A 12 123 1234 B 25 26 . C 526 . . ; run; data PRACTICE ; input ID Var1 Var2 Var3 Var4 Var5 ; cards; 1 123 . . . . 2 26 12 762 85 . 3 9 34 526 . . ; run; data hash(keep=group val); set REFERENCE; array c{*} code: ; do i=1 to dim(c); val=c{i};if not missing(val) then output; end; run; proc sql noprint; select group into : list separated by ' ' from REFERENCE; quit; data want(drop=i j k rc val group); if _n_ eq 1 then do; if 0 then set hash; declare hash ha(dataset:'hash'); ha.definekey('val'); ha.definedata('group'); ha.definedone(); end; set PRACTICE; array g{*} &list ; array v{*} var: ; do k=1 to dim(g); g{k}=0; end; do i=1 to dim(v); val=v{i};rc=ha.find(); if rc=0 then do; do j=1 to dim(g); if group=vname(g{j}) then g{j}=1; end; end; end; run;
Xia Keshan
Thank you, Ksharp! This was a very useful and fast script!
To anybody else who wants to use it, be sure to use category names that are different from any of the indices that Ksharp uses in the arrays. Most notably, don't use i, j or k as category names - he uses them in the arrays, and then he deletes them.
Also, thanks to everyone else who pitched in with solutions!
Here's a longer way for sure...
data REFERENCE ;
input Group $ Code1 Code2 Code3 ;
cards;
A 12 123 1234
B 25 26 .
C 526 . .
;
run;
data PRACTICE ;
input ID Var1 Var2 Var3 Var4 Var5 ;
cards;
1 123 . . . .
2 26 12 762 85 .
3 9 34 526 . .
;
run;
proc transpose data=reference out=reft prefix=Code;
by group;
run;
proc transpose data=practice out=pract prefix=var;
by id;
run;
proc sql;
create table temp as
select p.*, r.group, 1 as indicator
from pract as p
left join reft as r
on r.code1=p.var1
where p.var1 ne .
and r.group ne ''
order by id;
quit;
proc transpose data=temp out=tempt;
id group;
by id;
var indicator;
run;
data want;
merge practice tempt (drop=_name_);
by id;
run;
Another old school solution:
data REFERENCE ;
input Group $ Code1 Code2 Code3 ;
cards;
A 12 123 1234
B 25 26 .
C 526 . .
;
run;
data PRACTICE ;
input ID Var1 Var2 Var3 Var4 Var5 ;
cards;
1 123 . . . .
2 26 12 762 85 .
3 9 34 526 . .
;
run;
PROC SQL NOPRINT;
SELECT DISTINCT GROUP INTO :GROUP SEPARATED BY ' ' FROM REFERENCE;
QUIT;
data want;
set practice;
ARRAY grp &group. ;
DO OVER GRP; GRP=0;END;
array var var:;
do i=1 to nobs;
set reference nobs=nobs point=i;
array code code:;
do over var;
if missing (var) then CONTINUE;
ELSE if var in code then
DO OVER GRP;
GRP=IFN(VNAME(GRP)=GROUP,1,GRP);
END;
END;
END;
DROP GROUP CODE:;
RUN;
If you want efficiency, 's solution is the answer.
Regards,
Haikuo
PROC TRANSPOSE DATA=ref OUT=tref(WHERE=(NOT Missing(col1)));
BY group;
VAR code:; * brings in all variables starting with code ;
RUN;
PROC TRANSPOSE DATA=practice OUT=tprac(WHERE=(NOT Missing(col1)));
BY id;
VAR var:; * brings in all variables starting with var ;
RUN;
PROC SQL;
CREATE TABLE results AS
SELECT *
FROM tref a, tprac(RENAME=(_name_=_name_p)) b
WHERE a.col1 = b.col1
;
QUIT;
you should be able to carry on from there.
Transpose them both. Join them and calculate your new flag values. Transpose that and merge it back with the PRACTICE file.
proc transpose data=practice out=tpractice (where=(not missing(col1)));
by id ;
var var:;
run;
proc transpose data=reference out=treference (where=(not missing(col1)));
by group;
var code:;
run;
proc sql ;
create table tall as
select p.id
, t.group as flagname
, max(p.col1 = t.col1) as flag
from trefernce t
, tpractice p
group by 1,2
order by 1,2
;
quit;
proc transpose data=tall out=tall2 (drop=_name_);
by id ;
id flagname ;
var flag ;
run;
data want ;
merge practice tall2;
by id;
run;
I would go for a simple approach. Create one format. While this could be automated using REFERENCE as a CNTLIN= data set, your data is small enough that you could actually just type it:
proc format;
value group 12, 123, 1234 = 'A'
25, 26 = 'B'
526 = 'Not Found';
run;
Then use that format in a DATA step:
data want;
set practice;
array vars {11} var1 - var11;
a=0;
b=0;
c=0;
do _i_=1 to 11;
test_group = put(vars{_i_}, group.);
if test_group='A' then a=1;
else if test_group='B' then b=1;
else if test_group='C' then c=1;
end;
drop test_group;
run;
There would be one restriction. Any CODE could only be mapped to one group. Otherwise, you might even call it straightforward.
Good luck.
Here is a way that goes through the large PRACTICE file only once.
First step is to pull out the GROUP values into a macro variable so they can be used as the new variable names.
proc sql noprint ;
select group
into :flaglist separated by ' '
from reference
;
quit;
Then process PRACTICE and sequentially bring in each observation of REFERENCE. Then loop over the VARxxx variables from PRACTICE and see if any are in the CODExxx variables from REFERENCE. You can stop at the first match.
data want ;
set practice ;
array flags &flaglist ;
array vars var: ;
do p=1 to dim(flags);
set reference(keep=code:) point=p ;
do i=1 to dim(vars) until (flags(p)=1);
flags(p)=0<whichn(vars(i),of code:);
end;
end;
drop i code:;
run;
Join us for SAS Innovate 2025, our biggest and most exciting global event of the year, in Orlando, FL, from May 6-9.
Early bird rate extended! Save $200 when you sign up by March 31.
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.