Solved: Checking whether the value of one variable in a data set exists in a r...

EricCai · Posted 09-18-2014 11:03 PM

Dear Community,

I have 2 data sets. Let's call them REFERENCE and PRACTICE.

Here's what REFERENCE looks like:

Group Code1 Code2 Code3

A 12 123 1234

B 25 26

C 526

Here's what PRACTICE looks like:

ID Var1 Var2 Var3 Var4 Var5

1 123

2 26 12 762 85

3 9 34 526

For each row in PRACTICE, I want to check if any of Var1 to Var5 exists in any row of REFERENCE. If so, then I want to flag that ID with the corresponding Group.

I think that this is best done by having 3 indicator variables in PRACTICE. Here's what I want PRACTICE to look like.

ID Var1 Var2 Var3 Var4 Var5 A B C

1 123 1 0 0

2 26 12 762 85 1 1 0

3 9 34 526 0 0 1

Ultimately, I want to count how many observations in PRACTICE fall under each group. (It is possible for an observation in PRACTICE to fall under multiple groups.)

Would you know how to do this efficiently? My actual REFERENCE has 14 rows and 3 codes. My actual PRACTICE has about 7,000 rows and up to 11 variables.

Thanks for your help!

Eric

Ksharp · Posted 09-19-2014 09:41 AM

How about:


data REFERENCE ;
input Group   $  Code1     Code2     Code3 ;
cards;
A           12           123         1234
B           25            26  .
C           526      .  .
;
run;
 
 
data PRACTICE ;
input ID               Var1          Var2          Var3       Var4          Var5 ;
cards;
1                123  . . . .
2                26             12              762         85  .
3                9               34               526 . .
;
run;
data hash(keep=group val);
 set REFERENCE;
 array c{*} code: ;
 do i=1 to dim(c);
  val=c{i};if not missing(val) then output;
 end;
run;
proc sql noprint;
select group  into : list separated by ' ' from REFERENCE;
quit;
data want(drop=i j k rc val group);
 if _n_ eq 1 then do;
  if 0 then set hash;
  declare hash ha(dataset:'hash');
   ha.definekey('val');
   ha.definedata('group');
   ha.definedone();
end;
set PRACTICE;
array g{*} &list ;
array v{*} var: ;
do k=1 to dim(g);
 g{k}=0;
end;
do i=1 to dim(v);
 val=v{i};rc=ha.find();
 if rc=0 then do;
  do j=1 to dim(g);
    if group=vname(g{j}) then g{j}=1;
  end;
 end;
end;
run;

Xia Keshan

View solution in original post

Reeza · Posted 09-19-2014 12:07 AM

Transpose your reference data to create 14 formats and then loop through your dataset to check.

You could also transpose both datasets and merge them together, and then back to the original data.

Honestly though, the data isn't big enough to worry about efficiency because any extra time spent in programming/problem solving will outweigh any net speed benefits.

RW9 · Posted 09-19-2014 04:31 AM

Hi,

You can use proc sql like below (slows down on a lot of observations - then I would suggest the one further down):

Simple:

proc sql;

create table WANT as

select A.*,

(select THIS.VAR1 from WORK.REFERENCE THIS

where THIS.VAR1=A.VAR1 and THIS.VAR2=A.VAR2

and THIS.VAR3=THIS.VAR4 and THIS.VAR5=A.VAR5) as FLAG

from WORK.PRACTICE;

quit;

Using join:

proc sql;

create table WANT as

select A.*,

B.VAR1 /* Non-null=present */

from WORK.PRACTICE A

left join WORK.REFERENCE B

on A.VAR1=B.VAR1

and A.VAR2=B.VAR2

and A.VAR3=B.VAR3

and A.VAR4=B.VAR4

and A.VAR5=B.VAR5;

quit;

Ksharp · Posted 09-19-2014 09:41 AM

How about:


data REFERENCE ;
input Group   $  Code1     Code2     Code3 ;
cards;
A           12           123         1234
B           25            26  .
C           526      .  .
;
run;
 
 
data PRACTICE ;
input ID               Var1          Var2          Var3       Var4          Var5 ;
cards;
1                123  . . . .
2                26             12              762         85  .
3                9               34               526 . .
;
run;
data hash(keep=group val);
 set REFERENCE;
 array c{*} code: ;
 do i=1 to dim(c);
  val=c{i};if not missing(val) then output;
 end;
run;
proc sql noprint;
select group  into : list separated by ' ' from REFERENCE;
quit;
data want(drop=i j k rc val group);
 if _n_ eq 1 then do;
  if 0 then set hash;
  declare hash ha(dataset:'hash');
   ha.definekey('val');
   ha.definedata('group');
   ha.definedone();
end;
set PRACTICE;
array g{*} &list ;
array v{*} var: ;
do k=1 to dim(g);
 g{k}=0;
end;
do i=1 to dim(v);
 val=v{i};rc=ha.find();
 if rc=0 then do;
  do j=1 to dim(g);
    if group=vname(g{j}) then g{j}=1;
  end;
 end;
end;
run;

Xia Keshan

EricCai · Posted 09-25-2014 12:44 PM

Thank you, Ksharp! This was a very useful and fast script!

To anybody else who wants to use it, be sure to use category names that are different from any of the indices that Ksharp uses in the arrays. Most notably, don't use i, j or k as category names - he uses them in the arrays, and then he deletes them.

Also, thanks to everyone else who pitched in with solutions!

Reeza · Posted 09-19-2014 10:11 AM

Here's a longer way for sure...

data REFERENCE ;

input Group $ Code1 Code2 Code3 ;

cards;

A 12 123 1234

B 25 26 .

C 526 . .

;

run;

data PRACTICE ;

input ID Var1 Var2 Var3 Var4 Var5 ;

cards;

1 123 . . . .

2 26 12 762 85 .

3 9 34 526 . .

;

run;

proc transpose data=reference out=reft prefix=Code;

by group;

run;

proc transpose data=practice out=pract prefix=var;

by id;

run;

proc sql;

create table temp as

select p.*, r.group, 1 as indicator

from pract as p

left join reft as r

on r.code1=p.var1

where p.var1 ne .

and r.group ne ''

order by id;

quit;

proc transpose data=temp out=tempt;

id group;

by id;

var indicator;

run;

data want;

merge practice tempt (drop=_name_);

by id;

run;

Haikuo · Posted 09-19-2014 10:59 AM

Another old school solution:

data REFERENCE ;

input Group $ Code1 Code2 Code3 ;

cards;

A 12 123 1234

B 25 26 .

C 526 . .

;

run;

data PRACTICE ;

input ID Var1 Var2 Var3 Var4 Var5 ;

cards;

1 123 . . . .

2 26 12 762 85 .

3 9 34 526 . .

;

run;

PROC SQL NOPRINT;

SELECT DISTINCT GROUP INTO :GROUP SEPARATED BY ' ' FROM REFERENCE;

QUIT;

data want;

set practice;

ARRAY grp &group. ;

DO OVER GRP; GRP=0;END;

array var var:;

do i=1 to nobs;

set reference nobs=nobs point=i;

array code code:;

do over var;

if missing (var) then CONTINUE;

ELSE if var in code then

DO OVER GRP;

GRP=IFN(VNAME(GRP)=GROUP,1,GRP);

END;

DROP GROUP CODE:;

RUN;

If you want efficiency, 's solution is the answer.

Regards,

Haikuo

rcwright · Posted 09-19-2014 01:50 PM

PROC TRANSPOSE DATA=ref OUT=tref(WHERE=(NOT Missing(col1)));

BY group;

VAR code:; * brings in all variables starting with code ;

RUN;

PROC TRANSPOSE DATA=practice OUT=tprac(WHERE=(NOT Missing(col1)));

BY id;

VAR var:; * brings in all variables starting with var ;

RUN;

PROC SQL;

CREATE TABLE results AS

SELECT *

FROM tref a, tprac(RENAME=(_name_=_name_p)) b

WHERE a.col1 = b.col1

;

QUIT;

you should be able to carry on from there.

Tom · Posted 09-19-2014 02:50 PM

Transpose them both. Join them and calculate your new flag values. Transpose that and merge it back with the PRACTICE file.

proc transpose data=practice out=tpractice (where=(not missing(col1)));

by id ;

var var:;

run;

proc transpose data=reference out=treference (where=(not missing(col1)));

by group;

var code:;

run;

proc sql ;

create table tall as

select p.id

, t.group as flagname

, max(p.col1 = t.col1) as flag

from trefernce t

, tpractice p

group by 1,2

order by 1,2

;

quit;

proc transpose data=tall out=tall2 (drop=_name_);

by id ;

id flagname ;

var flag ;

run;

data want ;

merge practice tall2;

by id;

run;

Astounding · Posted 09-19-2014 04:48 PM

I would go for a simple approach. Create one format. While this could be automated using REFERENCE as a CNTLIN= data set, your data is small enough that you could actually just type it:

proc format;

value group 12, 123, 1234 = 'A'

25, 26 = 'B'

526 = 'Not Found';

run;

Then use that format in a DATA step:

data want;

set practice;

array vars {11} var1 - var11;

a=0;

b=0;

c=0;

do _i_=1 to 11;

test_group = put(vars{_i_}, group.);

if test_group='A' then a=1;

else if test_group='B' then b=1;

else if test_group='C' then c=1;

end;

drop test_group;

run;

There would be one restriction. Any CODE could only be mapped to one group. Otherwise, you might even call it straightforward.

Good luck.

Tom · Posted 09-20-2014 12:54 PM

Here is a way that goes through the large PRACTICE file only once.

First step is to pull out the GROUP values into a macro variable so they can be used as the new variable names.

proc sql noprint ;

select group

into :flaglist separated by ' '

from reference

;

quit;

Then process PRACTICE and sequentially bring in each observation of REFERENCE. Then loop over the VARxxx variables from PRACTICE and see if any are in the CODExxx variables from REFERENCE. You can stop at the first match.

data want ;

set practice ;

array flags &flaglist ;

array vars var: ;

do p=1 to dim(flags);

set reference(keep=code:) point=p ;

do i=1 to dim(vars) until (flags(p)=1);

flags(p)=0<whichn(vars(i),of code:);

end;

drop i code:;

run;

Checking whether the value of one variable in a data set exists in a row of another data set

Re: Checking whether the value of one variable in a data set exists in a row of another data set

Re: Checking whether the value of one variable in a data set exists in a row of another data set

Re: Checking whether the value of one variable in a data set exists in a row of another data set

Re: Checking whether the value of one variable in a data set exists in a row of another data set

Re: Checking whether the value of one variable in a data set exists in a row of another data set

Re: Checking whether the value of one variable in a data set exists in a row of another data set

Re: Checking whether the value of one variable in a data set exists in a row of another data set

Re: Checking whether the value of one variable in a data set exists in a row of another data set

Re: Checking whether the value of one variable in a data set exists in a row of another data set

Re: Checking whether the value of one variable in a data set exists in a row of another data set

Re: Checking whether the value of one variable in a data set exists in a row of another data set

Registration is open

SAS Training: Just a Click Away