DATA Step, Macro, Functions and more

Checking whether the value of one variable in a data set exists in a row of another data set

Accepted Solution Solved
Reply
Contributor
Posts: 40
Accepted Solution

Checking whether the value of one variable in a data set exists in a row of another data set

Dear Community,

I have 2 data sets.  Let's call them REFERENCE and PRACTICE.

Here's what REFERENCE looks like:

Group     Code1     Code2     Code3

A           12           123         1234

B           25            26

C           526

Here's what PRACTICE looks like:

ID               Var1          Var2          Var3       Var4          Var5

1                123

2                26             12              762         85

3                9               34               526

For each row in PRACTICE, I want to check if any of Var1 to Var5 exists in any row of REFERENCE.  If so, then I want to flag that ID with the corresponding Group.

I think that this is best done by having 3 indicator variables in PRACTICE.  Here's what I want PRACTICE to look like.

ID               Var1          Var2          Var3       Var4          Var5         A          B          C

1                123                                                                           1          0           0

2                26             12              762         85                             1          1          0

3                9               34               526                                        0          0          1

Ultimately, I want to count how many observations in PRACTICE fall under each group.  (It is possible for an observation in PRACTICE to fall under multiple groups.)

Would you know how to do this efficiently?  My actual REFERENCE has 14 rows and 3 codes.  My actual PRACTICE has about 7,000 rows and up to 11 variables.

Thanks for your help!

Eric


Accepted Solutions
Solution
‎09-19-2014 09:41 AM
Super User
Posts: 10,023

Re: Checking whether the value of one variable in a data set exists in a row of another data set

How about:


data REFERENCE ;
input Group   $  Code1     Code2     Code3 ;
cards;
A           12           123         1234
B           25            26  .
C           526      .  .
;
run;
 
 
data PRACTICE ;
input ID               Var1          Var2          Var3       Var4          Var5 ;
cards;
1                123  . . . .
2                26             12              762         85  .
3                9               34               526 . .
;
run;
data hash(keep=group val);
 set REFERENCE;
 array c{*} code: ;
 do i=1 to dim(c);
  val=c{i};if not missing(val) then output;
 end;
run;
proc sql noprint;
select group  into : list separated by ' ' from REFERENCE;
quit;
data want(drop=i j k rc val group);
 if _n_ eq 1 then do;
  if 0 then set hash;
  declare hash ha(dataset:'hash');
   ha.definekey('val');
   ha.definedata('group');
   ha.definedone();
end;
set PRACTICE;
array g{*} &list ;
array v{*} var: ;
do k=1 to dim(g);
 g{k}=0;
end;
do i=1 to dim(v);
 val=v{i};rc=ha.find();
 if rc=0 then do;
  do j=1 to dim(g);
    if group=vname(g{j}) then g{j}=1;
  end;
 end;
end;
run;

Xia Keshan

View solution in original post


All Replies
Super User
Posts: 19,772

Re: Checking whether the value of one variable in a data set exists in a row of another data set

Transpose your reference data to create 14 formats and then loop through your dataset to check.

You could also transpose both datasets and merge them together, and then back to the original data.

Honestly though, the data isn't big enough to worry about efficiency because any extra time spent in programming/problem solving will outweigh any net speed benefits.

Super User
Super User
Posts: 7,942

Re: Checking whether the value of one variable in a data set exists in a row of another data set

Hi,

You can use proc sql like below (slows down on a lot of observations - then I would suggest the one further down):

Simple:

proc sql;

     create table WANT as

     select     A.*,

                    (select THIS.VAR1 from WORK.REFERENCE THIS

                     where THIS.VAR1=A.VAR1 and THIS.VAR2=A.VAR2

                         and THIS.VAR3=THIS.VAR4 and THIS.VAR5=A.VAR5) as FLAG

     from       WORK.PRACTICE;

quit;

Using join:

proc sql;

     create table WANT as

     select     A.*,

                   B.VAR1  /* Non-null=present */

     from      WORK.PRACTICE A

     left join WORK.REFERENCE B

     on          A.VAR1=B.VAR1

     and        A.VAR2=B.VAR2

     and        A.VAR3=B.VAR3

     and        A.VAR4=B.VAR4

     and        A.VAR5=B.VAR5;

quit;

Solution
‎09-19-2014 09:41 AM
Super User
Posts: 10,023

Re: Checking whether the value of one variable in a data set exists in a row of another data set

How about:


data REFERENCE ;
input Group   $  Code1     Code2     Code3 ;
cards;
A           12           123         1234
B           25            26  .
C           526      .  .
;
run;
 
 
data PRACTICE ;
input ID               Var1          Var2          Var3       Var4          Var5 ;
cards;
1                123  . . . .
2                26             12              762         85  .
3                9               34               526 . .
;
run;
data hash(keep=group val);
 set REFERENCE;
 array c{*} code: ;
 do i=1 to dim(c);
  val=c{i};if not missing(val) then output;
 end;
run;
proc sql noprint;
select group  into : list separated by ' ' from REFERENCE;
quit;
data want(drop=i j k rc val group);
 if _n_ eq 1 then do;
  if 0 then set hash;
  declare hash ha(dataset:'hash');
   ha.definekey('val');
   ha.definedata('group');
   ha.definedone();
end;
set PRACTICE;
array g{*} &list ;
array v{*} var: ;
do k=1 to dim(g);
 g{k}=0;
end;
do i=1 to dim(v);
 val=v{i};rc=ha.find();
 if rc=0 then do;
  do j=1 to dim(g);
    if group=vname(g{j}) then g{j}=1;
  end;
 end;
end;
run;

Xia Keshan

Contributor
Posts: 40

Re: Checking whether the value of one variable in a data set exists in a row of another data set

Thank you, Ksharp!  This was a very useful and fast script!

To anybody else who wants to use it, be sure to use category names that are different from any of the indices that Ksharp uses in the arrays.  Most notably, don't use i, j or k as category names - he uses them in the arrays, and then he deletes them.

Also, thanks to everyone else who pitched in with solutions!

Super User
Posts: 19,772

Re: Checking whether the value of one variable in a data set exists in a row of another data set

Here's a longer way for sure...

data REFERENCE ;

input Group   $  Code1     Code2     Code3 ;

cards;

A           12           123         1234

B           25            26  .

C           526      .  .

;

run;

data PRACTICE ;

input ID               Var1          Var2          Var3       Var4          Var5 ;

cards;

1                123  . . . .

2                26             12              762         85  .

3                9               34               526 . .

;

run;

proc transpose data=reference out=reft prefix=Code;

by group;

run;

proc transpose data=practice out=pract prefix=var;

by id;

run;

proc sql;

    create table temp as

    select p.*, r.group, 1 as indicator

    from pract as p

    left join reft as r

    on r.code1=p.var1

    where p.var1 ne .

        and r.group ne ''

    order by id;

quit;

proc transpose data=temp out=tempt;

id group;

by id;

var indicator;

run;

data want;

merge practice tempt (drop=_name_);

by id;

run;

Respected Advisor
Posts: 3,156

Re: Checking whether the value of one variable in a data set exists in a row of another data set

Another old school solution:

data REFERENCE ;

input Group   $ Code1     Code2     Code3 ;

cards;

A           12           123         1234

B           25            26 .

C           526      . .

;

run;

data PRACTICE ;

input ID               Var1          Var2          Var3       Var4          Var5 ;

cards;

1                123  . . . .

2                26             12              762         85 .

3                9               34               526 . .

;

run;

PROC SQL NOPRINT;

SELECT DISTINCT GROUP INTO :GROUP SEPARATED BY ' ' FROM REFERENCE;

QUIT;

data want;

set practice;

ARRAY grp &group. ;

DO OVER GRP; GRP=0;END;

array var var:;

do i=1 to nobs;

      set reference nobs=nobs point=i;

         array code code:;

            do over var;

              if missing (var) then CONTINUE;

           ELSE if var in code then

                DO OVER GRP;

                  GRP=IFN(VNAME(GRP)=GROUP,1,GRP);

                 END;

             END;

     END;

DROP GROUP CODE:;

RUN;

If you want efficiency, 's solution is the answer.

Regards,

Haikuo

Contributor
Posts: 26

Re: Checking whether the value of one variable in a data set exists in a row of another data set

PROC TRANSPOSE DATA=ref OUT=tref(WHERE=(NOT Missing(col1)));

     BY group;

     VAR code:; * brings in all variables starting with code ;

RUN;

PROC TRANSPOSE DATA=practice OUT=tprac(WHERE=(NOT Missing(col1)));

     BY id;

     VAR var:; * brings in all variables starting with var ;

RUN;

PROC SQL;

     CREATE TABLE results AS

          SELECT *

          FROM tref a, tprac(RENAME=(_name_=_name_p)) b

          WHERE a.col1 = b.col1

       ;

QUIT;

you should be able to carry on from there.

Super User
Super User
Posts: 7,039

Re: Checking whether the value of one variable in a data set exists in a row of another data set

Transpose them both.  Join them and calculate your new flag values.  Transpose that  and merge it back with the PRACTICE file.

proc transpose data=practice out=tpractice (where=(not missing(col1)));

  by id ;

  var var:;

run;

proc transpose data=reference out=treference (where=(not missing(col1)));

  by group;

  var code:;

run;

proc sql ;

create table tall as

   select p.id

        , t.group as flagname

        , max(p.col1 = t.col1) as flag

  from trefernce t

     , tpractice p

  group by 1,2

  order by 1,2

;

quit;

proc transpose data=tall out=tall2 (drop=_name_);

  by id  ;

  id flagname ;

  var flag ;

run;

data want ;

  merge practice tall2;

  by id;

run;

Super User
Posts: 5,498

Re: Checking whether the value of one variable in a data set exists in a row of another data set

I would go for a simple approach.  Create one format.  While this could be automated using REFERENCE as a CNTLIN= data set, your data is small enough that you could actually just type it:

proc format;

  value group 12, 123, 1234 = 'A'

    25, 26 = 'B'

    526 = 'Not Found';

run;

Then use that format in a DATA step:

data want;

  set practice;

  array vars {11} var1 - var11;

  a=0;

  b=0;

  c=0;

  do _i_=1 to 11;

     test_group = put(vars{_i_}, group.);

     if test_group='A' then a=1;

     else if test_group='B' then b=1;

     else if test_group='C' then c=1;

  end;

  drop test_group;

run;

There would be one restriction.  Any CODE could only be mapped to one group.  Otherwise, you might even call it straightforward.

Good luck.

Super User
Super User
Posts: 7,039

Re: Checking whether the value of one variable in a data set exists in a row of another data set

Here is a way that goes through the large PRACTICE file only once.

First step is to pull out the GROUP values into a macro variable so they can be used as the new variable names.

proc sql noprint ;

  select group

    into :flaglist separated by ' '

    from reference

  ;

quit;


Then process PRACTICE and sequentially bring in each observation of REFERENCE.  Then loop over the VARxxx variables from PRACTICE and see if any are in the CODExxx variables from REFERENCE.  You can stop at the first match.

data want ;

  set practice ;

  array flags &flaglist ;

  array vars  var: ;

  do p=1 to dim(flags);

    set reference(keep=codeSmiley Happy point=p ;

    do i=1 to dim(vars) until (flags(p)=1);

      flags(p)=0<whichn(vars(i),of codeSmiley Happy;

    end;

  end;

  drop i code:;

run;

🔒 This topic is solved and locked.

Need further help from the community? Please ask a new question.

Discussion stats
  • 10 replies
  • 6562 views
  • 3 likes
  • 8 in conversation