BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
EricCai
Calcite | Level 5

Dear Community,

I have 2 data sets.  Let's call them REFERENCE and PRACTICE.

Here's what REFERENCE looks like:

Group     Code1     Code2     Code3

A           12           123         1234

B           25            26

C           526

Here's what PRACTICE looks like:

ID               Var1          Var2          Var3       Var4          Var5

1                123

2                26             12              762         85

3                9               34               526

For each row in PRACTICE, I want to check if any of Var1 to Var5 exists in any row of REFERENCE.  If so, then I want to flag that ID with the corresponding Group.

I think that this is best done by having 3 indicator variables in PRACTICE.  Here's what I want PRACTICE to look like.

ID               Var1          Var2          Var3       Var4          Var5         A          B          C

1                123                                                                           1          0           0

2                26             12              762         85                             1          1          0

3                9               34               526                                        0          0          1

Ultimately, I want to count how many observations in PRACTICE fall under each group.  (It is possible for an observation in PRACTICE to fall under multiple groups.)

Would you know how to do this efficiently?  My actual REFERENCE has 14 rows and 3 codes.  My actual PRACTICE has about 7,000 rows and up to 11 variables.

Thanks for your help!

Eric

1 ACCEPTED SOLUTION

Accepted Solutions
Ksharp
Super User

How about:


data REFERENCE ;
input Group   $  Code1     Code2     Code3 ;
cards;
A           12           123         1234
B           25            26  .
C           526      .  .
;
run;
 
 
data PRACTICE ;
input ID               Var1          Var2          Var3       Var4          Var5 ;
cards;
1                123  . . . .
2                26             12              762         85  .
3                9               34               526 . .
;
run;
data hash(keep=group val);
 set REFERENCE;
 array c{*} code: ;
 do i=1 to dim(c);
  val=c{i};if not missing(val) then output;
 end;
run;
proc sql noprint;
select group  into : list separated by ' ' from REFERENCE;
quit;
data want(drop=i j k rc val group);
 if _n_ eq 1 then do;
  if 0 then set hash;
  declare hash ha(dataset:'hash');
   ha.definekey('val');
   ha.definedata('group');
   ha.definedone();
end;
set PRACTICE;
array g{*} &list ;
array v{*} var: ;
do k=1 to dim(g);
 g{k}=0;
end;
do i=1 to dim(v);
 val=v{i};rc=ha.find();
 if rc=0 then do;
  do j=1 to dim(g);
    if group=vname(g{j}) then g{j}=1;
  end;
 end;
end;
run;

Xia Keshan

View solution in original post

10 REPLIES 10
Reeza
Super User

Transpose your reference data to create 14 formats and then loop through your dataset to check.

You could also transpose both datasets and merge them together, and then back to the original data.

Honestly though, the data isn't big enough to worry about efficiency because any extra time spent in programming/problem solving will outweigh any net speed benefits.

RW9
Diamond | Level 26 RW9
Diamond | Level 26

Hi,

You can use proc sql like below (slows down on a lot of observations - then I would suggest the one further down):

Simple:

proc sql;

     create table WANT as

     select     A.*,

                    (select THIS.VAR1 from WORK.REFERENCE THIS

                     where THIS.VAR1=A.VAR1 and THIS.VAR2=A.VAR2

                         and THIS.VAR3=THIS.VAR4 and THIS.VAR5=A.VAR5) as FLAG

     from       WORK.PRACTICE;

quit;

Using join:

proc sql;

     create table WANT as

     select     A.*,

                   B.VAR1  /* Non-null=present */

     from      WORK.PRACTICE A

     left join WORK.REFERENCE B

     on          A.VAR1=B.VAR1

     and        A.VAR2=B.VAR2

     and        A.VAR3=B.VAR3

     and        A.VAR4=B.VAR4

     and        A.VAR5=B.VAR5;

quit;

Ksharp
Super User

How about:


data REFERENCE ;
input Group   $  Code1     Code2     Code3 ;
cards;
A           12           123         1234
B           25            26  .
C           526      .  .
;
run;
 
 
data PRACTICE ;
input ID               Var1          Var2          Var3       Var4          Var5 ;
cards;
1                123  . . . .
2                26             12              762         85  .
3                9               34               526 . .
;
run;
data hash(keep=group val);
 set REFERENCE;
 array c{*} code: ;
 do i=1 to dim(c);
  val=c{i};if not missing(val) then output;
 end;
run;
proc sql noprint;
select group  into : list separated by ' ' from REFERENCE;
quit;
data want(drop=i j k rc val group);
 if _n_ eq 1 then do;
  if 0 then set hash;
  declare hash ha(dataset:'hash');
   ha.definekey('val');
   ha.definedata('group');
   ha.definedone();
end;
set PRACTICE;
array g{*} &list ;
array v{*} var: ;
do k=1 to dim(g);
 g{k}=0;
end;
do i=1 to dim(v);
 val=v{i};rc=ha.find();
 if rc=0 then do;
  do j=1 to dim(g);
    if group=vname(g{j}) then g{j}=1;
  end;
 end;
end;
run;

Xia Keshan

EricCai
Calcite | Level 5

Thank you, Ksharp!  This was a very useful and fast script!

To anybody else who wants to use it, be sure to use category names that are different from any of the indices that Ksharp uses in the arrays.  Most notably, don't use i, j or k as category names - he uses them in the arrays, and then he deletes them.

Also, thanks to everyone else who pitched in with solutions!

Reeza
Super User

Here's a longer way for sure...

data REFERENCE ;

input Group   $  Code1     Code2     Code3 ;

cards;

A           12           123         1234

B           25            26  .

C           526      .  .

;

run;

data PRACTICE ;

input ID               Var1          Var2          Var3       Var4          Var5 ;

cards;

1                123  . . . .

2                26             12              762         85  .

3                9               34               526 . .

;

run;

proc transpose data=reference out=reft prefix=Code;

by group;

run;

proc transpose data=practice out=pract prefix=var;

by id;

run;

proc sql;

    create table temp as

    select p.*, r.group, 1 as indicator

    from pract as p

    left join reft as r

    on r.code1=p.var1

    where p.var1 ne .

        and r.group ne ''

    order by id;

quit;

proc transpose data=temp out=tempt;

id group;

by id;

var indicator;

run;

data want;

merge practice tempt (drop=_name_);

by id;

run;

Haikuo
Onyx | Level 15

Another old school solution:

data REFERENCE ;

input Group   $ Code1     Code2     Code3 ;

cards;

A           12           123         1234

B           25            26 .

C           526      . .

;

run;

data PRACTICE ;

input ID               Var1          Var2          Var3       Var4          Var5 ;

cards;

1                123  . . . .

2                26             12              762         85 .

3                9               34               526 . .

;

run;

PROC SQL NOPRINT;

SELECT DISTINCT GROUP INTO :GROUP SEPARATED BY ' ' FROM REFERENCE;

QUIT;

data want;

set practice;

ARRAY grp &group. ;

DO OVER GRP; GRP=0;END;

array var var:;

do i=1 to nobs;

      set reference nobs=nobs point=i;

         array code code:;

            do over var;

              if missing (var) then CONTINUE;

           ELSE if var in code then

                DO OVER GRP;

                  GRP=IFN(VNAME(GRP)=GROUP,1,GRP);

                 END;

             END;

     END;

DROP GROUP CODE:;

RUN;

If you want efficiency, 's solution is the answer.

Regards,

Haikuo

rcwright
Calcite | Level 5

PROC TRANSPOSE DATA=ref OUT=tref(WHERE=(NOT Missing(col1)));

     BY group;

     VAR code:; * brings in all variables starting with code ;

RUN;

PROC TRANSPOSE DATA=practice OUT=tprac(WHERE=(NOT Missing(col1)));

     BY id;

     VAR var:; * brings in all variables starting with var ;

RUN;

PROC SQL;

     CREATE TABLE results AS

          SELECT *

          FROM tref a, tprac(RENAME=(_name_=_name_p)) b

          WHERE a.col1 = b.col1

       ;

QUIT;

you should be able to carry on from there.

Tom
Super User Tom
Super User

Transpose them both.  Join them and calculate your new flag values.  Transpose that  and merge it back with the PRACTICE file.

proc transpose data=practice out=tpractice (where=(not missing(col1)));

  by id ;

  var var:;

run;

proc transpose data=reference out=treference (where=(not missing(col1)));

  by group;

  var code:;

run;

proc sql ;

create table tall as

   select p.id

        , t.group as flagname

        , max(p.col1 = t.col1) as flag

  from trefernce t

     , tpractice p

  group by 1,2

  order by 1,2

;

quit;

proc transpose data=tall out=tall2 (drop=_name_);

  by id  ;

  id flagname ;

  var flag ;

run;

data want ;

  merge practice tall2;

  by id;

run;

Astounding
PROC Star

I would go for a simple approach.  Create one format.  While this could be automated using REFERENCE as a CNTLIN= data set, your data is small enough that you could actually just type it:

proc format;

  value group 12, 123, 1234 = 'A'

    25, 26 = 'B'

    526 = 'Not Found';

run;

Then use that format in a DATA step:

data want;

  set practice;

  array vars {11} var1 - var11;

  a=0;

  b=0;

  c=0;

  do _i_=1 to 11;

     test_group = put(vars{_i_}, group.);

     if test_group='A' then a=1;

     else if test_group='B' then b=1;

     else if test_group='C' then c=1;

  end;

  drop test_group;

run;

There would be one restriction.  Any CODE could only be mapped to one group.  Otherwise, you might even call it straightforward.

Good luck.

Tom
Super User Tom
Super User

Here is a way that goes through the large PRACTICE file only once.

First step is to pull out the GROUP values into a macro variable so they can be used as the new variable names.

proc sql noprint ;

  select group

    into :flaglist separated by ' '

    from reference

  ;

quit;


Then process PRACTICE and sequentially bring in each observation of REFERENCE.  Then loop over the VARxxx variables from PRACTICE and see if any are in the CODExxx variables from REFERENCE.  You can stop at the first match.

data want ;

  set practice ;

  array flags &flaglist ;

  array vars  var: ;

  do p=1 to dim(flags);

    set reference(keep=code:) point=p ;

    do i=1 to dim(vars) until (flags(p)=1);

      flags(p)=0<whichn(vars(i),of code:);

    end;

  end;

  drop i code:;

run;

sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 10 replies
  • 16434 views
  • 3 likes
  • 8 in conversation