Solved: Re: Merge Info with Long List of Info

TC_ · Posted 02-16-2022 01:40 PM

Hello,

I was working with labelling some data by merging. The goal is to give an ICCC coding based on patients' cancer site, histology, and cancer behavior. The way I am doing it is illustrated below. For simplicity, I am only showing 1 ID to be the "by" variable. "Tab1"-"Tab3" are the names of the tables for easier referring.

This requires a perfect match of the IDs. But currently Tab2 is in the format below:

The 3 "by" variables are the first 3 columns and I wanted to get "ICCC". But the 3 "by" variables are in a very compact format, the dash means a range (C000-C002 = C000, C001, C002 etc.) so each row actually contains information of a large number of rows. Same for Histology and Behavior. So if expanded fully, the first row will be 810 * 2 * 1= 1620 rows and the second row will be 3 * 9 * 1 = 27 rows. And I have about 130 of these compact rows of information. I fully expanded it and it contained over 400 million rows. This makes merging very challenging.

What would be a better way to obtain "ICCC" using patients' Site, Histology, and Behavior? Any help is appreciated!

ChrisNZ · Posted 02-16-2022 10:54 PM

Like this?

data T1;
  X='COE22'; Y="1"; output;
  X='XOE32'; Y="2"; output;
  X='XOE21'; Y="1"; output;
  X='XOE35'; Y="1"; output;
  X='XOE35'; Y="7"; output;
run;
data T2;
  X='COE20-COE24, COE35'; Y="1,2"; A=1; output;
  X='XOE20-XOE24, XOE35'; Y="1-2,4"; A=2; output;
run;
data T3;
  set T1;
  RC=dosubl(cats('data _null_;                                     '
                ,'  set T2;                                        '
                ,'  do XI=1 to countw(X,",");                      '
                ,'    XSTR=scan(X, XI, ",");                       '
                ,'    if index(XSTR,"-") then do;                  '
                ,'      X1=scan(XSTR,1);X2=scan(XSTR,2);           '
                ,'    end;                                         '
                ,'    else do;                                     '
                ,'      X1=XSTR;X2=XSTR;                           '
                ,'    end;                                         '
                ,'    do YI=1 to countw(Y,",");                    '
                ,'      YSTR=scan(Y, YI, ",");                     '
                ,'      if index(YSTR,"-") then do;                '
                ,'        Y1=scan(YSTR,1);Y2=scan(YSTR,2);         '
                ,'      end;                                       '
                ,'      else do;                                   '
                ,'        Y1=YSTR;Y2=YSTR;                         '
                ,'      end;                                       '
                ,'      if strip(X1)<="', X, '"<=strip(X2)         '
                ,'       & strip(Y1)<="', Y, '"<=strip(Y2) then do;'
                ,'        call symputx("ICCC",A);                  '
                ,'        stop;                                    '
                ,'      end;                                       '
                ,'    end;                                         '
                ,'  end;                                           '
                ,'  call symputx("ICCC"," ");                      '
                ,'run;                                             '));
  ICCC=symget('ICCC');
run;
proc print; run;

X	Y	ICCC
COE22	1	1
XOE32	2
XOE21	1	2
XOE35	1	2
XOE35	7

High-Performance SAS Coding - Third Edition

View solution in original post

ChrisNZ · Posted 02-16-2022 07:52 PM

What does Tab1 look like?

Do you want 400m rows as the result?

High-Performance SAS Coding - Third Edition

TC_ · Posted 02-16-2022 08:35 PM

Tab1 has about 30,000 rows. Each row is a patient's information of Site, Histology, and Behavior. I want the corresponding ICCC information for the 30,000 patients. Tab1 looks like this

And I want the final result (Tab3) to be like this

ChrisNZ · Posted 02-16-2022 08:42 PM

So you don't want a ROW for ID=2 in the results, like you showed in the original question?

You only keep rows for which there's an ICCC?

High-Performance SAS Coding - Third Edition

TC_ · Posted 02-16-2022 08:45 PM

Sorry for the confusion, that was more for an illustration of the merging process.

I want to have an ICCC for all those with complete Site, Histology, and Behavior, as shown in my response above. Also pasted again here:

ChrisNZ · Posted 02-16-2022 08:44 PM

Please provide some test data as a program (data step or SQL) so we can use it in our reply.

High-Performance SAS Coding - Third Edition

ChrisNZ · Posted 02-16-2022 09:52 PM

Since no data is supplied, I use an example.

One way to do this:

data T1;
  X='COE22'; output;
  X='XOE32'; output;
  X='XOE21'; output;
run;
data T2;
  X='COE20-COE24, COE35'; A=1; output;
  X='XOE20-XOE24, XOE35'; A=2; output;
run;
data T3;
  set T1;
  RC=dosubl(cats('data _null_;                            '
                ,'  set T2;                               '
                ,'  do I=1 to countw(X,",");              '
                ,'    INTVL=scan(X, I, ",");              '
                ,'    if index(INTVL,"-") then do;        '
                ,'      V1=scan(INTVL,1);V2=scan(INTVL,2);'
                ,'    end;                                '
                ,'    else do;                            '
                ,'      V1=INTVL;V2=INTVL;                '
                ,'    end;                                '
                ,'    if V1<="', X, '"<=V2 then do;       '
                ,'      call symputx("ICCC",A);           '
                ,'      stop;                             '
                ,'    end;                                '
                ,'  end;                                  '
                ,'  call symputx("ICCC"," ");             '
                ,'run;                                    '));
  ICCC=symget('ICCC');
run;
proc print; run;

X	ICCC
COE22	1
XOE32
XOE21	2

High-Performance SAS Coding - Third Edition

TC_ · Posted 02-16-2022 10:07 PM

data T1;
  X='COE22'; Y="1"; output;
  X='XOE32'; Y="2"; output;
  X='XOE21'; Y="1"; output;
run;
data T2;
  X='COE20-COE24, COE35'; Y="1,2"; A=1; output;
  X='XOE20-XOE24, XOE35'; Y="1-2,4"; A=2; output;
run;

Sorry for the late response. Could you possibly expand what you did to a 2-column match?

Thank you!

ChrisNZ · Posted 02-16-2022 10:54 PM

Like this?

data T1;
  X='COE22'; Y="1"; output;
  X='XOE32'; Y="2"; output;
  X='XOE21'; Y="1"; output;
  X='XOE35'; Y="1"; output;
  X='XOE35'; Y="7"; output;
run;
data T2;
  X='COE20-COE24, COE35'; Y="1,2"; A=1; output;
  X='XOE20-XOE24, XOE35'; Y="1-2,4"; A=2; output;
run;
data T3;
  set T1;
  RC=dosubl(cats('data _null_;                                     '
                ,'  set T2;                                        '
                ,'  do XI=1 to countw(X,",");                      '
                ,'    XSTR=scan(X, XI, ",");                       '
                ,'    if index(XSTR,"-") then do;                  '
                ,'      X1=scan(XSTR,1);X2=scan(XSTR,2);           '
                ,'    end;                                         '
                ,'    else do;                                     '
                ,'      X1=XSTR;X2=XSTR;                           '
                ,'    end;                                         '
                ,'    do YI=1 to countw(Y,",");                    '
                ,'      YSTR=scan(Y, YI, ",");                     '
                ,'      if index(YSTR,"-") then do;                '
                ,'        Y1=scan(YSTR,1);Y2=scan(YSTR,2);         '
                ,'      end;                                       '
                ,'      else do;                                   '
                ,'        Y1=YSTR;Y2=YSTR;                         '
                ,'      end;                                       '
                ,'      if strip(X1)<="', X, '"<=strip(X2)         '
                ,'       & strip(Y1)<="', Y, '"<=strip(Y2) then do;'
                ,'        call symputx("ICCC",A);                  '
                ,'        stop;                                    '
                ,'      end;                                       '
                ,'    end;                                         '
                ,'  end;                                           '
                ,'  call symputx("ICCC"," ");                      '
                ,'run;                                             '));
  ICCC=symget('ICCC');
run;
proc print; run;

X	Y	ICCC
COE22	1	1
XOE32	2
XOE21	1	2
XOE35	1	2
XOE35	7

High-Performance SAS Coding - Third Edition

TC_ · Posted 02-16-2022 11:06 PM

Perfect, thank you so much!

TC_ · Posted 02-20-2022 12:42 PM

Sorry to bother again after this question is marked as "Solved". @ChrisNZ

Do you think I might be able to use X of T2 in their original compact format? Below are the sample input datasets.

data input;
input hist $;
cards;
100
105
110
110
;

data  in2;
length hist $15.;
input hist label;
cards;
100-105,108 1
110 3
;
run;

I was thinking of using each row of IN2 in their exact compact format

if hist in (100-105,108)

In your previous post, your way was to do it in 2 steps,

if 100<=hist<=105
if hist=108

I try to do this because the dataset I work with is complicated and it takes really long to loop through all the combinations. To give you a general idea, some rows look like this(columns in the red circle correspond to the "X" and "Y" in your T2 dataset). And I have 136 rows of data in this format.

I tried to write something similar to what's in your previous post:

data test;
	set input;
	RC=dosubl(cats('data _null_;                                     '
                                ,'  set in2;                             '
				,'call symput("hist_list",hist);         '
				,'if ', hist,  ' in (&hist_list) then do;'
				,'call symputx("lab",label);             ' 
				,'stop;                                  '
				,'call symputx("lab"," ");               '
				,'run;                                   '));
	label=symget('lab');
run;

But things didn't work out so well.

Again, thank you so much for all the help and inspiration!

ChrisNZ · Posted 02-21-2022 08:57 PM

I was thinking of using each row of IN2 in their exact compact format
if hist in (100-105,108)

This is invalid syntax. If it was valid, I'd have used it as it makes the code much simpler.

The equivalent syntax is

if 100<=hist<=105 | hist=108

which might be marginally faster than

if 100<=hist<=105
if hist=108

or not: In both cases it's 2 tests being performed.
Since you may have many combinations of values, the OR clause might become slower as it's lengthened.

I never tested this, but I expected the difference to be minimal if it exists.

You might be able to use in() for all the single values instead of several tests. This might make a small difference, but the bottom line is that all these value comparisons need to be performed, so it cannot be fast.
Another way to hope for performance improvement is to put the most common matches at the top of the tests so that all the subsequent else comparisons don't take place.

High-Performance SAS Coding - Third Edition

ChrisNZ · Posted 02-21-2022 09:12 PM

One way to keep the current logic and make it faster would be to parse T2 as is done now to create all the test code from the values, but to save that code in a text file,

In other words, use T2 to create a text file that contains a succession of IF tests.

Then, instead of calling a data step for each row of T1, which is very inefficient, you simply

data T3;
  set T1;
  %include(tests);
run;

High-Performance SAS Coding - Third Edition

TC_ · Posted 02-21-2022 09:38 PM

Yes, I feel like the current bottleneck is with the reading of T2 dataset for every single row of the T1 dataset. The process flooded my log and I was asked to save/clear the log before the process was done.
Regarding your %include, can you please explain what you meant by "use T2 to create a text file that contains a succession of IF tests"? I have no experience with this kind of code writing.
Thank you!

ChrisNZ · Posted 02-23-2022 01:41 AM

Something like this:

data T1;
  X='COE22'; Y="1"; output;
  X='XOE32'; Y="2"; output;
  X='XOE21'; Y="1"; output;
  X='XOE35'; Y="1"; output;
  X='XOE35'; Y="7"; output;
run;
data T2;
  X='COE20-COE24, COE35'; Y="1,2"; A=1; output;
  X='XOE20-XOE24, XOE35'; Y="1-2,4"; A=2; output;
run;
filename TESTS "&wdir\tests.sas";
data _null_;                                     
  set T2;  
  file TESTS; 
  do XI=1 to countw(X,",");                      
    XSTR=scan(X, XI, ",");                       
    if index(XSTR,"-") then do;                  
       X1=scan(XSTR,1);X2=scan(XSTR,2);
    end;                                         
    else do;                                     
      X1=XSTR;X2=XSTR;                           
    end;                                         
    do YI=1 to countw(Y,",");   
      TESTNO+1; 
      YSTR=scan(Y, YI, ",");                     
      if index(YSTR,"-") then do;                
        Y1=scan(YSTR,1);Y2=scan(YSTR,2);         
      end;                                       
      else do;                                   
        Y1=YSTR;Y2=YSTR;                         
      end;                                       
      TEST= catx(' ', ifc(TESTNO>1,'else',' '), 'if', quote(strip(X1)), '<= X <=', quote(strip(X2))        
           ,'and', quote(strip(Y1)), '<= Y <=', quote(strip(Y2)), 'then ICCC=', A, ';');
      put TEST;                  
    end;                                         
  end;                                           
run;         
options source2; 
data T3; 
  set T1;
  %include TESTS ;
run;

63         options source2;
64         data T3;
65           set T1;
66           %include TESTS ;
NOTE: %INCLUDE (level 1) file TESTS is file X:\SASWORK\_TD23384_NZ8037SPSAS9003_\tests.sas.
67        +if "COE20" <= X <= "COE24" and "1" <= Y <= "1" then ICCC= 1 ;
68        +else if "COE20" <= X <= "COE24" and "2" <= Y <= "2" then ICCC= 1 ;
69        +else if "COE35" <= X <= "COE35" and "1" <= Y <= "1" then ICCC= 1 ;
70        +else if "COE35" <= X <= "COE35" and "2" <= Y <= "2" then ICCC= 1 ;
71        +else if "XOE20" <= X <= "XOE24" and "1" <= Y <= "2" then ICCC= 2 ;
72        +else if "XOE35" <= X <= "XOE35" and "1" <= Y <= "2" then ICCC= 2 ;
NOTE: %INCLUDE (level 1) ending.
73         run;

NOTE: There were 5 observations read from the data set WORK.T1.
NOTE: The data set WORK.T3 has 5 observations and 3 variables.
NOTE: DATA statement used (Total process time):
      real time           0.01 seconds
      cpu time            0.01 seconds

You might be able to optimise the tests further by grouping them, and using an equality test when there's no range

High-Performance SAS Coding - Third Edition

Ready to join fellow brilliant minds for the SAS Hackathon?

Classroom Training Available!