BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
TC_
Obsidian | Level 7 TC_
Obsidian | Level 7

Hello,

I was working with labelling some data by merging. The goal is to give an ICCC coding based on patients' cancer site, histology, and cancer behavior. The way I am doing it is illustrated below. For simplicity, I am only showing 1 ID to be the "by" variable. "Tab1"-"Tab3" are the names of the tables for easier referring. 

merge.png

 

This requires a perfect match of the IDs. But currently Tab2 is in the format below:

leu.png

 

The 3 "by" variables are the first 3 columns and I wanted to get "ICCC". But the 3 "by" variables are in a very compact format, the dash means a range (C000-C002 = C000, C001, C002 etc.) so each row actually contains information of a large number of rows. Same for Histology and Behavior. So if expanded fully, the first row will be 810 * 2 * 1= 1620 rows and the second row will be 3 * 9 * 1 = 27 rows. And I have about 130 of these compact rows of information. I fully expanded it and it contained over 400 million rows. This makes merging very challenging.

 

What would be a better way to obtain "ICCC" using patients' Site, Histology, and Behavior? Any help is appreciated!

1 ACCEPTED SOLUTION

Accepted Solutions
ChrisNZ
Tourmaline | Level 20

Like this?

data T1;
  X='COE22'; Y="1"; output;
  X='XOE32'; Y="2"; output;
  X='XOE21'; Y="1"; output;
  X='XOE35'; Y="1"; output;
  X='XOE35'; Y="7"; output;
run;
data T2;
  X='COE20-COE24, COE35'; Y="1,2"; A=1; output;
  X='XOE20-XOE24, XOE35'; Y="1-2,4"; A=2; output;
run;
data T3;
  set T1;
  RC=dosubl(cats('data _null_;                                     '
                ,'  set T2;                                        '
                ,'  do XI=1 to countw(X,",");                      '
                ,'    XSTR=scan(X, XI, ",");                       '
                ,'    if index(XSTR,"-") then do;                  '
                ,'      X1=scan(XSTR,1);X2=scan(XSTR,2);           '
                ,'    end;                                         '
                ,'    else do;                                     '
                ,'      X1=XSTR;X2=XSTR;                           '
                ,'    end;                                         '
                ,'    do YI=1 to countw(Y,",");                    '
                ,'      YSTR=scan(Y, YI, ",");                     '
                ,'      if index(YSTR,"-") then do;                '
                ,'        Y1=scan(YSTR,1);Y2=scan(YSTR,2);         '
                ,'      end;                                       '
                ,'      else do;                                   '
                ,'        Y1=YSTR;Y2=YSTR;                         '
                ,'      end;                                       '
                ,'      if strip(X1)<="', X, '"<=strip(X2)         '
                ,'       & strip(Y1)<="', Y, '"<=strip(Y2) then do;'
                ,'        call symputx("ICCC",A);                  '
                ,'        stop;                                    '
                ,'      end;                                       '
                ,'    end;                                         '
                ,'  end;                                           '
                ,'  call symputx("ICCC"," ");                      '
                ,'run;                                             '));
  ICCC=symget('ICCC');
run;
proc print; run;
X Y ICCC
COE22 1 1
XOE32 2  
XOE21 1 2
XOE35 1 2
XOE35 7  

 

View solution in original post

17 REPLIES 17
ChrisNZ
Tourmaline | Level 20

What does Tab1 look like?

Do you want 400m rows as the result?

TC_
Obsidian | Level 7 TC_
Obsidian | Level 7

Tab1 has about 30,000 rows. Each row is a patient's information of Site, Histology, and Behavior. I want the corresponding ICCC information for the 30,000 patients. Tab1 looks like this

pat.png

 

And I want the final result (Tab3) to be like this

pat2.png

 

ChrisNZ
Tourmaline | Level 20

So you don't want a ROW for ID=2 in the results, like you showed in the original question?

You only keep rows for which there's an ICCC?

TC_
Obsidian | Level 7 TC_
Obsidian | Level 7

Sorry for the confusion, that was more for an illustration of the merging process.

I want to have an ICCC for all those with complete Site, Histology, and Behavior, as shown in my response above. Also pasted again here:

pat2.png

ChrisNZ
Tourmaline | Level 20

Please provide some test data as a program (data step or SQL) so we can use it in our reply.

ChrisNZ
Tourmaline | Level 20

Since no data is supplied, I use an example.

One way to do this:

data T1;
  X='COE22'; output;
  X='XOE32'; output;
  X='XOE21'; output;
run;
data T2;
  X='COE20-COE24, COE35'; A=1; output;
  X='XOE20-XOE24, XOE35'; A=2; output;
run;
data T3;
  set T1;
  RC=dosubl(cats('data _null_;                            '
                ,'  set T2;                               '
                ,'  do I=1 to countw(X,",");              '
                ,'    INTVL=scan(X, I, ",");              '
                ,'    if index(INTVL,"-") then do;        '
                ,'      V1=scan(INTVL,1);V2=scan(INTVL,2);'
                ,'    end;                                '
                ,'    else do;                            '
                ,'      V1=INTVL;V2=INTVL;                '
                ,'    end;                                '
                ,'    if V1<="', X, '"<=V2 then do;       '
                ,'      call symputx("ICCC",A);           '
                ,'      stop;                             '
                ,'    end;                                '
                ,'  end;                                  '
                ,'  call symputx("ICCC"," ");             '
                ,'run;                                    '));
  ICCC=symget('ICCC');
run;
proc print; run;
X ICCC
COE22 1
XOE32  
XOE21 2

 

 

TC_
Obsidian | Level 7 TC_
Obsidian | Level 7
data T1;
  X='COE22'; Y="1"; output;
  X='XOE32'; Y="2"; output;
  X='XOE21'; Y="1"; output;
run;
data T2;
  X='COE20-COE24, COE35'; Y="1,2"; A=1; output;
  X='XOE20-XOE24, XOE35'; Y="1-2,4"; A=2; output;
run;

Sorry for the late response. Could you possibly expand what you did to a 2-column match?

Thank you!

ChrisNZ
Tourmaline | Level 20

Like this?

data T1;
  X='COE22'; Y="1"; output;
  X='XOE32'; Y="2"; output;
  X='XOE21'; Y="1"; output;
  X='XOE35'; Y="1"; output;
  X='XOE35'; Y="7"; output;
run;
data T2;
  X='COE20-COE24, COE35'; Y="1,2"; A=1; output;
  X='XOE20-XOE24, XOE35'; Y="1-2,4"; A=2; output;
run;
data T3;
  set T1;
  RC=dosubl(cats('data _null_;                                     '
                ,'  set T2;                                        '
                ,'  do XI=1 to countw(X,",");                      '
                ,'    XSTR=scan(X, XI, ",");                       '
                ,'    if index(XSTR,"-") then do;                  '
                ,'      X1=scan(XSTR,1);X2=scan(XSTR,2);           '
                ,'    end;                                         '
                ,'    else do;                                     '
                ,'      X1=XSTR;X2=XSTR;                           '
                ,'    end;                                         '
                ,'    do YI=1 to countw(Y,",");                    '
                ,'      YSTR=scan(Y, YI, ",");                     '
                ,'      if index(YSTR,"-") then do;                '
                ,'        Y1=scan(YSTR,1);Y2=scan(YSTR,2);         '
                ,'      end;                                       '
                ,'      else do;                                   '
                ,'        Y1=YSTR;Y2=YSTR;                         '
                ,'      end;                                       '
                ,'      if strip(X1)<="', X, '"<=strip(X2)         '
                ,'       & strip(Y1)<="', Y, '"<=strip(Y2) then do;'
                ,'        call symputx("ICCC",A);                  '
                ,'        stop;                                    '
                ,'      end;                                       '
                ,'    end;                                         '
                ,'  end;                                           '
                ,'  call symputx("ICCC"," ");                      '
                ,'run;                                             '));
  ICCC=symget('ICCC');
run;
proc print; run;
X Y ICCC
COE22 1 1
XOE32 2  
XOE21 1 2
XOE35 1 2
XOE35 7  

 

TC_
Obsidian | Level 7 TC_
Obsidian | Level 7

Perfect, thank you so much!

TC_
Obsidian | Level 7 TC_
Obsidian | Level 7

Sorry to bother again after this question is marked as "Solved". @ChrisNZ 

Do you think I might be able to use X of T2 in their original compact format? Below are the sample input datasets. 

data input;
input hist $;
cards;
100
105
110
110
;

data  in2;
length hist $15.;
input hist label;
cards;
100-105,108 1
110 3
;
run;

I was thinking of using each row of IN2 in their exact compact format

if hist in (100-105,108)

In your previous post, your way was to do it in 2 steps, 

if 100<=hist<=105
if hist=108

I try to do this because the dataset I work with is complicated and it takes really long to loop through all the combinations. To give you a general idea, some rows look like this(columns in the red circle correspond to the "X" and "Y" in your T2 dataset). And I have 136 rows of data in this format.

sample.png

 

I tried to write something similar to what's in your previous post:

data test;
	set input;
	RC=dosubl(cats('data _null_;                                     '
                                ,'  set in2;                             '
				,'call symput("hist_list",hist);         '
				,'if ', hist,  ' in (&hist_list) then do;'
				,'call symputx("lab",label);             ' 
				,'stop;                                  '
				,'call symputx("lab"," ");               '
				,'run;                                   '));
	label=symget('lab');
run;

But things didn't work out so well.

 

Again, thank you so much for all the help and inspiration!

 

ChrisNZ
Tourmaline | Level 20

I was thinking of using each row of IN2 in their exact compact format

if hist in (100-105,108)

This is invalid syntax. If it was valid, I'd have used it as it makes the code much simpler.

The equivalent syntax is 

if 100<=hist<=105 | hist=108

which might be marginally faster than 

if 100<=hist<=105
if hist=108

 or not: In both cases it's 2 tests being performed.
Since you may have many combinations of values, the OR clause might become slower as it's lengthened.

I never tested this, but I expected the difference to be minimal if it exists.

 

You might be able to use in() for all the single values instead of several tests. This might make a small difference, but the bottom line is that all these value comparisons need to be performed, so it cannot be fast.
Another way to hope for performance improvement is to put the most common matches at the top of the tests so that all the subsequent else comparisons don't take place.

ChrisNZ
Tourmaline | Level 20

One way to keep the current logic and make it faster would be to parse T2 as is done now to create all the test code from the values, but to save that code in a text file,

In other words, use T2 to create a text file that contains a succession of IF tests.

 

Then, instead of calling a data step for each row of T1, which is very inefficient, you simply

data T3;
  set T1;
  %include(tests);
run;

 

 

TC_
Obsidian | Level 7 TC_
Obsidian | Level 7
Yes, I feel like the current bottleneck is with the reading of T2 dataset for every single row of the T1 dataset. The process flooded my log and I was asked to save/clear the log before the process was done.
Regarding your %include, can you please explain what you meant by "use T2 to create a text file that contains a succession of IF tests"? I have no experience with this kind of code writing.
Thank you!
ChrisNZ
Tourmaline | Level 20

Something like this:

 

data T1;
  X='COE22'; Y="1"; output;
  X='XOE32'; Y="2"; output;
  X='XOE21'; Y="1"; output;
  X='XOE35'; Y="1"; output;
  X='XOE35'; Y="7"; output;
run;
data T2;
  X='COE20-COE24, COE35'; Y="1,2"; A=1; output;
  X='XOE20-XOE24, XOE35'; Y="1-2,4"; A=2; output;
run;
filename TESTS "&wdir\tests.sas";
data _null_;                                     
  set T2;  
  file TESTS; 
  do XI=1 to countw(X,",");                      
    XSTR=scan(X, XI, ",");                       
    if index(XSTR,"-") then do;                  
       X1=scan(XSTR,1);X2=scan(XSTR,2);
    end;                                         
    else do;                                     
      X1=XSTR;X2=XSTR;                           
    end;                                         
    do YI=1 to countw(Y,",");   
      TESTNO+1; 
      YSTR=scan(Y, YI, ",");                     
      if index(YSTR,"-") then do;                
        Y1=scan(YSTR,1);Y2=scan(YSTR,2);         
      end;                                       
      else do;                                   
        Y1=YSTR;Y2=YSTR;                         
      end;                                       
      TEST= catx(' ', ifc(TESTNO>1,'else',' '), 'if', quote(strip(X1)), '<= X <=', quote(strip(X2))        
           ,'and', quote(strip(Y1)), '<= Y <=', quote(strip(Y2)), 'then ICCC=', A, ';');
      put TEST;                  
    end;                                         
  end;                                           
run;         
options source2; 
data T3; 
  set T1;
  %include TESTS ;
run;
63         options source2;
64         data T3;
65           set T1;
66           %include TESTS ;
NOTE: %INCLUDE (level 1) file TESTS is file X:\SASWORK\_TD23384_NZ8037SPSAS9003_\tests.sas.
67        +if "COE20" <= X <= "COE24" and "1" <= Y <= "1" then ICCC= 1 ;
68        +else if "COE20" <= X <= "COE24" and "2" <= Y <= "2" then ICCC= 1 ;
69        +else if "COE35" <= X <= "COE35" and "1" <= Y <= "1" then ICCC= 1 ;
70        +else if "COE35" <= X <= "COE35" and "2" <= Y <= "2" then ICCC= 1 ;
71        +else if "XOE20" <= X <= "XOE24" and "1" <= Y <= "2" then ICCC= 2 ;
72        +else if "XOE35" <= X <= "XOE35" and "1" <= Y <= "2" then ICCC= 2 ;
NOTE: %INCLUDE (level 1) ending.
73         run;

NOTE: There were 5 observations read from the data set WORK.T1.
NOTE: The data set WORK.T3 has 5 observations and 3 variables.
NOTE: DATA statement used (Total process time):
      real time           0.01 seconds
      cpu time            0.01 seconds
      

You might be able to optimise the tests further by grouping them, and using an equality test when there's no range

 

 

 

Ready to join fellow brilliant minds for the SAS Hackathon?

Build your skills. Make connections. Enjoy creative freedom. Maybe change the world. Registration is now open through August 30th. Visit the SAS Hackathon homepage.

Register today!
How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 17 replies
  • 1157 views
  • 1 like
  • 2 in conversation