Hello,
I was working with labelling some data by merging. The goal is to give an ICCC coding based on patients' cancer site, histology, and cancer behavior. The way I am doing it is illustrated below. For simplicity, I am only showing 1 ID to be the "by" variable. "Tab1"-"Tab3" are the names of the tables for easier referring.
This requires a perfect match of the IDs. But currently Tab2 is in the format below:
The 3 "by" variables are the first 3 columns and I wanted to get "ICCC". But the 3 "by" variables are in a very compact format, the dash means a range (C000-C002 = C000, C001, C002 etc.) so each row actually contains information of a large number of rows. Same for Histology and Behavior. So if expanded fully, the first row will be 810 * 2 * 1= 1620 rows and the second row will be 3 * 9 * 1 = 27 rows. And I have about 130 of these compact rows of information. I fully expanded it and it contained over 400 million rows. This makes merging very challenging.
What would be a better way to obtain "ICCC" using patients' Site, Histology, and Behavior? Any help is appreciated!
Like this?
data T1;
X='COE22'; Y="1"; output;
X='XOE32'; Y="2"; output;
X='XOE21'; Y="1"; output;
X='XOE35'; Y="1"; output;
X='XOE35'; Y="7"; output;
run;
data T2;
X='COE20-COE24, COE35'; Y="1,2"; A=1; output;
X='XOE20-XOE24, XOE35'; Y="1-2,4"; A=2; output;
run;
data T3;
set T1;
RC=dosubl(cats('data _null_; '
,' set T2; '
,' do XI=1 to countw(X,","); '
,' XSTR=scan(X, XI, ","); '
,' if index(XSTR,"-") then do; '
,' X1=scan(XSTR,1);X2=scan(XSTR,2); '
,' end; '
,' else do; '
,' X1=XSTR;X2=XSTR; '
,' end; '
,' do YI=1 to countw(Y,","); '
,' YSTR=scan(Y, YI, ","); '
,' if index(YSTR,"-") then do; '
,' Y1=scan(YSTR,1);Y2=scan(YSTR,2); '
,' end; '
,' else do; '
,' Y1=YSTR;Y2=YSTR; '
,' end; '
,' if strip(X1)<="', X, '"<=strip(X2) '
,' & strip(Y1)<="', Y, '"<=strip(Y2) then do;'
,' call symputx("ICCC",A); '
,' stop; '
,' end; '
,' end; '
,' end; '
,' call symputx("ICCC"," "); '
,'run; '));
ICCC=symget('ICCC');
run;
proc print; run;
What does Tab1 look like?
Do you want 400m rows as the result?
Tab1 has about 30,000 rows. Each row is a patient's information of Site, Histology, and Behavior. I want the corresponding ICCC information for the 30,000 patients. Tab1 looks like this
And I want the final result (Tab3) to be like this
So you don't want a ROW for ID=2 in the results, like you showed in the original question?
You only keep rows for which there's an ICCC?
Sorry for the confusion, that was more for an illustration of the merging process.
I want to have an ICCC for all those with complete Site, Histology, and Behavior, as shown in my response above. Also pasted again here:
Please provide some test data as a program (data step or SQL) so we can use it in our reply.
Since no data is supplied, I use an example.
One way to do this:
data T1;
X='COE22'; output;
X='XOE32'; output;
X='XOE21'; output;
run;
data T2;
X='COE20-COE24, COE35'; A=1; output;
X='XOE20-XOE24, XOE35'; A=2; output;
run;
data T3;
set T1;
RC=dosubl(cats('data _null_; '
,' set T2; '
,' do I=1 to countw(X,","); '
,' INTVL=scan(X, I, ","); '
,' if index(INTVL,"-") then do; '
,' V1=scan(INTVL,1);V2=scan(INTVL,2);'
,' end; '
,' else do; '
,' V1=INTVL;V2=INTVL; '
,' end; '
,' if V1<="', X, '"<=V2 then do; '
,' call symputx("ICCC",A); '
,' stop; '
,' end; '
,' end; '
,' call symputx("ICCC"," "); '
,'run; '));
ICCC=symget('ICCC');
run;
proc print; run;
X | ICCC |
---|---|
COE22 | 1 |
XOE32 | |
XOE21 | 2 |
data T1;
X='COE22'; Y="1"; output;
X='XOE32'; Y="2"; output;
X='XOE21'; Y="1"; output;
run;
data T2;
X='COE20-COE24, COE35'; Y="1,2"; A=1; output;
X='XOE20-XOE24, XOE35'; Y="1-2,4"; A=2; output;
run;
Sorry for the late response. Could you possibly expand what you did to a 2-column match?
Thank you!
Like this?
data T1;
X='COE22'; Y="1"; output;
X='XOE32'; Y="2"; output;
X='XOE21'; Y="1"; output;
X='XOE35'; Y="1"; output;
X='XOE35'; Y="7"; output;
run;
data T2;
X='COE20-COE24, COE35'; Y="1,2"; A=1; output;
X='XOE20-XOE24, XOE35'; Y="1-2,4"; A=2; output;
run;
data T3;
set T1;
RC=dosubl(cats('data _null_; '
,' set T2; '
,' do XI=1 to countw(X,","); '
,' XSTR=scan(X, XI, ","); '
,' if index(XSTR,"-") then do; '
,' X1=scan(XSTR,1);X2=scan(XSTR,2); '
,' end; '
,' else do; '
,' X1=XSTR;X2=XSTR; '
,' end; '
,' do YI=1 to countw(Y,","); '
,' YSTR=scan(Y, YI, ","); '
,' if index(YSTR,"-") then do; '
,' Y1=scan(YSTR,1);Y2=scan(YSTR,2); '
,' end; '
,' else do; '
,' Y1=YSTR;Y2=YSTR; '
,' end; '
,' if strip(X1)<="', X, '"<=strip(X2) '
,' & strip(Y1)<="', Y, '"<=strip(Y2) then do;'
,' call symputx("ICCC",A); '
,' stop; '
,' end; '
,' end; '
,' end; '
,' call symputx("ICCC"," "); '
,'run; '));
ICCC=symget('ICCC');
run;
proc print; run;
Perfect, thank you so much!
Sorry to bother again after this question is marked as "Solved". @ChrisNZ
Do you think I might be able to use X of T2 in their original compact format? Below are the sample input datasets.
data input;
input hist $;
cards;
100
105
110
110
;
data in2;
length hist $15.;
input hist label;
cards;
100-105,108 1
110 3
;
run;
I was thinking of using each row of IN2 in their exact compact format
if hist in (100-105,108)
In your previous post, your way was to do it in 2 steps,
if 100<=hist<=105
if hist=108
I try to do this because the dataset I work with is complicated and it takes really long to loop through all the combinations. To give you a general idea, some rows look like this(columns in the red circle correspond to the "X" and "Y" in your T2 dataset). And I have 136 rows of data in this format.
I tried to write something similar to what's in your previous post:
data test;
set input;
RC=dosubl(cats('data _null_; '
,' set in2; '
,'call symput("hist_list",hist); '
,'if ', hist, ' in (&hist_list) then do;'
,'call symputx("lab",label); '
,'stop; '
,'call symputx("lab"," "); '
,'run; '));
label=symget('lab');
run;
But things didn't work out so well.
Again, thank you so much for all the help and inspiration!
I was thinking of using each row of IN2 in their exact compact format
if hist in (100-105,108)
This is invalid syntax. If it was valid, I'd have used it as it makes the code much simpler.
The equivalent syntax is
if 100<=hist<=105 | hist=108
which might be marginally faster than
if 100<=hist<=105
if hist=108
or not: In both cases it's 2 tests being performed.
Since you may have many combinations of values, the OR clause might become slower as it's lengthened.
I never tested this, but I expected the difference to be minimal if it exists.
You might be able to use in() for all the single values instead of several tests. This might make a small difference, but the bottom line is that all these value comparisons need to be performed, so it cannot be fast.
Another way to hope for performance improvement is to put the most common matches at the top of the tests so that all the subsequent else comparisons don't take place.
One way to keep the current logic and make it faster would be to parse T2 as is done now to create all the test code from the values, but to save that code in a text file,
In other words, use T2 to create a text file that contains a succession of IF tests.
Then, instead of calling a data step for each row of T1, which is very inefficient, you simply
data T3;
set T1;
%include(tests);
run;
Something like this:
data T1;
X='COE22'; Y="1"; output;
X='XOE32'; Y="2"; output;
X='XOE21'; Y="1"; output;
X='XOE35'; Y="1"; output;
X='XOE35'; Y="7"; output;
run;
data T2;
X='COE20-COE24, COE35'; Y="1,2"; A=1; output;
X='XOE20-XOE24, XOE35'; Y="1-2,4"; A=2; output;
run;
filename TESTS "&wdir\tests.sas";
data _null_;
set T2;
file TESTS;
do XI=1 to countw(X,",");
XSTR=scan(X, XI, ",");
if index(XSTR,"-") then do;
X1=scan(XSTR,1);X2=scan(XSTR,2);
end;
else do;
X1=XSTR;X2=XSTR;
end;
do YI=1 to countw(Y,",");
TESTNO+1;
YSTR=scan(Y, YI, ",");
if index(YSTR,"-") then do;
Y1=scan(YSTR,1);Y2=scan(YSTR,2);
end;
else do;
Y1=YSTR;Y2=YSTR;
end;
TEST= catx(' ', ifc(TESTNO>1,'else',' '), 'if', quote(strip(X1)), '<= X <=', quote(strip(X2))
,'and', quote(strip(Y1)), '<= Y <=', quote(strip(Y2)), 'then ICCC=', A, ';');
put TEST;
end;
end;
run;
options source2;
data T3;
set T1;
%include TESTS ;
run;
63 options source2; 64 data T3; 65 set T1; 66 %include TESTS ; NOTE: %INCLUDE (level 1) file TESTS is file X:\SASWORK\_TD23384_NZ8037SPSAS9003_\tests.sas. 67 +if "COE20" <= X <= "COE24" and "1" <= Y <= "1" then ICCC= 1 ; 68 +else if "COE20" <= X <= "COE24" and "2" <= Y <= "2" then ICCC= 1 ; 69 +else if "COE35" <= X <= "COE35" and "1" <= Y <= "1" then ICCC= 1 ; 70 +else if "COE35" <= X <= "COE35" and "2" <= Y <= "2" then ICCC= 1 ; 71 +else if "XOE20" <= X <= "XOE24" and "1" <= Y <= "2" then ICCC= 2 ; 72 +else if "XOE35" <= X <= "XOE35" and "1" <= Y <= "2" then ICCC= 2 ; NOTE: %INCLUDE (level 1) ending. 73 run; NOTE: There were 5 observations read from the data set WORK.T1. NOTE: The data set WORK.T3 has 5 observations and 3 variables. NOTE: DATA statement used (Total process time): real time 0.01 seconds cpu time 0.01 seconds
You might be able to optimise the tests further by grouping them, and using an equality test when there's no range
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.