I am using SAS version 9.4 on z/OS mainframe.
I have the following data in two datasets (as an example) -
Dataset A
Variable DSNA V SNA
ABC.DEF.GHI.DEU 5
ABC.DEF.GHI.KYZ 1
ABC.DEF.GHI.LMB 2
ADD.XYZ.GHI.ABD 1
Dataset B
Variable DSNB VSNB
ABC.DEF.GHI 3
ADD.XYZ.GHI 5
I want to search Dataset A DSNA - using Dataset B DSNB. I cannot just merge Dataset A and B - because DSNB is a substring of the variable DSNA. In the above example, I need to return ABC.DEF.GHI.LMB and VSNA of 2 from the search using DSNB value of ABC.DEF.GHI. Basically - return the last (highest) value in DSNA.
Can somebody kindly give me some hints how to go about doing this? Thank you.
If I understand your problem correctly, DSNB is always the leftmost substring of DSNA. Then something like this may work:
data a;
length dsna $20 vsna 8;
input dsna vsna;
cards;
ABC.DEF.GHI.DEU 5
ABC.DEF.GHI.KYZ 1
ABC.DEF.GHI.LMB 2
ADD.XYZ.GHI.ABD 1
;run;
data b;
length dsnb $20 vsnb 8;
input dsnb vsnb;
cards;
ABC.DEF.GHI 3
ADD.XYZ.GHI 5
;run;
proc sort data=a;
by dsna;
run;
proc sort data=b;
by dsnb;
run;
data want;
set a(in=a) b(rename=(dsnb=dsna vsnb=vsna) in=b);
by dsna;
if b then do;
dsnb=dsna;
vsnb=vsna;
end;
retain dsnb vsnb;
if a;
if dsna=:trim(dsnb);
run;
If DSNB variable is a fixed length then create substring of DSNA as new variable and use simple merge or
if they may be more then one match use proc sql.
Unfortunately, DSNB is not of fixed length. I only showed the example as fixed length, but in the actual file, is variable length.
Also, I am not familiar with PROC SQL - can somebody kindly post some examples of coding - so that I can follow? Thank you.
In fact, both DSNA and DSNB are variable in length.
Try next code:
proc sql;
create table to_check as
select a.* , b.*
from A as a
join B as b
on index(DSNA, DSNB) > 0;
quit;
Can you guarantee a minimum length? For example, can you guarantee that all DSNB values are at least 3 characters long?
Next code is tested and gives the wanted results:
data dataA;
length DSNA $ 20;
input DSNA $ v_sna;
cards;
ABC.DEF.GHI.DEU 5
ABC.DEF.GHI.KYZ 1
ABC.DEF.GHI.LMB 2
ADD.XYZ.GHI.ABD 1
;
run;
data dataB;
length DSNB $15;
input DSNB $ v_snb;
cards;
ABC.DEF.GHI 3
ADD.XYZ.GHI 5
;
run;
proc sql;
create table to_check as
select a.*, b.*
from dataA as a
join dataB as b
on index(DSNA, strip(DSNB)) > 0;
quit;
Hi Shmuel,
Thank you very much. I am now testing the code in my z/OS mainframe. Will let you know if it works.
To Astounding -
Yes - the minimum length for both DSNA and DSNB are all greater than 8 characters.
Hello,
I am testing the SQL - running it on my mainframe - z/OS operating system - the SQL procedure started at 16:56 and now is already 18:35 - and it is still running.
DSNA variable has 855593 observations and DSNB has 17471 observations.
The mainframe is a 1,700 MIPS machine - definitely not slow.
I am wondering if there is anything else I could check? Thank you.
Alternative to sql -
As DSNB is much smaller (less observations) then DSNA you may try use iterative hash technique
in a datastep instead sql, but I'm not expert with it.
The hashing will read DSNB once, at the beginning into memory and
do all checking reading DSNA once.
@TingSern wrote:
Hello,
I am testing the SQL - running it on my mainframe - z/OS operating system - the SQL procedure started at 16:56 and now is already 18:35 - and it is still running.
DSNA variable has 855593 observations and DSNB has 17471 observations.
The mainframe is a 1,700 MIPS machine - definitely not slow.
I am wondering if there is anything else I could check? Thank you.
If either set has duplicate values for either the search or target string you might consider searching on DISTINCT values of the strings:
such as
proc sql; create table to_check as select a.*, b.* from (select distinct DSNA from dataA) as a join (select distinct DSNB from dataB) as b on index(DSNA, strip(DSNB)) > 0; quit;
and then if you need more values from either of the sets use the resulting set with a LEFT or RIGHT join on the appropriate variable.
If the code that is running is similar to the @Shmuel's original posting it is comparing every single record in A with every record in B, which would be 855593 * 17471 comparisons. Which might take some time. Also only include the variables you might need for matching purposes at this point as the more variables involved the more disk IO is going to be needed.
Given that you have a minimum length of 8 for DSNB:
proc sql;
create table want as
select a.*, b.* from dataset_a a, dataset_b b
where substr(dsna, 1, 😎 = substr(dsnb, 1, 😎 and
dsnb <= dsna
order by dsnb dsna;
quit;
It's untested, but it should speed things up and give you the right observations to subset.
It does run into problems, however, if there are duplicate values for DSNB in DATASET_B.
It's conceivable that this would actually run faster if you create (ahead of time) separate variables holding the first 8 characters of DSNA and DSNB so that SQL doesn't need to apply SUBSTR.
Finally, note that this does not give you the final answer. It gives you a superset of the records you need. You still need to filter out the largest with something like:
data really_want;
set want;
by dsnb;
if last.dsnb;
run;
SQL takes too much overheads. A simple DATA procedure is much faster (at least on z/OS mainframe). Thank you very much for all your help.
If I understand your problem correctly, DSNB is always the leftmost substring of DSNA. Then something like this may work:
data a;
length dsna $20 vsna 8;
input dsna vsna;
cards;
ABC.DEF.GHI.DEU 5
ABC.DEF.GHI.KYZ 1
ABC.DEF.GHI.LMB 2
ADD.XYZ.GHI.ABD 1
;run;
data b;
length dsnb $20 vsnb 8;
input dsnb vsnb;
cards;
ABC.DEF.GHI 3
ADD.XYZ.GHI 5
;run;
proc sort data=a;
by dsna;
run;
proc sort data=b;
by dsnb;
run;
data want;
set a(in=a) b(rename=(dsnb=dsna vsnb=vsna) in=b);
by dsna;
if b then do;
dsnb=dsna;
vsnb=vsna;
end;
retain dsnb vsnb;
if a;
if dsna=:trim(dsnb);
run;
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.