BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
TingSern
Obsidian | Level 7

 

 

 

 

 

 

 

 

 

 

I am using SAS version  9.4 on z/OS mainframe.

 

I have the following data in two datasets (as an example) -

 

Dataset A 

 

Variable DSNA                        V SNA

ABC.DEF.GHI.DEU                  5

ABC.DEF.GHI.KYZ                  1

ABC.DEF.GHI.LMB                  2

ADD.XYZ.GHI.ABD                  1

 

Dataset B

 

Variable DSNB                         VSNB

ABC.DEF.GHI                           3

ADD.XYZ.GHI                           5

 

I want to search Dataset A DSNA - using Dataset B DSNB.  I cannot just merge Dataset A and B - because DSNB is a substring of the variable DSNA.  In the above example, I need to return ABC.DEF.GHI.LMB and VSNA of 2 from the search using DSNB value of ABC.DEF.GHI.   Basically - return the last (highest) value in DSNA. 

 

Can somebody kindly give me some hints how to go about doing this?  Thank you. 

 

1 ACCEPTED SOLUTION

Accepted Solutions
s_lassen
Meteorite | Level 14

If I understand your problem correctly, DSNB is always the leftmost substring of DSNA. Then something like this may work:

data a;
  length dsna $20 vsna 8;
  input dsna vsna;
cards;
ABC.DEF.GHI.DEU 5
ABC.DEF.GHI.KYZ 1
ABC.DEF.GHI.LMB 2
ADD.XYZ.GHI.ABD 1
;run;


data b;
  length dsnb $20 vsnb 8;
  input dsnb vsnb;
cards;
ABC.DEF.GHI 3
ADD.XYZ.GHI 5
;run;

proc sort data=a;
  by dsna;
run;

proc sort data=b;
  by dsnb;
run;

data want;
  set a(in=a) b(rename=(dsnb=dsna vsnb=vsna) in=b);
  by dsna;
  if b then do;
    dsnb=dsna;
    vsnb=vsna;
    end;
  retain dsnb vsnb;
  if a;
  if dsna=:trim(dsnb);
run;

View solution in original post

14 REPLIES 14
Shmuel
Garnet | Level 18

If DSNB variable is a fixed length then create substring of DSNA as new variable and use simple merge or

if they may be more then one match use proc sql.

TingSern
Obsidian | Level 7

Unfortunately, DSNB is not of fixed length.  I only showed the example as fixed length, but in the actual file, is variable length. 

 

Also, I am not familiar with PROC SQL - can somebody kindly post some examples of coding - so that I can follow?  Thank you.

TingSern
Obsidian | Level 7

In fact, both DSNA and DSNB are variable in length. 

Shmuel
Garnet | Level 18

Try next code:

 

proc sql;

   create table to_check as

  select a.* , b.*

  from A as a

  join  B as b

  on index(DSNA, DSNB) > 0;

quit;

Astounding
PROC Star

Can you guarantee a minimum length?  For example, can you guarantee that all DSNB values are at least 3 characters long?

Shmuel
Garnet | Level 18

Next code is tested and gives the wanted results:

data dataA;
length DSNA $ 20;
input DSNA $ v_sna;
cards;
ABC.DEF.GHI.DEU                  5
ABC.DEF.GHI.KYZ                  1
ABC.DEF.GHI.LMB                  2
ADD.XYZ.GHI.ABD                  1
;
run;

data dataB;
length DSNB $15; input DSNB $ v_snb; cards; ABC.DEF.GHI 3 ADD.XYZ.GHI 5 ; run; proc sql; create table to_check as select a.*, b.* from dataA as a join dataB as b on index(DSNA, strip(DSNB)) > 0; quit;
TingSern
Obsidian | Level 7

Hi Shmuel,

 

Thank you very much.  I am now testing the code in my z/OS mainframe.  Will let you know if it works. 

TingSern
Obsidian | Level 7

To Astounding -

 

Yes - the minimum length for both DSNA and DSNB are all greater than 8 characters.

 

 

TingSern
Obsidian | Level 7

Hello,

 

I am testing the SQL - running it on my mainframe - z/OS operating system - the SQL procedure started at 16:56 and now is already 18:35 - and it is still running.

 

DSNA variable has 855593 observations and DSNB has 17471 observations.

 

The mainframe is a 1,700 MIPS machine - definitely not slow.

 

I am wondering if there is anything else I could check?   Thank you.

Shmuel
Garnet | Level 18

Alternative to sql - 

 

As DSNB is much smaller (less observations) then DSNA you may try use iterative hash technique

in a datastep instead sql, but I'm not expert with it.

 

The hashing will read DSNB once, at the beginning into memory and

do all checking reading DSNA once.

ballardw
Super User

@TingSern wrote:

Hello,

 

I am testing the SQL - running it on my mainframe - z/OS operating system - the SQL procedure started at 16:56 and now is already 18:35 - and it is still running.

 

DSNA variable has 855593 observations and DSNB has 17471 observations.

 

The mainframe is a 1,700 MIPS machine - definitely not slow.

 

I am wondering if there is anything else I could check?   Thank you.


If either set has duplicate values for either the search or target string you might consider searching on DISTINCT values of the strings:

such as

proc sql;
  create table to_check as
  select a.*, b.*
  from (select distinct DSNA from dataA) as a 
  join (select distinct DSNB from dataB) as b
  on index(DSNA, strip(DSNB)) > 0;
quit;

and then if you need more values from either of the sets use the resulting set with a LEFT or RIGHT join on the appropriate variable.

 

 

If the code that is running is similar to the @Shmuel's original posting it is comparing every single record in A with every record in B, which would be 855593 * 17471 comparisons. Which might take some time. Also only include the variables you might need for matching purposes at this point as the more variables involved the more disk IO is going to be needed.

Astounding
PROC Star

Given that you have a minimum length of 8 for DSNB:

 

proc sql;

create table want as

select a.*, b.* from dataset_a a, dataset_b b

where substr(dsna, 1, 😎 = substr(dsnb, 1, 😎 and

dsnb <= dsna

order by dsnb dsna;

quit;

 

It's untested, but it should speed things up and give you the right observations to subset.

 

It does run into problems, however, if there are duplicate values for DSNB in DATASET_B.

 

It's conceivable that this would actually run faster if you create (ahead of time) separate variables holding the first 8 characters of DSNA and DSNB so that SQL doesn't need to apply SUBSTR.

 

Finally, note that this does not give you the final answer.  It gives you a superset of the records you need.  You still need to filter out the largest with something like:

 

data really_want;

set want;

by dsnb;

if last.dsnb;

run;

TingSern
Obsidian | Level 7

SQL takes too much overheads.  A simple DATA procedure is much faster (at least on z/OS mainframe).  Thank you very much for all your help.

s_lassen
Meteorite | Level 14

If I understand your problem correctly, DSNB is always the leftmost substring of DSNA. Then something like this may work:

data a;
  length dsna $20 vsna 8;
  input dsna vsna;
cards;
ABC.DEF.GHI.DEU 5
ABC.DEF.GHI.KYZ 1
ABC.DEF.GHI.LMB 2
ADD.XYZ.GHI.ABD 1
;run;


data b;
  length dsnb $20 vsnb 8;
  input dsnb vsnb;
cards;
ABC.DEF.GHI 3
ADD.XYZ.GHI 5
;run;

proc sort data=a;
  by dsna;
run;

proc sort data=b;
  by dsnb;
run;

data want;
  set a(in=a) b(rename=(dsnb=dsna vsnb=vsna) in=b);
  by dsna;
  if b then do;
    dsnb=dsna;
    vsnb=vsna;
    end;
  retain dsnb vsnb;
  if a;
  if dsna=:trim(dsnb);
run;

sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 14 replies
  • 1495 views
  • 0 likes
  • 5 in conversation