Solved: Re: Search a dataset using a substring of a variable

TingSern · Posted 01-25-2018 02:08 AM

I am using SAS version 9.4 on z/OS mainframe.

I have the following data in two datasets (as an example) -

Dataset A

Variable DSNA V SNA

ABC.DEF.GHI.DEU 5

ABC.DEF.GHI.KYZ 1

ABC.DEF.GHI.LMB 2

ADD.XYZ.GHI.ABD 1

Dataset B

Variable DSNB VSNB

ABC.DEF.GHI 3

ADD.XYZ.GHI 5

I want to search Dataset A DSNA - using Dataset B DSNB. I cannot just merge Dataset A and B - because DSNB is a substring of the variable DSNA. In the above example, I need to return ABC.DEF.GHI.LMB and VSNA of 2 from the search using DSNB value of ABC.DEF.GHI. Basically - return the last (highest) value in DSNA.

Can somebody kindly give me some hints how to go about doing this? Thank you.

s_lassen · Posted 01-26-2018 08:10 AM

If I understand your problem correctly, DSNB is always the leftmost substring of DSNA. Then something like this may work:

data a;
  length dsna $20 vsna 8;
  input dsna vsna;
cards;
ABC.DEF.GHI.DEU 5
ABC.DEF.GHI.KYZ 1
ABC.DEF.GHI.LMB 2
ADD.XYZ.GHI.ABD 1
;run;


data b;
  length dsnb $20 vsnb 8;
  input dsnb vsnb;
cards;
ABC.DEF.GHI 3
ADD.XYZ.GHI 5
;run;

proc sort data=a;
  by dsna;
run;

proc sort data=b;
  by dsnb;
run;

data want;
  set a(in=a) b(rename=(dsnb=dsna vsnb=vsna) in=b);
  by dsna;
  if b then do;
    dsnb=dsna;
    vsnb=vsna;
    end;
  retain dsnb vsnb;
  if a;
  if dsna=:trim(dsnb);
run;

View solution in original post

Shmuel · Posted 01-25-2018 02:17 AM

If DSNB variable is a fixed length then create substring of DSNA as new variable and use simple merge or

if they may be more then one match use proc sql.

TingSern · Posted 01-25-2018 02:53 AM

Unfortunately, DSNB is not of fixed length. I only showed the example as fixed length, but in the actual file, is variable length.

Also, I am not familiar with PROC SQL - can somebody kindly post some examples of coding - so that I can follow? Thank you.

TingSern · Posted 01-25-2018 02:54 AM

In fact, both DSNA and DSNB are variable in length.

Shmuel · Posted 01-25-2018 03:01 AM

Try next code:

proc sql;

create table to_check as

select a.* , b.*

from A as a

join B as b

on index(DSNA, DSNB) > 0;

quit;

Astounding · Posted 01-25-2018 03:37 AM

Can you guarantee a minimum length? For example, can you guarantee that all DSNB values are at least 3 characters long?

Shmuel · Posted 01-25-2018 03:47 AM

Next code is tested and gives the wanted results:

data dataA;
length DSNA $ 20;
input DSNA $ v_sna;
cards;
ABC.DEF.GHI.DEU                  5
ABC.DEF.GHI.KYZ                  1
ABC.DEF.GHI.LMB                  2
ADD.XYZ.GHI.ABD                  1
;
run;

data dataB;
length DSNB $15;
input DSNB $ v_snb;
cards;
ABC.DEF.GHI                           3
ADD.XYZ.GHI                           5
;
run;

proc sql;
  create table to_check as
  select a.*, b.*
  from dataA as a 
  join dataB as b
  on index(DSNA, strip(DSNB)) > 0;
quit;

TingSern · Posted 01-25-2018 03:58 AM

Hi Shmuel,

Thank you very much. I am now testing the code in my z/OS mainframe. Will let you know if it works.

TingSern · Posted 01-25-2018 04:14 AM

To Astounding -

Yes - the minimum length for both DSNA and DSNB are all greater than 8 characters.

TingSern · Posted 01-25-2018 05:40 AM

Hello,

I am testing the SQL - running it on my mainframe - z/OS operating system - the SQL procedure started at 16:56 and now is already 18:35 - and it is still running.

DSNA variable has 855593 observations and DSNB has 17471 observations.

The mainframe is a 1,700 MIPS machine - definitely not slow.

I am wondering if there is anything else I could check? Thank you.

Shmuel · Posted 01-25-2018 06:48 AM

Alternative to sql -

As DSNB is much smaller (less observations) then DSNA you may try use iterative hash technique

in a datastep instead sql, but I'm not expert with it.

The hashing will read DSNB once, at the beginning into memory and

do all checking reading DSNA once.

ballardw · Posted 01-25-2018 10:55 AM

@TingSern wrote:

Hello,

I am testing the SQL - running it on my mainframe - z/OS operating system - the SQL procedure started at 16:56 and now is already 18:35 - and it is still running.

DSNA variable has 855593 observations and DSNB has 17471 observations.

The mainframe is a 1,700 MIPS machine - definitely not slow.

I am wondering if there is anything else I could check? Thank you.

If either set has duplicate values for either the search or target string you might consider searching on DISTINCT values of the strings:

such as

proc sql;
  create table to_check as
  select a.*, b.*
  from (select distinct DSNA from dataA) as a 
  join (select distinct DSNB from dataB) as b
  on index(DSNA, strip(DSNB)) > 0;
quit;

and then if you need more values from either of the sets use the resulting set with a LEFT or RIGHT join on the appropriate variable.

If the code that is running is similar to the @Shmuel's original posting it is comparing every single record in A with every record in B, which would be 855593 * 17471 comparisons. Which might take some time. Also only include the variables you might need for matching purposes at this point as the more variables involved the more disk IO is going to be needed.

Astounding · Posted 01-25-2018 07:19 AM

Given that you have a minimum length of 8 for DSNB:

proc sql;

create table want as

select a.*, b.* from dataset_a a, dataset_b b

where substr(dsna, 1, 😎 = substr(dsnb, 1, 😎 and

dsnb <= dsna

order by dsnb dsna;

quit;

It's untested, but it should speed things up and give you the right observations to subset.

It does run into problems, however, if there are duplicate values for DSNB in DATASET_B.

It's conceivable that this would actually run faster if you create (ahead of time) separate variables holding the first 8 characters of DSNA and DSNB so that SQL doesn't need to apply SUBSTR.

Finally, note that this does not give you the final answer. It gives you a superset of the records you need. You still need to filter out the largest with something like:

data really_want;

set want;

by dsnb;

if last.dsnb;

run;

TingSern · Posted 01-29-2018 06:57 AM

SQL takes too much overheads. A simple DATA procedure is much faster (at least on z/OS mainframe). Thank you very much for all your help.

s_lassen · Posted 01-26-2018 08:10 AM

If I understand your problem correctly, DSNB is always the leftmost substring of DSNA. Then something like this may work:

data a;
  length dsna $20 vsna 8;
  input dsna vsna;
cards;
ABC.DEF.GHI.DEU 5
ABC.DEF.GHI.KYZ 1
ABC.DEF.GHI.LMB 2
ADD.XYZ.GHI.ABD 1
;run;


data b;
  length dsnb $20 vsnb 8;
  input dsnb vsnb;
cards;
ABC.DEF.GHI 3
ADD.XYZ.GHI 5
;run;

proc sort data=a;
  by dsna;
run;

proc sort data=b;
  by dsnb;
run;

data want;
  set a(in=a) b(rename=(dsnb=dsna vsnb=vsna) in=b);
  by dsna;
  if b then do;
    dsnb=dsna;
    vsnb=vsna;
    end;
  retain dsnb vsnb;
  if a;
  if dsna=:trim(dsnb);
run;

Catch up on SAS Innovate 2026

SAS Training: Just a Click Away