Search a dataset using a substring of a variable

Accepted Solution Solved
Reply
Occasional Contributor
Posts: 15
Accepted Solution

Search a dataset using a substring of a variable

 

 

 

 

 

 

 

 

 

 

I am using SAS version  9.4 on z/OS mainframe.

 

I have the following data in two datasets (as an example) -

 

Dataset A 

 

Variable DSNA                        V SNA

ABC.DEF.GHI.DEU                  5

ABC.DEF.GHI.KYZ                  1

ABC.DEF.GHI.LMB                  2

ADD.XYZ.GHI.ABD                  1

 

Dataset B

 

Variable DSNB                         VSNB

ABC.DEF.GHI                           3

ADD.XYZ.GHI                           5

 

I want to search Dataset A DSNA - using Dataset B DSNB.  I cannot just merge Dataset A and B - because DSNB is a substring of the variable DSNA.  In the above example, I need to return ABC.DEF.GHI.LMB and VSNA of 2 from the search using DSNB value of ABC.DEF.GHI.   Basically - return the last (highest) value in DSNA. 

 

Can somebody kindly give me some hints how to go about doing this?  Thank you. 

 


Accepted Solutions
Solution
4 weeks ago
PROC Star
Posts: 169

Re: Search a dataset using a substring of a variable

If I understand your problem correctly, DSNB is always the leftmost substring of DSNA. Then something like this may work:

data a;
  length dsna $20 vsna 8;
  input dsna vsna;
cards;
ABC.DEF.GHI.DEU 5
ABC.DEF.GHI.KYZ 1
ABC.DEF.GHI.LMB 2
ADD.XYZ.GHI.ABD 1
;run;


data b;
  length dsnb $20 vsnb 8;
  input dsnb vsnb;
cards;
ABC.DEF.GHI 3
ADD.XYZ.GHI 5
;run;

proc sort data=a;
  by dsna;
run;

proc sort data=b;
  by dsnb;
run;

data want;
  set a(in=a) b(rename=(dsnb=dsna vsnb=vsna) in=b);
  by dsna;
  if b then do;
    dsnb=dsna;
    vsnb=vsna;
    end;
  retain dsnb vsnb;
  if a;
  if dsna=:trim(dsnb);
run;

View solution in original post


All Replies
Trusted Advisor
Posts: 1,683

Re: Search a dataset using a substring of a variable

If DSNB variable is a fixed length then create substring of DSNA as new variable and use simple merge or

if they may be more then one match use proc sql.

Occasional Contributor
Posts: 15

Re: Search a dataset using a substring of a variable

Unfortunately, DSNB is not of fixed length.  I only showed the example as fixed length, but in the actual file, is variable length. 

 

Also, I am not familiar with PROC SQL - can somebody kindly post some examples of coding - so that I can follow?  Thank you.

Occasional Contributor
Posts: 15

Re: Search a dataset using a substring of a variable

In fact, both DSNA and DSNB are variable in length. 

Trusted Advisor
Posts: 1,683

Re: Search a dataset using a substring of a variable

[ Edited ]

Try next code:

 

proc sql;

   create table to_check as

  select a.* , b.*

  from A as a

  join  B as b

  on index(DSNA, DSNB) > 0;

quit;

Super User
Posts: 6,008

Re: Search a dataset using a substring of a variable

Can you guarantee a minimum length?  For example, can you guarantee that all DSNB values are at least 3 characters long?

Trusted Advisor
Posts: 1,683

Re: Search a dataset using a substring of a variable

[ Edited ]
Posted in reply to Astounding

Next code is tested and gives the wanted results:

data dataA;
length DSNA $ 20;
input DSNA $ v_sna;
cards;
ABC.DEF.GHI.DEU                  5
ABC.DEF.GHI.KYZ                  1
ABC.DEF.GHI.LMB                  2
ADD.XYZ.GHI.ABD                  1
;
run;

data dataB;
length DSNB $15; input DSNB $ v_snb; cards; ABC.DEF.GHI 3 ADD.XYZ.GHI 5 ; run; proc sql; create table to_check as select a.*, b.* from dataA as a join dataB as b on index(DSNA, strip(DSNB)) > 0; quit;
Occasional Contributor
Posts: 15

Re: Search a dataset using a substring of a variable

Hi Shmuel,

 

Thank you very much.  I am now testing the code in my z/OS mainframe.  Will let you know if it works. 

Occasional Contributor
Posts: 15

Re: Search a dataset using a substring of a variable

Posted in reply to Astounding

To Astounding -

 

Yes - the minimum length for both DSNA and DSNB are all greater than 8 characters.

 

 

Occasional Contributor
Posts: 15

Re: Search a dataset using a substring of a variable

Hello,

 

I am testing the SQL - running it on my mainframe - z/OS operating system - the SQL procedure started at 16:56 and now is already 18:35 - and it is still running.

 

DSNA variable has 855593 observations and DSNB has 17471 observations.

 

The mainframe is a 1,700 MIPS machine - definitely not slow.

 

I am wondering if there is anything else I could check?   Thank you.

Trusted Advisor
Posts: 1,683

Re: Search a dataset using a substring of a variable

[ Edited ]

Alternative to sql - 

 

As DSNB is much smaller (less observations) then DSNA you may try use iterative hash technique

in a datastep instead sql, but I'm not expert with it.

 

The hashing will read DSNB once, at the beginning into memory and

do all checking reading DSNA once.

Super User
Posts: 12,148

Re: Search a dataset using a substring of a variable


TingSern wrote:

Hello,

 

I am testing the SQL - running it on my mainframe - z/OS operating system - the SQL procedure started at 16:56 and now is already 18:35 - and it is still running.

 

DSNA variable has 855593 observations and DSNB has 17471 observations.

 

The mainframe is a 1,700 MIPS machine - definitely not slow.

 

I am wondering if there is anything else I could check?   Thank you.


If either set has duplicate values for either the search or target string you might consider searching on DISTINCT values of the strings:

such as

proc sql;
  create table to_check as
  select a.*, b.*
  from (select distinct DSNA from dataA) as a 
  join (select distinct DSNB from dataB) as b
  on index(DSNA, strip(DSNB)) > 0;
quit;

and then if you need more values from either of the sets use the resulting set with a LEFT or RIGHT join on the appropriate variable.

 

 

If the code that is running is similar to the @Shmuel's original posting it is comparing every single record in A with every record in B, which would be 855593 * 17471 comparisons. Which might take some time. Also only include the variables you might need for matching purposes at this point as the more variables involved the more disk IO is going to be needed.

Super User
Posts: 6,008

Re: Search a dataset using a substring of a variable

[ Edited ]

Given that you have a minimum length of 8 for DSNB:

 

proc sql;

create table want as

select a.*, b.* from dataset_a a, dataset_b b

where substr(dsna, 1, 8) = substr(dsnb, 1, 8) and

dsnb <= dsna

order by dsnb dsna;

quit;

 

It's untested, but it should speed things up and give you the right observations to subset.

 

It does run into problems, however, if there are duplicate values for DSNB in DATASET_B.

 

It's conceivable that this would actually run faster if you create (ahead of time) separate variables holding the first 8 characters of DSNA and DSNB so that SQL doesn't need to apply SUBSTR.

 

Finally, note that this does not give you the final answer.  It gives you a superset of the records you need.  You still need to filter out the largest with something like:

 

data really_want;

set want;

by dsnb;

if last.dsnb;

run;

Occasional Contributor
Posts: 15

Re: Search a dataset using a substring of a variable

Posted in reply to Astounding

SQL takes too much overheads.  A simple DATA procedure is much faster (at least on z/OS mainframe).  Thank you very much for all your help.

Solution
4 weeks ago
PROC Star
Posts: 169

Re: Search a dataset using a substring of a variable

If I understand your problem correctly, DSNB is always the leftmost substring of DSNA. Then something like this may work:

data a;
  length dsna $20 vsna 8;
  input dsna vsna;
cards;
ABC.DEF.GHI.DEU 5
ABC.DEF.GHI.KYZ 1
ABC.DEF.GHI.LMB 2
ADD.XYZ.GHI.ABD 1
;run;


data b;
  length dsnb $20 vsnb 8;
  input dsnb vsnb;
cards;
ABC.DEF.GHI 3
ADD.XYZ.GHI 5
;run;

proc sort data=a;
  by dsna;
run;

proc sort data=b;
  by dsnb;
run;

data want;
  set a(in=a) b(rename=(dsnb=dsna vsnb=vsna) in=b);
  by dsna;
  if b then do;
    dsnb=dsna;
    vsnb=vsna;
    end;
  retain dsnb vsnb;
  if a;
  if dsna=:trim(dsnb);
run;

☑ This topic is solved.

Need further help from the community? Please ask a new question.

Discussion stats
  • 14 replies
  • 140 views
  • 0 likes
  • 5 in conversation