Re: SAS proc SQL and Inner join - what are alternative methods

imanojkumar1 · Posted 11-15-2016 09:08 AM

What I want to do is to find an alternative to the following code:

PROC SQL;
       CREATE TABLE WORK.XX AS
       SELECT DISTINCT t2.WC, t2.CWC
          FROM WORK.YY t1
               INNER JOIN WORK.ZZ t2 ON (t1.MC = t2.WC)
    ;
QUIT;

Could someone please help in doing the same thing using hash or any other method?

Kurt_Bremser · Posted 11-15-2016 09:16 AM

If there is a 1(YY) to n(ZZ) relation, then

data work.xx;
merge
  work.yy (in=a keep=mc rename=(mc=wc))
  work.zz (in=b keep=wc cwc)
;
by wc;
if a and b;
run;

proc sort data=work.xx nodupkey;
by wc cwc;
run;

should deliver the same result.

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

imanojkumar1 · Posted 11-15-2016 02:08 PM

Many thanks for your help.

unfortunately, it returned me an error:

ERROR: BY variables are not properly sorted on data set WORK.YY
a=1 b=1 WC=7507 CWC= FIRST.WC=0 LAST.WC=1 _ERROR_=1 _N_=585904753

Kurt_Bremser · Posted 11-15-2016 02:18 PM

Then you need to sort your datasets by the variables that are used in the merge.

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

imanojkumar1 · Posted 11-16-2016 04:10 AM

These are relatively large tables. sorting might take a long time. I am providing a sample data in a reply in the trailing communication (please see below). Thanks.

Kurt_Bremser · Posted 11-16-2016 06:22 AM

The solution depends very much on the size(s). If one of the tables fits into MEMSIZE, creating a format or using a hash object can speed your process up.

If that is not the case, you will have to sort the tables for the merge. In my experience, using explicit sorts and a merge will out-perform SQL when the contributing tables grow large

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

PeterClemmensen · Posted 11-15-2016 09:17 AM

Do the variables WC and CWC also exist in WORK.YY? 🙂

The DATA to DATA Step Macro
Blog: SASnrd

Kurt_Bremser · Posted 11-15-2016 09:20 AM

@PeterClemmensen wrote:

Do the variables WC and CWC also exist in WORK.YY? 🙂

Good catch. I edited my post to safeguard against that.

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

imanojkumar1 · Posted 11-15-2016 09:48 AM

XX Table has following vars:
MC LC MCC MCN TLC DD

YY:
WC CWC TS MS Y X

Where MC and WC are same datatype (char) and common values but only var names are different i.e. in XX it is MC and in YY it is WC

CWC is only in YY and not in XX

LinusH · Posted 11-16-2016 12:51 AM

That's a quite neat SQL, why do you need an alternative?

Data never sleeps

imanojkumar1 · Posted 11-16-2016 04:06 AM

I have the following tables:

data have01;
infile cards truncover expandtabs;
input MC $ LC $ MCC $ MCN $ TLC $ DD $ ODS_TimeStamp ODS_LUpd zTPl $ PuD $;
cards;
1853 DR14 1 Vetu SM3 . 24SEP2013:10:06:53.580 20JUL2016:12:55:39.240 0 .
1856 DR14 1 Vetu SM3 . 24SEP2013:10:06:53.580 20JUL2016:12:55:39.240 0 .
1869 DR14 1 Vetu SM3 . 24SEP2013:10:06:53.580 20JUL2016:12:55:39.240 0 .
2024 DV16 1 Vetu SM3 2008-01-31 24SEP2013:10:06:53.580 20JUL2016:12:55:39.240 47 .
2025 DV16 1 Vetu SM3 2008-01-31 24SEP2013:10:06:53.580 20JUL2016:12:55:39.240 47 .
run;

You might have to format date column in the above table.

data have02;
infile cards truncover expandtabs;
input WPMVId ToSTimeStamp TId ASN WC $ CWC $ TSide $ MNo Y X;
cards;
1 21AUG2012:17:57:39.000 20949 1 7604 HPUS230 R 1 -82140 2468
2 21AUG2012:17:57:39.000 20949 1 7604 HPUS230 R 2 -81940 2466
3 21AUG2012:17:57:39.000 20949 1 7604 HPUS230 R 3 -81739 2463
4 21AUG2012:17:57:39.000 20949 1 7604 HPUS230 R 4 -81539 2459
5 21AUG2012:17:57:39.000 20949 1 7604 HPUS230 R 5 -81339 2456
6 21AUG2012:17:57:39.000 20949 1 7604 HPUS230 R 6 -81139 2453
run;

You might have to format date column in the above table.

Please help me using some alternative to SQL code above, specifically when I have issue that my Table 2 above is almost 0.8 billion rows data and it takes hell a lot time to run SQL query as above.

LinusH · Posted 11-16-2016 11:30 AM

But how large is t1?
Are the tables by any chance indexed on MC/WC respectively?
And how large/small does t1 subsets t2 aproxmately?
Unless small sub with indexes hash table or user def format are your best shots.

Data never sleeps

Registration is open

SAS Training: Just a Click Away