BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.

I do not understand why join done with proc sql does not play the same way hash. What did I do wrong?

 

data tab1;
input id1 id2 ts;
datalines;
1 1 3
1 2 5
1 3 6
1 2 20
1 2 22
run;
data tab2;
input id id1 id2 time_from time_to descr $;
datalines;
1 1 1 4 6 xxx
2 1 2 4 6 yyy
3 1 2 21 30 kkk
run;

/*sql version*/
proc sql;
create table  ris_sql as select 
		a.*,
		b.descr 
		from     tab1 as a 
		left join tab2 as b 
		on a.id1=b.id1 and 
		   a.id2=b.id2 and
		  (a.ts between b.time_from and b.time_to);
	quit; 

/*hash version*/
data ris_hash;
attrib descr format=$10.;
if _n_=1 then do;
	declare hash s(dataset: 'tab2');
	s.definekey('id1','id2');
	s.definedata('time_from', 'time_to','descr');
	s.definedone();
	end;
set tab1;
rc=s.find();
if not(time_from <= ts <= time_to) then call missing(descr);
drop rc time_from time_to;
run;

 

1 ACCEPTED SOLUTION

Accepted Solutions
DanielSantos
Barite | Level 11

Hi.

 

That's because you data set has duplicate key entries. Only the first entry is loaded.

 

You should use the MULTIDATA option and cycle find until there are no more matches.

 

Check this very good paper by Paul M. Dorfman: http://support.sas.com/resources/papers/proceedings16/10200-2016.pdf

 

Hope it helps.

 

Daniel Santos @ www.cgd.pt

View solution in original post

2 REPLIES 2
rogerjdeangelis
Barite | Level 11
For small datasets I don't think you can easily force SAS to use a HASH.

/* T007370 SQL HASH MAGIC=103 faster that SQL Sort merge */

Hi Team

I wanted to select about 1,000,000 records from a 5,000,000
table and add 100 variables at the same time.
The finder file was 800mb.

I noticed that the default SQL method(sort)  took almost twice as
long as the SQL(hash) method.  Note that I have plenty of ram(128gb).

The code below will not run in EG, memsize is set at 0.5gb.

* sql sort=103 seconds;
* sql hash 60 seconds;


* there is an excellent paper in the most recent pharmasug
on the sql magic option;

* 800mb finder file with additional variables c1-c100;
options compress=no;
data find_benes(compress=no drop=i j);
  array chr[100] $8 c1-c100;
  do i=100000 to 10000000 by 10;
    pat_id=encrypt(i);
    do j=1 to 100;
      chr[j]=put(1e8*uniform(-1),z8.);
    end;
    output;
  end;
run;quit;

/*
800mb table
Up to 40 obs from find_benes total obs=990001

pat_id     C1        C100

1202046  75950697   28236507
1202056  22914180   68703933
1202066  61116305   19959418
1202076  58626531   82712843
1202086  10725652   84593862
*/

* 4 gb;
options compress=no;
data all_benes(compress=no drop=i j);
  array chr[100] $8 d1-d100;
  do i=100000 to 10000000 by 2;
    pat_id=encrypt(i);
    do j=1 to 100;
      chr[j]=put(1e8*uniform(-1),z8.);
    end;
    output;
  end;
run;quit;


/*
4gb table with variables d1-d100
Up to 40 obs from all_benes total obs=4,950,001

 pat_id     D1                   D100

 1201766  90585571   28236507
 1201768  55665932   68703933
 1201770  29801506   19959418
 1201772  67134560   82712843
 1201774  74029386   84593862
*/


* sql sort=103 seconds;
* sql hash 60 seconds;
proc sql _method magic=103;
    create
      table gotem as
    select
     l.*
     ,r.*
    from
      find_benes as l, all_benes as r
    where
     l.pat_id = r.pat_id
;quit;


/* magic=101 = cartesian
   magic=102 = sort merge
   magic=103 = hash
*/

Note if you have good size ram 60-128gb you might want to consider using sasfile
options load, open and close. for intermediate work datasets, but don't forget
to free memory often. Unfortunately SAS does not have a means of using memory
for the utility files and indexes are not supported. Perhaps a new kind of index
is needed, in memory index.

Sorting a 'big' dataset.
I suspect using a hash with 7 systasks( split the big dataset  into 7(firstobs-obs)
pieces that sum to less then total  dataset size might be much faster than the SAS
sort on the big dataset, even though the sort is mutithreaded.
Of course you need a lot of ram.


DanielSantos
Barite | Level 11

Hi.

 

That's because you data set has duplicate key entries. Only the first entry is loaded.

 

You should use the MULTIDATA option and cycle find until there are no more matches.

 

Check this very good paper by Paul M. Dorfman: http://support.sas.com/resources/papers/proceedings16/10200-2016.pdf

 

Hope it helps.

 

Daniel Santos @ www.cgd.pt

sas-innovate-2024.png

Don't miss out on SAS Innovate - Register now for the FREE Livestream!

Can't make it to Vegas? No problem! Watch our general sessions LIVE or on-demand starting April 17th. Hear from SAS execs, best-selling author Adam Grant, Hot Ones host Sean Evans, top tech journalist Kara Swisher, AI expert Cassie Kozyrkov, and the mind-blowing dance crew iLuminate! Plus, get access to over 20 breakout sessions.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 2 replies
  • 1002 views
  • 2 likes
  • 3 in conversation