Solved: Re: from sql to hash

mariopellegrini · Posted 01-13-2017 08:15 AM

I do not understand why join done with proc sql does not play the same way hash. What did I do wrong?

data tab1;
input id1 id2 ts;
datalines;
1 1 3
1 2 5
1 3 6
1 2 20
1 2 22
run;
data tab2;
input id id1 id2 time_from time_to descr $;
datalines;
1 1 1 4 6 xxx
2 1 2 4 6 yyy
3 1 2 21 30 kkk
run;

/*sql version*/
proc sql;
create table  ris_sql as select 
		a.*,
		b.descr 
		from     tab1 as a 
		left join tab2 as b 
		on a.id1=b.id1 and 
		   a.id2=b.id2 and
		  (a.ts between b.time_from and b.time_to);
	quit; 

/*hash version*/
data ris_hash;
attrib descr format=$10.;
if _n_=1 then do;
	declare hash s(dataset: 'tab2');
	s.definekey('id1','id2');
	s.definedata('time_from', 'time_to','descr');
	s.definedone();
	end;
set tab1;
rc=s.find();
if not(time_from <= ts <= time_to) then call missing(descr);
drop rc time_from time_to;
run;

DanielSantos · Posted 01-13-2017 09:23 AM

Hi.

That's because you data set has duplicate key entries. Only the first entry is loaded.

You should use the MULTIDATA option and cycle find until there are no more matches.

Check this very good paper by Paul M. Dorfman: http://support.sas.com/resources/papers/proceedings16/10200-2016.pdf

Hope it helps.

Daniel Santos @ www.cgd.pt

View solution in original post

rogerjdeangelis · Posted 01-13-2017 09:11 AM

For small datasets I don't think you can easily force SAS to use a HASH.

/* T007370 SQL HASH MAGIC=103 faster that SQL Sort merge */

Hi Team

I wanted to select about 1,000,000 records from a 5,000,000
table and add 100 variables at the same time.
The finder file was 800mb.

I noticed that the default SQL method(sort)  took almost twice as
long as the SQL(hash) method.  Note that I have plenty of ram(128gb).

The code below will not run in EG, memsize is set at 0.5gb.

* sql sort=103 seconds;
* sql hash 60 seconds;


* there is an excellent paper in the most recent pharmasug
on the sql magic option;

* 800mb finder file with additional variables c1-c100;
options compress=no;
data find_benes(compress=no drop=i j);
  array chr[100] $8 c1-c100;
  do i=100000 to 10000000 by 10;
    pat_id=encrypt(i);
    do j=1 to 100;
      chr[j]=put(1e8*uniform(-1),z8.);
    end;
    output;
  end;
run;quit;

/*
800mb table
Up to 40 obs from find_benes total obs=990001

pat_id     C1        C100

1202046  75950697   28236507
1202056  22914180   68703933
1202066  61116305   19959418
1202076  58626531   82712843
1202086  10725652   84593862
*/

* 4 gb;
options compress=no;
data all_benes(compress=no drop=i j);
  array chr[100] $8 d1-d100;
  do i=100000 to 10000000 by 2;
    pat_id=encrypt(i);
    do j=1 to 100;
      chr[j]=put(1e8*uniform(-1),z8.);
    end;
    output;
  end;
run;quit;


/*
4gb table with variables d1-d100
Up to 40 obs from all_benes total obs=4,950,001

 pat_id     D1                   D100

 1201766  90585571   28236507
 1201768  55665932   68703933
 1201770  29801506   19959418
 1201772  67134560   82712843
 1201774  74029386   84593862
*/


* sql sort=103 seconds;
* sql hash 60 seconds;
proc sql _method magic=103;
    create
      table gotem as
    select
     l.*
     ,r.*
    from
      find_benes as l, all_benes as r
    where
     l.pat_id = r.pat_id
;quit;


/* magic=101 = cartesian
   magic=102 = sort merge
   magic=103 = hash
*/

Note if you have good size ram 60-128gb you might want to consider using sasfile
options load, open and close. for intermediate work datasets, but don't forget
to free memory often. Unfortunately SAS does not have a means of using memory
for the utility files and indexes are not supported. Perhaps a new kind of index
is needed, in memory index.

Sorting a 'big' dataset.
I suspect using a hash with 7 systasks( split the big dataset  into 7(firstobs-obs)
pieces that sum to less then total  dataset size might be much faster than the SAS
sort on the big dataset, even though the sort is mutithreaded.
Of course you need a lot of ram.

DanielSantos · Posted 01-13-2017 09:23 AM

Hi.

That's because you data set has duplicate key entries. Only the first entry is loaded.

You should use the MULTIDATA option and cycle find until there are no more matches.

Check this very good paper by Paul M. Dorfman: http://support.sas.com/resources/papers/proceedings16/10200-2016.pdf

Hope it helps.

Daniel Santos @ www.cgd.pt

from sql to hash

Re: from sql to hash

Re: from sql to hash

Re: from sql to hash

Registration is open

SAS Training: Just a Click Away