Solved: Re: DATA MERGE on large datasets but SAS runs out of memory - Page 3

Cruise · Posted 04-23-2019 03:59 PM

@novinosrin

1. Do both the datasets have the same set of ID's or could it differ? It could differ

2. Do the names of the ID's or in other words keys differ? IDs in two datasets are same name and format.

novinosrin · Posted 04-23-2019 04:08 PM

Ok @Cruise

Let me know if you can follow this and execute

"1. Do both the datasets have the same set of ID's or could it differ? It could differ

2. Do the names of the ID's or in other words keys differ? IDs in two datasets are same name and format."


DATA LARGE;
INPUT ID $ CODES $ DATE; 
CARDS;
ALLSS 04185 18127 
ALLSS 04185 18129 
ALLSS 04185 18130 
ALLSS 25000 16840 
ALLSS 25000 16874 
ALLSS 25000 16972 
ALLSS 25000 17063 
ALLSS 25000 17184 
ALLSS 25000 17188 
AALLSS 25000 17231 
AALLSS 25000 17321 
AALLSS 25000 17458 
AALLSS 25000 17604 
AALLSS 25000 17867 
AALLSS 25000 17979 
AALLSS 25000 18078 
AALLSS 25002 17374 
AALLSS 2669 17545 
AALLSS 2689 17374 
AALLSS 2689 17601 
AALLSS 2720 17184 
AALLSS 2724 16840 
AALLSS 2724 17063 
AALLSS 2724 17175 
AALLSS 2724 17231 
AALLSS 2724 17321 
AALLSS 2724 17374 
AALLSS 2724 17601 
AALLSS 2724 17867 
AALLSS 2724 17979 
AALLSS 2729 17188 
AALLSS 2752 17184 
AALLSS 2772 17825 
AALLSS 27800 17184 
AALLSS 2859 16552 
AALLSS 2859 16664 
AALLSS 2859 16852 
AALLSS 2859 16909 
AALLSS 2948 17184 
AALLSS 30000 17545 
AALLSS 311 16874 
AALLSS 311 16972 
AALLSS 311 17321 
AALLSS 311 17458 
AALLSS 311 18078 
AALLSS 32723 17184 
AALLSS 3310 17188 
AALLSS 4019 16636 
AALLSS 4019 16840 
AALLSS 4019 16874 
AALLSS 4019 16909 
AALLSS 4019 16972 
AALLSS 4019 17063 
AALLSS 4019 17175 
AALLSS 4019 17184 
AALLSS 4019 17188 
AALLSS 4019 17231 
AALLSS 4019 17321 
AALLSS 4019 17374 
AALLSS 4019 17458 
AALLSS 4019 17601 
AALLSS 42760 17604 
AALLSS 4293 17184 
AALLSS 43491 17184 
AALLSS 43491 17185 
AALLSS 43491 17186 
AALLSS 43491 17187 
AALLSS 4660 17825 
AALLSS 496 16840 
AALLSS 496 16874 
AALLSS 496 16905 
AALLSS 496 16935 
AALLSS 496 16966 
AALLSS 496 16996 
AALLSS 496 17027 
AALLSS 496 17058 
AALLSS 496 17088 
AALLSS 496 17119 
AALLSS 496 17149 
AALLSS 496 17180 
AALLSS 496 17211 
AALLSS 496 17239 
AALLSS 496 17270 
AALLSS 496 17300 
AALLSS 496 17331 
AALLSS 496 17361 
AALLSS 496 17392 
AALLSS 496 17423 
AALLSS 496 17453 
AALLSS 496 17458 
AALLSS 496 17484 
AALLSS 496 17514 
AALLSS 496 17545 
AALLSS 496 17576 
AALLSS 496 17605 
AALLSS 496 17636 
AALLSS 496 17666 
AALLSS 496 17697 
AALLSS 496 17727 
AALLSS 496 17758 
;

DATA SMALL;
length id $8;
id='AALLSS';
RUN;
/*sort is need for by group processing i.e do until;set;by*/
proc sort data=large;
by ID;
run;
/*Keeping only the matching(exisiting) ids in small*/
proc sql;
create table temp as 
select *
from large
where id in (select id from small)
order by id;
quit;

/*The following requires a sorted approach hence the proc sort by id*/
data want;
if _n_=1 then 
    do;
      if 0 then set small large;
      dcl hash h1 (multidata:'y');
      h1.defineKey('id');
      h1.defineData('codes','date');
      h1.defineDone();
    end;
/*Reading your large by id i.e only one id at a time and load
in the hash object*/
do until(last.id);
set temp;
by id;
rc=h1.add();
end;
/*Reading your small and look up the large*/
do until(last.id);
set small;
by id;
  do while(h1.do_over() eq 0);
    output;
  end;
end;
/*Clear the contents of hash freeing memory space after processing each id*/
h1.clear();
drop rc:;
run;

novinosrin · Posted 04-23-2019 04:11 PM

Also a simple inner join in proc sql will suffice, you could test to see how this performs

proc sql;
create table want as
select a.id,codes,date
from small a inner join large b
on a.id=b.id
order by id,codes ,date;
quit;

Cruise · Posted 04-23-2019 04:15 PM

@novinosrin

I'm running the latest hash code now. speaking of simple sql inner join

1. does sql jointakes less work space than merge in data step?

2. does hash object approach takes less work space than any of above approaches?

my struggle is the "out of memory" and "running out of work space" issue.

novinosrin · Posted 04-23-2019 04:17 PM

I know. I just gave you both. Let's test to see what works. If one of the two works, use it. 🙂 If both work, great. The one that doesn't is garbage

Cruise · Posted 04-23-2019 04:33 PM

@novinosrin

Below is the n of distinct mbr_id from small and large datasets.

proc sql; select count(distinct id) from small; quit; /*N=281,577*/
proc sql; select count(distinct id) from large; quit; /*N=349,893*/

However, I find only 4,667 rows which is 4 unique ID in "want" data.

NOTE: There were 108956917 observations read from the data set WORK.TEMP.
NOTE: There were 181979 observations read from the data set WORK.SMALL.
NOTE: The data set WORK.WANT has 4667 observations and 26 variables.
NOTE: DATA statement used (Total process time):
real time 23.46 seconds
cpu time 23.47 seconds

I'm trying proc sql simple inner join now,

novinosrin · Posted 04-23-2019 04:36 PM

did you do a select count(distinct) from temp ?

That should tally with the want

Cruise · Posted 04-23-2019 04:39 PM

@novinosrin

proc sql; select count(distinct mbr_id) from temp; quit; /*N=181,978*/

novinosrin · Posted 04-23-2019 04:44 PM

That strange that you are getting only 4000 odd in your want.

Basically the idea is to have the same id's in filtered temp as that of small. So in other words, the small is all unique id's whereas in temp, the very same has duplicates.

So When you sort both by ID. The sequence would be the same. Therefore, making the hash look up process work such that

1. Load the group 1 from Temp

2. Read the group 1 from Small

3. Find the associated variables.

Also make sure, you name correctly, in some you seem to have MBR_id and others ID. In that case, Hash requires explicit reference.

Cruise · Posted 04-23-2019 04:50 PM

@novinosrin

does this code look al'right? this is what resulted in 4000 in want data.

proc sort data=small; /*N distinct=281,577, Nrows=281,577*/
by id;
proc sort data=temp; /*N distinct=181,978, Nrows=108,956,917*/
by id;
proc sort data=large; /*N distinct=349,893, Nrows=295,780,166*/
by id;
/*The following requires a sorted approach hence the proc sort by id*/
data want;
if _n_=1 then 
    do;
      if 0 then set small large;
      dcl hash h1 (multidata:'y');
      h1.defineKey('id');
      h1.defineData('codes','date');
      h1.defineDone();
    end;
/*Reading your large by id i.e only one id at a time and load
in the hash object*/
do until(last.id);
set temp;
by id;
rc=h1.add();
end;
/*Reading your small and look up the large*/
do until(last.id);
set small;
by id;
  do while(h1.do_over() eq 0);
    output;
  end;
end;
/*Clear the contents of hash freeing memory space after processing each id*/
h1.clear();
drop rc:;
run;

novinosrin · Posted 04-23-2019 04:59 PM

Let's examine this

proc sort data=small; /*N distinct=281,577, Nrows=281,577*/
by id;
proc sort data=temp; /*N distinct=181,978, Nrows=108,956,917*/
by id;
proc sort data=large; /*N distinct=349,893, Nrows=295,780,166*/
by id;

before we go further.

In temp,we intend to have only the ID's that exist in small by filtering the large --using where id in(select id in small)

Now that your distinct count in small being higher than that of temp by almost 100 ,000. Blimey! that's a lot. So a lot of ID's in small are not actually present in large or temp in the first place. This is the first conclusion.

Now run a test to see how many ID's in small matches with temp.

proc sql;

create table check as

select count(distinct id) as c

from small

where id in (select distinct id from temp);

quit;

The above should be very small. Let me know

Cruise · Posted 04-23-2019 05:24 PM

I just switched between work stations. and running my queries from the scratch. in the meantime, proc sql, simple inner join appears to have worked. I'll let you know.

Cruise · Posted 04-23-2019 05:30 PM

This is correct: "So a lot of ID's in small are not actually present in large or temp in the first place". Small dataset is the universe of all possible mbr_id of which many are not in large data. Large data is large because of repeated rows not by n of distinct subjects.

novinosrin · Posted 04-24-2019 09:20 AM

Good morning @Cruise

Is the climax over? So what worked or didn't work. Should we assume you are good and happy with your progress?

Cruise · Posted 04-24-2019 09:25 AM

My SAS is still running "proc sort nodupkey; by id codes date;" since 7am and now it is: 9:30. I needed to go back to this step because I switched between work stations. Thanks for asking. I'll update as soon as this deduplication process finish and run hash table code again. Sound good?

Classroom Training Available!