1. Do both the datasets have the same set of ID's or could it differ? It could differ
2. Do the names of the ID's or in other words keys differ? IDs in two datasets are same name and format.
Ok @Cruise
Let me know if you can follow this and execute
"1. Do both the datasets have the same set of ID's or could it differ? It could differ
2. Do the names of the ID's or in other words keys differ? IDs in two datasets are same name and format."
DATA LARGE;
INPUT ID $ CODES $ DATE;
CARDS;
ALLSS 04185 18127
ALLSS 04185 18129
ALLSS 04185 18130
ALLSS 25000 16840
ALLSS 25000 16874
ALLSS 25000 16972
ALLSS 25000 17063
ALLSS 25000 17184
ALLSS 25000 17188
AALLSS 25000 17231
AALLSS 25000 17321
AALLSS 25000 17458
AALLSS 25000 17604
AALLSS 25000 17867
AALLSS 25000 17979
AALLSS 25000 18078
AALLSS 25002 17374
AALLSS 2669 17545
AALLSS 2689 17374
AALLSS 2689 17601
AALLSS 2720 17184
AALLSS 2724 16840
AALLSS 2724 17063
AALLSS 2724 17175
AALLSS 2724 17231
AALLSS 2724 17321
AALLSS 2724 17374
AALLSS 2724 17601
AALLSS 2724 17867
AALLSS 2724 17979
AALLSS 2729 17188
AALLSS 2752 17184
AALLSS 2772 17825
AALLSS 27800 17184
AALLSS 2859 16552
AALLSS 2859 16664
AALLSS 2859 16852
AALLSS 2859 16909
AALLSS 2948 17184
AALLSS 30000 17545
AALLSS 311 16874
AALLSS 311 16972
AALLSS 311 17321
AALLSS 311 17458
AALLSS 311 18078
AALLSS 32723 17184
AALLSS 3310 17188
AALLSS 4019 16636
AALLSS 4019 16840
AALLSS 4019 16874
AALLSS 4019 16909
AALLSS 4019 16972
AALLSS 4019 17063
AALLSS 4019 17175
AALLSS 4019 17184
AALLSS 4019 17188
AALLSS 4019 17231
AALLSS 4019 17321
AALLSS 4019 17374
AALLSS 4019 17458
AALLSS 4019 17601
AALLSS 42760 17604
AALLSS 4293 17184
AALLSS 43491 17184
AALLSS 43491 17185
AALLSS 43491 17186
AALLSS 43491 17187
AALLSS 4660 17825
AALLSS 496 16840
AALLSS 496 16874
AALLSS 496 16905
AALLSS 496 16935
AALLSS 496 16966
AALLSS 496 16996
AALLSS 496 17027
AALLSS 496 17058
AALLSS 496 17088
AALLSS 496 17119
AALLSS 496 17149
AALLSS 496 17180
AALLSS 496 17211
AALLSS 496 17239
AALLSS 496 17270
AALLSS 496 17300
AALLSS 496 17331
AALLSS 496 17361
AALLSS 496 17392
AALLSS 496 17423
AALLSS 496 17453
AALLSS 496 17458
AALLSS 496 17484
AALLSS 496 17514
AALLSS 496 17545
AALLSS 496 17576
AALLSS 496 17605
AALLSS 496 17636
AALLSS 496 17666
AALLSS 496 17697
AALLSS 496 17727
AALLSS 496 17758
;
DATA SMALL;
length id $8;
id='AALLSS';
RUN;
/*sort is need for by group processing i.e do until;set;by*/
proc sort data=large;
by ID;
run;
/*Keeping only the matching(exisiting) ids in small*/
proc sql;
create table temp as
select *
from large
where id in (select id from small)
order by id;
quit;
/*The following requires a sorted approach hence the proc sort by id*/
data want;
if _n_=1 then
do;
if 0 then set small large;
dcl hash h1 (multidata:'y');
h1.defineKey('id');
h1.defineData('codes','date');
h1.defineDone();
end;
/*Reading your large by id i.e only one id at a time and load
in the hash object*/
do until(last.id);
set temp;
by id;
rc=h1.add();
end;
/*Reading your small and look up the large*/
do until(last.id);
set small;
by id;
do while(h1.do_over() eq 0);
output;
end;
end;
/*Clear the contents of hash freeing memory space after processing each id*/
h1.clear();
drop rc:;
run;
Also a simple inner join in proc sql will suffice, you could test to see how this performs
proc sql;
create table want as
select a.id,codes,date
from small a inner join large b
on a.id=b.id
order by id,codes ,date;
quit;
I'm running the latest hash code now. speaking of simple sql inner join
1. does sql jointakes less work space than merge in data step?
2. does hash object approach takes less work space than any of above approaches?
my struggle is the "out of memory" and "running out of work space" issue.
I know. I just gave you both. Let's test to see what works. If one of the two works, use it. 🙂 If both work, great. The one that doesn't is garbage
Below is the n of distinct mbr_id from small and large datasets.
proc sql; select count(distinct id) from small; quit; /*N=281,577*/
proc sql; select count(distinct id) from large; quit; /*N=349,893*/
However, I find only 4,667 rows which is 4 unique ID in "want" data.
NOTE: There were 108956917 observations read from the data set WORK.TEMP.
NOTE: There were 181979 observations read from the data set WORK.SMALL.
NOTE: The data set WORK.WANT has 4667 observations and 26 variables.
NOTE: DATA statement used (Total process time):
real time 23.46 seconds
cpu time 23.47 seconds
I'm trying proc sql simple inner join now,
did you do a select count(distinct) from temp ?
That should tally with the want
proc sql; select count(distinct mbr_id) from temp; quit; /*N=181,978*/
That strange that you are getting only 4000 odd in your want.
Basically the idea is to have the same id's in filtered temp as that of small. So in other words, the small is all unique id's whereas in temp, the very same has duplicates.
So When you sort both by ID. The sequence would be the same. Therefore, making the hash look up process work such that
1. Load the group 1 from Temp
2. Read the group 1 from Small
3. Find the associated variables.
Also make sure, you name correctly, in some you seem to have MBR_id and others ID. In that case, Hash requires explicit reference.
does this code look al'right? this is what resulted in 4000 in want data.
proc sort data=small; /*N distinct=281,577, Nrows=281,577*/
by id;
proc sort data=temp; /*N distinct=181,978, Nrows=108,956,917*/
by id;
proc sort data=large; /*N distinct=349,893, Nrows=295,780,166*/
by id;
/*The following requires a sorted approach hence the proc sort by id*/
data want;
if _n_=1 then
do;
if 0 then set small large;
dcl hash h1 (multidata:'y');
h1.defineKey('id');
h1.defineData('codes','date');
h1.defineDone();
end;
/*Reading your large by id i.e only one id at a time and load
in the hash object*/
do until(last.id);
set temp;
by id;
rc=h1.add();
end;
/*Reading your small and look up the large*/
do until(last.id);
set small;
by id;
do while(h1.do_over() eq 0);
output;
end;
end;
/*Clear the contents of hash freeing memory space after processing each id*/
h1.clear();
drop rc:;
run;
Let's examine this
proc sort data=small; /*N distinct=281,577, Nrows=281,577*/
by id;
proc sort data=temp; /*N distinct=181,978, Nrows=108,956,917*/
by id;
proc sort data=large; /*N distinct=349,893, Nrows=295,780,166*/
by id;
before we go further.
In temp,we intend to have only the ID's that exist in small by filtering the large --using where id in(select id in small)
Now that your distinct count in small being higher than that of temp by almost 100 ,000. Blimey! that's a lot. So a lot of ID's in small are not actually present in large or temp in the first place. This is the first conclusion.
Now run a test to see how many ID's in small matches with temp.
proc sql;
create table check as
select count(distinct id) as c
from small
where id in (select distinct id from temp);
quit;
The above should be very small. Let me know
Good morning @Cruise
Is the climax over? So what worked or didn't work. Should we assume you are good and happy with your progress?
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.