BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
Cruise
Ammonite | Level 13

@novinosrin 

1. Do both the datasets have the same set of ID's or could it differ? It could differ

2. Do the names of the ID's or in other words keys differ? IDs in two datasets are same name and format.

novinosrin
Tourmaline | Level 20

Ok @Cruise 

Let me know if you can follow this and execute

"1. Do both the datasets have the same set of ID's or could it differ? It could differ

2. Do the names of the ID's or in other words keys differ? IDs in two datasets are same name and format."

 


DATA LARGE;
INPUT ID $ CODES $ DATE; 
CARDS;
ALLSS 04185 18127 
ALLSS 04185 18129 
ALLSS 04185 18130 
ALLSS 25000 16840 
ALLSS 25000 16874 
ALLSS 25000 16972 
ALLSS 25000 17063 
ALLSS 25000 17184 
ALLSS 25000 17188 
AALLSS 25000 17231 
AALLSS 25000 17321 
AALLSS 25000 17458 
AALLSS 25000 17604 
AALLSS 25000 17867 
AALLSS 25000 17979 
AALLSS 25000 18078 
AALLSS 25002 17374 
AALLSS 2669 17545 
AALLSS 2689 17374 
AALLSS 2689 17601 
AALLSS 2720 17184 
AALLSS 2724 16840 
AALLSS 2724 17063 
AALLSS 2724 17175 
AALLSS 2724 17231 
AALLSS 2724 17321 
AALLSS 2724 17374 
AALLSS 2724 17601 
AALLSS 2724 17867 
AALLSS 2724 17979 
AALLSS 2729 17188 
AALLSS 2752 17184 
AALLSS 2772 17825 
AALLSS 27800 17184 
AALLSS 2859 16552 
AALLSS 2859 16664 
AALLSS 2859 16852 
AALLSS 2859 16909 
AALLSS 2948 17184 
AALLSS 30000 17545 
AALLSS 311 16874 
AALLSS 311 16972 
AALLSS 311 17321 
AALLSS 311 17458 
AALLSS 311 18078 
AALLSS 32723 17184 
AALLSS 3310 17188 
AALLSS 4019 16636 
AALLSS 4019 16840 
AALLSS 4019 16874 
AALLSS 4019 16909 
AALLSS 4019 16972 
AALLSS 4019 17063 
AALLSS 4019 17175 
AALLSS 4019 17184 
AALLSS 4019 17188 
AALLSS 4019 17231 
AALLSS 4019 17321 
AALLSS 4019 17374 
AALLSS 4019 17458 
AALLSS 4019 17601 
AALLSS 42760 17604 
AALLSS 4293 17184 
AALLSS 43491 17184 
AALLSS 43491 17185 
AALLSS 43491 17186 
AALLSS 43491 17187 
AALLSS 4660 17825 
AALLSS 496 16840 
AALLSS 496 16874 
AALLSS 496 16905 
AALLSS 496 16935 
AALLSS 496 16966 
AALLSS 496 16996 
AALLSS 496 17027 
AALLSS 496 17058 
AALLSS 496 17088 
AALLSS 496 17119 
AALLSS 496 17149 
AALLSS 496 17180 
AALLSS 496 17211 
AALLSS 496 17239 
AALLSS 496 17270 
AALLSS 496 17300 
AALLSS 496 17331 
AALLSS 496 17361 
AALLSS 496 17392 
AALLSS 496 17423 
AALLSS 496 17453 
AALLSS 496 17458 
AALLSS 496 17484 
AALLSS 496 17514 
AALLSS 496 17545 
AALLSS 496 17576 
AALLSS 496 17605 
AALLSS 496 17636 
AALLSS 496 17666 
AALLSS 496 17697 
AALLSS 496 17727 
AALLSS 496 17758 
;

DATA SMALL;
length id $8;
id='AALLSS';
RUN;
/*sort is need for by group processing i.e do until;set;by*/
proc sort data=large;
by ID;
run;
/*Keeping only the matching(exisiting) ids in small*/
proc sql;
create table temp as 
select *
from large
where id in (select id from small)
order by id;
quit;

/*The following requires a sorted approach hence the proc sort by id*/
data want;
if _n_=1 then 
    do;
      if 0 then set small large;
      dcl hash h1 (multidata:'y');
      h1.defineKey('id');
      h1.defineData('codes','date');
      h1.defineDone();
    end;
/*Reading your large by id i.e only one id at a time and load
in the hash object*/
do until(last.id);
set temp;
by id;
rc=h1.add();
end;
/*Reading your small and look up the large*/
do until(last.id);
set small;
by id;
  do while(h1.do_over() eq 0);
    output;
  end;
end;
/*Clear the contents of hash freeing memory space after processing each id*/
h1.clear();
drop rc:;
run;
novinosrin
Tourmaline | Level 20

Also a simple inner join in proc sql will suffice, you could test to see how this performs

 

proc sql;
create table want as
select a.id,codes,date
from small a inner join large b
on a.id=b.id
order by id,codes ,date;
quit;
Cruise
Ammonite | Level 13

@novinosrin 

I'm running the latest hash code now. speaking of simple sql inner join

1. does sql jointakes less work space than merge in data step?

2. does hash object approach takes less work space than any of above approaches?

 

my struggle is the "out of memory" and "running out of work space" issue.

novinosrin
Tourmaline | Level 20

I know. I just gave you both. Let's test to see what works. If one of the two works, use it. 🙂 If both work, great. The one that doesn't is garbage 

Cruise
Ammonite | Level 13

@novinosrin 

Below is the n of distinct mbr_id from small and large datasets.

proc sql; select count(distinct id) from small; quit; /*N=281,577*/
proc sql; select count(distinct id) from large; quit; /*N=349,893*/

However, I find only 4,667 rows which is 4 unique ID in "want" data.

 

NOTE: There were 108956917 observations read from the data set WORK.TEMP.
NOTE: There were 181979 observations read from the data set WORK.SMALL.
NOTE: The data set WORK.WANT has 4667 observations and 26 variables.
NOTE: DATA statement used (Total process time):
real time 23.46 seconds
cpu time 23.47 seconds

 

I'm trying proc sql simple inner join now,

novinosrin
Tourmaline | Level 20

did you do a select count(distinct) from temp ?

 

That should tally with the want

Cruise
Ammonite | Level 13

@novinosrin 

proc sql; select count(distinct mbr_id) from temp; quit; /*N=181,978*/
novinosrin
Tourmaline | Level 20

That strange that you are getting only 4000 odd in your want. 

 

Basically the idea is to have the same id's in filtered temp as that of small. So in other words, the small is all unique id's whereas in temp, the very same has duplicates.

 

So When you sort both by ID. The sequence would be the same. Therefore, making the hash look up process work such that

1. Load the group 1 from Temp

2. Read the group 1 from  Small

3. Find the associated variables. 

 

 

Also make sure, you name correctly, in some you seem to have MBR_id and others ID. In that case, Hash requires explicit reference.

 

Cruise
Ammonite | Level 13

@novinosrin 

does this code look al'right? this is what resulted in 4000 in want data.

 

proc sort data=small; /*N distinct=281,577, Nrows=281,577*/
by id;
proc sort data=temp; /*N distinct=181,978, Nrows=108,956,917*/
by id;
proc sort data=large; /*N distinct=349,893, Nrows=295,780,166*/
by id;
/*The following requires a sorted approach hence the proc sort by id*/
data want;
if _n_=1 then 
    do;
      if 0 then set small large;
      dcl hash h1 (multidata:'y');
      h1.defineKey('id');
      h1.defineData('codes','date');
      h1.defineDone();
    end;
/*Reading your large by id i.e only one id at a time and load
in the hash object*/
do until(last.id);
set temp;
by id;
rc=h1.add();
end;
/*Reading your small and look up the large*/
do until(last.id);
set small;
by id;
  do while(h1.do_over() eq 0);
    output;
  end;
end;
/*Clear the contents of hash freeing memory space after processing each id*/
h1.clear();
drop rc:;
run;
novinosrin
Tourmaline | Level 20

Let's examine this

proc sort data=small; /*N distinct=281,577, Nrows=281,577*/
by id;
proc sort data=temp; /*N distinct=181,978, Nrows=108,956,917*/
by id;
proc sort data=large; /*N distinct=349,893, Nrows=295,780,166*/
by id;

before we go further.

 

In temp,we intend to have only the ID's that exist in small  by filtering the large --using where id in(select id in small)

 

Now that your distinct count in small being higher than that of temp by almost 100 ,000. Blimey! that's a lot. So a lot of ID's in small are not actually present in large or temp in the first place. This is the first conclusion.

 

Now run a test to see how many ID's in small matches with temp.

proc sql;

create table check as

select count(distinct id) as c

from small 

where id in  (select distinct id from temp);

quit;

 

The above should be very small. Let me know

Cruise
Ammonite | Level 13
I just switched between work stations. and running my queries from the scratch. in the meantime, proc sql, simple inner join appears to have worked. I'll let you know.
Cruise
Ammonite | Level 13
This is correct: "So a lot of ID's in small are not actually present in large or temp in the first place". Small dataset is the universe of all possible mbr_id of which many are not in large data. Large data is large because of repeated rows not by n of distinct subjects.
novinosrin
Tourmaline | Level 20

Good morning @Cruise 

 

Is the climax over? So what worked or didn't work. Should we assume you are good and happy with your progress?

 

 

Cruise
Ammonite | Level 13
My SAS is still running "proc sort nodupkey; by id codes date;" since 7am and now it is: 9:30. I needed to go back to this step because I switched between work stations. Thanks for asking. I'll update as soon as this deduplication process finish and run hash table code again. Sound good?

sas-innovate-2024.png

Available on demand!

Missed SAS Innovate Las Vegas? Watch all the action for free! View the keynotes, general sessions and 22 breakouts on demand.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 60 replies
  • 3465 views
  • 35 likes
  • 7 in conversation