is h1 taking its h from libname h? I'm trying to customize the code to my actual directories. thanks
h1 is the reference for the hash table the code defines (command DCL stands for DECLARE). You could use any name here. Conceptually the same like when defining a libref in a libname statement.
The libref for your source table is used within the dataset option of the Hash Declare statement.
Since Z.DATA_LARGE (560MB) is very small.
Hash Table is the best way to do it .
or try PROC FORMAT .
@Patrick I get this error.
254 data O.MERGED;
255 if _n_=1 then
256 do;
257 if 0 then set R.data_large(keep=(id var1 var2));
-
214
23
258 dcl h1 (dataset:'R.data_large(keep=(id var1 var2))';
--
565
ERROR: DATA STEP Component Object failure. Aborted during the COMPILATION phase.
ERROR 214-322: Variable name ( is not valid.
ERROR 23-7: Invalid value for the KEEP option.
ERROR 565-185: H1 is not a known class name.
NOTE: The SAS System stopped processing this step because of errors.
NOTE: DATA statement used (Total process time):
real time 0.01 seconds
cpu time 0.01 seconds
259 h1.defineKey('id');
260 h1.defineData('var1','var2');
261 h2.defineDone();
262 end;
263 set R.all_sites;
264 do while(h1.do_over() eq 0);
265 output;
266 end;
267 run;
That's why I normally need sample data. I tend to add "typos" to my code which I normally quickly can spot and fix when testing. Here: No brackets required for the dataset KEEP option.
data h.merged;
if _n_=1 then
do;
if 0 then set z.data_large(keep=id var1 var2);
dcl h1 (dataset:'z.data_large(keep=id var1 var2)');
h1.defineKey('id');
h1.defineData('var1','var2');
h1.defineDone();
end;
set z.all_sites;
do while(h1.do_over() eq 0);
output;
end;
run;
And @novinosrin is absolutely right. Multidata is also required in your case. So here the latest cut:
data h.merged;
if _n_=1 then
do;
if 0 then set z.data_large(keep=id var1 var2);
dcl h1 (dataset:'z.data_large(keep=id var1 var2)',multidata:'y');
h1.defineKey('id');
h1.defineData('var1','var2');
h1.defineDone();
end;
set z.all_sites;
do while(h1.do_over() eq 0);
output;
end;
run;
To whom it may concern,
I think the very moment the use of do over aka find_next before 9.4 equivalent is in place, and should the look up be 1:N , dcl hash h1 (dataset:'large',multidata:'y'); would be required.
@Cruise That's just a syntax error
Here is the correction
dcl hash h1 (dataset:'R.data_large(keep=(id var1 var2))';
below code got SAS running but gave me an error below when I used latest cut from Patrick adding hash in DCL statement as novinosirin suggested. I'll post the sample data soon.
353 data k.merged;
354 if _n_=1 then
355 do;
356 if 0 then set R.data_large(keep=id var1 var2);
NOTE: Data file R.data_large is in a format that is native to another host, or the file encoding does
not match the session encoding. Cross Environment Data Access will be used, which might
require additional CPU resources and might reduce performance.
357 dcl hash h1 (dataset:'R.DX(keep=id var1 var2)',multidata:'y');
358 h1.defineKey('ID');
359 h1.defineData('var1','var2');
360 h1.defineDone();
361 end;
362 set R.all_sites;
363 do while(h1.do_over() eq 0);
364 output;
365 end;
366 run;
NOTE: Data file R.data_large.DATA is in a format that is native to another host, or the file encoding does
not match the session encoding. Cross Environment Data Access will be used, which might
require additional CPU resources and might reduce performance.
ERROR: Hash object added 22020080 items when memory failure occurred.
FATAL: Insufficient memory to execute DATA step program. Aborted during the EXECUTION phase.
ERROR: The SAS System stopped processing this step because of insufficient memory.
WARNING: The data set K.MERGED may be incomplete. When this step was stopped there were 0
observations and 4 variables.
WARNING: Data set K.MERGED was not replaced because this step was stopped.
NOTE: DATA statement used (Total process time):
real time 1:31.52
cpu time 19.79 seconds
ERROR: Hash object added 22020080 items when memory failure occurred
That's now a totally different issue. A SAS Hash table resides in memory and it appears that the load into memory failed after 22M rows.
That's nothing we can fix unless you get more memory for your session. May be also excluding all rows where=(var1 ne .) helps to keep the total rows below this 22M+ limit - something I should have added to the code from start.
You can estimate the memory requirement by summing up the lengths of the 3 variables loaded into the hash multiplied by the number of rows in your table (with none missing var1).
You can get information about memory available via Proc Options group=memory;run;
You will need to talk to your SAS admin if you require more memory than allowed to you for a SAS session.
data h.merged;
if _n_=1 then
do;
if 0 then set z.data_large(keep=id var1 var2);
dcl h1 (dataset:'z.data_large(keep=id var1 var2 where=(var1 ne .))',multidata:'y;');
h1.defineKey('id');
h1.defineData('var1','var2');
h1.defineDone();
end;
set z.all_sites;
do while(h1.do_over() eq 0);
output;
end;
run;
proc sort nodupkey is being memory intensive as well. at this point I'm wondering to use hash table approach by each site using where statement. all_sites data has 40 sites and i can do (where site=1 thru site=40) one site at a time.
1. Will hash table save time compared with merge in data step?
2. can you point out where I can insert where statement in the hash table code?
thanks in advance.
Hi @Cruise Hmm proc sort isn't working?. I was about to recommend the below(that clears memory) with confidence but now recommending without confidence
The following assumed both datasets have the same ID's. The comments will give you an idea
/*Creating samples- full deck N */
data large;
set sashelp.class;
run;
/*Creating small with unique Id's as you mentioned 1:N*/
proc sort data=sashelp.class out=small(keep=sex) nodupkey;
by sex;
run;
proc sort data=large ;
by sex;
run;
/*The following requires a sorted approach hence the proc sort by id*/
data want;
if _n_=1 then
do;
if 0 then set small large;
dcl hash h1 (multidata:'y');
h1.defineKey('sex');
h1.defineData('name','age','height','weight');
h1.defineDone();
end;
/*Reading your large by id i.e only one id at a time and load
in the hash object*/
do until(last.sex);
set large;
by sex;
rc=h1.add();
end;
/*Reading your small and look uo the large*/
do until(last.sex);
set small;
by sex;
do while(h1.do_over() eq 0);
output;
end;
end;
/*Clear the contents of hash freeing memory space after processing each id*/
h1.clear();
drop rc:;
run;
proc sort nodupkey by ID CODES AND DATE halved the data size. However, the code you provided output want data with 0 observations. Any idea what I might be doing wrong at my end? Below is my large and small data.
DATA LARGE;
INPUT ID $ CODES $ DATE;
CARDS;
ALLSS 04185 18127
ALLSS 04185 18129
ALLSS 04185 18130
ALLSS 25000 16840
ALLSS 25000 16874
ALLSS 25000 16972
ALLSS 25000 17063
ALLSS 25000 17184
ALLSS 25000 17188
AALLSS 25000 17231
AALLSS 25000 17321
AALLSS 25000 17458
AALLSS 25000 17604
AALLSS 25000 17867
AALLSS 25000 17979
AALLSS 25000 18078
AALLSS 25002 17374
AALLSS 2669 17545
AALLSS 2689 17374
AALLSS 2689 17601
AALLSS 2720 17184
AALLSS 2724 16840
AALLSS 2724 17063
AALLSS 2724 17175
AALLSS 2724 17231
AALLSS 2724 17321
AALLSS 2724 17374
AALLSS 2724 17601
AALLSS 2724 17867
AALLSS 2724 17979
AALLSS 2729 17188
AALLSS 2752 17184
AALLSS 2772 17825
AALLSS 27800 17184
AALLSS 2859 16552
AALLSS 2859 16664
AALLSS 2859 16852
AALLSS 2859 16909
AALLSS 2948 17184
AALLSS 30000 17545
AALLSS 311 16874
AALLSS 311 16972
AALLSS 311 17321
AALLSS 311 17458
AALLSS 311 18078
AALLSS 32723 17184
AALLSS 3310 17188
AALLSS 4019 16636
AALLSS 4019 16840
AALLSS 4019 16874
AALLSS 4019 16909
AALLSS 4019 16972
AALLSS 4019 17063
AALLSS 4019 17175
AALLSS 4019 17184
AALLSS 4019 17188
AALLSS 4019 17231
AALLSS 4019 17321
AALLSS 4019 17374
AALLSS 4019 17458
AALLSS 4019 17601
AALLSS 42760 17604
AALLSS 4293 17184
AALLSS 43491 17184
AALLSS 43491 17185
AALLSS 43491 17186
AALLSS 43491 17187
AALLSS 4660 17825
AALLSS 496 16840
AALLSS 496 16874
AALLSS 496 16905
AALLSS 496 16935
AALLSS 496 16966
AALLSS 496 16996
AALLSS 496 17027
AALLSS 496 17058
AALLSS 496 17088
AALLSS 496 17119
AALLSS 496 17149
AALLSS 496 17180
AALLSS 496 17211
AALLSS 496 17239
AALLSS 496 17270
AALLSS 496 17300
AALLSS 496 17331
AALLSS 496 17361
AALLSS 496 17392
AALLSS 496 17423
AALLSS 496 17453
AALLSS 496 17458
AALLSS 496 17484
AALLSS 496 17514
AALLSS 496 17545
AALLSS 496 17576
AALLSS 496 17605
AALLSS 496 17636
AALLSS 496 17666
AALLSS 496 17697
AALLSS 496 17727
AALLSS 496 17758
;
DATA SMALL;
MBR_ID='AALLSS';
RUN;
Hi @Cruise
Please clarify the below:
1. Do both the datasets have the same set of ID's or could it differ? /*this is very important as you would have seen the note in my previous thread*/
2. Do the names of the ID's or in other words keys differ?
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.