Thanks all for your comments
I have made one text splitting the dataset in 10 smaller datasets, the results given are below
options fullstimer;
proc sort data = Dataset_A out = Dataset_B (compress=binary);
by KEY;
run;
NOTE: There were 32498768 observations read from the data set Dataset_A.
NOTE: The data set WORK.PERSONSHORT has 32498768 observations and 31 variables.
NOTE: Compressing data set Dataset_B decreased size by 71.08 percent.
Compressed is 391645 pages; un-compressed would require 1354116 pages.
NOTE: PROCEDURE SORT used (Total process time):
real time 44:04.28
user cpu time 1:54.30
system cpu time 1:13.66
Memory 66929k
OS Memory 100216k
Timestamp 16/09/2014 17:22:45
Now I splitt the dataset with a macro (thanks to Selvaratnam Sridharma for the tool )
data _null_;
33 if 0 then set ec_dim_per nobs=count;
34 call symput('numobs',put(count,8.));
35 run;
36 %let n=%sysevalf(&numobs/&num,ceil);
37 data %do J=1 %to &num ; DATASET_B_&J %end; ;
38 set DATASET_B;
39 %do I=1 %to #
40 if %eval(&n*(&i-1)) <_n_ <= %eval(&n*&I)
41 then output DATASET_B_&I;
42 %end;
43 run;
44 %mend split;
45 %split (10)
NOTE: DATA STEP stopped due to looping.
NOTE: DATA statement used (Total process time):
real time 0.04 seconds
cpu time 0.00 seconds
NOTE: There were 32498768 observations read from the data set WORK.DATASET_B.
NOTE: The data set WORK.DATASET_B_1 has 3249877 observations and 31 variables.
NOTE: The data set WORK.DATASET_B_2 has 3249877 observations and 31 variables.
NOTE: The data set WORK.DATASET_B_3 has 3249877 observations and 31 variables.
NOTE: The data set WORK.DATASET_B_4 has 3249877 observations and 31 variables.
NOTE: The data set WORK.DATASET_B_5 has 3249877 observations and 31 variables.
NOTE: The data set WORK.DATASET_B_6 has 3249877 observations and 31 variables.
NOTE: The data set WORK.DATASET_B_7 has 3249877 observations and 31 variables.
NOTE: The data set WORK.DATASET_B_8 has 3249877 observations and 31 variables.
NOTE: The data set WORK.DATASET_B_9 has 3249877 observations and 31 variables.
NOTE: The data set WORK.DATASET_B_10 has 3249875 observations and 31 variables.
NOTE: DATA statement used (Total process time):
real time 15:13.22
cpu time 49.68 seconds
NOTE: This SAS session is using a registry in WORK. All changes will be lost at the end of
this session.
Another macro (this time made by me ) to run all the proc sort throught the datasets
46 %macro sort;
47 %do i=1 %to 10;
48 proc sort data = DATASET_B_&i;
49 by KEY;
50 run;
51 %end;
52 run;
53 %mend sort;
54 %short
NOTE: There were 3249877 observations read from the data set WORK.DATASET_B_1.
NOTE: The data set WORK.DATASET_B_1 has 3249877 observations and 31 variables.
NOTE: PROCEDURE SORT used (Total process time):
real time 4:19.91
cpu time 17.59 seconds
NOTE: There were 3249877 observations read from the data set WORK.DATASET_B_2.
NOTE: The data set WORK.DATASET_B_2 has 3249877 observations and 31 variables.
NOTE: PROCEDURE SORT used (Total process time):
real time 5:08.46
cpu time 19.56 seconds
NOTE: There were 3249877 observations read from the data set WORK.DATASET_B_3.
NOTE: The data set WORK.DATASET_B_3 has 3249877 observations and 31 variables.
NOTE: PROCEDURE SORT used (Total process time):
real time 2:29.01
cpu time 19.51 seconds
NOTE: There were 3249877 observations read from the data set WORK.DATASET_B_4.
NOTE: The data set WORK.DATASET_B_4 has 3249877 observations and 31 variables.
NOTE: PROCEDURE SORT used (Total process time):
real time 4:03.43
cpu time 17.03 seconds
NOTE: There were 3249877 observations read from the data set WORK.DATASET_B_5.
NOTE: The data set WORK.DATASET_B_5 has 3249877 observations and 31 variables.
NOTE: PROCEDURE SORT used (Total process time):
real time 6:03.90
cpu time 18.95 seconds
NOTE: There were 3249877 observations read from the data set WORK.DATASET_B_6.
NOTE: The data set WORK.DATASET_B_6 has 3249877 observations and 31 variables.
NOTE: PROCEDURE SORT used (Total process time):
real time 6:44.73
cpu time 19.23 seconds
NOTE: There were 3249877 observations read from the data set WORK.DATASET_B_7.
NOTE: The data set WORK.DATASET_B_7 has 3249877 observations and 31 variables.
NOTE: PROCEDURE SORT used (Total process time):
real time 2:48.26
cpu time 18.72 seconds
NOTE: There were 3249877 observations read from the data set WORK.DATASET_B_8.
NOTE: The data set WORK.DATASET_B_8 has 3249877 observations and 31 variables.
NOTE: PROCEDURE SORT used (Total process time):
real time 3:36.37
cpu time 20.17 seconds
NOTE: There were 3249877 observations read from the data set WORK.DATASET_B_9.
NOTE: The data set WORK.DATASET_B_9 has 3249877 observations and 31 variables.
NOTE: PROCEDURE SORT used (Total process time):
real time 4:41.83
cpu time 19.00 seconds
NOTE: There were 3249875 observations read from the data set WORK.DATASET_B_10.
NOTE: The data set WORK.DATASET_B_10 has 3249875 observations and 31 variables.
NOTE: PROCEDURE SORT used (Total process time):
real time 3:17.87
cpu time 19.48 seconds
the sum of the all sorts procedures are almost 41 min without counting the 15 min for the macro to work. And we still need a merge to build the dataset again (by the way, what do you think is the best way to merge all the datasets?)
As you can see split in 10 dataset isn't faster than the original proc sort. If I able to free space I'd try with perhaps 20 small datasets.
AmySwinford, I have SAS/ACCESS ready for use in my PC but I never use it and to be honest I don´t know where to start to make it work in local mode (server/cliente in the same machine).
I think defragmenting the hard disk is a very good idea , I´d try if my user have admin privileges,...
Regards
Not much different to what you already have been provided with, but split each type of "KEY" out into a dataset, then set back together again (note depending on number of keys, this could create a lot of small datasets):
/* Some test data */
data have;
attrib key format=$5.;
array var{15} $20.;
do i=1 to 1000; /*32498768;*/
key=strip(put(mod(i,234),best.));
output;
end;
run;
/* Get list of unique KEY values */
proc sql;
create table WORK.LOOP as
select distinct
KEY
from WORK.HAVE;
quit;
/* Generate code to split data into each key value */
data _null_;
set loop;
call execute('data work.srt'||strip(key)||'; set work.have (where=(key="'||strip(key)||'")); run;');
run;
/* Generate final datastep to append back together again */
data _null_;
set loop end=last;
if _n_=1 then call execute('data want; set ');
call execute(' work.srt'||strip(key));
if last then call execute(';run;');
run;
If your Key1 and Key2 are numbers and that their minimum and maximum values are small enough(your system memory can hold) to use a temporary array, you may reduce the workload much.
But you need to find out the maximum occurrence of Key1 or Key2 whichever is maximum. The array will hold the Record ID(RID) of the observation for the Key1 and Key2.
You then use this RID in POINT = option of SET and get the sorted output.
Here is an example using SASHELP.CLASS with AGE and HEIGHT. Though Height has a decimal point, it does not matter to use it as array-index, the decimal is
ignored.
The same idea can be used with Hash Table too.
data class;
set sashelp.class;
run;
data need;
array k[11:16, 51:72, 2] _temporary_;
if _n_ = 1 then do;
do _n_ = 1 by 1 until(eof);
set class(keep = age height) end = eof;
i = 1;
do while(k[age, height, i] > 0); i + 1; end;
k[age, height, i] = _n_;
end;
end;
do i = lbound1(k) to hbound1(k);
do j = lbound2(k) to hbound2(k);
m = 1;
do while(k[i, j, m] > 0 );
ptr = k[i, j, m];
set class point = ptr;
output;
m + 1;
if m > hbound3(k) then leave;
end;
end;
end;
stop;
drop i j m;
run;
proc print data = need;
run;
Note the use of
array k[11:16, 51:72, 2] _temporary_;
By the homework, 11:16 for Age, 51:72 for Height and the maximum occurrence of 2 are found. The homework is needed to minimize the array sizes. The RID is _N_;
The Hash version may simplify your homework in finding the metadata of the data set.
The SASHELP.CLASS data set is used.
data class;
set sashelp.class;
run;
data need;
if _n_ = 1 then do;
if 0 then set class(keep = age height);
declare hash h(multidata:'Y', ordered:'Y', hashexp:16); * may try 20;
h.definekey('age','height');
h.definedata('RID');
h.definedone();
declare hiter hi ('h');
do RID = 1 by 1 until(eof);
set class(keep = age height) end = eof;
h.add();
end;
end;
do rc = hi.first() by 0 while(rc = 0);
set class point = RID ;
output;
rc = hi.next();
end;
stop;
run;
proc print data = need;
run;
For 9.1 Version , a simple change can make the KEYS unique. Splitting the data set and joining them together may be a big drawback.
Again with:
data class;
set sashelp.class;
run;
data need;
if _n_ = 1 then do;
if 0 then set class(keep = age height);
declare hash h(ordered:'Y', hashexp:16);
h.definekey('age','height','RID');
h.definedata('RID');
h.definedone();
declare hiter hi ('h');
do RID = 1 by 1 until(eof);
set class(keep = age height) end = eof;
h.add();
end;
end;
do rc = hi.first() by 0 while(rc = 0);
set class point = RID end = last;
output;
rc = hi.next();
end;
stop;
run;
Thanks all for your ideas to improve the code. All code related to deal with memory system simply don't work due to limitations of size (only 4 gb) and the impossibility to change it fo a new powerfull one.
I'd try splitting in more datasets (perhaps 20 or 25) but how is the bes option for merge all those datasets?.
Could I make an array in the merge statement?
Regards
Sergio,
I think dividing the data set into small ones to fit your memory for sorting seems to be the last resort. Merge them together by BYVARS. ARRAY is not handy to merge them.
Regards,
Sergio,
There are still basic questions that your posts have left unaddressed. Do you need to combine your sorted data sets? Or do you merely need to verify matches and mismatches? One basic idea would be to cut out the variables that might not be needed:
proc sort data=huge (keep=key) out=sorted NODUPKEY;
by key;
run;
There is no need to sort 31 variables if you only need 1 or 2.
If you need to sort the full data set, the "splitting into pieces" approach might work faster along these lines. For a character version of KEY that contains digits only:
proc sort data=huge (where=(key =: '1')) out=output_1;
by key;
run;
proc sort data=huge (where=(key =: '2')) out=output_2;
by key;
run;
...
10 times
...
You would then need a SET statement, not a MERGE statement (much slower) to combine the pieces:
data sorted;
set output_0 output_1 output_2 output_3 ... output_9;
by key;
run;
But I think you still need to clarify the result you are seeking before the solutions go any further. Some details about the data might be important (is there one key or two, is the key character or numeric, do you know anything about the distribution of the key).
Good luck.
Morning all.
I attached one screenshot of my dataset. The key is only var1 (numeric, 8 length) . It's something like a social Security but with less length. I need almost all field because I need to check if this variables match it with other variables that there are in the source.
Thanks
Sergio,
Read the following paper for Sorting a Large data set. The macro
in the paper is given below. The Macro does not work as given.
/** Paper: www.lexjansen.com/nesug/nesug97/coders/labrecqu.pdf **/
libname lib1 'd:\..library location..';
%let infile = filenm1;
%let incremnt = 999999;
%let byvars = ssn dob;
%let keepvar = ssn dob;
%macro qdsort;
* get number of records in dataset *;
data _null_;
retain p 1;
set lib1.&infile nobs=nobs point=p;
call symput('nobs',nobs);
stop;
run;
* initialize macro counters and sort *;
* 'incremnt' size chunks of data and *;
* write to temporary files *;
%let process = 0;
%let obs1 = 1;
%do %while(&obs1<=&nobs);
%let process = %eval(&process + 1);
proc sort data=lib1.&infile
(firstobs=&obs1 obs= %eval(&obs1 + &incremnt)
keep=&keepvar)
out=temp&process nodupkey;
by &byvars;
run;
%let obs1= %eval(&obs1+(&incremnt+1));
%end;
* merge sorted temporary files *;
data lib1.&infile;
merge
%do i=1 %to &process;
temp&i;
%end;
by &byvar;
run;
%mend;
I have fixed the problem and the corrected macro is given below.
The corrected statements are BOLDED in the original Macro.
It is used with an example data set, CLASS from SASHELP.CLASS.
libname lib1 'd:/prod/Codes/';
data lib1.class;
set sashelp.class;
run;
%let infile = class;
%let incremnt = 4;
%let byvars = age height;
%let keepvar = age height weight Name;
%macro qdsort;
* get number of records in dataset *;
data _null_;
call symputx('nobs',nobs);
if 0 then set lib1.&infile nobs=nobs;
stop;
run;
* initialize macro counters and sort *;
* 'incremnt' size chunks of data and *;
* write to temporary files *;
%let process = 0;
%let obs1 = 1;
%do %while(&obs1<=&nobs);
%let process = %eval(&process + 1);
proc sort data=lib1.&infile
(firstobs=&obs1 obs= %eval(&obs1 + &incremnt)
keep=&keepvar)
out=temp&process nodupkey;
by &byvars;
run;
%let obs1= %eval(&obs1+(&incremnt+1));
%end;
* merge sorted temporary files *;
data lib1.&infile;
set
%do i=1 %to &process;
temp&i
%end;
;
by &byvars;
run;
%mend;
%qdsort;
This might be what you are looking for. Hope your problem is solved!!!
I feel that there are other efficient ways - taking less runtime - but it
needs more information from you.
After sorting the data set, you might be doing something with that.
If you tell that, then one may find a solution without requiring full
sort. Probably, you can fabricate some input data set and prepare
the required output data set. The process of deriving the output must
also be explained to save time for the Responders.
This way you may learn SAS Programming.
Regards.
Sergio, I have to thank you keeping my brains sharp for hearing and trying to understand the issue and trying to find acceptable ways out.
The "proc sort" will use multi threading (using all 4 cores) as it reaches some point. It will split up your dataset in smaller pieces sort them and doe a merger when writing out the results. There is some point with data sizing where the efficiency of this process rapidly falls by resource consumption on the system. As you are already short on that I would not advice to run those in parallel.
If the sorting of let us take a 2 Gb is acceptable fast (<1min) then sorting 10 of them in serial will take a factor 10 times this time.
Merging that 10 files together will be the last step. Do not expect miracles from this. Copying 2Gb is still a lot and will need the time for that.
Running processes parallel can be done by batch scripting or using MP-connect part of SAS/connect.
Probably knowledge of maximum and minimum values for KEY1 and KEY2 might be useful for array solution.
I cannot look into your wallet, but checking for a external ssd 128Gb I found: Philips external SSD drive USB3.0 128GB - Prijzen - Tweakers about 100 €
Muthia reading that paper is the approach Sergio already tried. Splitting a big dataset is smaller trunks that will solve a possible memory error but it tells nothing on turn-around time.
The paper you have found is coming from the health environment not from hardcore IT specialized persons. (I am on the edge hot really hardcore).
Performance & Tuning while getting into your present hardware limits is something of all time. These days it is associated with "big data". Altough Sergios hardware is very limited for these days it was the mainframe sizing 20 years ago. The least time Sergio can achieve is 2 or 3 times the time it will take copying that file (drag-drop) with Windows tools. AS we have that file estimated as 22 Gb I guess it will be some 10 or 20 minutes. As we do not know what this time really is we do not know what efficiency gains are possible (coming down form 45 minutes).
Just for sorting there is also a tagsort option but this option is terrible slow. Running out of resources an getting error it could be used however when execution time is less critical.
This question/techniques are rather old you can find: http://www2.sas.com/proceedings/sugi24/Coders/p095-24.pdf (1995 big iron) and http://www2.sas.com/proceedings/sugi26/p121-26.pdf (1997) some metrics have changed since those times.
What Astounding is after is doing investments of human thinking and algoritme alternatives for the problem in smarter resource usage approaches. That is also the "big data" game. Knowing what was solved in the old days (limited hardware) by smart people there may be options. I do not know whether it is worth doing those efforts. (ROI).
It is possible and do that for the whole original problem of Sergio but that can open too much what he wants the keep private.
Another is using all observations but only sorting on the keys and also keeping the obs number with that.
Having the result done on the keys you can get back all other record information using the point= dataset option. IT is the same fundamental approach as using indexes. Building indexes will have a dataset (indexfile) that can point to the data. This sounds nice for the logic. The IO bottleneck is however different searching the data is not sequential anymore (the most efficient access) but getting random. With about 10-20% hit ratio of all data you will be as fast or slow as reading all data. For the turnaround time of your process you have to expect a 5 timer longer running as that sequential one. Quickly evaluated Sergio will not have any gain of importance with that.
Hi all
Thank you very much for all your help. it's much appreciated for me.
Regards
Registration is now open for SAS Innovate 2025 , our biggest and most exciting global event of the year! Join us in Orlando, FL, May 6-9.
Sign up by Dec. 31 to get the 2024 rate of just $495.
Register now!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.