In the following code I'm reading the entire dataset using SET Statement where it has 10 Million records and 25+ varaibles. Since it is time consuming, Is there any efficient way to read the dataset and extract the distinct values for ENTITY and REPORTING? Can we implement Bitmap Index (available in SAS?) kind of apporach here?
/*extract unique entity_id and reporting cause*/
data dis_entity_id ;
set &tblname.;
format ENTITY $ENTITY. REPORTING $50.;
keep ENTITY REPORTING;
run;
proc sort data=dis_entity_id nodupkey;
by ENTITY REPORTING;
run;
@Babloo wrote:
Could you please help me with one example?
@Tom wrote:Not sure you can get anymore efficient than PROC SORT.
Very true.
Example:
/* Create sample data for demonstration */
proc format;
value $entity
other=[$revers5.];
run;
data have;
call streaminit(27182818);
length reporting $50 entity $5; /* "reverse" order intended */
array nv[14] (4567:4580);
array cv[14] $9 (14*'Test data');
do _n_=1 to 1e7;
entity=put(rand('integer',99999),z5.);
reporting='Reporting cause: customer request last '||put(rand('integer',7),downame.-l);
output;
end;
format entity $entity.;
run; /* 10000000 obs., 30 variables, 2.76 GB, run time: 2.94 s */
/* Use three different methods to extract the unique keys (in ascending order) */
data _null_;
dcl hash h(dataset:'have', ordered:'a', hashexp:20);
h.definekey('entity','reporting'); /* explicit list to define variable order */
h.definedone();
h.output(dataset:'want');
stop;
set have;
run; /* 699992 obs., 6.43 s (with default hashexp:8 it took 10-11 s) */
proc sort data=have(keep=entity reporting) out=want2 nodupkey;
by entity reporting;
run; /* 699992 obs., 5.61 s */
proc sql;
create table want3 as
select distinct entity, reporting
from have;
quit; /* 699992 obs., 6.93 s */
Log:
8 data have; 9 call streaminit(27182818); 10 length reporting $50 entity $5; /* "reverse" order intended */ 11 array nv[14] (4567:4580); 12 array cv[14] $9 (14*'Test data'); 13 do _n_=1 to 1e7; 14 entity=put(rand('integer',99999),z5.); 15 reporting='Reporting cause: customer request last '||put(rand('integer',7),downame.-l); 16 output; 17 end; 18 format entity $entity.; 19 run; NOTE: The data set WORK.HAVE has 10000000 observations and 30 variables. NOTE: DATA statement used (Total process time): real time 2.94 seconds cpu time 2.94 seconds 20 21 data _null_; 22 dcl hash h(dataset:'have', ordered:'a', hashexp:20); 23 h.definekey('entity','reporting'); /* explicit list to define variable order */ 24 h.definedone(); 25 h.output(dataset:'want'); 26 stop; 27 set have; 28 run; NOTE: There were 10000000 observations read from the data set WORK.HAVE. NOTE: The data set WORK.WANT has 699992 observations and 2 variables. NOTE: DATA statement used (Total process time): real time 6.43 seconds cpu time 6.36 seconds 29 30 proc sort data=have(keep=entity reporting) out=want2 nodupkey; 31 by entity reporting; 32 run; NOTE: There were 10000000 observations read from the data set WORK.HAVE. NOTE: SAS threaded sort was used. NOTE: 9300008 observations with duplicate key values were deleted. NOTE: The data set WORK.WANT2 has 699992 observations and 2 variables. NOTE: PROCEDURE SORT used (Total process time): real time 5.61 seconds cpu time 7.97 seconds 33 34 proc sql; 35 create table want3 as 36 select distinct entity, reporting 37 from have; NOTE: SAS threaded sort was used. NOTE: Table WORK.WANT3 created, with 699992 rows and 2 columns. 38 quit; NOTE: PROCEDURE SQL used (Total process time): real time 6.93 seconds cpu time 9.15 seconds
One-time test of this kind are inconclusive. Too many confounding factors: depends on the platform, resource distribution, the nature of the keys, etc.
On my Windows machine, your steps fare this way (the times are averaged over 3 test runs):
Kind regards
Paul D.
@hashman wrote:
One-time test of this kind are inconclusive. Too many confounding factors: (...)
Yes, sure, I know. 🙂
Indeed, I was hesitant to post my comparison without describing the dependence on various factors. I thought this was well known to most programmers. Next time in a similar situation I will at least mention this fact. My code was primarily intended to provide the OP with the requested hash example using a HAVE dataset (hopefully) similar to their original.
Thanks for adding your own time measurements. It's always interesting to see more than one performance profile. The different ratios between the four run times on my workstation may be due to the fact that my WORK library is located on a hybrid RAM disk (composed of RAM and a special kind of SSD), which obviously mitigates the usual performance gap between (hard) disk and memory access.
Your points are well taken. My "machine" is an ancient W520 i-7 2.4GHz Intel Lenovo with 8G RAM and a half-empty 500G SSD.
As far as the original OP's question goes, methinks its main thrust was to find out if the distinct key-value counts could be obtained by not reading the entire input file - which of course cannot be done, unless this information is created during the file creation and stored in some special structure on the side (like in some RDBMS where you can get these counts via an index-only scan).
Kind regards
Paul D.
Not sure you can get anymore efficient than PROC SORT.
proc sort data=&tblname.(keep=ENTITY REPORTING) out=dis_entity_id nodupkey;
by ENTITY REPORTING;
format entity $entity.;
run;
Hopefully, you realize that if you have an unindexed SAS data file, then in order to get the unique values for a key, you'll have to read the entire file, no matter which method of getting the distinct values you employ. This is simply because the first key-value you're looking for can be in the first record and the last key-value - in the last, irrespective of how the file is ordered.
Now if the file is already indexed, in principle, you'd think that to get the distinct key-values, you should be able to read only the primary index nodes rather than the entire file. Such capability is embedded in some RDBMS, but then the index should be organized in a way that would allow it to happen. To my knowledge, SAS indexes don't afford this capability - if I'm wrong on this head, I'd be elated to be wrong and corrected. I know that using proc CONTENTS, you can get the info about the key centiles and the number of unique key-values for a key - but not the distinct counts for every key-value.
Kind regards
Paul D.
Good news: We've extended SAS Hackathon registration until Sept. 12, so you still have time to be part of our biggest event yet – our five-year anniversary!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.