BookmarkSubscribeRSS Feed
Babloo
Rhodochrosite | Level 12

In the following code I'm reading the entire dataset using SET Statement where it has 10 Million records and 25+ varaibles. Since it is time consuming, Is there any efficient way to read the dataset and extract the distinct values for ENTITY and REPORTING? Can we implement Bitmap Index (available in SAS?) kind of apporach here?

 

/*extract unique entity_id and reporting cause*/

data dis_entity_id ;
  set &tblname.;
  format ENTITY $ENTITY. REPORTING $50.;
  keep ENTITY REPORTING;
run;

proc sort data=dis_entity_id nodupkey;
  by ENTITY REPORTING;
run;
9 REPLIES 9
Astounding
PROC Star
Load the data set into a hash table.

Use ENTITY and REPORTING as the keys.

Use a method that ignores duplicates.

Unload the result.
Babloo
Rhodochrosite | Level 12
Could you please help me with one example?
FreelanceReinh
Jade | Level 19

@Babloo wrote:
Could you please help me with one example?


@Tom wrote:

Not sure you can get anymore efficient than PROC SORT.

Very true.

 

Example:

/* Create sample data for demonstration */

proc format;
value $entity
other=[$revers5.];
run;

data have;
call streaminit(27182818);
length reporting $50 entity $5; /* "reverse" order intended */
array nv[14] (4567:4580);
array cv[14] $9 (14*'Test data');
do _n_=1 to 1e7;
  entity=put(rand('integer',99999),z5.);
  reporting='Reporting cause: customer request last '||put(rand('integer',7),downame.-l);
  output;
end;
format entity $entity.;
run; /* 10000000 obs., 30 variables, 2.76 GB, run time: 2.94 s */

/* Use three different methods to extract the unique keys (in ascending order) */

data _null_;
dcl hash h(dataset:'have', ordered:'a', hashexp:20);
h.definekey('entity','reporting'); /* explicit list to define variable order */
h.definedone();
h.output(dataset:'want');
stop;
set have;
run; /* 699992 obs., 6.43 s (with default hashexp:8 it took 10-11 s) */

proc sort data=have(keep=entity reporting) out=want2 nodupkey;
by entity reporting;
run; /* 699992 obs., 5.61 s */

proc sql;
create table want3 as
select distinct entity, reporting
from have;
quit; /* 699992 obs., 6.93 s */

Log:

Spoiler
8    data have;
9    call streaminit(27182818);
10   length reporting $50 entity $5; /* "reverse" order intended */
11   array nv[14] (4567:4580);
12   array cv[14] $9 (14*'Test data');
13   do _n_=1 to 1e7;
14     entity=put(rand('integer',99999),z5.);
15     reporting='Reporting cause: customer request last '||put(rand('integer',7),downame.-l);
16     output;
17   end;
18   format entity $entity.;
19   run;

NOTE: The data set WORK.HAVE has 10000000 observations and 30 variables.
NOTE: DATA statement used (Total process time):
      real time           2.94 seconds
      cpu time            2.94 seconds


20
21   data _null_;
22   dcl hash h(dataset:'have', ordered:'a', hashexp:20);
23   h.definekey('entity','reporting'); /* explicit list to define variable order */
24   h.definedone();
25   h.output(dataset:'want');
26   stop;
27   set have;
28   run;

NOTE: There were 10000000 observations read from the data set WORK.HAVE.
NOTE: The data set WORK.WANT has 699992 observations and 2 variables.
NOTE: DATA statement used (Total process time):
      real time           6.43 seconds
      cpu time            6.36 seconds


29
30   proc sort data=have(keep=entity reporting) out=want2 nodupkey;
31   by entity reporting;
32   run;

NOTE: There were 10000000 observations read from the data set WORK.HAVE.
NOTE: SAS threaded sort was used.
NOTE: 9300008 observations with duplicate key values were deleted.
NOTE: The data set WORK.WANT2 has 699992 observations and 2 variables.
NOTE: PROCEDURE SORT used (Total process time):
      real time           5.61 seconds
      cpu time            7.97 seconds


33
34   proc sql;
35   create table want3 as
36   select distinct entity, reporting
37   from have;
NOTE: SAS threaded sort was used.
NOTE: Table WORK.WANT3 created, with 699992 rows and 2 columns.

38   quit;
NOTE: PROCEDURE SQL used (Total process time):
      real time           6.93 seconds
      cpu time            9.15 seconds

 

hashman
Ammonite | Level 13

@FreelanceReinh:

One-time test of this kind are inconclusive. Too many confounding factors: depends on the platform, resource distribution, the nature of the keys, etc.

On my Windows machine, your steps fare this way (the times are averaged over 3 test runs):

  1. Creation of HAVE: Time: 7.3 s, Memory: 0.4 MB
  2. Hash: Time: 6.4 s, Memory: 180 MB
  3. Sort: Time: 12.1 s, Memory: 1000 MB
  4. SQL:  14.5 s, Memory: 1000 MB  

Kind regards

Paul D.

 

FreelanceReinh
Jade | Level 19

@hashman wrote:

@FreelanceReinh:

One-time test of this kind are inconclusive. Too many confounding factors: (...)


Yes, sure, I know. 🙂

Indeed, I was hesitant to post my comparison without describing the dependence on various factors. I thought this was well known to most programmers. Next time in a similar situation I will at least mention this fact. My code was primarily intended to provide the OP with the requested hash example using a HAVE dataset (hopefully) similar to their original.

 

Thanks for adding your own time measurements. It's always interesting to see more than one performance profile. The different ratios between the four run times on my workstation may be due to the fact that my WORK library is located on a hybrid RAM disk (composed of RAM and a special kind of SSD), which obviously mitigates the usual performance gap between (hard) disk and memory access.

hashman
Ammonite | Level 13

@FreelanceReinh:

Your points are well taken. My "machine" is an ancient W520 i-7 2.4GHz Intel Lenovo with 8G RAM and a half-empty 500G SSD.

 

As far as the original OP's question goes, methinks its main thrust was to find out if the distinct key-value counts could be obtained by not reading the entire input file - which of course cannot be done, unless this information is created during the file creation and stored in some special structure on the side (like in some RDBMS where you can get these counts via an index-only scan).  

 

Kind regards

Paul D.

Astounding
PROC Star
I don't have an example so one of us will have to learn how. I think it's more appropriate for you to learn how, rather than for me to do it for you.
Tom
Super User Tom
Super User

Not sure you can get anymore efficient than PROC SORT.

proc sort data=&tblname.(keep=ENTITY REPORTING) out=dis_entity_id nodupkey;
  by ENTITY REPORTING;
  format entity $entity.;
run;
hashman
Ammonite | Level 13

@Babloo:

Hopefully, you realize that if you have an unindexed SAS data file, then in order to get the unique values for a key, you'll have to read the entire file, no matter which method of getting the distinct values you employ. This is simply because the first key-value you're looking for can be in the first record and the last key-value - in the last, irrespective of how the file is ordered.

 

Now if the file is already indexed, in principle, you'd think that to get the distinct key-values, you should be able to read only the primary index nodes rather than the entire file. Such capability is embedded in some RDBMS, but then the index should be organized in a way that would allow it to happen. To my knowledge, SAS indexes don't afford this capability - if I'm wrong on this head, I'd be elated to be wrong and corrected. I know that using proc CONTENTS, you can get the info about the key centiles and the number of unique key-values for a key - but not the distinct counts for every key-value.      

 

Kind regards

Paul D.

hackathon24-white-horiz.png

2025 SAS Hackathon: There is still time!

Good news: We've extended SAS Hackathon registration until Sept. 12, so you still have time to be part of our biggest event yet – our five-year anniversary!

Register Now

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 9 replies
  • 1498 views
  • 0 likes
  • 5 in conversation