Efficient way to read the data

Babloo · Posted 09-27-2019 06:12 AM

In the following code I'm reading the entire dataset using SET Statement where it has 10 Million records and 25+ varaibles. Since it is time consuming, Is there any efficient way to read the dataset and extract the distinct values for ENTITY and REPORTING? Can we implement Bitmap Index (available in SAS?) kind of apporach here?

/*extract unique entity_id and reporting cause*/

data dis_entity_id ;
  set &tblname.;
  format ENTITY $ENTITY. REPORTING $50.;
  keep ENTITY REPORTING;
run;

proc sort data=dis_entity_id nodupkey;
  by ENTITY REPORTING;
run;

Astounding · Posted 09-27-2019 08:07 AM

Load the data set into a hash table.

Use ENTITY and REPORTING as the keys.

Use a method that ignores duplicates.

Unload the result.

Babloo · Posted 09-27-2019 08:18 AM

Could you please help me with one example?

FreelanceReinh · Posted 09-27-2019 10:46 AM

@Babloo wrote:
Could you please help me with one example?

@Tom wrote:
Not sure you can get anymore efficient than PROC SORT.

Very true.

Example:

/* Create sample data for demonstration */

proc format;
value $entity
other=[$revers5.];
run;

data have;
call streaminit(27182818);
length reporting $50 entity $5; /* "reverse" order intended */
array nv[14] (4567:4580);
array cv[14] $9 (14*'Test data');
do _n_=1 to 1e7;
  entity=put(rand('integer',99999),z5.);
  reporting='Reporting cause: customer request last '||put(rand('integer',7),downame.-l);
  output;
end;
format entity $entity.;
run; /* 10000000 obs., 30 variables, 2.76 GB, run time: 2.94 s */

/* Use three different methods to extract the unique keys (in ascending order) */

data _null_;
dcl hash h(dataset:'have', ordered:'a', hashexp:20);
h.definekey('entity','reporting'); /* explicit list to define variable order */
h.definedone();
h.output(dataset:'want');
stop;
set have;
run; /* 699992 obs., 6.43 s (with default hashexp:8 it took 10-11 s) */

proc sort data=have(keep=entity reporting) out=want2 nodupkey;
by entity reporting;
run; /* 699992 obs., 5.61 s */

proc sql;
create table want3 as
select distinct entity, reporting
from have;
quit; /* 699992 obs., 6.93 s */

Log:

Spoiler

8    data have;
9    call streaminit(27182818);
10   length reporting $50 entity $5; /* "reverse" order intended */
11   array nv[14] (4567:4580);
12   array cv[14] $9 (14*'Test data');
13   do _n_=1 to 1e7;
14     entity=put(rand('integer',99999),z5.);
15     reporting='Reporting cause: customer request last '||put(rand('integer',7),downame.-l);
16     output;
17   end;
18   format entity $entity.;
19   run;

NOTE: The data set WORK.HAVE has 10000000 observations and 30 variables.
NOTE: DATA statement used (Total process time):
      real time           2.94 seconds
      cpu time            2.94 seconds


20
21   data _null_;
22   dcl hash h(dataset:'have', ordered:'a', hashexp:20);
23   h.definekey('entity','reporting'); /* explicit list to define variable order */
24   h.definedone();
25   h.output(dataset:'want');
26   stop;
27   set have;
28   run;

NOTE: There were 10000000 observations read from the data set WORK.HAVE.
NOTE: The data set WORK.WANT has 699992 observations and 2 variables.
NOTE: DATA statement used (Total process time):
      real time           6.43 seconds
      cpu time            6.36 seconds


29
30   proc sort data=have(keep=entity reporting) out=want2 nodupkey;
31   by entity reporting;
32   run;

NOTE: There were 10000000 observations read from the data set WORK.HAVE.
NOTE: SAS threaded sort was used.
NOTE: 9300008 observations with duplicate key values were deleted.
NOTE: The data set WORK.WANT2 has 699992 observations and 2 variables.
NOTE: PROCEDURE SORT used (Total process time):
      real time           5.61 seconds
      cpu time            7.97 seconds


33
34   proc sql;
35   create table want3 as
36   select distinct entity, reporting
37   from have;
NOTE: SAS threaded sort was used.
NOTE: Table WORK.WANT3 created, with 699992 rows and 2 columns.

38   quit;
NOTE: PROCEDURE SQL used (Total process time):
      real time           6.93 seconds
      cpu time            9.15 seconds

8 data have; 9 call streaminit(27182818); 10 length reporting $50 entity $5; /* "reverse" order intended */ 11 array nv[14] (4567:4580); 12 array cv[14] $9 (14*'Test data'); 13 do _n_=1 to 1e7; 14 entity=put(rand('integer',99999),z5.); 15 reporting='Reporting cause: customer request last '||put(rand('integer',7),downame.-l); 16 output; 17 end; 18 format entity $entity.; 19 run; NOTE: The data set WORK.HAVE has 10000000 observations and 30 variables. NOTE: DATA statement used (Total process time): real time 2.94 seconds cpu time 2.94 seconds 20 21 data _null_; 22 dcl hash h(dataset:'have', ordered:'a', hashexp:20); 23 h.definekey('entity','reporting'); /* explicit list to define variable order */ 24 h.definedone(); 25 h.output(dataset:'want'); 26 stop; 27 set have; 28 run; NOTE: There were 10000000 observations read from the data set WORK.HAVE. NOTE: The data set WORK.WANT has 699992 observations and 2 variables. NOTE: DATA statement used (Total process time): real time 6.43 seconds cpu time 6.36 seconds 29 30 proc sort data=have(keep=entity reporting) out=want2 nodupkey; 31 by entity reporting; 32 run; NOTE: There were 10000000 observations read from the data set WORK.HAVE. NOTE: SAS threaded sort was used. NOTE: 9300008 observations with duplicate key values were deleted. NOTE: The data set WORK.WANT2 has 699992 observations and 2 variables. NOTE: PROCEDURE SORT used (Total process time): real time 5.61 seconds cpu time 7.97 seconds 33 34 proc sql; 35 create table want3 as 36 select distinct entity, reporting 37 from have; NOTE: SAS threaded sort was used. NOTE: Table WORK.WANT3 created, with 699992 rows and 2 columns. 38 quit; NOTE: PROCEDURE SQL used (Total process time): real time 6.93 seconds cpu time 9.15 seconds

hashman · Posted 09-27-2019 01:49 PM

@FreelanceReinh:

One-time test of this kind are inconclusive. Too many confounding factors: depends on the platform, resource distribution, the nature of the keys, etc.

On my Windows machine, your steps fare this way (the times are averaged over 3 test runs):

Creation of HAVE: Time: 7.3 s, Memory: 0.4 MB
Hash: Time: 6.4 s, Memory: 180 MB
Sort: Time: 12.1 s, Memory: 1000 MB
SQL: 14.5 s, Memory: 1000 MB

Kind regards

Paul D.

FreelanceReinh · Posted 09-27-2019 05:14 PM

@hashman wrote:

@FreelanceReinh:

One-time test of this kind are inconclusive. Too many confounding factors: (...)

Yes, sure, I know. 🙂

Indeed, I was hesitant to post my comparison without describing the dependence on various factors. I thought this was well known to most programmers. Next time in a similar situation I will at least mention this fact. My code was primarily intended to provide the OP with the requested hash example using a HAVE dataset (hopefully) similar to their original.

Thanks for adding your own time measurements. It's always interesting to see more than one performance profile. The different ratios between the four run times on my workstation may be due to the fact that my WORK library is located on a hybrid RAM disk (composed of RAM and a special kind of SSD), which obviously mitigates the usual performance gap between (hard) disk and memory access.

hashman · Posted 09-27-2019 10:53 PM

@FreelanceReinh:

Your points are well taken. My "machine" is an ancient W520 i-7 2.4GHz Intel Lenovo with 8G RAM and a half-empty 500G SSD.

As far as the original OP's question goes, methinks its main thrust was to find out if the distinct key-value counts could be obtained by not reading the entire input file - which of course cannot be done, unless this information is created during the file creation and stored in some special structure on the side (like in some RDBMS where you can get these counts via an index-only scan).

Kind regards

Paul D.

Astounding · Posted 09-27-2019 08:32 AM

I don't have an example so one of us will have to learn how. I think it's more appropriate for you to learn how, rather than for me to do it for you.

Tom · Posted 09-27-2019 09:43 AM

Not sure you can get anymore efficient than PROC SORT.

proc sort data=&tblname.(keep=ENTITY REPORTING) out=dis_entity_id nodupkey;
  by ENTITY REPORTING;
  format entity $entity.;
run;

hashman · Posted 09-27-2019 10:43 PM

@Babloo:

Hopefully, you realize that if you have an unindexed SAS data file, then in order to get the unique values for a key, you'll have to read the entire file, no matter which method of getting the distinct values you employ. This is simply because the first key-value you're looking for can be in the first record and the last key-value - in the last, irrespective of how the file is ordered.

Now if the file is already indexed, in principle, you'd think that to get the distinct key-values, you should be able to read only the primary index nodes rather than the entire file. Such capability is embedded in some RDBMS, but then the index should be organized in a way that would allow it to happen. To my knowledge, SAS indexes don't afford this capability - if I'm wrong on this head, I'd be elated to be wrong and corrected. I know that using proc CONTENTS, you can get the info about the key centiles and the number of unique key-values for a key - but not the distinct counts for every key-value.

Kind regards

Paul D.

Efficient way to read the data

Re: Efficient way to read the data

Re: Efficient way to read the data

Re: Efficient way to read the data

Re: Efficient way to read the data

Re: Efficient way to read the data

Re: Efficient way to read the data

Re: Efficient way to read the data

Re: Efficient way to read the data

Re: Efficient way to read the data

Registration is open

Registration is open

SAS Training: Just a Click Away