Solved: HASH tables in SAS Viya with CAS processing.

val_nikolajevs · Posted 12-27-2022 06:31 PM

Hello,

I am trying to re purpose SAS code for use in SAS Viya with CAS.

Can be hash tables integrated with CAS, so multi thread processing is done and intermediary results are stored in CAS Memory?

When I use SAS 9.4 code with Hash tables code, referencing cas library in the set statement, the hash_table.output (dataset:'out');

creates a cas library casuser.out. However, this table contains junk data character variables look something like this:

all dates got to 1960 year value and numeric are zeros.

When do debugging, I can see that the cas library is read correctly.

Does CAS even support hash tables--could not find anything in the documentation.

Thank you.

DerylHollick · Posted 03-23-2023 09:19 PM

I can replicate this behavior except I get 100009 rows (usually):

NOTE: Running DATA step in Cloud Analytic Services.
NOTE: The DATA step will run in multiple threads.
NOTE: There were 200000 observations read from the table TEST in caslib CASUSER.
NOTE: The table out in caslib CASUSER has 100009 observations and 7 variables.

I think what is happening here is a timing issue. I updated the hash definition to include the threadid and hostname both when a row is added to the hash and when it is replaced:

data _null_;
  length hostname hostname_add $50;
  if _n_=1 then do;
      dcl hash h(ordered:"A");
      h.defineKey("id");
      h.defineData("id","var","var_c", "threadid_add", "threadid_replace", "hostname_add", "hostname_replace");
      h.defineDone();
      call missing(threadid, threadid_add, hostname, hostname_add);
   end;
set casuser.test end=lr;
*by id;

if h.find()=0 then do;
 var=sum(var,var_n);
 threadid_replace = _threadid_;
 hostname_replace = _hostname_;
 h.replace();
end;
else do;
 var=var_n;
 threadid_add = _threadid_;
 hostname_add = _hostname_;
 h.add();
end;
if lr then h.output(dataset:'out'); 
run;

data dup_id_rows;
set casuser.out;
by id;
if not (first.id and last.id);
run;

The dup_id_rows dataset contains the 18 rows associated with the 9 "extra" ids. In each case, the id was added to the hash twice by a separate thread and never replaced.

There is no locking of the hash table going on behind the scenes, so I think these are instances where the second thread checked the hash before the first thread's add was completed.

The issue is not as pronounced as it could be, in this particular case, because the casuser.test table is not distributed across all the worker nodes. When it was initially created, the log said:

NOTE: Running DATA step in Cloud Analytic Services.
NOTE: The DATA step has no input data set and will run in a single thread.

It can be distributed by running table.partition:

proc cas;
table.partition / casout={caslib="casuser", name="test", replace=true} table={caslib="casuser", name="test"} ;   
table.tabledetails / caslib="casuser" name="test" level="node";
table.tableinfo / table="test";
quit;

My system has 4 workers, so the output of table.tabledetails now shows 50k rows on each node instead of 200k on one node. When I run the hash DATA step there are now 100088 rows due to more threads at work:

NOTE: Running DATA step in Cloud Analytic Services.
NOTE: The DATA step will run in multiple threads.
NOTE: There were 200000 observations read from the table TEST in caslib CASUSER.
NOTE: The table out in caslib CASUSER has 100088 observations and 7 variables.

To avoid this and still run multi-threaded, we can make all the rows associated with a given ID be processed by the same thread. This can be done by adding BY ID; to the DATA step. I have it in the above code but commented out. When I uncomment and run, there are 100000 rows:

NOTE: Running DATA step in Cloud Analytic Services.
NOTE: The DATA step will run in multiple threads.
NOTE: There were 200000 observations read from the table TEST in caslib CASUSER.
NOTE: The table out in caslib CASUSER has 100000 observations and 7 variables.

View solution in original post

Kurt_Bremser · Posted 12-27-2022 06:53 PM

The documentation of the DECLARE statement shows it is also of type CAS.

Please post the complete log of the step creating the dataset.

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

val_nikolajevs · Posted 12-27-2022 09:15 PM

The declare, in the documentation, references reading from work library not cas library.

*create table with duplicate IDs, numeric and character variables;
data test;
do id=1 to 100000 ;
	var_n=id*2;
   var_c=cats(_N_,"_n_",var_n);
	output;
	var_n=1/id;
   var_c=cats(_N_,"_f_",var_n);
	output;
end;
run;

8use Hash for aggregation by ID;
data _null_;
  if _n_=1 then do;
      dcl hash h(ordered:"A");
      h.defineKey("id");
      h.defineData("id","var","var_c");
      h.defineDone();
	end;
set test end=lr;


if h.find()=0 then do;
 var=sum(var,var_n);
 h.replace();
end;
else do;
 var=var_n;
 h.add();
end;
 if lr then h.output(dataset:'out'); 
run;

The log from the above program running on a work server gets me 100000,which is exactly how many I do expect-- because in the first data step we have two entries per each id.

NOTE: The data set WORK.OUT has 100000 observations and 3 variables.

NOTE: There were 200000 observations read from the data set WORK.TEST.

NOTE: DATA statement used (Total process time):

real time 0.16 seconds

user cpu time 0.12 seconds

system cpu time 0.04 seconds

memory 33531.40k

OS Memory 65156.00k

libname casuser cas;
data casuser.test;
do id=1 to 100000 ;
	var_n=id*2;
   var_c=cats(_N_,"_n_",var_n);
	output;
	var_n=1/id;
   var_c=cats(_N_,"_f_",var_n);
	output;
end;
run;

The above creates expected 200k entries in CAS library--works as expected.

NOTE: Running DATA step in Cloud Analytic Services.

NOTE: The DATA step has no input data set and will run in a single thread.

NOTE: The table test in caslib CASUSER(xxxx) has 200000 observations and 3 variables.

NOTE: DATA statement used (Total process time):

real time 0.29 seconds

user cpu time 0.01 seconds

system cpu time 0.00 seconds

data _null_;
  if _n_=1 then do;
      dcl hash h(ordered:"A");
      h.defineKey("id");
      h.defineData("id","var","var_c");
      h.defineDone();
	end;
set casuser.test end=lr;

if h.find()=0 then do;
 var=sum(var,var_n);
 h.replace();
end;
else do;
 var=var_n;
 h.add();
end;
 if lr then h.output(dataset:'out'); 
run;

Running the above with reference to casuser.test --cas library produces, some extra 5 entries

NOTE: Running DATA step in Cloud Analytic Services.

NOTE: The DATA step will run in multiple threads.

NOTE: There were 200000 observations read from the table TEST in caslib CASUSER(xxxx).

NOTE: The table out in caslib CASUSER(xxxx) has 100005 observations and 3 variables.

NOTE: DATA statement used (Total process time):


The above is very basic sample; I am using iterators and and multiple hashes--code works in SAS processing on a single work station. Only when start using CAS the total numbers for rows are way off and the data becomes meaningless.

DerylHollick · Posted 03-23-2023 09:19 PM

I can replicate this behavior except I get 100009 rows (usually):

NOTE: Running DATA step in Cloud Analytic Services.
NOTE: The DATA step will run in multiple threads.
NOTE: There were 200000 observations read from the table TEST in caslib CASUSER.
NOTE: The table out in caslib CASUSER has 100009 observations and 7 variables.

I think what is happening here is a timing issue. I updated the hash definition to include the threadid and hostname both when a row is added to the hash and when it is replaced:

data _null_;
  length hostname hostname_add $50;
  if _n_=1 then do;
      dcl hash h(ordered:"A");
      h.defineKey("id");
      h.defineData("id","var","var_c", "threadid_add", "threadid_replace", "hostname_add", "hostname_replace");
      h.defineDone();
      call missing(threadid, threadid_add, hostname, hostname_add);
   end;
set casuser.test end=lr;
*by id;

if h.find()=0 then do;
 var=sum(var,var_n);
 threadid_replace = _threadid_;
 hostname_replace = _hostname_;
 h.replace();
end;
else do;
 var=var_n;
 threadid_add = _threadid_;
 hostname_add = _hostname_;
 h.add();
end;
if lr then h.output(dataset:'out'); 
run;

data dup_id_rows;
set casuser.out;
by id;
if not (first.id and last.id);
run;

The dup_id_rows dataset contains the 18 rows associated with the 9 "extra" ids. In each case, the id was added to the hash twice by a separate thread and never replaced.

There is no locking of the hash table going on behind the scenes, so I think these are instances where the second thread checked the hash before the first thread's add was completed.

The issue is not as pronounced as it could be, in this particular case, because the casuser.test table is not distributed across all the worker nodes. When it was initially created, the log said:

NOTE: Running DATA step in Cloud Analytic Services.
NOTE: The DATA step has no input data set and will run in a single thread.

It can be distributed by running table.partition:

proc cas;
table.partition / casout={caslib="casuser", name="test", replace=true} table={caslib="casuser", name="test"} ;   
table.tabledetails / caslib="casuser" name="test" level="node";
table.tableinfo / table="test";
quit;

My system has 4 workers, so the output of table.tabledetails now shows 50k rows on each node instead of 200k on one node. When I run the hash DATA step there are now 100088 rows due to more threads at work:

NOTE: Running DATA step in Cloud Analytic Services.
NOTE: The DATA step will run in multiple threads.
NOTE: There were 200000 observations read from the table TEST in caslib CASUSER.
NOTE: The table out in caslib CASUSER has 100088 observations and 7 variables.

To avoid this and still run multi-threaded, we can make all the rows associated with a given ID be processed by the same thread. This can be done by adding BY ID; to the DATA step. I have it in the above code but commented out. When I uncomment and run, there are 100000 rows:

NOTE: Running DATA step in Cloud Analytic Services.
NOTE: The DATA step will run in multiple threads.
NOTE: There were 200000 observations read from the table TEST in caslib CASUSER.
NOTE: The table out in caslib CASUSER has 100000 observations and 7 variables.

val_nikolajevs · Posted 03-24-2023 12:56 PM

Thank you for getting to the solution!

HASH tables in SAS Viya with CAS processing.

Re: HASH tables in SAS Viya with CAS processing.

Re: HASH tables in SAS Viya with CAS processing.

Re: HASH tables in SAS Viya with CAS processing.

Re: HASH tables in SAS Viya with CAS processing.

Re: HASH tables in SAS Viya with CAS processing.

HASH tables in SAS Viya with CAS processing.

Re: HASH tables in SAS Viya with CAS processing.

Re: HASH tables in SAS Viya with CAS processing.

Re: HASH tables in SAS Viya with CAS processing.

Re: HASH tables in SAS Viya with CAS processing.

Re: HASH tables in SAS Viya with CAS processing.

The 2025 SAS Hackathon has begun!

SAS Training: Just a Click Away