BookmarkSubscribeRSS Feed
☑ This topic is solved. Need further help from the community? Please sign in and ask a new question.
val_nikolajevs
Obsidian | Level 7

Hello,

I am trying to re purpose SAS code for use in SAS Viya with CAS.

Can be hash tables integrated with CAS, so multi thread processing is done and intermediary results are stored in CAS Memory?

When I use SAS 9.4 code with Hash tables code, referencing cas library in the set statement, the hash_table.output (dataset:'out');

creates a cas library casuser.out. However, this table contains junk data character variables look something like this:

  all dates got to 1960 year value and numeric are zeros.

When do debugging, I can see that the cas library is read correctly.

Does CAS even support hash tables--could not find anything in the documentation.

Thank you.

 

1 ACCEPTED SOLUTION

Accepted Solutions
DerylHollick
SAS Employee

I can replicate this behavior except I get 100009 rows (usually):

 

NOTE: Running DATA step in Cloud Analytic Services.
NOTE: The DATA step will run in multiple threads.
NOTE: There were 200000 observations read from the table TEST in caslib CASUSER.
NOTE: The table out in caslib CASUSER has 100009 observations and 7 variables.

 

I think what is happening here is a timing issue.  I updated the hash definition to include the threadid and hostname both when a row is added to the hash and when it is replaced:  

data _null_;
  length hostname hostname_add $50;
  if _n_=1 then do;
      dcl hash h(ordered:"A");
      h.defineKey("id");
      h.defineData("id","var","var_c", "threadid_add", "threadid_replace", "hostname_add", "hostname_replace");
      h.defineDone();
      call missing(threadid, threadid_add, hostname, hostname_add);
   end;
set casuser.test end=lr;
*by id;

if h.find()=0 then do;
 var=sum(var,var_n);
 threadid_replace = _threadid_;
 hostname_replace = _hostname_;
 h.replace();
end;
else do;
 var=var_n;
 threadid_add = _threadid_;
 hostname_add = _hostname_;
 h.add();
end;
if lr then h.output(dataset:'out'); 
run;

data dup_id_rows;
set casuser.out;
by id;
if not (first.id and last.id);
run;

 

The dup_id_rows dataset contains the 18 rows associated with the 9 "extra" ids.  In each case, the id was added to the hash twice by a separate thread and never replaced.

 

hash.JPG

 

There is no locking of the hash table going on behind the scenes, so I think these are instances where the second thread checked the hash before the first thread's add was completed.

 

The issue is not as pronounced as it could be, in this particular case, because the casuser.test table is not distributed across all the worker nodes.  When it was initially created, the log said:

NOTE: Running DATA step in Cloud Analytic Services.
NOTE: The DATA step has no input data set and will run in a single thread.

 

It can be distributed by running table.partition:

proc cas;
table.partition / casout={caslib="casuser", name="test", replace=true} table={caslib="casuser", name="test"} ;   
table.tabledetails / caslib="casuser" name="test" level="node";
table.tableinfo / table="test";
quit;

 

My system has 4 workers, so the output of table.tabledetails now shows 50k rows on each node instead of 200k on one node.  When I run the hash DATA step there are now 100088 rows due to more threads at work:

 

NOTE: Running DATA step in Cloud Analytic Services.
NOTE: The DATA step will run in multiple threads.
NOTE: There were 200000 observations read from the table TEST in caslib CASUSER.
NOTE: The table out in caslib CASUSER has 100088 observations and 7 variables.

 

To avoid this and still run multi-threaded, we can make all the rows associated with a given ID be processed by the same thread.  This can be done by adding BY ID; to the DATA step.  I have it in the above code but commented out.  When I uncomment and run, there are 100000 rows:

 

NOTE: Running DATA step in Cloud Analytic Services.
NOTE: The DATA step will run in multiple threads.
NOTE: There were 200000 observations read from the table TEST in caslib CASUSER.
NOTE: The table out in caslib CASUSER has 100000 observations and 7 variables.

 

View solution in original post

4 REPLIES 4
val_nikolajevs
Obsidian | Level 7

The declare, in the documentation,  references reading from work library not cas library.

*create table with duplicate IDs, numeric and character variables;
data test;
do id=1 to 100000 ;
	var_n=id*2;
   var_c=cats(_N_,"_n_",var_n);
	output;
	var_n=1/id;
   var_c=cats(_N_,"_f_",var_n);
	output;
end;
run;

8use Hash for aggregation by ID;
data _null_;
  if _n_=1 then do;
      dcl hash h(ordered:"A");
      h.defineKey("id");
      h.defineData("id","var","var_c");
      h.defineDone();
	end;
set test end=lr;


if h.find()=0 then do;
 var=sum(var,var_n);
 h.replace();
end;
else do;
 var=var_n;
 h.add();
end;
 if lr then h.output(dataset:'out'); 
run;

The log from the above program running on a work server gets me 100000,which is exactly how many I do expect-- because in the first data step we have two entries per each id.

NOTE: The data set WORK.OUT has 100000 observations and 3 variables.
NOTE: There were 200000 observations read from the data set WORK.TEST.
NOTE: DATA statement used (Total process time):
real time 0.16 seconds
user cpu time 0.12 seconds
system cpu time 0.04 seconds
memory 33531.40k
OS Memory 65156.00k
libname casuser cas;
data casuser.test;
do id=1 to 100000 ;
	var_n=id*2;
   var_c=cats(_N_,"_n_",var_n);
	output;
	var_n=1/id;
   var_c=cats(_N_,"_f_",var_n);
	output;
end;
run;
The above creates expected 200k entries in CAS library--works as expected.
NOTE: Running DATA step in Cloud Analytic Services.
NOTE: The DATA step has no input data set and will run in a single thread.
NOTE: The table test in caslib CASUSER(xxxx) has 200000 observations and 3 variables.
NOTE: DATA statement used (Total process time):
real time 0.29 seconds
user cpu time 0.01 seconds
system cpu time 0.00 seconds

data _null_;
  if _n_=1 then do;
      dcl hash h(ordered:"A");
      h.defineKey("id");
      h.defineData("id","var","var_c");
      h.defineDone();
	end;
set casuser.test end=lr;

if h.find()=0 then do;
 var=sum(var,var_n);
 h.replace();
end;
else do;
 var=var_n;
 h.add();
end;
 if lr then h.output(dataset:'out'); 
run;
Running the above with reference to casuser.test --cas library produces, some extra 5 entries

NOTE: Running DATA step in Cloud Analytic Services.
NOTE: The DATA step will run in multiple threads.
NOTE: There were 200000 observations read from the table TEST in caslib CASUSER(xxxx).
NOTE: The table out in caslib CASUSER(xxxx) has 100005 observations and 3 variables.
NOTE: DATA statement used (Total process time):


The above is very basic sample; I am using iterators and and multiple hashes--code works in SAS processing on a single work station. Only when start using CAS the total numbers for rows are way off and the data becomes meaningless.
DerylHollick
SAS Employee

I can replicate this behavior except I get 100009 rows (usually):

 

NOTE: Running DATA step in Cloud Analytic Services.
NOTE: The DATA step will run in multiple threads.
NOTE: There were 200000 observations read from the table TEST in caslib CASUSER.
NOTE: The table out in caslib CASUSER has 100009 observations and 7 variables.

 

I think what is happening here is a timing issue.  I updated the hash definition to include the threadid and hostname both when a row is added to the hash and when it is replaced:  

data _null_;
  length hostname hostname_add $50;
  if _n_=1 then do;
      dcl hash h(ordered:"A");
      h.defineKey("id");
      h.defineData("id","var","var_c", "threadid_add", "threadid_replace", "hostname_add", "hostname_replace");
      h.defineDone();
      call missing(threadid, threadid_add, hostname, hostname_add);
   end;
set casuser.test end=lr;
*by id;

if h.find()=0 then do;
 var=sum(var,var_n);
 threadid_replace = _threadid_;
 hostname_replace = _hostname_;
 h.replace();
end;
else do;
 var=var_n;
 threadid_add = _threadid_;
 hostname_add = _hostname_;
 h.add();
end;
if lr then h.output(dataset:'out'); 
run;

data dup_id_rows;
set casuser.out;
by id;
if not (first.id and last.id);
run;

 

The dup_id_rows dataset contains the 18 rows associated with the 9 "extra" ids.  In each case, the id was added to the hash twice by a separate thread and never replaced.

 

hash.JPG

 

There is no locking of the hash table going on behind the scenes, so I think these are instances where the second thread checked the hash before the first thread's add was completed.

 

The issue is not as pronounced as it could be, in this particular case, because the casuser.test table is not distributed across all the worker nodes.  When it was initially created, the log said:

NOTE: Running DATA step in Cloud Analytic Services.
NOTE: The DATA step has no input data set and will run in a single thread.

 

It can be distributed by running table.partition:

proc cas;
table.partition / casout={caslib="casuser", name="test", replace=true} table={caslib="casuser", name="test"} ;   
table.tabledetails / caslib="casuser" name="test" level="node";
table.tableinfo / table="test";
quit;

 

My system has 4 workers, so the output of table.tabledetails now shows 50k rows on each node instead of 200k on one node.  When I run the hash DATA step there are now 100088 rows due to more threads at work:

 

NOTE: Running DATA step in Cloud Analytic Services.
NOTE: The DATA step will run in multiple threads.
NOTE: There were 200000 observations read from the table TEST in caslib CASUSER.
NOTE: The table out in caslib CASUSER has 100088 observations and 7 variables.

 

To avoid this and still run multi-threaded, we can make all the rows associated with a given ID be processed by the same thread.  This can be done by adding BY ID; to the DATA step.  I have it in the above code but commented out.  When I uncomment and run, there are 100000 rows:

 

NOTE: Running DATA step in Cloud Analytic Services.
NOTE: The DATA step will run in multiple threads.
NOTE: There were 200000 observations read from the table TEST in caslib CASUSER.
NOTE: The table out in caslib CASUSER has 100000 observations and 7 variables.

 

val_nikolajevs
Obsidian | Level 7

Thank you for getting to the solution!

SAS Innovate 2025: Call for Content

Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!

Submit your idea!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 4 replies
  • 1692 views
  • 3 likes
  • 3 in conversation