Re: Using HASH to add sequence number to unsorted dataset

ma4 · Posted 03-10-2020 04:13 AM

Hello everybody!

I am a frequent reader and fan of the community and have now a problem where I can't find a previous solution on the board or other publications.

This issue is that I am currently getting into HASH objects due to the sheer size of datasets I have to work with. So far HASH has made my life much easier.

But the problem I have is following: I have a large dataset with multiple rows for each subject. I want to allocate a sequence number which increments each line of a subject and starts again at 1 for the next subject. The subjects have to be sorted in a certain order. I have created a test dataset and the "normal" code I am using for this step.

As the datastep is massive I would rather not sort it, but determeine the order by using the sorting variables as key. I have a feeling that this is possible but everything I've tried hasn't worked.

/* Create Input Data Set */ 
data source;   
	length USUBJID PARCAT1N PARCAT2N PARAMCD AVALC $ 16;   
	input USUBJID PARCAT1N PARCAT2N PARAMCD AVALC; 
	datalines; 
		1010835359 1 1 VAR1 VALUE1
		1010835359 1 2 VAR2 VALUE2
		1010836424 1 1 VAR1 VALUE1
		1010836424 1 2 VAR2 VALUE2
		1010836424 2 3 VAR3 VALUE3
		1010835359 2 3 VAR3 VALUE3
		1010835359 2 4 VAR4 VALUE4
		1010835359 3 5 VAR5 VALUE5
		1010835359 3 6 VAR6 VALUE6
		1010836424 2 4 VAR4 VALUE4
		1010836424 3 5 VAR5 VALUE5
		1010836424 3 6 VAR6 VALUE6

	;
run;

/*The "normal" code I would ususally use. This take too miuch time to run*/
proc sort data=source out=source_sort; by USUBJID PARCAT1N PARCAT2N PARAMCD AVALC; run;

data want;
	set source_sort;
		count + 1;
		by USUBJID;
		if first.USUBJID then count = 1;
run;

I thought about something like this, where the sequence number is according to the order of the variables (USUBJID PARCAT1N PARCAT2N PARAMCD AVALC) but I dont have to do an extra sort:

/* Load and iterate over hash */ 
data want;   

declare hash ht(dataset:"source",ordered:"a");   
	ht.definekey('USUBJID','PARCAT1N','PARCAT2N','PARAMCD','AVALC');
	ht.defineDone(); 

declare hiter iter("ht");   

set source;
	rc = iter.first();   count = 1;
	if rc = 0 then output;
	rc = iter.next(); count + 1;
run;

Is something like this possible with hash?

I would be very thankful for help!

Greeting,

Marius

ed_sas_member · Posted 03-10-2020 04:33 AM

Hi @ma4

Here is a suggestion to achieve this:

/* Sort the table using a hash (avoid PROC SORT) */ 
data _null_;   
	if _n_=1 then do;
		declare hash ht(ordered:"a");   
		ht.definekey('USUBJID','PARCAT1N','PARCAT2N','PARAMCD','AVALC');
		ht.definedata('USUBJID','PARCAT1N','PARCAT2N','PARAMCD','AVALC');
		ht.defineDone(); 
	end;
	set source;
	ht.add();
	ht.output(dataset:'source_sorted');
run; 

/* Create the COUNT variable */
data want;
	set source_sorted;
	by USUBJID;
	if first.USUBJID then count = 0;
	count + 1;
run;

All the best,

ma4 · Posted 03-10-2020 05:38 AM

Hello,

thanks for the quick reply! It worked but the log file showed me:

1239 data _null_;
1240 if _n_=1 then do;
1241 declare hash ht(ordered:"a");
1242 ht.definekey('USUBJID','PARCAT1N','PARCAT2N','PARAMCD','AVALC');
1243 ht.definedata('USUBJID','PARCAT1N','PARCAT2N','PARAMCD','AVALC');
1244 ht.defineDone();
1245 end;
1246 set source;
1247 ht.add();
1248 ht.output(dataset:'source_sorted');
1249 run;
NOTE: The data set WORK.SOURCE_SORTED has 1 observations and 5 variables.
NOTE: The data set WORK.SOURCE_SORTED has 2 observations and 5 variables.
NOTE: The data set WORK.SOURCE_SORTED has 3 observations and 5 variables.
NOTE: The data set WORK.SOURCE_SORTED has 4 observations and 5 variables.
NOTE: The data set WORK.SOURCE_SORTED has 5 observations and 5 variables.
NOTE: The data set WORK.SOURCE_SORTED has 6 observations and 5 variables.
NOTE: The data set WORK.SOURCE_SORTED has 7 observations and 5 variables.
NOTE: The data set WORK.SOURCE_SORTED has 8 observations and 5 variables.
NOTE: The data set WORK.SOURCE_SORTED has 9 observations and 5 variables.
NOTE: The data set WORK.SOURCE_SORTED has 10 observations and 5 variables.
NOTE: The data set WORK.SOURCE_SORTED has 11 observations and 5 variables.
NOTE: The data set WORK.SOURCE_SORTED has 12 observations and 5 variables.
NOTE: There were 12 observations read from the data set WORK.SOURCE.
NOTE: DATA statement used (Total process time):
real time 0.13 seconds
cpu time 0.07 seconds

If I would use this on my main dataset this would iterate through every line and take prrobably longer than an actual proc sort.

ed_sas_member · Posted 03-10-2020 06:35 AM

Hi @ma4

No problem. I have just realized that I've made it harder for myself with my previous code. This one will be much efficient:

data _null_;	
	if _n_=1 then do;
		set source;
		declare hash ht (dataset:'source', ordered:'a');
		ht.definekey ('USUBJID','PARCAT1N','PARCAT2N','PARAMCD','AVALC');
		ht.definedata ('USUBJID','PARCAT1N','PARCAT2N','PARAMCD','AVALC');
		ht.definedone ();
	end;
	ht.output(dataset:"source_sorted");	
run;

My best,

s_lassen · Posted 03-10-2020 04:38 AM

I assume that the big dataset has many more columns than shown in the example, and that is why it takes so long to sort.

In that case I would suggest something like this:

data extract;
  set have(keep=USUBJID PARCAT1N PARCAT2N PARACMD AVALC);
  obsno=_N_;
run;

proc sort data=extract;
  by USUBJID PARCAT1N PARCAT2N PARACMD AVALC;
run;

data counts(keep=obsno count);
  set extract;
  by USUBJID;
  if first.USUBJID then count=1;
  else count+1;
run;

proc sort data=counts;
  by obsno;
run;

data want;
  set have;
  set counts(drop=obsno);
run;

ma4 · Posted 03-10-2020 05:42 AM

Thank you very much! This is an amazing idea! I would have never though of that.

I tried on my main dataset and using only the key variables reduced my sort time considerably (down to 30% of the original time). I will check later the implementation in my main program.

Greetings

Marius

FreelanceReinh · Posted 03-10-2020 06:45 AM

Hello @ma4,

I'm not sure to what extent your sample dataset is representative of your real data. (Note that count happens to equal PARCAT2N.) If the input dataset is already sorted by PARCAT1N PARCAT2N PARAMCD AVALC within each subset of observations with the same USUBJID, as is the case in dataset SOURCE, you can assign the sequence numbers like this (and thus avoid any sorting):

data want;
if _n_=1 then do;
  dcl hash h();
  h.definekey('usubjid');
  h.definedata('count');
  h.definedone();
end;
set source;
if ~h.find() then count=count+1;
else count=1;
h.replace();
run;

Classroom Training Available!