Solved: Re: PROC FCMP for removing missing values in array and OF operators

TedP · Posted 11-20-2015 02:28 PM

Hi everyone,

First post. Just began learning SAS about a couple weeks ago and am finding some things unbelievably infuriating. I'm used to stackoverflow handing me answers on a platter but this doesn't seem to be the case with SAS so here I am....

My requirements are very straightforward. I am writing a user defined functions using PROC FCMP, passing it an array. I want to create a new array without missing values and calculate a median. I realize that I don't need a UDF for this but this answer will generalize to many other things I'd like to do. I'd also like to know how to have a function variable assigned to a macro variable.

Here is my current function.

proc fcmp outlib=work.functions.sample;
	function remove_missing(t[*]);
		array n[100] /nosymbols;
		array n_new[100] / nosymbols;
		do i = 1 to dim(t);
			n[i] = t[i];
		end;
		num_missing = cmiss(of n[*]);
		kept = 100 - num_missing;

		call dynamic_array(n_new, kept);
		olddim = dim(n);
		newdim = dim(n_new);
		put olddim= newdim=;
		do i = 1 to dim(n_new);
			n_new[i]=n[i];
		end;
		m = median(of n_new[*]); *error here. Index out of bounds.;
		return(m);
	endsub;
run;
quit;

And Data step

options cmplib=work.functions;
data _null_;
	array prices[*] t1-t55;
	do i = 1 to 10;
		prices[i] = i*2;
	end;
	med = remove_missing(prices);
run;

After dynamically altering the new array, running median(of n_new[*]) attempts to use all 100 of the previously defined array than the 10 for the new dimension. And the reason I must make a new array inside my function is that the OF operator needs a contiguous (or temporary) array for it to work.

Also, I would absolutely love to know to the following in a function. Have a macro variable assigned to a function variable. %SYSFUNC somehow doesn not work with the dim function.

variable = dim(array);
%let macrovar = variable;

Tom · Posted 11-20-2015 06:21 PM

Are you calculating something that the existing time series procedures cannot do?

Are you calculating something that PROC MEANS cannot calculate?

It looks to me like you are just calculating the MEDIAN and the MODE over the set of values that are a fixed number of observations before or after a value? Why not just output those as multiple records and let PROC MEANS do the work?

* Get some sample data ;
data sample;
  set sashelp.iris (keep=species petalwidth);
  rename petalwidth=price ;
run;

* Set macro variables so it will be easy to convert to macro ;
%let lag=3 ;
%let dsn=sample;
%let byvars=species;
%let var=price ;

First expand the data to add your lead and lag observations. You can do this as a data step view to save permanent disk space if you data is large.

data lead_lag / view=lead_lag ;
  keep rowno &var;
  do until (last.%scan(&byvars,-1)) ;
    set &dsn;
    by &byvars ;
    _stop+1;
    if first.species then _start=_stop ;
  end;
  do until (last.%scan(&byvars,-1)) ;
    set &dsn;
    by &byvars;
    rowno+1;
    do p=max(_start,rowno-&lag) to min(rowno+&lag,_stop);
      set &dsn point=p;
      output;
    end;
  end;
run;

Then calculate the statistic for each of the original rows.

proc summary data=lead_lag nway;
  by rowno ;
  var &var ;
  output out=want(drop=_type_ _freq_) mode=mode mean=mean median=median ;
run;

Then combine the new statistic columns back with the source data.

data want ;
  merge &dsn want ;
run;

Results look like this:

Obs    Species    price    rowno    mode      mean     median

  1    Setosa       2         1       2     3.00000      2.5
  2    Setosa       3         2       2     2.80000      2.0
  3    Setosa       2         3       2     2.66667      2.0
  4    Setosa       5         4       2     2.57143      2.0
  5    Setosa       2         5       2     2.42857      2.0
  6    Setosa       2         6       2     2.28571      2.0
  7    Setosa       2         7       2     2.85714      2.0
  8    Setosa       1         8       2     2.42857      2.0
  9    Setosa       2         9       2     2.42857      2.0
 10    Setosa       6        10       2     2.57143      2.0
 11    Setosa       2        11       2     2.57143      2.0

View solution in original post

Tom · Posted 11-20-2015 02:42 PM

What is the purpose of the proposed function?

What does it mean to remove the missing values?

How is the median going to be any different once you have removed the missing values?

TedP · Posted 11-20-2015 02:58 PM

The above problem is just a dummy version of a much larger function where I compute the mode of an array of numbers using a hash table. If there is no mode (each number appars once) or the frequency of the most common number of the array is less than 30% of the length of the data, then the function returns the median. Else it returns the mode.

My main issue is that I cannot compute median(of array[*]) and cmiss(of array[*]) in the udf because for whatever reason proc fcmp needs a contiguous or temporary array. My current solution is to compute the median and missing count outside of the udf and just pass those values as parameters into my function.

Here is my real function

proc fcmp outlib=work.functions.sample;
	function find_mode(prices[*], num_missing, the_median);
		length price 8;
		length count 8;
		declare hash hash_initial();
		rc = hash_initial.defineKey('price');
		rc = hash_initial.defineData('price', 'count');
		rc = hash_initial.defineDone();
		declare hash hash_ordered(ordered:'d');
		rc = hash_ordered.defineKey('count');
		rc = hash_ordered.defineData('price', 'count');
		rc = hash_ordered.defineDone();
		declare hiter hash_initial_iter('hash_initial');
		declare hiter hash_ordered_iter('hash_ordered');

	do i = 1 to dim(prices);
		price = prices[i];
		rc = hash_initial.find();
		if rc ^= 0 then
			do;
				if price = . then count = 0;
				else count = 1;
				rc = hash_initial.add();
			end;
		else
			do;
				if price ^= . then count = count + 1;
				rc = hash_initial.replace();
			end;
	end;
		
	rc = hash_initial_iter.first();
	do while (rc=0);
     	rc = hash_ordered.replace();
        rc=hash_initial_iter.next();
  	end;

	rc = hash_ordered_iter.first();
	num_non_missing = dim(prices) - num_missing;
	if num_non_missing > 0 then
		do;
		if count = 1 or count / num_non_missing < .3 then price = the_median;
		end;
	else price = .;

	rc = hash_initial_iter.delete(); 
	rc = hash_ordered_iter.delete();
	rc = hash_initial.clear();
	rc = hash_ordered.clear();
	
	return(price);
	endsub;
run;
quit;

Reeza · Posted 11-20-2015 03:19 PM

Consider transposing your data, using BY processing within a PROC MEANS instead.

PROC STDIZE can be used for filling in missing values.

TedP · Posted 11-20-2015 03:38 PM

Reeza, I'd rather not get derailed going down a path other than the answer to my question. It seems there should be a simple solution. Pass an array to a function. Make it a contiguous array. Compute summary statistics using the of operator. What also would work is %let arraydim = %sysfunc(dim(oldarray)); newarray array[&arraydim]; But there are two issues. %sysfunc doesn't allow dim and even if it did it wouldn't be able calculate dim(oldarray) because oldarray is not a macro variable.

And to give more background. I am taking the leading and lagging 7 values by group during the datastep and stuffing them into an array to calculate the mode so proc means wouldn't work here. Here is the code that use my my main udf find_mode.

It works fine the way it is. I just want to move the num_missing and the_median statements into my udf, which should be very easy...but I can't figure it out.

%macro short_mode(dataset, new_dataset, price_col, output_col, lastcol, num_lags);
options cmplib=sp.functions;
data &new_dataset;
	%let lastcol1 = %sysfunc(catx(,&lastcol,1));
  	set &dataset;
  	by geomarket segment CURR_OPER_NAME HL_LL_Desc &lastcol;
	
	%do i = 1 %to &num_lags;
		%let obs = %eval(&i + 1);
		if eof&i=0 then
			set &dataset (firstobs=&obs keep=geomarket segment CURR_OPER_NAME HL_LL_Desc &lastcol &price_col 
							rename=(&price_col=lead&i HL_LL_Desc=HL_LL_Desc1 CURR_OPER_NAME=CURR_OPER_NAME1
								segment=segment1 geomarket=geomarket1 &lastcol=&lastcol1)) end=eof&i;
	  	if HL_LL_Desc1 ^= HL_LL_Desc or CURR_OPER_NAME1 ^= CURR_OPER_NAME or segment1 ^= segment or geomarket1 ^= geomarket or &lastcol1 ^= &lastcol
			then lead&i=.;

		lag&i = lag&i(&price_col);
		if lag&i(geomarket) ^= geomarket or lag&i(segment) ^= segment or lag&i(CURR_OPER_NAME) ^= CURR_OPER_NAME or 
			lag&i(HL_LL_Desc) ^= HL_LL_Desc or lag&i(&lastcol) ^= &lastcol then	
  		lag&i=.;
	%end;

	array prices[*] lead1-lead&num_lags &price_col lag1-lag&num_lags;
	%let array_dim = %eval(&num_lags * 2 + 1);
	array prices_temp(&array_dim) _temporary_;
	do i = 1 to dim(prices);
		prices_temp[i]=prices[i];
	end;
	num_missing = cmiss(of prices_temp[*]);
	the_median = median(of prices_temp[*]);
	mode = find_mode(prices_temp, num_missing, the_median);
	&keepdrop;
	rename mode = &output_col;
run;
%mend;

%let keepdrop = drop i rc num_non_missing lag1-lag7 lead1-lead7 
					count num_missing the_median;
%short_mode(ftl_rolling, ftl_rolling_mode, Actual_Price_per_Unit,
			RollingModeCat, UOM_PPU_Type, 7);

Tom · Posted 11-20-2015 03:47 PM

Why do you need to calculate the mode using a function instead of running the code in the data step?

Couldn't you generate the needed code using a macro if it was too cumberson to re-type?

What value is being added by the function that out ways its added complexity to implemen, especially given the limits that SAS has imposed on the flexibilty of passing values to the function?

TedP · Posted 11-20-2015 03:58 PM

Just to super clarify. My code works well and runs quite fast. Even on a million rows of data it only takes a minute to calculate a rolling mode and its doing an absurd amount of calculations. I just want to tidy it up a bit more with something that is dead easy in other languages.

"Why do you need to calculate the mode using a function instead of running the code in the data step? "

A udf is necessary (unless you can otherwise tell me so) to return a single value. I don't know how to declare a hash table in a macro and don't know how to return a single value from a macro or from a data step. A function (in my mind and just like all other languages I've ever programmed in) takes in values, does a calculation and spits out a single value.

"Couldn't you generate the needed code using a macro if it was too cumberson to re-type?"

I wish I knew how to declare hash tables in macros and return a single value. I thought this was what functions were for.

"What value is being added by the function that out ways its added complexity to implemen, especially given the limits that SAS has imposed on the flexibilty of passing values to the function?"

Just like any other function in any language. Reusability, modular code. But I'm new to SAS so perhaps there are better solutions but it seems incredibly simple what I am asking.

Tom · Posted 11-20-2015 06:21 PM

Are you calculating something that the existing time series procedures cannot do?

Are you calculating something that PROC MEANS cannot calculate?

It looks to me like you are just calculating the MEDIAN and the MODE over the set of values that are a fixed number of observations before or after a value? Why not just output those as multiple records and let PROC MEANS do the work?

* Get some sample data ;
data sample;
  set sashelp.iris (keep=species petalwidth);
  rename petalwidth=price ;
run;

* Set macro variables so it will be easy to convert to macro ;
%let lag=3 ;
%let dsn=sample;
%let byvars=species;
%let var=price ;

First expand the data to add your lead and lag observations. You can do this as a data step view to save permanent disk space if you data is large.

data lead_lag / view=lead_lag ;
  keep rowno &var;
  do until (last.%scan(&byvars,-1)) ;
    set &dsn;
    by &byvars ;
    _stop+1;
    if first.species then _start=_stop ;
  end;
  do until (last.%scan(&byvars,-1)) ;
    set &dsn;
    by &byvars;
    rowno+1;
    do p=max(_start,rowno-&lag) to min(rowno+&lag,_stop);
      set &dsn point=p;
      output;
    end;
  end;
run;

Then calculate the statistic for each of the original rows.

proc summary data=lead_lag nway;
  by rowno ;
  var &var ;
  output out=want(drop=_type_ _freq_) mode=mode mean=mean median=median ;
run;

Then combine the new statistic columns back with the source data.

data want ;
  merge &dsn want ;
run;

Results look like this:

Obs    Species    price    rowno    mode      mean     median

  1    Setosa       2         1       2     3.00000      2.5
  2    Setosa       3         2       2     2.80000      2.0
  3    Setosa       2         3       2     2.66667      2.0
  4    Setosa       5         4       2     2.57143      2.0
  5    Setosa       2         5       2     2.42857      2.0
  6    Setosa       2         6       2     2.28571      2.0
  7    Setosa       2         7       2     2.85714      2.0
  8    Setosa       1         8       2     2.42857      2.0
  9    Setosa       2         9       2     2.42857      2.0
 10    Setosa       6        10       2     2.57143      2.0
 11    Setosa       2        11       2     2.57143      2.0

TedP · Posted 11-20-2015 06:46 PM

Very nice. Thats extremely good and very fast. Thanks a lot! I'll have to tweak just a bit to add my logic for when choosing the median but I won't have to use arrays now.

Reeza · Posted 11-20-2015 04:13 PM

The difference is that an Array in SAS is not an object. It's a shortcut reference to variables. The concept of data object don't exist at least within traditional SAS. If you go down IML and/or DS2 then that differs.

Its better to sometimes learn the best way to do things in a new language rather than to shoehorn a different method into the language.

Ready to join fellow brilliant minds for the SAS Hackathon?

Classroom Training Available!