Auto-suggest helps you quickly narrow down your search results by suggesting possible matches as you type.

Showing results for

Find a Community

- Home
- /
- SAS Programming
- /
- SAS Procedures
- /
- PROC FCMP for removing missing values in array and...

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

11-20-2015 02:28 PM

Hi everyone,

First post. Just began learning SAS about a couple weeks ago and am finding some things unbelievably infuriating. I'm used to stackoverflow handing me answers on a platter but this doesn't seem to be the case with SAS so here I am....

My requirements are very straightforward. I am writing a user defined functions using PROC FCMP, passing it an array. I want to create a new array without missing values and calculate a median. I realize that I don't need a UDF for this but this answer will generalize to many other things I'd like to do. I'd also like to know how to have a function variable assigned to a macro variable.

Here is my current function.

proc fcmp outlib=work.functions.sample; function remove_missing(t[*]); array n[100] /nosymbols; array n_new[100] / nosymbols; do i = 1 to dim(t); n[i] = t[i]; end; num_missing = cmiss(of n[*]); kept = 100 - num_missing; call dynamic_array(n_new, kept); olddim = dim(n); newdim = dim(n_new); put olddim= newdim=; do i = 1 to dim(n_new); n_new[i]=n[i]; end; m = median(of n_new[*]); *error here. Index out of bounds.; return(m); endsub; run; quit;

And Data step

options cmplib=work.functions; data _null_; array prices[*] t1-t55; do i = 1 to 10; prices[i] = i*2; end; med = remove_missing(prices); run;

After dynamically altering the new array, running **median(of n_new[*])** attempts to use all 100 of the previously defined array than the 10 for the new dimension. And the reason I must make a new array inside my function is that the OF operator needs a contiguous (or temporary) array for it to work.

Also, I would absolutely love to know to the following in a function. Have a macro variable assigned to a function variable. %SYSFUNC somehow doesn not work with the **dim **function.

variable = dim(array); %let macrovar = variable;

Accepted Solutions

Solution

11-20-2015
06:35 PM

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

11-20-2015 06:21 PM - edited 11-20-2015 06:30 PM

Are you calculating something that the existing time series procedures cannot do?

Are you calculating something that PROC MEANS cannot calculate?

It looks to me like you are just calculating the MEDIAN and the MODE over the set of values that are a fixed number of observations before or after a value? Why not just output those as multiple records and let PROC MEANS do the work?

```
* Get some sample data ;
data sample;
set sashelp.iris (keep=species petalwidth);
rename petalwidth=price ;
run;
* Set macro variables so it will be easy to convert to macro ;
%let lag=3 ;
%let dsn=sample;
%let byvars=species;
%let var=price ;
```

First expand the data to add your lead and lag observations. You can do this as a data step view to save permanent disk space if you data is large.

```
data lead_lag / view=lead_lag ;
keep rowno &var;
do until (last.%scan(&byvars,-1)) ;
set &dsn;
by &byvars ;
_stop+1;
if first.species then _start=_stop ;
end;
do until (last.%scan(&byvars,-1)) ;
set &dsn;
by &byvars;
rowno+1;
do p=max(_start,rowno-&lag) to min(rowno+&lag,_stop);
set &dsn point=p;
output;
end;
end;
run;
```

Then calculate the statistic for each of the original rows.

```
proc summary data=lead_lag nway;
by rowno ;
var &var ;
output out=want(drop=_type_ _freq_) mode=mode mean=mean median=median ;
run;
```

Then combine the new statistic columns back with the source data.

```
data want ;
merge &dsn want ;
run;
```

Results look like this:

```
Obs Species price rowno mode mean median
1 Setosa 2 1 2 3.00000 2.5
2 Setosa 3 2 2 2.80000 2.0
3 Setosa 2 3 2 2.66667 2.0
4 Setosa 5 4 2 2.57143 2.0
5 Setosa 2 5 2 2.42857 2.0
6 Setosa 2 6 2 2.28571 2.0
7 Setosa 2 7 2 2.85714 2.0
8 Setosa 1 8 2 2.42857 2.0
9 Setosa 2 9 2 2.42857 2.0
10 Setosa 6 10 2 2.57143 2.0
11 Setosa 2 11 2 2.57143 2.0
```

All Replies

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

11-20-2015 02:42 PM

What is the purpose of the proposed function?

What does it mean to remove the missing values?

How is the median going to be any different once you have removed the missing values?

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

11-20-2015 02:58 PM

The above problem is just a dummy version of a much larger function where I compute the mode of an array of numbers using a hash table. If there is no mode (each number appars once) or the frequency of the most common number of the array is less than 30% of the length of the data, then the function returns the median. Else it returns the mode.

My main issue is that I cannot compute **median(of array[*])** and **cmiss(of array[*])** in the udf because for whatever reason **proc fcmp** needs a contiguous or temporary array. My current solution is to compute the median and missing count outside of the udf and just pass those values as parameters into my function.

Here is my real function

proc fcmp outlib=work.functions.sample; function find_mode(prices[*], num_missing, the_median); length price 8; length count 8; declare hash hash_initial(); rc = hash_initial.defineKey('price'); rc = hash_initial.defineData('price', 'count'); rc = hash_initial.defineDone(); declare hash hash_ordered(ordered:'d'); rc = hash_ordered.defineKey('count'); rc = hash_ordered.defineData('price', 'count'); rc = hash_ordered.defineDone(); declare hiter hash_initial_iter('hash_initial'); declare hiter hash_ordered_iter('hash_ordered'); do i = 1 to dim(prices); price = prices[i]; rc = hash_initial.find(); if rc ^= 0 then do; if price = . then count = 0; else count = 1; rc = hash_initial.add(); end; else do; if price ^= . then count = count + 1; rc = hash_initial.replace(); end; end; rc = hash_initial_iter.first(); do while (rc=0); rc = hash_ordered.replace(); rc=hash_initial_iter.next(); end; rc = hash_ordered_iter.first(); num_non_missing = dim(prices) - num_missing; if num_non_missing > 0 then do; if count = 1 or count / num_non_missing < .3 then price = the_median; end; else price = .; rc = hash_initial_iter.delete(); rc = hash_ordered_iter.delete(); rc = hash_initial.clear(); rc = hash_ordered.clear(); return(price); endsub; run; quit;

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

11-20-2015 03:19 PM - edited 11-20-2015 03:31 PM

Consider transposing your data, using BY processing within a PROC MEANS instead.

PROC STDIZE can be used for filling in missing values.

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

11-20-2015 03:38 PM

Reeza, I'd rather not get derailed going down a path other than the answer to my question. It seems there should be a simple solution. Pass an array to a function. Make it a contiguous array. Compute summary statistics using the **of **operator. What also would work is **%let arraydim = %sysfunc(dim(oldarray)); newarray array[&arraydim]; **But there are two issues. **%sysfunc** doesn't allow **dim ** and even if it did it wouldn't be able calculate **dim(oldarray)** because **oldarray **is not a macro variable.

And to give more background. I am taking the leading and lagging 7 values by group during the datastep and stuffing them into an array to calculate the mode so proc means wouldn't work here. Here is the code that use my my main udf **find_mode**.

It works fine the way it is. I just want to move the **num_missing **and **the_median **statements into my udf, which should be very easy...but I can't figure it out.

%macro short_mode(dataset, new_dataset, price_col, output_col, lastcol, num_lags); options cmplib=sp.functions; data &new_dataset; %let lastcol1 = %sysfunc(catx(,&lastcol,1)); set &dataset; by geomarket segment CURR_OPER_NAME HL_LL_Desc &lastcol; %do i = 1 %to &num_lags; %let obs = %eval(&i + 1); if eof&i=0 then set &dataset (firstobs=&obs keep=geomarket segment CURR_OPER_NAME HL_LL_Desc &lastcol &price_col rename=(&price_col=lead&i HL_LL_Desc=HL_LL_Desc1 CURR_OPER_NAME=CURR_OPER_NAME1 segment=segment1 geomarket=geomarket1 &lastcol=&lastcol1)) end=eof&i; if HL_LL_Desc1 ^= HL_LL_Desc or CURR_OPER_NAME1 ^= CURR_OPER_NAME or segment1 ^= segment or geomarket1 ^= geomarket or &lastcol1 ^= &lastcol then lead&i=.; lag&i = lag&i(&price_col); if lag&i(geomarket) ^= geomarket or lag&i(segment) ^= segment or lag&i(CURR_OPER_NAME) ^= CURR_OPER_NAME or lag&i(HL_LL_Desc) ^= HL_LL_Desc or lag&i(&lastcol) ^= &lastcol then lag&i=.; %end; array prices[*] lead1-lead&num_lags &price_col lag1-lag&num_lags; %let array_dim = %eval(&num_lags * 2 + 1); array prices_temp(&array_dim) _temporary_; do i = 1 to dim(prices); prices_temp[i]=prices[i]; end; num_missing = cmiss(of prices_temp[*]); the_median = median(of prices_temp[*]); mode = find_mode(prices_temp, num_missing, the_median); &keepdrop; rename mode = &output_col; run; %mend; %let keepdrop = drop i rc num_non_missing lag1-lag7 lead1-lead7 count num_missing the_median; %short_mode(ftl_rolling, ftl_rolling_mode, Actual_Price_per_Unit, RollingModeCat, UOM_PPU_Type, 7);

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

11-20-2015 03:47 PM

Why do you need to calculate the mode using a function instead of running the code in the data step?

Couldn't you generate the needed code using a macro if it was too cumberson to re-type?

What value is being added by the function that out ways its added complexity to implemen, especially given the limits that SAS has imposed on the flexibilty of passing values to the function?

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

11-20-2015 03:58 PM - edited 11-20-2015 03:59 PM

Just to super clarify. My code works well and runs quite fast. Even on a million rows of data it only takes a minute to calculate a rolling mode and its doing an absurd amount of calculations. I just want to tidy it up a bit more with something that is dead easy in other languages.

"*Why do you need to calculate the mode using a function instead of running the code in the data step?* "

A udf is necessary (unless you can otherwise tell me so) to return a single value. I don't know how to declare a hash table in a macro and don't know how to return a single value from a macro or from a data step. A function (in my mind and just like all other languages I've ever programmed in) takes in values, does a calculation and spits out a single value.

"*Couldn't you generate the needed code using a macro if it was too cumberson to re-type?*"

I wish I knew how to declare hash tables in macros and return a single value. I thought this was what functions were for.

"*What value is being added by the function that out ways its added complexity to implemen, especially given the limits that SAS has imposed on the flexibilty of passing values to the function?*"

Just like any other function in any language. Reusability, modular code. But I'm new to SAS so perhaps there are better solutions but it seems incredibly simple what I am asking.

Solution

11-20-2015
06:35 PM

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

11-20-2015 06:21 PM - edited 11-20-2015 06:30 PM

Are you calculating something that the existing time series procedures cannot do?

Are you calculating something that PROC MEANS cannot calculate?

It looks to me like you are just calculating the MEDIAN and the MODE over the set of values that are a fixed number of observations before or after a value? Why not just output those as multiple records and let PROC MEANS do the work?

```
* Get some sample data ;
data sample;
set sashelp.iris (keep=species petalwidth);
rename petalwidth=price ;
run;
* Set macro variables so it will be easy to convert to macro ;
%let lag=3 ;
%let dsn=sample;
%let byvars=species;
%let var=price ;
```

First expand the data to add your lead and lag observations. You can do this as a data step view to save permanent disk space if you data is large.

```
data lead_lag / view=lead_lag ;
keep rowno &var;
do until (last.%scan(&byvars,-1)) ;
set &dsn;
by &byvars ;
_stop+1;
if first.species then _start=_stop ;
end;
do until (last.%scan(&byvars,-1)) ;
set &dsn;
by &byvars;
rowno+1;
do p=max(_start,rowno-&lag) to min(rowno+&lag,_stop);
set &dsn point=p;
output;
end;
end;
run;
```

Then calculate the statistic for each of the original rows.

```
proc summary data=lead_lag nway;
by rowno ;
var &var ;
output out=want(drop=_type_ _freq_) mode=mode mean=mean median=median ;
run;
```

Then combine the new statistic columns back with the source data.

```
data want ;
merge &dsn want ;
run;
```

Results look like this:

```
Obs Species price rowno mode mean median
1 Setosa 2 1 2 3.00000 2.5
2 Setosa 3 2 2 2.80000 2.0
3 Setosa 2 3 2 2.66667 2.0
4 Setosa 5 4 2 2.57143 2.0
5 Setosa 2 5 2 2.42857 2.0
6 Setosa 2 6 2 2.28571 2.0
7 Setosa 2 7 2 2.85714 2.0
8 Setosa 1 8 2 2.42857 2.0
9 Setosa 2 9 2 2.42857 2.0
10 Setosa 6 10 2 2.57143 2.0
11 Setosa 2 11 2 2.57143 2.0
```

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

11-20-2015 06:46 PM - edited 11-20-2015 06:48 PM

Very nice. Thats extremely good and very fast. Thanks a lot! I'll have to tweak just a bit to add my logic for when choosing the median but I won't have to use arrays now.

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

11-20-2015 04:13 PM

The difference is that an Array in SAS is not an object. It's a shortcut reference to variables. The concept of data object don't exist at least within traditional SAS. If you go down IML and/or DS2 then that differs.

Its better to sometimes learn the best way to do things in a new language rather than to shoehorn a different method into the language.

Its better to sometimes learn the best way to do things in a new language rather than to shoehorn a different method into the language.