Solved: Why does proc format create duplicates?

Emjay · Posted 10-08-2019 06:55 AM

I want to use proc format to transform my numerical variable into a range variable. This works fine.

But if I want to do some further analysis with this new range variable I have the problem that proc format has created duplicates but I don't see why.

So let's say I have the following:

data have; 
input ID KBZ; 
datalines;
1 0.25
2 0.33
3 1.2
4 5.8
5 3.2
6 3.8
7 3.44
8 4.7
;
run;

proc format;
value KBZrange (FUZZ=0)
low-1 = "<=1" 
1<-2 = "1-2" 
2<-3 = "2-3"
3<-4 = "3-4"
4<-5 = "4-5"
5<-high = ">=5";
quit;

data have;
set have;
format KBZ_range KBZrange.;
KBZ_range=KBZ;
run;

proc sql;
create table why_dup as
select distinct KBZ_range from have;
quit;

Why are there now this extra rows although they look the same to me?

PeterClemmensen · Posted 10-08-2019 07:08 AM

No. PROC SQL considers the actual values of KBZ_range. Not the formated values of KBZ_range. Right now, KBZ_range is exactly equal to KBZ. They just look different because you gave it a different format.

If you do like below however, then you give KBZ_range the values of the ranges you specify in PROC SORT (the actual strings), and you get 5 obs

data have;
set have;
KBZ_range=put(KBZ, KBZrange.);
run;

proc sql;
create table why_dup as
select distinct KBZ_range from have;
quit;

Result:

KBZ_Range
1-2
3-4
4-5
<=1
>=5

Makes sense?

The DATA to DATA Step Macro
Blog: SASnrd

View solution in original post

PeterClemmensen · Posted 10-08-2019 06:58 AM

I don't see any extra rows? Your initial data has 8 rows and so does why_dup?

The DATA to DATA Step Macro
Blog: SASnrd

Emjay · Posted 10-08-2019 07:01 AM

Yes it has 8 rows, but shouldn't it have only 5 as I wrote select distinct?

PeterClemmensen · Posted 10-08-2019 07:08 AM

No. PROC SQL considers the actual values of KBZ_range. Not the formated values of KBZ_range. Right now, KBZ_range is exactly equal to KBZ. They just look different because you gave it a different format.

If you do like below however, then you give KBZ_range the values of the ranges you specify in PROC SORT (the actual strings), and you get 5 obs

data have;
set have;
KBZ_range=put(KBZ, KBZrange.);
run;

proc sql;
create table why_dup as
select distinct KBZ_range from have;
quit;

Result:

KBZ_Range
1-2
3-4
4-5
<=1
>=5

Makes sense?

The DATA to DATA Step Macro
Blog: SASnrd

Emjay · Posted 10-08-2019 07:16 AM

Yes, makes sense. 🙂

Now I also understand why KBZ_range still had the type numeric and this blue circle next to it. I thought this was because sas didn't know how to display my custom format.

Thanks!

PeterClemmensen · Posted 10-08-2019 07:17 AM

Spot on! SAS knows exactly how to display your custom format 🙂

Anytime, glad you found your answer.

The DATA to DATA Step Macro
Blog: SASnrd

ballardw · Posted 10-08-2019 11:25 AM

@Emjay wrote:

Yes, makes sense. 🙂

Now I also understand why KBZ_range still had the type numeric and this blue circle next to it. I thought this was because sas didn't know how to display my custom format.

Thanks!

Consider:

proc summary data=have ;
   class kbz;
   format kbz kbzrange.;
   output out=work.summary  ;
run;

You will find the groups created by the format are honored by other procedures, though you might be surprised sometimes what the actual value (unformatted) kept with Proc Summary output.

Emjay · Posted 10-08-2019 11:32 AM

You are right! And that's exactly why I was so confused. Cause I used my original approach with a proc freq a while ago and there it also worked fine... But now I know. 🙂

ballardw · Posted 10-08-2019 03:41 PM

@Emjay wrote:

You are right! And that's exactly why I was so confused. Cause I used my original approach with a proc freq a while ago and there it also worked fine... But now I know. 🙂

I use formats to validate some long lists of values and sometime use things like

if put(variable, validformatname.) = 'Valid' then <what I wanted to do>;
Else do <whatever when the value isn't on the valid list>;

You should be able to use:

proc sql;
create table why_dup as
select distinct put(KBZ,KBZrange. ) as KBZ_range from have;
quit;

Emjay · Posted 10-09-2019 03:17 AM

I've seen this format approach in if-statements but I never understood the benefit.

What would be the reason to rather write:

proc format;
value KBZ2range (FUZZ=0)
low-1 = "valid" 
1<-2 = "nv" 
2<-3 = "valid"
3<-4 = "nv"
4<-5 = "valid"
5<-high = "nv";
quit;

data foo;
set have;
if put(KBZ,KBZ2range.)='valid' then check='yes'; 
else check='no';
run;

Instead of:

data foo2;
set have;
if (KBZ<=1 or 2<KBZ<=3 or 4<KBZ<=5) then check='yes'; 
else check='no';
run;

A lot of extra typing...

Why does proc format create duplicates?

Re: Why does proc format create duplicates?

Re: Why does proc format create duplicates?

Re: Why does proc format create duplicates?

Re: Why does proc format create duplicates?

Re: Why does proc format create duplicates?

Re: Why does proc format create duplicates?

Re: Why does proc format create duplicates?

Re: Why does proc format create duplicates?

Re: Why does proc format create duplicates?

Re: Why does proc format create duplicates?

Classroom Training Available!