Help using Base SAS procedures

Result issue with data step

Reply
N/A
Posts: 0

Result issue with data step

Hello All

I am trying to transpose the data mentioned in the datalines statement below using the array concept.

data work.temp;
input Org_ID $ Value;
Datalines;
1000 0.22
1000 0.12
1000 0.11
1100 0.14
1100 0.34
1100 0.21
;
run;

Data work.new (keep = Org_ID Value1-Value3 );
Array Values {3} Value1-Value3;
DO I=1 TO 3;
SET WORK.TEMP;
RETAIN ORG_ID;
VALUES{I} = VALUE;
end;
RUN;

How ever the resultant dataset "New" is printing the Org_ID in the last column versus the first. I am not sure why this is happening.

This is how it looks:
Value 1 Value 2 Value 3 Org_ID
0.22 0.21 0.11 1000
0.14 0.34 0.21 1100

Could anyone help me with this. Appreciate your time and help.

Pappu
Contributor
Posts: 29

Re: Result issue with data step

Try moving
set work.temp;
to right after
Data work.new (keep = Org_ID Value1-Value3 );
like I've done below. That's the only change I made to your code.



data work.temp;
input Org_ID $ Value;
Datalines;
1000 0.22
1000 0.12
1000 0.11
1100 0.14
1100 0.34
1100 0.21
;
run;

Data work.new (keep = Org_ID Value1-Value3 );
SET WORK.TEMP;
Array Values {3} Value1-Value3;
DO I=1 TO 3;
RETAIN ORG_ID;
VALUES{I} = VALUE;
end;
RUN;
N/A
Posts: 0

Re: Result issue with data step

Hello Gary

Thank you for the correction. It sure works now.

Pappu.
SAS Super FREQ
Posts: 8,743

Re: Result issue with data step

Hi:
When SAS creates WORK.NEW, it needs to build something called the "descriptor portion" for the NEW dataset. Therefore, at compile time (of the code) SAS scans your program and builds this descriptor portion using a buffer area to hold the column names and values for each observation. The column names come from any references, assignment statements or SET/MERGE datasets in your entire program. At compile time, the buffer area just holds the column names. Then at execution time, the buffer area holds the observation information being read in and at the end of the data step SAS writes the "keep" variables to the OUTPUT dataset (in this case, WORK.NEW).

So, by scanning through your program (from top to bottom) SAS builds a buffer area that looks "sort of" like this:
[pre]
value1 value2 value3 i org_id value _n_ _error_
[/pre]

The first references to variables that SAS finds are the variables in the ARRAY statement (the reference in the KEEP statement isn't used until SAS has a complete list of ALL the variables). Even though the KEEP statement -appears- first in your code, it is not -used- first to make the list of variables in the buffer area (technically called the "Program Data Vector" or PDV).

VALUE1, VALUE2, VALUE3 are put in the buffer from the ARRAY statement. Next, the DO loop is encountered and I is added to the buffer. Then the SET statement is encountered and Org_ID and VALUE are added in whatever order they appear in WORK.TEMP, Then, _N_ and _ERROR_ are added to the buffer area -- they are internal variables that SAS always uses for DATA step programs.

Next, the Drop or Keep statements or options are applied to this complete list of variables:
[pre]
value1 value2 value3 i org_id value _n_ _error_
K K K D K D D D
[/pre]

K means Keep and D means Drop. SAS has to have the whole column list before it can keep track of the DROP/KEEPs for the variables. So now, this is the order that the variables have been created: Value1, Value2, Value3, Org_ID. That's why when you do a default PROC PRINT, you get the variables in this order. You can control that with a VAR statement inside the PROC PRINT.

Your program assumes that there will ALWAYS be 3 observations for every Org_ID -- if an Org_ID only has 2 observations, then your SET statement will start to become "off" and continue to be "off" for the rest of the data. I rarely code a SET statement inside a DO loop of this form -- there are times to use the technique, but this would not be the technique I'd use for this problem.

This could be a more robust program if you investigated some other ways to process the data...such as using the SET statement outside the DO loop or using PROC TRANSPOSE.

Both of those methods are illustrated below with some new data that contains one Org_ID (1105) that only has 2 observations and another Org_ID that has only 1 observation (1106) using BY group processing -- either with the DATA step program or with PROC TRANSPOSE will ensure that every Org_ID by group stays together for processing.

cynthia

[pre]

data work.temp;
infile datalines;
input Org_ID $ Value;
return;
Datalines;
1000 0.22
1000 0.12
1000 0.11
1100 0.14
1100 0.34
1100 0.21
1105 0.33
1105 0.44
1106 0.66
1200 0.11
1200 0.22
1200 0.33
;
run;

ods listing;
Data work.new2 (keep = Org_ID Value1-Value3 numval);
** read work.temp and use by group processing;
set work.temp;
by Org_ID;
** retain value1, value2, value3;
** and declare them in an ARRAY statement;
retain value1 value2 value3 i;
Array Values {3} Value1-Value3;

** for every "new" Org_ID, initialize values to missing;
** reset i to 0;
if first.Org_ID then do;
value1=.; value2=.; value3=.;
i = 0;
end;
** increment i for every observation;
i + 1;

** assign the array member a value, based on I;
VALUES{I} = VALUE;

** by the time the last Org_ID is read, the array is full;
** create a var to hold the number of values and output;
** the new observation.;
if last.Org_ID then do;
numval = i;
output;
end;
RUN;

proc print data=work.new2;
title '1) with array and set outside do loop';
run;

proc transpose data=temp out=new_trans;
by Org_ID;
var value;
run;

proc print data=new_trans;
title '2) with transpose';
run;
[/pre]
N/A
Posts: 0

Re: Result issue with data step

Hello Cynthia

This is great. I do have a quick extension to the question/ your program. The example from you and the transpose works assuming that the missing values are the 2nd and the 3rd. What if my dataset had another column which defines what the 3 values are and now how does the program change to account for a missing value for the first activity instead of 2nd ro 3rd.

Again thanks soo much for all the explaination.

data work.temp;
infile datalines;
input org_id $ activity $ value;
return;
datalines;
1000 WF_BP 0.22
1000 WF_ISF 0.12
1000 WF_WP 0.11
1100 WF_BP 0.14
1100 WF_ISF 0.34
1100 WF_WP 0.21
1105 WF_BP 0.33
1105 WF_WP 0.44
1106 WF_WP 0.66
1200 WF_BP 0.11
1200 WF_ISF 0.22
1200 WF_WP 0.33
;
run;

Thanks
Pappu.
SAS Super FREQ
Posts: 8,743

Re: Result issue with data step

Hi:
The change is fairly simple. Although you -can- have numbered variables created or treated as array members, you can also have unnumbered variables treated as though they were in an array. For example, these would all be OK ways to create array references:
[pre]
array value{3} WF_BP WF_ISF WF_WP;
array other{3} o1 o2 o3;
array tvchar{4} $ lucy ricky fred ethel;
[/pre]

So in the above example, VALUE{1} would point to WF_BP, VALUE{2} would point to WF_ISF and VALUE{3} would point to WF_WP....just like OTHER{1} points to the variable O1 and TVCHAR{1} point to the variable LUCY.

Given your data example, if the variable ACTIVITY holds the name of the variable you want to assign and the variable VALUE holds the value for that variable, then one way to change my earlier example would be as shown below.

With this method, the index for the array comes from the value of the Activity variable. So, if any of the 3 activities were missing in the data, that array member would have a missing value -- the array index in this case is not automatically incrementing for each member of the Org_ID by group -- the array index (AIND) is created based on the value of Activity.

There are other ways of processing, you could have made value1-value3 and varname1-varname3, but that seemed overly complicated and unnecessary. Between these 2 programs and the others you've received, you'll just have to experiment and run tests on your data until you find the right combination of programming logic to give you the end result you desire.

cynthia

[pre]
ods listing;
Data work.new3 (keep = Org_ID WF_BP WF_ISF WF_WP);
** read work.temp and use by group processing;
set work.temp;
by Org_ID;
** retain WF_BP WF_ISF WF_WP;
** and declare them in an ARRAY statement;
retain WF_BP WF_ISF WF_WP;
Array Values {3} WF_BP WF_ISF WF_WP;

** for every "new" Org_ID, initialize values to missing;
** reset i to 0;
if first.Org_ID then do;
WF_BP=.; WF_ISF=.; WF_WP=.;
end;

** set AIND (index for array) based on Activity value;
** if activity value is not found, then new variable;
** will be missing in final data set.;
** Could use a format for this part.;

if Activity = 'WF_BP' then AIND = 1;
else if Activity = 'WF_ISF' then AIND = 2;
else if Activity = 'WF_WP' then AIND = 3;
else aind = .;

** assign the array member a value, based on Activity -- only if AIND is 1,2 3;
** if AIND is missing (.), then the corresponding array member will be missing;
if aind in (1,2,3) then VALUES{AIND} = VALUE;

** by the time the last Org_ID is read, the array is full;
** create a var to hold the number of values and output;
** the new observation.;
if last.Org_ID then do;
output;
end;
RUN;

proc print data=work.new3;
title '1) with different array';
run;
[/pre]

Results (using your data as an example):
[pre]
Obs org_id WF_BP WF_ISF WF_WP

1 1000 0.22 0.12 0.11
2 1100 0.14 0.34 0.21
3 1105 0.33 . 0.44
4 1106 . . 0.66
5 1200 0.11 0.22 0.33
[/pre]
N/A
Posts: 0

Re: Result issue with data step

Hello Cynthia

Thank you for the detailed information. I do have another question for you.

I used the code you created and added some variables to it. Shown below:

Data facall09.temp ;
set facall09.testallvars;
run;

proc sort data=facall09.temp;
by org_id activity;
run;

ods listing;
Data facall09.CLUSTERINPUT (drop = workarea activity waterfall aind);
** read facall09.sas_input_09_data and use by group processing;
set facall09.temp;
by Org_ID;
** retain WF_BP WF_ISF WF_WP;
** and declare them in an ARRAY statement;
retain WF_BP WF_ISF WF_SF WF_TTR WF_TT WF_WP;
Array Values {6} WF_BP WF_ISF WF_SF WF_TTR WF_TT WF_WP;
** Assign formats to be treated as numeric values;
format WF_BP 3.2 WF_ISF 3.2 WF_SF 3.2 WF_TTR 3.2 WF_TT 3.2 WF_WP 3.2;
** for every "new" Org_ID, initialize values to missing;
** reset i to 0;
if first.Org_ID then do;
WF_BP=.;WF_ISF=.; WF_SF =.;WF_TTR = .; WF_TT =.;WF_WP=.;
end;

** set AIND (index for array) based on Activity value;
** if activity value is not found, then new variable;
** will be missing in final data set.;
** Could use a format for this part.;

if Activity = WF_BP then AIND = 1;
else if Activity = WF_ISF then AIND = 2;
else if Activity = WF_SF then AIND = 3;
ELSE IF ACTIVITY = WF_TTR THEN AIND = 4;
else if Activity = WF_TT then AIND = 5;
else if Activity = WF_WP then AIND = 6;
else aind = .;

** assign the array member a value, based on Activity -- only if;
** AIND is 1,2 3,4,5;
** if AIND is missing (.), then the corresponding array;
** member will be missing;
if aind in (1,2,3,4,5,6) then Values{AIND} = Waterfall;

** by the time the last Org_ID is read, the array is full;
** create a var to hold the number of values and output;
** the new observation.;
if last.Org_ID then do;
output;
end;
RUN;

Now there are 2 issues with it:
1. Here is my dataset.
- There are 6 groups of Activity for each org_id such as WF_BP, WF_ISF, WF_SF, WF_TTR, WF_TT, WF_WP
- I sorted the data based on activity.
-As I had mentioned earlier the issue about missing values is still continuing. Org_ID # 1000 does not have a value for WF_ISF and so the value of WF_SF is being assigned to it and the WF_WP is receiving a missing value.


Org_ID Activity Waterfall
1000 WF_BP 0.27
1000 WF_SF 0.38
1000 WF_TT 3.1
1000 WF_TTR 2.16
1000 WF_WP 0.11
1100 WF_BP 0.24
1100 WF_SF 0.27
1100 WF_SF 0.31
1100 WF_TT 5.3
1100 WF_TTR 3.6
1200 WF_BP 0.19
1200 WF_ISF 0.23
1200 WF_ISF 0.28
1200 WF_SF 0.31
1200 WF_WP 0.06

2. In the log I am getting the following notes / errors:

NOTE: Character values have been converted to numeric values at the places given by: (Line)Smiley SadColumn).
383:8 384:13 385:13 386:13 387:13 388:13
NOTE: Invalid numeric data, Activity='WF_BP' , at line 383 column 8.

NOTE: Invalid numeric data, Activity='WF_SF' , at line 383 column 8.
and so on for all the values for each activity.

Could you suggest why this is happening and a solution to it.

Thanks a lot for all your time in advance.
Pappu.
SAS Super FREQ
Posts: 8,743

Re: Result issue with data step

Hi:
A careful comparison of these 2 IF statements may reveal part of the problem:
[pre]
My IF statement:

if Activity = 'WF_BP' then AIND = 1;
else if Activity = 'WF_ISF' then AIND = 2;
else if Activity = 'WF_WP' then AIND = 3;
else aind = .;
***** ***** ***** ***** ***** *****
Your IF statement:

if Activity = WF_BP then AIND = 1;
else if Activity = WF_ISF then AIND = 2;
else if Activity = WF_SF then AIND = 3;
ELSE IF ACTIVITY = WF_TTR THEN AIND = 4;
else if Activity = WF_TT then AIND = 5;
else if Activity = WF_WP then AIND = 6;
else aind = .;
[/pre]

My IF statement is asking whether the value of ACTIVITY is a character string 'WF_BP' and if the comparison is true, then AIND is sent to 1. Your IF statement is asking whether the value of ACTIVITY (a character variable) is equal to the value of the variable WF_BP (a numeric variable). The character string "WF_BP" is the value of a character variable, ACTIVITY. In my program "WF_BP" is different from WF_BP -- which is a reference to the numeric variable that you WANT to create. In your program, you're asking SAS to compare the value in a character variable to the value in a numeric variable.

Although the quotes in my comparison may not have seemed important or significant, they are allowing the index for the array to be set correctly. At this point (when you're in the IF logic), you have not set any values for WF_BP (the variable) yet. So then, SAS sees that the value of ACTIVITY is a character string 'WF_BP' and it's confused because you have tried to compare the value of a character variable to the value of a numeric variable.

Why did you take the quotes out of the comparison? The whole purpose of the IF was to see the value of the character string -inside- the ACTIVITY variable and set the array index accordingly. Since the ACTIVITY variable is a character variable, it MUST be compared to a character string....if you want the logic to work. The logic of the program will ONLY work if the AIND variable is set correctly, with a bad IF condition, AIND will -never- be set correctly.

If you don't understand or don't want to use an ARRAY, you would have had to hard-code every assignment statement....something like this. However, note that the condition still has to compare ACTIVITY to a character string:
[pre]
if Activity = 'WF_BP' then do;
WF_BP = waterfall;
end;
else if else if Activity = 'WF_ISF' then do;
WF_ISF = waterfall;
end;
. . . more conditions . . . ;
[/pre]


When you start with a program that works and you turn it into a program that doesn't work, you have to ask yourself "what changed" -- and then to paraphrase Sherlock Holmes (or A. C. Doyle) -- whatever changed, however insignificant or improbable it seems, must be causing or contributing to the problem.

cynthia
Ask a Question
Discussion stats
  • 7 replies
  • 163 views
  • 0 likes
  • 3 in conversation