DATA Step, Macro, Functions and more

using array, first., and last. to create only one observation and multiple variables per subject

Accepted Solution Solved
Reply
Occasional Contributor
Posts: 11
Accepted Solution

using array, first., and last. to create only one observation and multiple variables per subject

Hi all!

I am having trouble using array, first., and last. to create only one observation and multiple variables per subject. The data set has 18,082 observations with 3 variables: ID_NO, SYMPTOM_NO, and SYMPTOM. I need to keep the id_no variable and lose the symptom_no and symptom variables yet create 5 variables called sympt1 (for heartburns), sympt2 (for sickness), sympt3 (for spasm), sympt4 (for temperature), and sympt5 (for tiredness), and one constant variable disease (which is always = 0)

Data set given with the 1st 15 obs:

id_no

symptom_no

symptom

1

1

Heartburns

1

2

Sickness

4

5

Tiredness

6

1

Heartburns

7

4

Temperature

8

1

Heartburns

8

5

Tiredness

9

2

Sickness

9

3

Spasm

9

4

Temperature

10

3

Spasm

10

5

Tiredness

11

3

Spasm

11

5

Tiredness

12

4

Temperature

12

5

Tiredness

13

3

Spasm

13

4

Temperature

14

3

Spasm

15

1

Heartburns

Goal set up:

Obs

id_no

sympt1

sympt2

sympt3

sympt4

sympt5

disease

1

1

Heartburns

Sickness

 

 

 

0

2

4

 

 

 

 

Tiredness

0

3

6

Heartburns

 

 

 

 

0

4

7

 

 

 

Temperature

 

0

5

8

Heartburns

 

 

 

Tiredness

0

6

9

 

Sickness

Spasm

Temperature

 

0

7

10

 

 

Spasm

 

Tiredness

0

8

11

 

 

Spasm

 

Tiredness

0

9

12

 

 

 

Temperature

Tiredness

0

10

13

 

 

Spasm

Temperature

 

0

11

14

 

 

Spasm

 

 

0

12

15

Heartburns

 

 

Temperature

 

0

13

16

 

 

 

 

Tiredness

0

14

17

Heartburns

 

 

 

 

0

15

19

 

Sickness

 

 

 

0

 

 

 

Here is the code i am using and the log error message.

****1.IMPORT;

options nodate nonumber;
%macro P3 (a, b, c);
proc import out= &a
datafile= "C:\HW5\&b"
dbms=xlsx replace;
getnames=yes;
run;
proc sort data=&a;
by &c;
run;
Proc print data = &a;
Run;
%mend P3;

%P3 (PROJECT3_F17, Project3.xlsx, id_no);
***2. USE ARRAY STATEMENT ALONG WITH FIRST.ID and LAST.ID to reorganize;
data sympt;
set PROJECT3_F17;
retain id_no sympt_no sympt1-sympt5;
length sympt1-sympt5 $20;
array symptoms (5) sympt1-sympt5;
by symptom_no;
do i=1 to 5;
if first.symptom_no then symptoms(i)=sympt1;
else symptoms (i)=.;

end;
drop symptom symptom_no;
if last.symptom_no then output;
disease=0;
format id_no sympt1-sympt5 disease;
run;
proc print data=sympt;
run;

 

LOG:

 

127055 options nodate nonumber;
127056 ****1.IMPORT;
127057 %macro P3 (a, b, c);
127058 proc import out= &a
127059 datafile= "C:\HW5\&b"
127060 dbms=xlsx replace;
127061 getnames=yes;
127062 run;
127063 proc sort data=&a;
127064 by &c;
127065 run;
127066 Proc print data = &a;
127067 Run;
127068 %mend P3;
127069
127070 %P3 (PROJECT3_F17, Project3.xlsx, id_no);

NOTE: The import data set has 18082 observations and 3 variables.
NOTE: WORK.PROJECT3_F17 data set was successfully created.
NOTE: PROCEDURE IMPORT used (Total process time):
real time 0.91 seconds
cpu time 0.79 seconds

 

NOTE: There were 18082 observations read from the data set WORK.PROJECT3_F17.
NOTE: The data set WORK.PROJECT3_F17 has 18082 observations and 3 variables.
NOTE: PROCEDURE SORT used (Total process time):
real time 0.04 seconds
cpu time 0.04 seconds

 

NOTE: There were 18082 observations read from the data set WORK.PROJECT3_F17.
NOTE: PROCEDURE PRINT used (Total process time):
real time 3.11 seconds
cpu time 2.73 seconds


127071 ***2.REORGANIZE variables: USE ARRAY STATEMENT ALONG WITH FIRST.ID and LAST.ID;
127072 data sympt;
127073 set PROJECT3_F17;
127074 retain id_no sympt_no sympt1-sympt5;
127075 length sympt1-sympt5 $20;
127076 array symptoms (5) sympt1-sympt5;
127077 *array v (5) symptom_no;
127078 by symptom_no;
127079 do i=1 to 5;
127080 if first.symptom_no then symptoms(i)=sympt1;
127081 *else symptoms (i) = .;
127082 else symptoms (i)=.;
127083
127084 end;
127085 drop symptom symptom_no;
127086 if last.symptom_no then output;
127087 *if symptoms (i) = Heartburns then symptoms (i)= sympt1;
127088 *if symptoms (i) = Sickness then symptoms (i)= sympt2;
127089 *if symptoms (i) = Spasm then symptoms (i)= sympt3;
127090 *if symptoms (i) = Temperature then symptoms (i)= sympt4;
127091 *if symptoms (i) = Tiredness then symptoms (i)= sympt5;
127092 *else symptoms (i) = .;
127093 *end;
127094 disease=0;
127095 format id_no sympt1-sympt5 disease;
127096 *data sympt2;
127097 *set sympt;
127098 run;

NOTE: Numeric values have been converted to character values at the places given by:
(Line)Smiley SadColumn).
127082:14
NOTE: Variable sympt_no is uninitialized.
ERROR: BY variables are not properly sorted on data set WORK.PROJECT3_F17.
id_no=4 symptom_no=5 symptom=Tiredness sympt1= sympt2= sympt3= sympt4= sympt5=
FIRST.symptom_no=1 LAST.symptom_no=1 i=. disease=. _ERROR_=1 _N_=3
NOTE: The SAS System stopped processing this step because of errors.
NOTE: There were 4 observations read from the data set WORK.PROJECT3_F17.
WARNING: The data set WORK.SYMPT may be incomplete. When this step was stopped there were 2
observations and 8 variables.
WARNING: Data set WORK.SYMPT was not replaced because this step was stopped.
NOTE: DATA statement used (Total process time):
real time 0.46 seconds
cpu time 0.09 seconds


127099 proc print data=sympt;
127100 run;

NOTE: No observations in data set WORK.SYMPT.
NOTE: PROCEDURE PRINT used (Total process time):
real time 0.00 seconds
cpu time 0.00 seconds

 

Please help!


Accepted Solutions
Solution
‎02-10-2018 04:21 PM
PROC Star
Posts: 8,142

Re: using array, first., and last. to create only one observation and multiple variables per subject

Posted in reply to aespinarey

I'd do it as follows:

data sympt (drop=symptom_no symptom);
  set PROJECT3_F17;
  retain sympt1-sympt5;
  retain disease 0;
  length sympt1-sympt5 $15;
  array symptoms (5) $15. sympt1-sympt5;
  by id_no;
  if first.id_no then call missing(of symptoms(*));
  symptoms(symptom_no)=symptom;
  if last.id_no then output;
run;

Art, CEO, AnalystFinder.com

 

View solution in original post


All Replies
PROC Star
Posts: 1,547

Re: using array, first., and last. to create only one observation and multiple variables per subject

Posted in reply to aespinarey

looking only at your sample, makes me want to do it this way:

data have;
input id_no	symptom_no	symptom :$15.;
datalines;
1	1	Heartburns
1	2	Sickness
4	5	Tiredness
6	1	Heartburns
7	4	Temperature
8	1	Heartburns
8	5	Tiredness
9	2	Sickness
9	3	Spasm
9	4	Temperature
10	3	Spasm
10	5	Tiredness
11	3	Spasm
11	5	Tiredness
12	4	Temperature
12	5	Tiredness
13	3	Spasm
13	4	Temperature
14	3	Spasm
15	1	Heartburns
;

proc transpose data=have out=want(drop=_name_) ;
by id_no;
var symptom;
id symptom;
run;

data final_want;
set want;
disease=0;
run;

PROC Star
Posts: 502

Re: using array, first., and last. to create only one observation and multiple variables per subject

[ Edited ]
Posted in reply to aespinarey

if you have few variables then one way to do this is 

proc sql;
create table want as
select id_no,
max(case when symptom_no = 1 then symptom end) as sympt1,
max(case when symptom_no = 2 then symptom end) as sympt2,
max(case when symptom_no = 3 then symptom end) as sympt3,
max(case when symptom_no = 4 then symptom end) as sympt4,
max(case when symptom_no = 5 then symptom end) as sympt5,
"0" as Disease
from abc
group by id_no;
Solution
‎02-10-2018 04:21 PM
PROC Star
Posts: 8,142

Re: using array, first., and last. to create only one observation and multiple variables per subject

Posted in reply to aespinarey

I'd do it as follows:

data sympt (drop=symptom_no symptom);
  set PROJECT3_F17;
  retain sympt1-sympt5;
  retain disease 0;
  length sympt1-sympt5 $15;
  array symptoms (5) $15. sympt1-sympt5;
  by id_no;
  if first.id_no then call missing(of symptoms(*));
  symptoms(symptom_no)=symptom;
  if last.id_no then output;
run;

Art, CEO, AnalystFinder.com

 

PROC Star
Posts: 1,547

Re: using array, first., and last. to create only one observation and multiple variables per subject

Good afternoon @art297 I did think of your approach however, should the number of symptoms changes, you are gonna have somebody in prod support team to change to the array subsript right? or perhaps

 

would you take a max(count distinct symptoms) into array_num and call it as array subscript to make it dynamic?

 

This is the only concern. Not a big deal but just in case

Occasional Contributor
Posts: 11

Re: using array, first., and last. to create only one observation and multiple variables per subject

Thank you! it worked perfectly! 

☑ This topic is solved.

Need further help from the community? Please ask a new question.

Discussion stats
  • 5 replies
  • 205 views
  • 4 likes
  • 4 in conversation