Solved: How do I set the first occurrence value to missing if no first occurre...

itmightbeharry · Posted 12-13-2020 01:23 PM

Hello everyone:

I have a data set for which I have created a variable "onset" set equal to the time of first occurrence by id. I'm mostly satisfied with the output using PROC PRINT and the PROC MEANS output is exactly what I'm looking for. The problem I've encountered is when merging this data with other data. There is no value for onset for id # 1, which has no occurrence of "fever"; however, when merging with other data, each onset value moves up one, with id #10 missing. I would like to have "." for onset for id # 1 in the merged data. Any help or suggestions would be very much appreciated! I have done my best to copy my code and output below, along with the desired output at the very bottom.

title 'Summary of Onset of Fever by Group';
data work.prob3e;
  set work.fever;
  by id;
  where fever=1;
  if first.id;
  onset=time;
  label onset='day of first fever';
run;
proc print data=work.prob3e;
run;
proc means data=work.prob3e maxdec=1 nonobs n mean std min median max;
  class group;
  var onset;
run;
title 'Listing of New Variables';
data work.prob3g;
  set work.prob3b;
  set work.prob3c;
  set work.duration;
  set work.prob3e;
  set work.prob3f;
  by id;
run;
proc print data=work.prob3g;
id id;
run;

Here the first observation is id 2. Id 1 has no occurrence of fever:

Summary of Onset of Fever by Group

Obs	temp	id	time	group	fever	onset
1	101.626	2	4	1	1	4
2	100.728	3	3	1	1	3
3	100.850	4	9	1	1	9
4	101.173	5	9	1	1	9
5	101.163	6	1	2	1	1
6	101.730	7	1	2	1	1
7	101.314	8	1	2	1	1
8	103.964	9	1	2	1	1
9	101.370	10	3	2	1	3

All of this is correct:

The MEANS Procedure

Analysis Variable : onset day of first fever
treatment group	N	Mean	Std Dev	Minimum	Median	Maximum
1	4	6.3	3.2	3.0	6.5	9.0
2	5	1.4	0.9	1.0	1.0	3.0

Here's where things go awry. The onset for id 2 from previous has moved up one position to id 1, and so forth for every id, with id 10 missing. I would like onset=. for id 1.

Listing of New Variables

id	temp	time	group	fever	nfev	ever	consecutive	duration	onset	auc
1	101.626	4	1	1	0	NO	0	0	4	879.900
2	100.728	3	1	1	1	YES	0	1	3	881.142
3	100.850	9	1	1	2	YES	0	1	9	892.676
4	101.173	9	1	1	1	YES	0	1	9	872.479
5	101.163	1	1	1	1	YES	0	1	1	889.449
6	101.730	1	2	1	7	YES	0	2	1	913.268
7	101.314	1	2	1	6	YES	0	3	1	906.276
8	103.964	1	2	1	9	YES	0	7	1	915.809
9	101.370	3	2	1	9	YES	0	5	3	914.003

I can take care of the labels and variables output, but this is what I'm trying to produce:

mkeintz · Posted 12-13-2020 01:49 PM

Not only will your results be mismatched, but (as you have discovered) your data set will have no more observations than the smallest contributor among your SET statements. Learn about match-merging using the MERGE statement accompanied by a BY statement: take a look at the examples in SAS® 9.4 and SAS® Viya 3.5 Programming Documentation, MERGE Statement page.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

View solution in original post

Reeza · Posted 12-13-2020 01:40 PM

Are you sure you want multiple SET statements which is an Interleave?

I would expect a merge instead but by group not by ID.

The different methods of combining data are outlined here, I think you're looking for a match merge

https://documentation.sas.com/?cdcId=pgmsascdc&cdcVersion=9.4_3.5&docsetId=lrcon&docsetTarget=n1tgk0...

You'll have to provide more details but I suspect you'll need more than one merge for your data sets.

It looks like you're trying to add summary statistics to a main data set? For these type of calculations I find SQL much easier personally, as it can save you multiple steps. Determining the "FIRST" occurrence can also be done via PROC SUMMARY.

https://github.com/statgeek/SAS-Tutorials/blob/master/add_average_value_to_dataset.sas

@itmightbeharry wrote:

Hello everyone:

I have a data set for which I have created a variable "onset" set equal to the time of first occurrence by id. I'm mostly satisfied with the output using PROC PRINT and the PROC MEANS output is exactly what I'm looking for. The problem I've encountered is when merging this data with other data. There is no value for onset for id # 1, which has no occurrence of "fever"; however, when merging with other data, each onset value moves up one, with id #10 missing. I would like to have "." for onset for id # 1 in the merged data. Any help or suggestions would be very much appreciated! I have done my best to copy my code and output below, along with the desired output at the very bottom.
title 'Summary of Onset of Fever by Group';
data work.prob3e;
  set work.fever;
  by id;
  where fever=1;
  if first.id;
  onset=time;
  label onset='day of first fever';
run;
proc print data=work.prob3e;
run;
proc means data=work.prob3e maxdec=1 nonobs n mean std min median max;
  class group;
  var onset;
run;
title 'Listing of New Variables';
data work.prob3g;
  set work.prob3b;
  set work.prob3c;
  set work.duration;
  set work.prob3e;
  set work.prob3f;
  by id;
run;
proc print data=work.prob3g;
id id;
run;
Here the first observation is id 2. Id 1 has no occurrence of fever:

Summary of Onset of Fever by Group

Obs

temp

id

time

group

fever

onset

1

101.626

2

4

1

1

4

2

100.728

3

3

1

1

3

3

100.850

4

9

1

1

9

4

101.173

5

9

1

1

9

5

101.163

6

1

2

1

1

6

101.730

7

1

2

1

1

7

101.314

8

1

2

1

1

8

103.964

9

1

2

1

1

9

101.370

10

3

2

1

3

All of this is correct:

The MEANS Procedure

Analysis Variable : onset day of first fever

treatment group

N

Mean

Std Dev

Minimum

Median

Maximum

1

4

6.3

3.2

3.0

6.5

9.0

2

5

1.4

0.9

1.0

1.0

3.0

Here's where things go awry. The onset for id 2 from previous has moved up one position to id 1, and so forth for every id, with id 10 missing. I would like onset=. for id 1.

Listing of New Variables

id

temp

time

group

fever

nfev

ever

consecutive

duration

onset

auc

1

101.626

4

1

1

0

NO

0

0

4

879.900

2

100.728

3

1

1

1

YES

0

1

3

881.142

3

100.850

9

1

1

2

YES

0

1

9

892.676

4

101.173

9

1

1

1

YES

0

1

9

872.479

5

101.163

1

1

1

1

YES

0

1

1

889.449

6

101.730

1

2

1

7

YES

0

2

1

913.268

7

101.314

1

2

1

6

YES

0

3

1

906.276

8

103.964

1

2

1

9

YES

0

7

1

915.809

9

101.370

3

2

1

9

YES

0

5

3

914.003

I can take care of the labels and variables output, but this is what I'm trying to produce:

itmightbeharry · Posted 12-13-2020 01:57 PM

Hello Reeza:

Thank you very much for reaching out! I've inserted my full code below, along with the output produced. For the final output, I would like the id number to be my first variable. I have not yet labeled the variables where I'm trying to combine data, but the ones I looking to display from the final output I'd copied are "id", "group", "nfev", "duration"

title 'Summary of Fever by Group and Day';
footnote 'Temperature >= 100.4 F is considered a fever.';
data prob3;
  set cert.tempdata;
  label fever='fever';
  if temp>=100.4 then fever='YES';
    else if temp<100.4 then fever='NO';
run;
proc freq data=prob3;
  tables time*group*fever / nopercent nocol;
run;
title 'Summary of Number of Days with Fever by Group';
footnote;
proc sort data=cert.tempdata out=work.prob3b1;
  by id;
run;
data work.prob3b;
  set work.prob3b1;
  by id;
  if temp>=100.4 then fever=1;
    else if temp<100.4 then fever=0;
  if first.id then nfev=0;
  nfev+fever;
  if last.id;
run;
proc means data=work.prob3b nonobs n mean std min median max maxdec=1;
  label nfev='# days with fever';
  var nfev;
  class group;
run;
title 'Summary of Ever Having Fever by Group';
data work.prob3c;
  set work.prob3b;
  label ever='ever had fever';
  if nfev>=1 then ever='YES';
    else if nfev=0 then ever='NO';
run;
proc freq data=work.prob3c;
  tables group*ever / nopercent nocol;
run;
title 'Summary of Number of Days with Longest Consecutive Fever by Group';
data work.fever;
  set cert.tempdata;
  if temp>=100.4 then fever=1;
    else if temp<100.4 then fever=0;
run;
data work.duration;
  do until (last.id);
    set work.fever;
	by id;
	if fever = 0 then consecutive = 0;
	if fever then consecutive+1;
	duration = max(duration, consecutive);
  end;
  consecutive=0;
  label duration='# days with longest consecutive fever';
run;
proc means data = work.duration nway maxdec=1 nonobs n mean std min median max;
  class group;
  var duration;
run;
title 'Summary of Onset of Fever by Group';
data work.prob3e;
  set work.fever;
  by id;
  where fever=1;
  if first.id;
  onset=time;
  label onset='day of first fever';
run;
proc print data=work.prob3e;
run;
proc means data=work.prob3e maxdec=1 nonobs n mean std min median max;
  class group;
  var onset;
run;
title Summary of Area Under the Curve for Temperature by Group;
data work.prob3f;
do until(last.id);
  set cert.tempdata; by id time;
  if last.time then
    auc = sum(auc, mean(temp, prevTemp) * range(time, prevTime));
  prevTemp = temp;
  prevTime = time;
  end;
keep id auc group;
label auc='area under the curve for temperature';
run;
proc means data=work.prob3f maxdec=1 nonobs n mean std min median max;
  class group;
  var auc;
run;
title 'Listing of New Variables';
data work.prob3g;
  set work.prob3b;
  set work.prob3c;
  set work.duration;
  set work.prob3e;
  set work.prob3f;
  by id;
run;
proc print data=work.prob3g;
id id;
run;

Summary of Fever by Group and Day

The FREQ Procedure

Frequency

Row Pct

Table 1 of group by fever

Controlling for time=1

group(treatment group)

fever(fever)

NO

YES

Total

1

5

100.00

0

0.00

5

2

1

20.00

4

80.00

5

Total

6

4

10

Frequency

Row Pct

Table 2 of group by fever

Controlling for time=2

group(treatment group)

fever(fever)

NO

YES

Total

1

5

100.00

0

0.00

5

2

40.00

3

60.00

5

Total

7

3

10

Frequency

Row Pct

Table 3 of group by fever

Controlling for time=3

group(treatment group)

fever(fever)

NO

YES

Total

1

4

80.00

1

20.00

5

2

0

0.00

5

100.00

5

Total

4

6

10

Frequency

Row Pct

Table 4 of group by fever

Controlling for time=4

group(treatment group)

fever(fever)

NO

YES

Total

1

4

80.00

1

20.00

5

2

40.00

3

60.00

5

Total

6

4

10

Frequency

Row Pct

Table 5 of group by fever

Controlling for time=5

group(treatment group)

fever(fever)

NO

YES

Total

1

5

100.00

0

0.00

5

2

3

60.00

2

40.00

5

Total

8

2

10

Frequency

Row Pct

Table 6 of group by fever

Controlling for time=6

group(treatment group)

fever(fever)

NO

YES

Total

1

4

80.00

1

20.00

5

2

1

20.00

4

80.00

5

Total

5

10

Frequency

Row Pct

Table 7 of group by fever

Controlling for time=7

group(treatment group)

fever(fever)

NO

YES

Total

1

5

100.00

0

0.00

5

2

1

20.00

4

80.00

5

Total

6

4

10

Frequency

Row Pct

Table 8 of group by fever

Controlling for time=8

group(treatment group)

fever(fever)

NO

YES

Total

1

5

100.00

0

0.00

5

2

40.00

3

60.00

5

Total

7

3

10

Frequency

Row Pct

Table 9 of group by fever

Controlling for time=9

group(treatment group)

fever(fever)

NO

YES

Total

1

3

60.00

2

40.00

5

2

1

20.00

4

80.00

5

Total

4

6

10

Frequency

Row Pct

Table 10 of group by fever

Controlling for time=10

group(treatment group)

fever(fever)

NO

YES

Total

1

5

100.00

0

0.00

5

2

0

0.00

5

100.00

5

Total

5

10

Temperature >= 100.4 F is considered a fever.

Summary of Number of Days with Fever by Group

The MEANS Procedure

Analysis Variable : nfev # days with fever
treatment group	N	Mean	Std Dev	Minimum	Median	Maximum
1	5	1.0	0.7	0.0	1.0	2.0
2	5	7.4	1.5	6.0	7.0	9.0

Summary of Ever Having Fever by Group

The FREQ Procedure

Frequency

Row Pct

Table of group by ever

group(treatment group)

ever(ever had fever)

NO

YES

Total

1

20.00

4

80.00

5

2

0

0.00

5

100.00

5

Total

1

9

10

Summary of Number of Days with Longest Consecutive Fever by Group

The MEANS Procedure

Analysis Variable : duration # days with longest consecutive fever
treatment group	N	Mean	Std Dev	Minimum	Median	Maximum
1	5	0.8	0.4	0.0	1.0	1.0
2	5	4.0	2.0	2.0	3.0	7.0

Summary of Onset of Fever by Group

Obs	temp	id	time	group	fever	onset
1	101.626	2	4	1	1	4
2	100.728	3	3	1	1	3
3	100.850	4	9	1	1	9
4	101.173	5	9	1	1	9
5	101.163	6	1	2	1	1
6	101.730	7	1	2	1	1
7	101.314	8	1	2	1	1
8	103.964	9	1	2	1	1
9	101.370	10	3	2	1	3

Summary of Onset of Fever by Group

The MEANS Procedure

Analysis Variable : onset day of first fever
treatment group	N	Mean	Std Dev	Minimum	Median	Maximum
1	4	6.3	3.2	3.0	6.5	9.0
2	5	1.4	0.9	1.0	1.0	3.0

Summary of Area Under the Curve for Temperature by Group

The MEANS Procedure

Analysis Variable : auc area under the curve for temperature
treatment group	N	Mean	Std Dev	Minimum	Median	Maximum
1	5	883.1	8.0	872.5	881.1	892.7
2	5	911.2	4.4	906.3	913.3	915.8

Listing of New Variables

id	temp	time	group	fever	nfev	ever	consecutive	duration	onset	auc
1	101.626	4	1	1	0	NO	0	0	4	879.900
2	100.728	3	1	1	1	YES	0	1	3	881.142
3	100.850	9	1	1	2	YES	0	1	9	892.676
4	101.173	9	1	1	1	YES	0	1	9	872.479
5	101.163	1	1	1	1	YES	0	1	1	889.449
6	101.730	1	2	1	7	YES	0	2	1	913.268
7	101.314	1	2	1	6	YES	0	3	1	906.276
8	103.964	1	2	1	9	YES	0	7	1	915.809
9	101.370	3	2	1	9	YES	0	5	3	914.003

mkeintz · Posted 12-13-2020 02:22 PM

@Reeza wrote:

Are you sure you want multiple SET statements which is an Interleave?

... stuff deleted ...

@Reeza

I don't think this particular OP example is an interleave. Interleave would require multiple dataset names in a single SET statement, accompanied by a BY statement. In the OP's case, there are multiple SET statements, each with a single dataset name. The code is structured like:

data want;
   set a;
   set b;
   set c;
   by id;
  ....
run;

The BY statement here is tied only to the SET C statement. It doesn't track the ID values in A or B. So this actually is a merge, but not a match-merge, as the OP has discovered.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

itmightbeharry · Posted 12-13-2020 03:17 PM

That's right! I was trying a one-to-one reading approach, without realizing the impact this approach would have upon the missing onset value for the first id. Once I used match-merging, what I was looking for was easily produced.

Reeza · Posted 12-14-2020 10:53 AM

Thanks for the correction @mkeintz !

mkeintz · Posted 12-13-2020 01:49 PM

Not only will your results be mismatched, but (as you have discovered) your data set will have no more observations than the smallest contributor among your SET statements. Learn about match-merging using the MERGE statement accompanied by a BY statement: take a look at the examples in SAS® 9.4 and SAS® Viya 3.5 Programming Documentation, MERGE Statement page.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

itmightbeharry · Posted 12-13-2020 02:04 PM

Thank you for the assistance mkeintz! Just to make sure I'm understanding, if I use a merge statement, as opposed to multiple set statements, would I want to follow the merge statement with the data sets, i.e.

merge work.prob3b work.duration?

Thanks again!

itmightbeharry · Posted 12-13-2020 02:07 PM

I just tried this and that's exactly what I needed. Thank you very much to both you and Reeza! I greatly appreciate the documentation each of you provided and how prompt you both were in assisting me.

How do I set the first occurrence value to missing if no first occurrence?

Re: How do I set the first occurrence value to missing if no first occurrence?

Re: How do I set the first occurrence value to missing if no first occurrence?

Re: How do I set the first occurrence value to missing if no first occurrence?

Re: How do I set the first occurrence value to missing if no first occurrence?

Re: How do I set the first occurrence value to missing if no first occurrence?

Re: How do I set the first occurrence value to missing if no first occurrence?

Re: How do I set the first occurrence value to missing if no first occurrence?

Re: How do I set the first occurrence value to missing if no first occurrence?

Re: How do I set the first occurrence value to missing if no first occurrence?

Registration is open