BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
Cruise
Ammonite | Level 13

I'm replacing missing values in data_of_diagnosis_mm by random integer 1 thru 12. However, integers replacing must be logical to death of month. Can you please help specify a condition in macro below where month of diagnosis occurs earlier than month of death? Without such condition specified I end up with bunch of people died before diagnosed. 

 

 


%macro RandBetween(min, max);
   (&min + floor((1+&max-&min)*rand("uniform")))
%mend;

data temp1; set temp; 
if date_of_diagnosis_mm in ('99') then date_of_diagnosis_mm = %RandBetween(1, 12);
run; 
where date_of_death_mm>date_of_diagnosis_mm

 

1 ACCEPTED SOLUTION

Accepted Solutions
ChrisNZ
Tourmaline | Level 20

Like this?


data _null_;  
  do DATE_OF_DEATH_MM = 1 to 12; 
    DATE_OF_DIAGNOSIS_MM = %RandBetween(1, DATE_OF_DEATH_MM );
    putlog DATE_OF_DEATH_MM= DATE_OF_DIAGNOSIS_MM=;
  end;
run; 

DATE_OF_DEATH_MM=1 DATE_OF_DIAGNOSIS_MM=1
DATE_OF_DEATH_MM=2 DATE_OF_DIAGNOSIS_MM=2
DATE_OF_DEATH_MM=3 DATE_OF_DIAGNOSIS_MM=1
DATE_OF_DEATH_MM=4 DATE_OF_DIAGNOSIS_MM=1
DATE_OF_DEATH_MM=5 DATE_OF_DIAGNOSIS_MM=1
DATE_OF_DEATH_MM=6 DATE_OF_DIAGNOSIS_MM=1
DATE_OF_DEATH_MM=7 DATE_OF_DIAGNOSIS_MM=7
DATE_OF_DEATH_MM=8 DATE_OF_DIAGNOSIS_MM=6
DATE_OF_DEATH_MM=9 DATE_OF_DIAGNOSIS_MM=1
DATE_OF_DEATH_MM=10 DATE_OF_DIAGNOSIS_MM=3
DATE_OF_DEATH_MM=11 DATE_OF_DIAGNOSIS_MM=8
DATE_OF_DEATH_MM=12 DATE_OF_DIAGNOSIS_MM=2

View solution in original post

16 REPLIES 16
ChrisNZ
Tourmaline | Level 20

Like this?


data _null_;  
  do DATE_OF_DEATH_MM = 1 to 12; 
    DATE_OF_DIAGNOSIS_MM = %RandBetween(1, DATE_OF_DEATH_MM );
    putlog DATE_OF_DEATH_MM= DATE_OF_DIAGNOSIS_MM=;
  end;
run; 

DATE_OF_DEATH_MM=1 DATE_OF_DIAGNOSIS_MM=1
DATE_OF_DEATH_MM=2 DATE_OF_DIAGNOSIS_MM=2
DATE_OF_DEATH_MM=3 DATE_OF_DIAGNOSIS_MM=1
DATE_OF_DEATH_MM=4 DATE_OF_DIAGNOSIS_MM=1
DATE_OF_DEATH_MM=5 DATE_OF_DIAGNOSIS_MM=1
DATE_OF_DEATH_MM=6 DATE_OF_DIAGNOSIS_MM=1
DATE_OF_DEATH_MM=7 DATE_OF_DIAGNOSIS_MM=7
DATE_OF_DEATH_MM=8 DATE_OF_DIAGNOSIS_MM=6
DATE_OF_DEATH_MM=9 DATE_OF_DIAGNOSIS_MM=1
DATE_OF_DEATH_MM=10 DATE_OF_DIAGNOSIS_MM=3
DATE_OF_DEATH_MM=11 DATE_OF_DIAGNOSIS_MM=8
DATE_OF_DEATH_MM=12 DATE_OF_DIAGNOSIS_MM=2

Cruise
Ammonite | Level 13

@ChrisNZ Thanks Chris.

The code eliminated 99 in date_of_diagnosis_mm. But, if you look at freq of "date_of_diagnosis_mm" before and after imputing freq of after values decreased after month 6, ie., 1329(before) vs 1192(after). Isn't freq after imputing supposed to increase because they're now added with portions of 99 allocated to them?

 

before imputing

date_of_diagnosis_mm Frequency Percent
1 1776 12.38
2 1542 10.75
3 1586 11.06
4 1495 10.42
5 1360 9.48
6 1326 9.24
7 1153 8.04
8 1112 7.75
9 933 6.50
10 828 5.77
11 630 4.39
12 331 2.31
99 271 1.89

 


after imputing

date_of_diagnosis_mm Frequency Percent
1 2634 18.36
2 2315 16.14
3 1894 13.21
4 1599 11.15
5 1389 9.68
6 1192 8.31
7 973 6.78
8 811 5.65
9 634 4.42
10 483 3.37
11 293 2.04
12 126 0.88

 

 

data b; set temp;
DATE_OF_DIAGNOSIS_MM = %RandBetween(1, DATE_OF_DEATH_MM );
where date_of_death_mm ne . and date_of_diagnosis_yyyy=date_of_death_yyyy; 
run;

proc freq data=temp; 
tables date_of_diagnosis_mm/list missing nocum;
where date_of_death_mm ne . and date_of_diagnosis_yyyy=date_of_death_yyyy; 
title "before imputing";
run; 

proc freq data=b; 
tables date_of_diagnosis_mm/list missing nocum;
where date_of_death_mm ne . and date_of_diagnosis_yyyy=date_of_death_yyyy; 
title "after imputing";
run;

 

Kurt_Bremser
Super User

Since you have a where condition, all observations where date_of_diagnosis and date_of_death do not fall into the same year are discarded, reducing the overall numbers.

See this slightly changed code, run against the import of your csv:

proc import datafile="$HOME/sascommunity/b.csv"
out=b
dbms=csv replace;
run;

proc freq data=b;
tables date_of_diagnosis_mm;
run;

data b1;
set b;
if
  date_of_diagnosis_mm = 99 and
  date_of_death_mm ne . and
  date_of_diagnosis_yyyy = date_of_death_yyyy
then date_of_diagnosis_mm = rand("INTEGER", date_of_death_mm);
run;

proc freq data=b1;
tables date_of_diagnosis_mm;
run;

Result:

                          The FREQ Procedure

        date_of_                             Cumulative    Cumulative
    diagnosis_mm    Frequency     Percent     Frequency      Percent
---------------------------------------------------------------------
               1        3316        8.23          3316         8.23  
               2        3066        7.61          6382        15.84  
               3        3510        8.71          9892        24.55  
               4        3435        8.53         13327        33.08  
               5        3511        8.72         16838        41.80  
               6        3445        8.55         20283        50.35  
               7        3306        8.21         23589        58.55  
               8        3304        8.20         26893        66.76  
               9        3245        8.05         30138        74.81  
              10        3347        8.31         33485        83.12  
              11        3113        7.73         36598        90.85  
              12        3091        7.67         39689        98.52  
              99         597        1.48         40286       100.00  
                                                                     

                          The FREQ Procedure

        date_of_                             Cumulative    Cumulative
    diagnosis_mm    Frequency     Percent     Frequency      Percent
---------------------------------------------------------------------
               1        3343        8.30          3343         8.30  
               2        3085        7.66          6428        15.96  
               3        3528        8.76          9956        24.71  
               4        3456        8.58         13412        33.29  
               5        3518        8.73         16930        42.02  
               6        3452        8.57         20382        50.59  
               7        3313        8.22         23695        58.82  
               8        3309        8.21         27004        67.03  
               9        3247        8.06         30251        75.09  
              10        3348        8.31         33599        83.40  
              11        3115        7.73         36714        91.13  
              12        3091        7.67         39805        98.81  
              99         481        1.19         40286       100.00  

You can see that a small shift from 99 to the other months (ecept 12) has happened, with the cumulative sum staying the same.

There's still a lot of 99's left where there's no death date or where death happened in another year.

Cruise
Ammonite | Level 13

@ChrisNZ

 

I used your approach like below: It appears that problem is SOLVED!!!

 

 

/*if you use data attached to this form:*/

proc import datafile="...\b.csv"
out=cr.b
dbms=csv replace;
run;
data b1; set b; do DATE_OF_DEATH_MM = 1 to 12; IF DATE_OF_DIAGNOSIS_MM=99 THEN DATE_OF_DIAGNOSIS_MM= %RandBetween(DATE_OF_DEATH_MM,12); where date_of_death_mm ne . and date_of_diagnosis_yyyy = date_of_death_yyyy; end; run; proc freq data=b; tables date_of_diagnosis_mm; where date_of_death_mm ne . and date_of_diagnosis_yyyy = date_of_death_yyyy; run; proc freq data=b1; tables date_of_diagnosis_mm; where date_of_death_mm ne . and date_of_diagnosis_yyyy = date_of_death_yyyy; run;

Result:

BEFORE

date_of_diagnosis_mm

n %

AFTER

date_of_diagnosis_mm

n %
1 679 11.7 1 687 11.88
2 645 11.2 2 654 11.31
3 617 10.7 3 625 10.81
4 618 10.7 4 627 10.84
5 562 9.72 5 568 9.82
6 530 9.17 6 535 9.25
7 472 8.16 7 488 8.44
8 449 7.77 8 450 7.78
9 372 6.43 9 387 6.69
10 329 5.69 10 338 5.85
11 257 4.44 11 269 4.65
12 136 2.35 12 154 2.66
99 116 2.01      
PGStats
Opal | Level 21

Assuming that date_of_diagnosis_mm and date_of_death_mm are month numbers:

 

 

data temp1; 
set temp; 
if date_of_diagnosis_mm in (99) then 
    date_of_diagnosis_mm = rand("INTEGER", date_of_death_mm);
run;
PG
Cruise
Ammonite | Level 13

@PGStats

Thanks. There's something wrong. The code replace 99 by . Any idea why?

Otherwise, i really like the idea.

%macro RandBetween(min, max);
   (&min + floor((1+&max-&min)*rand("uniform")))
%mend;

data b1; set b;
if date_of_diagnosis_mm in (99) then
   date_of_diagnosis_mm = rand("INTEGER", date_of_death_mm);
where date_of_death_mm ne . and date_of_diagnosis_yyyy=date_of_death_yyyy;
run;
proc freq data=b1;
tables date_of_diagnosis_mm*date_of_death_mm/list missing;
run;  

 

date_of_diagnosis_mm date_of_death_mm Freq Percent
. 1 8 0.06
. 2 23 0.16
. 3 18 0.13
. 4 19 0.13
. 5 21 0.15
. 6 30 0.21
. 7 25 0.17
. 8 16 0.11
. 9 27 0.19
. 10 33 0.23
. 11 17 0.12
. 12 34 0.24
1 1 307 2.14
1 2 339 2.36
1 3 216 1.51
1 4 166 1.16
1 5 134 0.93
1 6 101 0.7
1 7 84 0.59
1 8 96 0.67
1 9 95 0.66
1 10 85 0.59
1 11 86 0.6
1 12 67 0.47
2 2 303 2.11
2 3 284 1.98
2 4 202 1.41
2 5 163 1.14
2 6 104 0.73
2 7 97 0.68
2 8 84 0.59
2 9 85 0.59
2 10 68 0.47
2 11 73 0.51
2 12 79 0.55
3 3 313 2.18
3 4 318 2.22
3 5 209 1.46
3 6 155 1.08
3 7 115 0.8
3 8 106 0.74
3 9 87 0.61
3 10 95 0.66
3 11 86 0.6
3 12 102 0.71
4 4 317 2.21
4 5 297 2.07
4 6 229 1.6
4 7 156 1.09
4 8 108 0.75
4 9 100 0.7
4 10 89 0.62
4 11 109 0.76
4 12 90 0.63
5 5 303 2.11
5 6 308 2.15
5 7 196 1.37
5 8 171 1.19
5 9 128 0.89
5 10 106 0.74
5 11 67 0.47
5 12 81 0.56
6 6 271 1.89
6 7 314 2.19
6 8 195 1.36
6 9 160 1.12
6 10 143 1
6 11 129 0.9
6 12 114 0.79
7 7 299 2.08
7 8 288 2.01
7 9 186 1.3
7 10 157 1.09
7 11 121 0.84
7 12 102 0.71
8 8 304 2.12
8 9 306 2.13
8 10 217 1.51
8 11 154 1.07
8 12 131 0.91
9 9 266 1.85
9 10 317 2.21
9 11 200 1.39
9 12 150 1.05
10 10 281 1.96
10 11 332 2.31
10 12 215 1.5
11 11 299 2.08
11 12 331 2.31
12 12 331 2.31
Cruise
Ammonite | Level 13

@Kurt_Bremser, @PGStats

 

this is ERROR log I get.

 

925  data cr.b1; set cr.b;
926  if
927    date_of_diagnosis_mm = 99 and
928    date_of_death_mm ne . and
929    date_of_diagnosis_yyyy = date_of_death_yyyy
930  then date_of_diagnosis_mm = rand("INTEGER", date_of_death_mm);
931  run;

ERROR: The first argument of the RAND function must be a character string with a value of
       BERNOULLI, BETA, BINOMIAL, CAUCHY, CHISQUARE, ERLANG, EXPONENTIAL, F, GAMMA, GAUSSIAN,
       GEOMETRIC, HYPERGEOMETRIC, LOGNORMAL, NEGB, NORMAL, POISSON, T, TABLE, TRIANGULAR,
       UNIFORM, or WEIBULL.
NOTE: Argument 1 to function RAND('INTEGER',6) at line 930 column 29 is invalid.
date_of_diagnosis_dd=99 date_of_diagnosis_mm=. date_of_diagnosis_yyyy=2004 date_of_death_dd=26
date_of_death_mm=6 date_of_death_yyyy=2004 _ERROR_=1 _N_=2899
ERROR: The first argument of the RAND function must be a character string with a value of
       BERNOULLI, BETA, BINOMIAL, CAUCHY, CHISQUARE, ERLANG, EXPONENTIAL, F, GAMMA, GAUSSIAN,
       GEOMETRIC, HYPERGEOMETRIC, LOGNORMAL, NEGB, NORMAL, POISSON, T, TABLE, TRIANGULAR,
       UNIFORM, or WEIBULL.
NOTE: Argument 1 to function RAND('INTEGER',12) at line 930 column 29 is invalid.
date_of_diagnosis_dd=99 date_of_diagnosis_mm=. date_of_diagnosis_yyyy=2004 date_of_death_dd=21
date_of_death_mm=12 date_of_death_yyyy=2004 _ERROR_=1 _N_=3621
ERROR: The first argument of the RAND function must be a character string with a value of
       BERNOULLI, BETA, BINOMIAL, CAUCHY, CHISQUARE, ERLANG, EXPONENTIAL, F, GAMMA, GAUSSIAN,
       GEOMETRIC, HYPERGEOMETRIC, LOGNORMAL, NEGB, NORMAL, POISSON, T, TABLE, TRIANGULAR,
       UNIFORM, or WEIBULL.
NOTE: Argument 1 to function RAND('INTEGER',2) at line 930 column 29 is invalid.
date_of_diagnosis_dd=99 date_of_diagnosis_mm=. date_of_diagnosis_yyyy=2004 date_of_death_dd=28
date_of_death_mm=2 date_of_death_yyyy=2004 _ERROR_=1 _N_=3632
ERROR: The first argument of the RAND function must be a character string with a value of
       BERNOULLI, BETA, BINOMIAL, CAUCHY, CHISQUARE, ERLANG, EXPONENTIAL, F, GAMMA, GAUSSIAN,
       GEOMETRIC, HYPERGEOMETRIC, LOGNORMAL, NEGB, NORMAL, POISSON, T, TABLE, TRIANGULAR,
       UNIFORM, or WEIBULL.
NOTE: Argument 1 to function RAND('INTEGER',7) at line 930 column 29 is invalid.
date_of_diagnosis_dd=99 date_of_diagnosis_mm=. date_of_diagnosis_yyyy=2004 date_of_death_dd=11
date_of_death_mm=7 date_of_death_yyyy=2004 _ERROR_=1 _N_=3694
ERROR: The first argument of the RAND function must be a character string with a value of
       BERNOULLI, BETA, BINOMIAL, CAUCHY, CHISQUARE, ERLANG, EXPONENTIAL, F, GAMMA, GAUSSIAN,
       GEOMETRIC, HYPERGEOMETRIC, LOGNORMAL, NEGB, NORMAL, POISSON, T, TABLE, TRIANGULAR,
       UNIFORM, or WEIBULL.
NOTE: Argument 1 to function RAND('INTEGER',10) at line 930 column 29 is invalid.
date_of_diagnosis_dd=99 date_of_diagnosis_mm=. date_of_diagnosis_yyyy=2005 date_of_death_dd=18
date_of_death_mm=10 date_of_death_yyyy=2005 _ERROR_=1 _N_=6775
NOTE: Invalid argument to function RAND('INTEGER',6) at line 930 column 29.
date_of_diagnosis_dd=99 date_of_diagnosis_mm=. date_of_diagnosis_yyyy=2005 date_of_death_dd=11
date_of_death_mm=6 date_of_death_yyyy=2005 _ERROR_=1 _N_=7226
NOTE: Invalid argument to function RAND('INTEGER',2) at line 930 column 29.
date_of_diagnosis_dd=99 date_of_diagnosis_mm=. date_of_diagnosis_yyyy=2005 date_of_death_dd=8
date_of_death_mm=2 date_of_death_yyyy=2005 _ERROR_=1 _N_=7725
NOTE: Invalid argument to function RAND('INTEGER',9) at line 930 column 29.
date_of_diagnosis_dd=99 date_of_diagnosis_mm=. date_of_diagnosis_yyyy=2005 date_of_death_dd=28
date_of_death_mm=9 date_of_death_yyyy=2005 _ERROR_=1 _N_=7769
NOTE: Invalid argument to function RAND('INTEGER',10) at line 930 column 29.
date_of_diagnosis_dd=99 date_of_diagnosis_mm=. date_of_diagnosis_yyyy=2005 date_of_death_dd=31
date_of_death_mm=10 date_of_death_yyyy=2005 _ERROR_=1 _N_=7874
NOTE: Invalid argument to function RAND('INTEGER',7) at line 930 column 29.
date_of_diagnosis_dd=99 date_of_diagnosis_mm=. date_of_diagnosis_yyyy=2006 date_of_death_dd=1
date_of_death_mm=7 date_of_death_yyyy=2006 _ERROR_=1 _N_=10607
NOTE: Invalid argument to function RAND('INTEGER',5) at line 930 column 29.
date_of_diagnosis_dd=99 date_of_diagnosis_mm=. date_of_diagnosis_yyyy=2007 date_of_death_dd=31
date_of_death_mm=5 date_of_death_yyyy=2007 _ERROR_=1 _N_=12501
NOTE: Invalid argument to function RAND('INTEGER',3) at line 930 column 29.
date_of_diagnosis_dd=99 date_of_diagnosis_mm=. date_of_diagnosis_yyyy=2007 date_of_death_dd=30
date_of_death_mm=3 date_of_death_yyyy=2007 _ERROR_=1 _N_=12580
NOTE: Invalid argument to function RAND('INTEGER',7) at line 930 column 29.
date_of_diagnosis_dd=99 date_of_diagnosis_mm=. date_of_diagnosis_yyyy=2006 date_of_death_dd=10
date_of_death_mm=7 date_of_death_yyyy=2006 _ERROR_=1 _N_=13166
NOTE: Invalid argument to function RAND('INTEGER',4) at line 930 column 29.
date_of_diagnosis_dd=99 date_of_diagnosis_mm=. date_of_diagnosis_yyyy=2006 date_of_death_dd=16
date_of_death_mm=4 date_of_death_yyyy=2006 _ERROR_=1 _N_=13167
NOTE: Invalid argument to function RAND('INTEGER',5) at line 930 column 29.
date_of_diagnosis_dd=99 date_of_diagnosis_mm=. date_of_diagnosis_yyyy=2006 date_of_death_dd=29
date_of_death_mm=5 date_of_death_yyyy=2006 _ERROR_=1 _N_=13528
NOTE: Invalid argument to function RAND('INTEGER',6) at line 930 column 29.
date_of_diagnosis_dd=99 date_of_diagnosis_mm=. date_of_diagnosis_yyyy=2007 date_of_death_dd=23
date_of_death_mm=6 date_of_death_yyyy=2007 _ERROR_=1 _N_=14045
NOTE: Invalid argument to function RAND('INTEGER',9) at line 930 column 29.
date_of_diagnosis_dd=99 date_of_diagnosis_mm=. date_of_diagnosis_yyyy=2007 date_of_death_dd=25
date_of_death_mm=9 date_of_death_yyyy=2007 _ERROR_=1 _N_=14512
NOTE: Invalid argument to function RAND('INTEGER',4) at line 930 column 29.
date_of_diagnosis_dd=99 date_of_diagnosis_mm=. date_of_diagnosis_yyyy=2007 date_of_death_dd=27
date_of_death_mm=4 date_of_death_yyyy=2007 _ERROR_=1 _N_=14719
NOTE: Invalid argument to function RAND('INTEGER',4) at line 930 column 29.
date_of_diagnosis_dd=99 date_of_diagnosis_mm=. date_of_diagnosis_yyyy=2008 date_of_death_dd=22
date_of_death_mm=4 date_of_death_yyyy=2008 _ERROR_=1 _N_=15474
NOTE: Invalid argument to function RAND('INTEGER',6) at line 930 column 29.
WARNING: Limit set by ERRORS= option reached.  Further errors of this type will not be printed.
date_of_diagnosis_dd=99 date_of_diagnosis_mm=. date_of_diagnosis_yyyy=2007 date_of_death_dd=2
date_of_death_mm=6 date_of_death_yyyy=2007 _ERROR_=1 _N_=16148
NOTE: Mathematical operations could not be performed at the following places. The results of the
      operations have been set to missing values.
      Each place is given by: (Number of times) at (Line):(Column).
      116 at 930:29
NOTE: There were 40286 observations read from the data set CR.B.
NOTE: The data set CR.B1 has 40286 observations and 6 variables.
NOTE: DATA statement used (Total process time):
      real time           0.07 seconds
      cpu time            0.03 seconds


Kurt_Bremser
Super User

A month (1-12) on its own is a useless value. Only in connection with a year will it depict a certain real-world month. So you should expand your data accordingly, and then create your random diagnosis month within the last 12 months of the person's life. If you use intelligent data (working with SAS date values instead of year/month pairs), you can use the intnx() function, calculating from the person's death date.

Cruise
Ammonite | Level 13

Hi guys, attached is the sample from my data, as @Kurt_Bremser suggested. Thanks for help! @PGStats @ChrisNZ

 

proc sql;
create table b(keep=date_of_death: date_of_diagnosis:) as
select *
from data
where ranuni (0)<X;
quit;


proc import datafile="...\b.csv"
out=data
dbms=csv replace;
run;

Kurt_Bremser
Super User

When running your code

proc import datafile="$HOME/sascommunity/b.csv"
out=data
dbms=csv replace;
run;

proc sql;
create table b(keep=date_of_death: date_of_diagnosis:) as
select *
from data
where ranuni (0)<X;
quit;

, I get this ERROR in the SQL:

65         proc sql;
66         create table b(keep=date_of_death: date_of_diagnosis:) as
67         select *
68         from data
69         where ranuni (0)<X;
ERROR: The following columns were not found in the contributing tables: X.
Cruise
Ammonite | Level 13

@Kurt_Bremser proc sql part is just to show where my data came from. So ignore sql part plz.

sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 16 replies
  • 1316 views
  • 2 likes
  • 4 in conversation