BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
Cruise
Ammonite | Level 13

Dear SAS experts:

 

I'd like to create 100 indicator variables which mark 0 through 100% of the rows in a random fashion. For example, MARK50 in the mock data would mark 50% of the rows selected randomly while MARK25 variable would mark 25% of the rows in a random selection. Desirable dataset has 100 indicator dummy variables marking 0-100% of the rows in random.  

This is my attempt to manually conduct a simulation analysis in my previous post.  If I could just create mark 0, 15, 50, 75 and 100 then I would come up with a skeleton of the final desirable plot I want. I need this prelim information int he meantime.

 

https://communities.sas.com/t5/SAS-Programming/Simulate-parameter-estimates-of-the-model-for-missing...

 

Thank you so much for your precious time!

PS: I'm familiar with proc surveyselect. But I want to create indicator variables within the dataset rather than creating subsets of random samples.

 

DATA HAVE;
INPUT ID RANDOM MARK50 MARK25;
CARDS;
1  1  1
2  0  0
3  1  0
4  0  1
5  1  0
6  0  0
7  1  0
8  0  1
9  1  0
10 0  0
11 1  0
12 0  1
;
RUN;

 

 

1 ACCEPTED SOLUTION

Accepted Solutions
PaigeMiller
Diamond | Level 26
DATA HAVE;
    INPUT ID;
    rand=rand('uniform');
CARDS;
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
;
RUN;
proc rank data=have out=ranked;
    var rand;
	ranks ranks;
run;
data want;
	set ranked;
	array mark mark1-mark99;
	do i=1 to dim(mark);
	    if ranks<=i then mark(i)=1; 
		else mark(i)=0;
	end;
run;
--
Paige Miller

View solution in original post

19 REPLIES 19
PaigeMiller
Diamond | Level 26
DATA HAVE;
    INPUT ID;
    rand=rand('uniform');
    array mark mark1-mark99;
    do i=1 to dim(mark);
        if rand<(i/100) then mark(i)=1;
	else mark(i)=0;
    end;
CARDS;
1  1  1
2  0  0
3  1  0
4  0  1
5  1  0
6  0  0
7  1  0
8  0  1
9  1  0
10 0  0
11 1  0
12 0  1
;
RUN;
--
Paige Miller
Cruise
Ammonite | Level 13

@PaigeMiller 

Thanks a lot. I tried your approach on 1 thru 100 row dataset. How to tie mark1 totals to 1., so on., mark100 totals to 100. Possible? 

 

DATA HAVE;
    INPUT ID;
    rand=rand('uniform');
    array mark mark1-mark99;
    do i=1 to dim(mark);
        if rand<(i/100) then mark(i)=1;
	else mark(i)=0;
    end;
CARDS;
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
;
RUN;

proc summary data=have;
var mark1-mark100;
output out=totals sum=;
run;
proc transpose data=totals;
run;
proc print; run; 
PaigeMiller
Diamond | Level 26

How to tie mark1 totals to 1., so on., mark100 totals to 100. Possible? 

 

I don't know what this means.

--
Paige Miller
Cruise
Ammonite | Level 13

@PaigeMiller correct me If I'm wrong. I thought if i have 100 observations and randomly selecting them 100 times. Total number of rows selected randomly aiming to cover the 50% of the dataset of 100 rows then I would end up selecting 50 rows. Right? therefore, mark50 would match up to the 50 rows selected in total. That is how I mean by matching markN to the N and the N% of the data selection in random on the 100 rows of data. Make sense?

Reeza
Super User

Wouldn't this just be an array?

 

data want;
set have;

call streaminit(300) ; *to allow for replication;
array pct(100) pct1-pct100;

do i=1 to 100;
pct(i) = rand('bernoulli', i/100);
end;

run;

@Cruise wrote:

Dear SAS experts:

 

I'd like to create 100 indicator variables which mark 0 through 100% of the rows in a random fashion. For example, MARK50 in the mock data would mark 50% of the rows selected randomly while MARK25 variable would mark 25% of the rows in a random selection. Desirable dataset has 100 indicator dummy variables marking 0-100% of the rows in random.  

This is my attempt to manually conduct a simulation analysis in my previous post.  If I could just create mark 0, 15, 50, 75 and 100 then I would come up with a skeleton of the final desirable plot I want. I need this prelim information int he meantime.

 

https://communities.sas.com/t5/SAS-Programming/Simulate-parameter-estimates-of-the-model-for-missing...

 

Thank you so much for your precious time!

PS: I'm familiar with proc surveyselect. But I want to create indicator variables within the dataset rather than creating subsets of random samples.

 

DATA HAVE;
INPUT ID RANDOM MARK50 MARK25;
CARDS;
1  1  1
2  0  0
3  1  0
4  0  1
5  1  0
6  0  0
7  1  0
8  0  1
9  1  0
10 0  0
11 1  0
12 0  1
;
RUN;

 

 


 

PaigeMiller
Diamond | Level 26
DATA HAVE;
    INPUT ID;
    rand=rand('uniform');
CARDS;
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
;
RUN;
proc rank data=have out=ranked;
    var rand;
	ranks ranks;
run;
data want;
	set ranked;
	array mark mark1-mark99;
	do i=1 to dim(mark);
	    if ranks<=i then mark(i)=1; 
		else mark(i)=0;
	end;
run;
--
Paige Miller
Cruise
Ammonite | Level 13

@PaigeMiller 

 

Your last approach is what I wanted. mark1 flags 1 row, mark2 flags 2 rows et.c.

my actual data has 20K rows and i will use these marks as an indicator of the extent of a missing, thus, replaced data in my manual simulation analysis. mark50, would indicate that the 50% of the variable of interest was missing,therefore, replaced by a proxy value (July1 for the month/day).

 

RANDOM_MARK.png

 

PaigeMiller
Diamond | Level 26

@Reeza wrote:

Wouldn't this just be an array?

 

data want;
set have;

call streaminit(300) ; *to allow for replication;
array pct(100) pct1-pct100;

do i=1 to 100;
pct(i) = rand('bernoulli', i/100);
end;

run;


It's still not 100% clear to me, but it seems that the question from @Cruise  requires only one value of 1 and 99 values of zero in column 1; and exactly 2 values of 1 and 98 values of zero in column 2; and so on. So, this use of the Bernoulli random variable does not produce that result.

--
Paige Miller
Reeza
Super User

@PaigeMiller  given the last reponse of this:

 

 

I thought if i have 100 observations and randomly selecting them 100 times. Total number of rows selected randomly aiming to cover the 50% of the dataset of 100 rows then I would end up selecting 50 rows. Right? therefore, mark50 would match up to the 50 rows selected in total. That is how I mean by matching markN to the N and the N% of the data selection in random on the 100 rows of data. Make sense?

 

This would create those variables. 

 

PCT1 would have 1% with values of 1.

PCT50 would have 50% of values with 1, ie 50% selected.

PaigeMiller
Diamond | Level 26

@Reeza wrote:

@PaigeMiller  given the last reponse of this:

 

 

I thought if i have 100 observations and randomly selecting them 100 times. Total number of rows selected randomly aiming to cover the 50% of the dataset of 100 rows then I would end up selecting 50 rows. Right? therefore, mark50 would match up to the 50 rows selected in total. That is how I mean by matching markN to the N and the N% of the data selection in random on the 100 rows of data. Make sense?

 

This would create those variables. 

 

PCT1 would have 1% with values of 1.

PCT50 would have 50% of values with 1, ie 50% selected.


data want;
set have;

call streaminit(300) ; *to allow for replication;
array pct(100) pct1-pct100;

do i=1 to 100;
pct(i) = rand('bernoulli', i/100);
end;

run;
proc summary data=want;
    var pct1-pct100;
	output out=stats sum=;
run;

If it produced the result as I now understand it, PROC SUMMARY should produce a sum of 1 for pct1, and 2 for pct 2 and so on, and this code does not produce that result.

--
Paige Miller
Reeza
Super User

@PaigeMiller wrote:

@Reeza wrote:

@PaigeMiller  given the last reponse of this:

 

 

I thought if i have 100 observations and randomly selecting them 100 times. Total number of rows selected randomly aiming to cover the 50% of the dataset of 100 rows then I would end up selecting 50 rows. Right? therefore, mark50 would match up to the 50 rows selected in total. That is how I mean by matching markN to the N and the N% of the data selection in random on the 100 rows of data. Make sense?

 

This would create those variables. 

 

PCT1 would have 1% with values of 1.

PCT50 would have 50% of values with 1, ie 50% selected.


data want;
set have;

call streaminit(300) ; *to allow for replication;
array pct(100) pct1-pct100;

do i=1 to 100;
pct(i) = rand('bernoulli', i/100);
end;

run;
proc summary data=want;
    var pct1-pct100;
	output out=stats sum=;
run;

If it produced the result as I now understand it, PROC SUMMARY should produce a sum of 1 for pct1, and 2 for pct 2 and so on, and this code does not produce that result.


You would need to look at the mean, not the sum because the number of observations would affect the results. When done in that case, it most definitely does produce a series of approximately 1-100%.

 

delete_probabilities.png

Cruise
Ammonite | Level 13
I see what you're saying. I'll use both your and paige miller's approaches on my actual data for a comparison purpose. I'll report back in here.
Reeza
Super User
Doesn't matter which one you use, just wanted to ensure that I wasn't missing something. Both work and allow you to choose samples of a specific size though. You should review David L. Cassell's paper on don't be loopy - it covers simulations in SAS both using an approach that you're trying and survey select.

Cruise
Ammonite | Level 13

@Reeza @PaigeMiller 

 

I ended up using  Reeza's approach since this took a percentile of the rows. Paige's approach marked N 1-100 rows in total from 27,440 rows.

 

REEZA.png

SAS Innovate 2025: Call for Content

Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!

Submit your idea!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 19 replies
  • 2783 views
  • 12 likes
  • 3 in conversation