BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
sasthalie
Fluorite | Level 6

Hi, 

Here's an example of my data:

start_date     levelA   levelB   levelC   levelD

15-01-2020  0            1           0          0

16-01-2020   0            0          1           0

20-02-2020     1          0           0          0

20-02-2020    0            0          1           0

25-03-2020     0            0          1           0

25-03-2020     0            0          0          1

(etc.)

 

Each line is a patient and each patient can either be A, B, C or D. Multiple patients have the same start_date (numeric).

 

Here's what I've been trying (unsuccessfully to create):

start_date    n_levelA    n_levelB   n_levelC     n_levelD     high 

15-01-2020    0              1               0                0                  1

16-01-2020     0              0              1                 0                 1 

20-02-2020     1             0                 1                0                 2 

25-03-2020     0              0                1                1                 2

 

My end goal is to have this data in wide format in order to create a stacked histogram (x=start_date, y=number of patient, with stacks for each level A to D), which requires me to find out the highest and lowest number of observations for each date. 

 

Thank you!

1 ACCEPTED SOLUTION

Accepted Solutions
ballardw
Super User

@sasthalie wrote:

Hi, 

Here's an example of my data:

start_date     levelA   levelB   levelC   levelD

15-01-2020  0            1           0          0

16-01-2020   0            0          1           0

20-02-2020     1          0           0          0

20-02-2020    0            0          1           0

25-03-2020     0            0          1           0

25-03-2020     0            0          0          1

(etc.)

 

Each line is a patient and each patient can either be A, B, C or D. Multiple patients have the same start_date (numeric).

 

Here's what I've been trying (unsuccessfully to create):

start_date    n_levelA    n_levelB   n_levelC     n_levelD     high 

15-01-2020    0              1               0                0                  1

16-01-2020     0              0              1                 0                 1 

20-02-2020     1             0                 1                0                 2 

25-03-2020     0              0                1                1                 2

 

My end goal is to have this data in wide format in order to create a stacked histogram (x=start_date, y=number of patient, with stacks for each level A to D), which requires me to find out the highest and lowest number of observations for each date. 

 

Thank you!


What are the rules for assigning values to high?

 

A guess:

data have;
   input start_date :ddmmyy10. levelA levelB levelC levelD;
   format start_date ddmmyy10.;
datalines;
15-01-2020 0  1  0  0
16-01-2020 0  0  1  0
20-02-2020 1  0  0  0
20-02-2020 0  0  1  0
25-03-2020 0  0  1  0
25-03-2020 0  0  0  1
;

proc summary data=have nway;
   class start_date;
   var level: ;
   output out=want(drop=_type_ rename=(_freq_=high)) max=;
run;

Note the data step to provide something to code with and code placed in a text box so it is easy to copy and paste into an editor.

 

Proc summary code above gets the maximum value of each level variable (the : after level is a list maker for all variable names that start with level). The procedure automatically includes two variables in the output set, _type_ indicates combinations of class variables and would all have the same value as the NWAY option only outputs the highest type. Run the code without nway and remove the drop to see what _type_ looks like and the data. The second automatic variable is _freq_ which represents number of records with the class variable combination, so renaming it to "high" seems to match your unstated requirement on how to get value for "high".

 

View solution in original post

3 REPLIES 3
ballardw
Super User

@sasthalie wrote:

Hi, 

Here's an example of my data:

start_date     levelA   levelB   levelC   levelD

15-01-2020  0            1           0          0

16-01-2020   0            0          1           0

20-02-2020     1          0           0          0

20-02-2020    0            0          1           0

25-03-2020     0            0          1           0

25-03-2020     0            0          0          1

(etc.)

 

Each line is a patient and each patient can either be A, B, C or D. Multiple patients have the same start_date (numeric).

 

Here's what I've been trying (unsuccessfully to create):

start_date    n_levelA    n_levelB   n_levelC     n_levelD     high 

15-01-2020    0              1               0                0                  1

16-01-2020     0              0              1                 0                 1 

20-02-2020     1             0                 1                0                 2 

25-03-2020     0              0                1                1                 2

 

My end goal is to have this data in wide format in order to create a stacked histogram (x=start_date, y=number of patient, with stacks for each level A to D), which requires me to find out the highest and lowest number of observations for each date. 

 

Thank you!


What are the rules for assigning values to high?

 

A guess:

data have;
   input start_date :ddmmyy10. levelA levelB levelC levelD;
   format start_date ddmmyy10.;
datalines;
15-01-2020 0  1  0  0
16-01-2020 0  0  1  0
20-02-2020 1  0  0  0
20-02-2020 0  0  1  0
25-03-2020 0  0  1  0
25-03-2020 0  0  0  1
;

proc summary data=have nway;
   class start_date;
   var level: ;
   output out=want(drop=_type_ rename=(_freq_=high)) max=;
run;

Note the data step to provide something to code with and code placed in a text box so it is easy to copy and paste into an editor.

 

Proc summary code above gets the maximum value of each level variable (the : after level is a list maker for all variable names that start with level). The procedure automatically includes two variables in the output set, _type_ indicates combinations of class variables and would all have the same value as the NWAY option only outputs the highest type. Run the code without nway and remove the drop to see what _type_ looks like and the data. The second automatic variable is _freq_ which represents number of records with the class variable combination, so renaming it to "high" seems to match your unstated requirement on how to get value for "high".

 

Reeza
Super User

This can get you started:

data have;
   input start_date :ddmmyy10. levelA levelB levelC levelD;
   format start_date ddmmyy10.;
   patient_id = _n_;
datalines;
15-01-2020 0  1  0  0
16-01-2020 0  0  1  0
20-02-2020 1  0  0  0
20-02-2020 0  0  1  0
25-03-2020 0  0  1  0
25-03-2020 0  0  0  1
;

proc transpose data=have out=long(where=(col1 ne 0));
by patient_id start_date;
var level:;
run;

data level_data;
set long;
level = substr(compress(reverse(_name_)), 1, 1);
run;

proc sgplot data=level_data;
vbar start_date / group=level datalabel seglabel;
run;

proc freq data=level_data;
table start_date * level / nopercent norow nocol;
run;

 

 

sasthalie
Fluorite | Level 6

This one works out great, thank you!

I changed the last paragraph for this, since I'm looking to make a histogram (not bar chart). It seems to also be correct. 

 

proc sgplot data=level_data;
histogram start_date/ group=level;
run;

sas-innovate-2024.png

Available on demand!

Missed SAS Innovate Las Vegas? Watch all the action for free! View the keynotes, general sessions and 22 breakouts on demand.

 

Register now!

Mastering the WHERE Clause in PROC SQL

SAS' Charu Shankar shares her PROC SQL expertise by showing you how to master the WHERE clause using real winter weather data.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 3 replies
  • 535 views
  • 2 likes
  • 3 in conversation