Hi all,
Im attempting to have only one entry point for each operation(variable name: id) in my long dataset. One operation can include information on multiple lines (as for line1 and 2).
Now it is in the format:
Operation 1 | Lead 1 |
Operation 1 | Lead 2 |
Operation 2 | |
Operation 3 |
I would like:
Operation 1 | Lead 1 | Lead 2 |
Operation 2 | . | . |
Operation 3 | . | . |
Lead_x contains multiple variables (33 variables) which I think should be renamed to move to the corresponding row with a suffix indicating 1,2, etc
Part of data for one patient (record 15) (not showing all 33 variables corresponding to lead, but some):
data test1;
input record id date record_1 id_1 id_lead type_lead$;
informat date ddmmyy10. ;
format date ddmmyy10.;
datalines;
15 62 15-09-2003 15 62 84 LVL
15 62 15-09-2003 15 62 83 HVL
15 85 20-12-2010 . . . .
15 13 04-08-2017 . . . .
;
run;
I would like the two first rows to be together as:
15 62 15-09-2003 15 62 84 LVL 83 HVL
so only the new variables are in first row
I thought some of the code would include if statement,.. something like: if id = id_1 but id_lead is not equal to impl_id
Want output to be:
record |
id |
date |
record_1 |
id_1 |
id_lead |
type_lead |
id_lead_2 | type_lead_2 |
15 | 62 | 15-09-2003 | 15 | 62 | 84 | LVL | 83 | HVL |
15 | 85 | 20-12-2010 | ||||||
15 | 85 | 04-08-2017 |
How to rename new variables (id_lead_2 and type_lead_2) and how to change them from one row to the other ..
As seen the unique variable connecting line 1 and 2 is the id and id_1 which is the same .
Hope to get some help
Can each operation have multiple "leads"?
- Yes, each operation can have up till 3 leads
What does "lead" mean? Are you collecting electrocardiograms during the operation?
No, not electrocardiograms. Its implantation of heart devices. And leads are those wires which are put in the body.
So if I divide the variables into 4 levels. I have:
- a patient identifier: record
- operation information: id (unique for each operation) date (operation date)
- battery information: record_1 (actually copied from record), id_1 (copied from id above)
- lead/wire information: id_lead (unique for each implanted wire) and type_lead
Now I got the data to include one record pr operation pr patient. As the data was pivoted in three "extra" levels, I got a total of more than 500 variables.
I've used a datastep to clean it up a bit by using keep and drop statements.
Im doing some more coding to classify what happens at each operation.
After that I shall present my data with only one record pr patient in the study group.
Think I can use:
proc summary data = want;
by record ;
output out=one_line_pr_patient (drop=_type_) idgroup(out[1] ( id_1 -- type_lead_3 ) =);
run;
and then again clean it up a bit, so the variables with same contents is dropped.
Or is there a more "correct" way of specifing earlyer with variables (the string of lead information) that can get pivoted ?
I'm going to go off on a tangent, and provide advice. The usual advice is to not convert data sets to wide. This generally is not necessary, and causes subsequent programming to be more difficult. See Maxim 19 "Long beats Wide".
So, I feel compelled to ask ... why do you want a wide data set like this? What can you do with a wide data set that you can't do with a long data set?
Thanks @PaigeMiller for the nice link . Found the description of long/wide
In this particular dataset the reason is that dates, reason of operation, complications and so on are linked with a unique number between dataset. For instance the date of lead implantation is not in the same dataset. I used proc sql to gather information between different sets.
If I get the data from each operation on same line, I (think) its gonna be easier to calculate batery duration and so on
But I still haven't found the correct step to use in SAS language
In this particular dataset the reason is that dates, reason of operation, complications and so on are linked with a unique number between dataset. For instance the date of lead implantation is not in the same dataset. I used proc sql to gather information between different sets.
This is fine and makes sense, but it is not what I was asking about. I want to know, once you have this wide data set, what analysis thereafter will you be doing that requires wide rather than long?
Ah,
Im actually not good enough in sas to tell whether specific analysis cant be made in wide vs long.
However, what I need to do with the dataset:
- create a new variable which needs informations from the new variables (from the "long" observation line")
- Use the new variables to divide the patients in operation types and for each type describe the demographics
- calculate which operation type has the best outcome , and if there is a statistical difference
Use the new variables to divide the patients in operation types and for each type describe the demographics
This to me screams long data, not wide data.
Even the merging can be done in the long data form, if I am understanding you properly.
Ok, since Im still feeling pretty new - can you please explain
I personally thought it would be easier to calcute the number of operations and leads pr operation, if they were presented wide.
Lets say
patient_1 Operation_1 Lead
patient_1 operation _1 lead
How can I get SAS to "count" them ( I have a unique lead-nr) in long dataset ?
@rookie21 wrote:
Ok, since Im still feeling pretty new - can you please explain
I personally thought it would be easier to calcute the number of operations and leads pr operation, if they were presented wide.
Lets say
patient_1 Operation_1 Lead
patient_1 operation _1 lead
How can I get SAS to "count" them ( I have a unique lead-nr) in long dataset ?
I am not grasping the meaning of your question here. Make it more specific and concrete.
What do you want to count?
If PATIENT is the variable with the patient identifier and OPERATION is the variable that identifies the operation type then you can count how many times each patient had each operation type with a simple PROC FREQ.
proc freq;
tables patient*operation / list;
run;
If you need something more exotic you might need to do some pre-processing or perhaps use PROC SQL.
It is not hard to do, but it really seems like a waste of time as the result will be a dataset that is almost impossible to use for anything.
The easiest way to transpose a whole series of variables is to use PROC SUMMARY and the IDGROUP feature.
You will need to know before calling PROC SUMMARY the maximum number of observations you want to keep. If you don't know you can either just set some value larger than you expect or run a separated step to count the maximum. For this example I will use 5 since I can see you only have 4 observations so 5 will be more than enough.
Here is your data step cleaned up a little. Do not indent the lines of data, it will make it hard to write the input statement. Do not intend the DATALINES (also known as CARDS) statement as that will just encourage the program editor to indent the lines of data. Do not display dates with MDY or DMY order, that will just confuse 50% of your audience.
data test1;
input record id date record_1 id_1 id_lead type_lead $;
informat date ddmmyy10. ;
format date yymmdd10.;
datalines;
15 62 15-09-2003 15 62 84 LVL
15 62 15-09-2003 15 62 83 HVL
15 85 20-12-2010 . . . .
15 13 04-08-2017 . . . .
;
So to transpose with PROC SUMMARY you need some key variable that uniquely identifies the observations you want to combine. So it looks like that is RECORD in this case. The data needs to be grouped by the key variable(s) so you can use a BY statement. If it is grouped, but not actually sorted in ascending or descending order then use the NOTSORTED keyword on the BY statement.
You also need the list of variables you want to take the values of.
proc summary data=test1 ;
by record ;
output out=want(drop=_type_) idgroup(out[5] (id date record_1 id_1 id_lead type_lead)= );
run;
You could also include some other summary statistics if you wanted.
@PaigeMiller Can see @Tom replied to my question with the proc freq step 🙂
I tried this code as you wrote
proc summary data=test1 ;
by record ;
output out=want(drop=_type_) idgroup(out[5] (id date record_1 id_1 id_lead type_lead)= );
run;
This collects all data on one line (everything for that patient) instead of for each operation.
Then I tried the same step with id:
proc sort data=test1;
by id; run;
proc summary data=test1 ;
by id ;
output out=want(drop=_type_) idgroup(out[5] (id date record_1 id_1 id_lead type_lead)= );
run;
and got (showing part of output)
Here it is asuming 5 of every variable . Is it not possible to have id date record_1 id_1
only once, while the unique data for the additional lead (id_lead and type_lead) is folded out with 5 repeated events ?
I tried deleting id date record_1 id_1
in the parentes, but that solely took out those data in the out put
You need to understand how your data is organized.
What are the variables that uniquely identify the observations you want to treat as a single group?
What does RECORD mean? Is that the patient identifier?
What does ID mean? Is that the operation identifier? Can the same ID be used by multiple patients? That doesn't seem plausible if it is an identifier of a particular operation. But if it is instead a classification of an operation (Hernia operation versus Open Heart Surgery) then it will not be unique.
Is the goal to get one record per patient per operation?
Then it sounds like your should use both variables in the BY statement.
But if the ID is a "type of operation" field then you might also need some other variable to uniquely identify the group. It is possible to multiple operations on your knee for example. So perhaps you need to include the DATE also?
proc summary data=test1 ;
by record id date;
output out=want(drop=_type_) idgroup(out[5] (record_1 id_1 id_lead type_lead)= );
run;
Agreeing with @Tom
I realize that figuring out the best layout for data, and the best way to proceed with analysis, is something that beginners cannot know (even experienced people struggle with this). This is why I am trying to steer you into what seems to be a better approach (and I assume the same is true for Tom).
But we can't know your data as well as you know it. Maxim 3, you know. So a combination of experienced SAS programmers and your knowledge of the data would be a very good combination. But we need you to explain in more detail what the data is, and more detail about what analysis will be done.
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.