Solved: Re: Renaming variables and reduce number of observations

rookie21 · Posted 12-21-2021 09:58 AM

Hi all,

Im attempting to have only one entry point for each operation(variable name: id) in my long dataset. One operation can include information on multiple lines (as for line1 and 2).

Now it is in the format:

Operation 1	Lead 1
Operation 1	Lead 2
Operation 2
Operation 3

I would like:

Operation 1	Lead 1	Lead 2
Operation 2	.	.
Operation 3	.	.

Lead_x contains multiple variables (33 variables) which I think should be renamed to move to the corresponding row with a suffix indicating 1,2, etc

Part of data for one patient (record 15) (not showing all 33 variables corresponding to lead, but some):

data test1;
   input record id date record_1 id_1 id_lead type_lead$;
   informat date ddmmyy10. ;
	format date ddmmyy10.;
   datalines;
	15 62 15-09-2003 15 62 84 LVL
	15 62 15-09-2003 15 62 83 HVL
	15 85 20-12-2010 . . . .
	15 13 04-08-2017 . . . . 
;
run;

I would like the two first rows to be together as:

15 62 15-09-2003 15 62 84 LVL 83 HVL

so only the new variables are in first row

I thought some of the code would include if statement,.. something like: if id = id_1 but id_lead is not equal to impl_id

Want output to be:

`record`	`id`	`date`	`record_1`	`id_1`	`id_lead`	`type_lead`	id_lead_2	type_lead_2
15	62	15-09-2003	15	62	84	LVL	83	HVL
15	85	20-12-2010
15	85	04-08-2017

How to rename new variables (id_lead_2 and type_lead_2) and how to change them from one row to the other ..

As seen the unique variable connecting line 1 and 2 is the id and id_1 which is the same .

Hope to get some help

rookie21 · Posted 12-22-2021 08:43 AM

Can each operation have multiple "leads"?

- Yes, each operation can have up till 3 leads

What does "lead" mean? Are you collecting electrocardiograms during the operation?

No, not electrocardiograms. Its implantation of heart devices. And leads are those wires which are put in the body.

So if I divide the variables into 4 levels. I have:

- a patient identifier: record

- operation information: id (unique for each operation) date (operation date)

- battery information: record_1 (actually copied from record), id_1 (copied from id above)

- lead/wire information: id_lead (unique for each implanted wire) and type_lead

Now I got the data to include one record pr operation pr patient. As the data was pivoted in three "extra" levels, I got a total of more than 500 variables.

I've used a datastep to clean it up a bit by using keep and drop statements.

Im doing some more coding to classify what happens at each operation.

After that I shall present my data with only one record pr patient in the study group.

Think I can use:

proc summary data = want;
	by record ;
	output out=one_line_pr_patient (drop=_type_) idgroup(out[1] ( id_1 -- type_lead_3 ) =); 
run;

and then again clean it up a bit, so the variables with same contents is dropped.

Or is there a more "correct" way of specifing earlyer with variables (the string of lead information) that can get pivoted ?

View solution in original post

PaigeMiller · Posted 12-21-2021 10:03 AM

I'm going to go off on a tangent, and provide advice. The usual advice is to not convert data sets to wide. This generally is not necessary, and causes subsequent programming to be more difficult. See Maxim 19 "Long beats Wide".

So, I feel compelled to ask ... why do you want a wide data set like this? What can you do with a wide data set that you can't do with a long data set?

--
Paige Miller

rookie21 · Posted 12-21-2021 10:36 AM

Thanks @PaigeMiller for the nice link . Found the description of long/wide

In this particular dataset the reason is that dates, reason of operation, complications and so on are linked with a unique number between dataset. For instance the date of lead implantation is not in the same dataset. I used proc sql to gather information between different sets.

If I get the data from each operation on same line, I (think) its gonna be easier to calculate batery duration and so on

But I still haven't found the correct step to use in SAS language

PaigeMiller · Posted 12-21-2021 10:39 AM

In this particular dataset the reason is that dates, reason of operation, complications and so on are linked with a unique number between dataset. For instance the date of lead implantation is not in the same dataset. I used proc sql to gather information between different sets.

This is fine and makes sense, but it is not what I was asking about. I want to know, once you have this wide data set, what analysis thereafter will you be doing that requires wide rather than long?

--
Paige Miller

rookie21 · Posted 12-21-2021 10:48 AM

Ah,

Im actually not good enough in sas to tell whether specific analysis cant be made in wide vs long.

However, what I need to do with the dataset:

- create a new variable which needs informations from the new variables (from the "long" observation line")

- Use the new variables to divide the patients in operation types and for each type describe the demographics

- calculate which operation type has the best outcome , and if there is a statistical difference

PaigeMiller · Posted 12-21-2021 10:50 AM

Use the new variables to divide the patients in operation types and for each type describe the demographics

This to me screams long data, not wide data.

Even the merging can be done in the long data form, if I am understanding you properly.

--
Paige Miller

rookie21 · Posted 12-21-2021 11:08 AM

Ok, since Im still feeling pretty new - can you please explain

I personally thought it would be easier to calcute the number of operations and leads pr operation, if they were presented wide.

Lets say

patient_1 Operation_1 Lead

patient_1 operation _1 lead

How can I get SAS to "count" them ( I have a unique lead-nr) in long dataset ?

PaigeMiller · Posted 12-21-2021 11:10 AM

@rookie21 wrote:

Ok, since Im still feeling pretty new - can you please explain

I personally thought it would be easier to calcute the number of operations and leads pr operation, if they were presented wide.

Lets say

patient_1 Operation_1 Lead

patient_1 operation _1 lead

How can I get SAS to "count" them ( I have a unique lead-nr) in long dataset ?

I am not grasping the meaning of your question here. Make it more specific and concrete.

--
Paige Miller

Tom · Posted 12-21-2021 11:12 AM

What do you want to count?

If PATIENT is the variable with the patient identifier and OPERATION is the variable that identifies the operation type then you can count how many times each patient had each operation type with a simple PROC FREQ.

proc freq;
  tables patient*operation / list;
run;

If you need something more exotic you might need to do some pre-processing or perhaps use PROC SQL.

Tom · Posted 12-21-2021 10:50 AM

It is not hard to do, but it really seems like a waste of time as the result will be a dataset that is almost impossible to use for anything.

The easiest way to transpose a whole series of variables is to use PROC SUMMARY and the IDGROUP feature.

You will need to know before calling PROC SUMMARY the maximum number of observations you want to keep. If you don't know you can either just set some value larger than you expect or run a separated step to count the maximum. For this example I will use 5 since I can see you only have 4 observations so 5 will be more than enough.

Here is your data step cleaned up a little. Do not indent the lines of data, it will make it hard to write the input statement. Do not intend the DATALINES (also known as CARDS) statement as that will just encourage the program editor to indent the lines of data. Do not display dates with MDY or DMY order, that will just confuse 50% of your audience.

data test1;
  input record id date record_1 id_1 id_lead type_lead $;
  informat date ddmmyy10. ;
  format date yymmdd10.;
datalines;
15 62 15-09-2003 15 62 84 LVL
15 62 15-09-2003 15 62 83 HVL
15 85 20-12-2010 . . . .
15 13 04-08-2017 . . . . 
;

So to transpose with PROC SUMMARY you need some key variable that uniquely identifies the observations you want to combine. So it looks like that is RECORD in this case. The data needs to be grouped by the key variable(s) so you can use a BY statement. If it is grouped, but not actually sorted in ascending or descending order then use the NOTSORTED keyword on the BY statement.

You also need the list of variables you want to take the values of.

proc summary data=test1 ;
  by record ;
  output out=want(drop=_type_) idgroup(out[5] (id date record_1 id_1 id_lead type_lead)= );
run;

You could also include some other summary statistics if you wanted.

rookie21 · Posted 12-21-2021 11:28 AM

@PaigeMiller Can see @Tom replied to my question with the proc freq step 🙂

I tried this code as you wrote

proc summary data=test1 ;
  by record ;
  output out=want(drop=_type_) idgroup(out[5] (id date record_1 id_1 id_lead type_lead)= );
run;

This collects all data on one line (everything for that patient) instead of for each operation.

Then I tried the same step with id:

proc sort data=test1; 
by id; run; 

proc summary data=test1 ;
  by id ;
  output out=want(drop=_type_) idgroup(out[5] (id date record_1 id_1 id_lead type_lead)= );
run;

and got (showing part of output)

Here it is asuming 5 of every variable . Is it not possible to have id date record_1 id_1 only once, while the unique data for the additional lead (id_lead and type_lead) is folded out with 5 repeated events ?

I tried deleting id date record_1 id_1 in the parentes, but that solely took out those data in the out put

Tom · Posted 12-21-2021 11:53 AM

You need to understand how your data is organized.

What are the variables that uniquely identify the observations you want to treat as a single group?

What does RECORD mean? Is that the patient identifier?

What does ID mean? Is that the operation identifier? Can the same ID be used by multiple patients? That doesn't seem plausible if it is an identifier of a particular operation. But if it is instead a classification of an operation (Hernia operation versus Open Heart Surgery) then it will not be unique.

Is the goal to get one record per patient per operation?

Then it sounds like your should use both variables in the BY statement.

But if the ID is a "type of operation" field then you might also need some other variable to uniquely identify the group. It is possible to multiple operations on your knee for example. So perhaps you need to include the DATE also?

proc summary data=test1 ;
  by record id date;
  output out=want(drop=_type_) idgroup(out[5] (record_1 id_1 id_lead type_lead)= );
run;

PaigeMiller · Posted 12-21-2021 12:06 PM

Agreeing with @Tom

I realize that figuring out the best layout for data, and the best way to proceed with analysis, is something that beginners cannot know (even experienced people struggle with this). This is why I am trying to steer you into what seems to be a better approach (and I assume the same is true for Tom).

But we can't know your data as well as you know it. Maxim 3, you know. So a combination of experienced SAS programmers and your knowledge of the data would be a very good combination. But we need you to explain in more detail what the data is, and more detail about what analysis will be done.

--
Paige Miller

rookie21 · Posted 12-21-2021 02:11 PM

Yes, its really difficult to be steps ahead now
I have tried to answer specific on Toms questions and really hope this makes more sense now as it seems so difficult . Hoped I could use proc transpose, but haven't seen it beeing used on a serial of variables.
Will try understand the steps in proc summary

rookie21 · Posted 12-21-2021 02:04 PM

Yes,
Record is patient identifier.
ID is operation identifier. The information in ID is equal to ID_1.
ID is unique for each operation and is therefore not shared between patients. ID is not "type" of operation.
The serial of information I would like to transpose as one is id_lead and type_lead.

The goal is to in the end to get on record pr patient - but first as you write: one record per patient per operation?

Does it make better sense now ?

Ready to join fellow brilliant minds for the SAS Hackathon?

Classroom Training Available!