BookmarkSubscribeRSS Feed
Jeffery_Porter
Obsidian | Level 7

Hello and good morning!

 

I have 4 different datasets that I am looking to merge as one based on the patient's accountnumber as the unique identifier. However due to the amount of columns it will be difficult to track all of the different columns from the different sets of data without some sort of flag. I was wonder if it would be possible to Rename all of the Columns inside of a dataset to have a prefix of "PG_"+name if the data is coming from the PG dataset or "RTA_"+name if the data is coming from the RTA dataset.  

 

I've seen online a lot of different information and I was hoping someone would be able to point me to a specific solution for my situation.

 

so the end result would be to have placed 4 different datasets together into 1, each set would have a prefix to flag where it came from to make it easier to track that specific information.

 

AccountNumber | AdmissionDate | DCDate | PG_Helping | PG_Sight | PG_Skin | RTA_Symptoms | RTA_Signs | Fall_Date | Fall_Location etc. etc.

 

Thank you for your assistance!

6 REPLIES 6
ballardw
Super User

I am not sure that what you are requesting is going to do what you need.

First are you actually meaning "merge" as in match rows by the patients account number? Are you sure that every account number exists only one time (or none) in every one of those 4 data sets? If not you have a many-to-many situation which is a bit more complicated to deal with (unless you are pairing that account number with other identifiers such as a date that does only occur once per set).

When I see the word "patient" I actually expect everything to be multiple in every set because a single visit may have multiple procedures, tests, therapies or what have you that tend to have multiple entries in almost every thing related.

 

It may be that you actually mean, or would be better off with appending data (adding rows) and having an actual identifier added to each record so that you know the source. Which would look like:

 

data want; 
   set file1
        file2
        setsomething
        yetanotherdataset
        indsname=inname
   ;
   source=inname;
run;

The option INDSNAME= creates a temporary variable with the given name, inname in my example, that has the full name of the data set each record comes from, library and set name. This variable is captured in Source so it becomes a permanent variable with that information.

Of course once you have a variable it can be manipulated like any other character variable. So if you don't need the library name you could remove it or parse other information that might be of use.

Jeffery_Porter
Obsidian | Level 7

Thank you both for your responses.

 

I am not fully sure of everything you said, as some of that is new to me. However, I am willing to learn!

 

Each of these data sets pre-exist and have a distinct value associated with them - the patient's account number. It is listed 1 time in each dataset as a identifier. 

Examples of each dataset

[DemoGraphic Data]

PatientAccountNum | FirstName | LastName | Birthdate | County | Admission Date etc.

J110114542112 | John | Doe | 01/01/1960 | Duval | 03/20/2023

[Press-Ganey Data]

PatientAccountNum | FirstName | LastName | PTgoals | PTconcerns | OTgoals | OTconcerns

J110114542112 | John | Doe | 5 | 5 | 4 | 5

[RTA Data]

PatientAccountNum | FirstName | LastName | Date_of_event | symptoms | durationOfSymptoms 

J110114542112 | John | Doe | 03/24/2023 | SOB,Tachycardia | <1 hour

[Falls Data]

PatientAccountNum | FirstName | LastName | DateofFall | LocationOfFall | TimeOfFall

J110114542112 | John | Doe | 03/22/2023 | Room | 1300

 

This is assuming that the patient with account number J110114542112 actually had all of these events occur (if it didn't then it would only contain the information that is actually listed)

 

EXPECTED OUTPUT:

After I remove some of the duplicate data:

 

PatientAccountNum | FirstName | LastName | Birthdate | County | Admission Date | PG_PTgoals | PG_PTconcerns | PG_OTgoals | PG_OTconcerns | RTA_Date_of_event | RTA_symptoms | RTA_durationOfSymptoms  | Fall_DateofFall | Fall_LocationOfFall | Fall_TimeOfFall

 

J110114542112 | John | Doe | 01/01/1960 | Duval | 03/20/2023 | 5 | 5 | 4 | 5 | 03/24/2023 | SOB,Tachycardia | <1 hour | 03/22/2023 | Room | 1300

 

I am attempting to tie in patient demographic data that is gathered from our EMR that already pre-exists in a dataset, with these data I want to tie in link/merge the data found in 3 other datasets (all merged on the patient's account number - which exists uniquely to the patients' visit.) using the account number of the patient.

 

I am currently doing this now, however it is very time consuming and I was wondering if I could learn a faster way that would allow me to flag all of the new data that gets merged from the datasets with a prefix to make it easier to identify exactly where the information is coming from when I am building dashboards. If there is not easily solution I can also just individually pull and rename each of the elements (about 500) but I wanted to check here first.

 

Thank you!

Tom
Super User Tom
Super User

First let's convert your listings into actual datasets (with valid SAS names).

data DemoGraphic;
  infile cards dsd dlm='|' truncover;
  input PatientAccountNum :$15. FirstName :$20. LastName :$20. Birthdate :mmddyy. County :$20. AdmissionDate :mmddyy.;
  format Birthdate AdmissionDate yymmdd10.;
cards;
J110114542112|John|Doe|01/01/1960|Duval|03/20/2023
;

data PressGaney;
  infile cards dsd dlm='|' truncover;
  input PatientAccountNum :$15. FirstName :$20. LastName :$20. PTgoals  PTconcerns OTgoals OTconcerns;
cards;
J110114542112|John|Doe|5|5|4|5
;

data RTA;
  infile cards dsd dlm='|' truncover;
  input PatientAccountNum :$15. FirstName :$20. LastName :$20. Date_of_event :mmddyy. symptoms :$20. durationOfSymptoms :$20.;
  format Date_of_event yymmdd10.;
cards;
J110114542112|John|Doe|03/24/2023|SOB,Tachycardia|<1 hour
;

data Falls;
  infile cards dsd dlm='|' truncover;
  input PatientAccountNum :$15. FirstName :$20. LastName :$20. DateofFall :mmddyy. LocationOfFall :$20. TimeOfFall ;
  format DateofFall yymmdd10.;
cards;
J110114542112|John|Doe|03/22/2023|Room|1300
;

Make sure each dataset is sorted by the key variable, PatientAccountNum , and then just MERGE them.

If you know everyone is going to be in the DEMOGRAPHICS dataset then just drop the other common variables.

data want;
  merge DemoGraphic 
        PressGaney(drop=FirstName LastName) 
        RTA(drop=FirstName LastName)
        Falls(drop=FirstName LastName)
  ;
  by PatientAccountNum ;
run;

If you don't know that then perhaps you can just merge by ALL of the common variables (and hope that their name is recorded the same in every place).


data want;
  merge DemoGraphic PressGaney RTA Falls;
  by PatientAccountNum FirstName LastName;
run;

Result:

2908  data _null_;
2909    set want;
2910    put (_all_) (=/);
2911  run;


PatientAccountNum=J110114542112
FirstName=John
LastName=Doe
Birthdate=1960-01-01
County=Duval
AdmissionDate=2023-03-20
PTgoals=5
PTconcerns=5
OTgoals=4
OTconcerns=5
Date_of_event=2023-03-24
symptoms=SOB,Tachycardia
durationOfSymptoms=<1 hour
DateofFall=2023-03-22
LocationOfFall=Room
TimeOfFall=1300
NOTE: There were 1 observations read from the data set WORK.WANT
PaigeMiller
Diamond | Level 26

Just because you have requested a renaming solution does not mean that is the best way to proceed. If I had a vote, I would vote for the solution proposed by @ballardw which does not involve renaming at all, but involves re-arranging the data in a way that is much more amenable to additional programming. By adopting that solution, you can use things such as BY statements in additional programming to handle data from each data set; it is also in a form that makes using certain types of reporting in PROC REPORT very easy; and things like comparing means via PROC TTEST or PROC GLM are also possible with this re-arranging of data (but requires more work if you use the renaming solution). Plus many other benefits.

 

As always, the best approach depends on what happens next to this data, which you haven't explained. Please tell us what you intend to do with this data after you have renamed it; or what you intend to do with this data after you have applied @ballardw 's solution. 

--
Paige Miller
Quentin
Super User

I share other people's concerns about super-wide datasets with hundreds (thousands?) of variables.

 

But the idea of automating renaming of variables makes sense.  There are many ways to approach this sort of code generation task.  The basic idea is to build a list of variables in a dataset, and then build a list for a rename statement.  That is, you want :

data want; 
  set have (rename=(x=have_x y=have_y  z=have_z)) ;  *this a list: x=have_x y=have_y  z=have_z ;
run;

There are many code-generation techniques in SAS.  One way is to use the macro language.  Below macro uses PROC CONTENTS to generate the list of variables, then uses PROC SQL to generate the list of renames and it returns the list.  It assumes that the variable names in the data will be short enough to accommodate having the dataset name prepended to them.

 

%macro RenameList(data=);
  %local renamelist rc;

  %let rc=%sysfunc(dosubl(%nrstr(
  proc contents data=&data out=__varlist(keep=memname name) noprint;
  run;

  proc sql noprint;
    select cats(name,"=",memname,"_",name) into :renamelist separated by " "
    from __varlist
    ;
    drop table __varlist;
  quit;
 
  )));

&renamelist
%mend RenameList;

Test like:

243  %put %RenameList(data=sashelp.shoes) ;
Inventory=SHOES_Inventory Product=SHOES_Product Region=SHOES_Region Returns=SHOES_Returns Sales=SHOES_Sales Stores=SHOES_Stores Subsidiary=SHOES_Subsidiary

244  %put %RenameList(data=sashelp.class(drop=name)) ;
Age=CLASS_Age Height=CLASS_Height Sex=CLASS_Sex Weight=CLASS_Weight

When you have a function-style macro like that, you can use it in Tom's helpful MERGE statement, to rename variables.  When you call %renamelist, you would drop PatientAccountNum and any other variables don't want to be renamed.

 

data want;
  merge DemoGraphic (drop=FirstName LastName rename=(%RenameList(data=DemoGraphic(drop=PatientAccountNum FirstName LastName)))) 
        PressGaney  (drop=FirstName LastName rename=(%RenameList(data=PressGaney(drop=PatientAccountNum FirstName LastName)))) 
        RTA         (drop=FirstName LastName rename=(%RenameList(data=RTA(drop=PatientAccountNum FirstName LastName))))
        Falls       (drop=FirstName LastName rename=(%RenameList(data=Falls(drop=PatientAccountNum FirstName LastName))))
  ;
  by PatientAccountNum ;
run;

WANT has variables:

#    Variable

1    PatientAccountNum
2    DEMOGRAPHIC_Birthdate
3    DEMOGRAPHIC_County
4    DEMOGRAPHIC_AdmissionDate
5    PRESSGANEY_PTgoals
6    PRESSGANEY_PTconcerns
7    PRESSGANEY_OTgoals
8    PRESSGANEY_OTconcerns
9    RTA_Date_of_event
0    RTA_symptoms
1    RTA_durationOfSymptoms
2    FALLS_DateofFall
3    FALLS_LocationOfFall
4    FALLS_TimeOfFall

 

 

 

hackathon24-white-horiz.png

The 2025 SAS Hackathon has begun!

It's finally time to hack! Remember to visit the SAS Hacker's Hub regularly for news and updates.

Latest Updates

Creating Custom Steps in SAS Studio

Check out this tutorial series to learn how to build your own steps in SAS Studio.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 6 replies
  • 6451 views
  • 3 likes
  • 6 in conversation