BookmarkSubscribeRSS Feed
Wolverine
Quartz | Level 8

This is what I have:

 FIRST_NAME  MIDDLE_NAME LAST_NAME  Sex  DOB  SSN  health_sys  MRN  hsys_MRN_ID  MIDBZip
John Q Public M 1/1/2010 987654321 Mercy 2222222 Mercy_2222222 34567
John Q Public M 1/1/2010 987654321 Mercy 2222222 Mercy_2222222 34577
John   Public M 1/1/2010 987654321 Saint 333333 Saint_333333 34568
John   Public M 1/1/2010 987654321 Saint 333333 Saint_333333 34577

 

And this is what I want:

 FIRST_NAME  MIDDLE_NAME LAST_NAME  Sex  DOB  SSN  health_sys_1  MRN_1  hsys_MRN_ID_1  health_sys_2  MRN_2  hsys_MRN_ID_2  health_sys_3  MRN_3  hsys_MRN_ID_3  MIDBZip_A  MIDBZip_B  MIDBZip_C
John Q Public M 1/1/2010 987654321 Mercy 2222222 Mercy_2222222 Saint 333333 Saint_333333       34567 34577 34568

 

Note that the zip code variables are A, B, C rather than 1, 2, 3 because I just need to know that the person was at that zip code at some point -- it doesn't have to be linked to a particular health system. In fact, zip code 34577 shows up for both health systems, but it should only appear once in the output record.

 

Also note that I want to retain Middle Name, but sometimes it's blank.  So I want to preferentially select it from records where it's present.

9 REPLIES 9
japelin
Rhodochrosite | Level 12

Since there are multiple issues involved, I thought they should be handled separately.
1. work to fill in the missing middle names
2. combining zip codes into 1obs per person
3. combining other information into 1obs
(I processed this as a SSN unique key)

Please forgive the abbreviations for variable names and types.

 

/* CREATE TESTDATA */
data original;
  attrib fn length=$20  label='FIRST_NAME';
  attrib mn length=$20  label='MIDDLE_NAME';
  attrib ln length=$20  label='LAST_NAME';
  attrib sex length=$1  label='Sex';
  attrib dob length=$10  label='DOB';
  attrib ssn length=$20  label='SSN';
  attrib hsys length=$20  label='health_sys';
  attrib mrn length=$20  label='MRN';
  attrib mrnid length=$20  label='hsys_MRN_ID';
  attrib zip length=$5  label='MIDBZip';
  infile cards missover dsd;
  input fn mn ln sex dob ssn hsys mrn mrnid zip $;
cards;
John,Q,Public,M,1/1/2010,987654321,Mercy,2222222,Mercy_2222222,34567
John,Q,Public,M,1/1/2010,987654321,Mercy,2222222,Mercy_2222222,34577
John, ,Public,M,1/1/2010,987654321,Saint,333333,Saint_333333,34568
John, ,Public,M,1/1/2010,987654321,Saint,333333,Saint_333333,34577
;
run;

/* BASE DATASET:START */
proc sort data=original out=sorted;
  by ssn descending mn;
run;
/* BASE DATASET:END */

/* ZIP DATASET:START */
proc sort data=sorted out=tmp1(keep=ssn zip) nodupkey;
  by ssn zip;
run;
data tmp2;
  set tmp1;
length abc $1; by ssn; cnt+1; abc=byte(64+cnt); run; proc transpose data=tmp2 out=zip(keep=ssn zip_:) prefix=zip_; var zip; by ssn; ID abc; run; /* ZIP DATASET:END */ /* PERSONALINFO DATASET:START */ data tmp3(drop=middlename zip cnt); set sorted; retain middlename; by ssn descending mn; if first.ssn then do middlename=mn; cnt=1; end; else do; cnt+1; end; if mn='' then mn=middlename; run; proc sort data=tmp3 out=personal_all nodup; by ssn; run; proc sort data=personal_all out=personal(keep=fn--ssn) force; by ssn; run; proc sort data=personal nodup; by ssn; run; /* PERSONALINFO DATASET:END */ /* TRANSPOSE BY VARIABLES:START */ proc sort data=personal_all out=trn(keep=ssn--mrnid) force; by ssn; run; proc transpose data=trn out=trn_hsys(keep=ssn hsys_:) prefix=hsys_; var hsys; by ssn; run; proc transpose data=trn out=trn_mrn(keep=ssn mrn_:) prefix=mrn_; var mrn; by ssn; run; proc transpose data=trn out=trn_mrnid(keep=ssn mrnid_:) prefix=mrnid_; var mrnid; by ssn; run; /* TRANSPOSE:END */ /* MERGE ALL */ data transposed; merge personal trn_: zip; by ssn; run;

 

Wolverine
Quartz | Level 8

Just to give some background info, I'm passing the final dataset onto another group of researchers after I'm done with my part of the process.  Unfortunately, they have made a change about how they want the output file to look.  They now want to retaining the link between zip and health system.  And since there may be multiple zip codes per health system, even if there are repeats.  I've renamed them 1A, 1B, 2A, 2B, etc:

BENEFICIARY_ID  FIRST_NAME  MIDDLE_NAME LAST_NAME  Sex  DOB  SSN  health_sys_1  MRN_1  hsys_MRN_ID_1  health_sys_2  MRN_2  hsys_MRN_ID_2  health_sys_3  MRN_3  hsys_MRN_ID_3  MIDBZip_1A  MIDBZip_1B  MIDBZip_2A  MIDBZip_2B
12345678 John Q Public M 1/1/2010 987654321 Mercy 2222222 Mercy_2222222 Saint 333333 Saint_333333       34567 34577 34568 34577

 

Also, there is at least one example I found where a person has a middle initial in one record and a full middle name in another.  So I created a ranking system for that.

 

I've never actually used PROC TRANSPOSE before, and this seems like a very complicated data file to start with!  Between that and the changes I mentioned above, I'm having a hard time trying to figure out how to modify the code you posted accordingly.  Below is what I have so far, and when I run it, I get notes in the log like this: "NOTE: Invalid numeric data, MIDDLE_NAME='H' , at line 119 column 20".  In other words, SAS seems to think that MIDDLE_NAME should be numeric but of course it's finding text instead (in this case, a person with the middle initial H).

/* Section 1.1 -- Preferentially select records with full middle name, then initial,
then blank. LENGTH returns a value of 1 for blank, whereas LENGHTN returns a value of 0.*/
DATA temp.MIDB_Medicaid_linked_servdatedd; SET temp.MIDB_Medicaid_linked_servdatedd;

IF LENGTHN(MIDDLE_NAME) > 1 THEN MNRank = 1;
IF LENGTHN(MIDDLE_NAME) = 1 THEN MNRank = 2;
IF LENGTHN(MIDDLE_NAME) = 0 THEN MNRank = 3;

RUN;

/* Section 1.2 -- Sort on appropriate variables. */
PROC SORT data=temp.MIDB_Medicaid_linked_servdatedd out=temp.MIDB_Medicaid_servdatedd_srt;
  BY SSN descending MNRank;
RUN;


/* Section 1.3 -- PERSONALINFO DATASET:START */
DATA temp.tmp3(drop=MIDDLE_NAME MIDBzip cnt);
  SET temp.MIDB_Medicaid_servdatedd_srt;
  RETAIN MIDDLE_NAME MIDBzip;
  BY SSN descending MNRank;
  if first.SSN then do
    MIDDLE_NAME=mn; 
    cnt=1;
  end; else
  do;
    cnt+1;
  end;
  if mn='' then mn=MIDDLE_NAME;
RUN;

 

 

 

japelin
Rhodochrosite | Level 12

First of all, I have to apologize for using an inappropriate name for the variable.

I can then write it with the correct variable names and the dataset names you have, but first let me know.

What are the keys 1 and 2 in the variable names MIDBZip_1A MIDBZip_1B MIDBZip_2A MIDBZip_2B based on: MRN or hsys_MRN_ID?
I assume that A and B are the order within that key, is that correct?

 

I'm not sure what the BENEFICIARY_ID variable is, but I fixed it anyway.

Please try this code.

Note that the transposed dataset generated below may not have health_sys_3, MRN_3, and hsys_MRN_ID_3.
This is because the original dataset contains only two types of values.

 


/* Section 1.1 -- Preferentially select records with full middle name, then initial,
then blank. LENGTH returns a value of 1 for blank, whereas LENGHTN returns a value of 0.*/
DATA temp.MIDB_Medicaid_linked_servdatedd; 
  SET temp.MIDB_Medicaid_linked_servdatedd;
  /* If there were multiple middle names, it would not be possible to determine if MNRank=1,2,3, so the value of lengthn was used as the sort key. */
  MNRank=LENGTHN(MIDDLE_NAME);
RUN;

/* Section 1.2 -- Sort on appropriate variables. */
PROC SORT data=temp.MIDB_Medicaid_linked_servdatedd out=temp.MIDB_Medicaid_servdatedd_srt;
  BY SSN descending MNRank;
RUN;
/* BASE DATASET:END */

/* AFTER Section 1.2 */

/* ZIP DATASET:START */
proc sort data=temp.MIDB_Medicaid_linked_servdatedd out=tmp1(keep=ssn hsys_MRN_ID MIDBZip);
  by ssn hsys_MRN_ID MIDBZip;
run;
data tmp2;
  set tmp1;
  length suffix $2;
  by ssn hsys_MRN_ID;
  if first.hsys_MRN_ID then do;
    cntABC=0;
    cntNUM+1;
  end;
  cntABC+1;
  suffix=cats(cntNUM,byte(64+cntABC));
run;
proc transpose data=tmp2 out=zip(keep=ssn MIDBZip_:) prefix=MIDBZip_;
  var MIDBZip;
  by ssn;
  ID suffix;
run;
/* ZIP DATASET:END */

/* PERSONALINFO DATASET:START */
data tmp3(drop=middlename MIDBZip cnt);
  set temp.MIDB_Medicaid_servdatedd_srt;
  retain middlename;
  by ssn descending MNRank;
  if first.ssn then do
    middlename=MIDDLE_NAME; 
    cnt=1;
  end; else
  do;
    cnt+1;
  end;
  MIDDLE_NAME=middlename;/* To overwrite with a long middle name */
run;
proc sort data=tmp3 out=personal(keep=FIRST_NAME--ssn) force nodupkey;
  by ssn;
run;
/* PERSONALINFO DATASET:END */

/* TRANSPOSE BY VARIABLES:START */
proc sort data=temp.MIDB_Medicaid_linked_servdatedd(keep=ssn--hsys_MRN_ID) out=trn;
  by ssn;
run;
proc sort data=trn force nodup;
  by ssn;
run;
proc transpose data=trn out=trn_hsys(keep=ssn health_sys_:) prefix=health_sys_;
  var health_sys;
  by ssn; 
run;
proc transpose data=trn out=trn_mrn(keep=ssn mrn_:) prefix=mrn_;
  var mrn;
  by ssn; 
run;
proc transpose data=trn out=trn_mrnid(keep=ssn hsys_MRN_ID_:) prefix=hsys_MRN_ID_;
  var hsys_MRN_ID;
  by ssn; 
run;
/* TRANSPOSE:END */

/* MERGE ALL */
data transposed;
  merge personal trn_: Zip;
  by ssn;
run;

 

 

 

Wolverine
Quartz | Level 8

Thank you for updating the syntax. But before we go any further, I should let you know that they asked me to stop working on this temporarily, as there may be more changes to the layout🙄  We're collaborating with multiple agencies, so it's hard to figure out a layout that will work for everyone.

 

But I did run your code, and I got this error message: 

 

141 /* TRANSPOSE BY VARIABLES:START */
142 proc sort data=temp.MIDB_Medicaid_linked_servdatedd(keep=ssn--hsys_MRN_ID) out=trn;
ERROR: Starting variable ssn of double-dash list has been defined after the ending variable
hsys_MRN_ID.
143 by ssn;
144 run;

japelin
Rhodochrosite | Level 12

how this change.

proc sort data=temp.MIDB_Medicaid_linked_servdatedd out=trn(keep=ssn--hsys_MRN_ID);
  by ssn;
run;

or

/* It's not a smart description. */
proc sort data=temp.MIDB_Medicaid_linked_servdatedd out=trn(keep=SSN health_sys MRN hsys_MRN_ID);
  by ssn;
run;

I know what you're saying.

Climb the boat with many captains...lol

Wolverine
Quartz | Level 8

They've finalized the layout, and it's really not much different than before.  So I ran this code and it's very close to what I need.  The only problem is that the zip code variables are only populated for the first record.  They are blank for everyone else.  Hopefully it's an easy fix!

 

/* Section 1.1 -- Preferentially select records with full middle name, then initial,
then blank. LENGTH returns a value of 1 for blank, whereas LENGHTN returns a value of 0.*/
DATA temp.MIDBall_Medicaid_linked_servdate; 
  SET temp.MIDBall_Medicaid_linked_servdate;
  /* If there were multiple middle names, it would not be possible to determine if MNRank=1,2,3, 
  	so the value of lengthn was used as the sort key. */
  MNRank=LENGTHN(MIDDLE_NAME);
RUN;

/* Section 1.2 -- Sort on appropriate variables. */
PROC SORT data=temp.MIDBall_Medicaid_linked_servdate out=temp.MIDB_Medicaid_servdatedd_srt;
  BY SSN descending MNRank;
RUN;
/* BASE DATASET:END */

/* AFTER Section 1.2 */

/* ZIP DATASET:START */
proc sort data=temp.MIDBall_Medicaid_linked_servdate out=tmp1(keep=ssn hsys_MRN_ID MIDBZip);
  by ssn hsys_MRN_ID MIDBZip;
run;
data tmp2;
  set tmp1;
  length suffix $2;
  by ssn hsys_MRN_ID;
  if first.hsys_MRN_ID then do;
    cntABC=0;
    cntNUM+1;
  end;
  cntABC+1;
  suffix=cats(cntNUM,byte(64+cntABC));
run;
proc transpose data=tmp2 out=zip(keep=ssn MIDBZip_:) prefix=MIDBZip_;
  var MIDBZip;
  by ssn;
  ID suffix;
run;
/* ZIP DATASET:END */

/* PERSONALINFO DATASET:START */
data tmp3(drop=middlename MIDBZip cnt);
  set temp.MIDB_Medicaid_servdatedd_srt;
  retain middlename;
  by ssn descending MNRank;
  if first.ssn then do
    middlename=MIDDLE_NAME; 
    cnt=1;
  end; else
  do;
    cnt+1;
  end;
  MIDDLE_NAME=middlename;/* To overwrite with a long middle name */
run;
proc sort data=tmp3 out=personal(keep=FIRST_NAME--ssn) force nodupkey;
  by ssn;
run;
/* PERSONALINFO DATASET:END */

/* TRANSPOSE BY VARIABLES:START */
proc sort data=temp.MIDB_Medicaid_linked_servdatedd out=trn(keep=SSN health_sys MRN hsys_MRN_ID);
  by ssn;
run;
proc sort data=trn force nodup;
  by ssn;
run;
proc transpose data=trn out=trn_hsys(keep=ssn health_sys_:) prefix=health_sys_;
  var health_sys;
  by ssn; 
run;
proc transpose data=trn out=trn_mrn(keep=ssn mrn_:) prefix=mrn_;
  var mrn;
  by ssn; 
run;
proc transpose data=trn out=trn_mrnid(keep=ssn hsys_MRN_ID_:) prefix=hsys_MRN_ID_;
  var hsys_MRN_ID;
  by ssn; 
run;
/* TRANSPOSE:END */

/* MERGE ALL */
data temp.transposed;
  merge personal trn_: Zip;
  by ssn;
run;
japelin
Rhodochrosite | Level 12

Can you please indicate what kind of data you want for the zip code?
I don't know what the situation is with "the zip code variables are only populated for the first record." because right now my program should be generating only one record with SSN as the key variable.

Wolverine
Quartz | Level 8

The zip code is a 5-digit numeric number.  And yes, the file has 1 record per person.  So when I say only the first record has zip codes, I mean that zip codes appear for only the very first person in the file.  For everyone else, the zip code variables are blank.

japelin
Rhodochrosite | Level 12

like this?

 

2021-02-19_07h47_12.png

If possible, I would like to see the actual (even dummy) data regarding the zip code.

 

sas-innovate-2024.png

Available on demand!

Missed SAS Innovate Las Vegas? Watch all the action for free! View the keynotes, general sessions and 22 breakouts on demand.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 9 replies
  • 1212 views
  • 0 likes
  • 2 in conversation