BookmarkSubscribeRSS Feed
dwhitney
Calcite | Level 5

I have a dataset with over 4M individuals categorized into 2 groups: CP1=0 and CP1=1. The data is the in the wide format (each row is a single individual). Each row (person) has anywhere from 0 to 80 medications; the variable name is Gnrc_Nm1 to Gnrc_Nm80, and the data for these variables are the "distinct" drug name in character format.

 

I would like to do 2 things in order:

 

First, I would like to identify the number of individuals per group (CP1) that has each of the distinct drug names. I imagine I would have to do some procedure that looks across Gnrc_Nm1 to Gnrc_Nm80 per person and then summarizes per group for each drug name. Each person has a distinct drug name, so no worries about duplicates per person (i.e., if someone had FLUOCINONIDE 3x, they will only have it once in Gnrc_Nmx). How would you recommend I go about this?

 

Second, I would like to do a cluster analysis to identify the top 10 most common clusters for each group. I cannot figure out how to do this with the data in the wide format, the variables Gnrc_Nm1 to Gnrc_Nm80, and the data for these variables as character. Any idea on how to go about this?

 

Below is an example of the data using proc print:

 

Obs Patid CP1 _LABEL_ Gnrc_Nm1 Gnrc_Nm2 Gnrc_Nm3 Gnrc_Nm4 Gnrc_Nm5 Gnrc_Nm6 Gnrc_Nm7 Gnrc_Nm8 Gnrc_Nm9 Gnrc_Nm10 Gnrc_Nm11 Gnrc_Nm12 Gnrc_Nm13 Gnrc_Nm14 Gnrc_Nm15 Gnrc_Nm16 Gnrc_Nm17 Gnrc_Nm18 Gnrc_Nm19 Gnrc_Nm20 Gnrc_Nm21 Gnrc_Nm22 Gnrc_Nm23 Gnrc_Nm24 Gnrc_Nm25 Gnrc_Nm26 Gnrc_Nm27 Gnrc_Nm28 Gnrc_Nm29 Gnrc_Nm30 Gnrc_Nm31 Gnrc_Nm32 Gnrc_Nm33 Gnrc_Nm34 Gnrc_Nm35 Gnrc_Nm36 Gnrc_Nm37 Gnrc_Nm38 Gnrc_Nm39 Gnrc_Nm40 Gnrc_Nm41 Gnrc_Nm42 Gnrc_Nm43 Gnrc_Nm44 Gnrc_Nm45 Gnrc_Nm46 Gnrc_Nm47 Gnrc_Nm48 Gnrc_Nm49 Gnrc_Nm50 Gnrc_Nm51 Gnrc_Nm52 Gnrc_Nm53 Gnrc_Nm54 Gnrc_Nm55 Gnrc_Nm56 Gnrc_Nm57 Gnrc_Nm58 Gnrc_Nm59 Gnrc_Nm60 Gnrc_Nm61 Gnrc_Nm62 Gnrc_Nm63 Gnrc_Nm64 Gnrc_Nm65 Gnrc_Nm66 Gnrc_Nm67 Gnrc_Nm68 Gnrc_Nm69 Gnrc_Nm70 Gnrc_Nm71 Gnrc_Nm72 Gnrc_Nm73 Gnrc_Nm74 Gnrc_Nm75 Gnrc_Nm76 Gnrc_Nm77 Gnrc_Nm78 Gnrc_Nm79 Gnrc_Nm80 FUdate CP1 CP2 ID1 ID2 ASD1 ASD2 Epil1 Epil2 Yrdob sex Race USreg YRsd age12345678 
10Generic NameFLUOCINONIDEOMEPRAZOLE                                                                              2016-01-01000000001960FW1201656
20Generic NameHYDROCORTISONE                                                                               2016-12-31000000001957MW2201659
31Generic NameAMOXICILLIN/POTASSIUM CLAVLEVOTHYROXINE SODIUM                                                                              2015-01-01000000001957FH1201558
40Generic NameALBUTEROL SULFATEALENDRONATE SODIUMAMLODIPINE/VALSARTANAMOXICILLINATENOLOLATORVASTATIN CALCIUMCITALOPRAM HYDROBROMIDECLOPIDOGREL BISULFATEFENOFIBRATE NANOCRYSTALLIZEDFENOFIBRIC ACID (CHOLINE)FLUTICASONE PROPIONATEFLUTICASONE/SALMETEROLFUROSEMIDEGABAPENTINGLIPIZIDELORAZEPAMMONTELUKAST SODIUMNEO/POLYMYX B SULF/DEXAMETHNIFEDIPINEPANTOPRAZOLE SODIUMPENTOXIFYLLINEROFLUMILASTTIOTROPIUM BROMIDEZOSTER VACCINE LIVE/PF                                                        2014-01-01000000001938FW1201476
51Generic NameAMOXICILLIN/POTASSIUM CLAVTOBRAMYCIN                                                                              2015-05-01000000001969MU3201546
61Generic NameBROMFENAC SODIUMCIPROFLOXACIN HCLLATANOPROSTTIMOLOL MALEATETRIAMCINOLONE ACETONIDE                                                                           2014-01-01000000001928MH1201486
70Generic NameAZITHROMYCIN                                                                               2016-11-30000000001978FH3201638
80Generic NameADAPALENE/BENZOYL PEROXIDECLINDAMYCIN PHOSPHATEDEXTROAMPHETAMINE/AMPHETAMINEL-NORGEST-ETH ESTR/ETHIN ESTRA                                                                            2015-07-01000000001985FW4201530
    

 

 

 

 

7 REPLIES 7
PaigeMiller
Diamond | Level 26

Please provide a portion of your data according to these instructions:

https://communities.sas.com/t5/SAS-Communities-Library/How-to-create-a-data-step-version-of-your-dat...

--
Paige Miller
dwhitney
Calcite | Level 5

Here is a copy and paste after following the first set of instructions from the link you sent.

 

data WORK.PHARM1117_PMDESCRIPTIVES1;
  infile datalines dsd truncover;

input Patid:19. _NAME_:$8. _LABEL_:$40. Gnrc_Nm1:$50. Gnrc_Nm2:$50. Gnrc_Nm3:$50. Gnrc_Nm4:$50. Gnrc_Nm5:$50. Gnrc_Nm6:$50. Gnrc_Nm7:$50. Gnrc_Nm8:$50. Gnrc_Nm9:$50. Gnrc_Nm10:$50. Gnrc_Nm11:$50. Gnrc_Nm12:$50. Gnrc_Nm13:$50. Gnrc_Nm14:$50. Gnrc_Nm15:$50.
Gnrc_Nm16:$50. Gnrc_Nm17:$50. Gnrc_Nm18:$50. Gnrc_Nm19:$50. Gnrc_Nm20:$50. Gnrc_Nm21:$50. Gnrc_Nm22:$50. Gnrc_Nm23:$50. Gnrc_Nm24:$50. Gnrc_Nm25:$50. Gnrc_Nm26:$50. Gnrc_Nm27:$50. Gnrc_Nm28:$50. Gnrc_Nm29:$50. Gnrc_Nm30:$50. Gnrc_Nm31:$50. Gnrc_Nm32:$50. G
nrc_Nm33:$50. Gnrc_Nm34:$50. Gnrc_Nm35:$50. Gnrc_Nm36:$50. Gnrc_Nm37:$50. Gnrc_Nm38:$50. Gnrc_Nm39:$50. Gnrc_Nm40:$50. Gnrc_Nm41:$50. Gnrc_Nm42:$50. Gnrc_Nm43:$50. Gnrc_Nm44:$50. Gnrc_Nm45:$50. Gnrc_Nm46:$50. Gnrc_Nm47:$50. Gnrc_Nm48:$50. Gnrc_Nm49:$50. Gn
rc_Nm50:$50. Gnrc_Nm51:$50. Gnrc_Nm52:$50. Gnrc_Nm53:$50. Gnrc_Nm54:$50. Gnrc_Nm55:$50. Gnrc_Nm56:$50. Gnrc_Nm57:$50. Gnrc_Nm58:$50. Gnrc_Nm59:$50. Gnrc_Nm60:$50. Gnrc_Nm61:$50. Gnrc_Nm62:$50. Gnrc_Nm63:$50. Gnrc_Nm64:$50. Gnrc_Nm65:$50. Gnrc_Nm66:$50. Gnr
c_Nm67:$50. Gnrc_Nm68:$50. Gnrc_Nm69:$50. Gnrc_Nm70:$50. Gnrc_Nm71:$50. Gnrc_Nm72:$50. Gnrc_Nm73:$50. Gnrc_Nm74:$50. Gnrc_Nm75:$50. Gnrc_Nm76:$50. Gnrc_Nm77:$50. Gnrc_Nm78:$50. Gnrc_Nm79:$50. Gnrc_Nm80:$50. FUdate:YYMMDD10. CP1:32. CP2:32. ID1:32. ID2:32.
ASD1:32. ASD2:32. Epil1:32. Epil2:32. Yrdob:5. sex:$1. Race:$1. USreg:32. YRsd:32. age:32.;
  format Patid 19. FUdate YYMMDD10. Yrdob 5.;
  label Patid="Patient Id" _NAME_="NAME OF FORMER VARIABLE" _LABEL_="LABEL OF FORMER VARIABLE" Yrdob="Year of Birth" sex="Gender Code" Race="Race Code";
datalines;
33003282022 Gnrc_Nm Generic Name FLUOCINONIDE OMEPRAZOLE                                                                                                                                                             2016-01-01 0 0 0 0 0 0 0 0 1960 F W 1 2016
56
33003282023 Gnrc_Nm Generic Name HYDROCORTISONE                                                                                                                                                               2016-12-31 0 0 0 0 0 0 0 0 1957 M W 2 2016 59
33003282043 Gnrc_Nm Generic Name AMOXICILLIN/POTASSIUM CLAV LEVOTHYROXINE SODIUM                                                                                                                                                             2015-01-01 0 0 0 0
0 0 0 0 1957 F H 1 2015 58
33003282049 Gnrc_Nm Generic Name ALBUTEROL SULFATE ALENDRONATE SODIUM AMLODIPINE/VALSARTAN AMOXICILLIN ATENOLOL ATORVASTATIN CALCIUM CITALOPRAM HYDROBROMIDE CLOPIDOGREL BISULFATE FENOFIBRATE NANOCRYSTALLIZED FENOFIBRIC ACID (CHOLINE) FLUTICASONE PROPIONATE
FLUTICASONE/SALMETEROL FUROSEMIDE GABAPENTIN GLIPIZIDE LORAZEPAM MONTELUKAST SODIUM NEO/POLYMYX B SULF/DEXAMETH NIFEDIPINE PANTOPRAZOLE SODIUM PENTOXIFYLLINE ROFLUMILAST TIOTROPIUM BROMIDE ZOSTER VACCINE LIVE/PF
                                                                    2014-01-01 0 0 0 0 0 0 0 0 1938 F W 1 2014 76
33003282054 Gnrc_Nm Generic Name AMOXICILLIN/POTASSIUM CLAV TOBRAMYCIN                                                                                                                                                             2015-05-01 0 0 0 0 0 0 0 0
1969 M U 3 2015 46
33003282072 Gnrc_Nm Generic Name BROMFENAC SODIUM CIPROFLOXACIN HCL LATANOPROST TIMOLOL MALEATE TRIAMCINOLONE ACETONIDE
              2014-01-01 0 0 0 0 0 0 0 0 1928 M H 1 2014 86
33003282074 Gnrc_Nm Generic Name AZITHROMYCIN                                                                                                                                                               2016-11-30 0 0 0 0 0 0 0 0 1978 F H 3 2016 38
33003282079 Gnrc_Nm Generic Name ADAPALENE/BENZOYL PEROXIDE CLINDAMYCIN PHOSPHATE DEXTROAMPHETAMINE/AMPHETAMINE L-NORGEST-ETH ESTR/ETHIN ESTRA
                                      2015-07-01 0 0 0 0 0 0 0 0 1985 F W 4 2015 30
33003282082 Gnrc_Nm Generic Name SIMVASTATIN                                                                                                                                                               2014-01-01 0 0 0 0 0 0 0 0 1949 F W 2 2014 65
33003282095 Gnrc_Nm Generic Name AMOXICILLIN AMPICILLIN TRIHYDRATE DICLOFENAC SODIUM METFORMIN HCL METOPROLOL TARTRATE OFLOXACIN PREDNISOLONE ACETATE
                                        2014-01-01 0 0 0 0 0 0 0 0 1930 F W 2 2014 84
;;;;
NOTE: There were 10 observations read from the data set WOR.PHARM1117_PMDESCRIPTIVES1.
NOTE: DATA statement used (Total process time):
      real time           0.01 seconds
      cpu time            0.01 seconds

 

ballardw
Super User

Please double check whether the source of the code actually wraps lines in the middle of the datalines. I might suggest pasting the code into a box opened with {I} instead of the running man icon as that may be setting line length limit that is causing the data step to have invalid data in many places.

 

Another option would be to consider only using 10 of the Gnrc_nm varaibles. If the FUdate CP1 CP2 ID1 ID2

ASD1 ASD2 Epil1 Epil2 Yrdob sex Race USreg YRsd age are not needed for the cluster definitions drop them.

 

data export;

   set yourdata (keep = Patid Gnrc_nm1 - Gncrc_nm10);

run;

for example and turn that into the data step. The method for determining clusters for 10 variables would be the same just different results.

 

 

dwhitney
Calcite | Level 5
The data do not wrap around like this in SAS. It must be a copy and paste issue.

Below I copied and pasted using the {i} function as suggested. I also attached the same code in Word format that seems to be a little bit more organized.

Let me know if this helps. Thank you for your time!

data WORK.PHARM1117_PMDESCRIPTIVES1; infile datalines dsd truncover; input Patid:19. _NAME_:$8. _LABEL_:$40. Gnrc_Nm1:$50. Gnrc_Nm2:$50. Gnrc_Nm3:$50. Gnrc_Nm4:$50. Gnrc_Nm5:$50. Gnrc_Nm6:$50. Gnrc_Nm7:$50. Gnrc_Nm8:$50. Gnrc_Nm9:$50. Gnrc_Nm10:$50. Gnrc_Nm11:$50. Gnrc_Nm12:$50. Gnrc_Nm13:$50. Gnrc_Nm14:$50. Gnrc_Nm15:$50. Gnrc_Nm16:$50. Gnrc_Nm17:$50. Gnrc_Nm18:$50. Gnrc_Nm19:$50. Gnrc_Nm20:$50. Gnrc_Nm21:$50. Gnrc_Nm22:$50. Gnrc_Nm23:$50. Gnrc_Nm24:$50. Gnrc_Nm25:$50. Gnrc_Nm26:$50. Gnrc_Nm27:$50. Gnrc_Nm28:$50. Gnrc_Nm29:$50. Gnrc_Nm30:$50. Gnrc_Nm31:$50. Gnrc_Nm32:$50. G nrc_Nm33:$50. Gnrc_Nm34:$50. Gnrc_Nm35:$50. Gnrc_Nm36:$50. Gnrc_Nm37:$50. Gnrc_Nm38:$50. Gnrc_Nm39:$50. Gnrc_Nm40:$50. Gnrc_Nm41:$50. Gnrc_Nm42:$50. Gnrc_Nm43:$50. Gnrc_Nm44:$50. Gnrc_Nm45:$50. Gnrc_Nm46:$50. Gnrc_Nm47:$50. Gnrc_Nm48:$50. Gnrc_Nm49:$50. Gn rc_Nm50:$50. Gnrc_Nm51:$50. Gnrc_Nm52:$50. Gnrc_Nm53:$50. Gnrc_Nm54:$50. Gnrc_Nm55:$50. Gnrc_Nm56:$50. Gnrc_Nm57:$50. Gnrc_Nm58:$50. Gnrc_Nm59:$50. Gnrc_Nm60:$50. Gnrc_Nm61:$50. Gnrc_Nm62:$50. Gnrc_Nm63:$50. Gnrc_Nm64:$50. Gnrc_Nm65:$50. Gnrc_Nm66:$50. Gnr c_Nm67:$50. Gnrc_Nm68:$50. Gnrc_Nm69:$50. Gnrc_Nm70:$50. Gnrc_Nm71:$50. Gnrc_Nm72:$50. Gnrc_Nm73:$50. Gnrc_Nm74:$50. Gnrc_Nm75:$50. Gnrc_Nm76:$50. Gnrc_Nm77:$50. Gnrc_Nm78:$50. Gnrc_Nm79:$50. Gnrc_Nm80:$50. FUdate:YYMMDD10. CP1:32. CP2:32. ID1:32. ID2:32. ASD1:32. ASD2:32. Epil1:32. Epil2:32. Yrdob:5. sex:$1. Race:$1. USreg:32. YRsd:32. age:32.; format Patid 19. FUdate YYMMDD10. Yrdob 5.; label Patid="Patient Id" _NAME_="NAME OF FORMER VARIABLE" _LABEL_="LABEL OF FORMER VARIABLE" Yrdob="Year of Birth" sex="Gender Code" Race="Race Code"; datalines; 33003282022 Gnrc_Nm Generic Name FLUOCINONIDE OMEPRAZOLE 2016-01-01 0 0 0 0 0 0 0 0 1960 F W 1 2016 56 33003282023 Gnrc_Nm Generic Name HYDROCORTISONE 2016-12-31 0 0 0 0 0 0 0 0 1957 M W 2 2016 59 33003282043 Gnrc_Nm Generic Name AMOXICILLIN/POTASSIUM CLAV LEVOTHYROXINE SODIUM 2015-01-01 0 0 0 0 0 0 0 0 1957 F H 1 2015 58 33003282049 Gnrc_Nm Generic Name ALBUTEROL SULFATE ALENDRONATE SODIUM AMLODIPINE/VALSARTAN AMOXICILLIN ATENOLOL ATORVASTATIN CALCIUM CITALOPRAM HYDROBROMIDE CLOPIDOGREL BISULFATE FENOFIBRATE NANOCRYSTALLIZED FENOFIBRIC ACID (CHOLINE) FLUTICASONE PROPIONATE FLUTICASONE/SALMETEROL FUROSEMIDE GABAPENTIN GLIPIZIDE LORAZEPAM MONTELUKAST SODIUM NEO/POLYMYX B SULF/DEXAMETH NIFEDIPINE PANTOPRAZOLE SODIUM PENTOXIFYLLINE ROFLUMILAST TIOTROPIUM BROMIDE ZOSTER VACCINE LIVE/PF 2014-01-01 0 0 0 0 0 0 0 0 1938 F W 1 2014 76 33003282054 Gnrc_Nm Generic Name AMOXICILLIN/POTASSIUM CLAV TOBRAMYCIN 2015-05-01 0 0 0 0 0 0 0 0 1969 M U 3 2015 46 33003282072 Gnrc_Nm Generic Name BROMFENAC SODIUM CIPROFLOXACIN HCL LATANOPROST TIMOLOL MALEATE TRIAMCINOLONE ACETONIDE 2014-01-01 0 0 0 0 0 0 0 0 1928 M H 1 2014 86 33003282074 Gnrc_Nm Generic Name AZITHROMYCIN 2016-11-30 0 0 0 0 0 0 0 0 1978 F H 3 2016 38 33003282079 Gnrc_Nm Generic Name ADAPALENE/BENZOYL PEROXIDE CLINDAMYCIN PHOSPHATE DEXTROAMPHETAMINE/AMPHETAMINE L-NORGEST-ETH ESTR/ETHIN ESTRA 2015-07-01 0 0 0 0 0 0 0 0 1985 F W 4 2015 30 33003282082 Gnrc_Nm Generic Name SIMVASTATIN 2014-01-01 0 0 0 0 0 0 0 0 1949 F W 2 2014 65 33003282095 Gnrc_Nm Generic Name AMOXICILLIN AMPICILLIN TRIHYDRATE DICLOFENAC SODIUM METFORMIN HCL METOPROLOL TARTRATE OFLOXACIN PREDNISOLONE ACETATE 2014-01-01 0 0 0 0 0 0 0 0 1930 F W 2 2014 84 ;;;; NOTE: There were 10 observations read from the data set WOR.PHARM1117_PMDESCRIPTIVES1. NOTE: DATA statement used (Total process time): real time 0.01 seconds cpu time 0.01 seconds
Ksharp
Super User

You could try COSINE to measure the similarity between two string , and build a recommendation system via it.

@Rick_SAS  wrote a blog about it to measure the similarity between recipes .

dwhitney
Calcite | Level 5

Thank you for the suggestion. I am working on it now.

 

Looking back at my original post, I think I may have been misleading. I used the term "cluster" but I do not mean that I want to do PROC CLUSTER analysis. I would like to do principal component analysis to identify groups ("clusters") of medications common to CP1=0 and CP1=1.

 

I don't know if this changes how people may provide direction.

 

Sorry for the mix up.

Ksharp
Super User

PCA also need X variables which have numeric value,better continuous value .

The problem is how you get these X variables.

SAS Innovate 2025: Save the Date

 SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!

Save the date!

What is Bayesian Analysis?

Learn the difference between classical and Bayesian statistical approaches and see a few PROC examples to perform Bayesian analysis in this video.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 7 replies
  • 1379 views
  • 0 likes
  • 4 in conversation