I have a dataset with over 4M individuals categorized into 2 groups: CP1=0 and CP1=1. The data is the in the wide format (each row is a single individual). Each row (person) has anywhere from 0 to 80 medications; the variable name is Gnrc_Nm1 to Gnrc_Nm80, and the data for these variables are the "distinct" drug name in character format.
I would like to do 2 things in order:
First, I would like to identify the number of individuals per group (CP1) that has each of the distinct drug names. I imagine I would have to do some procedure that looks across Gnrc_Nm1 to Gnrc_Nm80 per person and then summarizes per group for each drug name. Each person has a distinct drug name, so no worries about duplicates per person (i.e., if someone had FLUOCINONIDE 3x, they will only have it once in Gnrc_Nmx). How would you recommend I go about this?
Second, I would like to do a cluster analysis to identify the top 10 most common clusters for each group. I cannot figure out how to do this with the data in the wide format, the variables Gnrc_Nm1 to Gnrc_Nm80, and the data for these variables as character. Any idea on how to go about this?
Below is an example of the data using proc print:
1 | 0 | Generic Name | FLUOCINONIDE | OMEPRAZOLE | 2016-01-01 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1960 | F | W | 1 | 2016 | 56 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
2 | 0 | Generic Name | HYDROCORTISONE | 2016-12-31 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1957 | M | W | 2 | 2016 | 59 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
3 | 1 | Generic Name | AMOXICILLIN/POTASSIUM CLAV | LEVOTHYROXINE SODIUM | 2015-01-01 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1957 | F | H | 1 | 2015 | 58 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
4 | 0 | Generic Name | ALBUTEROL SULFATE | ALENDRONATE SODIUM | AMLODIPINE/VALSARTAN | AMOXICILLIN | ATENOLOL | ATORVASTATIN CALCIUM | CITALOPRAM HYDROBROMIDE | CLOPIDOGREL BISULFATE | FENOFIBRATE NANOCRYSTALLIZED | FENOFIBRIC ACID (CHOLINE) | FLUTICASONE PROPIONATE | FLUTICASONE/SALMETEROL | FUROSEMIDE | GABAPENTIN | GLIPIZIDE | LORAZEPAM | MONTELUKAST SODIUM | NEO/POLYMYX B SULF/DEXAMETH | NIFEDIPINE | PANTOPRAZOLE SODIUM | PENTOXIFYLLINE | ROFLUMILAST | TIOTROPIUM BROMIDE | ZOSTER VACCINE LIVE/PF | 2014-01-01 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1938 | F | W | 1 | 2014 | 76 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
5 | 1 | Generic Name | AMOXICILLIN/POTASSIUM CLAV | TOBRAMYCIN | 2015-05-01 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1969 | M | U | 3 | 2015 | 46 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
6 | 1 | Generic Name | BROMFENAC SODIUM | CIPROFLOXACIN HCL | LATANOPROST | TIMOLOL MALEATE | TRIAMCINOLONE ACETONIDE | 2014-01-01 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1928 | M | H | 1 | 2014 | 86 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
7 | 0 | Generic Name | AZITHROMYCIN | 2016-11-30 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1978 | F | H | 3 | 2016 | 38 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
8 | 0 | Generic Name | ADAPALENE/BENZOYL PEROXIDE | CLINDAMYCIN PHOSPHATE | DEXTROAMPHETAMINE/AMPHETAMINE | L-NORGEST-ETH ESTR/ETHIN ESTRA | 2015-07-01 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1985 | F | W | 4 | 2015 | 30 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Please provide a portion of your data according to these instructions:
Here is a copy and paste after following the first set of instructions from the link you sent.
data WORK.PHARM1117_PMDESCRIPTIVES1;
infile datalines dsd truncover;
input Patid:19. _NAME_:$8. _LABEL_:$40. Gnrc_Nm1:$50. Gnrc_Nm2:$50. Gnrc_Nm3:$50. Gnrc_Nm4:$50. Gnrc_Nm5:$50. Gnrc_Nm6:$50. Gnrc_Nm7:$50. Gnrc_Nm8:$50. Gnrc_Nm9:$50. Gnrc_Nm10:$50. Gnrc_Nm11:$50. Gnrc_Nm12:$50. Gnrc_Nm13:$50. Gnrc_Nm14:$50. Gnrc_Nm15:$50.
Gnrc_Nm16:$50. Gnrc_Nm17:$50. Gnrc_Nm18:$50. Gnrc_Nm19:$50. Gnrc_Nm20:$50. Gnrc_Nm21:$50. Gnrc_Nm22:$50. Gnrc_Nm23:$50. Gnrc_Nm24:$50. Gnrc_Nm25:$50. Gnrc_Nm26:$50. Gnrc_Nm27:$50. Gnrc_Nm28:$50. Gnrc_Nm29:$50. Gnrc_Nm30:$50. Gnrc_Nm31:$50. Gnrc_Nm32:$50. G
nrc_Nm33:$50. Gnrc_Nm34:$50. Gnrc_Nm35:$50. Gnrc_Nm36:$50. Gnrc_Nm37:$50. Gnrc_Nm38:$50. Gnrc_Nm39:$50. Gnrc_Nm40:$50. Gnrc_Nm41:$50. Gnrc_Nm42:$50. Gnrc_Nm43:$50. Gnrc_Nm44:$50. Gnrc_Nm45:$50. Gnrc_Nm46:$50. Gnrc_Nm47:$50. Gnrc_Nm48:$50. Gnrc_Nm49:$50. Gn
rc_Nm50:$50. Gnrc_Nm51:$50. Gnrc_Nm52:$50. Gnrc_Nm53:$50. Gnrc_Nm54:$50. Gnrc_Nm55:$50. Gnrc_Nm56:$50. Gnrc_Nm57:$50. Gnrc_Nm58:$50. Gnrc_Nm59:$50. Gnrc_Nm60:$50. Gnrc_Nm61:$50. Gnrc_Nm62:$50. Gnrc_Nm63:$50. Gnrc_Nm64:$50. Gnrc_Nm65:$50. Gnrc_Nm66:$50. Gnr
c_Nm67:$50. Gnrc_Nm68:$50. Gnrc_Nm69:$50. Gnrc_Nm70:$50. Gnrc_Nm71:$50. Gnrc_Nm72:$50. Gnrc_Nm73:$50. Gnrc_Nm74:$50. Gnrc_Nm75:$50. Gnrc_Nm76:$50. Gnrc_Nm77:$50. Gnrc_Nm78:$50. Gnrc_Nm79:$50. Gnrc_Nm80:$50. FUdate:YYMMDD10. CP1:32. CP2:32. ID1:32. ID2:32.
ASD1:32. ASD2:32. Epil1:32. Epil2:32. Yrdob:5. sex:$1. Race:$1. USreg:32. YRsd:32. age:32.;
format Patid 19. FUdate YYMMDD10. Yrdob 5.;
label Patid="Patient Id" _NAME_="NAME OF FORMER VARIABLE" _LABEL_="LABEL OF FORMER VARIABLE" Yrdob="Year of Birth" sex="Gender Code" Race="Race Code";
datalines;
33003282022 Gnrc_Nm Generic Name FLUOCINONIDE OMEPRAZOLE 2016-01-01 0 0 0 0 0 0 0 0 1960 F W 1 2016
56
33003282023 Gnrc_Nm Generic Name HYDROCORTISONE 2016-12-31 0 0 0 0 0 0 0 0 1957 M W 2 2016 59
33003282043 Gnrc_Nm Generic Name AMOXICILLIN/POTASSIUM CLAV LEVOTHYROXINE SODIUM 2015-01-01 0 0 0 0
0 0 0 0 1957 F H 1 2015 58
33003282049 Gnrc_Nm Generic Name ALBUTEROL SULFATE ALENDRONATE SODIUM AMLODIPINE/VALSARTAN AMOXICILLIN ATENOLOL ATORVASTATIN CALCIUM CITALOPRAM HYDROBROMIDE CLOPIDOGREL BISULFATE FENOFIBRATE NANOCRYSTALLIZED FENOFIBRIC ACID (CHOLINE) FLUTICASONE PROPIONATE
FLUTICASONE/SALMETEROL FUROSEMIDE GABAPENTIN GLIPIZIDE LORAZEPAM MONTELUKAST SODIUM NEO/POLYMYX B SULF/DEXAMETH NIFEDIPINE PANTOPRAZOLE SODIUM PENTOXIFYLLINE ROFLUMILAST TIOTROPIUM BROMIDE ZOSTER VACCINE LIVE/PF
2014-01-01 0 0 0 0 0 0 0 0 1938 F W 1 2014 76
33003282054 Gnrc_Nm Generic Name AMOXICILLIN/POTASSIUM CLAV TOBRAMYCIN 2015-05-01 0 0 0 0 0 0 0 0
1969 M U 3 2015 46
33003282072 Gnrc_Nm Generic Name BROMFENAC SODIUM CIPROFLOXACIN HCL LATANOPROST TIMOLOL MALEATE TRIAMCINOLONE ACETONIDE
2014-01-01 0 0 0 0 0 0 0 0 1928 M H 1 2014 86
33003282074 Gnrc_Nm Generic Name AZITHROMYCIN 2016-11-30 0 0 0 0 0 0 0 0 1978 F H 3 2016 38
33003282079 Gnrc_Nm Generic Name ADAPALENE/BENZOYL PEROXIDE CLINDAMYCIN PHOSPHATE DEXTROAMPHETAMINE/AMPHETAMINE L-NORGEST-ETH ESTR/ETHIN ESTRA
2015-07-01 0 0 0 0 0 0 0 0 1985 F W 4 2015 30
33003282082 Gnrc_Nm Generic Name SIMVASTATIN 2014-01-01 0 0 0 0 0 0 0 0 1949 F W 2 2014 65
33003282095 Gnrc_Nm Generic Name AMOXICILLIN AMPICILLIN TRIHYDRATE DICLOFENAC SODIUM METFORMIN HCL METOPROLOL TARTRATE OFLOXACIN PREDNISOLONE ACETATE
2014-01-01 0 0 0 0 0 0 0 0 1930 F W 2 2014 84
;;;;
NOTE: There were 10 observations read from the data set WOR.PHARM1117_PMDESCRIPTIVES1.
NOTE: DATA statement used (Total process time):
real time 0.01 seconds
cpu time 0.01 seconds
Please double check whether the source of the code actually wraps lines in the middle of the datalines. I might suggest pasting the code into a box opened with {I} instead of the running man icon as that may be setting line length limit that is causing the data step to have invalid data in many places.
Another option would be to consider only using 10 of the Gnrc_nm varaibles. If the FUdate CP1 CP2 ID1 ID2
ASD1 ASD2 Epil1 Epil2 Yrdob sex Race USreg YRsd age are not needed for the cluster definitions drop them.
data export;
set yourdata (keep = Patid Gnrc_nm1 - Gncrc_nm10);
run;
for example and turn that into the data step. The method for determining clusters for 10 variables would be the same just different results.
The data do not wrap around like this in SAS. It must be a copy and paste issue.
Below I copied and pasted using the {i} function as suggested. I also attached the same code in Word format that seems to be a little bit more organized.
Let me know if this helps. Thank you for your time!
data WORK.PHARM1117_PMDESCRIPTIVES1; infile datalines dsd truncover; input Patid:19. _NAME_:$8. _LABEL_:$40. Gnrc_Nm1:$50. Gnrc_Nm2:$50. Gnrc_Nm3:$50. Gnrc_Nm4:$50. Gnrc_Nm5:$50. Gnrc_Nm6:$50. Gnrc_Nm7:$50. Gnrc_Nm8:$50. Gnrc_Nm9:$50. Gnrc_Nm10:$50. Gnrc_Nm11:$50. Gnrc_Nm12:$50. Gnrc_Nm13:$50. Gnrc_Nm14:$50. Gnrc_Nm15:$50. Gnrc_Nm16:$50. Gnrc_Nm17:$50. Gnrc_Nm18:$50. Gnrc_Nm19:$50. Gnrc_Nm20:$50. Gnrc_Nm21:$50. Gnrc_Nm22:$50. Gnrc_Nm23:$50. Gnrc_Nm24:$50. Gnrc_Nm25:$50. Gnrc_Nm26:$50. Gnrc_Nm27:$50. Gnrc_Nm28:$50. Gnrc_Nm29:$50. Gnrc_Nm30:$50. Gnrc_Nm31:$50. Gnrc_Nm32:$50. G nrc_Nm33:$50. Gnrc_Nm34:$50. Gnrc_Nm35:$50. Gnrc_Nm36:$50. Gnrc_Nm37:$50. Gnrc_Nm38:$50. Gnrc_Nm39:$50. Gnrc_Nm40:$50. Gnrc_Nm41:$50. Gnrc_Nm42:$50. Gnrc_Nm43:$50. Gnrc_Nm44:$50. Gnrc_Nm45:$50. Gnrc_Nm46:$50. Gnrc_Nm47:$50. Gnrc_Nm48:$50. Gnrc_Nm49:$50. Gn rc_Nm50:$50. Gnrc_Nm51:$50. Gnrc_Nm52:$50. Gnrc_Nm53:$50. Gnrc_Nm54:$50. Gnrc_Nm55:$50. Gnrc_Nm56:$50. Gnrc_Nm57:$50. Gnrc_Nm58:$50. Gnrc_Nm59:$50. Gnrc_Nm60:$50. Gnrc_Nm61:$50. Gnrc_Nm62:$50. Gnrc_Nm63:$50. Gnrc_Nm64:$50. Gnrc_Nm65:$50. Gnrc_Nm66:$50. Gnr c_Nm67:$50. Gnrc_Nm68:$50. Gnrc_Nm69:$50. Gnrc_Nm70:$50. Gnrc_Nm71:$50. Gnrc_Nm72:$50. Gnrc_Nm73:$50. Gnrc_Nm74:$50. Gnrc_Nm75:$50. Gnrc_Nm76:$50. Gnrc_Nm77:$50. Gnrc_Nm78:$50. Gnrc_Nm79:$50. Gnrc_Nm80:$50. FUdate:YYMMDD10. CP1:32. CP2:32. ID1:32. ID2:32. ASD1:32. ASD2:32. Epil1:32. Epil2:32. Yrdob:5. sex:$1. Race:$1. USreg:32. YRsd:32. age:32.; format Patid 19. FUdate YYMMDD10. Yrdob 5.; label Patid="Patient Id" _NAME_="NAME OF FORMER VARIABLE" _LABEL_="LABEL OF FORMER VARIABLE" Yrdob="Year of Birth" sex="Gender Code" Race="Race Code"; datalines; 33003282022 Gnrc_Nm Generic Name FLUOCINONIDE OMEPRAZOLE 2016-01-01 0 0 0 0 0 0 0 0 1960 F W 1 2016 56 33003282023 Gnrc_Nm Generic Name HYDROCORTISONE 2016-12-31 0 0 0 0 0 0 0 0 1957 M W 2 2016 59 33003282043 Gnrc_Nm Generic Name AMOXICILLIN/POTASSIUM CLAV LEVOTHYROXINE SODIUM 2015-01-01 0 0 0 0 0 0 0 0 1957 F H 1 2015 58 33003282049 Gnrc_Nm Generic Name ALBUTEROL SULFATE ALENDRONATE SODIUM AMLODIPINE/VALSARTAN AMOXICILLIN ATENOLOL ATORVASTATIN CALCIUM CITALOPRAM HYDROBROMIDE CLOPIDOGREL BISULFATE FENOFIBRATE NANOCRYSTALLIZED FENOFIBRIC ACID (CHOLINE) FLUTICASONE PROPIONATE FLUTICASONE/SALMETEROL FUROSEMIDE GABAPENTIN GLIPIZIDE LORAZEPAM MONTELUKAST SODIUM NEO/POLYMYX B SULF/DEXAMETH NIFEDIPINE PANTOPRAZOLE SODIUM PENTOXIFYLLINE ROFLUMILAST TIOTROPIUM BROMIDE ZOSTER VACCINE LIVE/PF 2014-01-01 0 0 0 0 0 0 0 0 1938 F W 1 2014 76 33003282054 Gnrc_Nm Generic Name AMOXICILLIN/POTASSIUM CLAV TOBRAMYCIN 2015-05-01 0 0 0 0 0 0 0 0 1969 M U 3 2015 46 33003282072 Gnrc_Nm Generic Name BROMFENAC SODIUM CIPROFLOXACIN HCL LATANOPROST TIMOLOL MALEATE TRIAMCINOLONE ACETONIDE 2014-01-01 0 0 0 0 0 0 0 0 1928 M H 1 2014 86 33003282074 Gnrc_Nm Generic Name AZITHROMYCIN 2016-11-30 0 0 0 0 0 0 0 0 1978 F H 3 2016 38 33003282079 Gnrc_Nm Generic Name ADAPALENE/BENZOYL PEROXIDE CLINDAMYCIN PHOSPHATE DEXTROAMPHETAMINE/AMPHETAMINE L-NORGEST-ETH ESTR/ETHIN ESTRA 2015-07-01 0 0 0 0 0 0 0 0 1985 F W 4 2015 30 33003282082 Gnrc_Nm Generic Name SIMVASTATIN 2014-01-01 0 0 0 0 0 0 0 0 1949 F W 2 2014 65 33003282095 Gnrc_Nm Generic Name AMOXICILLIN AMPICILLIN TRIHYDRATE DICLOFENAC SODIUM METFORMIN HCL METOPROLOL TARTRATE OFLOXACIN PREDNISOLONE ACETATE 2014-01-01 0 0 0 0 0 0 0 0 1930 F W 2 2014 84 ;;;; NOTE: There were 10 observations read from the data set WOR.PHARM1117_PMDESCRIPTIVES1. NOTE: DATA statement used (Total process time): real time 0.01 seconds cpu time 0.01 seconds
You could try COSINE to measure the similarity between two string , and build a recommendation system via it.
@Rick_SAS wrote a blog about it to measure the similarity between recipes .
Thank you for the suggestion. I am working on it now.
Looking back at my original post, I think I may have been misleading. I used the term "cluster" but I do not mean that I want to do PROC CLUSTER analysis. I would like to do principal component analysis to identify groups ("clusters") of medications common to CP1=0 and CP1=1.
I don't know if this changes how people may provide direction.
Sorry for the mix up.
PCA also need X variables which have numeric value,better continuous value .
The problem is how you get these X variables.
Join us for SAS Innovate 2025, our biggest and most exciting global event of the year, in Orlando, FL, from May 6-9. Sign up by March 14 for just $795.
Learn the difference between classical and Bayesian statistical approaches and see a few PROC examples to perform Bayesian analysis in this video.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.