Solved: Converting data from wide to long

UcheOkoro · Posted 03-16-2021 03:50 PM

Please, I need help converting data from wide to long. I have a data set which has data in wide form. It has drug_class1 - drug_class25 , drug_name1-drug_name25 and the drug indication starts from drug_indication1_21-drug_indication25_40..

I would like to transpose the data to look like this

encounter_id drug_class drug_name indication21 .................................indication40.

My main problem is how to transpose the drug_indication. I tried the following code

proc transpose data=Data_new out=long_indication_meds prefix=indication_meds;
   by encounterid;
   var indication_meds1___21 - indication_meds25___40;run;

but got the following error message

ERROR: Either roots don't match or start suffix after end suffix.

Thank you in advance

ballardw · Posted 03-16-2021 07:44 PM

If your source data was a text file and you copy something like the first 5 lines and paste them into a text box (critical!) we may be able to show you how to read that file directly to create that data structure.

If your source data was in a different format tell us what type file.

Consider this example, which only has 3 sets of values consisting of two "words" and 5 numeric values:

data example;
   infile datalines dlm=',' dsd;
   informat worda wordb $10. value1-value5  best.;
   input id @;
   do i=1 to 3;
      input worda wordb value1-value5 @;
      output;
   end;
   input;
   drop i;
datalines;
1,abc,pdq,1,0,0,1,1,zz,top,0,0,0,0,1,john,doe,1,1,1,1,0
1,aaa,ggq,1,0,0,1,1,z3,bgt,0,0,0,0,1,mary,pit,1,1,1,1,0
;

The infile statement would point to a text file but here it shows some options if the original data were comma separated (CSV). The DLM option tells SAS to expect a comma between values. The DSD option means that if there are two successive delimiters to treat that variable as missing.

The Informat describes how to read each variable, worda and wordb as character, and then 5 numeric variables as numbers (not actually needed as a default but to make basis for when more complicated things come up). The ID could be numeric or character.

The first input reads just the ID value. The @ symbol at the end of an input means "hold the reading pointer on the current record where it is", which in this case would be right after the ID.

Since we KNOW we have 3 sets of variables we can use a loop to specify that number of times to read, still using the @ to hold the pointer. The OUTPUT statement says to write the current data to the output data set.

After the Loop we need an explicit Input to advance to the next row of data.

The I value isn't needed, unless you want a line or group indicator for some reason. If so pick a name that means more than the "i" and don't use the Drop statement.

If not all 3 groups are present you will get some blank lines currently. That could be fixed by using "If not missing(worda) then output;" instead of the simple output.

Another option might be to EXPORT your existing data to CSV and write a program similar to above to read it.

You did mention other variables. Where they appear is going to make a difference but as long as the pattern is regular it is quite doable, just somewhat cumbersome code.

For example if you had indication1 otherindication1 indication2 otherindication2 etc as the column order then the Input statement would just list those.

The headaches come if the "other" are mixed in after the the drug type and name group of values.

View solution in original post

ballardw · Posted 03-16-2021 04:14 PM

When you use variable list like Var_1 - Var15 SAS expects there to be 15 variables, sequentially numbered and the "root" which would be "Var_" to be the same.

Your use drug_indication1_21-drug_indication25_40 does not have a "root" that is the same. From that I have no idea if the "next" variable name after drug_indication1_21 would be drug_indication2_21,drug_indication1_22, drug_indication2_22 or what. Neither does SAS.

So either type out all of the names or use one of the other forms of lists.

If all of the drug_indication variables are adjacent columns (run proc contents and see if the variable # in the output is sequential for those variables). If so you could use the two dash list: drug_indication1_21 -- drug_indication25_40 , which would use the variables in column order.

If they are not sequential then possibly you could use the list form of : drug_indication: the : after the part of the name says to use ALL of the variables whose names start with the common characters. There could however be a an issue that the order variable names are added to the list is not the expected order.

Your code shows :prefix=indication_meds

This means the output variables would be named indication_meds1, indication_meds2, ... indication_medsXX

not the stated Indication21

However it looks more like what you want to do is RENAME variables. Which could be done in Proc datasets or a data step (inefficient in cpu time, about the same in coding time)

proc datasets library=work;
   modify data_new;
      rename 
         drug_indication1_21= indication21
         ...(you provide each pair)
         drug_indication25_40= indication40
      ;
   run;
quit;

If not Provide some example data, with only 3 drug class and names with maybe 5 or 6 "drug_indication" variables. Then show what you expect to see for output.

UcheOkoro · Posted 03-16-2021 06:41 PM

Thank you so much for your response. The indication_21.......indication_40 was an error. What I had in mind was drug_indication21.........drug_indication_40.

I did a proc contents and noticed that the variable # in the output is not sequential for those variables because there is another set of variables other_indication1.......other_indications25 which has character observations( eg. "History of allergy to iodine ".)

Renaming the variable is going to be a gruely task because they are over 500.

Sample data:

What I have is the following

id drug_class1 drug_name1 drug_indication1_21 ........ drug_indication1_40 ........drug_class25 drug_name25 drug_indication25_21 ...... drug_indication25_40

1 NSAID Aspirin 1 .............. 0 Antihistamine Cetirizine 0 1

2 Opioid hydrocodone 0 1 NSAID Aspirin 1 0

What I want to see is

What I expect to see is the following

s/n id drug_class drug_name drug_indication21................................drug_indication40

1 1 NSAID Aspirin 1 ......................................... ...................0

2 . . . .............................................................

. . . . .......................................................................

. . . . ..................................................................

. . . . ...................................................................

. . . . ..................................................................

25 1 Antihistamine Cetirizine 0................................................................ ..1

26 2 Opioid hydrocodone 1 ............................................ .....................0

. . . .............................................................

. . . .......................................................................

. . . ..................................................................

. . . ...................................................................

. . . ..................................................................

50 2 NSAID Aspirin 1......................................................................0

I hope this is clear?

Thank you.

ballardw · Posted 03-16-2021 07:44 PM

If your source data was a text file and you copy something like the first 5 lines and paste them into a text box (critical!) we may be able to show you how to read that file directly to create that data structure.

If your source data was in a different format tell us what type file.

Consider this example, which only has 3 sets of values consisting of two "words" and 5 numeric values:

data example;
   infile datalines dlm=',' dsd;
   informat worda wordb $10. value1-value5  best.;
   input id @;
   do i=1 to 3;
      input worda wordb value1-value5 @;
      output;
   end;
   input;
   drop i;
datalines;
1,abc,pdq,1,0,0,1,1,zz,top,0,0,0,0,1,john,doe,1,1,1,1,0
1,aaa,ggq,1,0,0,1,1,z3,bgt,0,0,0,0,1,mary,pit,1,1,1,1,0
;