BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
edison83
Obsidian | Level 7

Hello everyone, i am new in sas and i maybe have a dumb question, but i could not find any answer online.

 

The following table contains the diseases diagnosed in each person (ID). Some have only one disease and others have multiples.

I would like to reorganize the date putting each disease in one column as the second table shows.

 

Can you guys help me?

 

ID1 disease2 diseases3 disease4 diseases
1HIVMeningitis  
2Ebolapneumonia  
3FluHIVEbola 
4FluEbolaMeningitis 
5Flu   
6MeningitisEbolaFluHIV
7pneumoniaFluHIV 
8EbolaMeningitis  
9FluMeningitis  
10Ebola   
11Ebola   
12MeningitisFlu  
13MeningitisEbolaHIVFlu
14MeningitisHIV  

 

 

 

 

IDHIVFluMeningitisEbolaPneumonia
1HIV Meningitis  
2   EbolaPneumonia
3HIVFlu Ebola 
4 FluMeningitisEbola 
5 Flu   
6  Meningitis  
7HIVFlu  Pneumonia
8  MeningitisEbola 
9 FluMeningitis  
10   Ebola 
11   Ebola 
12 FluMeningitis  
13HIVFluMeningitisEbola 
14HIV    
1 ACCEPTED SOLUTION

Accepted Solutions
unison
Lapis Lazuli | Level 10

Hi @edison83. Welcome to SAS communities. Not at all a dumb question. Take a look at my code below!

 

What you want to do here is convert to long form and then transpose back:

 

data have;
	infile datalines dlm=',' missover;
	input ID (disease_1-disease_4) (:$15.);
	datalines;
1,HIV,Meningitis
2,Ebola,pneumonia,
3,Flu,HIV,Ebola,
4,Flu,Ebola,Meningitis,
5,Flu,
6,Meningitis,Ebola,Flu,HIV
7,pneumonia,Flu,HIV,
8,Ebola,Meningitis,
9,Flu,Meningitis,
10,Ebola,
11,Ebola,
12,Meningitis,Flu,
13,Meningitis,Ebola,HIV,Flu
14,Meningitis,HIV,
;
run;

data have_long;
	set have;
	array arr_dis disease:;

	do over arr_dis;

		if arr_dis ne '' then
			do;
				disease=arr_dis;
				output;
			end;
	end;
run;

proc sort data=have_long;
	by id disease;
run;

proc transpose data=have_long(keep=id disease) out=want(drop=_name_);
	by id;
	id disease;
	var disease;
run;

 

-unison

View solution in original post

4 REPLIES 4
unison
Lapis Lazuli | Level 10

Hi @edison83. Welcome to SAS communities. Not at all a dumb question. Take a look at my code below!

 

What you want to do here is convert to long form and then transpose back:

 

data have;
	infile datalines dlm=',' missover;
	input ID (disease_1-disease_4) (:$15.);
	datalines;
1,HIV,Meningitis
2,Ebola,pneumonia,
3,Flu,HIV,Ebola,
4,Flu,Ebola,Meningitis,
5,Flu,
6,Meningitis,Ebola,Flu,HIV
7,pneumonia,Flu,HIV,
8,Ebola,Meningitis,
9,Flu,Meningitis,
10,Ebola,
11,Ebola,
12,Meningitis,Flu,
13,Meningitis,Ebola,HIV,Flu
14,Meningitis,HIV,
;
run;

data have_long;
	set have;
	array arr_dis disease:;

	do over arr_dis;

		if arr_dis ne '' then
			do;
				disease=arr_dis;
				output;
			end;
	end;
run;

proc sort data=have_long;
	by id disease;
run;

proc transpose data=have_long(keep=id disease) out=want(drop=_name_);
	by id;
	id disease;
	var disease;
run;

 

-unison
edison83
Obsidian | Level 7
Thanks a lot, it worked perfectly!
novinosrin
Tourmaline | Level 20

Same idea, just a double PROC transpose as this syntactically seems easy, at least to me

 

proc transpose data=have out=temp;
by id;
var disease_1--disease_4;
run;

proc transpose data=temp out=want(drop=_name_);
where not missing(col1);
by id;
var col1;
id col1;
run;
mkeintz
PROC Star

There is a solution that can be implemented in one pass through the data.  It's admittedly more programming, but possibly more efficient with large data sets.  It requires use of a hash table to store the wide-form data, and a small hash table to track the variety of diseases:

 

data have;
	infile datalines dlm=',' missover;
	input ID (disease_1-disease_4) (:$15.);
	datalines;
1,HIV,Meningitis
2,Ebola,pneumonia,
3,Flu,HIV,Ebola,
4,Flu,Ebola,Meningitis,
5,Flu,
6,Meningitis,Ebola,Flu,HIV
7,pneumonia,Flu,HIV,
8,Ebola,Meningitis,
9,Flu,Meningitis,
10,Ebola,
11,Ebola,
12,Meningitis,Flu,
13,Meningitis,Ebola,HIV,Flu
14,Meningitis,HIV,
;
run;

data empty;
  length id 8 name1-name40 $32;
  call missing (of _all_);
  stop;
run;

data _null_;
  merge have empty  end=end_of_have;
  length var_renames $5000;
  retain var_renames;

  if _n_=1 then do;
    declare hash wide (dataset:'empty',ordered:'a');
      wide.definekey('id');
      wide.definedata(all:'Y');
      wide.definedone();
    declare hash dlookup();
      dlookup.definekey('disease');
      dlookup.definedata('disease','d');
      dlookup.definedone();
  end;
  array nam {*} $32 name: ;

  do disease=disease_1,disease_2,disease_3,disease_4;
    if disease=' ' then leave;
    if dlookup.find()^=0 then do;
      d=dlookup.num_items+1;;
      dlookup.add();
      var_renames=catx(' ',var_renames,cats('name',d,'=',disease));
    end;
    nam{d}=disease;
  end;

  wide.add();

  if end_of_have;
  nd=dlookup.num_items;
  wide.output (dataset:catx(' ',"want (keep=id",cats('name1-name',ND),'rename=(',var_renames,')'));
run;
  1. The EMPTY dataset is just a convenient an compact way to set up a list of dummy variable names.  Make sure you declare more names than there are distinct diseases in the original data.  Using EMPTY is also a compact way to declare the data content of hash object wide.
  2. I use a MERGE HAVE with EMPTY, even though EMPTY has no rows of data, just to force the data step to reset all the NAME variables to missing values at the top of each data step iteration.
  3. The first DO structure looks for new diseases, adds them to the DLOOKUP hash, and gives each disease an index value to co-ordinate with the array of NAME variables.  It also builds a rename list (e.g.  "name1=HIV name2=Meningitis …") to be used when the WIDE.OUTPUT method is used.  This is one of the great benefits of hash objects - you can dynamically generate exactly the list of needed variables, based on the data encountered.  No advance knowledge needed.
--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

sas-innovate-2024.png

Available on demand!

Missed SAS Innovate Las Vegas? Watch all the action for free! View the keynotes, general sessions and 22 breakouts on demand.

 

Register now!

Mastering the WHERE Clause in PROC SQL

SAS' Charu Shankar shares her PROC SQL expertise by showing you how to master the WHERE clause using real winter weather data.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 4 replies
  • 604 views
  • 3 likes
  • 4 in conversation