Solved: lookup multiple columns by name and reorganize it

edison83 · Posted 01-05-2020 09:18 PM

Hello everyone, i am new in sas and i maybe have a dumb question, but i could not find any answer online.

The following table contains the diseases diagnosed in each person (ID). Some have only one disease and others have multiples.

I would like to reorganize the date putting each disease in one column as the second table shows.

Can you guys help me?

ID	1 disease	2 diseases	3 disease	4 diseases
1	HIV	Meningitis
2	Ebola	pneumonia
3	Flu	HIV	Ebola
4	Flu	Ebola	Meningitis
5	Flu
6	Meningitis	Ebola	Flu	HIV
7	pneumonia	Flu	HIV
8	Ebola	Meningitis
9	Flu	Meningitis
10	Ebola
11	Ebola
12	Meningitis	Flu
13	Meningitis	Ebola	HIV	Flu
14	Meningitis	HIV

ID	HIV	Flu	Meningitis	Ebola	Pneumonia
1	HIV		Meningitis
2				Ebola	Pneumonia
3	HIV	Flu		Ebola
4		Flu	Meningitis	Ebola
5		Flu
6			Meningitis
7	HIV	Flu			Pneumonia
8			Meningitis	Ebola
9		Flu	Meningitis
10				Ebola
11				Ebola
12		Flu	Meningitis
13	HIV	Flu	Meningitis	Ebola
14	HIV

unison · Posted 01-05-2020 09:56 PM

Hi @edison83. Welcome to SAS communities. Not at all a dumb question. Take a look at my code below!

What you want to do here is convert to long form and then transpose back:

data have;
	infile datalines dlm=',' missover;
	input ID (disease_1-disease_4) (:$15.);
	datalines;
1,HIV,Meningitis
2,Ebola,pneumonia,
3,Flu,HIV,Ebola,
4,Flu,Ebola,Meningitis,
5,Flu,
6,Meningitis,Ebola,Flu,HIV
7,pneumonia,Flu,HIV,
8,Ebola,Meningitis,
9,Flu,Meningitis,
10,Ebola,
11,Ebola,
12,Meningitis,Flu,
13,Meningitis,Ebola,HIV,Flu
14,Meningitis,HIV,
;
run;

data have_long;
	set have;
	array arr_dis disease:;

	do over arr_dis;

		if arr_dis ne '' then
			do;
				disease=arr_dis;
				output;
			end;
	end;
run;

proc sort data=have_long;
	by id disease;
run;

proc transpose data=have_long(keep=id disease) out=want(drop=_name_);
	by id;
	id disease;
	var disease;
run;

-unison

View solution in original post

unison · Posted 01-05-2020 09:56 PM

Hi @edison83. Welcome to SAS communities. Not at all a dumb question. Take a look at my code below!

What you want to do here is convert to long form and then transpose back:

data have;
	infile datalines dlm=',' missover;
	input ID (disease_1-disease_4) (:$15.);
	datalines;
1,HIV,Meningitis
2,Ebola,pneumonia,
3,Flu,HIV,Ebola,
4,Flu,Ebola,Meningitis,
5,Flu,
6,Meningitis,Ebola,Flu,HIV
7,pneumonia,Flu,HIV,
8,Ebola,Meningitis,
9,Flu,Meningitis,
10,Ebola,
11,Ebola,
12,Meningitis,Flu,
13,Meningitis,Ebola,HIV,Flu
14,Meningitis,HIV,
;
run;

data have_long;
	set have;
	array arr_dis disease:;

	do over arr_dis;

		if arr_dis ne '' then
			do;
				disease=arr_dis;
				output;
			end;
	end;
run;

proc sort data=have_long;
	by id disease;
run;

proc transpose data=have_long(keep=id disease) out=want(drop=_name_);
	by id;
	id disease;
	var disease;
run;

-unison

edison83 · Posted 01-06-2020 04:26 PM

Thanks a lot, it worked perfectly!

novinosrin · Posted 01-05-2020 10:26 PM

Same idea, just a double PROC transpose as this syntactically seems easy, at least to me

proc transpose data=have out=temp;
by id;
var disease_1--disease_4;
run;

proc transpose data=temp out=want(drop=_name_);
where not missing(col1);
by id;
var col1;
id col1;
run;

mkeintz · Posted 01-05-2020 11:30 PM

There is a solution that can be implemented in one pass through the data. It's admittedly more programming, but possibly more efficient with large data sets. It requires use of a hash table to store the wide-form data, and a small hash table to track the variety of diseases:

data have;
	infile datalines dlm=',' missover;
	input ID (disease_1-disease_4) (:$15.);
	datalines;
1,HIV,Meningitis
2,Ebola,pneumonia,
3,Flu,HIV,Ebola,
4,Flu,Ebola,Meningitis,
5,Flu,
6,Meningitis,Ebola,Flu,HIV
7,pneumonia,Flu,HIV,
8,Ebola,Meningitis,
9,Flu,Meningitis,
10,Ebola,
11,Ebola,
12,Meningitis,Flu,
13,Meningitis,Ebola,HIV,Flu
14,Meningitis,HIV,
;
run;

data empty;
  length id 8 name1-name40 $32;
  call missing (of _all_);
  stop;
run;

data _null_;
  merge have empty  end=end_of_have;
  length var_renames $5000;
  retain var_renames;

  if _n_=1 then do;
    declare hash wide (dataset:'empty',ordered:'a');
      wide.definekey('id');
      wide.definedata(all:'Y');
      wide.definedone();
    declare hash dlookup();
      dlookup.definekey('disease');
      dlookup.definedata('disease','d');
      dlookup.definedone();
  end;
  array nam {*} $32 name: ;

  do disease=disease_1,disease_2,disease_3,disease_4;
    if disease=' ' then leave;
    if dlookup.find()^=0 then do;
      d=dlookup.num_items+1;;
      dlookup.add();
      var_renames=catx(' ',var_renames,cats('name',d,'=',disease));
    end;
    nam{d}=disease;
  end;

  wide.add();

  if end_of_have;
  nd=dlookup.num_items;
  wide.output (dataset:catx(' ',"want (keep=id",cats('name1-name',ND),'rename=(',var_renames,')'));
run;

The EMPTY dataset is just a convenient an compact way to set up a list of dummy variable names. Make sure you declare more names than there are distinct diseases in the original data. Using EMPTY is also a compact way to declare the data content of hash object wide.
I use a MERGE HAVE with EMPTY, even though EMPTY has no rows of data, just to force the data step to reset all the NAME variables to missing values at the top of each data step iteration.
The first DO structure looks for new diseases, adds them to the DLOOKUP hash, and gives each disease an index value to co-ordinate with the array of NAME variables. It also builds a rename list (e.g. "name1=HIV name2=Meningitis …") to be used when the WIDE.OUTPUT method is used. This is one of the great benefits of hash objects - you can dynamically generate exactly the list of needed variables, based on the data encountered. No advance knowledge needed.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

lookup multiple columns by name and reorganize it

Re: lookup multiple columns by name and reorganize it

Re: lookup multiple columns by name and reorganize it

Re: lookup multiple columns by name and reorganize it

Re: lookup multiple columns by name and reorganize it

Re: lookup multiple columns by name and reorganize it