10-20-2017 04:51 PM
I've come across this task that requires me to concatenate patient information from 30 different sites/30 different tables, each table has 20 columns, I've rearranged them so that all the column orders and names are the same across 30 dataset.
However, each table was submitted by different sites, so there were lots of messy formats going on. i.e., patient_Ids are charter values in some sites' data (they were unique identifier, but can be anything, like UDWGER15581DE2) but are numeric values in others (1,2,3....). Same case for other variables, like drug_ID. For the same reason, this also means the length of the values, if there were character, were different, some Patient_ID are $15, some are $49.....
To sum up, I was wondering if there is any easy way to systematically unify them? Instead of changing 20*30=600 variables by hand?
Any suggestion will be greatly appreciated.
10-20-2017 05:15 PM
If by this: I've rearranged them so that all the column orders and names are the same across 30 dataset. You mean that you have arranged some SAS datasets with the variables in a common order I would be very tempted to:
1) Proc Export all of the sets to CSV files, may want to use the option Putnames=no so it is easier in the next step
2) combine all the csv files into a single one
3) Proc import the large csv file with a large value for guessing rows.
4) either modify the data step generated by proc import to use your desired variable names and formats or use proc datasets to do so.
10-20-2017 07:07 PM
You'd never do it by hand, I like @ballardw solution the best, but if you want to keep it in SAS you should set up a master data set that has the formats you want and then use that to check variables one by one and convert. You'd design it for one data set and then automate but it's a pain. The writing it to a CSV file and then importing is a good idea.
10-21-2017 03:09 PM
If the problem is that some columns sometimes come up as numeric but you would like them to be character, you can convert the problematic columns with something like:
length tmp1 $nn;
tmp1 = cats(myProblematicColumn);
This doesn't affect the column if it is already character.
10-23-2017 08:51 AM
If you can define an output format where all variables are character, this method may work: use the VVALUEX function.
Given a couple of tables with different variable types, but same variable names:
data a; length x 8 y $30 z $14; input x y z; cards; 1 2 3 5 fdsfgh 2 3 c 31254 ;run; data b; length x $3 y 8 z $4; input x y z; cards; 1rr 2 3 25 2 gg 33 31254 f ;run;
You then define a model table where all variables are character, and long enough to hold all values, e.g.:
data want; length x $8 y $30 z $20; stop; run;
You can then submit the same datastep for each input to add your data to the model dataset - you may want to put that in a macro:
if 0 then modify want;
array out _character_;
set &table(rename=(x=_1 y=_2 z=_3));
do _N_=1 to dim(out);
Just make sure that the variables in the RENAME option do not collide with the real variable names.
Now you need just one line of code for each input table: