First, if there's an easier way to line up the data and make it readable in this window please tell me because it was lined up where I copied it from elsewhere and then I copied it into here and it was all scrunched up and I had to hit the space bar a bunch to line things up.
But onto the real question. Yesterday I posted this but it turns out I only posted part of the question. Here it is again.
Here's the data.
ID FormSet var_name var_value var_label var_fmt
1 50 x 1 Label of X xfmt
1 50 y 2 Label of Y
1 50 z 3 Label of Z zfmt
1 55 x 4 Label of X xfmt
1 55 y 5 Label of Y
1 55 z 6 Label of Z zfmt
2 60 x 7 Label of X xfmt
2 60 y 8 Label of Y
2 60 z 9 Label of Z zfmt
2 65 x 10 Label of X
2 65 y 11 Label of Y
2 65 z 12 Label of Z
Here is what I wanted to get yesterday in terms of a dataset and that someone here helped me get.
ID FormSet x y z
1 50 1 2 3
1 55 4 5 6
2 60 7 8 9
2 65 10 11 12
Okay, that part is fine. But as you've probably guessed by now, I also would like to use the var_label variable to generate labels for the newly created dataset immediately above as well as use the var_fmt variable to generate formats for it. I didn't put a format on the variable y to illustrate that some variables will have formats and others will not.
My best guess now is that making the formats will have something to do with proc contents cntlin= and making the labels will have something to do with data _null_ but that's as close as I've gotten. But I can imagine having the data like at the top would be a thing people would do for giant datasets since if would save a lot of hassle if you could then easily convert things, which there must be a way to do.
I think in the past I've already done this except that the labels and formats were already on the 2nd (wide) dataset I created rather than on the 1st (long) one.
Any help is appreciated.
Oops, I forgot to put xfmt and zfmt in the var_fmt column in the last few row. Pretend they are there.
Is the desired output a report? Or is the desired output a SAS data set?
Use IDLABEL option in PROC TRANSPOSE.
To attach the formats you will need a separate step. If the list is short enough put it into a macro variable just use PROC SQL to make the list of var/fmtspec pairs. Make sure what is in your VAR_FMT column is a valid format specification. If it is just the name of the format you will need to append a period to make it into a format specification.
proc sql noprint;
select distinct catx(' ',var_name,var_fmt) into :fmtlist separated by ' '
from have
where not missing(var_fmt)
;
quit;
proc transpose data=have out=want(drop=_name_) ;
by id formset ;
id var_name;
idlabel var_label;
var var_value;
format &fmtlist;
run;
proc contents data=want; run;
The CONTENTS Procedure Alphabetic List of Variables and Attributes # Variable Type Len Format Label 2 FormSet Num 8 1 ID Num 8 3 x Num 8 XFMT. Label of X 4 y Num 8 Label of Y 5 z Num 8 ZFMT. Label of Z
I'll try that but I can't yet because other things have come up. But stay tuned because I'll answer eventually. It would be very useful to me if the code you provides works.
Re. the other persons question about what this was for, what they do is collect all the data elsewhere and all the data attributes, etc, and they know its coming to me in SAS eventually so they have me provide the labels and format names. But eventually what we do is take all that data and bring it into SAS and the dataset looks like the first one in this thread (except way longer) and I have to convert the data to usable, rectangular SAS data and I have to convert the label and format variables into actual labels and formats.
Why do you need a "rectangular" data set? To make a report, or to do further analyses. In either case, you'd be better off leaving the data in the long form, as most reporting procedures and analysis procedures work better on long data sets.
It's more complicated than the usual long data and short data thing. I was simplifying a bit.
You may already know what a visit matrix is but in case not I'll explain. Supposed there is a study that asks people questions at various visits over time, say at Baseline, 3 6 and 12 months. And say there are various data collection forms that may or may not at each visit (demographics or lab values or whatever. Say there are 10 forms.
If you put the 4 times across the top and the 10 forms down the side you form a matrix that has 4 x 10 = 40 cells. You put an 'x' in each cell where a form is collected. That's a visit matrix. Perhaps demographics is collected only at baseline so the row for it only has an 'x' at Baseline. Perhaps lab values are collected only in 6 month intervals so the row for it only has an 'x' at Baseline, 6 and 12 months. Etc. You get the idea. That is how the study I'm working on now is organized, only much bigger.
Suppose 15 of the 40 cells in your visit matrix has an 'x.' That means you have 15 formsets. The forms go down the left hand side and the visits (or sets) go across the top, thus where there is an 'x' we call it a formset. Also, the formsets are unique across subjects. So Subject 1 has formsets numbered 1 thru 15, subject 2 has formsets numbers 16 thru 30, etc.
The data is collected in one giant column. I did not set this up. I don't know if it's genius or if it''s crazy, although I do know it's crazy that we had one person set it up and nobody else knew how it worked then he left and now we have to deal with it because we're halfway through a very long study.
Here is how the collected data looks.
ID FormSet Value Variable
1 1 M Sex
1 1 70 Ht_Inches
1 1 150 Wt_lbs
1 2 120 BP_Diastolic
1 2 80 BP_Systolic
etc
All the data we collect is in the Value column. The Variable column tells the variable but its values will be repeated. For example, Sex will be included for each subject. If BP_Diastolic is measured for each subject at each visit then BP_Diastolic will be there every time. As you can imagine, after awhile the columns become very long.
Meanwhile, in other datasets we have attributes of the variables, such as the field rules and formats, etc. I merge them all together to get them into one dataset.
I hope that's at least somewhat clear. It's a pretty complicated thing. But anyway, IDLabel statement in PROC TRANSPOSE worked like a charm so that's good and now all that's left is assigning the formats
I don't see how I can assign the formats when it's in one giant column so I'll try it after I do the PROC TRANSPOSE and we have a rectangular dataset. Perhaps at that point I can merge in the name of the formats so that it's on the dataset and then use the PROC SQL code mentioned above. I haven't tried it yet. I think at that point I could also do a big DATA _NULL_ to create but since there are a lot of variables and formats it would be cleaner to do it with just those handful of lines in the PROC SQL mentioned above.
I'll keep you updated.
I don't think the schedule of visits matters in how to convert the data. Instead the unit is the FORM. You want all of the variables from one form in one dataset, so an DEMOG dataset and a VITALS dataset. Visit just becomes one of the BY variables when deciding which sets of rows in the TALL structure combine to form one observation in the wide dataset.
How you describe it is how I would have done it but I wasn't in charge. And when you say make it by form, that is exactly what I'm doing right before I do the PROC TRANSPOSE. I subset out each form (Demography, for instance) from the giant column of data and then I have a long column that contains only demography data, and now I'm doing manipulations to convert that into a "long" SAS dataset, which is still rectangular but is no longer one column of data. The PROC TRANSPOSE made it rectangular and the IDLABEL option in it added the labels and now all that's left (in this part) is to add the formats, which I'll attempt later today. Another person and I are still cleaning up the data and much of my day will be spent on the phone with her doing that.
Given that the data are set up this way now though, there is no choice but to deal with it this way because the dataset with the long column of data doesn't have anything useful except the value of the variable collected, the formset and the variable name (and not even a useful variable name, since we use a key like q123 for it and then in another dataset we have the key and the useful variable name, such as Sex, that goes with it).
IOW, just using the dataset with the collected data, (the one with the giant column of data), we don't know the form it came from or the visit or the label or the format, etc. Thus, we're forced to do a bunch of merging. Stay tuned.
So if your source table just has a character variable and you want to produce numeric variables you will need to add a step to convert the character strings into numbers. You might do it after the PROC TRANSPOSE, but since that means making a new variable you will have to re-attach the label. So it might be easier to do the conversion in the TALL structure and use two PROC TRANSPOSE steps, one for the character variables and one for the numeric variables. Then merge the two. You can add the FORMAT statement to the merge step.
So assuming that FORMID=1 has three numeric variables named DOB, HEIGHT and WEIGHT you might use logic like below.
data tall_subset;
set tall;
where formid=1;
if type='N' then do;
select (var_name);
when ('DOB') num_value = input(value,anydtdte.);
when ('HEIGHT','WEIGHT') num_value = input(value,32.);
end;
end;
run;
proc transpose data=tall_subset out=char_vars;
by id visit ;
where var_type='C';
var value ;
id var_name;
idlabel var_label;
run;
proc transpose data=tall_subset out=num_vars;
by id visit ;
where var_type='N';
var num_value ;
id var_name;
idlabel var_label;
run;
data form1 ;
merge char_var num_vars;
by id visit;
format ..... ;
run;
It should actually be easy to generate that type of code from the metadata about the forms and variables, without really needing to merge the large data file with the metadata file.
Personally I find it easiest to setup a data step that reads the metadata and generates the code needed via writing lines to file. You can then review the file and make sure it will work. Once the generation program is working it can just use %INCLUDE to run the generated code.
While it was one big character column to begin with what we did was figure out what kind of variable it was going to be (free text, or a free ranging number like weight, or a date, or a numeric variable but one that had a format, like 1, 2, and 3 with a format of Yes, No and Don't Know), etc. and then deal with each of those separately. And actually it got more complicated than that for various reasons, one of which was that some dates required an exact date, and for other dates the month and year were enough and we could infer the 15th of the month and for yet other dates just the year was good enough and we could infer July 2 as the month and date.
Maybe there is a more efficient way of doing it but at this point there is no turning back since we have it all working (I think) except for the formats. I think we're at the point where we have all the variables in a nice SAS dataset and the format names are elsewhere, with the same SAS variable names, and we can merge them and at that point we'll have the variables and their formats in the same dataset. Of course, all variables will have a format of one kind or another but I'm talking about the formats like 1=Yes, 2=No, 3=Don't Know. The rest of the formats will already be assigned. If so I may be able to do that with a data _null_.
I've been busy with other stuff so I haven't tried it yet but I just wanted to post this so you know I haven't vanished.
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.