Hello all,
I have 10 SAS datasets, which I want to combine together (to concatenate). Some variables appear in all files, some only in some. Each file represent a different visit in a clinical trial.
I had a problem with variables being numeric in one dataset and character in another. There weren't too many of them, so I dealt with it manually.
Now a new problem. Some character variables (usually the ones of the free text), have different length in differnt datasets, so I get lot's of warnings in the log, and my data is truncated.
Afte looking online, I tried the following thing, before using the SET statement, I used the length, like this:
data All;
length Var1 $ 160 Var2 $ 150;
Set DS1 DS2 DS3;
run;
So now there are no warnings anymore in the log, however, first of all, the data is still being truncated ! Secondly, when looking at the attributes of the specific variables, I see that the length was changed indeed, but the format wasn't, for example: If Var1 had length 109 in DS1 and 160 in DS2, than I do see 160 in the length, but in format I see $109, which is weird.
So my quesion is, assuming I have several dataset, with hundreds of variables in each, and assuming that the type is finally matching, how do I concatenate them all, if some character variables have different lengths ?
Thank you !
While you haven't shown the program/log, I suspect the issue is the positioning of the statement. Even though the LENGTH statement must come before the SET statement, the FORMAT statement must come after the SET statement. LENGTH uses the first setting found, but FORMAT uses the last setting found.
You can do it in a number of ways, the SQL posted below fixes the length in the first select and all other will then apply to that. You could also query the SASHELP.VCOLUMN to find the biggest if needed. The question is why do you have 10 datasets with different attributes? Is this something you have created, or do you get it from someone else? If its the former then fix it in your input datasets, if its the later check out the data import agreement to see what the data should be like.
Although after checking, this does exactly the same thing:
data want; length a $50 b c 8; set have1 have2 have3; run;
So I am now unsure of the question?
data have1; length a $5; a="ABC"; b=2; output; a="DEFGT"; b=1; output; run; data have2; length a $25; a="23"; b=1; output; run; data have3; c=1; b=1; output; run; proc sql; create table WANT as select A length=25, B, . as C from HAVE1 union all select *, . as C from HAVE2 union all select "" as A, * from HAVE3; quit;
Ah, so its not actually dataset we are talking about then, I am presuming Excel sheets yes? Common problem. Get the data in CSV, i.e. plain text comma separated variable, if you have to, do it yourself - Excel -> SaveAs and select CommaSeparated. Now for each of the files (or you may be able to get away with one) write a basic data step import:
data mydata; infile "c:\abc.csv" dlm=","; length a $10 b $5...; input a $ b; run;
What this does is read in the data correctly, proc import is a "guessing" procedure, it is trying to guess what you want to do which is never a good thing, and you end up with this length nonsense. Specify the lengths yourself, if you don't have an import agreement then fix lengths as 200.
You are right, Excel sheets, and I did convert to CSV. I cannot use your way, because I have hundreds of variables in each file, so I will need to define them all in the length and input statements. Or am I wrong, won't I have to do that?
That depends on the files themselves, i.e. what the variables are. Is there a fixed number, are they all the same type - things that should be in a data import agreement.
The problem is if you don't look at it at the import stage, your going to have to look at it somewhere - i.e. your having trouble now with lengths, what about datatypes later on, processing on the data, ...
Worth testing: I believe if you use SQL to combine the data sets it will automatically solve the length issues. Not sure about the formats, but it's definitely worth trying.
There isn't much reason to apply a format to a lengthy character variable. The best solution for conflicting formats in those cases is probably just to remove the format entirely:
format longvar;
Good luck.
Astounding, thank you. Your comment is interesting. What does format longvar does ? Will it work if my files contains both numerical variables AND characteristic variables with varying lenghts ?
Secondly, how do I append a few datasets (let's say 3 or 4 for the example) using SQL ?
The FORMAT statement is intended as a DATA step statement. When you don't actually supply a format name in a FORMAT statement, it's asking that the format be removed.
format var1 var2 var3;
This statement asks that any formats for VAR1, VAR2, and VAR3 be removed.
Others are better positioned than myself to show you how to concatenate data sets in SQL.
Good luck.
Unfortunatelly something is wrong, I tried adding the format statement with the variable name and not specifying a format, and again, the length is now 200, and the format is $109. The data is truncated.
While you haven't shown the program/log, I suspect the issue is the positioning of the statement. Even though the LENGTH statement must come before the SET statement, the FORMAT statement must come after the SET statement. LENGTH uses the first setting found, but FORMAT uses the last setting found.
It's unlikely to hurt, as you are unlikely to add observations at this point. Still, it would be safer to remove the informats. If you have a FORMAT statement that removes formats, it's should be easy to construct an INFORMAT statement as well.
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.