Combining Data Sets Containing Character Variables of Different Lengths

Accepted Solution Solved
Reply
Regular Contributor
Posts: 180
Accepted Solution

Combining Data Sets Containing Character Variables of Different Lengths

[ Edited ]

Hello all,

 

I have 10 SAS datasets, which I want to combine together (to concatenate). Some variables appear in all files, some only in some. Each file represent a different visit in a clinical trial.

 

I had a problem with variables being numeric in one dataset and character in another. There weren't too many of them, so I dealt with it manually.

 

Now a new problem. Some character variables (usually the ones of the free text), have different length in differnt datasets, so I get lot's of warnings in the log, and my data is truncated.

 

Afte looking online, I tried the following thing, before using the SET statement, I used the length, like this:

 

 

data All;
length Var1 $ 160 Var2 $ 150;
Set DS1 DS2 DS3;
run;

 

 

So now there are no warnings anymore in the log, however, first of all, the data is still being truncated ! Secondly, when looking at the attributes of the specific variables, I see that the length was changed indeed, but the format wasn't, for example: If Var1 had length 109 in DS1 and 160 in DS2, than I do see 160 in the length, but in format I see $109, which is weird.

 

So my quesion is, assuming I have several dataset, with hundreds of variables in each, and assuming that the type is finally matching, how do I concatenate them all, if some character variables have different lengths ?

 

Thank you !

 


Accepted Solutions
Solution
‎01-13-2016 09:21 AM
Super User
Posts: 5,083

Re: Combining Data Sets Containing Character Variables of Different Lengths

While you haven't shown the program/log, I suspect the issue is the positioning of the statement.  Even though the LENGTH statement must come before the SET statement, the FORMAT statement must come after the SET statement.  LENGTH uses the first setting found, but FORMAT uses the last setting found.

View solution in original post


All Replies
Super User
Posts: 17,837

Re: Combining Data Sets Containing Character Variables of Different Lengths

Your doing it correctly, the format is making it appear as if the data is truncated.

Change the format, similar to how you specified the length

Another trick is to append the data and making sure the base data is the dataset with the longest length/format, which sets the attributes for the final data set.
Regular Contributor
Posts: 180

Re: Combining Data Sets Containing Character Variables of Different Lengths

Thank you. I thought about using the longest length/format. However, I have 10 datasets, and each variable can have the longest length in a different dataset, it's too many combinations to check.
How do I change the format, let's say, for a length of 142?
Super User
Super User
Posts: 7,401

Re: Combining Data Sets Containing Character Variables of Different Lengths

[ Edited ]

You can do it in a number of ways, the SQL posted below fixes the length in the first select and all other will then apply to that.  You could also query the SASHELP.VCOLUMN to find the biggest if needed.  The question is why do you have 10 datasets with different attributes?  Is this something you have created, or do you get it from someone else?  If its the former then fix it in your input datasets, if its the later check out the data import agreement to see what the data should be like.

Although after checking, this does exactly the same thing:

data want;
  length a $50 b c 8;
  set have1 have2 have3;
run;

So I am now unsure of the question?

 

 

data have1;
  length a $5;
  a="ABC"; b=2; output;
  a="DEFGT"; b=1; output;
run;

data have2;
  length a $25;
  a="23"; b=1; output;
run;

data have3;
  c=1; b=1; output;
run;

proc sql;
  create table WANT as
  select  A length=25,
          B,
          . as C
  from    HAVE1
  union all
  select  *,
          . as C
  from    HAVE2
  union all 
  select  "" as A,
          *
  from    HAVE3;
quit;

 

Regular Contributor
Posts: 180

Re: Combining Data Sets Containing Character Variables of Different Lengths

This is how I got the data, I can't control it, the problem comes from free text fields, sometimes over 100 characters. I have 10 datasets with hundreds variables in each, I can't control it. I used PROC Import and told SAS to use maximum variables to define format and length. I see no other way to read the data into SAS, so now I have to "clean" it.
Super User
Super User
Posts: 7,401

Re: Combining Data Sets Containing Character Variables of Different Lengths

Ah, so its not actually dataset we are talking about then, I am presuming Excel sheets yes?  Common problem.  Get the data in CSV, i.e. plain text comma separated variable, if you have to, do it yourself - Excel -> SaveAs and select CommaSeparated.  Now for each of the files (or you may be able to get away with one) write a basic data step import:

data mydata;
  infile "c:\abc.csv" dlm=",";
  length a $10 b $5...;
  input a $ b;
run;

What this does is read in the data correctly, proc import is a "guessing" procedure, it is trying to guess what you want to do which is never a good thing, and you end up with this length nonsense.  Specify the lengths yourself, if you don't have an import agreement then fix lengths as 200.

Regular Contributor
Posts: 180

Re: Combining Data Sets Containing Character Variables of Different Lengths

[ Edited ]

You are right, Excel sheets, and I did convert to CSV. I cannot use your way, because I have hundreds of variables in each file, so I will need to define them all in the length and input statements. Or am I wrong, won't I  have to do that?

Super User
Super User
Posts: 7,401

Re: Combining Data Sets Containing Character Variables of Different Lengths

That depends on the files themselves, i.e. what the variables are.  Is there a fixed number, are they all the same type - things that should be in a data import agreement.  

The problem is if you don't look at it at the import stage, your going to have to look at it somewhere - i.e. your having trouble now with lengths, what about datatypes later on, processing on the data, ...

Super User
Posts: 5,083

Re: Combining Data Sets Containing Character Variables of Different Lengths

Worth testing:  I believe if you use SQL to combine the data sets it will automatically solve the length issues.  Not sure about the formats, but it's definitely worth trying.

 

There isn't much reason to apply a format to a lengthy character variable.  The best solution for conflicting formats in those cases is probably just to remove the format entirely:

 

format longvar;

 

Good luck.

Regular Contributor
Posts: 180

Re: Combining Data Sets Containing Character Variables of Different Lengths

Astounding, thank you. Your comment is interesting. What does format longvar does ? Will it work if my files contains both numerical variables AND characteristic variables with varying lenghts ?

 

Secondly, how do I append a few datasets (let's say 3 or 4 for the example) using SQL ?

Super User
Posts: 5,083

Re: Combining Data Sets Containing Character Variables of Different Lengths

The FORMAT statement is intended as a DATA step statement.  When you don't actually supply a format name in a FORMAT statement, it's asking that the format be removed.

 

format var1 var2 var3;

 

This statement asks that any formats for VAR1, VAR2, and VAR3 be removed.

 

Others are better positioned than myself to show you how to concatenate data sets in SQL.

 

Good luck.

Regular Contributor
Posts: 180

Re: Combining Data Sets Containing Character Variables of Different Lengths

Unfortunatelly something is wrong, I tried adding the format statement with the variable name and not specifying a format, and again, the length is now 200, and the format is $109. The data is truncated.

Solution
‎01-13-2016 09:21 AM
Super User
Posts: 5,083

Re: Combining Data Sets Containing Character Variables of Different Lengths

While you haven't shown the program/log, I suspect the issue is the positioning of the statement.  Even though the LENGTH statement must come before the SET statement, the FORMAT statement must come after the SET statement.  LENGTH uses the first setting found, but FORMAT uses the last setting found.

Regular Contributor
Posts: 180

Re: Combining Data Sets Containing Character Variables of Different Lengths

You are right ! I moved the format to after the set, and now the format is $200, and my data is no longer truncated !

The informat is still 109, does it matter ?
Super User
Posts: 5,083

Re: Combining Data Sets Containing Character Variables of Different Lengths

It's unlikely to hurt, as you are unlikely to add observations at this point.  Still, it would be safer to remove the  informats.  If you have a FORMAT statement that removes formats, it's should be easy to construct an INFORMAT statement as well.

☑ This topic is SOLVED.

Need further help from the community? Please ask a new question.

Discussion stats
  • 15 replies
  • 609 views
  • 3 likes
  • 5 in conversation