BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
_maldini_
Barite | Level 11

I am stacking 5 datasets into 1 (Time periods 1-5 = Final dataset). Each of these 5 datasets was the result of a merge of 5 datasets (e.g. Physical activity, nutrition, personal health, etc.). That's 25 datasets total. There are 500 variables and 2000 observations.

 

I used PROC IMPORT for each of 25 Excel worksheets (5 files w/ 5 tabs each). I will never do this again! Smiley LOL

 

I am seeing truncated values in at least 1 character variable - an open text box - in the final dataset only.

 

Not surprisingly, according to PROC CONTENTS, the length of this variable is different in each of the 5 merged datasets (Time period 1, 2, 3, 4, 5). No truncation occurs in these datasets. The truncation only occurs in the final dataset, after these 5 datasets are joined into one.

 

Can I assign length, or other attributes, to certain variables in the DATA step before the SET statement for the final dataset? 

 

My code: 

DATA a1_a5_final;

  LENGTH var1 $45;
  SET 
			a1_final 
			a2_final 
			a3_final 
			a4_final 
			a5_final;	
RUN; 

 

There are only a handful of character variables that I would need to be worried about here. In a perfect world, I would go back and read-in the data for each worksheet and set the attributes. It's too late for that now. 

 

Is there a quicker way to solve this problem?

 

Thank you!

 

 

 

1 ACCEPTED SOLUTION

Accepted Solutions
Tom
Super User Tom
Super User

SAS has an extremely nasty habit when reading from Excel or external databases of permanently attaching formats to character variables.

format comment $100.;

Now if you use PROC IMPORT to let SAS guess at how to format your data from Excel of text files then you can end up with the variable in one file being of length 10 and in another file of length 20.  This causes two problems.  

 

The first problem is how to make sure that when you stack them that each variable is assigned the proper length so that no values are truncated.  PROC SQL can help with that.  Or you can just define the variables BEFORE the MERGE or SET statement.

data want ;
   length id $20 age 8 comment $200 ;
   set ds1-ds5 ;
run;

The second problem as those annoying permanently attached formats.  If the first dataset has COMMENT define as LENGTH=$200 and has no format attached and the second has it defined as LENGTH=100 and has $100. format attached if you set them together you get the right length but the wrong format is attached so when you print the value it is truncated.  You fix this by adding a format statement that lists variables but does not have a format specified. This will remove the format from those variables.

data want ;
   length id $20 age 8 comment $200 ;
   set ds1-ds5 ;
   format _character_ ;
run;

View solution in original post

11 REPLIES 11
FreelanceReinh
Jade | Level 19

@_maldini_ wrote:

Can I assign length, or other attributes, to certain variables in the DATA step before the MERGE statement for the final dataset?  


Yes, you can. Assigning lengths could look like this:

data want;
length first_chvar $260 second_chvar $300 ...;
merge ...
...
run;
_maldini_
Barite | Level 11

@FreelanceReinh Sorry. I should have tried this first. Given my struggles lately, I assumed that it couldn't be that easy...

 

It works, kinda. A new PROC CONTENTS says the length has been changed, however the values are still truncated. Obviously I need to do some more investigating. 

 

The values are not truncated in any of the datasets that go into the final dataset...Hmmm.

 

Btw, I should have made it clear that I am joining/stacking the 5 merged datasets into one final dataset (i.e. Not merging). I'm going to make the change in the original post.

FreelanceReinh
Jade | Level 19

You can use the LENGTH statement in the same way before a SET statement (if that's what you used).

 

The important point is that the LENGTH statement is executed before the SET (or MERGE, ...) statement, so that the specified lengths are set in the PDV (program data vector) and the values from the input datasets are read into the "wide fields" prepared in the PDV.

Astounding
PROC Star

Consider using PROC SQL to join the data sets instead of a DATA step.  SQL is pretty good at locating the longest length needed and using it, with no extra work on your part.

_maldini_
Barite | Level 11

@Astounding I've never used it. Would it be too much to ask for some example code? Smiley Surprised

 

Btw, I should have made it clear that was joining/stacking the 5 merged datasets into one final dataset (i.e. Not merging)...

Astounding
PROC Star

Here's a version worth trying:

 

proc sql;

create table a1_a5_final as

select * from a1_final union

select * from a2_final union

select * from a3_final union

select * from a4_final union

select * from a5_final;

quit;

 

Pay careful attention to the number of records coming in vs. going out.  SQL UNION has a habit of removing duplicates.

FreelanceReinh
Jade | Level 19

@Astounding wrote:

SQL UNION has a habit of removing duplicates.


Indeed, this is the standard behavior. But you can avoid the elimination of duplicates by adding the ALL keyword.

 

Another "habit" of the UNION operator which is very unfamiliar if you come from data step programming is to align columns by position, not by name. Again, there is a keyword, CORRESPONDING, to change this behavior to the more familiar alignment by variable name. So, you may want to try "union all corresponding" if you encounter unwanted effects with "union" alone. However, the CORRESPONDING keyword has a (maybe again unwanted) side effect: Non-matching columns are dropped. (So, in this case the UNION CORRESPONDING operator operates like an "intersection operator" on the sets of variable names ... Maybe the data step is somewhat easier to use.)

 

Edit: I forgot to mention that PROC SQL in general has a habit of delivering observations in (what can seem like) random order. (But you can add an ORDER BY clause ... if you can specify sort criteria.)

FreelanceReinh
Jade | Level 19

Also, please make sure that

  1. no permanent formats (of insufficient lengths) are associated with the character variables in question (see PROC CONTENTS output)
  2. the truncation is not just caused by displaying the values, e.g., with PROC PRINT (see the log for pertinent warnings).

Maybe your variables contain the correct values, but either of the above issues lets them appear truncated.

Tom
Super User Tom
Super User

SAS has an extremely nasty habit when reading from Excel or external databases of permanently attaching formats to character variables.

format comment $100.;

Now if you use PROC IMPORT to let SAS guess at how to format your data from Excel of text files then you can end up with the variable in one file being of length 10 and in another file of length 20.  This causes two problems.  

 

The first problem is how to make sure that when you stack them that each variable is assigned the proper length so that no values are truncated.  PROC SQL can help with that.  Or you can just define the variables BEFORE the MERGE or SET statement.

data want ;
   length id $20 age 8 comment $200 ;
   set ds1-ds5 ;
run;

The second problem as those annoying permanently attached formats.  If the first dataset has COMMENT define as LENGTH=$200 and has no format attached and the second has it defined as LENGTH=100 and has $100. format attached if you set them together you get the right length but the wrong format is attached so when you print the value it is truncated.  You fix this by adding a format statement that lists variables but does not have a format specified. This will remove the format from those variables.

data want ;
   length id $20 age 8 comment $200 ;
   set ds1-ds5 ;
   format _character_ ;
run;
_maldini_
Barite | Level 11

@Tom

 

This explanation is very helpful. Thank you.

 

After playing around w/ these ideas, I noticed that the lengths and formats for many of these variable conflicted, or there was no format assigned, as you mentioned. I also noticed that when the LENGTH and the FORMAT conflict, the LENGTH attribute "wins" in the dataset. 

 

<You fix this by adding a format statement that lists variables but does not have a format specified. This will remove the format from those variables.>

 

In your example you put "_character_" in the FORMAT statement. What is the effect of this? Will it simply strip the format of all character variables?

 

Final question, if I included an INFORMAT statement in a DATA step like this, where would it go relative to the LENGTH, SET and FORMAT statements? After LENGTH and before SET?

Tom
Super User Tom
Super User

_CHARACTER_ is an example of a variable list.  It means all character variables.  

 

There is almost no situations where it makes sense to have a $xxx. format permanently attached to a variable.  Formats instruct SAS how to display the data and SAS already knows how to display character strings without any special instructions.

 

The real solution is to avoid using CSV files or other formats for your data where the lengths of variables are not defined.  If you have a lot of CSV files then create your own data step to read them so that you can make sure to define the variables properly.  This will make combining them easier.

sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 11 replies
  • 15861 views
  • 14 likes
  • 4 in conversation