BookmarkSubscribeRSS Feed
sassy_seb
Obsidian | Level 7

Hello, I am getting the following error messages when trying to merge two datasets.

One of the datasets I am getting from a csv file, so maybe the issue could be there?

sassy_seb_0-1715715058795.png

I was trying to specify the length of the PID variable for the redcap_sort dataset from the redcap one, which is the one we got from the csv file. However, I keep getting messages that the variable has multiple lengths and it keeps truncating the data.

 

Any PID after 999 gets shortened. So 1000 and 1001 become 100, 1010 becomes 101, etc.

 

Any help or a nudge in the right direction would be greatly appreciated, thank you so much.

 

Edit: The programming with the csv file already has:

data work.redcap; %let _EFIERR_ = 0;
infile &csv_file  delimiter = ',' MISSOVER DSD lrecl=32767 firstobs=1 ;

	informat pid $500. ;
	informat pid_ini $500. ;

and the code for format:

format pid $500. ;

It has this for all the variables. I thought the above code would make it so that the variables would have that limit of 500 characters? 

36 REPLIES 36
sassy_seb
Obsidian | Level 7

Both datsets do have the variable I am merging by in common. The only discrepancy seems to be the length in one of the datasets but I am not sure if it has something to do with the SAS code or the csv where I am pulling the data from. 

Kurt_Bremser
Super User

@sassy_seb wrote:

Both datsets do have the variable I am merging by in common. The only discrepancy seems to be the length in one of the datasets but I am not sure if it has something to do with the SAS code or the csv where I am pulling the data from. 


Read the log. You merge by PID only, but the complaint is about PID_INI and VIT_DTH_SP. One usually does not want a situation where there are variables not included in the BY in both datasets. It is quite hard to predict which values will end up in the result if there's not a strict 1:1 relationship.

 

It would help a lot if you post examples for the datasets and the expected result.

ballardw
Super User

In this code:

data work.redcap; %let _EFIERR_ = 0;
infile &csv_file  delimiter = ',' MISSOVER DSD lrecl=32767 firstobs=1 ;

	informat pid $500. ;
	informat pid_ini $500. ;

The INFORMAT will set the length.

FORMAT would control how many characters are displayed.

 

When in doubt run Proc Contents on the data set. Then you will KNOW what the length and type of the variable may be.

You "example", not very clear, of the PID 1000 becoming 100 makes me think that you may have leading spaces in the value and that then gets truncated. You don't say anything about how you examine the values. If you are printing them them you may not be aware that the default HTML output will remove leading spaces and so you don't see them.

 

You may want to run this code and examine the different results.

data work.ex;
   length x $ 5;
   x='  1000';
run;
/* doesn't display leading spaces*/
proc print;run;

data work.ex2;
   length x $ 6;
   x='123456';
run;

/* will have length of x from work.ex
   and generate warning about truncating 
   data
*/
data work.combine1;
   set work.ex 
       work.ex2
   ;
run;
/* will have length of x from work.ex2
   and no warning
*/
data work.combine2;
   set work.ex2 
       work.ex
   ;
run;

/* will have truncation warning 
   because the Length statement has
   x shorter than in work.ex2
*/
data work.combine3;
   length x $ 5;
   set work.ex2 
       work.ex
   ;
run;
sassy_seb
Obsidian | Level 7

Yeah sorry so the variable that we use to identify has a prefix usually it's a two letter code like AR and a dash character . So for example, the full PID would be "AA-001."

 

When we start getting to values such as "AA-1000" this is where the value gets cut off and displays "AA-100" and so on.

 

 

ballardw
Super User

How did you read the data into SAS? 

I suspect that you used Proc Import and ran into one the cases where your data isn't read properly because you didn't realize that by default Proc Import examines a very small number of records before guessing the lengths and types of variables.

You mention CSV, so if you used Proc Import I would go back and set the option GUESSINGROWS=MAX; in the code or widget used to import the data. Or write a data step to read the data controlling the length and type.

 

When you combine data set and the variable may have different lengths in sets then use a Length statement prior to any SET or MERGE statements to have a length at least as long as the longest version of any contributing data set such as this:

data dummyexample;
   length somevar $ 15;
   set thisdataset thatdataset;
run;

 Proc Import be default only examines 20 rows of data so if you have an Id variable that is in a sort order like:

AA-100

... (incrementing 101, 102, etc the first hundreds of lines have 6 characters so Import uses 6 for the length)

AA-999

AA-1000  (import still uses 6 for the length of the variable so the result is AA-100).

sassy_seb
Obsidian | Level 7

Okay, there might be an issue in the way the import of the csv was handled. The sas code I was given to generate the dataset does not seem to have a proc import statement.

 

It just reads:

%let csv_file = 'file_path' 
OPTIONS nofmterr;

Then there are just numerous proc format statements for the variable informats and formats and labels and values.

 

ballardw
Super User

If the data originates in a CSV file then there is a Proc Import or a DATA step somewhere. If your code has %INCLUDE statements then the code might appear there.

 

If you are not reading the external file then someone else did with one of those and may not realize the possible issues.

sassy_seb
Obsidian | Level 7

Just got clarification on this.  I asked my colleagues and this is what they told me:

 

The software used to capture the data exports SAS code alongside a CSV file with the data. Running the SAS program in the same folder as the CSV file generates the dataset. 

 

 

ballardw
Super User

@sassy_seb wrote:

Just got clarification on this.  I asked my colleagues and this is what they told me:

 

The software used to capture the data exports SAS code alongside a CSV file with the data. Running the SAS program in the same folder as the CSV file generates the dataset. 

 

 


Unless the SAS code is using something like a very cumbersome Proc SQL with a stack of Insert statements then there are details in how this set is being created that is missing. And the likely the culprit for the problems encountered.

 

Run Proc Contents on both sets. That will provide information on how to fix the merge steps with Length statements. But you may have to do this every time if the "process" does not have any control on the lengths when these variables are created.

sassy_seb
Obsidian | Level 7

I did try this using both datasets just to make sure there was no further discrepancies but I got the same issue. Once it gets to AA-1000 it shortens it to AA-100

Tom
Super User Tom
Super User

Your example of how to adjust the length of PID looks fine. 

 

Did you do that for both datasets or just one of them?  If not then the length of PID will be set by the first dataset listed in the MERGE statement.  And the FORMAT attached to PID will be set by the first dataset that actually has a format attached to PID.  If you define a variable as length $10 but then attach the $6 format then the values will appear truncated even if they are not.

 

The warning on the first data step should not be a problem (assuming the length you set for PID is large enough).

The warning on the second data step is talking about TWO other variables.  You could use the same method you used for PID to adjust those also.

 

The sample code you posted looks like the gibberish generated by PROC IMPORT.  That is not the style I would use.  Instead I would define the variables first, using a LENGTH statement.  I would not attach informats like $500. to character variables because SAS does NOT NEED special informat instructions for reading character variables.  Only variables with values like DATE, TIME and DATETIME normally need special informats to read in values.  Normal numbers and character variables do not need them.

 

The code might have been generated by PROC IMPORT (it does that kind of stupid stuff).  Or it might have been generated by RCC.   I have not have good experience with companies generating good SAS code (even PROC IMPORT which was written by SAS generates ugly SAS code.).

 

sassy_seb
Obsidian | Level 7

I did it to both data sets to set the same length for both before merging them

Tom
Super User Tom
Super User

@sassy_seb wrote:

I did it to both data sets to set the same length for both before merging them


Then assuming you set PID long enough you should not be having any trouble caused by the MERGE step.

 

If the values are truncated they were already truncated in the original dataset(s). 

 

So you need to fix something earlier in the process to prevent the loss of data.

sas-innovate-2024.png

Available on demand!

Missed SAS Innovate Las Vegas? Watch all the action for free! View the keynotes, general sessions and 22 breakouts on demand.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 36 replies
  • 594 views
  • 13 likes
  • 7 in conversation