BookmarkSubscribeRSS Feed
Astounding
PROC Star

@Tom  is right about the correct starting point.  Since you have steps that are obscure and difficult to explain. start by examining data values before the MERGE.  See if any truncated values occur before any merging takes place.

 

Eventually, you will need to correct the program for all the other variables that SAS mentions.  But truncation of PID is a separate issue.

sassy_seb
Obsidian | Level 7

Is MERGE case sensitive when it comes to variables? Will PID in one dataset match with pid from the other one?

ballardw
Super User

@sassy_seb wrote:

Is MERGE case sensitive when it comes to variables? Will PID in one dataset match with pid from the other one?


Questions like this are a place that you should create some small data set and test it. You also need to consider such things as presence of a BY statement as the behavior does change with Merge.

 

data one;
   input x $ y;
datalines;
a 1
b 2
;

data two;
   input x $ z;
datalines;
A 11
b 22
;

/* NO BY statement*/
data test1;
  merge one
        two
  ;
run;
/* merge with BY statement*/
data test2;
   merge one
         two
   ;
   by x;
run;
Patrick
Opal | Level 21

The log your shared complains not only about PID. Please run below code and share the result.

title 'Data: cht_all_sort';
proc contents data=cht_all_sort(keep=pid pid_id vit_oth_sp);
run;quit;
title 'Data: redcap';
proc contents data=redcap(keep=pid pid_id vit_oth_sp);
run;quit;
title;

 

sassy_seb
Obsidian | Level 7

Here's the first proc contents:

sassy_seb_0-1715792021671.png

And the second one:

sassy_seb_2-1715792086818.png

 

 

Tom
Super User Tom
Super User

If the code that created that first dataset actually used the $6. informat when making the dataset, as in a formatted mode INPUT statement like this:

input pid $6. ;

then even though the variable is 500 bytes long only 6 bytes will be stored into it. The rest will be blanks.

Put if the code used a LIST MODE input statement like one of these:

input pid ;
input pid :$6. ;

then the data would not have been truncated to just 6 bytes by the input statement.

sassy_seb
Obsidian | Level 7

Okay, update:

I got some of the PIDS from 999 to show up. Like for example, 1001, 1004, 1006, etc.

 

The issue wasn't really the informat and formats on the SAS code I believe. We are exporting from Access into the first database. This database has patients like 1002, 1005, 1007, 1011. 

 

When it comes to the export of these variables from Access, this is what causes the truncation of data. 1002 gets cut to 100, and etc. 

So it seems like there is an issue with either Access or the initial SAS program that imports all of the data from Access. Will work on it soon, thank you everyone for all of the help. While I don't have a solution yet, I feel like I am making small breakthroughs as a SAS rookie. 

SASKiwi
PROC Star

DId you know that SAS can read Access databases directly if you have the SAS product SAS/ACCESS Interface to PC Files? You can check if you have this product by running this:

proc setinit;
run;
sassy_seb
Obsidian | Level 7

I have it 😁

SASKiwi
PROC Star

In that case you can avoid truncation completely by reading the data directly.

 

PROC IMPORT is probably a good starting point. The link contains some example code you can try. 

Patrick
Opal | Level 21

@sassy_seb wrote:

I have it 😁


To be able to directly connect to your access database some config will be required (ODBC driver). You also need physical access to this Access database which is often not the case when running SAS on a server. 

 

You wrote earlier:

The software used to capture the data exports SAS code alongside a CSV file with the data. Running the SAS program in the same folder as the CSV file generates the dataset.

It sounds like there is some upstream application that not only creates the .csv but also the SAS code to read this .csv into SAS. If that's not working properly then the one maintaining this application needs to fix it. You should only have to demonstrate that what you get on the SAS side when running the script is not what's in the access table.

 

The Proc Contents report you shared shows that your tables contain A LOT of variables (columns) - at least 772

Patrick_1-1715831527546.png

Looking into the SAS docu under section Microsoft Access Database (.mdb and .accdb) Maximum Size Limits per Method this could cause issues for a direct connection via SAS Access to PC Files.

Patrick_2-1715831670702.png

 

For a "one-off" with quick turn-around you could consider below two options:

  1. - You could use Proc Import instead of the generated SAS code to read the .csv. Make sure you use guessingrows=max.
  2. - You could ask for a XML export of the access tables.

I've done a test for the XML approach:

Patrick_5-1715832643710.png

libname test xmlv2 "C:\temp\Table1.xml" xmlschema="C:\temp\Table1.xsd";
proc contents data=test.Table1 order=varnum;
run;quit;

Patrick_6-1715832708556.png

 

Also when using XML the variable lengths are based on the maximum string lengths found in the table which for the same variable name in more than one table can still lead to length discrepancies. Given in MS Access one doesn't define the column lengths I assume this challenge will remain for any import approach.

It looks like the generated import code you're currently using just defines some "insane" lengths to avoid string truncation - which in turn is not a great thing for table size and processing.

 

If required one can always write code that analysis the imported tables and then aligns the lengths for same named columns.

 

Unfortunately even when using a XML with a XSD schema the data type in SAS is still just based on the values found in the XML file which leads to variable char_digits_only to become numerical instead of char on the SAS side.

Patrick_7-1715833037650.png

<xsd:element name="char_digits_only" minOccurs="0" od:jetType="memo" od:sqlSType="ntext">

 

With above findings:

I'd ask for a fix of the upstream process and to proceed would eventually for now use Proc Import to create the tables in SAS. 

If you can directly connect to the access DB from SAS then I'd give this also a try. If you can't make this work then I'd also ask for an extension to the upstream process so it analysis the .csv and generates SAS import code with more reasonable variable lengths.

 

 

 

 

 

 

sassy_seb
Obsidian | Level 7

Update:

The first dataset is using a previous SAS code that imports data from Excel.

The excel file, has the PIDs of AA-1000, but in this code is where it gets truncated to AA-100 when we run the code. 

 

So now we believe the issue to come from this initial setup SAS code.

 

Or is this an issue with Excel? 

 

When we run the setup program we have a lot of import statements for multiple Excel files and sheets. When we merge the data, we get the same errors I had shown before, about various lengths being given to the PID variable as well as others.

 

Running PROC Contents after merging shows that PID still has a format and informat of $6. 

I also ran PROC Contents on the first dataset imported from Excel and it gives me the same results.

 

We are a little stumped because this is the very beginning of the data analysis. I even asked to be sure if there was nothing that came before that might have defined PID from a previous program but no. We have a format catalog inside one of the libraries we are referencing so I thought maybe the issue could be there but from what they told me there is nothing before this program that defined PID with something from the format catalog. 

 

Tom
Super User Tom
Super User

Importing from an actual EXCEL file (.XLSX file) should not cause values to be truncated.  The only reason I can think that might happen is if the hyphen is not a hyphen by some other character that might take more than one byte in UTF-8 encoding.  In that case (at least in theory) it is posisble SAS could GUESS to make the variable 7 bytes long thinking the hyphen only needed one byte,  But then when it read only the first 7 bytes of the UTF-8 string it would lose the last digit since the hyphen used 2 bytes instead of one.

 

But if instead you are importing from a CSV file (which is NOT an EXCEL but instead is just a plain old text file) with PROC IMPORT and you forgot to include the GUESSINGROWS=MAX; statement then PROC IMPORT could GUESS to make the variable only 6 bytes long since in the subset of the lines it checked that was the maximum length it found.

 

And if you do have a CSV file then ditch the PROC IMPORT step and instead write the data step to read from the CSV file yourself and you will have complete control over how the variables are defined and read.

sassy_seb
Obsidian | Level 7

I ran the other import statements, we had like 4 or 5

 

Using PROC contents on each one, I see that PID has a different format and informat from between $6. to $7. 

 

Is there anything I can do before the import statements to specify the length or the informat I want? 

Or actually, is there something I can do to prevent this from happening to further datasets going forward?

Like making sure for sure that PID is read universally across all of my programs as $10. or $15. for example. 

 

 

I am assuming the IMPORT is just getting the maximum length for each respective variable because the first import statement's Excel Sheet didn't have any PID's with more than 6 characters.

The second sheet did, since that one has AA-1000 properly show up and has PID reflect the format and informat as $7. 

 

 

sassy_seb
Obsidian | Level 7

And I tried the merge just using these two smaller datasets, and same issue. AA-1000 keeps getting shortened and I keep getting the same warning about multiple lengths. 

sas-innovate-2024.png

Available on demand!

Missed SAS Innovate Las Vegas? Watch all the action for free! View the keynotes, general sessions and 22 breakouts on demand.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 36 replies
  • 664 views
  • 13 likes
  • 7 in conversation