BookmarkSubscribeRSS Feed
SasDewd
Obsidian | Level 7

Hello,

 

I need to import daily CSV files that can have up to 150 columns.

I have the code that does this and most of the time it works fine.

However, each file can and a lot of time is different and could have 100, 120, 20 columns or really any number of columns.

So, every now and then a file will get imported and it will have some of the columns misaligned.

I assume that if a certain column is not present in the file it will just skip it and leave it blank but it seems that for some reason sometimes it just tries to push values from different column.

My code is way too long to post here but this is sorta what it looks like:

 

DATA WORK.DATASET_DAILY;
LENGTH
Column1 8
Column2 

...

FORMAT
Column1 BEST2.
Column2 BEST2.

...

INFORMAT
Column1 BEST2.
Column2 BEST2.

...

INFILE "&path/&&File&i" 
DELIMITER = ','
DSD
LRECL=32767 
FIRSTOBS=2;
INPUT

Column1 : ?? COMMA4.
Column2 : ?? COMMA4.

...

In here all the columns are listed so no file being imported will have a column that is not defined here but most of the file will not have all the columns specified here.

 

Does anyone have any idea why occasionally I get my columns misaligned ?

 

Thanks,

10 REPLIES 10
ballardw
Super User

@SasDewd wrote:

Hello,

 

I need to import daily CSV files that can have up to 150 columns.

I have the code that does this and most of the time it works fine.

However, each file can and a lot of time is different and could have 100, 120, 20 columns or really any number of columns.

So, every now and then a file will get imported and it will have some of the columns misaligned.

 

Does anyone have any idea why occasionally I get my columns misaligned ?

 

Thanks,


Define "misaligned".

I would suspect that some column that should be character may occasionally contain a comma that is not within quotes.

Another possibility would be a field with a single quote such 5" for five inches. That messes with most delimited files.

 

Tom
Super User Tom
Super User

Are all of the columns the same? Numeric, like in your example?

Why are you not using the TRUNCOVER option to keep the INPUT statement from flowing to the next line if there are not enough values on the line?

Is it possible that some of the lines have embedded line breaks?

 

That codes looks like the ugly stuff that PROC IMPORT generates.  Why not just do:

DATA WORK.DATASET_DAILY;
  INFILE "&path/&&File&i"  dsd firstobs=2 truncover ;
  input column1 - column200 ;
  informat column1 - column200 comma. ;
run;

 If suppressing the invalid data messages is important then perhaps:

DATA WORK.DATASET_DAILY;
  INFILE "&path/&&File&i"  dsd firstobs=2 truncover ;
  input (column1 - column200) (: ??comma.) ;
run;
 
Reeza
Super User
When the file is sent with 100 columns, does that mean it has 150 columns with 50 blank or only 100 columns? If it’s only 100, then SAS just reads them in order that you’ve specified and it will definitely be wrong.
SasDewd
Obsidian | Level 7

@ballardw By misaligned I mean say, the first 40 columns everything looks good and then starting with column 41 I would start seeing truncated values in columns that don't belong there. For example, I'll start seeing part of customer's last name in the address field and so on.

As for the possibility of a comma being included in a field value, it's possible, but I thought DSD option would make sure SAS doesn't think it's the end of the value.

I'll look for single quotes to see if that's perhaps what's causing the misalignment.

I'm pretty new to SAS and I find it pretty hard to troubleshoot when something goes wrong.

In this case for example, I wish I could step through the code and watch what happens to each field as the code is being executed.

 

@Tom No, not all the columns are numeric, they are all kind of different data types, but if a file contains certain column it should be the same type.

So, for example, in this code I have declared 150 columns, assigned data type and lengths.

Now, tomorrow's file may only have 75 columns but those columns should all be the same data type as specified in the code for those particular columns.

You're right, that is a code generated by import wizard. Again, I'm new to SAS so that's what I started with and it's working most of the time but not always.

These CSV files, by the way, are all daily data dumps from MongoDB so that's why I never know what columns will be in there.

 

 @Reeza It will only be 100 columns so what it'll do it'll create a data set with 100 columns and then I have code that appends that to the master data set based on the column names so I don't think that would be an issue as long as the values in the newly created data set are correct which they are not always.

 

Thank you all for your replies!

Reeza
Super User
Given that you’re specifying formats other than character I’m not surprised you’re running into issues. If you try to read a character column with comma format you’ll have issues. I would recommend reading it all as characters. Have a master table with formats/Informats and lengths. Use that to apply the correct informat, format and length. Then append it.
SasDewd
Obsidian | Level 7

@Reeza Thank you! That makes sense so I will definitely look into doing it that way if that would save me the trouble down the road. 

One thing I'm concern with is even if I figure it out where all my testing completes successfully I have no way of knowing what kind of data will show up in tomorrow's file and if something unexpected will throw everything off. So, if reading all the values as character would save me from any surprises than that's definitely worth looking into. Thank you!

Reeza
Super User

@SasDewd wrote:

@Reeza Thank you! That makes sense so I will definitely look into doing it that way if that would save me the trouble down the road. 

One thing I'm concern with is even if I figure it out where all my testing completes successfully I have no way of knowing what kind of data will show up in tomorrow's file and if something unexpected will throw everything off. So, if reading all the values as character would save me from any surprises than that's definitely worth looking into. Thank you!


You'd add one more check to your system. If the variable loaded is not in your master table you would then throw an error so that you knew that something went wrong and you would check the process to ensure it's correctly done. When I was in BI, we had a master table of jobs that a scheduler used to run the jobs, but at the end of each job, it would enter a record into a table that would let us know if it loaded successfully or not and then we could always refer to that to verify a load. But we had a very rigorous process with our data due to data audits - always fun to have an accountant auditor watching you run the code.

Tom
Super User Tom
Super User

SAS knows how to read CSV files, but perhaps Mongo DB does not know how to create them.

 

Some database engines (including many that are very popular now like Redshift and Snowflake) seem to have decided to abandon the CSV standard method of using quotes to protect delimiters and quotes in the data and are instead using Unix inspired "escape" characters instead.  Perhaps Mongo DB can be convinced to make a readable file with the right options.

 

You might want to first test the file before trying to read it.  For example a data step like this will check that each line has the same number of fields. 

data _null_:
  infile 'myfile.csv' truncover ;
  input line $32767. ;
  nwords = countw(line,',','mq');
  previous=lag(nwords);
  if _n_>1 and nwords ne previous then do;
     put 'ERROR: Number of fields per line changes. ' _n_= nwords= previous= ;
     list;
     stop;
  end;
run;
        

This will work if the maximum line length is 32K or less.  You could still test if the lines could be longer, but you will need to modify the program.

 

 

SasDewd
Obsidian | Level 7

@Tom Thank you Tom, I will look into the approach you suggested. Unfortunately, I have no control over how the input files are created from MongoDB so I just need to figure it out and make sure it works. Thanks again!

ballardw
Super User

If the data is not too sensitive paste the header, if any, and five to ten lines of data into a code box opened using the forum's {I}. The main message window reformats text and so the code box is better for this.

If the data is sensitive and the number of fields that are sensitive is small then perhaps replace sensitive values with meaningless values.

 

Or make a text file of the few lines and attach that file.

 

 

hackathon24-white-horiz.png

2025 SAS Hackathon: There is still time!

Good news: We've extended SAS Hackathon registration until Sept. 12, so you still have time to be part of our biggest event yet – our five-year anniversary!

Register Now

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 10 replies
  • 2961 views
  • 2 likes
  • 4 in conversation