BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
Kurt_Bremser
Super User

Rapidly changing file structures are a VERY BAD IDEA, as they only cause unnecessary work and are a sign for no design or a very idiotic one.

For one-shots, PROC IMPORT is  the way to go, but let me tell you that even for one-off flat files that I have to deal with, I always get the same type of documentation that I get for those that will be created and imported daily.

So you either learn to live with the shortcomings of PROC IMPORT, or go through a lot of work determining the file structure by trial-and-error, or tell the idiot that heaped that task on you to finally grow a brain.

ballardw
Super User

Point out to management how much "extra time" this process takes for one iteration and the costs. Then the next time the additional time and associated costs. And the next time. Repeat as needed.

 

It may sink into someone head that having a process with an order and controls may actually make financial sense to the organization.

 

I once worked as a contractor doing some data analysis for a client. The at one point asked about the typical $200 to $400 monthly fee we charged in programming. When they learned it was because we had to reprogram processes each month due to file column order and content changes the data stabilized pretty quickly.

 

Note that we did not have that issue with the client's software side of the house only the hardware folks.

evgenys
Calcite | Level 5
One of the reasons for such problem is (that's my problem too) if you integrate new software you'll run it 1000s times with 100s of different output files.
Moreover, I expected from software like sas if I've written separated by '','' it would read data between two'','' .
Reeza
Super User

@evgenys wrote:


Moreover, I expected from software like sas if I've written separated by '','' it would read data between two'','' .

That's what it does. How does SAS not do that? 

 

Honestly your question and what your actual issue is unclear. We're making standard guesses and comments but beyond you need to know your data what else can we say. I'm 99% sure any other language operates pretty much the same way with a file that is unspecified. This is also why people use databases. 

 

Kurt_Bremser
Super User

Moreover: if I had to make a guess, I'd say that 80% of my work now involves documentation, either writing or reading. The actual programming consumes very little time, partly owing to the tools I created to make my work easier, but mostly because of the documentation!

JBailey
Barite | Level 11

Hi @evgenys

 

First things, first... This is frustrating situation to find yourself in. Everyone who has participated in this thread has experienced it. We feel your pain.

 

TL;DR

Best advice: Get the official file format and data details from the producer/creator of the file.

 

Now, for a nonsensical philosophical discussion. 

 

When I have to do things like this I ask myself a series of questions. The first two questions are the most important. The answer to them will dictate whether I ask the other questions.

 

 

What's the cost of being wrong?

If this data is really important then figuring out the format on your own is a bad idea because being wrong is expensive. In other words: the creator/producer of the data has responsibility for ACCURATELY describing it.

 

If the data isn't critical you can use this as a learning exercise but it isn't a really great learning exericise.

 

Is this a one time exercise or is it going to be repeated? If it needs to be a repeatable process how often will it run?

If this is a one-time exercise then running guessrows on the entire file may not be a huge deal. Start the job before you go to lunch or leave for the day.

 

If this is for a production job, you want a real, stable, description of the file layout. 

 

Pro Tip: if it takes your machine a long time to run this job, use this as justification for a new, awesome, computer.

 

Where does the data originate?

If the data comes from a database a DBA can point you to the SQL (Data Definition Language (DDL) that was used to create the table). Using this, I know the longest possible length of the character strings.

 

If the data is something someone threw together it may be a good idea to ask them how they are doing it. Perhaps they can provide the length of the strings.

 

Others have pointed-out that we should always ask the source of the data (person or machine) for the file layout and column length information. It is solid advice. In fact, I believe this is the best advice you could be given. As you can tell from the comments, it is very frustrating to be asked to read in a file that you know nothing about.  

 

How big is this CSV file (KB, MB, GB)?

4 million rows is not a lot these days. For example, when I am experimenting with database loads I will typically use 10 to 20 million rows. My desktop PC handles this with no problem even when I am loading data into cloud databases. That being said, if the file is huge (many GB) it could take some time to process which leads to the next question.

 

Can I subset the file in so that my chances of finding the longest text fields increases?

If may be possible to sort the file so the longest records appear at the top. This could help you figure-out the max length of the character fields.

 

 

Does anyone know something that can help me?

This is a variant of the first question. You may have a co-worker who has read the file before and can help you.

 

I know this isn't what you are after but I think it is the only real answer. Anyone who has been through this can identify with it. It is frustrating.

 

Best wishes,

Jeff

 

hackathon24-white-horiz.png

The 2025 SAS Hackathon has begun!

It's finally time to hack! Remember to visit the SAS Hacker's Hub regularly for news and updates.

Latest Updates

How to connect to databases in SAS Viya

Need to connect to databases in SAS Viya? SAS’ David Ghan shows you two methods – via SAS/ACCESS LIBNAME and SAS Data Connector SASLIBS – in this video.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 20 replies
  • 18708 views
  • 8 likes
  • 8 in conversation