Solved: Re: Import large csv problem

evgenys · Posted 01-28-2017 08:44 AM

Hi all,
I'm trying to import CSV file of 4 million rows and ~150 columns that order of columns I don't know.
It is impossible to do guessingrows for such a file.
And, I get cut character columns.

Any ideas what can I do?

JBailey · Posted 02-14-2018 09:58 AM

Hi @evgenys

First things, first... This is frustrating situation to find yourself in. Everyone who has participated in this thread has experienced it. We feel your pain.

TL;DR

Best advice: Get the official file format and data details from the producer/creator of the file.

Now, for a nonsensical philosophical discussion.

When I have to do things like this I ask myself a series of questions. The first two questions are the most important. The answer to them will dictate whether I ask the other questions.

What's the cost of being wrong?

If this data is really important then figuring out the format on your own is a bad idea because being wrong is expensive. In other words: the creator/producer of the data has responsibility for ACCURATELY describing it.

If the data isn't critical you can use this as a learning exercise but it isn't a really great learning exericise.

Is this a one time exercise or is it going to be repeated? If it needs to be a repeatable process how often will it run?

If this is a one-time exercise then running guessrows on the entire file may not be a huge deal. Start the job before you go to lunch or leave for the day.

If this is for a production job, you want a real, stable, description of the file layout.

Pro Tip: if it takes your machine a long time to run this job, use this as justification for a new, awesome, computer.

Where does the data originate?

If the data comes from a database a DBA can point you to the SQL (Data Definition Language (DDL) that was used to create the table). Using this, I know the longest possible length of the character strings.

If the data is something someone threw together it may be a good idea to ask them how they are doing it. Perhaps they can provide the length of the strings.

Others have pointed-out that we should always ask the source of the data (person or machine) for the file layout and column length information. It is solid advice. In fact, I believe this is the best advice you could be given. As you can tell from the comments, it is very frustrating to be asked to read in a file that you know nothing about.

How big is this CSV file (KB, MB, GB)?

4 million rows is not a lot these days. For example, when I am experimenting with database loads I will typically use 10 to 20 million rows. My desktop PC handles this with no problem even when I am loading data into cloud databases. That being said, if the file is huge (many GB) it could take some time to process which leads to the next question.

Can I subset the file in so that my chances of finding the longest text fields increases?

If may be possible to sort the file so the longest records appear at the top. This could help you figure-out the max length of the character fields.

Does anyone know something that can help me?

This is a variant of the first question. You may have a co-worker who has read the file before and can help you.

I know this isn't what you are after but I think it is the only real answer. Anyone who has been through this can identify with it. It is frustrating.

Best wishes,

Jeff

View solution in original post

Kurt_Bremser · Posted 01-28-2017 08:59 AM

Very simple. Write a data step according to the file description.

To get help in this, you best post that description here.

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

evgenys · Posted 01-28-2017 09:51 AM

As I wrote. I didn't know the order and some times discription of columns... becouse of this I can't write data step....

Kurt_Bremser · Posted 01-28-2017 09:58 AM

Then DEMAND the description from whoever gave you that file.

Your only other option is to run proc import with a large enough guessingrows value, and hope for the best.

Unless you want to inspect all rows with the good old eyeballs Mk 1.

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

Reeza · Posted 01-28-2017 10:03 AM

There isn't much of a choice then. I suggest the standard process of using proc import to generate code. Copy the code from the log and customize according to your errors until they're gone.

I would recommend setting GUESSINGROWS to 1 million for initial proc import.

You can limit total amount read using OBS option.

Option obs=1500000;

Theb reset after:

option obs=Max;

evgenys · Posted 01-28-2017 10:09 AM

I tried GUESSINGROWS 4 million ...it doesn't end.😃

Reeza · Posted 01-28-2017 10:15 AM

If 1 million doesn't get it that would be very surprising.

I suggested 1.5,million.

Kurt_Bremser · Posted 01-28-2017 10:18 AM

@evgenys wrote:
I tried GUESSINGROWS 4 million ...it doesn't end.😃

Oh yeah, 4 million rows * 150 columns takes a lot of time to inspect. It is what happens when you try something stupid.

Get back to the data source and demand the description, and tell them that the data is useless without it.

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

evgenys · Posted 01-28-2017 10:41 AM

😃
I did it first...but before it done I have to cope with it...

TomKari · Posted 01-28-2017 11:37 AM

This isn't an easy problem!

Do any of your fields have internal commas? If not, then you could use (untested):

data _null_;
retain MaxCols 0;
infile x end=LastRec;
input;
ColCount = count(_infile_, ",") + 1;
if ColCount > MaxCols then MaxCols = ColCount;
if LastRec then call symput("ColCount", put(MaxCols, best8.));
run;

This should give you the highest number of columns in your file.

Then maybe something along the lines of:

%macro GetLen;
data _null_;
retain MaxLen 0;
infile x end=LastRec;
input;
%do &i = 1 to &ColCount;
ColLen = length(scan(_infile_, &i))
if ColLen > MaxLen then MaxLen = ColLen;
%end;
if LastRec then call symput("ColLen", put(MaxLen, best8.));
run;
%mend;
%GetLen;

And finally:

%macro GetData;
data Want;
length Col1-Col&ColCount. $&ColLen.;
infile x;
input;
%do &i = 1 to &ColCount;
Col&i. scan(_infile_, &i);
%end;
run;
%mend;
%GetData;

Good luck!

Tom

evgenys · Posted 01-28-2017 11:53 AM

Thanks Tom. I'll try it tomorrow.

Tom · Posted 01-28-2017 12:00 PM

You don't need to know what is in a CSV file to read it in as character strings. If you don't even know how many columns there are just use a larger number (150 in the example below) than you expect. If the last column is not empty then increase the number and read the file again.

data temp (compress=yes);
   infile 'myfile.csv' dsd truncover firstobs=2 ;
   length x1-x150 $200 ;
   input x1-x150;
run;

You can then analyze the character strings yourself and make your own decision on what is in it.

Find the maximum length (if you find any with maximum length close to 200 then you might want to use more then $200 in the step above).
Check if they can be converted to a number by using INPUT function with COMMA32. informat.
Check if they can by converted to a date, time, or datetime by using ANYDTDTE, ANYDTTME, and ANYDTDTM informats, respectively.

If the first line has variable names then you could read that line in separately and use it to rename the variables.

PGStats · Posted 01-28-2017 10:20 PM

Here is a quick way to take a 1% random sample of the records using a line pointer control in the input statement

data sample;
length str $200;
infile "&sasforum\datasets\frame.csv" truncover line=lineNo;
linePt = ceil(2 * 100 * rand("uniform"));
input #linePt str&;
line + lineNo;
keep line str;
run;

PG

ballardw · Posted 01-30-2017 11:40 AM

My rough stab would be to run proc import with guessingows in the 32000 range. The save and inspect the generated datastep code.

I would likely increase the lengths of character variables in case the generated 73 or such doesn't quite work for all of them, likely around 10 percent added.

But if you don't have a document that says what any of the columns are then what are you going to do with this data file? Unless you have column headers long enough to be explanatory it may be hard to tell what anything else is.

evgenys · Posted 01-30-2017 12:24 PM

Let see... you have a software that it's output CSV file that cat change. You do know what you get.. and you can't change your import code every time.That's you must have a sutable function.