BookmarkSubscribeRSS Feed
France
Quartz | Level 8

I am SAS beginner and trying to import a large .csv file into SAS. Could any one give me an example of importing a csv file into SAS by Data step code please? Beside, do I need any code in data step to ensure the accuracy of imported date?

10 REPLIES 10
Reeza
Super User

We can't give you code because data step code requires knowledge of the data. You typically need to know the variables, variable type and length for starters, usually noted in the record layout or documentation on the data set. 

 

You can always use PROC IMPORT and then steal the data step code from the log. 

 

A bit of this is written up here:

 

https://stats.idre.ucla.edu/sas/faq/how-do-i-read-in-a-delimited-ascii-file-in-sas/

 


@France wrote:

I am SAS beginner and trying to import a large .csv file into SAS. Could any one give me an example of importing a csv file into SAS by Data step code please? Beside, do I need any code in data step to ensure the accuracy of imported date?


 

France
Quartz | Level 8

Thanks for your answer. but how could I know the maximum length of each column if the csv file is so large that I cannot check them one by one? Could you give me some codes to check it please?

 

For example, if the Domain of two variables ( separately APPLN_ID and APPLN_TITLE) is 'Number 0 ... 999 999 999' and 'Up to 3000 characters', is that mean the maximum length of two variables are separately 9 and 3000?  Could i import them into sas by following codes?

 

data XXX;

input APPLN_ID: 9. APPLN_TITLE: $3000. ;

format APPLN_ID: 9. APPLN_TITLE: $3000. ;

run;

 

or 

 

data XXX;

informat APPLN_ID 9.;

informat APPLN_TITLE $3000. ;

format APPLN_ID 9.;

format APPLN_TITLE $3000. ;

input 

APPLN_ID

APPLN_TITLE $;

run;

 

or

 

data XXX;

% let _EFIERR_ =0;

inflie 'XXX.csv' delimiter = ',' missover dsd lrecl=32767;

informat APPLN_ID 9.;

informat APPLN_TITLE $3000. ;

format APPLN_ID 9. ;

format APPLN_TITLE $3000. ;

input 

APPLN_ID

APPLN_TITLE $

;

if _ERROR_ then call symputx ('_EFIERR_',1);

run;

 

Which one is right?  what the difference between them? 

thanks in advance.

 

France
Quartz | Level 8
Thanks for your answer.
Reeza
Super User

If this is the question you also posted on SO, my answer there is still correct. 

 


@France wrote:

Thanks for your answer. but how could I know the maximum length of each column if the csv file is so large that I cannot check them one by one? Could you give me some codes to check it please?

 


You typically need to know the variables, variable type and length for starters, usually noted in the record layout or documentation on the data set. 

 

First rule of data analysis: Know thy data.

 

 'Number 0 ... 999 999 999' and 'Up to 3000 characters'

 

I would interpret that as the variable has values ranging from 0 to 999,999,999 in a character variable that has a length of 3000 characters. Why it would be 3000 characters long, I have no idea...seems too big to me by far.

 

I don't have the file or the record layout so there is no way I can say which is right. I can say those are different methods. 

I would suggest taking a  small subset of data, testing it on that and then using it on the full data set once you have it working. 

There are many powershell or other examples on how to retrieve a small subset of a text file. 

 

 

France
Quartz | Level 8
Thanks a lot, Reeza. the data is PATSTAT database. and I will try by using a small subset of data.
Reeza
Super User

@France wrote:
Thanks a lot, Reeza. the data is PATSTAT database. and I will try by using a small subset of data.

Then it has established documentation:

 

PATSTAT – EPO Worldwide Patent Statistical Database data catalog
Description of data model of PATSTAT Raw Data and PATSTAT Online. The data catalog contains detailed descriptions of the tables and all the attributes (fields). Be aware that the data catalog is version dependent. 

 

 

I would echo @Kurt_Bremser solution though, just use PROC IMPORT and set GUESSINGROWS = max. Go home. Check the code the next day and use that as your starting point. 

 

 

 

 

France
Quartz | Level 8
ok, I have download the data catalog.
Have a good night.
ballardw
Super User

@France wrote:

Thanks for your answer. but how could I know the maximum length of each column if the csv file is so large that I cannot check them one by one? Could you give me some codes to check it please?

 


 

 

In many organizations data files, especially large ones often have documentation that has details such as the column heading, the expected data type (character, numeric, date, time or something) and the maximum lengths expected of character variables. Sometimes there will even be a list of acceptable values such a "1 to 5, 7 or 9" or "Apple, Pear, Banana or blank".

Typically the more important the data the more likely such a document exists.

 

Your first two data steps would not work because you do not have an INFILE statement indicating the file name and options needed to describe the file for reading such as a delimiter or the maximum length of the input line to read (needed if the line exceeds the default line length.

 

The last one is more likely to work (and pretty obviously based on Proc Import code) because it does indicate the source file, indicates it is a delimited file and that comma will separate the values (delimiter=','), a maximum expected line length (Lrecl), if the character variables may have commas imbedded (DSD) but if the value enclosed in quotes the comma won't end the variable, and what to do if the data runs out before all of the variables on the input statement are read (missover which  would have any of the variables missing not found in the data, write the record to the output and then start reading the next data record from the first variable).

The lines

% let _EFIERR_ =0;

if _ERROR_ then call symputx ('_EFIERR_',1);

create a macro variable _EFIERR_ that is set to 0 by default and if any errors occur when reading the data then it is set to 1. Other steps could use this information to do other tasks if there is a problem. These lines are included in Proc Import generated code because the SAS developers do not know if the code used will be in a batch environment (no one looking over the results as steps are running) and so you could include additional instructions such as to stop any processing and write a note to the log or such to notify the user there was a problem with the data.

France
Quartz | Level 8
Many thanks for your answer, ballardw.

sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 10 replies
  • 1703 views
  • 2 likes
  • 4 in conversation