Re: CSV imports

Citrine10 · Posted 05-18-2022 05:35 AM

hi,

I have been trying for some time to get this right, I have various CSV files, some with numerics and strings or both.As CSV files formats are rather tricky, I am not sure what to do.

When using my import function, how can I ensure that it keeps the same format of the data as it was captured in the CSV file?

My import code;

proc import file="/C:/File.csv"
out=work.filecsv
dbms=tab
replace;
delimiter=",";
guessingrows=100;
run;

andreas_lds · Posted 05-18-2022 06:11 AM

The best way: write your own data steps using the file description provided by the creator of the file. Or use the code generated by proc import and make the necessary changes.

Citrine10 · Posted 05-18-2022 06:39 AM

thank you, in that case how do I check if a column name has the word 'Text' that the format is string?

Kurt_Bremser · Posted 05-18-2022 06:41 AM

Look at the documentation you got along with the file.

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

Citrine10 · Posted 05-18-2022 06:46 AM

where is the documentation you refer to?

Kurt_Bremser · Posted 05-18-2022 06:53 AM

You got it with the file, or it is available in the place from where you retrieved the file.

If someone sent you just the file without any documentation, ask there for it.

If this is supposed to happen in a professional environment, print a page of the file, wrap the print around a baseball bat, go to the source of the file, and apply the baseball bat liberally. This should make it clear that data dictionaries are an essential ingredient of moving data.

Just kidding, but not much.

If nothing of this is possible, you have to make guesses. Inspect the file with a text editor (NOT with Excel!!!) and see if you can make sense out of what you see.

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

andreas_lds · Posted 05-18-2022 06:54 AM

@Citrine10 wrote:
where is the documentation you refer to?

Hardly possible to answer for anybody outside your organisation.

ballardw · Posted 05-18-2022 11:33 AM

One thing that may help is increase the guessing rows value but that is a long shot without providing an example of the CSV file that misbehaves.

IF a column you think is numeric is imported as character that means some of the values in the data are that, character.

One common cause of this is more than one header row. So the text in the second or subsequent header row is read as text and that sets the column as text from Proc Import rules.

Examine the data immediately after reading it. If you see such then you can copy the code generated in from the log, paste into the editor and modify the INFORMAT (and associated Format) statement to a numeric informat. You may also want to add a FIRSTOBS option to the INFILE to skip those rows.

However if the values are getting text because the column contains actual values like "NA" "NULL" "MISSING" "<.0001" "Out of range" or any similar non-numeric values then do the same copy of generated code and change informat and be prepared to see a lot of invalid data messages in the log.

If the files are supposed to be of the same structure then get ONE of these programs working, specify a custom informat to handle the text (best) or use a an INPUT modifier to suppress invalid data messages for those problem variable values (may hide issues you need to address).

Tom · Posted 05-18-2022 02:47 PM

You seem to mainly be interested in checking if PROC IMPORT created a numeric variable from a column whose header name makes it looks like it should be a character variable.

Assuming you asked PROC IMPORT to inspect the whole file before deciding how to read it it can only happen if all of the non-empty values in that column look like valid numbers. For example if you had an ID variable and the ids only include digits. Note that if the column is completely empty then PROC IMPORT will create it as character with a length of $1 since that takes less space in the dataset than a numeric variable.

Once the value has been read into a number there is not a fully reliable way to get back the original text that was read. Your will probably want to re-read the file using the right definition for the variable(s).

PROC IMPORT does not provide any method for that. but it should show the data step it generated. It is pretty simple to copy that code and modify it to define and read the variable correctly and then run the modified code to make a new clean dataset.

If you need to automate this process you might want to skip using PROC IMPORT.

Either write your own code to read the CSV file(s). It is really trivial to read a CSV file. Just point the INFILE statement at it. Define the variables. Attach and NEEDED informats or format. Attach any wanted labels. And then read the variables with an INPUT statement. If you DEFINE the variables in the same order as they appear in the lines of CSV file then you can use a positional variable list in the INPUT statement to avoid having to retype all of the names.

data want;
  infile 'myfile.csv' dsd truncover firstobs=2;
  length id $20 age 8 sex $1 text $100 ;
  input id -- text;
run;

If you really need to GUESS how to read the file then use a tool that has some flexibility to allow you to force the type of variables that are created.

https://github.com/sasutils/macros/blob/master/csv2ds.sas

The OVERRIDES dataset can be used to set metadata used to generate the
code. Only the VARNUM varaible is required and the dataset must be sorted
by VARNUM. The values of INFORMAT and FORMAT must include the period.
Since file is read using list mode input there is no need to include a
width on the INFORMAT.  Do not include decimal places on INFORMAT unless
you want INPUT to divide values without periods by that power of 10.
  NAME     LEN  Description
  -------- ---- -------------------------------------------------
  varnum   8    Column order (req)
  name     $32  Variable name
  length   $6   LENGTH as required for LENGTH statement
  informat $43  INFORMAT specification used to read from file
  format   $43  FORMAT to attach to variable in the dataset
  label    $256 LABEL to attach to variable

L10_guri · Posted 05-19-2022 04:09 PM

Hi, I had a similar issue with a .txt file with +700K rows. So my first question is : how may rows your dataset has?
Because with only 100 rows you are going to get wrong data if after the first 100 rows the data has different length.
I will suggest to increase the guessing rows to an average of the dataset size and then use proc compare to find if there is difference between two methods!

Register Today!

SAS Training: Just a Click Away