PROC IMPORT CSV Importing Numbers as Characters when need numbers

broomtc · Posted 06-24-2019 05:08 PM

I am running into a situation where proc import dbms=csv is failing to identify numbers. Every column comes back as type 'CHAR' despite the CSV passing validation with flying colors.

filename datain url 'https://data.cms.gov/resource/csi5-w7vr.csv';

proc import datafile=datain out=dataout dbms=csv replace;
getnames=yes;
guessingrows=max;
run;

Alphabetic List of Variables and Attributes# Variable Type Len Format Informat2321142592048182162215172426272829303132333435363738394041153101112137196

american_indian_or_alaska	Char	5	$5.	$5.
asian_pacific_islander	Char	5	$5.	$5.
average_age	Char	4	$4.	$4.
average_hcc_score	Char	6	$6.	$6.
average_length_of_stay_days	Char	6	$6.	$6.
black_beneficiaries	Char	5	$5.	$5.
city	Char	21	$21.	$21.
distinct_beneficiaries_per	Char	5	$5.	$5.
dual_beneficiaries	Char	5	$5.	$5.
facility_name	Char	52	$52.	$52.
female_beneficiaries	Char	5	$5.	$5.
hispanic_beneficiaries	Char	5	$5.	$5.
male_beneficiaries	Char	5	$5.	$5.
nondual_beneficiaries	Char	5	$5.	$5.
other_unknown_beneficiaries	Char	4	$4.	$4.
percent_of_beneficiaries	Char	6	$6.	$6.
percent_of_beneficiaries_1	Char	5	$5.	$5.
percent_of_beneficiaries_2	Char	5	$5.	$5.
percent_of_beneficiaries_3	Char	6	$6.	$6.
percent_of_beneficiaries_4	Char	5	$5.	$5.
percent_of_beneficiaries_5	Char	5	$5.	$5.
percent_of_beneficiaries_6	Char	5	$5.	$5.
percent_of_beneficiaries_7	Char	5	$5.	$5.
percent_of_beneficiaries_8	Char	5	$5.	$5.
percent_of_beneficiaries_9	Char	5	$5.	$5.
percent_of_beneficiaries_10	Char	5	$5.	$5.
percent_of_beneficiaries_11	Char	5	$5.	$5.
percent_of_beneficiaries_12	Char	6	$6.	$6.
percent_of_beneficiaries_13	Char	5	$5.	$5.
percent_of_beneficiaries_14	Char	5	$5.	$5.
percent_of_beneficiaries_15	Char	6	$6.	$6.
provider_id	Char	7	$7.	$7.
state	Char	4	$4.	$4.
street_address	Char	42	$42.	$42.
total_snf_charge_amount	Char	10	$10.	$10.
total_snf_medicare_allowed	Char	10	$10.	$10.
total_snf_medicare_payment	Char	10	$10.	$10.
total_snf_medicare_standard	Char	9	$9.	$9.
total_stays	Char	5	$5.	$5.
white_beneficiaries	Char	5	$5.	$5.
zip_code	Char	7	$7.	$7.

The file has specifications with several numbers. Documentation for file is here. https://dev.socrata.com/foundry/data.cms.gov/csi5-w7vr

ballardw · Posted 06-24-2019 05:27 PM

I suggest downloading that file and examining or reading from a local version.

There are a number of things that may be going on with the HTTPS connection. And we can't see any of them or the actual data because of security.

What does the generated data step code look like?

I would suggest, especially if you are going to read other files with the same layout, that you copy the generated code from the log to the editor, change the character informats for the variables that should be numeric or dates to an appropriate informat, save the code and rerun the data step instead of Proc Import.

Also look in the log for any messages related to "transcoding" or "encoding" issues.

broomtc · Posted 06-24-2019 05:40 PM

Took a look and they are putting a " " around everything.

ballardw · Posted 06-24-2019 06:46 PM

@broomtc wrote:

Took a look and they are putting a " " around everything.

You can modify the data step as mentioned then add something like:

numvar = input(charvar,6.);

drop charvar;

The 6 would come from the length of the informat associated with the character version.

If you have currency values then use a COMMA format.

If the variable should be a date then use an appropriated date informat instead.

Kurt_Bremser · Posted 06-24-2019 05:29 PM

It seems asterisks are being used to denote missing values. proc import can't deal with that.

Write your own data step, according to the file specification, and you may have to do extra logic to handle the asterisks in a graceful manner.

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

broomtc · Posted 06-24-2019 05:41 PM

Looks like between the more complicated * inclusions and dealing with " " where there shouldn't be I am in for a much longer code base.

Kurt_Bremser · Posted 06-24-2019 06:27 PM

The quotes are handled by using the dsd option in the infile statement. Replacing the single asterisks with nothing can be done by applying the tranwrd() function to _infile_ (the input buffer), and then doing the actual read.

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

PROC IMPORT CSV Importing Numbers as Characters when need numbers

Re: PROC IMPORT CSV Importing Numbers as Characters when need numbers

Re: PROC IMPORT CSV Importing Numbers as Characters when need numbers

Re: PROC IMPORT CSV Importing Numbers as Characters when need numbers

Re: PROC IMPORT CSV Importing Numbers as Characters when need numbers

Re: PROC IMPORT CSV Importing Numbers as Characters when need numbers

Re: PROC IMPORT CSV Importing Numbers as Characters when need numbers

PROC IMPORT CSV Importing Numbers as Characters when need numbers

Re: PROC IMPORT CSV Importing Numbers as Characters when need numbers

Re: PROC IMPORT CSV Importing Numbers as Characters when need numbers

Re: PROC IMPORT CSV Importing Numbers as Characters when need numbers

Re: PROC IMPORT CSV Importing Numbers as Characters when need numbers

Re: PROC IMPORT CSV Importing Numbers as Characters when need numbers

Re: PROC IMPORT CSV Importing Numbers as Characters when need numbers

Register Today!

SAS Training: Just a Click Away