I am running into a situation where proc import dbms=csv is failing to identify numbers. Every column comes back as type 'CHAR' despite the CSV passing validation with flying colors.
filename datain url 'https://data.cms.gov/resource/csi5-w7vr.csv';
proc import datafile=datain out=dataout dbms=csv replace;
getnames=yes;
guessingrows=max;
run;
Alphabetic List of Variables and Attributes# Variable Type Len Format Informat2321142592048182162215172426272829303132333435363738394041153101112137196
american_indian_or_alaska | Char | 5 | $5. | $5. |
asian_pacific_islander | Char | 5 | $5. | $5. |
average_age | Char | 4 | $4. | $4. |
average_hcc_score | Char | 6 | $6. | $6. |
average_length_of_stay_days | Char | 6 | $6. | $6. |
black_beneficiaries | Char | 5 | $5. | $5. |
city | Char | 21 | $21. | $21. |
distinct_beneficiaries_per | Char | 5 | $5. | $5. |
dual_beneficiaries | Char | 5 | $5. | $5. |
facility_name | Char | 52 | $52. | $52. |
female_beneficiaries | Char | 5 | $5. | $5. |
hispanic_beneficiaries | Char | 5 | $5. | $5. |
male_beneficiaries | Char | 5 | $5. | $5. |
nondual_beneficiaries | Char | 5 | $5. | $5. |
other_unknown_beneficiaries | Char | 4 | $4. | $4. |
percent_of_beneficiaries | Char | 6 | $6. | $6. |
percent_of_beneficiaries_1 | Char | 5 | $5. | $5. |
percent_of_beneficiaries_2 | Char | 5 | $5. | $5. |
percent_of_beneficiaries_3 | Char | 6 | $6. | $6. |
percent_of_beneficiaries_4 | Char | 5 | $5. | $5. |
percent_of_beneficiaries_5 | Char | 5 | $5. | $5. |
percent_of_beneficiaries_6 | Char | 5 | $5. | $5. |
percent_of_beneficiaries_7 | Char | 5 | $5. | $5. |
percent_of_beneficiaries_8 | Char | 5 | $5. | $5. |
percent_of_beneficiaries_9 | Char | 5 | $5. | $5. |
percent_of_beneficiaries_10 | Char | 5 | $5. | $5. |
percent_of_beneficiaries_11 | Char | 5 | $5. | $5. |
percent_of_beneficiaries_12 | Char | 6 | $6. | $6. |
percent_of_beneficiaries_13 | Char | 5 | $5. | $5. |
percent_of_beneficiaries_14 | Char | 5 | $5. | $5. |
percent_of_beneficiaries_15 | Char | 6 | $6. | $6. |
provider_id | Char | 7 | $7. | $7. |
state | Char | 4 | $4. | $4. |
street_address | Char | 42 | $42. | $42. |
total_snf_charge_amount | Char | 10 | $10. | $10. |
total_snf_medicare_allowed | Char | 10 | $10. | $10. |
total_snf_medicare_payment | Char | 10 | $10. | $10. |
total_snf_medicare_standard | Char | 9 | $9. | $9. |
total_stays | Char | 5 | $5. | $5. |
white_beneficiaries | Char | 5 | $5. | $5. |
zip_code | Char | 7 | $7. | $7. |
The file has specifications with several numbers. Documentation for file is here. https://dev.socrata.com/foundry/data.cms.gov/csi5-w7vr
I suggest downloading that file and examining or reading from a local version.
There are a number of things that may be going on with the HTTPS connection. And we can't see any of them or the actual data because of security.
What does the generated data step code look like?
I would suggest, especially if you are going to read other files with the same layout, that you copy the generated code from the log to the editor, change the character informats for the variables that should be numeric or dates to an appropriate informat, save the code and rerun the data step instead of Proc Import.
Also look in the log for any messages related to "transcoding" or "encoding" issues.
Took a look and they are putting a " " around everything.
@broomtc wrote:
Took a look and they are putting a " " around everything.
You can modify the data step as mentioned then add something like:
numvar = input(charvar,6.);
drop charvar;
The 6 would come from the length of the informat associated with the character version.
If you have currency values then use a COMMA format.
If the variable should be a date then use an appropriated date informat instead.
It seems asterisks are being used to denote missing values. proc import can't deal with that.
Write your own data step, according to the file specification, and you may have to do extra logic to handle the asterisks in a graceful manner.
Looks like between the more complicated * inclusions and dealing with " " where there shouldn't be I am in for a much longer code base.
The quotes are handled by using the dsd option in the infile statement. Replacing the single asterisks with nothing can be done by applying the tranwrd() function to _infile_ (the input buffer), and then doing the actual read.
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Learn the difference between classical and Bayesian statistical approaches and see a few PROC examples to perform Bayesian analysis in this video.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.