- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
I have a dataset shared with me in csv format. Some NA values have been inexplicably set as '.' in this csv format. This causes PROC IMPORT to guess these variables as character, when they should be numeric. I have seen many posts that suggest the infile statement with a data step, and some formulation of NA handling using iteration and if-then statements.This would then require me to then manually change the columns from character to numeric by name. There are 1000+ columns.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
You can try the GUESSINGROWS statement of PROC IMPORT.
Paige Miller
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Appreciate the suggestion. Guessingrows still has the same issue, because it is the '.' NA symbol that turns the variable into a character. Not a factor of the number of rows SAS reads.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
1000+ columns? That screams "DESIGN FAIL" in the first place.
But since you've been given this and have to somehow beat it into usable shape, let's see:
How is that file structured? In particular, are all numeric columns situated to the right of a given column (e.g. numeric columns start at column 5, and all columns up to the end of a record are numeric)?
What character is used as delimiter?
Could you please post an example of the file (header and a few data lines) to illustrate this, and which we can use for testing our code ideas?
Please use the {i} button for posting such data.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Thanks for sticking with this problem. Unfortunately, the large number of columns isn't a design flaw, just the number of variables needed in the dataset (there are also quite a few observations). So I have to work with them, and the column order has been created in consideration of survey order. I can't share any of the data, either, but I can post a representation of it - keeping in mind any solution that would require reformatting a column individually wouldn't scale from a few columns to 1,000. It's comma delimited, so the raw text of csv would look as below.
The two other languages I've used, R and Python, both have ways to indicate NULL/NA character strings included in basic operations for reading in data, so I'm kind of mystified this isn't available in SAS.
ISOUTSIDE,EATDRYFOOD,TYPE,OTHERTYPE,ISFRIENDLY 1,1,dog,,. 1,.,dog,,. 1,.,cat,,. 0,.,dog,,. 0,.,,,. 0,.,,My pet's name is Jeffrey,. .,.,unknown,,. .,1,dog,,. .,1,dog,,1 1,0,cat,,0 1,0,cat,,1 1,0,elf,,1
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
I saved your data to a file and ran this code:
proc import
datafile='$HOME/sascommunity/utabikunda.csv'
dbms=csv
out=utabikunda
replace
;
run;
proc print data=utabikunda noobs;
run;
proc contents data=utabikunda;
run;
After which I got this output:
ISOUTSIDE EATDRYFOOD TYPE OTHERTYPE ISFRIENDLY 1 1 dog . 1 . dog . 1 . cat . 0 . dog . 0 . . 0 . My pet's name is Jeffrey . . . unknown . . 1 dog . . 1 dog 1 1 0 cat 0 1 0 cat 1 1 0 elf 1
Die Prozedur CONTENTS Dateiname WORK.UTABIKUNDA Beobachtungen 12 Membertyp DATA Variablen 5 Engine V9 Indizes 0 Erstellt 12.12.2019 13:18:02 Beobachtungslänge 56 Zuletzt geändert 12.12.2019 13:18:02 Gelöschte Beobachtungen 0 Schutz Komprimiert NEIN Dateityp Sortiert NEIN Etikett Datendarstellung HP_UX_64, RS_6000_AIX_64, SOLARIS_64, HP_IA64 Codierung latin9 European (ISO) Engine/Host-abhängige Informationen Dateiseitengröße 65536 Anzahl der Dateiseiten 1 Erste Datenseite 1 Max. Anz. Beob. pro Seite 1166 Anz. Beob. auf erster Datenseite 12 Anzahl der Dateireparaturen 0 Dateiname /wadaten/work2/saswork/SAS_work9750024C002C_as-dwh01/ SAS_workF399024C002C_as-dwh01/utabikunda.sas7bdat Erstellt mit Release 9.0401M5 Erstellt mit Betriebssystem AIX Inode-Nummer 140231 Zugriffsberechtigung rw-r--r-- Besitzername e9782 Dateigröße 128KB Dateigröße (Byte) 131072 Alphabetische Liste der Variablen und Attribute Ausg. Einl. # Variable Typ Länge Format Format 2 EATDRYFOOD Num 8 BEST12. BEST32. 5 ISFRIENDLY Num 8 BEST12. BEST32. 1 ISOUTSIDE Num 8 BEST12. BEST32. 4 OTHERTYPE Char 24 $24. $24. 3 TYPE Char 7 $7. $7.
As you can see, the numeric columns were correctly imported, with missing values where a dot was encountered.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Realizing it might actually be a problem of quotation marks around values. There is a DSD option for the infile statement of a data step, but I believe this method of importing csv data requires specifying each column. Is there a similar analogue with proc import to remove quotations around imported values?
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
When you inspect the log (Maxim 2) of the proc import code I ran, you can see that the data step created by proc import already uses the dsd option (to treat successive delimiters as missing values).
If you have specific problems with certain values, take the data step from the log and modify it (there's a lot of things you can optimize right from the start). Once you have code that fixes one column, fixing a lot of columns is "just" some macro-coding away.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Well, then the DSD option must not functioning properly, as it does not remove the quotation marks surrounding values in the csv file to read in. I also don't see how macro coding could solve the problem, if SAS can't guess the variable type, short of specifying the names of each character variable and each numeric variable, which I'm trying to avoid doing for 1,000+ variables. I was able to very easily handle this issue using read.csv() function in R, and make another dataset to read into SAS. I hope the DSD option is improved in future versions.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
@Utabikunda wrote:
Well, then the DSD option must not functioning properly, as it does not remove the quotation marks surrounding values in the csv file to read in. I also don't see how macro coding could solve the problem, if SAS can't guess the variable type, short of specifying the names of each character variable and each numeric variable, which I'm trying to avoid doing for 1,000+ variables. I was able to very easily handle this issue using read.csv() function in R, and make another dataset to read into SAS. I hope the DSD option is improved in future versions.
SAS will remove quotes around values. Having quotes around text that only contains digits will not confuse PROC IMPORT. I modified your example to add some quotes and PROC IMPORT still read the file fine. You need to dig deeper to figure out what exactly is causing your problem. Perhaps the quotes are not quotes, but are those "pretty" quotes that typesetters use?
ISOUTSIDE,EATDRYFOOD,TYPE,OTHERTYPE,ISFRIENDLY 1,1,dog,,. 1,'.',dog,,. 1,.,cat,,"." 0,.,dog,,. 0,.,,,. 0,.,,My pet's name is Jeffrey,. .,.,unknown,,. .,1,dog,,. .,1,dog,,1 1,0,cat,,0 1,0,cat,,1 1,0,elf,,1
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
One issue that can confuse SAS is when you have a pair of values where one starts with a quote and the other ends with a quote. That will look like a single value that contains commas but is enclosed in quotes and cause the columns to get shifted.
Consider this file with 6 fields.
A,B,C,X,Y.D S1,S2,S3,1,2,Yes 'twas,a,goin',3,4,No
That last row will look to SAS like it has only 4 values. So the text, No, that should be the value of D is read as the value of X and will either fail or if using PROC IMPORT cause it to guess that X should be a character variable.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
@Utabikunda wrote:
Well, then the DSD option must not functioning properly, as it does not remove the quotation marks surrounding values in the csv file to read in. I also don't see how macro coding could solve the problem, if SAS can't guess the variable type, short of specifying the names of each character variable and each numeric variable, which I'm trying to avoid doing for 1,000+ variables. I was able to very easily handle this issue using read.csv() function in R, and make another dataset to read into SAS. I hope the DSD option is improved in future versions.
Don't be silly. The DSD option is such a basic thing that tens of thousands of SAS users around the world would be screaming at SAS if it didn't work.
Please post some REAL example input data you have, so we can take a look at the issue without having to guess.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
I'll happily sign up for the group of screamers. Your answer provides no help
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
@ayaz1 wrote:
I'll happily sign up for the group of screamers. Your answer provides no help
If you are having trouble with a specific delimited text file start a new thread and provide the details there.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Normal input will read a period is missing. SAS is able to read that file as it is without problems. PROC IMPORT even works.
proc import datafile=test out=test replace
dbms=csv;
run;
proc contents data=test varnum; run;
The CONTENTS Procedure Variables in Creation Order # Variable Type Len Format Informat 1 ISOUTSIDE Num 8 BEST12. BEST32. 2 EATDRYFOOD Num 8 BEST12. BEST32. 3 TYPE Char 7 $7. $7. 4 OTHERTYPE Char 24 $24. $24. 5 ISFRIENDLY Num 8 BEST12. BEST32.
Your problem of PROC IMPORT converting numeric to text is caused by something else. Do some of the numeric values have text strings? Like the NA in your message?
A normal CSV file will indicate a missing value be not putting any characters between the commas. Can you have who ever created the file create it using that standard convention?