SAS Programming

Utabikunda · Posted 12-11-2019 04:27 PM

I have a dataset shared with me in csv format. Some NA values have been inexplicably set as '.' in this csv format. This causes PROC IMPORT to guess these variables as character, when they should be numeric. I have seen many posts that suggest the infile statement with a data step, and some formulation of NA handling using iteration and if-then statements.This would then require me to then manually change the columns from character to numeric by name. There are 1000+ columns.

PaigeMiller · Posted 12-11-2019 05:36 PM

You can try the GUESSINGROWS statement of PROC IMPORT.

https://go.documentation.sas.com/?cdcId=pgmmvacdc&cdcVersion=9.4&docsetId=proc&docsetTarget=p13kvtl8...

--
Paige Miller

Utabikunda · Posted 12-11-2019 07:53 PM

Appreciate the suggestion. Guessingrows still has the same issue, because it is the '.' NA symbol that turns the variable into a character. Not a factor of the number of rows SAS reads.

Kurt_Bremser · Posted 12-12-2019 01:35 AM

1000+ columns? That screams "DESIGN FAIL" in the first place.

But since you've been given this and have to somehow beat it into usable shape, let's see:

How is that file structured? In particular, are all numeric columns situated to the right of a given column (e.g. numeric columns start at column 5, and all columns up to the end of a record are numeric)?

What character is used as delimiter?

Could you please post an example of the file (header and a few data lines) to illustrate this, and which we can use for testing our code ideas?

Please use the {i} button for posting such data.

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

Utabikunda · Posted 12-12-2019 07:11 AM

Thanks for sticking with this problem. Unfortunately, the large number of columns isn't a design flaw, just the number of variables needed in the dataset (there are also quite a few observations). So I have to work with them, and the column order has been created in consideration of survey order. I can't share any of the data, either, but I can post a representation of it - keeping in mind any solution that would require reformatting a column individually wouldn't scale from a few columns to 1,000. It's comma delimited, so the raw text of csv would look as below.

The two other languages I've used, R and Python, both have ways to indicate NULL/NA character strings included in basic operations for reading in data, so I'm kind of mystified this isn't available in SAS.

ISOUTSIDE,EATDRYFOOD,TYPE,OTHERTYPE,ISFRIENDLY
1,1,dog,,.
1,.,dog,,.
1,.,cat,,.
0,.,dog,,.
0,.,,,.
0,.,,My pet's name is Jeffrey,.
.,.,unknown,,.
.,1,dog,,.
.,1,dog,,1
1,0,cat,,0
1,0,cat,,1
1,0,elf,,1

Kurt_Bremser · Posted 12-12-2019 07:20 AM

I saved your data to a file and ran this code:

proc import
  datafile='$HOME/sascommunity/utabikunda.csv'
  dbms=csv
  out=utabikunda
  replace
;
run;

proc print data=utabikunda noobs;
run;

proc contents data=utabikunda;
run;

After which I got this output:

ISOUTSIDE      EATDRYFOOD    TYPE       OTHERTYPE                     ISFRIENDLY

        1               1    dog                                               .
        1               .    dog                                               .
        1               .    cat                                               .
        0               .    dog                                               .
        0               .                                                      .
        0               .               My pet's name is Jeffrey               .
        .               .    unknown                                           .
        .               1    dog                                               .
        .               1    dog                                               1
        1               0    cat                                               0
        1               0    cat                                               1
        1               0    elf                                               1

                                                       Die Prozedur CONTENTS

                  Dateiname         WORK.UTABIKUNDA                                  Beobachtungen            12  
                  Membertyp         DATA                                             Variablen                5   
                  Engine            V9                                               Indizes                  0   
                  Erstellt          12.12.2019 13:18:02                              Beobachtungslänge        56  
                  Zuletzt geändert  12.12.2019 13:18:02                              Gelöschte Beobachtungen  0   
                  Schutz                                                             Komprimiert              NEIN
                  Dateityp                                                           Sortiert                 NEIN
                  Etikett                                                                                         
                  Datendarstellung  HP_UX_64, RS_6000_AIX_64, SOLARIS_64, HP_IA64                                 
                  Codierung         latin9  European (ISO)                                                        


                                                Engine/Host-abhängige Informationen

Dateiseitengröße                  65536                                                                                             
Anzahl der Dateiseiten            1                                                                                                 
Erste Datenseite                  1                                                                                                 
Max. Anz. Beob. pro Seite         1166                                                                                              
Anz. Beob. auf erster Datenseite  12                                                                                                
Anzahl der Dateireparaturen       0                                                                                                 
Dateiname                         /wadaten/work2/saswork/SAS_work9750024C002C_as-dwh01/                                             
                                  SAS_workF399024C002C_as-dwh01/utabikunda.sas7bdat                                                 
Erstellt mit Release              9.0401M5                                                                                          
Erstellt mit Betriebssystem       AIX                                                                                               
Inode-Nummer                      140231                                                                                            
Zugriffsberechtigung              rw-r--r--                                                                                         
Besitzername                      e9782                                                                                             
Dateigröße                        128KB                                                                                             
Dateigröße (Byte)                 131072                                                                                            


                                           Alphabetische Liste der Variablen und Attribute
 
                                                                           Ausg.      Einl.
                                       #    Variable      Typ     Länge    Format     Format

                                       2    EATDRYFOOD    Num         8    BEST12.    BEST32.
                                       5    ISFRIENDLY    Num         8    BEST12.    BEST32.
                                       1    ISOUTSIDE     Num         8    BEST12.    BEST32.
                                       4    OTHERTYPE     Char       24    $24.       $24.   
                                       3    TYPE          Char        7    $7.        $7.

As you can see, the numeric columns were correctly imported, with missing values where a dot was encountered.

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

Utabikunda · Posted 12-12-2019 08:50 AM

Realizing it might actually be a problem of quotation marks around values. There is a DSD option for the infile statement of a data step, but I believe this method of importing csv data requires specifying each column. Is there a similar analogue with proc import to remove quotations around imported values?

Kurt_Bremser · Posted 12-12-2019 08:59 AM

When you inspect the log (Maxim 2) of the proc import code I ran, you can see that the data step created by proc import already uses the dsd option (to treat successive delimiters as missing values).

If you have specific problems with certain values, take the data step from the log and modify it (there's a lot of things you can optimize right from the start). Once you have code that fixes one column, fixing a lot of columns is "just" some macro-coding away.

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

Utabikunda · Posted 12-15-2019 08:45 AM

Well, then the DSD option must not functioning properly, as it does not remove the quotation marks surrounding values in the csv file to read in. I also don't see how macro coding could solve the problem, if SAS can't guess the variable type, short of specifying the names of each character variable and each numeric variable, which I'm trying to avoid doing for 1,000+ variables. I was able to very easily handle this issue using read.csv() function in R, and make another dataset to read into SAS. I hope the DSD option is improved in future versions.

Tom · Posted 12-15-2019 09:16 AM

@Utabikunda wrote:

Well, then the DSD option must not functioning properly, as it does not remove the quotation marks surrounding values in the csv file to read in. I also don't see how macro coding could solve the problem, if SAS can't guess the variable type, short of specifying the names of each character variable and each numeric variable, which I'm trying to avoid doing for 1,000+ variables. I was able to very easily handle this issue using read.csv() function in R, and make another dataset to read into SAS. I hope the DSD option is improved in future versions.

SAS will remove quotes around values. Having quotes around text that only contains digits will not confuse PROC IMPORT. I modified your example to add some quotes and PROC IMPORT still read the file fine. You need to dig deeper to figure out what exactly is causing your problem. Perhaps the quotes are not quotes, but are those "pretty" quotes that typesetters use?

ISOUTSIDE,EATDRYFOOD,TYPE,OTHERTYPE,ISFRIENDLY
1,1,dog,,.
1,'.',dog,,.
1,.,cat,,"."
0,.,dog,,.
0,.,,,.
0,.,,My pet's name is Jeffrey,.
.,.,unknown,,.
.,1,dog,,.
.,1,dog,,1
1,0,cat,,0
1,0,cat,,1
1,0,elf,,1

Tom · Posted 12-15-2019 09:26 AM

One issue that can confuse SAS is when you have a pair of values where one starts with a quote and the other ends with a quote. That will look like a single value that contains commas but is enclosed in quotes and cause the columns to get shifted.

Consider this file with 6 fields.

A,B,C,X,Y.D
S1,S2,S3,1,2,Yes
'twas,a,goin',3,4,No

That last row will look to SAS like it has only 4 values. So the text, No, that should be the value of D is read as the value of X and will either fail or if using PROC IMPORT cause it to guess that X should be a character variable.

Kurt_Bremser · Posted 12-17-2019 01:46 AM

@Utabikunda wrote:

Well, then the DSD option must not functioning properly, as it does not remove the quotation marks surrounding values in the csv file to read in. I also don't see how macro coding could solve the problem, if SAS can't guess the variable type, short of specifying the names of each character variable and each numeric variable, which I'm trying to avoid doing for 1,000+ variables. I was able to very easily handle this issue using read.csv() function in R, and make another dataset to read into SAS. I hope the DSD option is improved in future versions.

Don't be silly. The DSD option is such a basic thing that tens of thousands of SAS users around the world would be screaming at SAS if it didn't work.

Please post some REAL example input data you have, so we can take a look at the issue without having to guess.

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

ayaz1 · Posted 08-04-2021 12:38 PM

I'll happily sign up for the group of screamers. Your answer provides no help

Tom · Posted 08-04-2021 01:21 PM

@ayaz1 wrote:

I'll happily sign up for the group of screamers. Your answer provides no help

If you are having trouble with a specific delimited text file start a new thread and provide the details there.

Tom · Posted 12-15-2019 09:11 AM

Normal input will read a period is missing. SAS is able to read that file as it is without problems. PROC IMPORT even works.

proc import datafile=test out=test replace 
  dbms=csv;
run;

proc contents data=test varnum; run;

The CONTENTS Procedure

             Variables in Creation Order

#    Variable      Type    Len    Format     Informat

1    ISOUTSIDE     Num       8    BEST12.    BEST32.
2    EATDRYFOOD    Num       8    BEST12.    BEST32.
3    TYPE          Char      7    $7.        $7.
4    OTHERTYPE     Char     24    $24.       $24.
5    ISFRIENDLY    Num       8    BEST12.    BEST32.

Your problem of PROC IMPORT converting numeric to text is caused by something else. Do some of the numeric values have text strings? Like the NA in your message?

A normal CSV file will indicate a missing value be not putting any characters between the commas. Can you have who ever created the file create it using that standard convention?

SAS Programming

Handle missing value characters in csv with proc import for 1000+ variables

Re: Handle missing value characters in csv with proc import for 1000+ variables

Re: Handle missing value characters in csv with proc import for 1000+ variables

Re: Handle missing value characters in csv with proc import for 1000+ variables

Re: Handle missing value characters in csv with proc import for 1000+ variables

Re: Handle missing value characters in csv with proc import for 1000+ variables

Re: Handle missing value characters in csv with proc import for 1000+ variables

Re: Handle missing value characters in csv with proc import for 1000+ variables

Re: Handle missing value characters in csv with proc import for 1000+ variables

Re: Handle missing value characters in csv with proc import for 1000+ variables

Re: Handle missing value characters in csv with proc import for 1000+ variables

Re: Handle missing value characters in csv with proc import for 1000+ variables

Re: Handle missing value characters in csv with proc import for 1000+ variables

Re: Handle missing value characters in csv with proc import for 1000+ variables

Re: Handle missing value characters in csv with proc import for 1000+ variables

Follow Us

What is...

SAS Programming

Our biggest data and AI event of the year.

SAS Training: Just a Click Away

Follow Us

What is...