BookmarkSubscribeRSS Feed
Utabikunda
Calcite | Level 5

I have a dataset shared with me in csv format. Some NA values have been inexplicably set as '.' in this csv format. This causes PROC IMPORT to guess these variables as character, when they should be numeric. I have seen many posts that suggest the infile statement with a data step, and some formulation of NA handling using iteration and if-then statements.This would then require me to then manually change the columns from character to numeric by name. There are 1000+ columns.

 

 

14 REPLIES 14
Utabikunda
Calcite | Level 5

Appreciate the suggestion. Guessingrows still has the same issue, because it is the '.' NA symbol that turns the variable into a character. Not a factor of the number of rows SAS reads.

Kurt_Bremser
Super User

1000+ columns? That screams "DESIGN FAIL" in the first place.

But since you've been given this and have to somehow beat it into usable shape, let's see:

How is that file structured? In particular, are all numeric columns situated to the right of a given column (e.g. numeric columns start at column 5, and all columns up to the end of a record are numeric)?

What character is used as delimiter?

Could you please post an example of the file (header and a few data lines) to illustrate this, and which we can use for testing our code ideas?

Please use the {i} button for posting such data.

 

Utabikunda
Calcite | Level 5

Thanks for sticking with this problem. Unfortunately, the large number of columns isn't a design flaw, just the number of variables needed in the dataset (there are also quite a few observations). So I have to work with them, and the column order has been created in consideration of survey order. I can't share any of the data, either, but I can post a representation of it - keeping in mind any solution that would require reformatting a column individually wouldn't scale from a few columns to 1,000. It's comma delimited, so the raw text of csv would look as below.

 

The two other languages I've used, R and Python, both have ways to indicate NULL/NA character strings included in basic operations for reading in data, so I'm kind of mystified this isn't available in SAS.

 

ISOUTSIDE,EATDRYFOOD,TYPE,OTHERTYPE,ISFRIENDLY
1,1,dog,,.
1,.,dog,,.
1,.,cat,,.
0,.,dog,,.
0,.,,,.
0,.,,My pet's name is Jeffrey,.
.,.,unknown,,.
.,1,dog,,.
.,1,dog,,1
1,0,cat,,0
1,0,cat,,1
1,0,elf,,1

 

Kurt_Bremser
Super User

I saved your data to a file and ran this code:

proc import
  datafile='$HOME/sascommunity/utabikunda.csv'
  dbms=csv
  out=utabikunda
  replace
;
run;

proc print data=utabikunda noobs;
run;

proc contents data=utabikunda;
run;

After which I got this output:

ISOUTSIDE      EATDRYFOOD    TYPE       OTHERTYPE                     ISFRIENDLY

        1               1    dog                                               .
        1               .    dog                                               .
        1               .    cat                                               .
        0               .    dog                                               .
        0               .                                                      .
        0               .               My pet's name is Jeffrey               .
        .               .    unknown                                           .
        .               1    dog                                               .
        .               1    dog                                               1
        1               0    cat                                               0
        1               0    cat                                               1
        1               0    elf                                               1
                                                       Die Prozedur CONTENTS

                  Dateiname         WORK.UTABIKUNDA                                  Beobachtungen            12  
                  Membertyp         DATA                                             Variablen                5   
                  Engine            V9                                               Indizes                  0   
                  Erstellt          12.12.2019 13:18:02                              Beobachtungslänge        56  
                  Zuletzt geändert  12.12.2019 13:18:02                              Gelöschte Beobachtungen  0   
                  Schutz                                                             Komprimiert              NEIN
                  Dateityp                                                           Sortiert                 NEIN
                  Etikett                                                                                         
                  Datendarstellung  HP_UX_64, RS_6000_AIX_64, SOLARIS_64, HP_IA64                                 
                  Codierung         latin9  European (ISO)                                                        


                                                Engine/Host-abhängige Informationen

Dateiseitengröße                  65536                                                                                             
Anzahl der Dateiseiten            1                                                                                                 
Erste Datenseite                  1                                                                                                 
Max. Anz. Beob. pro Seite         1166                                                                                              
Anz. Beob. auf erster Datenseite  12                                                                                                
Anzahl der Dateireparaturen       0                                                                                                 
Dateiname                         /wadaten/work2/saswork/SAS_work9750024C002C_as-dwh01/                                             
                                  SAS_workF399024C002C_as-dwh01/utabikunda.sas7bdat                                                 
Erstellt mit Release              9.0401M5                                                                                          
Erstellt mit Betriebssystem       AIX                                                                                               
Inode-Nummer                      140231                                                                                            
Zugriffsberechtigung              rw-r--r--                                                                                         
Besitzername                      e9782                                                                                             
Dateigröße                        128KB                                                                                             
Dateigröße (Byte)                 131072                                                                                            


                                           Alphabetische Liste der Variablen und Attribute
 
                                                                           Ausg.      Einl.
                                       #    Variable      Typ     Länge    Format     Format

                                       2    EATDRYFOOD    Num         8    BEST12.    BEST32.
                                       5    ISFRIENDLY    Num         8    BEST12.    BEST32.
                                       1    ISOUTSIDE     Num         8    BEST12.    BEST32.
                                       4    OTHERTYPE     Char       24    $24.       $24.   
                                       3    TYPE          Char        7    $7.        $7.    

As you can see, the numeric columns were correctly imported, with missing values where a dot was encountered.

Utabikunda
Calcite | Level 5

Realizing it might actually be a problem of quotation marks around values. There is a DSD option for the infile statement of a data step, but I believe this method of importing csv data requires specifying each column. Is there a similar analogue with proc import to remove quotations around imported values?

Kurt_Bremser
Super User

When you inspect the log (Maxim 2) of the proc import code I ran, you can see that the data step created by proc import already uses the dsd option (to treat successive delimiters as missing values).

If you have specific problems with certain values, take the data step from the log and modify it (there's a lot of things you can optimize right from the start). Once you have code that fixes one column, fixing a lot of columns is "just" some macro-coding away.

Utabikunda
Calcite | Level 5

Well, then the DSD option must not functioning properly, as it does not remove the quotation marks surrounding values in the csv file to read in. I also don't see how macro coding could solve the problem, if SAS can't guess the variable type, short of specifying the names of each character variable and each numeric variable, which I'm trying to avoid doing for 1,000+ variables. I was able to very easily handle this issue using read.csv() function in R, and make another dataset to read into SAS. I hope the DSD option is improved in future versions.

Tom
Super User Tom
Super User

@Utabikunda wrote:

Well, then the DSD option must not functioning properly, as it does not remove the quotation marks surrounding values in the csv file to read in. I also don't see how macro coding could solve the problem, if SAS can't guess the variable type, short of specifying the names of each character variable and each numeric variable, which I'm trying to avoid doing for 1,000+ variables. I was able to very easily handle this issue using read.csv() function in R, and make another dataset to read into SAS. I hope the DSD option is improved in future versions.


SAS will remove quotes around values. Having quotes around text that only contains digits will not confuse PROC IMPORT.  I modified your example to add some quotes and PROC IMPORT still read the file fine.  You need to dig deeper to figure out what exactly is causing your problem.  Perhaps the quotes are not quotes, but are those "pretty" quotes that typesetters use?

ISOUTSIDE,EATDRYFOOD,TYPE,OTHERTYPE,ISFRIENDLY
1,1,dog,,.
1,'.',dog,,.
1,.,cat,,"."
0,.,dog,,.
0,.,,,.
0,.,,My pet's name is Jeffrey,.
.,.,unknown,,.
.,1,dog,,.
.,1,dog,,1
1,0,cat,,0
1,0,cat,,1
1,0,elf,,1
Tom
Super User Tom
Super User

One issue that can confuse SAS is when you have a pair of values where one starts with a quote and the other ends with a quote. That will look like a single value that contains commas but is enclosed in quotes and cause the columns to get shifted.

Consider this file with 6 fields.

A,B,C,X,Y.D
S1,S2,S3,1,2,Yes
'twas,a,goin',3,4,No

That last row will look to SAS like it has only 4 values. So the text, No, that should be the value of D is read as the value of X and will either fail or if using PROC IMPORT cause it to guess that X should be a character variable.

Kurt_Bremser
Super User

@Utabikunda wrote:

Well, then the DSD option must not functioning properly, as it does not remove the quotation marks surrounding values in the csv file to read in. I also don't see how macro coding could solve the problem, if SAS can't guess the variable type, short of specifying the names of each character variable and each numeric variable, which I'm trying to avoid doing for 1,000+ variables. I was able to very easily handle this issue using read.csv() function in R, and make another dataset to read into SAS. I hope the DSD option is improved in future versions.


Don't be silly. The DSD option is such a basic thing that tens of thousands of SAS users around the world would be screaming at SAS if it didn't work.

Please post some REAL example input data you have, so we can take a look at the issue without having to guess.

ayaz1
Calcite | Level 5

I'll happily sign up for the group of screamers. Your answer provides no help

Tom
Super User Tom
Super User

@ayaz1 wrote:

I'll happily sign up for the group of screamers. Your answer provides no help


If you are having trouble with a specific delimited text file start a new thread and provide the details there.

Tom
Super User Tom
Super User

Normal input will read a period is missing. SAS is able to read that file as it is without problems. PROC IMPORT even works. 

proc import datafile=test out=test replace 
  dbms=csv;
run;

proc contents data=test varnum; run;
The CONTENTS Procedure

             Variables in Creation Order

#    Variable      Type    Len    Format     Informat

1    ISOUTSIDE     Num       8    BEST12.    BEST32.
2    EATDRYFOOD    Num       8    BEST12.    BEST32.
3    TYPE          Char      7    $7.        $7.
4    OTHERTYPE     Char     24    $24.       $24.
5    ISFRIENDLY    Num       8    BEST12.    BEST32.

Your problem of PROC IMPORT converting numeric to text is caused by something else.  Do some of the numeric values have text strings? Like the NA in your message?

 

A normal CSV file will indicate a missing value be not putting any characters between the commas.  Can you have who ever created the file create it using that standard convention?

hackathon24-white-horiz.png

The 2025 SAS Hackathon has begun!

It's finally time to hack! Remember to visit the SAS Hacker's Hub regularly for news and updates.

Latest Updates

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 14 replies
  • 16784 views
  • 0 likes
  • 5 in conversation