How to import data containing commas that are enclosed in parentheses?

MaryA_Marion · Posted 09-30-2019 03:59 PM

I am using SAS Studio. Code is not reading data with commas enclosed by "s. I tried two ways 1)format 2)informat.

I am reading a csv file with long column headings which I changed to ABCDEFGHIJK. I will deal with the long variable names next but first how should I read this in?

A=A B=. ....K=. Error message says I have invalid data for B--K.

How to do? Please see details below. Thank you.

Mary A. Marion

Data (first three lines):

A,B,C,D,E,F,G,H,I,J,K
Albany County,"294,565","289,565","245,060","32,624",605,"8,090",84,"3,102","5,000","9,079"
Allegany County,"49,927","49,487","48,444",361,139,358,2,183,440,454 ...

data ny;
infile '/folders/myshortcuts/611/data/Week 03/HWdata/nypopulation.mm.csv'  DSD;
input A B C D E F G H I J K;
format A $30. B comma8. C comma8. D comma8. E comma8. 
       F comma8. G comma8. H comma8. I comma8. 
       J comma8. K comma8.;
run;

data ny;
informat A $30. B comma8. C comma8. D comma8. E comma8. 
       F comma8. G comma8. H comma8. I comma8. 
       J comma8. K comma8.;
infile '/folders/myshortcuts/611/data/Week 03/HWdata/nypopulation.mm.csv'  DSD;
input A B C D E F G H I J K;
run;

Reeza · Posted 09-30-2019 04:02 PM

Can you show the actual log with one or two lines of the details?

MaryA_Marion · Posted 09-30-2019 04:19 PM

I added format statements as well as informat statements on my latest try. Please see log below. Thank you. MM

73 data ny;

74 informat A $30. B comma8. C comma8. D comma8. E comma8.

75 F comma8. G comma8. H comma8. I comma8.

76 J comma8. K comma8.;

77 infile '/folders/myshortcuts/611/data/Week

77 ! 03/HWdata/nypopulation.mm.csv' DSD;

78 input A B C D E F G H I J K;

79 format A $30. B comma8. C comma8. D comma8. E comma8.

80 F comma8. G comma8. H comma8. I comma8.

81 J comma8. K comma8.;

82 run;

NOTE: The infile '/folders/myshortcuts/611/data/Week

03/HWdata/nypopulation.mm.csv' is:

Filename=/folders/myshortcuts/611/data/Week

03/HWdata/nypopulation.mm.csv,

Owner Name=root,Group Name=vboxsf,

Access Permission=-rwxrwx---,

Last Modified=30Sep2019:14:54:11,

File Size (bytes)=5312

NOTE: Invalid data for B in line 1 3-3.

NOTE: Invalid data for C in line 1 5-5.

NOTE: Invalid data for D in line 1 7-7.

NOTE: Invalid data for E in line 1 9-9.

NOTE: Invalid data for F in line 1 11-11.

NOTE: Invalid data for G in line 1 13-13.

NOTE: Invalid data for H in line 1 15-15.

NOTE: Invalid data for I in line 1 17-17.

NOTE: Invalid data for J in line 1 19-19.

NOTE: Invalid data for K in line 1 21-21.

RULE: ----+----1----+----2----+----3----+----4----+----5----+----6----+----

1 A,B,C,D,E,F,G,H,I,J,K 21

A=A B=. C=. D=. E=. F=. G=. H=. I=. J=. K=. _ERROR_=1 _N_=1

NOTE: 63 records were read from the infile '/folders/myshortcuts/611/data/Week

03/HWdata/nypopulation.mm.csv'.

The minimum record length was 21.

The maximum record length was 115.

NOTE: The data set WORK.NY has 63 observations and 11 variables.

NOTE: DATA statement used (Total process time):

real time 0.01 seconds

cpu time 0.02 seconds

MaryA_Marion · Posted 09-30-2019 04:21 PM

NOTE: Invalid data for K in line 1 21-21.
RULE: ----+----1----+----2----+----3----+----4----+----5----+----6----+----
1 A,B,C,D,E,F,G,H,I,J,K 21
A=A B=. C=. D=. E=. F=. G=. H=. I=. J=. K=. _ERROR_=1 _N_=1
NOTE: 63 records were read from the infile '/folders/myshortcuts/611/data/Week

Reeza · Posted 09-30-2019 04:25 PM

Do you get one error or multiples? If just one then you need to add FIRSTOBS=2 to your INFILE statement to tell SAS to not read the first line of data since its the variable names.

MaryA_Marion · Posted 10-01-2019 08:49 AM

Hello,

I successfully read in the data as A--K but since I have to merge this data set with 3 other datasets with different variables ( not B--K), I am not sure this is the best way to go. How to read in this data using long file names as opposed to using letters? I am using label statements. If I use this technique for other files there will be multiple B's ..K's whose contents differ. Current working code is given below.

Thank you. MM

data nypop ;
informat A $30. B C D E F G H I J K comma10.;
infile '/folders/myshortcuts/611/data/Week 03/HWdata/nypopulation.mm.csv' DSD firstobs=2;
input A B C D E F G H I J K ;
format A $30. B C D E F G H I J K comma10.;
label A='Geographical Area'
B='Total Population'
C='Total One Race'
D='White'
E='Black or African American'
F='American Indian and Alaska Native'
G='Asian'	
H='Native Hawaiian and Other Pacific Islander'
I='Some other race'
J='Two or more Races'	
K='Hispanic or Latino(of any race)';
run;
proc print data=nypop label; run;

novinosrin · Posted 09-30-2019 05:09 PM

Hi @MaryA_Marion The problem is simple as we request your attention to the LOG:

1. You are trying to read variable A as character, so no problem

2. You are trying to read from B--K as numeric with an informat commaw.

3. So when the informat is applied on the char record i.e firstobs=1 (in other word when firstobs is read, SAS is unable to apply the informat commaw. to convert char to num. Therefore it triggers _ERROR_=1 for _N_=1.. @Reeza pointed this too

4. Your 2nd record should have been fine. Did you check your output?

So

infile FIRSTOBS=2;
informat A $30. B C D E F G H I J K comma10.;
input A B C D E F G H I J K;
format A $30. B C C E F G H I J K comma8.;
run;

should do

MaryA_Marion · Posted 10-01-2019 01:45 AM

hi,

The following code ran:

data ny;
informat A $30. B C D E F G H I J K comma10.;
infile '/folders/myshortcuts/611/data/Week 03/HWdata/nypopulation.mm.csv' DSD firstobs=2;
input A B C D E F G H I J K ;
format A $30. B C D E F G H I J K comma10.;
run;

Thank you. I will be deleting this post tomorrow. MM

novinosrin · Posted 10-01-2019 08:52 AM

We are glad. All the best!

Reeza · Posted 10-01-2019 10:50 AM

@MaryA_Marion wrote:

Thank you. I will be deleting this post tomorrow. MM

Please don't delete your question. The point of the forum is to save posts and such for future users to search when they have the same problem. Deleting your questions solves your problem but doesn't contribute to a body of knowledge for future users.

MaryA_Marion · Posted 10-01-2019 07:50 PM

I am in a classroom and don't want to affect my professor's class. I need
the data sent to SAS communities to be deleted. Can you help with that? I
have great books on SAS but sometimes the syntax eludes me.

Mary A. Marion

Reeza · Posted 10-02-2019 11:11 AM

There's no data set attached that I see. If you're referring to the data in your posts you can click each post at the top right and delete what you need.

In the future I'd highly recommend not posting actual data and adding fake records instead of working with your real data.

@MaryA_Marion wrote:
I am in a classroom and don't want to affect my professor's class. I need
the data sent to SAS communities to be deleted. Can you help with that? I
have great books on SAS but sometimes the syntax eludes me.

Mary A. Marion

mkeintz · Posted 10-02-2019 03:16 PM

The problem with the data your present is that some of the CSV numeric fields (apparently those valued at 1,000 or more) have embedded commas and are in quotes. So, except for the first field with county names, you have to read the subsequent fields one segment (i.e. on comma-separated set of characters) at a time. If the segment has a leading quote, then you have to concatenate further segments to it (stored in character variable COMPLETE) until you encounter a segment with a trailing quote. Then just convert that complete value to a number. If the segment has no leading quote then you have a complete value that can be converted to a number.

To enable the approach above you need to parse the automatic variable _INFILE_, which has a copy of the raw data line in it. That way you can look at one segment at a time, and dynamically change how to read in each value:

data ny (drop=_:);
infile datalines dlm=',';
input A :$30. @;
array var {*} b c d e f g h i j k;
length _complete $12. ;

  do v=1 to dim(var);
    call missing(_complete);
    input _segment :$5. @;
    if (_segment=:'"') then do until(indexc(_segment,'"'));
      _complete=cats(_complete,_segment);
      input _segment :$5. @;
    end;
    _complete=cats(_complete,_segment);
	_complete=translate(_complete,'','"');
    var{v}=input(_complete,best12.);
  end;
  format b--k  comma12.0;
datalines;
Albany County,	"294,565",	"289,565",	"245,060",	"32,624",	605,	"8,090",	84,	"3,102",	"5,000",	"9,079"
Allegany County,"49,927",	"49,487",	"48,444",	361,		139,	358,		2,	183,		440,		454
run;

The "do until indexc(_segment,'"') is a loop that will continue until a newly-read _segment has its own quote (i.e. the trailing quote). BTW, this will accomodate quoted numbers with more than 2 segments (i.e. like "123,436,789")

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

Tom · Posted 10-02-2019 03:32 PM

Huh? The quotes are the solution to the embedded delimiter problem. Using the DSD option handles that for you. To handle the numeric values with commas use the COMMA informat.

data test;
 infile cards dsd truncover ;
 length name $20 var1-var10 8;
 informat var1-var10 comma.;
 input name var1-var10;
cards;
Albany County,	"294,565",	"289,565",	"245,060",	"32,624",	605,	"8,090",	84,	"3,102",	"5,000",	"9,079"
Allegany County,"49,927",	"49,487",	"48,444",	361,		139,	358,		2,	183,		440,		454
;

proc print;
run;

Obs        name          var1     var2     var3     var4   var5   var6   var7   var8   var9   var10

 1    Albany County     294565   289565   245060   32624    605   8090    84    3102   5000    9079
 2    Allegany County    49927    49487    48444     361    139    358     2     183    440     454

mkeintz · Posted 10-02-2019 03:54 PM

Cool. I didn't realize that.

regards,

Mark

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

How to import data containing commas that are enclosed in parentheses?

Re: How to import data containing commas that are enclosed in parentheses?

Re: How to import data containing commas that are enclosed in parentheses?

Re: How to import data containing commas that are enclosed in parentheses?

Re: How to import data containing commas that are enclosed in parentheses?

Re: How to import data containing commas that are enclosed in parentheses?

Re: How to import data containing commas that are enclosed in parentheses?

Re: How to import data containing commas that are enclosed in quotes

Re: How to import data containing commas that are enclosed in quotes

Re: How to import data containing commas that are enclosed in quotes

Re: How to import data containing commas that are enclosed in quotes

Re: How to import data containing commas that are enclosed in quotes

Re: How to import data containing commas that are enclosed in parentheses?

Re: How to import data containing commas that are enclosed in parentheses?

Re: How to import data containing commas that are enclosed in parentheses?

Registration is open

SAS Training: Just a Click Away