BookmarkSubscribeRSS Feed
JohnHoughton
Quartz | Level 8

Hi. As an alternative you could  start by reading all fields as character strings and then find a way to clean that up. 

Tom
Super User Tom
Super User

Try a little divide an conquer.

Read the records with the right number of commas and dump the other one into a new file.

Then figure out how to read them or fix them and then read them.

 

So something like this:

filename badrows 'R:/Li/PATSTAT/Applications_bad.csv' ;

data SASDATA.Applications ;
infile 'R:/Li/PATSTAT/Applications.csv' DLM = ',' DSD missover lrecl=32767 firstobs = 3 ;
input @ ;
if countw(_infile_,',','m') ne 26 then do ;
  file badrows ;
  put _infile_;
  delete;
end;

input ....
Tom
Super User Tom
Super User

Here is an example program that could be used to "fix" a CSV file where only one column could have extra commas.

* 'Fix' a CSV file that is missing qutoes around a specific variable *;
%let outfile=fixed;
%let infile=sample;
%let nvars=26 ;
%let fixcol=8;
data _null_;
  infile &infile dsd truncover ;
  file &outfile dsd ;
  length x1-x&nvars next $200. ;
  input x1-x&fixcol. @;
  do i=1 to countw(_infile_,',','m')-&nvars ;
    input next @;
    x&fixcol=cats(x&fixcol,',',next); 
  end;
  input x%eval(&fixcol+1)-x&nvars;
  put x1-x&nvars;
run;

Basically you tell it what file to read and where to write the fixed file. You tell it how many columns there should be and which column to assume all of the extra commas should be placed into.

It then reads the line into variables with a loop to read any extra values into that specific column and writes it back out.

Let's try it on your two example rows. Let's make a sample data file.

filename sample temp;
data _null_;
  file sample;
  put
 '1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26'
/'3574782,BR,7600215,A ,1976-04-08,1976,BR19767600215,760215,,PI,0,N,N,Y,1975-04-14,1975,41149045,1976-10-05,1976,312739797,0,19727905,327504,29,56,1,2'
/'3897835,CA,182,A ,1967-09-15,1967,CA19670000182,000,182,PI,0,N,N,Y,1966-09-16,1966,22485460,1975-03-18,1975,315362359,1,10420177,1874780,14,0,1,1'
;
run;

Then run the program above.

And then look at the result.

323   data _null_;
324     infile &outfile ;
325     input ;
326     list;
327   run;

NOTE: The infile FIXED is:
      Filename=xxx\#LN00054,
      RECFM=V,LRECL=32767,File Size (bytes)=370,
      Last Modified=22Mar2018:00:41:48,
      Create Time=22Mar2018:00:41:48

RULE:     ----+----1----+----2----+----3----+----4----+----5----+----6----+----7----+----8----+----9----+----0
1         1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26 68
2         3574782,BR,7600215,A,1976-04-08,1976,BR19767600215,"760215,",PI,0,N,N,Y,1975-04-14,1975,41149045,197
     101  6-10-05,1976,312739797,0,19727905,327504,29,56,1,2 150
3         3897835,CA,182,A,1967-09-15,1967,CA19670000182,"000,182",PI,0,N,N,Y,1966-09-16,1966,22485460,1975-03
     101  -18,1975,315362359,1,10420177,1874780,14,0,1,1 146
NOTE: 3 records were read from the infile FIXED.
      The minimum record length was 68.
      The maximum record length was 150.
France
Quartz | Level 8

besides, could you explain the meaning of these codes for me please ? where could I write data name, path of infile and variables?

* 'Fix' a CSV file that is missing qutoes around a specific variable *;
%let outfile=fixed;
%let infile=sample;
%let nvars=26 ;
%let fixcol=8;
data _null_;
  infile &infile dsd truncover ;
  file &outfile dsd ;
  length x1-x&nvars next $200. ;
  input x1-x&fixcol. @;
  do i=1 to countw(_infile_,',','m')-&nvars ;
    input next @;
    x&fixcol=cats(x&fixcol,',',next); 
  end;
  input x%eval(&fixcol+1)-x&nvars;
  put x1-x&nvars;
run;

thanks in advance

 

Tom
Super User Tom
Super User

@France wrote:

besides, could you explain the meaning of these codes for me please ? where could I write data name, path of infile and variables?

* 'Fix' a CSV file that is missing qutoes around a specific variable *;
%let outfile=fixed;
%let infile=sample;
%let nvars=26 ;
%let fixcol=8;
data _null_;
  infile &infile dsd truncover ;
  file &outfile dsd ;
  length x1-x&nvars next $200. ;
  input x1-x&fixcol. @;
  do i=1 to countw(_infile_,',','m')-&nvars ;
    input next @;
    x&fixcol=cats(x&fixcol,',',next); 
  end;
  input x%eval(&fixcol+1)-x&nvars;
  put x1-x&nvars;
run;

thanks in advance

 


The easiest way to understand code is to take on the roll of the compiler and the data step and see what the statements do.  Since this is using macro variables make sure to first resolve the macro variables and macro functions and then evaluate the resulting code.  The %LET's are just setting values.  Basically you should be able to reuse the code and only change the values assigned to the macro variables in the %LET statements. 

 

The first two macro variables define the input and output files. In this example I am using filerefs that were previously defined by FILENAME statments. You could also just use physical filenames but you would then need to include the quotes around the names so that when expanded they make sense in the INFILE and FILE statement.

 

The next macro variable defines how many fields SHOULD be in each line. For your file it is 26.

The next macro variable defined which field will "get' all of the extra commas.  For your two example lines it was number 8.

 

As I said before the data step just reads the line into separate character variables and the writes it back out.  Let's look at some of the details.

 

data _null_;
  infile &infile dsd truncover ;
  file &outfile dsd ;
  length x1-x&nvars next $200. ;

We are going to use a DATA _NULL_ step since we are not creating any SAS dataset, just another text file.  The INFILE and FILE statement define what we want to read.  If you have really long lines (greater than 32K) then you can add LRECL= options to both of these.  The LENGTH statement defines the variables we are going to use. One for each field and one extra one we will use to read the extra values.  I have set them all to max of 200 characters. If any of the fields in your data are longer than that then change the number.  Notice that we use the macro variable NVARS to set the upper bound on the variable list.

  input x1-x&fixcol. @;

@so first we read up to the column where the extra commas might appear. The @ on the INPUT statement holds the line so we can use another INPUT statement to read more later.  Again notice the use of the macro variable FIXCOL to define the upper bound on the variable list.

  do i=1 to countw(_infile_,',','m')-&nvars ;
    input next @;
    x&fixcol=cats(x&fixcol,',',next); 
  end;

This is part that is dealing with the extra commas.  If the number of fields in the line (as counted by the COUNTW() function) is less than or equal to number expected then this loop never runs.  Otherwise it runs once for each extra comma.  So if there are 2 extra commas it will run twice.  Each time it reads the next word into NEXT and appends it along with a comma to the value of the target field.

  input x%eval(&fixcol+1)-x&nvars;
  put x1-x&nvars;
run;

Now all that is left to do is read the rest of the columns and write the values back out to the new file.

 

 

France
Quartz | Level 8

Thanks so much for your answer. it is a really very detailed explanation. but I still have several questions.

The LENGTH statement defines the variables we are going to use. One for each field and one extra one we will use to read the extra values.  I have set them all to max of 200 characters. 

you defined the length of all variables as max of 200 characters. will it change the data type of the raw data? what should I do if I want to define the length and data type of each variable?

 

Besides, can I use this code to separately fix each variable by changing the number of 8?

%let fixcol=8;

 

 

 

Tom
Super User Tom
Super User

This data _NULL_ step is NOT creating a SAS dataset, it is just writing a new text file. That is why all of the fields are being read as character.  When you late read the fixed text file you will make the decision about what type of data the text in each field of the raw text file represents.

 

This algorithm only works if all of the extra commas appear in the same column.  Basically it is assuming that all extra comma belong to the same raw value.

 

If you can use some other logic to split the source file into separate files where you know all of the extra commas are only one field.  For example one for field 8 and another file where the extra commas are in field 10. Then you can process them separately with this program.

France
Quartz | Level 8

many thanks for your help. I am running the dataset by your codes.

 



 

France
Quartz | Level 8

I have obtained the result ran by your code. Except for some of bottom, all of them are recorded in the same column. do you have some advices about it please?

Tom
Super User Tom
Super User

Depends on the volume and whether this is anyway to systematically tell which column should have the embedded commas.

 

If the number of records is small enough it is probably best to just create a fixed version of the messed up records by hand. You can use any text editor. Such as NOTEPAD on Windows or even the SAS program editor to fix the file. Basically you need to add quotes around the values that contain the embedded comma(s).

 

Then you can read the fixed version and combine the two datasets.

 

If the number of records is still too large for that then look for more patterns that can help.  Perhaps there is a value like a DATE or a constant that you check and see which column it appears in and then deduce from the location of that value where the extra commas are.  For example if the date is supposed to be in the 10 field and you are seeing it in the 11 field then one of the first 9 fields has an extra comma.

 

France
Quartz | Level 8

Dear Tom, thanks for your answer. and I have just received reply from the supplier of dataset. they said I "because your loading procedure did not take into account that text embedded in double quotes should be considered as 1 data item." Therefore,  I delate DSD in INFILE which look like following 

data SASDATA.Applications ;
infile 'R:/Li/PATSTAT/Applications.csv' DLM = ',' missover lrecl=32767 firstobs = 3 ;
input
  appln_id :29.
  appln_auth :$29.
  appln_nr :$29.
  appln_kind :$29.
  appln_filing_date :YYMMDD10.
  appln_filing_year
  appln_nr_epodoc :$50.
  appln_nr_original :$150.
  ipr_type :$29.
  internat_appln_id :29.
  int_phase :$29.
  reg_phase :$29.
  nat_phase :$29.
  earliest_filing_date :YYMMDD10.
  earliest_filing_year
  earliest_filing_id :29.
  earliest_publn_date :YYMMDD10.
  earliest_publn_year
  earliest_pat_publn_id :29.
  granted :29.
  docdb_family_id :29.
  inpadoc_family_id :29.
  docdb_family_size :29.
  nb_citing_docdb_fam :29.
  nb_applicants :29.
  nb_inventors :29.
;
format
  appln_filing_date :YYMMDDd10.
  appln_filing_year :YEAR10.
  earliest_filing_date :YYMMDDd10.
  earliest_filing_year :YEAR10.
  earliest_publn_date :YYMMDDd10.
  earliest_publn_year :YEAR10.
;
run ;

and get following result,

NOTE: Invalid data for internat_appln_id in line 5643 64-64.
NOTE: Invalid data for earliest_filing_date in line 5643 81-84.
NOTE: Invalid data for earliest_filing_id in line 5643 95-104.
NOTE: Invalid data for earliest_publn_date in line 5643 106-109.
RULE: ----+----1----+----2----+----3----+----4----+----5----+----6----+----7----+----8----+----9
5643 7402,EP,02795370,A ,2002-11-06,2002,EP20020795370,,PI,19037782,Y,Y,N,2002-01-24,2002,19032
91 584,2004-10-27,2004,287340067,0,27614728,4870,12,16,1,1 145
appln_id=7402 appln_auth=EP appln_nr=02795370 appln_kind=A appln_filing_date=2002-11-06
appln_filing_year=1965 appln_nr_epodoc=EP20020795370 appln_nr_original=PI ipr_type=19037782
internat_appln_id=. int_phase=Y reg_phase=N nat_phase=2002-01-24 earliest_filing_date=.
earliest_filing_year=********** earliest_filing_id=. earliest_publn_date=.
earliest_publn_year=********** earliest_pat_publn_id=0 granted=27614728 docdb_family_id=4870
inpadoc_family_id=12 docdb_family_size=16 nb_citing_docdb_fam=1 nb_applicants=1 nb_inventors=.
_ERROR_=1 _N_=5641

so, do you know which code I could use to treats two consecutive delimiters as a missing value but not removes quotation marks from character values.

Tom
Super User Tom
Super User

This response you received

 "because your loading procedure did not take into account that text embedded in double quotes should be considered as 1 data item."

is totally wrong and backwards.  SAS does consider values in quotes as representing a single value.

 

The problem is that the data file you are trying to load does NOT have the quotes.  

France
Quartz | Level 8

Is it because I use DSD that removes quotation marks from character values?

Reeza
Super User

No, the text file does not have the qualifiers.

Tom
Super User Tom
Super User

The DSD option will remove quotes from values that are totally enclosed in quotes. That is why when you are writing the file you need to add quotes when the value includes either quotes or delimiters.

 

So if the values are X=1, Y='a,b' and Z=2 then the line SHOULD look like this.

1,"a,b",2

And if the value was actually  X=1, Y='"yes"' and Z=2 then the line SHOULD look like this.

1,"""yes""",2

So the the outer quotes are there to protect the special character, quote characters in this case. The embedded quotes need to be doubled so that you know they don't indicate the closing of the quotes around the value.

 

So if you wrote a file by just dumping the values and adding delimiters between the values (for example you wrote it in SAS without using the DSD option) AND one of the values actually began and ended with quotes AND you read it with DSD option then the quotes would not be read as part of the data.

 

You can force SAS to NOT remove the quotes by using the ~ modifier in the INPUT statement.

 

13   data test ;
14     length x y z $10 ;
15     infile cards dlm=',' dsd truncover ;
16     input x y ~ z ;
17     put (_all_) (=) ;
18     put (_all_) (=:$quote.);
19   cards;

x=1 y="2" z=3
x="1" y="""2""" z="3"
NOTE: The data set WORK.TEST has 1 observations and 3 variables.

sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 40 replies
  • 2153 views
  • 8 likes
  • 5 in conversation