Solved: Re: Skipping Invalid Lines

Junyong · Posted 10-15-2020 03:22 PM

This MWE produces too many invalid data notes due to some occasional nulls as follows.

data irx;
	infile 'https://query1.finance.yahoo.com/v7/finance/download/^irx
?period1=-9999999999&period2=9999999999' url firstobs=2 dsd truncover;
	input date yymmdd10. +1 open hi lo close adj vol;
	if adj>. then output;
run;

and the file is

Date,Open,High,Low,Close,Adj Close,Volume
1960-01-04,4.520000,4.520000,4.520000,4.520000,4.520000,0
1960-01-05,4.550000,4.550000,4.550000,4.550000,4.550000,0
1960-01-06,4.680000,4.680000,4.680000,4.680000,4.680000,0
1960-01-07,4.630000,4.630000,4.630000,4.630000,4.630000,0
1960-01-08,4.590000,4.590000,4.590000,4.590000,4.590000,0
1960-01-10,null,null,null,null,null,null
1960-01-11,4.540000,4.540000,4.540000,4.540000,4.540000,0
1960-01-12,4.540000,4.540000,4.540000,4.540000,4.540000,0
1960-01-13,4.560000,4.560000,4.560000,4.560000,4.560000,0
1960-01-14,4.510000,4.510000,4.510000,4.510000,4.510000,0
1960-01-15,4.490000,4.490000,4.490000,4.490000,4.490000,0
1960-01-17,null,null,null,null,null,null
1960-01-18,4.370000,4.370000,4.370000,4.370000,4.370000,0
1960-01-19,4.310000,4.310000,4.310000,4.310000,4.310000,0
1960-01-20,4.300000,4.300000,4.300000,4.300000,4.300000,0
1960-01-21,4.270000,4.270000,4.270000,4.270000,4.270000,0
1960-01-22,4.120000,4.120000,4.120000,4.120000,4.120000,0
1960-01-24,null,null,null,null,null,null
1960-01-25,4.120000,4.120000,4.120000,4.120000,4.120000,0
1960-01-26,4.040000,4.040000,4.040000,4.040000,4.040000,0
1960-01-27,3.990000,3.990000,3.990000,3.990000,3.990000,0
1960-01-28,3.920000,3.920000,3.920000,3.920000,3.920000,0
1960-01-29,3.990000,3.990000,3.990000,3.990000,3.990000,0
1960-01-31,null,null,null,null,null,null

Can I skip these unnecessary observations, instead of reading all and delete them?

novinosrin · Posted 10-15-2020 03:44 PM

Ah I see what you mean. How about-

data have;
 input @;
 if index(_infile_,'null')=0;
 input var   :mmddyy10.;
  cards;
 null
 06/21/1982
 07/21/1982
 null
 ;

The above doesn't read records from buffer to PDV. Would that suffice?

Testing your sample-


data test;
infile cards dsd firstobs=2;
input @;
if index(_infile_,'null')=0;
input date yymmdd10. +1 open hi lo close adj vol;
cards;
Date,Open,High,Low,Close,Adj Close,Volume
1960-01-04,4.520000,4.520000,4.520000,4.520000,4.520000,0
1960-01-05,4.550000,4.550000,4.550000,4.550000,4.550000,0
1960-01-06,4.680000,4.680000,4.680000,4.680000,4.680000,0
1960-01-07,4.630000,4.630000,4.630000,4.630000,4.630000,0
1960-01-08,4.590000,4.590000,4.590000,4.590000,4.590000,0
1960-01-10,null,null,null,null,null,null
1960-01-11,4.540000,4.540000,4.540000,4.540000,4.540000,0
1960-01-12,4.540000,4.540000,4.540000,4.540000,4.540000,0
1960-01-13,4.560000,4.560000,4.560000,4.560000,4.560000,0
1960-01-14,4.510000,4.510000,4.510000,4.510000,4.510000,0
1960-01-15,4.490000,4.490000,4.490000,4.490000,4.490000,0
1960-01-17,null,null,null,null,null,null
1960-01-18,4.370000,4.370000,4.370000,4.370000,4.370000,0
1960-01-19,4.310000,4.310000,4.310000,4.310000,4.310000,0
1960-01-20,4.300000,4.300000,4.300000,4.300000,4.300000,0
1960-01-21,4.270000,4.270000,4.270000,4.270000,4.270000,0
1960-01-22,4.120000,4.120000,4.120000,4.120000,4.120000,0
1960-01-24,null,null,null,null,null,null
1960-01-25,4.120000,4.120000,4.120000,4.120000,4.120000,0
1960-01-26,4.040000,4.040000,4.040000,4.040000,4.040000,0
1960-01-27,3.990000,3.990000,3.990000,3.990000,3.990000,0
1960-01-28,3.920000,3.920000,3.920000,3.920000,3.920000,0
1960-01-29,3.990000,3.990000,3.990000,3.990000,3.990000,0
1960-01-31,null,null,null,null,null,null
;

View solution in original post

novinosrin · Posted 10-15-2020 03:28 PM

if _error_=0 and adj>.; /*please try*/

Junyong · Posted 10-15-2020 03:41 PM

This code also reads all the observations first and then removes the missing observations. Can I not read and completely skip the lines containing the nulls? For example, I tried the following but failed.

data irx;
	infile 'https://query1.finance.yahoo.com/v7/finance/download/^irx
?period1=-9999999999&period2=9999999999' url firstobs=2 dsd truncover;
	if find(_infile_,"null")=0 then input date yymmdd10. +1 open hi lo close adj vol;
run;

novinosrin · Posted 10-15-2020 03:44 PM

Ah I see what you mean. How about-

data have;
 input @;
 if index(_infile_,'null')=0;
 input var   :mmddyy10.;
  cards;
 null
 06/21/1982
 07/21/1982
 null
 ;

The above doesn't read records from buffer to PDV. Would that suffice?

Testing your sample-


data test;
infile cards dsd firstobs=2;
input @;
if index(_infile_,'null')=0;
input date yymmdd10. +1 open hi lo close adj vol;
cards;
Date,Open,High,Low,Close,Adj Close,Volume
1960-01-04,4.520000,4.520000,4.520000,4.520000,4.520000,0
1960-01-05,4.550000,4.550000,4.550000,4.550000,4.550000,0
1960-01-06,4.680000,4.680000,4.680000,4.680000,4.680000,0
1960-01-07,4.630000,4.630000,4.630000,4.630000,4.630000,0
1960-01-08,4.590000,4.590000,4.590000,4.590000,4.590000,0
1960-01-10,null,null,null,null,null,null
1960-01-11,4.540000,4.540000,4.540000,4.540000,4.540000,0
1960-01-12,4.540000,4.540000,4.540000,4.540000,4.540000,0
1960-01-13,4.560000,4.560000,4.560000,4.560000,4.560000,0
1960-01-14,4.510000,4.510000,4.510000,4.510000,4.510000,0
1960-01-15,4.490000,4.490000,4.490000,4.490000,4.490000,0
1960-01-17,null,null,null,null,null,null
1960-01-18,4.370000,4.370000,4.370000,4.370000,4.370000,0
1960-01-19,4.310000,4.310000,4.310000,4.310000,4.310000,0
1960-01-20,4.300000,4.300000,4.300000,4.300000,4.300000,0
1960-01-21,4.270000,4.270000,4.270000,4.270000,4.270000,0
1960-01-22,4.120000,4.120000,4.120000,4.120000,4.120000,0
1960-01-24,null,null,null,null,null,null
1960-01-25,4.120000,4.120000,4.120000,4.120000,4.120000,0
1960-01-26,4.040000,4.040000,4.040000,4.040000,4.040000,0
1960-01-27,3.990000,3.990000,3.990000,3.990000,3.990000,0
1960-01-28,3.920000,3.920000,3.920000,3.920000,3.920000,0
1960-01-29,3.990000,3.990000,3.990000,3.990000,3.990000,0
1960-01-31,null,null,null,null,null,null
;

novinosrin · Posted 10-15-2020 03:56 PM

Try the "subsetting if" rather than "If Then"

979  data test;
980  infile cards dsd firstobs=2;
981  input @;
982  if index(_infile_,'null')=0;
983  input date yymmdd10. +1 open hi lo close adj vol;
984  cards;

NOTE: The data set WORK.TEST has 20 observations and 7 variables.
NOTE: DATA statement used (Total process time):
      real time           0.01 seconds
      cpu time            0.01 seconds

Junyong · Posted 10-15-2020 04:06 PM

Subsetting first was the exact answer. Thanks.

CurtisMackWSIPP · Posted 10-15-2020 05:03 PM

Keep in mind that that solution will drop the record if any of the values are null, not just the "adj" variable.

CurtisMackWSIPP · Posted 10-15-2020 03:56 PM

I like using custom informat to handle these conditions.

proc format;
  invalue myfmt "null" = .
                default = 10.5;
run;

data want;
  infile datalines firstobs=2 dsd truncover;
  informat open hi lo close adj vol myfmt.;
  input date yymmdd10. +1 open hi lo close adj vol;
	if adj>. then output;
datalines;
Date,Open,High,Low,Close,Adj Close,Volume
1960-01-04,4.520000,4.520000,4.520000,4.520000,4.520000,0
1960-01-05,4.550000,4.550000,4.550000,4.550000,4.550000,0
1960-01-06,4.680000,4.680000,4.680000,4.680000,4.680000,0
1960-01-07,4.630000,4.630000,4.630000,4.630000,4.630000,0
1960-01-08,4.590000,4.590000,4.590000,4.590000,4.590000,0
1960-01-10,null,null,null,null,null,null
1960-01-11,4.540000,4.540000,4.540000,4.540000,4.540000,0
1960-01-12,4.540000,4.540000,4.540000,4.540000,4.540000,0
1960-01-13,4.560000,4.560000,4.560000,4.560000,4.560000,0
1960-01-14,4.510000,4.510000,4.510000,4.510000,4.510000,0
1960-01-15,4.490000,4.490000,4.490000,4.490000,4.490000,0
1960-01-17,null,null,null,null,null,null
1960-01-18,4.370000,4.370000,4.370000,4.370000,4.370000,0
1960-01-19,4.310000,4.310000,4.310000,4.310000,4.310000,0
1960-01-20,4.300000,4.300000,4.300000,4.300000,4.300000,0
1960-01-21,4.270000,4.270000,4.270000,4.270000,4.270000,0
1960-01-22,4.120000,4.120000,4.120000,4.120000,4.120000,0
1960-01-24,null,null,null,null,null,null
1960-01-25,4.120000,4.120000,4.120000,4.120000,4.120000,0
1960-01-26,4.040000,4.040000,4.040000,4.040000,4.040000,0
1960-01-27,3.990000,3.990000,3.990000,3.990000,3.990000,0
1960-01-28,3.920000,3.920000,3.920000,3.920000,3.920000,0
1960-01-29,3.990000,3.990000,3.990000,3.990000,3.990000,0
1960-01-31,null,null,null,null,null,null
;
run;

Junyong · Posted 10-15-2020 04:08 PM

It seems this will be useful for R CSVs, which put NAs for missings. Thanks.

Tom · Posted 10-15-2020 05:03 PM

proc format;
  invalue myfmt "null" = .
                default = 10.5;
run;

Did you really mean to set the value to 10.5 when the value is the string 'default'?

If not then what did you think that second line meant?

CurtisMackWSIPP · Posted 10-15-2020 05:05 PM

Try it. You are thinking of the "OTHER" statement. "DEFAULT" says use this existing format of 10.5.

Tom · Posted 10-15-2020 07:52 PM

@CurtisMackWSIPP wrote:

Try it. You are thinking of the "OTHER" statement. "DEFAULT" says use this existing format of 10.5.

There is no reason to use a different informat than what SAS normally uses to read numbers. But if you did want to specify the use of an existing informat you need to enclose it square brackets. But you wouldn't want to use an informat like 10.5 because that imposes an implied decimal point before the last 5 digits for this problem. So any integer values would be divided by 10**5.

Registration is open

SAS Training: Just a Click Away