I felt that this was a data set that was worth engineering a suitable data step, to show how reasonably the INFILE and INPUT statements can support such a particular structure.
In a single pass the solution provides columns :
co(company), pdate( price date), datatype, currency and price
The first input statements collect the date column header dates into an array
On the data rows (after the header) each price is output separately.
the code
/*********************************************************
* datastream data import 1 *
*********************************************************/
;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
*filename dbsgf17 'C:\Users\PC User\Dropbox\SASGF2017' ;
Data ds1 ;
attrib co length=$100 pdate format= date9. Informat= dDMMYY10. ;
array dates(10000) _temporary_ ; * allows for up to 10,000 dates on a row.;
Infile dbsgf17("datastream-sample.csv") dsd lrecl= 1000000 truncover COLUMN=COL LENGTH= LEN ;
KEEP CO DATATYPE CURRENCY PRICE PDATE;
RETAIN CO DATATYPE CURRENCY PRICE PDATE;
If _n_ = 1 then do;
*-----------------------------------------------*
* PROCESS COLUMN HEADERS ON FIRST ROW *
* just to collect the date values (dd/mm/yyyy *
*-----------------------------------------------*;
Input co datatype :$33. currency : $11. @ ;
do c= 1 to dim(dates) WHILE( COL<LEN ) ;
* cols COL LEN are defined in the INFILE
* LEN indicates length of current input line
* COL is pointer to next column to be read
* so no more, once pointer passes end of line ;
Input dates(c) : DDMMYY10. @ ;
end ;
*----------------------------------------------------*
* C is an index into the dates array *
* T is a pointer to the top of the dates array *
*----------------------------------------------------;
t= C-1 ;
PUT (len col T c)(= +1) (dates(1) DATES(T))(= DATE9. +1) ;
Delete ;
RETAIN T ;
end ;
Input co datatype currency @ ;
do c= 1 to t ;
Input price ?? @ ; ***** ?? ignores invalid input value NA ;
Pdate = dates(c) ;
If price>0 then Output ; ** output only when price is present;
End ;
Run ;
The log reported useful info:
NOTE: 101 records were read from the infile DBSGF17("datastream-sample.csv").
The minimum record length was 4987.
The maximum record length was 17237.
NOTE: The data set WORK.DS1 has 145312 observations and 5 variables.
NOTE: DATA statement used (Total process time):
real time 0.07 seconds
Indicating the v.large LRECL= 1000000 value was a bit "over the top" when the widest row(column headers) was 17237
and how little impact that has on performance- as the 145312 obs. were extracted from 101 records in less than a tenth of a sec.
As challenging as this data structure is for the IMPORT Procedure, the best alternative is, I believe, to spend the time to learn how SAS can parse data with INPUT
... View more