Re: datastream into SAS

rachelzhang · Posted 04-01-2017 12:20 PM

Dear SAS users!

I am hoping to get some help with the following problem and any advice/help is much appreciated!

I have retrieved price and return index data (so 2 variables) from datastream in excel format and I want to make it into a long form in SAS.

As you can see from my attachement, my data currently looks like this (also attached)

However, I want to make it into a long form with each row representing one firm date. So something like

Company Date Price Return Index

A 1 x x

A 2 x x

B 1 x x

An additional problem that I encounter some of the excel cells are marked "NA", would this cause a problem as the "NA"s will cause SAS to recognize variables as character instead of numeric? As there a way to mark the "NA"s just as missing data? I think I can do it in excel but hoping if I can solve all the problems in SAS too.

Any help or code is very much appreciated!

Thank you!

art297 · Posted 04-01-2017 12:54 PM

Many folks here won't download an Excel Workbook and, for your file, Google Chrome wasn't able to show a preview of it.

Save the first 10 or so records as a csv file and post that.

However, just looking at the pic you posted, I couldn't identify the two prices you said you have. I presume the ones under the dates represent one of them, but what represents the other?

Art, CEO, AnalystFinder.com

rachelzhang · Posted 04-01-2017 12:58 PM

hi @art297, yes you are right. If you look at datatype, those with a "P" mean price and with "RI" means return index. Also CSV file attached here. Thanks so much!

art297 · Posted 04-01-2017 02:31 PM

If you don't mind getting a long log, and if your first three records are actually like your example, I think the following will do what you want to accomplish:

proc import datafile='/folders/myfolders/sample code_csv.csv'
            out=have dbms=csv replace;
  guessingrows=3;
run;
data need1 (keep=name currency price date );
  set have (where=(datatype eq 'P'));
  format date date9.;
  array prices _:;
  do over prices;
    if not missing(prices) then do;
      price=prices;
      date=input(substr(vname(prices),2),anydtdte10.);
      output;
    end;
  end;
run;

data need2 (keep=currency return_index );
  set have (where=(datatype eq 'RI'));
  array ri _:;
  do over ri;
    if not missing(ri) then do;
      return_index=ri;
      output;
    end;
  end;
run;

data want;
  set need1;
  set need2;
run;

Of course, I didn't validate the results, so checking it would be up to you.

Art, CEO, AnalystFinder.com

rachelzhang · Posted 04-01-2017 03:32 PM

@art297 thanks so much for the code! and yes checking would be on me 🙂 appreciate this very much!

art297 · Posted 04-01-2017 06:09 PM

If the problem is solved don't forget to mark it as such.

Art, CEO, AnalystFinder.com

Tom · Posted 04-01-2017 06:16 PM

It is easy. First make it vertical with PROC TRANSPOSE.

proc transpose data=have out=middle ;
  by name datatype currency notsorted ;
  var _: ;
run;

Then convert values into numbers and the _name_ into date.

data want ;
  set middle;
  date = input(translate(substr(_name_,2),'/','_'),mmddyy10.);
  format date yymmdd10.;
  price = input(col1,??comma32.);
run;

art297 · Posted 04-01-2017 06:30 PM

@Tom: I tested the OP's data before submitting my suggestion. The code I suggested works and the problem isn't as simple as you described.

Since there are NA values throughout the data, unless one takes the approach I suggested (or writes the import manually) a scattered number of fields end up being character and the rest numeric.

Add that to the problem that the name variable is different for the two types of records, I don't see an easier way to solve the problem.

Fortunately, the files first two records (after the header record) are complete, so using guessingrows=3, forces proc import to only read the values (other than the grouping variables) as numeric.

Art, CEO, AnalystFinder.com

p.s. I'm bringing our non-submitted-or-accepted presentation on a stick to SGF. There have been a lot more cancellations so, either the number of presentations will be either fewer than advertised, or they might finally reach out for alternatives. Sure hope so, because our latest changes to the macro make it extremely (INHO) attractive to many of our colleagues .. if they know about it.

Tom · Posted 04-01-2017 07:08 PM

The XLSX libname doesn't know what to do with those date values for variable names so it converts them into the Excel numeric value for a date. So here is complete program. Note that the ?? modifier on the informat will allow it to silently convert the 'NA' strings to missing values.

libname x xlsx 'c:\downloads\sample1.xlsx';
proc transpose data=x.sheet1 out=middle ;
  by name datatype currency notsorted ;
  var _: ;
  format _character_;
run;
data want ;
  set middle ;
  date = input(_label_,32.) + '01JAN1900'd -2;
  format date yymmdd10.;
  price = input(col1,??comma32.);
  drop _name_ _label_ col1 ;
run;

proc print data=want (obs=20);
run;

                 Obs          Name           DATATYPE    CURRENCY          date    price

                   1    SINGAPORE TELECOM       P           S$       2009-01-01     2.55
                   2    SINGAPORE TELECOM       P           S$       2009-01-02     2.67
                   3    SINGAPORE TELECOM       P           S$       2009-01-05     2.61
                   4    SINGAPORE TELECOM       P           S$       2009-01-06     2.60
                   5    SINGAPORE TELECOM       P           S$       2009-01-07     2.52
                   6    SINGAPORE TELECOM       P           S$       2009-01-08     2.49
                   7    SINGAPORE TELECOM       P           S$       2009-01-09     2.52
                   8    SINGAPORE TELECOM       P           S$       2009-01-12     2.55
                   9    SINGAPORE TELECOM       P           S$       2009-01-13     2.53
                  10    SINGAPORE TELECOM       P           S$       2009-01-14     2.52
                  11    SINGAPORE TELECOM       P           S$       2009-01-15     2.43
                  12    SINGAPORE TELECOM       P           S$       2009-01-16     2.52
                  13    SINGAPORE TELECOM       P           S$       2009-01-19     2.54
                  14    SINGAPORE TELECOM       P           S$       2009-01-20     2.57
                  15    SINGAPORE TELECOM       P           S$       2009-01-21     2.55
                  16    SINGAPORE TELECOM       P           S$       2009-01-22     2.56
                  17    SINGAPORE TELECOM       P           S$       2009-01-23     2.53
                  18    SINGAPORE TELECOM       P           S$       2009-01-26     2.53
                  19    SINGAPORE TELECOM       P           S$       2009-01-27     2.53
                  20    SINGAPORE TELECOM       P           S$       2009-01-28     2.65

If you want you could add code like below to the last data step to allow you to split the NAME field into two variables.

return_flag = 0 < index(name,'- TOT RETURN IND');
name = tranwrd(name,'- TOT RETURN IND',' ');

rogerjdeangelis · Posted 04-02-2017 01:48 PM

WPS/SAS/R: Transposing a very fat (1,568 columns) excel sheet to 156,500 key/value pairs

Only WPS/R or IML/R could create the SAS dataset from the fat sheet?

proc import seemed failed (after column 4/30/2009 all data was missing (bad number conversion?)
Did not try to figure it out decided to switch to WPS-R interface

libname failed (255 col limit)

There are ways to do this with SAS but perhaps not as direct as R, NA is missing in R.

HAVE (An EXCEL sheet with 1,568 columns)
=========================================

Middle Observation(50 ) of she - Total Obs 100

Column      -- CHARACTER --

1          NAME                      C    57      SEMBCORP INDUSTRIES
2          DATATYPE                  C    2       RI
3          CURRENCY                  C    2       S$
4          2009:01:01:00:00:00       C    7       248.88
5          2009:01:02:00:00:00       C    7       260.74
6          2009:01:05:00:00:00       C    7       284.41
           ...
1566       X2014_12_29_00_00_00      C    8       577.87
1567       X2014_12_30_00_00_00      C    8       579.25
1568       X2014_12_31_00_00_00      C    8       581.92


WANT
===
check 100*(1568-3(dimension values)) = 156,500. we do get the right number of rows)

Up to 40 obs from want total obs=156,500

 Obs    DATATYPE          NAME           CURRENCY       DTE       VAL

    1       P        SINGAPORE TELECOM       S$       01JAN2009    2.55
    2       P        SINGAPORE TELECOM       S$       02JAN2009    2.67
    3       P        SINGAPORE TELECOM       S$       05JAN2009    2.61
    4       P        SINGAPORE TELECOM       S$       06JAN2009    2.6
    5       P        SINGAPORE TELECOM       S$       07JAN2009    2.52
    6       P        SINGAPORE TELECOM       S$       08JAN2009    2.49
    7       P        SINGAPORE TELECOM       S$       09JAN2009    2.52
    8       P        SINGAPORE TELECOM       S$       12JAN2009    2.55
    9       P        SINGAPORE TELECOM       S$       13JAN2009    2.53
   10       P        SINGAPORE TELECOM       S$       14JAN2009    2.52
....
78247       P        CHINA AVTN.OIL (SING.)  S$       26DEC2014    0.66
78248       P        CHINA AVTN.OIL (SING.)  S$       29DEC2014    0.655
78249       P        CHINA AVTN.OIL (SING.)  S$       30DEC2014    0.68
78250       P        CHINA AVTN.OIL (SING.)  S$       31DEC2014    0.67


Middle Observation(78250 ) of want - Total Obs 156,500

 -- CHARACTER --
DATATYPE                         C    2       RI                  DATATYPE
NAME                             C    57      SEMBCORP INDUSTR    NAME
CURRENCY                         C    2       S$                  CURRENCY
VAL                              C    8       581.92              VAL

 -- NUMERIC --
DTE                              N    8       20088               DTE

WORKING CODE

       wb <- loadWorkbook("d:/xls/sample.xlsx",create = FALSE);
       import r=she data=wrk.she;

       * key/value pairs;
       array vals X:;
       dte=input(substr(vname(vals),2,10),yymmdd10.);
       val=vals;


*     _                        _____ _         _       _
__  _| |___       ___  __ _ __|___  | |__   __| | __ _| |_
\ \/ / / __|_____/ __|/ _` / __| / /| '_ \ / _` |/ _` | __|
 >  <| \__ \_____\__ \ (_| \__ \/ / | |_) | (_| | (_| | |_
/_/\_\_|___/     |___/\__,_|___/_/  |_.__/ \__,_|\__,_|\__|

;

%utl_submit_r64('
library(XLConnect);
wb <- loadWorkbook("d:/xls/sample.xlsx",create = FALSE);
she<-readWorksheet(wb, "sheet1",colTypes="character", forceConversion=TRUE);
endsubmit;
import r=she data=wrk.she;
run;quit;
');

*_        ______  ____       ____    _    ____
\ \      / /  _ \/ ___|     / ___|  / \  / ___|
 \ \ /\ / /| |_) \___ \ ____\___ \ / _ \ \___ \
  \ V  V / |  __/ ___) |_____|__) / ___ \ ___) |
   \_/\_/  |_|   |____/     |____/_/   \_\____/

;

data want (keep=datatype name currency dte val);
  retain datatype name currency dte val ;
  format dte date9.;
  set she(where=(datatype in ('RI','P')));
      array vals X:;
      do over vals;
         dte=input(substr(vname(vals),2,10),yymmdd10.);
         val=vals;
         output;
      end;
  run;quit;
run;quit;

Peter_C · Posted 04-05-2017 03:57 PM

is that a header row of dates, followed by 1 row for each company with price under the corresponding dates?
If that is the data structure, a short data step could read a .csv version of the datastream data and create the requested data structure --- in one pass

except we would have poorer performance if the row width is unknown.

Data ds1 ;
attrib co length=$100   pdate format= date9. Informat= date11. ;
array dates(10000) _temporary_ ; * allows for up to 10,000 dates on a row.;
Infile "that .csv version of the input" dsd lrecl= 1000000 truncover ;
If _n_ = 1 then do;
Input co @;
   do c= 1 to dim(dates) ;
       Input dates(c) @ ;
   end ;
   Delete ;
end ;
Input co @ ;
do c= 1 to dim(dates) ;
   Input price @ ;
   Pdate = dates(c) ;
   If price>0 then Output ;
End ;
Run ;

That code assumes the dates conform to date. format like dd-mmm-yyyy or dd-mmm-yy
Or without those dashes.
If dates are in some other date layout for which SAS has a suitable informat that should replace the informat date11.
If no suitable informat is available conversion might still be possible. Other ways to define the price-date could be dealt with.
These alternatives need better definition of the input information structure .

Peter_C · Posted 04-19-2017 12:40 PM

I felt that this was a data set that was worth engineering a suitable data step, to show how reasonably the INFILE and INPUT statements can support such a particular structure.

In a single pass the solution provides columns :

co(company), pdate( price date), datatype, currency and price

The first input statements collect the date column header dates into an array

On the data rows (after the header) each price is output separately.

the code

/*********************************************************
 *  datastream data import 1                             *
 *********************************************************/
;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
*filename dbsgf17 'C:\Users\PC User\Dropbox\SASGF2017' ;

Data ds1 ;
attrib co  length=$100   pdate format= date9. Informat= dDMMYY10.  ;
array dates(10000) _temporary_ ; * allows for up to  10,000 dates on a row.;
Infile dbsgf17("datastream-sample.csv")  dsd lrecl= 1000000 truncover COLUMN=COL LENGTH= LEN ;

KEEP    CO DATATYPE CURRENCY PRICE PDATE;
RETAIN  CO DATATYPE CURRENCY PRICE PDATE;
If _n_ = 1 then do; 
*-----------------------------------------------*
* PROCESS COLUMN HEADERS ON FIRST ROW           *
*  just to collect the date values (dd/mm/yyyy  *
*-----------------------------------------------*;
  Input co datatype :$33. currency : $11.  @ ;
   do c= 1 to dim(dates) WHILE( COL<LEN ) ; 
                         * cols COL LEN are defined in the INFILE
                         * LEN indicates length of current input line
                         * COL is pointer to next column to be read 
                         * so no more, once pointer passes end of line ;
       Input dates(c) : DDMMYY10. @ ;
   end ;
   *----------------------------------------------------*
   *  C is an index into the dates array                *
   *  T is a pointer to the top of the dates array      *
   *----------------------------------------------------;
   t= C-1 ;
   PUT (len col T c)(= +1) (dates(1) DATES(T))(= DATE9. +1) ;
   Delete ;
   RETAIN T ;
end ;
Input co datatype currency @ ;
do c= 1 to t ; 
   Input price ?? @ ; ***** ?? ignores invalid input value NA ;
   Pdate = dates(c) ;
   If price>0 then Output ; ** output only when price is present;
End ;
Run ;

The log reported useful info:

 NOTE: 101 records were read from the infile DBSGF17("datastream-sample.csv").
       The minimum record length was 4987.
       The maximum record length was 17237.
 NOTE: The data set WORK.DS1 has 145312 observations and 5 variables.
 NOTE: DATA statement used (Total process time):
       real time           0.07 seconds

Indicating the v.large LRECL= 1000000 value was a bit "over the top" when the widest row(column headers) was 17237

and how little impact that has on performance- as the 145312 obs. were extracted from 101 records in less than a tenth of a sec.

As challenging as this data structure is for the IMPORT Procedure, the best alternative is, I believe, to spend the time to learn how SAS can parse data with INPUT

Registration is open

SAS Training: Just a Click Away