04-01-2017 12:20 PM
Dear SAS users!
I am hoping to get some help with the following problem and any advice/help is much appreciated!
I have retrieved price and return index data (so 2 variables) from datastream in excel format and I want to make it into a long form in SAS.
As you can see from my attachement, my data currently looks like this (also attached)
However, I want to make it into a long form with each row representing one firm date. So something like
Company Date Price Return Index
A 1 x x
A 2 x x
B 1 x x
An additional problem that I encounter some of the excel cells are marked "NA", would this cause a problem as the "NA"s will cause SAS to recognize variables as character instead of numeric? As there a way to mark the "NA"s just as missing data? I think I can do it in excel but hoping if I can solve all the problems in SAS too.
Any help or code is very much appreciated!
04-01-2017 12:54 PM
Many folks here won't download an Excel Workbook and, for your file, Google Chrome wasn't able to show a preview of it.
Save the first 10 or so records as a csv file and post that.
However, just looking at the pic you posted, I couldn't identify the two prices you said you have. I presume the ones under the dates represent one of them, but what represents the other?
Art, CEO, AnalystFinder.com
04-01-2017 12:58 PM
hi @art297, yes you are right. If you look at datatype, those with a "P" mean price and with "RI" means return index. Also CSV file attached here. Thanks so much!
04-01-2017 02:31 PM
If you don't mind getting a long log, and if your first three records are actually like your example, I think the following will do what you want to accomplish:
proc import datafile='/folders/myfolders/sample code_csv.csv' out=have dbms=csv replace; guessingrows=3; run; data need1 (keep=name currency price date ); set have (where=(datatype eq 'P')); format date date9.; array prices _:; do over prices; if not missing(prices) then do; price=prices; date=input(substr(vname(prices),2),anydtdte10.); output; end; end; run; data need2 (keep=currency return_index ); set have (where=(datatype eq 'RI')); array ri _:; do over ri; if not missing(ri) then do; return_index=ri; output; end; end; run; data want; set need1; set need2; run;
Of course, I didn't validate the results, so checking it would be up to you.
Art, CEO, AnalystFinder.com
04-01-2017 06:16 PM - edited 04-01-2017 06:17 PM
It is easy. First make it vertical with PROC TRANSPOSE.
proc transpose data=have out=middle ; by name datatype currency notsorted ; var _: ; run;
Then convert values into numbers and the _name_ into date.
data want ; set middle; date = input(translate(substr(_name_,2),'/','_'),mmddyy10.); format date yymmdd10.; price = input(col1,??comma32.); run;
04-01-2017 06:30 PM
@Tom: I tested the OP's data before submitting my suggestion. The code I suggested works and the problem isn't as simple as you described.
Since there are NA values throughout the data, unless one takes the approach I suggested (or writes the import manually) a scattered number of fields end up being character and the rest numeric.
Add that to the problem that the name variable is different for the two types of records, I don't see an easier way to solve the problem.
Fortunately, the files first two records (after the header record) are complete, so using guessingrows=3, forces proc import to only read the values (other than the grouping variables) as numeric.
Art, CEO, AnalystFinder.com
p.s. I'm bringing our non-submitted-or-accepted presentation on a stick to SGF. There have been a lot more cancellations so, either the number of presentations will be either fewer than advertised, or they might finally reach out for alternatives. Sure hope so, because our latest changes to the macro make it extremely (INHO) attractive to many of our colleagues .. if they know about it.
04-01-2017 07:08 PM - edited 04-01-2017 07:24 PM
The XLSX libname doesn't know what to do with those date values for variable names so it converts them into the Excel numeric value for a date. So here is complete program. Note that the ?? modifier on the informat will allow it to silently convert the 'NA' strings to missing values.
libname x xlsx 'c:\downloads\sample1.xlsx'; proc transpose data=x.sheet1 out=middle ; by name datatype currency notsorted ; var _: ; format _character_; run; data want ; set middle ; date = input(_label_,32.) + '01JAN1900'd -2; format date yymmdd10.; price = input(col1,??comma32.); drop _name_ _label_ col1 ; run; proc print data=want (obs=20); run;
Obs Name DATATYPE CURRENCY date price 1 SINGAPORE TELECOM P S$ 2009-01-01 2.55 2 SINGAPORE TELECOM P S$ 2009-01-02 2.67 3 SINGAPORE TELECOM P S$ 2009-01-05 2.61 4 SINGAPORE TELECOM P S$ 2009-01-06 2.60 5 SINGAPORE TELECOM P S$ 2009-01-07 2.52 6 SINGAPORE TELECOM P S$ 2009-01-08 2.49 7 SINGAPORE TELECOM P S$ 2009-01-09 2.52 8 SINGAPORE TELECOM P S$ 2009-01-12 2.55 9 SINGAPORE TELECOM P S$ 2009-01-13 2.53 10 SINGAPORE TELECOM P S$ 2009-01-14 2.52 11 SINGAPORE TELECOM P S$ 2009-01-15 2.43 12 SINGAPORE TELECOM P S$ 2009-01-16 2.52 13 SINGAPORE TELECOM P S$ 2009-01-19 2.54 14 SINGAPORE TELECOM P S$ 2009-01-20 2.57 15 SINGAPORE TELECOM P S$ 2009-01-21 2.55 16 SINGAPORE TELECOM P S$ 2009-01-22 2.56 17 SINGAPORE TELECOM P S$ 2009-01-23 2.53 18 SINGAPORE TELECOM P S$ 2009-01-26 2.53 19 SINGAPORE TELECOM P S$ 2009-01-27 2.53 20 SINGAPORE TELECOM P S$ 2009-01-28 2.65
If you want you could add code like below to the last data step to allow you to split the NAME field into two variables.
return_flag = 0 < index(name,'- TOT RETURN IND'); name = tranwrd(name,'- TOT RETURN IND',' ');
04-02-2017 01:48 PM
WPS/SAS/R: Transposing a very fat (1,568 columns) excel sheet to 156,500 key/value pairs Only WPS/R or IML/R could create the SAS dataset from the fat sheet? proc import seemed failed (after column 4/30/2009 all data was missing (bad number conversion?) Did not try to figure it out decided to switch to WPS-R interface libname failed (255 col limit) There are ways to do this with SAS but perhaps not as direct as R, NA is missing in R. HAVE (An EXCEL sheet with 1,568 columns) ========================================= Middle Observation(50 ) of she - Total Obs 100 Column -- CHARACTER -- 1 NAME C 57 SEMBCORP INDUSTRIES 2 DATATYPE C 2 RI 3 CURRENCY C 2 S$ 4 2009:01:01:00:00:00 C 7 248.88 5 2009:01:02:00:00:00 C 7 260.74 6 2009:01:05:00:00:00 C 7 284.41 ... 1566 X2014_12_29_00_00_00 C 8 577.87 1567 X2014_12_30_00_00_00 C 8 579.25 1568 X2014_12_31_00_00_00 C 8 581.92 WANT === check 100*(1568-3(dimension values)) = 156,500. we do get the right number of rows) Up to 40 obs from want total obs=156,500 Obs DATATYPE NAME CURRENCY DTE VAL 1 P SINGAPORE TELECOM S$ 01JAN2009 2.55 2 P SINGAPORE TELECOM S$ 02JAN2009 2.67 3 P SINGAPORE TELECOM S$ 05JAN2009 2.61 4 P SINGAPORE TELECOM S$ 06JAN2009 2.6 5 P SINGAPORE TELECOM S$ 07JAN2009 2.52 6 P SINGAPORE TELECOM S$ 08JAN2009 2.49 7 P SINGAPORE TELECOM S$ 09JAN2009 2.52 8 P SINGAPORE TELECOM S$ 12JAN2009 2.55 9 P SINGAPORE TELECOM S$ 13JAN2009 2.53 10 P SINGAPORE TELECOM S$ 14JAN2009 2.52 .... 78247 P CHINA AVTN.OIL (SING.) S$ 26DEC2014 0.66 78248 P CHINA AVTN.OIL (SING.) S$ 29DEC2014 0.655 78249 P CHINA AVTN.OIL (SING.) S$ 30DEC2014 0.68 78250 P CHINA AVTN.OIL (SING.) S$ 31DEC2014 0.67 Middle Observation(78250 ) of want - Total Obs 156,500 -- CHARACTER -- DATATYPE C 2 RI DATATYPE NAME C 57 SEMBCORP INDUSTR NAME CURRENCY C 2 S$ CURRENCY VAL C 8 581.92 VAL -- NUMERIC -- DTE N 8 20088 DTE WORKING CODE wb <- loadWorkbook("d:/xls/sample.xlsx",create = FALSE); import r=she data=wrk.she; * key/value pairs; array vals X:; dte=input(substr(vname(vals),2,10),yymmdd10.); val=vals; * _ _____ _ _ _ __ _| |___ ___ __ _ __|___ | |__ __| | __ _| |_ \ \/ / / __|_____/ __|/ _` / __| / /| '_ \ / _` |/ _` | __| > <| \__ \_____\__ \ (_| \__ \/ / | |_) | (_| | (_| | |_ /_/\_\_|___/ |___/\__,_|___/_/ |_.__/ \__,_|\__,_|\__| ; %utl_submit_r64(' library(XLConnect); wb <- loadWorkbook("d:/xls/sample.xlsx",create = FALSE); she<-readWorksheet(wb, "sheet1",colTypes="character", forceConversion=TRUE); endsubmit; import r=she data=wrk.she; run;quit; '); *_ ______ ____ ____ _ ____ \ \ / / _ \/ ___| / ___| / \ / ___| \ \ /\ / /| |_) \___ \ ____\___ \ / _ \ \___ \ \ V V / | __/ ___) |_____|__) / ___ \ ___) | \_/\_/ |_| |____/ |____/_/ \_\____/ ; data want (keep=datatype name currency dte val); retain datatype name currency dte val ; format dte date9.; set she(where=(datatype in ('RI','P'))); array vals X:; do over vals; dte=input(substr(vname(vals),2,10),yymmdd10.); val=vals; output; end; run;quit; run;quit;
04-05-2017 03:57 PM
04-19-2017 12:40 PM
I felt that this was a data set that was worth engineering a suitable data step, to show how reasonably the INFILE and INPUT statements can support such a particular structure.
In a single pass the solution provides columns :
co(company), pdate( price date), datatype, currency and price
The first input statements collect the date column header dates into an array
On the data rows (after the header) each price is output separately.
/********************************************************* * datastream data import 1 * *********************************************************/ ;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; *filename dbsgf17 'C:\Users\PC User\Dropbox\SASGF2017' ; Data ds1 ; attrib co length=$100 pdate format= date9. Informat= dDMMYY10. ; array dates(10000) _temporary_ ; * allows for up to 10,000 dates on a row.; Infile dbsgf17("datastream-sample.csv") dsd lrecl= 1000000 truncover COLUMN=COL LENGTH= LEN ; KEEP CO DATATYPE CURRENCY PRICE PDATE; RETAIN CO DATATYPE CURRENCY PRICE PDATE; If _n_ = 1 then do; *-----------------------------------------------* * PROCESS COLUMN HEADERS ON FIRST ROW * * just to collect the date values (dd/mm/yyyy * *-----------------------------------------------*; Input co datatype :$33. currency : $11. @ ; do c= 1 to dim(dates) WHILE( COL<LEN ) ; * cols COL LEN are defined in the INFILE * LEN indicates length of current input line * COL is pointer to next column to be read * so no more, once pointer passes end of line ; Input dates(c) : DDMMYY10. @ ; end ; *----------------------------------------------------* * C is an index into the dates array * * T is a pointer to the top of the dates array * *----------------------------------------------------; t= C-1 ; PUT (len col T c)(= +1) (dates(1) DATES(T))(= DATE9. +1) ; Delete ; RETAIN T ; end ; Input co datatype currency @ ; do c= 1 to t ; Input price ?? @ ; ***** ?? ignores invalid input value NA ; Pdate = dates(c) ; If price>0 then Output ; ** output only when price is present; End ; Run ;
The log reported useful info:
NOTE: 101 records were read from the infile DBSGF17("datastream-sample.csv"). The minimum record length was 4987. The maximum record length was 17237. NOTE: The data set WORK.DS1 has 145312 observations and 5 variables. NOTE: DATA statement used (Total process time): real time 0.07 seconds
Indicating the v.large LRECL= 1000000 value was a bit "over the top" when the widest row(column headers) was 17237
and how little impact that has on performance- as the 145312 obs. were extracted from 101 records in less than a tenth of a sec.
As challenging as this data structure is for the IMPORT Procedure, the best alternative is, I believe, to spend the time to learn how SAS can parse data with INPUT