Hi,
suppose I have a table with a somewhat complex data structure like the following:
n | line |
… | … |
34 | text jdjdjf stock IBM |
35 | djdff |
36 | 25$ |
… | … |
Whenever I have a line with the word "stock" in it, the actual price is always 2 lines below. So what I would like to have is the following structure:
n | line | price |
34 | text jdjdjf stock IBM | 25$ |
Like this the stock and its price are on the same observations and its possible to make an analysis.
Thank you.
Yes, your understanding is correct. Good.
Ok, still just a minor change
data want;
set have2 ;
retain _flag 0 _line _n ;
if find(line, "stock")>0 then do;
__n=0;
_flag=1;
_n=n;
_line=line;
end;
if _flag then __n+1;
if __n=3 then do;
n=_n;
price=line;
line=_line;
output;
end;
drop _: ;
run;
1) Why on earth do you have data structure like this? 🙂
2) Where does "A" come from in your line variable instead of IBM?
Use the @ character feature of the INPUT statement:
data want;
infile datalines;
input @ 'stock' stock :$5. // price :$5.;
put (_all_) (=);
datalines;
xxx
stock AAA
aaa
11$
xxx
xxx
stock BBBB
bbb
22$
run;
The slashes in the input statement tells it to advance one line per slash.
Hi mkeintz,
since my data is generated externally, lets assume that I already imported data equivalent to:
data have;
input line $ 1-10;
datalines;
xxx
stock AAA
aaa
11$
xxx
xxx
stock BBBB
bbb
22$
run;
It is now that I need to do the process described in the question. I tried to guess based on your code by doing the following:
data want;
set have;
input @ 'stock' stock :$5. // price :$5.;
put (_all_) (=);
run;
But unsurprisingly it didn't work
Input is for raw data input, not sas data input. I'd suggest usings @PeterClemmensen's solultion.
data have;
input n line $50.;
datalines;
34 text jdjdjf stock IBM
35 djdff
36 25$
37 text jdjdjf stock IBM
38 djdff
39 25$
40 text jdjdjf stock IBM
41 djdff
42 25$
;
data want;
merge have have(firstobs=3 rename=(line=price));
if find(line, "stock")>0 then output;
run;
Hi draycut,
thanks for the code, worked perfectly!
If possible to expand the code, in my case I have more variables of interest. For example, the 4th line after the line that contains the word "stock" I have its industry:
data have2;
input n line $50.;
datalines;
30 doidn
31 fifeniofe
32 doewiij
33 fofugrugr
34 text jdjdjf stock IBM
35 djdff
36 25$
37 fg fgg
38 computers
39 text jdjdjf stock AAA
40 djdff
41 22$
42 fggh
43 food
44 text jdjdjf stock BBB
45 djdff
46 55$
47 duehuefhu
48 automotive
49 dfeoufeijer
50 ierhde43848
;
run;
Using your code I did the following:
data want2;
merge have2 have2(firstobs=3 rename=(line=price)) have2(firstobs=5 rename=(line=industry));
if find(line, "stock")>0 then output;
run;
And also got the result that I wanted!
I would like to ask for your opinion on the following:
1) In my data I have tens to hundreds of thousands of lines (most of them useless) and 5 to 7 variables that I need to extract, so this would mean merging 5 to 7 times the same data table just like above. Is this a good approach, i.e, could there be efficiency or some other problems (like the computer freezing for example)?
2) Recently I worked on some data where I also had to restructure it into a more analysis-friendly form using Excel VBA. Looking at my data I would perhaps do the following:
Loop through my original data until I find the word "stock". Once I find it, remember the cells row r and paste the cell's content into
a new spreadsheet in cell (1,2) (the first column in the new sheet is the stock, second column price etc).
In the original sheet the contents of the cell with row = r+2 are the stock's price and so they will be pasted into the new sheet's cell (2,2), and so on for the other variables...
And when I reach a new cell with the word "stock" I go to the next line in the new sheet.
It seems, at least to me, that using cells is easier conceptually (because I couldn't figure out how to do the above mentioned data restructuring using SAS), so is it possible to "cellulize" data in SAS? Is it even worth it?
Thanks
@ilikesas Good afternoon, IMO unless your sample pattern resembles exactly your real, I'd think the merge might not suffice. You prolly will need a sound method at some point to make it squeaky clean to handle varying patterns.
data have2;
input n line $50.;
datalines;
30 doidn
31 fifeniofe
32 doewiij
33 fofugrugr
34 text jdjdjf stock IBM
35 djdff
36 25$
37 fg fgg
38 computers
39 text jdjdjf stock AAA
40 djdff
41 22$
42 fggh
43 food
44 text jdjdjf stock BBB
45 djdff
46 55$
47 duehuefhu
48 automotive
49 dfeoufeijer
50 ierhde43848
;
run;
data _null_;
if _N_ = 1 then do;
if 0 then set have2;
length price $10;
declare hash h( );
h.defineKey('_n','_line');
h.defineData('_n','_line','price');
h.defineDone();
call missing(price);
end;
do until(last);
set have2 end=last;
if find(line, "stock")>0 then do;
_n=n;
_line=line;
h.add();
end;
else if index(line,'$')>0 and h.check()=0 then do;
price=line;
h.replace();
end;
end;
if last then h.output(dataset:'want');
run;
You've been thinking in Excel terms, not SAS terms.
You can use RETAIN here to get what you want as well. I'm assuming you're scraping data from either a webpage or copy/pasted out of a report/PDF. There's often ways to move that data to get it in a better format in the first place but I'll consider that irrelevant for the moment.
If you use RETAIN, you can trigger the write to your SAS data set once you find a new stock symbol.
I don't see how you know which is a new symbol yet, but it seems relatively straightforward to me.
data want;
set have;
retain stock price <other variables>;
if <stock> then do;
output; *output values currently stored;
call missing(stock, price, <other variables>; *reset all values to missing to prevent carry over of values;
stock=value; *set to new stock ticker;
end;
else if ... then price=value;
else if ... then <other_variables>=value; *since these are all retained, you get a single record for each 'stock' at the end;
run;
Reeza is right, yeah a simple retain does it like-
data want;
set have2 end=last;
retain _line _n;
if find(line, "stock")>0 then do;
_n=n;
_line=line;
end;
else if index(line,'$')>0 then do;
n=_n;
price=line;
line=_line;
output;
end;
drop _:;
run;
Hi novinosrin,
for the line in your code:
else if index(line,'$')>0 then do;
if I understand correctly the code looks for the symbol "$", but in my case I have many useless lines that contain the symbol "$", so it is really the relative position of the lines from the line that contains the word "stock" that are important
Yes, your understanding is correct. Good.
Ok, still just a minor change
data want;
set have2 ;
retain _flag 0 _line _n ;
if find(line, "stock")>0 then do;
__n=0;
_flag=1;
_n=n;
_line=line;
end;
if _flag then __n+1;
if __n=3 then do;
n=_n;
price=line;
line=_line;
output;
end;
drop _: ;
run;
Thanks for the code.
I managed to include an additional "industry" variable:
data want;
set have2 ;
retain _flag 0 _line _n _line2 _n2 ;
if find(line, "stock")>0 then do;
__n=0;
_flag=1;
_n=n;
_line=line;
end;
if _flag then __n+1;
if __n=3 then do;
_n2=n;
_line2=line;
end;
if __n=5 then do;
n=_n;
industry=line;
price=_line2;
line=_line;
output;
end;
drop _: ;
run;
so I guess that if I had more variables to create I would just repeat the process
@ilikesas Sorry mate!, I have been busy with school. Did it work? Are you able to progress and make the necessary changes?
Save $250 on SAS Innovate and get a free advance copy of the new SAS For Dummies book! Use the code "SASforDummies" to register. Don't miss out, May 6-9, in Orlando, Florida.
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.