Solved: selecting observations based on their relative position to other obser...

ilikesas · Posted 01-07-2018 10:42 AM

Hi,

suppose I have a table with a somewhat complex data structure like the following:

n	line
…	…
34	text jdjdjf stock IBM
35	djdff
36	25$
…	…

Whenever I have a line with the word "stock" in it, the actual price is always 2 lines below. So what I would like to have is the following structure:

n	line	price
34	text jdjdjf stock IBM	25$

Like this the stock and its price are on the same observations and its possible to make an analysis.

Thank you.

novinosrin · Posted 01-07-2018 11:02 PM

Yes, your understanding is correct. Good.

Ok, still just a minor change

data want;
set have2 ;
retain _flag 0 _line _n ;
if find(line, "stock")>0 then do;
__n=0;
_flag=1;
_n=n;
_line=line;
end;
if _flag then __n+1;
if __n=3 then do;
n=_n;
price=line;
line=_line;
output;
end;
drop _: ;
run;

View solution in original post

PeterClemmensen · Posted 01-07-2018 10:46 AM

1) Why on earth do you have data structure like this? 🙂

2) Where does "A" come from in your line variable instead of IBM?

The DATA to DATA Step Macro
Blog: SASnrd

mkeintz · Posted 01-07-2018 10:52 AM

Use the @ character feature of the INPUT statement:

data want;
  infile datalines;
  input @ 'stock'  stock :$5. // price :$5.;
  put (_all_) (=);
datalines;
xxx
stock AAA
aaa
11$
xxx
xxx
stock BBBB
bbb
22$
run;

The slashes in the input statement tells it to advance one line per slash.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

ilikesas · Posted 01-07-2018 11:22 AM

Hi mkeintz,

since my data is generated externally, lets assume that I already imported data equivalent to:

data have;
input line $ 1-10;  
datalines;
xxx
stock AAA
aaa
11$
xxx
xxx
stock BBBB
bbb
22$
run;

It is now that I need to do the process described in the question. I tried to guess based on your code by doing the following:

data want;
set have;
input @ 'stock' stock :$5. // price :$5.;
put (_all_) (=);

run;

But unsurprisingly it didn't work

mkeintz · Posted 01-07-2018 11:24 AM

Input is for raw data input, not sas data input. I'd suggest usings @PeterClemmensen's solultion.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

PeterClemmensen · Posted 01-07-2018 10:55 AM

data have;
input n line $50.;
datalines;
34	text jdjdjf stock IBM
35	djdff 
36	25$
37	text jdjdjf stock IBM
38	djdff 
39	25$
40	text jdjdjf stock IBM
41	djdff 
42	25$
;

data want;
   merge have have(firstobs=3 rename=(line=price));
   if find(line, "stock")>0 then output;
run;

The DATA to DATA Step Macro
Blog: SASnrd

ilikesas · Posted 01-07-2018 12:19 PM

Hi draycut,

thanks for the code, worked perfectly!

If possible to expand the code, in my case I have more variables of interest. For example, the 4th line after the line that contains the word "stock" I have its industry:

data have2;
input n line $50.;
datalines;
30  doidn
31  fifeniofe
32  doewiij
33  fofugrugr
34	text jdjdjf stock IBM
35	djdff 
36	25$
37  fg fgg
38	computers
39	text jdjdjf stock AAA
40	djdff 
41	22$
42  fggh
43	food
44	text jdjdjf stock BBB
45	djdff 
46	55$
47  duehuefhu
48	automotive
49  dfeoufeijer
50  ierhde43848
;
run;

Using your code I did the following:

data want2;
   merge have2 have2(firstobs=3 rename=(line=price)) have2(firstobs=5 rename=(line=industry));
   if find(line, "stock")>0 then output;
run;

And also got the result that I wanted!

I would like to ask for your opinion on the following:

1) In my data I have tens to hundreds of thousands of lines (most of them useless) and 5 to 7 variables that I need to extract, so this would mean merging 5 to 7 times the same data table just like above. Is this a good approach, i.e, could there be efficiency or some other problems (like the computer freezing for example)?

2) Recently I worked on some data where I also had to restructure it into a more analysis-friendly form using Excel VBA. Looking at my data I would perhaps do the following:

Loop through my original data until I find the word "stock". Once I find it, remember the cells row r and paste the cell's content into

a new spreadsheet in cell (1,2) (the first column in the new sheet is the stock, second column price etc).

In the original sheet the contents of the cell with row = r+2 are the stock's price and so they will be pasted into the new sheet's cell (2,2), and so on for the other variables...

And when I reach a new cell with the word "stock" I go to the next line in the new sheet.

It seems, at least to me, that using cells is easier conceptually (because I couldn't figure out how to do the above mentioned data restructuring using SAS), so is it possible to "cellulize" data in SAS? Is it even worth it?

Thanks

novinosrin · Posted 01-07-2018 02:46 PM

@ilikesas Good afternoon, IMO unless your sample pattern resembles exactly your real, I'd think the merge might not suffice. You prolly will need a sound method at some point to make it squeaky clean to handle varying patterns.

novinosrin · Posted 01-07-2018 03:42 PM

data have2;

input n line $50.;

datalines;

30 doidn

31 fifeniofe

32 doewiij

33 fofugrugr

34 text jdjdjf stock IBM

35 djdff

36 25$

37 fg fgg

38 computers

39 text jdjdjf stock AAA

40 djdff

41 22$

42 fggh

43 food

44 text jdjdjf stock BBB

45 djdff

46 55$

47 duehuefhu

48 automotive

49 dfeoufeijer

50 ierhde43848

;

run;

data _null_;

if _N_ = 1 then do;

if 0 then set have2;

length price $10;

declare hash h( );

h.defineKey('_n','_line');

h.defineData('_n','_line','price');

h.defineDone();

call missing(price);

end;

do until(last);

set have2 end=last;

if find(line, "stock")>0 then do;

_n=n;

_line=line;

h.add();

end;

else if index(line,'$')>0 and h.check()=0 then do;

price=line;

h.replace();

end;

if last then h.output(dataset:'want');

run;

Reeza · Posted 01-07-2018 04:02 PM

You've been thinking in Excel terms, not SAS terms.

You can use RETAIN here to get what you want as well. I'm assuming you're scraping data from either a webpage or copy/pasted out of a report/PDF. There's often ways to move that data to get it in a better format in the first place but I'll consider that irrelevant for the moment.

If you use RETAIN, you can trigger the write to your SAS data set once you find a new stock symbol.

I don't see how you know which is a new symbol yet, but it seems relatively straightforward to me.

data want; 
set have;
retain stock price <other variables>;

if <stock> then do; 
output;  *output values currently stored;
call missing(stock, price, <other variables>; *reset all values to missing to prevent carry over of values;
stock=value;   *set to new stock ticker;
end;

else if ... then price=value;
else if ... then <other_variables>=value; *since these are all retained, you get a single record for each 'stock' at the end;



run;

novinosrin · Posted 01-07-2018 04:10 PM

Reeza is right, yeah a simple retain does it like-

data want;
set have2 end=last;
retain _line _n;
if find(line, "stock")>0 then do;
_n=n;
_line=line;
end;
else if index(line,'$')>0 then do;
n=_n;
price=line;
line=_line;
output;
end;
drop _:;
run;

ilikesas · Posted 01-07-2018 10:06 PM

Hi novinosrin,

for the line in your code:

else if index(line,'$')>0 then do;

if I understand correctly the code looks for the symbol "$", but in my case I have many useless lines that contain the symbol "$", so it is really the relative position of the lines from the line that contains the word "stock" that are important

novinosrin · Posted 01-07-2018 11:02 PM

Yes, your understanding is correct. Good.

Ok, still just a minor change

data want;
set have2 ;
retain _flag 0 _line _n ;
if find(line, "stock")>0 then do;
__n=0;
_flag=1;
_n=n;
_line=line;
end;
if _flag then __n+1;
if __n=3 then do;
n=_n;
price=line;
line=_line;
output;
end;
drop _: ;
run;

ilikesas · Posted 01-13-2018 07:45 PM

Thanks for the code.

I managed to include an additional "industry" variable:

data want;
set have2 ;
retain _flag 0 _line _n _line2 _n2 ;
if find(line, "stock")>0 then do;
__n=0;
_flag=1;
_n=n;
_line=line;
end;
if _flag then __n+1;
if __n=3 then do;
_n2=n;
_line2=line;
end;
if __n=5 then do;
n=_n;
industry=line;
price=_line2;
line=_line;
output;
end;
drop _: ;
run;

so I guess that if I had more variables to create I would just repeat the process

novinosrin · Posted 01-16-2018 11:21 PM

@ilikesas Sorry mate!, I have been busy with school. Did it work? Are you able to progress and make the necessary changes?

selecting observations based on their relative position to other observations

Re: selecting observations based on their relative position to other observations

Re: selecting observations based on their relative position to other observations

Re: selecting observations based on their relative position to other observations

Re: selecting observations based on their relative position to other observations

Re: selecting observations based on their relative position to other observations

Re: selecting observations based on their relative position to other observations

Re: selecting observations based on their relative position to other observations

Re: selecting observations based on their relative position to other observations

Re: selecting observations based on their relative position to other observations

Re: selecting observations based on their relative position to other observations

Re: selecting observations based on their relative position to other observations

Re: selecting observations based on their relative position to other observations

Re: selecting observations based on their relative position to other observations

Re: selecting observations based on their relative position to other observations

Re: selecting observations based on their relative position to other observations

Registration is open

SAS Training: Just a Click Away