selecting observations based on their relative position to other observations

Accepted Solution Solved
Reply
Super Contributor
Posts: 459
Accepted Solution

selecting observations based on their relative position to other observations

[ Edited ]

Hi,

 

suppose I have a table with a somewhat complex data structure like the following:

 

n line
34 text jdjdjf stock IBM
35 djdff 
36 25$

 

Whenever I have a line with the word "stock" in it, the actual price is always 2 lines below. So what I would like to have is the following structure:

n line price
34 text jdjdjf stock IBM 25$

 

Like this the stock and its price are on the same observations and its possible to make an analysis.

 

Thank you.


Accepted Solutions
Solution
‎01-21-2018 12:35 PM
PROC Star
Posts: 1,595

Re: selecting observations based on their relative position to other observations

Yes, your understanding is correct. Good.

Ok, still just a  minor change

 

data want;
set have2 ;
retain _flag 0 _line _n ;
if find(line, "stock")>0 then do;
__n=0;
_flag=1;
_n=n;
_line=line;
end;
if _flag then __n+1;
if __n=3 then do;
n=_n;
price=line;
line=_line;
output;
end;
drop _: ;
run;

View solution in original post


All Replies
PROC Star
Posts: 1,218

Re: selecting observations based on their relative position to other observations

1) Why on earth do you have data structure like this? Smiley Happy

 

2) Where does "A" come from in your line variable instead of IBM?

Trusted Advisor
Posts: 1,312

Re: selecting observations based on their relative position to other observations

Use the @ character feature of the INPUT statement:

 

data want;
  infile datalines;
  input @ 'stock'  stock :$5. // price :$5.;
  put (_all_) (=);
datalines;
xxx
stock AAA
aaa
11$
xxx
xxx
stock BBBB
bbb
22$
run;

The slashes in the input statement tells it to advance one  line per slash.

 

Super Contributor
Posts: 459

Re: selecting observations based on their relative position to other observations

Hi mkeintz,

 

since my data is generated externally, lets assume that I already imported data equivalent to:

data have;
input line $ 1-10;  
datalines;
xxx
stock AAA
aaa
11$
xxx
xxx
stock BBBB
bbb
22$
run;

It is now that I need to do the process described in the question. I tried to guess based on your code by doing the following:

 

data want;
set have;
input @ 'stock' stock :$5. // price :$5.;
put (_all_) (=);

run;

 

But unsurprisingly it didn't work

 

Trusted Advisor
Posts: 1,312

Re: selecting observations based on their relative position to other observations

Input is for raw data input, not sas data input.   I'd suggest usings @draycut's solultion.

PROC Star
Posts: 1,218

Re: selecting observations based on their relative position to other observations

data have;
input n line $50.;
datalines;
34	text jdjdjf stock IBM
35	djdff 
36	25$
37	text jdjdjf stock IBM
38	djdff 
39	25$
40	text jdjdjf stock IBM
41	djdff 
42	25$
;

data want;
   merge have have(firstobs=3 rename=(line=price));
   if find(line, "stock")>0 then output;
run;
Super Contributor
Posts: 459

Re: selecting observations based on their relative position to other observations

Hi draycut,

 

thanks for the code, worked perfectly!

 

If possible to expand the code, in my case I have more variables of interest. For example, the 4th line after the line that contains the word "stock" I have its industry:

 

data have2;
input n line $50.;
datalines;
30  doidn
31  fifeniofe
32  doewiij
33  fofugrugr
34	text jdjdjf stock IBM
35	djdff 
36	25$
37  fg fgg
38	computers
39	text jdjdjf stock AAA
40	djdff 
41	22$
42  fggh
43	food
44	text jdjdjf stock BBB
45	djdff 
46	55$
47  duehuefhu
48	automotive
49  dfeoufeijer
50  ierhde43848
;
run;

Using your code I did the following:

 

data want2;
   merge have2 have2(firstobs=3 rename=(line=price)) have2(firstobs=5 rename=(line=industry));
   if find(line, "stock")>0 then output;
run;

And also got the result that I wanted!

 

I would like to ask for your opinion on the following:

 

1) In my data I have tens to hundreds of thousands of lines (most of them useless) and 5 to 7 variables that I need to extract, so this would mean merging 5 to 7 times the same data table just like above. Is this a good approach, i.e, could there be efficiency or some other problems (like the computer freezing for example)?

 

2) Recently I worked on some data where I also had to restructure it into a more analysis-friendly form using Excel VBA. Looking at my data I would perhaps do the following:

 

Loop through my original data until I find the word "stock". Once I find it, remember the cells row r and paste the cell's content into

a new spreadsheet in cell (1,2) (the first column in the new sheet is the stock, second column price etc).

In the original sheet the contents of the cell with row = r+2 are the stock's price and so they will be pasted into the new sheet's cell (2,2), and so on for the other variables...

And when I reach a new cell with the word "stock" I go to the next line in the new sheet.

It seems, at least to me, that using cells is easier conceptually (because I couldn't figure out how to do the above mentioned data restructuring using SAS), so is it possible to "cellulize" data in SAS? Is it even worth it?

 

Thanks 

 

 

 

 

 

 

 

 

 

 

PROC Star
Posts: 1,595

Re: selecting observations based on their relative position to other observations

@ilikesas  Good afternoon, IMO unless your sample pattern resembles exactly your real, I'd think the merge might not suffice. You prolly will need a sound method at some point to make it squeaky clean to handle varying patterns. 

PROC Star
Posts: 1,595

Re: selecting observations based on their relative position to other observations

data have2;

input n line $50.;

datalines;

30  doidn

31  fifeniofe

32  doewiij

33  fofugrugr

34    text jdjdjf stock IBM

35    djdff

36    25$

37  fg fgg

38    computers

39    text jdjdjf stock AAA

40    djdff

41    22$

42  fggh

43    food

44    text jdjdjf stock BBB

45    djdff

46    55$

47  duehuefhu

48    automotive

49  dfeoufeijer

50  ierhde43848

;

run;

 

data _null_;

if _N_ = 1 then do;

        if 0 then set have2;

        length price $10;

      declare hash h( );

      h.defineKey('_n','_line');

      h.defineData('_n','_line','price');

      h.defineDone();

        call missing(price);

   end;

do until(last);

      set have2 end=last;

      if find(line, "stock")>0 then do;

                  _n=n;

                  _line=line;

                  h.add();

            end;

      else if index(line,'$')>0 and  h.check()=0 then do;

            price=line;

            h.replace();

      end;

end;

if last then h.output(dataset:'want');

run;

Super User
Posts: 23,337

Re: selecting observations based on their relative position to other observations

You've been thinking in Excel terms, not SAS terms. 

 

You can use RETAIN here to get what you want as well. I'm assuming you're scraping data from either a webpage or copy/pasted out of a report/PDF. There's often ways to move that data to get it in a better format in the first place but I'll consider that irrelevant for the moment. 

 

If you use RETAIN, you can trigger the write to your SAS data set once you find a new stock symbol. 

 

I don't see how you know which is a new symbol yet, but it seems relatively straightforward to me. 

 

data want; 
set have;
retain stock price <other variables>;

if <stock> then do; 
output;  *output values currently stored;
call missing(stock, price, <other variables>; *reset all values to missing to prevent carry over of values;
stock=value;   *set to new stock ticker;
end;

else if ... then price=value;
else if ... then <other_variables>=value; *since these are all retained, you get a single record for each 'stock' at the end;



run;

 

 

PROC Star
Posts: 1,595

Re: selecting observations based on their relative position to other observations

Reeza is right, yeah a simple retain does it like-

 


data want;
set have2 end=last;
retain _line _n;
if find(line, "stock")>0 then do;
_n=n;
_line=line;
end;
else if index(line,'$')>0 then do;
n=_n;
price=line;
line=_line;
output;
end;
drop _:;
run;

Super Contributor
Posts: 459

Re: selecting observations based on their relative position to other observations

Posted in reply to novinosrin

Hi novinosrin,

 

for the line in your code: 

 

else if index(line,'$')>0 then do;

 

if I understand correctly the code looks for the symbol "$", but in my case I have many useless lines that contain the symbol "$", so it is really the relative position of the lines from the line that contains the word "stock" that are important

Solution
‎01-21-2018 12:35 PM
PROC Star
Posts: 1,595

Re: selecting observations based on their relative position to other observations

Yes, your understanding is correct. Good.

Ok, still just a  minor change

 

data want;
set have2 ;
retain _flag 0 _line _n ;
if find(line, "stock")>0 then do;
__n=0;
_flag=1;
_n=n;
_line=line;
end;
if _flag then __n+1;
if __n=3 then do;
n=_n;
price=line;
line=_line;
output;
end;
drop _: ;
run;

Super Contributor
Posts: 459

Re: selecting observations based on their relative position to other observations

Posted in reply to novinosrin

Thanks for the code.

I managed to include an additional "industry" variable:

 

data want;
set have2 ;
retain _flag 0 _line _n _line2 _n2 ;
if find(line, "stock")>0 then do;
__n=0;
_flag=1;
_n=n;
_line=line;
end;
if _flag then __n+1;
if __n=3 then do;
_n2=n;
_line2=line;
end;
if __n=5 then do;
n=_n;
industry=line;
price=_line2;
line=_line;
output;
end;
drop _: ;
run;

so I guess that if I had more variables to create I would just repeat the process

PROC Star
Posts: 1,595

Re: selecting observations based on their relative position to other observations

@ilikesas Sorry mate!, I have been busy with school. Did it work? Are you able to progress and make the necessary changes?

☑ This topic is solved.

Need further help from the community? Please ask a new question.

Discussion stats
  • 15 replies
  • 318 views
  • 8 likes
  • 5 in conversation