BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
KS99
Obsidian | Level 7

Hi, 

 

Thank you for your help in advance! 

 

I have a dataset of 6,263 obs. and 4,590 variables. 

Each variable has only a few actual observations on top, and remaining observations are all missing. 

I want to lag the values far down to their designated places, and that iteratively over all the variables. 

 

For example, this is what my original dataset looks like: 


data Original;
input ANALYS var1 var2 var3 ;
CARDS;
1 5 3 7
2 . 7 9
3 . 9 10
4 . . .
5 . . .
6 . . .
7 . . .
8 . . .
9 . . .
10 . . .
;;;;

 

But I want to make my dataset look like the below: 


data WANT;
input ANALYS var1 var2 var3 ;
CARDS;
1 . . .
2 . . .
3 . 3 .
4 . . .
5 5 . .
6 . . .
7 . 7 7
8 . . .
9 . 9 9
10 . . 10
;;;;

 

 

I really appreciate your help, and life-saving!! 

 

Sincerely, 

KS -, 

 

 

1 ACCEPTED SOLUTION

Accepted Solutions
mkeintz
PROC Star

This is not a LAG problem.  It's more of a rearrange problem.    Although all of your sample data shows values that are carried forward to subsequent observations, I bet you will have some instances that will have to be carried backward forward as well.

 

If the values are actually consecutive integers as you show, then the program is a straightforward population of a matrix (1 row observation in original, and 1 col per VAR variable in original), followed by retrieval and output of the matrix:

 

data Original;
input ANALYS var1 var2 var3 ;
CARDS;
1 5 3 7
2 . 7 9
3 . 9 10
4 . . .
5 . . .
6 . . .
7 . . .
8 . . .
9 . . .
10 . . .
;;;;

/* Get number of rows and columns needed for the matrix */
data _null_;
  set original nobs=n_analys;
  call symput('n_analysts',cats(n_analys));
  array cusip_ids {*} var: ;
  call symput('n_cusips',cats(dim(cusip_ids)));
  stop;
run;
%put &=n_analysts;
%put &=n_cusips;

data want (drop=col row);
  array anls {&n_analysts} _temporary_;         /* List of ANALYS values */
  array t {&n_analysts,&n_cusips} _temporary_; /*2-way array, i.e. a 2-dim matrix */

  set original end=end_of_first_pass;
  anls{_n_}=analys;
  array _var {&n_cusips} var: ;
  /* From each VAR value, fill a cell in the corresponding row */
  do col=1 to dim(_var);
    if _var{col}^=. then t{_var{col},col}=_var{col};
  end;

  if end_of_first_pass;       /* One the input data is exhausted ... */
  do row=1 to dim(anls);
    analys=anls{row};
    do col=1 to &n_cusips;
      _var{col}=t{row,col};
    end;
    output;
  end;
run;

 

But I bet your values of ANALYS are not consecutive integers, right? If not, then the 2-way array will have to be replaced by something like a hash object:

 

data want (drop=v rc _:);
  set original end=end_of_first_pass;

  array _anls {&n_analysts} _temporary_ ; /* Temp store for all the ANALYS ids in ORIGINAL order */
  _anls{_n_}=analys;

  array vars  {&n_cusips} var: ;
  array _tmps {&n_cusips} ;               /* Temp Store for the current set of VAR values */

  if _n_=1 then do;  /*H is a lookup table keyed on ANALYS */
    declare hash h (dataset:'original(obs=0)');
      h.definekey('analys');
      h.definedata(all:'Y');
      h.definedone();
  end;

  do v=1 to dim(vars);
    _tmps{v}=vars{v};
  end;
  call missing(of vars{*});

  if h.check(key:analys)^=0 then h.add();   /*If this ANALYS not yet in h then put it in*/

  do v=1 to dim(vars);
    call missing(of vars{*});
    if _tmps{v}=. then continue;
    analys=_tmps{v};
    rc=h.find(key:analys);
    vars{v}=_tmps{v};
    h.replace();
  end;

  if end_of_first_pass then do v=1 to &n_analysts;
    analys=_anls{v};
    h.find();
    output;
  end;
run;

In fact is ANALYS even a numeric variable?  If not then you will need to change the _ANLS numeric array to a character array.

 

 

 

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

View solution in original post

23 REPLIES 23
ballardw
Super User

Lagging 4000 + variables is likely to be fraught with all sorts of issues. If nothing adding tens, if not hundreds of thousands of temporary values to keep track of.

 

How about: Providing values that are not the actual "rows" that you want to place data on (there are solutions for that sort of data that probably would not work in a generic solution) and provide the rules for how we know that var1 from observation 1 is supposed to get to observation 5. Even with a "lag" approach we need to know all the rules for which variable values go where.

 

By any chance is the largish number of variables related to different times of data collection or related to the value?

KS99
Obsidian | Level 7

Thank you for your quick reply! 

 

I forgot to tell you about the rules of placing values. 

The very first variable ANALYS is the benchmark, and the values of all other variables are supposed to align with the values of ANALYS. If the value of ANALYS is 9, then all 9's from other variables should line up there in that row.. 

 

This is actually WRDS global finance data, the analysts on the row, and company identifier (CUSIP) on the column. 

I was trying to make a matrix for a network study. 

 

Many thanks! 

 

KS -, 

 

Reeza
Super User
From your Original to Want, please explain the logic between ANALYS 4 to 10 and why rows 7, 9 and 10 are filled in but the remainders are not. There's no logic to that that I can see.
KS99
Obsidian | Level 7

Thank you for your reply! 

 

The story about this dataset is that I wanted to transform a panel data of forecasting analysts and their firms into a matrix with a row of analysts and a column of firm identifiers. 

From the SAS window, Original looks like this:

(It has 6,263 obs. and 4,589 variables) 

KS99_0-1611268430499.png

After I disperse the values to their appropriate rows for all the variables, I am going to change the missing to 0, the non-missing to 1. 

 

Many thanks! 

KS -, 

 

Reeza
Super User
I'm sorry but that doesn't answer the questions asked.
KS99
Obsidian | Level 7

Oh, sorry, maybe I am mistaken.. 

 

Why some values are filled, whereas others are missing, is simply by chance (from the database).  

 

ANALYS 1-10 are analyst ID. Var1-Var3 are stock1, stock2, and stock3. 

So, the dataset is reporting that stock1 is taken care of by analyst 5, stock2 by analysts 3,7,9, stock3 by analysts 7,9,10. 

What I want to do is to drag down the values to the right places corresponding to ANALYS. 

 

ANALYS var1 var2 var3
      1         5      3       7
      2          .      7       9
      3          .      9      10
      4          .      .        .
      5          .      .         .
      6          .      .         .
      7          .      .         .
      8          .      .         .
      9          .      .         .
      10        .      .         .

 

Am I answering you question correctly? 

 

Thanks, 

KS -, 

Kurt_Bremser
Super User

@KS99 wrote:

Oh, sorry, maybe I am mistaken.. 

 

Why some values are filled, whereas others are missing, is simply by chance (from the database).  

 

ANALYS 1-10 are analyst ID. Var1-Var3 are stock1, stock2, and stock3. 

So, the dataset is reporting that stock1 is taken care of by analyst 5, stock2 by analysts 3,7,9, stock3 by analysts 7,9,10. 

What I want to do is to drag down the values to the right places corresponding to ANALYS. 

 

ANALYS var1 var2 var3
      1         5      3       7
      2          .      7       9
      3          .      9      10
      4          .      .        .
      5          .      .         .
      6          .      .         .
      7          .      .         .
      8          .      .         .
      9          .      .         .
      10        .      .         .

 

Am I answering you question correctly? 

 

Thanks, 

KS -, 


The only thing you need to do to make such a dataset usable is a transpose:

data Original;
input ANALYS var1 var2 var3 ;
CARDS;
1 5 3 7
2 . 7 9
3 . 9 10
4 . . .
5 . . .
6 . . .
7 . . .
8 . . .
9 . . .
10 . . .
;

proc transpose
  data=original
  out=want (
    drop=analys
    rename=(_name_=company col1=analys)
    where=(analys ne .)
  )
;
var var:;
by analys;
run;

After that, you have a very streamlined matrix analyst vs. company, without wasting space for all those missing values. And it is easy to use, for whatever data you need to extract.

mkeintz
PROC Star

This is not a LAG problem.  It's more of a rearrange problem.    Although all of your sample data shows values that are carried forward to subsequent observations, I bet you will have some instances that will have to be carried backward forward as well.

 

If the values are actually consecutive integers as you show, then the program is a straightforward population of a matrix (1 row observation in original, and 1 col per VAR variable in original), followed by retrieval and output of the matrix:

 

data Original;
input ANALYS var1 var2 var3 ;
CARDS;
1 5 3 7
2 . 7 9
3 . 9 10
4 . . .
5 . . .
6 . . .
7 . . .
8 . . .
9 . . .
10 . . .
;;;;

/* Get number of rows and columns needed for the matrix */
data _null_;
  set original nobs=n_analys;
  call symput('n_analysts',cats(n_analys));
  array cusip_ids {*} var: ;
  call symput('n_cusips',cats(dim(cusip_ids)));
  stop;
run;
%put &=n_analysts;
%put &=n_cusips;

data want (drop=col row);
  array anls {&n_analysts} _temporary_;         /* List of ANALYS values */
  array t {&n_analysts,&n_cusips} _temporary_; /*2-way array, i.e. a 2-dim matrix */

  set original end=end_of_first_pass;
  anls{_n_}=analys;
  array _var {&n_cusips} var: ;
  /* From each VAR value, fill a cell in the corresponding row */
  do col=1 to dim(_var);
    if _var{col}^=. then t{_var{col},col}=_var{col};
  end;

  if end_of_first_pass;       /* One the input data is exhausted ... */
  do row=1 to dim(anls);
    analys=anls{row};
    do col=1 to &n_cusips;
      _var{col}=t{row,col};
    end;
    output;
  end;
run;

 

But I bet your values of ANALYS are not consecutive integers, right? If not, then the 2-way array will have to be replaced by something like a hash object:

 

data want (drop=v rc _:);
  set original end=end_of_first_pass;

  array _anls {&n_analysts} _temporary_ ; /* Temp store for all the ANALYS ids in ORIGINAL order */
  _anls{_n_}=analys;

  array vars  {&n_cusips} var: ;
  array _tmps {&n_cusips} ;               /* Temp Store for the current set of VAR values */

  if _n_=1 then do;  /*H is a lookup table keyed on ANALYS */
    declare hash h (dataset:'original(obs=0)');
      h.definekey('analys');
      h.definedata(all:'Y');
      h.definedone();
  end;

  do v=1 to dim(vars);
    _tmps{v}=vars{v};
  end;
  call missing(of vars{*});

  if h.check(key:analys)^=0 then h.add();   /*If this ANALYS not yet in h then put it in*/

  do v=1 to dim(vars);
    call missing(of vars{*});
    if _tmps{v}=. then continue;
    analys=_tmps{v};
    rc=h.find(key:analys);
    vars{v}=_tmps{v};
    h.replace();
  end;

  if end_of_first_pass then do v=1 to &n_analysts;
    analys=_anls{v};
    h.find();
    output;
  end;
run;

In fact is ANALYS even a numeric variable?  If not then you will need to change the _ANLS numeric array to a character array.

 

 

 

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------
KS99
Obsidian | Level 7

Dear mkeintz, 

 

Thank you so much for your help! I really appreciate it. 

 

I ran the codes you provided, but there seems to be a small obstacle. 

I got the following warning (bold) in the first part of your codes, 

 

119 data _null_;
120 set Matrix_2016 nobs=n_analys;
121 call symput('n_analysts',cats(n_analys));
122 array cusip_ids {*} var: ;
WARNING: Defining an array with zero elements.
123 call symput('n_cusips',cats(dim(cusip_ids)));
124 stop;
125 run;

NOTE: There were 1 observations read from the data set WORK.MATRIX_2016.
NOTE: DATA statement used (Total process time):
real time 0.01 seconds
cpu time 0.01 seconds

 

And running the latter part of your codes, I receives the following error message. 

ERROR: Invalid dimension specification for array vars. The upper bound of an array dimension is smaller than its
corresponding lower bound.
ERROR: Too few variables defined for the dimension(s) specified for the array vars.
ERROR: Invalid dimension specification for array _tmps. The upper bound of an array dimension is smaller than its
corresponding lower bound. 

 

Can you help me a little bit more with this? I would greatly appreciate it,

 

Thank you !

 

Sincerely, 

KS -, 

 

mkeintz
PROC Star

@KS99 wrote:

Dear mkeintz, 

 

Thank you so much for your help! I really appreciate it. 

 

I ran the codes you provided, but there seems to be a small obstacle. 

I got the following warning (bold) in the first part of your codes, 

 

119 data _null_;
120 set Matrix_2016 nobs=n_analys;
121 call symput('n_analysts',cats(n_analys));
122 array cusip_ids {*} var: ;
WARNING: Defining an array with zero elements.
123 call symput('n_cusips',cats(dim(cusip_ids)));
124 stop;
125 run;

NOTE: There were 1 observations read from the data set WORK.MATRIX_2016.
NOTE: DATA statement used (Total process time):
real time 0.01 seconds
cpu time 0.01 seconds

 

And running the latter part of your codes, I receives the following error message. 

ERROR: Invalid dimension specification for array vars. The upper bound of an array dimension is smaller than its
corresponding lower bound.
ERROR: Too few variables defined for the dimension(s) specified for the array vars.
ERROR: Invalid dimension specification for array _tmps. The upper bound of an array dimension is smaller than its
corresponding lower bound. 

 

Can you help me a little bit more with this? I would greatly appreciate it,

 

Thank you !

 

Sincerely, 

KS -, 

 


You got the warning probably because you didn't provide us the actual names of your cusip variables.  You used "VAR1 VAR2 VAR3", so I defined the array CUSIP_IDS as all the variables whose name begins with VAR  (see line 122 where I used "var:" ). Apparently you have no variables whose name beging with "VAR", so SAS told you that the array has zero elements.

 

You have to provide the ARRAY statement with the list of names that you do want to rearrange.  If you're lucky and they all start with the same letters (like CUS...), then you can specify the list  as "cus:" - all vars whose name begins with "cus".    But there are other ways to generate a list of variables names if this is not the case.

 

For instance, if the first cusip variable name was, say "apple_cusip"  and the last was "xerox_cusip", and every variables between them were cusips of interest, you could generate the needed list using a double-dash, as in  array cusip_ids {*} apple_cusip--xerox_cusip;  But this assumes the cusip values are numeric, and cusip id's can have letters.

 

So what are the names, and variable types (character or numeric) of the 4,590 cusip variables?

 

 

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------
KS99
Obsidian | Level 7

Dear mkeintz, 

 

Oh, thank you for pointing out what I am missing! 

So, today, I struggled to attach a prefix "Var_" to all the variables, using a set of codes provided by SAS support.  

 

/* Rename all the variables, adding prefix "Var_" */
proc sql noprint;
select cats(name,'=','Var_',name)
into :list
separated by ' '
from dictionary.columns
where libname = 'WORK' and memname = 'Matrix_2016'; quit;
proc datasets library = work nolist;
modify Matrix_2016;
rename &list; run; quit; 

 

However, I ended up with another error message like this: 

 

156 proc datasets library = work nolist;
157 modify Matrix_2016;
WARNING: Apparent symbolic reference LIST not resolved.
158 rename &list; run; 

I am sorry to take your precious time, but I really appreciate your help! 

 

Sincerely, 

KS -, 

 

mkeintz
PROC Star

My email anticipated this possibility, but you missed it.

 

I repeat: 

 

For instance, if the first cusip variable name was, say "apple_cusip"  and the last was "xerox_cusip", and every variables between them were cusips of interest, you could generate the needed list using a double-dash, as in array cusip_ids {*} apple_cusip--xerox_cusip;  But this assumes the cusip values are numeric, and cusip id's can have letters.

This means you don't have to rename the variables.   If the cusip variables are all side-by-side in the datasets, then all you need to know is the name of the left-most and right-most variable, and then use a double-dash.

 

In your case, say (from another of your comments) the left-most variable is _00036020.  You don't provide the right-most, so let's call it    Z13456676.  And again assuming all the variables between them are of interest, the array can be trivially constructed as 

array cusip_ids {*}   _00036020 -- Z13456676 ;
--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------
KS99
Obsidian | Level 7

Dear mkeintz, 

 

Now it works perfectly! 

Thank you so much , and Happy New Year! 

 

Sincerely, 

KS -, 

 

KS99
Obsidian | Level 7
Dear mkeintz

I have an additional question.

I ran your codes, and it worked well.
Except one thing: I discovered that in my end-results all variables starting with _(number) disappeared!
Maybe should I change the variable names? Or, is there is cure for this?

Many thanks,
KS Choi -,

sas-innovate-2024.png

Available on demand!

Missed SAS Innovate Las Vegas? Watch all the action for free! View the keynotes, general sessions and 22 breakouts on demand.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 23 replies
  • 1252 views
  • 6 likes
  • 8 in conversation