Hi,
Thank you for your help in advance!
I have a dataset of 6,263 obs. and 4,590 variables.
Each variable has only a few actual observations on top, and remaining observations are all missing.
I want to lag the values far down to their designated places, and that iteratively over all the variables.
For example, this is what my original dataset looks like:
data Original;
input ANALYS var1 var2 var3 ;
CARDS;
1 5 3 7
2 . 7 9
3 . 9 10
4 . . .
5 . . .
6 . . .
7 . . .
8 . . .
9 . . .
10 . . .
;;;;
But I want to make my dataset look like the below:
data WANT;
input ANALYS var1 var2 var3 ;
CARDS;
1 . . .
2 . . .
3 . 3 .
4 . . .
5 5 . .
6 . . .
7 . 7 7
8 . . .
9 . 9 9
10 . . 10
;;;;
I really appreciate your help, and life-saving!!
Sincerely,
KS -,
This is not a LAG problem.  It's more of a rearrange problem.    Although all of your sample data shows values that are carried forward to subsequent observations, I bet you will have some instances that will have to be carried backward forward as well.
If the values are actually consecutive integers as you show, then the program is a straightforward population of a matrix (1 row observation in original, and 1 col per VAR variable in original), followed by retrieval and output of the matrix:
data Original;
input ANALYS var1 var2 var3 ;
CARDS;
1 5 3 7
2 . 7 9
3 . 9 10
4 . . .
5 . . .
6 . . .
7 . . .
8 . . .
9 . . .
10 . . .
;;;;
/* Get number of rows and columns needed for the matrix */
data _null_;
  set original nobs=n_analys;
  call symput('n_analysts',cats(n_analys));
  array cusip_ids {*} var: ;
  call symput('n_cusips',cats(dim(cusip_ids)));
  stop;
run;
%put &=n_analysts;
%put &=n_cusips;
data want (drop=col row);
  array anls {&n_analysts} _temporary_;         /* List of ANALYS values */
  array t {&n_analysts,&n_cusips} _temporary_; /*2-way array, i.e. a 2-dim matrix */
  set original end=end_of_first_pass;
  anls{_n_}=analys;
  array _var {&n_cusips} var: ;
  /* From each VAR value, fill a cell in the corresponding row */
  do col=1 to dim(_var);
    if _var{col}^=. then t{_var{col},col}=_var{col};
  end;
  if end_of_first_pass;       /* One the input data is exhausted ... */
  do row=1 to dim(anls);
    analys=anls{row};
    do col=1 to &n_cusips;
      _var{col}=t{row,col};
    end;
    output;
  end;
run;
But I bet your values of ANALYS are not consecutive integers, right? If not, then the 2-way array will have to be replaced by something like a hash object:
data want (drop=v rc _:);
  set original end=end_of_first_pass;
  array _anls {&n_analysts} _temporary_ ; /* Temp store for all the ANALYS ids in ORIGINAL order */
  _anls{_n_}=analys;
  array vars  {&n_cusips} var: ;
  array _tmps {&n_cusips} ;               /* Temp Store for the current set of VAR values */
  if _n_=1 then do;  /*H is a lookup table keyed on ANALYS */
    declare hash h (dataset:'original(obs=0)');
      h.definekey('analys');
      h.definedata(all:'Y');
      h.definedone();
  end;
  do v=1 to dim(vars);
    _tmps{v}=vars{v};
  end;
  call missing(of vars{*});
  if h.check(key:analys)^=0 then h.add();   /*If this ANALYS not yet in h then put it in*/
  do v=1 to dim(vars);
    call missing(of vars{*});
    if _tmps{v}=. then continue;
    analys=_tmps{v};
    rc=h.find(key:analys);
    vars{v}=_tmps{v};
    h.replace();
  end;
  if end_of_first_pass then do v=1 to &n_analysts;
    analys=_anls{v};
    h.find();
    output;
  end;
run;
In fact is ANALYS even a numeric variable? If not then you will need to change the _ANLS numeric array to a character array.
Lagging 4000 + variables is likely to be fraught with all sorts of issues. If nothing adding tens, if not hundreds of thousands of temporary values to keep track of.
How about: Providing values that are not the actual "rows" that you want to place data on (there are solutions for that sort of data that probably would not work in a generic solution) and provide the rules for how we know that var1 from observation 1 is supposed to get to observation 5. Even with a "lag" approach we need to know all the rules for which variable values go where.
By any chance is the largish number of variables related to different times of data collection or related to the value?
Thank you for your quick reply!
I forgot to tell you about the rules of placing values.
The very first variable ANALYS is the benchmark, and the values of all other variables are supposed to align with the values of ANALYS. If the value of ANALYS is 9, then all 9's from other variables should line up there in that row..
This is actually WRDS global finance data, the analysts on the row, and company identifier (CUSIP) on the column.
I was trying to make a matrix for a network study.
Many thanks!
KS -,
Thank you for your reply!
The story about this dataset is that I wanted to transform a panel data of forecasting analysts and their firms into a matrix with a row of analysts and a column of firm identifiers.
From the SAS window, Original looks like this:
(It has 6,263 obs. and 4,589 variables)
After I disperse the values to their appropriate rows for all the variables, I am going to change the missing to 0, the non-missing to 1.
Many thanks!
KS -,
Oh, sorry, maybe I am mistaken..
Why some values are filled, whereas others are missing, is simply by chance (from the database).
ANALYS 1-10 are analyst ID. Var1-Var3 are stock1, stock2, and stock3.
So, the dataset is reporting that stock1 is taken care of by analyst 5, stock2 by analysts 3,7,9, stock3 by analysts 7,9,10.
What I want to do is to drag down the values to the right places corresponding to ANALYS.
ANALYS var1 var2 var3
      1         5      3       7
      2          .      7       9
      3          .      9      10
      4          .      .        .
      5          .      .         .
      6          .      .         .
      7          .      .         .
      8          .      .         .
      9          .      .         .
      10        .      .         .
Am I answering you question correctly?
Thanks,
KS -,
@KS99 wrote:
Oh, sorry, maybe I am mistaken..
Why some values are filled, whereas others are missing, is simply by chance (from the database).
ANALYS 1-10 are analyst ID. Var1-Var3 are stock1, stock2, and stock3.
So, the dataset is reporting that stock1 is taken care of by analyst 5, stock2 by analysts 3,7,9, stock3 by analysts 7,9,10.
What I want to do is to drag down the values to the right places corresponding to ANALYS.
ANALYS var1 var2 var3
1 5 3 7
2 . 7 9
3 . 9 10
4 . . .
5 . . .
6 . . .
7 . . .
8 . . .
9 . . .
10 . . .
Am I answering you question correctly?
Thanks,
KS -,
The only thing you need to do to make such a dataset usable is a transpose:
data Original;
input ANALYS var1 var2 var3 ;
CARDS;
1 5 3 7
2 . 7 9
3 . 9 10
4 . . .
5 . . .
6 . . .
7 . . .
8 . . .
9 . . .
10 . . .
;
proc transpose
  data=original
  out=want (
    drop=analys
    rename=(_name_=company col1=analys)
    where=(analys ne .)
  )
;
var var:;
by analys;
run;After that, you have a very streamlined matrix analyst vs. company, without wasting space for all those missing values. And it is easy to use, for whatever data you need to extract.
This is not a LAG problem.  It's more of a rearrange problem.    Although all of your sample data shows values that are carried forward to subsequent observations, I bet you will have some instances that will have to be carried backward forward as well.
If the values are actually consecutive integers as you show, then the program is a straightforward population of a matrix (1 row observation in original, and 1 col per VAR variable in original), followed by retrieval and output of the matrix:
data Original;
input ANALYS var1 var2 var3 ;
CARDS;
1 5 3 7
2 . 7 9
3 . 9 10
4 . . .
5 . . .
6 . . .
7 . . .
8 . . .
9 . . .
10 . . .
;;;;
/* Get number of rows and columns needed for the matrix */
data _null_;
  set original nobs=n_analys;
  call symput('n_analysts',cats(n_analys));
  array cusip_ids {*} var: ;
  call symput('n_cusips',cats(dim(cusip_ids)));
  stop;
run;
%put &=n_analysts;
%put &=n_cusips;
data want (drop=col row);
  array anls {&n_analysts} _temporary_;         /* List of ANALYS values */
  array t {&n_analysts,&n_cusips} _temporary_; /*2-way array, i.e. a 2-dim matrix */
  set original end=end_of_first_pass;
  anls{_n_}=analys;
  array _var {&n_cusips} var: ;
  /* From each VAR value, fill a cell in the corresponding row */
  do col=1 to dim(_var);
    if _var{col}^=. then t{_var{col},col}=_var{col};
  end;
  if end_of_first_pass;       /* One the input data is exhausted ... */
  do row=1 to dim(anls);
    analys=anls{row};
    do col=1 to &n_cusips;
      _var{col}=t{row,col};
    end;
    output;
  end;
run;
But I bet your values of ANALYS are not consecutive integers, right? If not, then the 2-way array will have to be replaced by something like a hash object:
data want (drop=v rc _:);
  set original end=end_of_first_pass;
  array _anls {&n_analysts} _temporary_ ; /* Temp store for all the ANALYS ids in ORIGINAL order */
  _anls{_n_}=analys;
  array vars  {&n_cusips} var: ;
  array _tmps {&n_cusips} ;               /* Temp Store for the current set of VAR values */
  if _n_=1 then do;  /*H is a lookup table keyed on ANALYS */
    declare hash h (dataset:'original(obs=0)');
      h.definekey('analys');
      h.definedata(all:'Y');
      h.definedone();
  end;
  do v=1 to dim(vars);
    _tmps{v}=vars{v};
  end;
  call missing(of vars{*});
  if h.check(key:analys)^=0 then h.add();   /*If this ANALYS not yet in h then put it in*/
  do v=1 to dim(vars);
    call missing(of vars{*});
    if _tmps{v}=. then continue;
    analys=_tmps{v};
    rc=h.find(key:analys);
    vars{v}=_tmps{v};
    h.replace();
  end;
  if end_of_first_pass then do v=1 to &n_analysts;
    analys=_anls{v};
    h.find();
    output;
  end;
run;
In fact is ANALYS even a numeric variable? If not then you will need to change the _ANLS numeric array to a character array.
Dear mkeintz,
Thank you so much for your help! I really appreciate it.
I ran the codes you provided, but there seems to be a small obstacle.
I got the following warning (bold) in the first part of your codes,
119 data _null_;
120 set Matrix_2016 nobs=n_analys;
121 call symput('n_analysts',cats(n_analys));
122 array cusip_ids {*} var: ;
WARNING: Defining an array with zero elements.
123 call symput('n_cusips',cats(dim(cusip_ids)));
124 stop;
125 run;
NOTE: There were 1 observations read from the data set WORK.MATRIX_2016.
NOTE: DATA statement used (Total process time):
real time 0.01 seconds
cpu time 0.01 seconds
And running the latter part of your codes, I receives the following error message.
ERROR: Invalid dimension specification for array vars. The upper bound of an array dimension is smaller than its
corresponding lower bound.
ERROR: Too few variables defined for the dimension(s) specified for the array vars.
ERROR: Invalid dimension specification for array _tmps. The upper bound of an array dimension is smaller than its
corresponding lower bound. 
Can you help me a little bit more with this? I would greatly appreciate it,
Thank you !
Sincerely,
KS -,
@KS99 wrote:
Dear mkeintz,
Thank you so much for your help! I really appreciate it.
I ran the codes you provided, but there seems to be a small obstacle.
I got the following warning (bold) in the first part of your codes,
119 data _null_;
120 set Matrix_2016 nobs=n_analys;
121 call symput('n_analysts',cats(n_analys));
122 array cusip_ids {*} var: ;
WARNING: Defining an array with zero elements.
123 call symput('n_cusips',cats(dim(cusip_ids)));
124 stop;
125 run;NOTE: There were 1 observations read from the data set WORK.MATRIX_2016.
NOTE: DATA statement used (Total process time):
real time 0.01 seconds
cpu time 0.01 seconds
And running the latter part of your codes, I receives the following error message.
ERROR: Invalid dimension specification for array vars. The upper bound of an array dimension is smaller than its
corresponding lower bound.
ERROR: Too few variables defined for the dimension(s) specified for the array vars.
ERROR: Invalid dimension specification for array _tmps. The upper bound of an array dimension is smaller than its
corresponding lower bound.
Can you help me a little bit more with this? I would greatly appreciate it,
Thank you !
Sincerely,
KS -,
You got the warning probably because you didn't provide us the actual names of your cusip variables. You used "VAR1 VAR2 VAR3", so I defined the array CUSIP_IDS as all the variables whose name begins with VAR (see line 122 where I used "var:" ). Apparently you have no variables whose name beging with "VAR", so SAS told you that the array has zero elements.
You have to provide the ARRAY statement with the list of names that you do want to rearrange. If you're lucky and they all start with the same letters (like CUS...), then you can specify the list as "cus:" - all vars whose name begins with "cus". But there are other ways to generate a list of variables names if this is not the case.
For instance, if the first cusip variable name was, say "apple_cusip" and the last was "xerox_cusip", and every variables between them were cusips of interest, you could generate the needed list using a double-dash, as in array cusip_ids {*} apple_cusip--xerox_cusip; But this assumes the cusip values are numeric, and cusip id's can have letters.
So what are the names, and variable types (character or numeric) of the 4,590 cusip variables?
Dear mkeintz,
Oh, thank you for pointing out what I am missing!
So, today, I struggled to attach a prefix "Var_" to all the variables, using a set of codes provided by SAS support.
/* Rename all the variables, adding prefix "Var_" */
proc sql noprint;
select cats(name,'=','Var_',name)
into :list
separated by ' '
from dictionary.columns
where libname = 'WORK' and memname = 'Matrix_2016'; quit;
proc datasets library = work nolist;
modify Matrix_2016;
rename &list; run; quit; 
However, I ended up with another error message like this:
156 proc datasets library = work nolist;
157 modify Matrix_2016;
WARNING: Apparent symbolic reference LIST not resolved.
158 rename &list; run; 
I am sorry to take your precious time, but I really appreciate your help!
Sincerely,
KS -,
My email anticipated this possibility, but you missed it.
I repeat:
For instance, if the first cusip variable name was, say "apple_cusip" and the last was "xerox_cusip", and every variables between them were cusips of interest, you could generate the needed list using a double-dash, as in array cusip_ids {*} apple_cusip--xerox_cusip; But this assumes the cusip values are numeric, and cusip id's can have letters.
This means you don't have to rename the variables. If the cusip variables are all side-by-side in the datasets, then all you need to know is the name of the left-most and right-most variable, and then use a double-dash.
In your case, say (from another of your comments) the left-most variable is _00036020. You don't provide the right-most, so let's call it Z13456676. And again assuming all the variables between them are of interest, the array can be trivially constructed as
array cusip_ids {*}   _00036020 -- Z13456676 ;Dear mkeintz,
Now it works perfectly!
Thank you so much , and Happy New Year!
Sincerely,
KS -,
It's finally time to hack! Remember to visit the SAS Hacker's Hub regularly for news and updates.
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.
