Greetings all. I have become completely perplexed in trying to code a database of about 150,000 house sales into a repeat sales index. Each row has address, county, home age, number of beds and baths, and sold date. I have distilled the dataset into transaction pairs and need to create one line for each pair. For some homes, there are more than 1 pair e.g., a home sold in 2003, 2005, and 2009. The first pair is 2003 and 2005 and the second is 2005 and 2009. The data begins in the 3rd quarter of 2002 and ends with the 2nd quarter of 2013. Attached is a sample SAS dataset.
The final dataset should be, in each row: -1 in the column for the quarter and year the home first sold, 1 in the quarter/year column for the second sale, and 0 otherwise. I can use address (actually a compound key) as a unique identifier of the pairs. One other slight consideration is that for each pair I need to compute the natural log difference between selling price, which should be straightforward as I can subtract the lagged log(price). There is no need to retain any of the housing information in the final dataset b/c I will regress the price difference on the 48 new columns/variables.
Problems I have experienced include using an array but not each line will be used; lags but then brute force of 48 new columns twice; first.address does not pick up the second pair when a home sells more than twice; and I thought of SQL but it is not my strength.
Thank you in advance for any help!!
Darren
OK - with below tweak to the code you should get what you're after.
"is there a good method to compute the natural log price difference between the first and second sales".
Not really my area of expertise. And there is also no price data in your sample. It might be worth that you tell us what you actually try to achieve. There are a few experienced and skilful people here in this forum and it may well be that you don't have to create this variable array with start and stop flags to get to the outcome you're after.
/* determine earliest and latest solddate */
proc sql noprint;
  create table start_end_date as
    select 
      min(solddate) as start_date format=date9.
      , max(solddate) as end_date format=date9.
    from have;
quit;
/* create macro variable &var_list containing all quarter specific variable names */
data _null_;
  set start_end_date;
  format quarter date9.;
  length var_list $32767;
  _i=0;
  do while( quarter < intnx('quarter',end_date,0,'b') );
    quarter=intnx('quarter',start_date,_i,'b');
    var_list=strip(var_list)||' '||'quarter_'||put(quarter,yyq.);
    _i+1;
  end;
  call symputx('var_list',var_list);
  call symputx('start_date',start_date);
  call symputx('n_element',countw(var_list));
run;
/* populate quarter specific variables as required */
data want (drop=_:);
  retain property_id 1;
  set have;
  by age beds baths county reetnumber solddate;
  lag1_SOLDDATE=lag1(solddate);
  format lag1_SOLDDATE mmddyy10.;
  array qtrarr {*} &var_list (&n_element*0);
  retain qtrarr _lag2_n_arr;
  _n_arr=intck('quarter',"&start_date",solddate)+1;
  _lag1_n_arr=lag1(_n_arr);
  /* only output rows if there have been at least 2 sells for a property */
  if not first.reetnumber then
    do;
      qtrarr[_lag1_n_arr]=-1;
      qtrarr[_n_arr]=1;
if not missing(_lag2_n_arr) then qtrarr[_lag2_n_arr]=0;
      /* only output cases where property not sold twice in same quarter */
      if intck('quarter',lag1_SOLDDATE,solddate)>0 then output;
      _lag2_n_arr=_lag1_n_arr;
    end;
  /* prepare retained variables for iteration of next by group */
  if last.reetnumber then
    do;
      call missing(_lag2_n_arr);
      property_id+1;
      qtrarr[_lag1_n_arr]=0;
      qtrarr[_n_arr]=0;
    end;
run;
If this were my project the first thing I would do are to have a unique parcel or property identifier or barring that make sure that the address information is actually the same for each property.
You have addresses like 133 Jenny and 133 JENNY Lane, 12982 State Hwy 19 Highway, 12982 ST. Hwy 19 Highway and 12982 State Highway 19. So you can't actually use address as part of a compound key as the values differ too much for consistent matching. The case of letters can be taken care of easity enough but missing/present items suchs as E W N and S and other abbreviations are likely to take a bit of work to clean up.
Then sorting by property and sold date should let you use first. and last. processing with a counter and retain to id the first, second and subsequent sales to set the sale sequence code. And careful use of LAG or DIF functions should give you the differences in sale price.
I'm not sure what you mean by "48 new columns/variable". You may have left some of your logic out as I am not seeing a need for more than 2 additional variables and really can't tell where 48 would come from.
The address is not actually my key. The compound key is age, beds, baths, county, and reetnumber and it works. I only include the address for presentation. The 48 new columns come from the unique quarter (4) for 12 years
So what goes in those 48 columns? You mentioned a price differential and what I would call a sequence code but not for each quarter. It may be that you can use first/last processing and then transpose on the result to get the 48 added columns. Leave the Solddate alone and process with a format like YYQ.
The 48 new columns are the quarter/year from the 3rd quarter of 2002 to the 2nd quarter of 2012. The panel is going to be largely zeros. When the home sells the first time, the value in that row is "-1" and it must populate in the correct quarter/year column. On the same row, a value of "1" must be populated in the correct quarter and year column for the subsequent sale.
I know that address is not usable but for illustration purposes, say I can use address as the key
Address 1 Sold 02/16/2003: -1 in Q1_2003 and 0 in other 47 columns.
Address 1 Sold 05/17/2006: in the same row as the first sale, retain -1 in Q1_2003, insert 1 in Q2_2006, and 0 in other 46 columns.
This is relatively straightforward and can be accomplished with first.address and last.address and brute force copy and paste 48 if-then statements, lag them into 48 new columns, (lag log(price) at the same time), remove if lag(address) ne address and then use select-when to insert "1" using 48 new statements for the second occurrence. Some of this may be accomplished with a 2-dimensional array indexed as
do year = 2002 to 2013;
do quarter = 1 to 4;
I think the main issue with these methods is that they will do not account for a second pair for the same property (i.e., need to know the 2nd of the pair is also the 1st of the next pair). Continuing the example:
Address 1 Sold 10/24/2010: on a new row, -1 in Q2_2006, 1 in Q4_2010, and 0 in other 46 columns
Address 2 Sold mm/dd/yyyy: recognize this as a new home and repeat above procedure
Thanks!
Darren
If I understood your requirements correctly then below code should do what you're after:
proc sort data=sample(drop=y quarter) out=have;
  by age beds baths county reetnumber solddate;
run;
/* determine earliest and latest solddate */
proc sql noprint;
  create table start_end_date as
  select 
      min(solddate) as start_date format=date9.
    , max(solddate) as end_date format=date9.
    from have;
quit;
/* create macro variable &var_list containing all quarter specific variable names */
data _null_;
  set start_end_date;
  format quarter date9.;
  length var_list $32767;
  _i=0;
  do while(quarter<end_date);
    quarter=intnx('quarter',start_date,_i,'b');
    var_list=strip(var_list)||' '||'quarter_'||put(quarter,yyq.);
    _i+1;
  end;
  call symputx('var_list',var_list);
  call symputx('start_date',start_date);
run;
/* populate quarter specific variables as required */
data want (drop=_:);
  retain property_id;
  set have;
  by age beds baths county reetnumber solddate;
  array qtrarr {*} &var_list;
  retain qtrarr;
  if first.reetnumber then 
    do;
      property_id+1;
      do _i=1 to dim(qtrarr);
        qtrarr[_i]=0;
      end;
    end;
  _n_arr=intck('quarter',"&start_date",solddate)+1;
  if first.reetnumber then qtrarr[_n_arr]=-1;
  else qtrarr[_n_arr]=1;
run;
Thank you Patrick! Quite impressive and I don't think I would have gotten to this point any time soon. One quick question, is there a good method to compute the natural log price difference between the first and second sales?
Thanks again,
Darren
On further inspection, the results are not quite what I'm looking for and this is probably because I was unclear previously. Records 21, 22, and 23 in the sample file illustrate the issue. The sales occur 02/19/2004, 05/17/207, and 03/02/2010. I need 2 rows from these 3 observations. Using Patrick's syntax, the resulting table needs to have: first row has -1 in quarter_2004Q1, 1 in quarter_2007Q2, and 0 otherwise. The second row has -1 in quarter_2007Q2, 1 in quarter_2010Q1, and 0 otherwise.
The current code has in the first row: -1 in quarter_2004Q1 for that observation (record 21) and 0 otherwise.
Second row still has -1 in quarter_2004Q1 and then 1 in quarter_2007Q2 and 0 otherwise
/* this works and I could just remove the first instance, except */
Third row still has -1 in quarter_2004Q1, 1 in quarter_2007Q2 and now 1 in quarter_2010Q1 (0 otherwise).
I will be regressing the holding period return (log differences) on each row and one row can only have one occurrence of -1 and 1.
Thanks!
Darren
OK - with below tweak to the code you should get what you're after.
"is there a good method to compute the natural log price difference between the first and second sales".
Not really my area of expertise. And there is also no price data in your sample. It might be worth that you tell us what you actually try to achieve. There are a few experienced and skilful people here in this forum and it may well be that you don't have to create this variable array with start and stop flags to get to the outcome you're after.
/* determine earliest and latest solddate */
proc sql noprint;
  create table start_end_date as
    select 
      min(solddate) as start_date format=date9.
      , max(solddate) as end_date format=date9.
    from have;
quit;
/* create macro variable &var_list containing all quarter specific variable names */
data _null_;
  set start_end_date;
  format quarter date9.;
  length var_list $32767;
  _i=0;
  do while( quarter < intnx('quarter',end_date,0,'b') );
    quarter=intnx('quarter',start_date,_i,'b');
    var_list=strip(var_list)||' '||'quarter_'||put(quarter,yyq.);
    _i+1;
  end;
  call symputx('var_list',var_list);
  call symputx('start_date',start_date);
  call symputx('n_element',countw(var_list));
run;
/* populate quarter specific variables as required */
data want (drop=_:);
  retain property_id 1;
  set have;
  by age beds baths county reetnumber solddate;
  lag1_SOLDDATE=lag1(solddate);
  format lag1_SOLDDATE mmddyy10.;
  array qtrarr {*} &var_list (&n_element*0);
  retain qtrarr _lag2_n_arr;
  _n_arr=intck('quarter',"&start_date",solddate)+1;
  _lag1_n_arr=lag1(_n_arr);
  /* only output rows if there have been at least 2 sells for a property */
  if not first.reetnumber then
    do;
      qtrarr[_lag1_n_arr]=-1;
      qtrarr[_n_arr]=1;
if not missing(_lag2_n_arr) then qtrarr[_lag2_n_arr]=0;
      /* only output cases where property not sold twice in same quarter */
      if intck('quarter',lag1_SOLDDATE,solddate)>0 then output;
      _lag2_n_arr=_lag1_n_arr;
    end;
  /* prepare retained variables for iteration of next by group */
  if last.reetnumber then
    do;
      call missing(_lag2_n_arr);
      property_id+1;
      qtrarr[_lag1_n_arr]=0;
      qtrarr[_n_arr]=0;
    end;
run;
Thank you! It works and I took care of the log(price).
It's finally time to hack! Remember to visit the SAS Hacker's Hub regularly for news and updates.
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.
