I have a dataset that contains millions of records about the details of loans in China. I want to do a city-level panel data analysis where I need the daily increment of the number of loans in a city. My basic idea is to count the observations with the specified value of city names and dates. Hopefully, I want to generate a table that contains all the counts at one time where the city list is the first column, and the date list is the first row. I design some very fundamental codes like this:
proc sql; select count(*) as N_obs from rrd_come.combo_comprehensive_29 where borrower_officelocation in 'beijing' and project_opentime_cn=20200131; quit;
But it can only output the daily increment of one city at one data, which is far below my requirements.
I want to generate a table of like this:
So could any generous friends help me out?
Thanks a lot!
First, convert these mostly unusable numbers to SAS date values, which will aid greatly in further processing.
Next, create a table with number of loans per city and date by using GROUP BY:
proc sql;
create table want as
select
borrower_officelocation,
project_opentime_cn,
count(*) as N_obs
from rrd_come.combo_comprehensive_29
group by borrower_officelocation, project_opentime_cn
;
quit;
Then you can calculate the difference between dates in a data step with LAG().
First, convert these mostly unusable numbers to SAS date values, which will aid greatly in further processing.
Next, create a table with number of loans per city and date by using GROUP BY:
proc sql;
create table want as
select
borrower_officelocation,
project_opentime_cn,
count(*) as N_obs
from rrd_come.combo_comprehensive_29
group by borrower_officelocation, project_opentime_cn
;
quit;
Then you can calculate the difference between dates in a data step with LAG().
It is a very inspiring way to figure it out! Thank you so much.
After following your idea, I got a dataset with three columns: borrower_officelocation, project_opentime_cn, and N_obs;
(Please igonre the Chinese characters...)
But could you please shed more light on how to transfer such a dataset into a format I desire like below one using data step and lag() function?
Thank you so much!
So it seems you already have the number of new acquisitions per day, and we do not need to build the difference to the previous day.
Note that "wide" datasets are much harder to process than the "long" dataset you already have now.
Is this for reporting purposes?
What you are after is a transpose, but this is usually only good for reporting purposes.
Are you trying to export this for human consumption (e.g. to Excel)?
It's not used for report purpose. I want to do panel data analysis of the impact of COVID19 on the lending and borrowing activities in China. Panel data analysis typically studies multiple areas over multiple periods. I want to treat loan number, average loan size and other similar variables of a city as explained variables, and handle daily new COVID19 cases and other control variables as explaining variables. So I think it is instinctively and visually better to adopt such a dataset format. But obviously, I am not a data scientist yet, so I hope that you can give me some constructive suggestions.
Then you can run PROC TRANSPOSE on the "want" dataset from my previous post:
proc transpose
data=want
out=want_wide
prefix=date_ /* you can't start a variable name with a number */
;
by borrower_officelocation;
var n_obs;
id project_opentime_cn;
run;
Save $250 on SAS Innovate and get a free advance copy of the new SAS For Dummies book! Use the code "SASforDummies" to register. Don't miss out, May 6-9, in Orlando, Florida.
ANOVA, or Analysis Of Variance, is used to compare the averages or means of two or more populations to better understand how they differ. Watch this tutorial for more.
Find more tutorials on the SAS Users YouTube channel.