BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
applemonster
Fluorite | Level 6

I have a dataset that contains millions of records about the details of loans in China. I want to do a city-level panel data analysis where I need the daily increment of the number of loans in a city. My basic idea is to count the observations with the specified value of city names and dates. Hopefully, I want to generate a table that contains all the counts at one time where the city list is the first column, and the date list is the first row. I design some very fundamental codes like this:

proc sql;

select count(*) as N_obs
from rrd_come.combo_comprehensive_29
where borrower_officelocation in 'beijing' and project_opentime_cn=20200131;

quit;

But it can only output the daily increment of one city at one data, which is far below my requirements.

image.png

I want to generate a table of like this:

applemonster_0-1597556568217.png

So could any generous friends help me out?

Thanks a lot!

 

1 ACCEPTED SOLUTION

Accepted Solutions
Kurt_Bremser
Super User

First, convert these mostly unusable numbers to SAS date values, which will aid greatly in further processing.

 

Next, create a table with number of loans per city and date by using GROUP BY:

proc sql;
create table want as
  select
    borrower_officelocation,
    project_opentime_cn,
    count(*) as N_obs
  from rrd_come.combo_comprehensive_29
  group by borrower_officelocation, project_opentime_cn
;
quit;

Then you can calculate the difference between dates in a data step with LAG().

View solution in original post

7 REPLIES 7
Kurt_Bremser
Super User

First, convert these mostly unusable numbers to SAS date values, which will aid greatly in further processing.

 

Next, create a table with number of loans per city and date by using GROUP BY:

proc sql;
create table want as
  select
    borrower_officelocation,
    project_opentime_cn,
    count(*) as N_obs
  from rrd_come.combo_comprehensive_29
  group by borrower_officelocation, project_opentime_cn
;
quit;

Then you can calculate the difference between dates in a data step with LAG().

applemonster
Fluorite | Level 6

It is a very inspiring way to figure it out! Thank you so much. 

After following your idea, I got a dataset with three columns: borrower_officelocation, project_opentime_cn, and N_obs;

applemonster_0-1597564138605.png

(Please igonre the Chinese characters...)

But could you please shed more light on how to transfer such a dataset into a format I desire like below one using data step and lag() function?

applemonster_1-1597564325838.png

Thank you so much!

Kurt_Bremser
Super User

So it seems you already have the number of new acquisitions per day, and we do not need to build the difference to the previous day.

Note that "wide" datasets are much harder to process than the "long" dataset you already have now.

Is this for reporting purposes?

applemonster
Fluorite | Level 6
I have known your meaning by using lag() function. I need to say that N_obs is already an incremental value instead of accumulative value, so it is unnecessary to do difference calculation again. I feel the next step is a brand-new issue and I'd better post another question. Thanks a lot!
applemonster
Fluorite | Level 6

It's not used for report purpose. I want to do panel data analysis of the impact of COVID19 on the lending and borrowing activities in China. Panel data analysis typically studies multiple areas over multiple periods. I want to treat loan number, average loan size and other similar variables of a city as explained variables, and handle daily new COVID19 cases and other control variables as explaining variables. So I think it is instinctively and visually better to adopt such a dataset format. But obviously, I am not a data scientist yet, so I hope that you can give me some constructive suggestions.

Kurt_Bremser
Super User

Then you can run PROC TRANSPOSE on the "want" dataset from my previous post:

proc transpose
  data=want
  out=want_wide
  prefix=date_ /* you can't start a variable name with a number */
;
by borrower_officelocation;
var n_obs;
id project_opentime_cn;
run;

SAS Innovate 2025: Call for Content

Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!

Submit your idea!

What is ANOVA?

ANOVA, or Analysis Of Variance, is used to compare the averages or means of two or more populations to better understand how they differ. Watch this tutorial for more.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 7 replies
  • 1001 views
  • 1 like
  • 2 in conversation