Solved: Re: Exporting a table of aggregate outcomes using sas proc sql

applemonster · Posted 08-16-2020 01:47 AM

I have a dataset that contains millions of records about the details of loans in China. I want to do a city-level panel data analysis where I need the daily increment of the number of loans in a city. My basic idea is to count the observations with the specified value of city names and dates. Hopefully, I want to generate a table that contains all the counts at one time where the city list is the first column, and the date list is the first row. I design some very fundamental codes like this:

proc sql;

select count(*) as N_obs
from rrd_come.combo_comprehensive_29
where borrower_officelocation in 'beijing' and project_opentime_cn=20200131;

quit;

But it can only output the daily increment of one city at one data, which is far below my requirements.

I want to generate a table of like this:

So could any generous friends help me out?

Thanks a lot!

Kurt_Bremser · Posted 08-16-2020 02:31 AM

First, convert these mostly unusable numbers to SAS date values, which will aid greatly in further processing.

Next, create a table with number of loans per city and date by using GROUP BY:

proc sql;
create table want as
  select
    borrower_officelocation,
    project_opentime_cn,
    count(*) as N_obs
  from rrd_come.combo_comprehensive_29
  group by borrower_officelocation, project_opentime_cn
;
quit;

Then you can calculate the difference between dates in a data step with LAG().

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

View solution in original post

Kurt_Bremser · Posted 08-16-2020 02:31 AM

First, convert these mostly unusable numbers to SAS date values, which will aid greatly in further processing.

Next, create a table with number of loans per city and date by using GROUP BY:

proc sql;
create table want as
  select
    borrower_officelocation,
    project_opentime_cn,
    count(*) as N_obs
  from rrd_come.combo_comprehensive_29
  group by borrower_officelocation, project_opentime_cn
;
quit;

Then you can calculate the difference between dates in a data step with LAG().

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

applemonster · Posted 08-16-2020 03:52 AM

It is a very inspiring way to figure it out! Thank you so much.

After following your idea, I got a dataset with three columns: borrower_officelocation, project_opentime_cn, and N_obs;

(Please igonre the Chinese characters...)

But could you please shed more light on how to transfer such a dataset into a format I desire like below one using data step and lag() function?

Thank you so much!

Kurt_Bremser · Posted 08-16-2020 04:15 AM

So it seems you already have the number of new acquisitions per day, and we do not need to build the difference to the previous day.

Note that "wide" datasets are much harder to process than the "long" dataset you already have now.

Is this for reporting purposes?

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

applemonster · Posted 08-16-2020 04:15 AM

I have known your meaning by using lag() function. I need to say that N_obs is already an incremental value instead of accumulative value, so it is unnecessary to do difference calculation again. I feel the next step is a brand-new issue and I'd better post another question. Thanks a lot!

Kurt_Bremser · Posted 08-16-2020 04:23 AM

What you are after is a transpose, but this is usually only good for reporting purposes.

Are you trying to export this for human consumption (e.g. to Excel)?

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

applemonster · Posted 08-16-2020 06:06 AM

It's not used for report purpose. I want to do panel data analysis of the impact of COVID19 on the lending and borrowing activities in China. Panel data analysis typically studies multiple areas over multiple periods. I want to treat loan number, average loan size and other similar variables of a city as explained variables, and handle daily new COVID19 cases and other control variables as explaining variables. So I think it is instinctively and visually better to adopt such a dataset format. But obviously, I am not a data scientist yet, so I hope that you can give me some constructive suggestions.

Kurt_Bremser · Posted 08-16-2020 07:17 AM

Then you can run PROC TRANSPOSE on the "want" dataset from my previous post:

proc transpose
  data=want
  out=want_wide
  prefix=date_ /* you can't start a variable name with a number */
;
by borrower_officelocation;
var n_obs;
id project_opentime_cn;
run;

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

SAS Innovate 2025: Call for Content