Investigating the Economics of Wiretapping with SAS

SAS programming concepts in this and other Free Data Friday articles remain useful, but SAS OnDemand for Academics has replaced SAS University Edition as a free e-learning option. Hit the orange button below to start your journey with SAS OnDemand for Academics:

Access Now

We all value our privacy and the thought of someone listening into our phone calls or reading our personal emails fills us with horror. Consequently, the right to privacy is regarded very highly, but that right isn’t absolute. Many US states and the Federal Government have the right to request a judge to authorise the interception of wire, oral or electronic communications. The Administrative Office of the United States Courts is required to report details of these requests, which they publish annually, and this is the data which we will be exploring in this edition of Free Data Friday.

Get the Data

You can download the data from The Administrative Office of the United States Courts web site in XLSX format.

Get Started with SAS OnDemand for Academics

In this 9-minute tutorial, SAS instructor @DomWeatherspoon shows you how to get your data into SAS OnDemand for Academics and other key steps:

Get Started

Getting the Data Ready

Whenever possible I prefer to use the SAS XLSX engine to open these sorts of files. However in this case, the header rows made that difficult because of the invalid variable names which would have been generated. Therefore, I opened the downloaded file in Excel, removed one of the hearer rows and used Proc Import to convert the file to a SAS data set.

filename tapfile '/folders/myshortcuts/Dropbox/wiretap.xlsx'; 
      
proc import datafile=tapfile dbms=xlsx out=wiretap replace;
	getnames=yes;
run;

Having done that, I ran the following data step to convert some of the variables from character to numeric, convert the jurisdiction name to a state name (I know the District of Columbia, Puerto Rico and the Virgin Islands aren’t states, but they provide data included in the reports) and drop the variables I wouldn’t be using.

data wiretap;
	length state $30;
	set wiretap(keep=jurisdiction offense number_of_days_in_operation var18
		where=(var18 ne "0"));
		
	if jurisdiction in ("DISTRICT OF COLUMBIA" "PUERTO RICO" "RHODE ISLAND"
		"VIRGIN ISLANDS") then state=propcase(jurisdiction);
	else if scan(jurisdiction,1) in ("NEW" "NORTH" "SOUTH" "WEST") then 
		state=propcase(cat(scan(jurisdiction,1)," ",scan(jurisdiction,2)));
	else state=propcase(scan(jurisdiction,1));
	
	cost = input(var18, 8.);
	drop var18;
	
	temp=input(number_of_days_in_operation, 8.);
	drop number_of_days_in_operation;
	rename temp=number_of_days_in_operation;	
run;

I ended up with a file looking like this:

The Results

Initially I wanted to calculate the cost per conviction of wiretaps, but rejected that option after checking the report. Many investigations don’t result in convictions in the year the wiretap is carried out, which makes it difficult to attribute convictions to wiretaps until some years after the event.

Instead I decided to calculate the total cost of wiretaps by state and the cost per day by state (the length of time for which wiretaps run varies enormously – generally speaking a judge will issue an order authorising interception for a 30 day period, which can then be extended in 30 day increments).

To do this, I ran a Proc Means statement followed by a data step to calculate the cost per day. You will see that I used the SAS divide function to carry out the division. This was because of the small number of cases where wiretap orders were obtained but never implemented, giving a zero value for the total_days variable. The divide function avoids divide by zero errors in the log and writes a special missing value to the data set. Finally, I ran a Proc Rank to pick out the top ten observations in each category and Proc SGPlot to create bar charts. Here is the code followed by the bar charts:

proc means data=wiretap n sum noprint;
	class state offense;
	output out=tapstats sum(cost)=total_cost sum(number_of_days_in_operation)=total_days;
	var cost;
run;

data costings;
	set tapstats;
	cost_per_day=divide(total_cost,total_days);
run;

proc rank data=costings(where=(_type_=2)) out=rank_state(where=(rankcost<=10 or rankdays<=10)) descending ties=high;
	by _type_;
	var total_cost total_days;
	ranks rankcost rankdays;
run;
	
ods graphics / reset imagemap;
title1 'Federal & State Wiretaps 2018';
title2 "Total Cost of Wiretaps by State";
title3 "Top Ten States";
footnote j=l "Data From https://www.uscourts.gov";

proc sgplot data=rank_state(where=(rankcost<=10));
	format total_cost dollar11.0;
	hbar state /  response=total_cost
		categoryorder=respdesc		
		dataskin=pressed fillattrs=(color=vpab)
		datalabel;
	xaxis label="Total Cost in US$" grid valuesformat=comma11.0;
	yaxis label="State" fitpolicy=none;
run;

ods graphics / reset imagemap;
title1 'Federal & State Wiretaps 2018';
title2 "Total Cost per Day by State";
title3 "Top Ten States";
footnote j=l " Data From https://www.uscourts.gov";


proc sgplot data=rank_state(where=(rankdays<=10));
	format cost_per_day dollar11.0;
	hbar state /  response=cost_per_day
		categoryorder=respdesc		
		dataskin=pressed fillattrs=(color=vpab)
		datalabel;
	xaxis label="Cost per Day in US$" grid valuesformat=comma11.0;;
	yaxis label="State" fitpolicy=none;
run;

The total cost by state results shows no real surprises with the three big states of New York, California and New Jersey occupying the top three spots. In the second chart (cost per day by state), Massachusetts tops the list as having the most expensive wiretaps. A Google search gave me a clue as to why this may be. Massachusetts law makes it very difficult to obtain authorisation for a wiretap; they can only be issued for a small number of specific serious crimes where organized crime is involved and where there is no other way of proceeding with an investigation. All of this means that, in effect, in Massachusetts wiretaps will only be carried out for the most difficult (and probably expensive) cases.

For the next part of my investigation, I went through the same process from the Proc Rank onwards but this time using total cost per offence and cost per day by offence. At this point, I should mention that many investigations involve multiple offence categories but in the reported data only the most serious offence is recorded.

proc rank data=costings(where=(_type_=1)) out=rank_offense
		(where=(rankcost<=10 or rankdays<=10))descending ties=high;
	by _type_;
	var total_cost total_days;
	ranks rankcost rankdays;
run;

ods graphics /reset imagemap;
title1 'Federal & State Wiretaps 2018';
title2 "Total Cost By Offense Category";
title3 "Top Ten Offenses";
footnote j=l " Data From https://www.uscourts.gov";

proc sgplot data=rank_offense(where=(rankcost<=10));
	format total_cost dollar11.0;
	hbar offense /  response=total_cost
		categoryorder=respdesc		
		dataskin=pressed fillattrs=(color=vpab)
		datalabel;
	xaxis label="Total Cost in US$" grid valuesformat=comma11.0;
	yaxis label="Offense" fitpolicy=none;
run;


ods graphics /reset imagemap;
title1 'Federal & State Wiretaps 2018';
title2 "Cost per Day by Offense Category";
title3 "Top Ten Offenses";
footnote j=l " Data From https://www.uscourts.gov";

proc sgplot data=rank_offense(where=(rankdays<=10));
	format cost_per_day dollar11.0;
	hbar offense /  response=cost_per_day
		categoryorder=respdesc		
		dataskin=pressed fillattrs=(color=vpab)
		datalabel;
	xaxis label="Cost per Day in US$" grid valuesformat=comma11.0;
	yaxis label="Offense" fitpolicy=none;
run;

The first bar chart is quite striking – by far the most money spent on wiretaps is spent on investigating narcotics offences. This is perhaps not surprising given the nature of the drugs business, involving as it does manufacture, distribution and sales involving many different people who will need to communicate amongst themselves, leaving them open to detection by electronic eavesdropping. Murder, by contrast, is usually a solitary activity with often little to no organisation required. However, the second bar chart has murder as the most expensive offence per day with narcotics coming in at sixth place. I confess it’s not entirely clear to me why this should be the case so if you have any ideas please leave a comment below. I’d love to hear your thoughts!

Now it's your Turn!

Did you find something else interesting in this data? Share in the comments. I’m glad to answer any questions.

Visit [[this link]] to see all the Free Data Friday articles.

VDD · ‎08-02-2019

Very interesting.

JackHamilton · ‎08-03-2019

Interesting article, and I think one of the things you did in the program is worthy of its own article someday.

You used the DIVIDE function to eliminate "divide by zero" messages in the log. It also eliminates "missing values were generated" and "mathematical operations could not be performed" messages.

DIVIDE is an underused function. There is great benefit to eliminating those messages from the log in cases when you know problematic values might appear in your data. A large number of unneeded messages in the log makes it too easy to overlook real errors. It's more work, but better programming practice, to create only messages that actually need attention.

DIVIDE is a relatively new function. You can create a similar result with old-fashioned code like

if y = 0 or nmiss(x, y) then
    z = .;
else 
    z = x / y;

The result is similar but not identical to the DIVIDE function because DIVIDE might also return the special missing values .I, .M, and ._

In the olden days, we hardly ever encountered missing values other than . , and the test for missing values in a programs was a simple

if x = . then 
    /* do stuff */;

But with the introduction of the DIVIDE function, that no longer suffices. It's better to code

if x is missing then 
    /* do stuff */;

or

if nmiss(x) then 
    /* do stuff */;

In PROC FORMAT, where those tests aren't available, you have to specify all the missing values explicitly:

proc format;
    value missb
        other        = [best4.]
        ., .a-.z, ._ = 'missing';
run;

I wouldn't know without looking it up whether .A comes before or after . or ._, so it's easier for me to list all three missing values ranges. Looking it up, I see that I could have specified

proc format;
    value missb
        other        = [best4.]
        ._ - .Z      = 'missing';
run;

That's not an obvious order. The SAS missing value sort order is

._
.
.A-.Z

but even knowing the ASCII sorting sequence doesn't help; it's

<space>
.
A-Z
_

and the EBCDIC sort sequence is

<space>
.
_
A-Z

ChrisBrooks · ‎08-03-2019

Thanks Jack - I remember many decades ago when I started PC programming after working on mainframes I was using a language (long since defunct) in which the only way of avoiding divide by zero errors was to POKE a value into a specific memory address. Things have certainly changed.....

VDD · ‎08-03-2019

@ChrisBrooks the old days of peak and poke, while pushing the stack. how things have changed and trying to keep pace.