Features:
Introducing Bedfellows
Under the Hood of a New Python Library for Detailing the Relationship between PAC Contributors and Recipients
Political Action Committees (PACs) in the US must report every donation made to another federal committee to the Federal Election Commission, yet the nature of the relationship between PAC contributors and recipients can be obscure. It is no easy feat to make the jump from the millions of entries in the FEC data to the story told by the contribution history associated with a contributorrecipient pair. A descriptive snapshot of the pair’s contribution history would go a long way towards improving accountability of political committee contributions. That’s where Bedfellows comes in.
To provide a measure of the dynamics of PAC contributions at the level of contributorrecipient pairs, The Upshot’s Derek Willis and I envisioned a score that models contributions at that relationship level. The model could be defined any number of ways, but we settled on a score between 0 and 1 assigned to every possible contributorrecipient pair, with 0 signifying that contributor has no association whatsoever with recipient, and a 1 signifying that contributor and recipient are more closely related than any other pair.
Bedfellows is a commandline tool that calculates scores for the donorrecipient relationship and provides a similarity score so users can see donors, recipients and pairs that are most like each other. It is meant to be run locally for data exploration; it is not currently optimized for use as a web application.
We cannot map all the information associated with a contributorrecipient pair into a decimal number between 0 and 1 without first defining how exactly to measure the strength of the affinity of contributorrecipient pairs. Now, we are not aware of existing research on the topic of how to measure the strength of the relationship between PAC contributors and recipients based on campaign finance data. (The work of professor Adam Bonica from Stanford University, which establishes a model for measuring ideological affinity, is similar to but not equivalent to the Bedfellows model. We are not concerned with drawing ideological divides; rather, we seek to reveal allegiances evidenced by campaign donations, regardless of political ideology.)
Given the lack of existing scholarship on the topic, we had no option but to devise our own metrics for quantifying the relationship between contributors and recipients. These definitions are essentially editorial: As journalists, we rely on our knowledge of the beat to decide which metrics to focus on. What follows is an account of the decisionmaking process that amounted to the computation of relationship scores and similarity scores.
Data, Tools & Initial Setup
The campaign finance data we use is an enhanced version of three files made available by the FEC, listing committees, candidates, and committeetocommittee transactions (the “itoth” file, in FEC terminology).
Our tools of choice are the opensource relational database MySQL and the Python library mysqldb. For the sake of convenience, we encapsulate all queries used to compute the scores into a Python script that connects with the database through mysqldb. Code and starter files are available on GitHub along with usage instructions.
The Python scripts assume that the database we’re using already contains tables fec_committee_contributions
, fec_committees
and fec_candidates
.
Before we start querying the database, we tailor the data to our needs in functions initial_setup
and setup_initial_indexes
. To do so, we first add indexes to table fec_committee_contributions
and then subset the table based on a specific kind of donation. We are interested in committeetocommittee donations, where committees can be PACs, candidate committees, or party committees.
To narrow down the data to committeetocommittee donations, we adopt the following constraints:
 Donations of transaction type “24K,” i.e. contributions made to nonaffiliate. These transactions refer to contributions made by committees, which are the ones we are interested in focusing on.
 Donations where entity type is “PAC” or “CCM.” The codes refer to Political Action Committee or Candidate Committee, respectively.
 We limit the analysis to donations made from 2003 on, since the contribution limit regulations differed significantly before then.
 We also remove super PACs from consideration because we are interested in donations bound by contribution limits, and because super PACs do not make candidate contributions.
This subset is stored in fec_contributions
, which is the table all subsequent queries are primarily built on top of. We add indexes to tables fec_contributions
, fec_candidates
and fec_committees
as well as every table we create in the process in order to speed up subsequent queries. Most indexes are added on attributes fec_committee_id
and other_id
. Attribute fec_committee_id
uniquely identifies contributors, whereas other_id
uniquely identifies recipients.
The bulk of our code is split into two scripts: overall.py
and groupedbycycle.py
. The former computes overall relationship scores, i.e., scores across all election cycles since 2003, whereas the latter computes scores for each election cycle separately. The main.py
script invokes one of the two scripts according to the first parameter it receives (either “overall” or “cycle.”) The second parameter required is the name of the database where the fec_candidate_contributions
, fec_candidates
and fec_committees
tables are stored.
The Model
Before we used the data we’d set up, though, we had to decide precisely what we wanted to accomplish with it.
What matters most in determining how invested contributors are in a campaign? The length of the relationship with donation recipients, or the amount donated? The number or the timing of donations? Absolute or relative number of donations? What exactly do we mean by timing anyway—are we talking about how often or how early they occur? These questions raise a fundamental point: No single metric will singlehandedly describe the affinity between contributors and donors.
Our method is to combine several metrics into the relationship score. This strategy encompasses several ways in which the strength of a relationship can manifest itself, but a core question remains: What exactly should the metrics be? The following is the list of metrics we decided to incorporate into the relationship score. Our hope is that most—if not all—are intuitive measures of affinity.
 Length of the relationship is an obvious first pick. The longer contributor has donated, the stronger a relationship it has with recipient. This metric is captured in the length score.
 Timing of donations also matters. The earlier in the election cycle a contributor donates to a campaign, the stronger a commitment to the campaign it displays. Uncertainty about a campaign’s prospects is higher early in the election cycle. The scores of early donations are bumped up through the reporttype score.
 Periodicity of donations is next. Periodic donations made around the same time each year are an indicator of strength, since recipients can expect to count on these donations. This kind of periodic pattern in the timing of donations is rewarded in the computation of the periodicity score.
 Amount donated should also factor into the score. The more money a PAC gives, the more invested in the recipient it is. It is not enough to look at the absolute figure—much more telling is what percentage share of the contribution limit allowed by the FEC the donation represents. We want to measure how close contributor was to donating as much as it lawfully could. This is the rationale behind the maxedout score.
 Exclusivity of the relationship is arguably relevant, too: The more selective contributors are in choosing recipients, the more invested they are in the respective campaigns. Contributors that donate to campaigns all across the country are less invested in specific recipients than contributors that donate exclusively to a given recipient. We capture this idea with the exclusivity score.
 Geography is the last metric. The more contributors donate to recipients associated with specific races, the more invested in the outcome of those specific races they are, which denotes a stronger relationship with the recipients associated with these races. This is the intuition behind racefocus scores.
One could argue than an obvious metric is missing from this list: a simple count of the number of donations associated with each contributorrecipient pair. We left this one out on purpose, as we’re more concerned about the patterns surrounding donations than about the number of donations per se. The count is embedded in the computation of several of these scores—notably periodicity scores and length scores, which are assigned a value of 0 in the event of onetime donations.
The choice of scores is of course an editorial decision, one of the several judgment calls that factor into the design of an algorithm of this kind. We hope to make this analysis accountable by disclosing the editorial decisions embedded in the algorithm design as well as providing Bedfellows users with full control over parameters of the scoring model.
And now, a closer look at how we compute each score.
Length Scores
The length score has an intuitive premise: The longer the relationship between contributor and recipient lasts, the stronger that relationship is. We want to reward pairs that exhibit a longlasting relationship between contributors and recipients. We measure length of relationship by counting the number of days passed between the first and the last donation associated with a contributorrecipient pair. We then normalize these counts by assigning a score of 1 to the highestscoring pair and scaling others accordingly.
Step by Step
We first compute unnormalized length scores as the difference between the first and the last date of donations on record, measured in days. This is readily accomplished with MySQL’s DATEDIFF function. The unnormalized scores are stored in unnormalized_length_scores
. We then normalize scores by first storing the highest value found in table max_length_score
and then dividing all unnormalized scores by the highest value. Normalized scores are stored in length_scores
.
In Depth
The scoring model we developed doesn’t explicitly reward pairs in proportion to the absolute number of corresponding contributions. Rather, the model seeks to flesh out patterns surrounding these contributions, namely periodicity, exclusivity and length of relationship as well as the timing of donations in the context of election cycles, the relative donation value with respect to the limit allowed by FEC and contributor’s focus on specific races.
ReportType Scores
Report type scores are built on top of the following premise: The earlier in the election cycle donations are made, the stronger the relationship between contributor and recipient is. Our underlying assumption is that early donations indicate that either the recipient asks the contributor before others or the contributor wants to establish a tie by proactively donating early.
We seek to translate this premise into a qualitative measure that rewards contributions that occur further from Election Day with high reporttype scores. We do so by looking at frequencies of different report types associated with donations, hence the name “reporttype score.”
Each donation in the fec_contributions
table is associated with a report type, which (as one would expect) indicates the type of report used to register donations with the FEC. Each report type is by definition associated with the period of the year when donation was reported. This, in combination with the year, makes report types a convenient measure for determining how early in the election cycle donation was made.
We compute reporttype scores by assigning each report type a weight that indicates how early in the election cycle they are, with higher weights awarded to earlier periods. We find the frequencies of each report type associated with a contributorrecipient pair and then compute unnormalized reporttype scores as the product of the frequency of each report type by the corresponding weight. Finally, we normalize the scores by dividing all unnormalized scores by the maximum score found so as to ensure they fall in a [0,1] range.
Step by Step
The computation of report type scores requires several steps, the first of which is to read into the database a CSV file detailing the report type weights to be used in computing report type scores. These weights are stored in the report_type_weights
table.
Next, we count how many times each report type occurs in the collection of donations associated with each pair. These counts are split by year parity. That is to say, we count how often each report type occurs for each contributionrecipient pair in odd years as well as how often they occur for each pair in even years. The goal here is to count the number of donations made at different points of the election cycle. The split by year parity is necessary because federal elections typically take place in even years only, meaning the correspondence between report types and periods of the election cycle differs according to year parity. These counts are stored in table report_type_count_by_pair
.
We then compute how many times each contributorrecipient pair occur in fec_contributions
, that is, we get a count of donations made by each contributor to each recipient. These counts, which are stored in table pairs_count table
, are nothing more than the number of occurrences of each pair in fec_contributions
.
Once equipped with these two counts, we compute report type frequencies: how often each report type occurs in donations associated with each pair. These frequencies are simply the quotient between each report type count (from report_type_count_by_pair
) and pair count (from pairs_count
.) We store the results in the report_type_frequency
table.
The next step is to compute report type subscores for each combination of contributionrecipient pair and report type present in fec_contributions
. These subscores are given by the product of the frequency of each report type (from report_type_frequency
) and the corresponding weights (from report_type_weights
.)
Then, we find unnormalized report type scores by summing over all subscores associated with a given pair. In other words, the report type score is the sum of all subscores corresponding to all combinations of report type and year parity that occur for each pair. We say these are unnormalized scores because it is possible (and indeed inevitable) that some pairs will have a score higher than 1. The unnormalized_report_type_scores
table store these results.
The max_report_type_score
table simply finds the maximum score in table unnormalized_report_type_scores
. I decided to store the maximum unnormalized score in a separate table in order to avoid querying the unnormalized table for the maximum value at each row in the unnormalized scores table.
At last, we arrive at the final report type scores by normalizing scores in unnormalized_report_type_scores
. Normalization is achieved by dividing all scores by maximum score stored in max_report_type_score
. This way we ensure that scores fall in a scale from 0 to 1.
In Depth
One could argue a more objective measure of time such as date is to be preferred over report type for the purposes of pinpointing how early in the election cycle a donation is made, especially seeing as date is an attribute readily available in table fec_contributions
. We choose to go with report types because they provide for a convenient grouping of donations made around the same time but not quite on the same date. Had we used date for this analysis, we would have had to come up with a method for clustering dates. Report types, on the other hand, are FECsanctioned clusters already in place. We favored using an existing FECsanctioned clustering method over devising a new one.
As a matter of fact, even report types are too granular a measure of time for our purposes, so much so that we grouped similar ones together in the process of assigning them weights. For instance, a weight of 1 is assigned to report types 12C, 12G, 12P, 12R, 12S, 30G, 30R, 30S in both even and odd years, and MY, M7, M8, M9, M10, M11 and M12 in even years. These report types represent donations made within a few months of Election Day. The goal is to differentiate between donations made very early in the election cycle, which get a weight of 4, and donations made towards the end of the campaign, which get a score of 1. Scores of 2 and 3 are assigned for donations in between.
File report_types.csv
lists the weight assigned to each report type score. These weight assignments are of course an editorial decision. We thus emphasize that Bedfellows users can easily change these weights by editing the CSV file.
Periodicity Scores
If a contributor regularly donates to the same recipient, their relationship is arguably strong, since said contributor is a source of funding the recipient can expect to count on. Periodicity scores seek to reward pairs for which donations are made around the same time of the year across the years. The goal is to quantify the temporal closeness of donations associated with a pair during the election cycle. Noting that closeness can be interpreted as the inverse of dispersion, we use the inverse of standard deviation—a measure of dispersion—as a means to compute periodicity score. The inverse of standard deviation correlates directly with periodicity, since data points with smaller standard deviation indicate that donations were made around the same time of the year. We favor standard deviation over variance due to its flatter curve, which leads to a less steep curve for periodicity scores.
To compute periodicity scores, we first map dates of donations associated with a pair into a “day of the year” data point (i.e. number of days passed since Jan 1st) and then compute the inverse of the standard deviation of the resulting data points. If standard deviation is found to be zero, we say periodicity score is 0 if the data is made up of a single distinct point and 1 if the data is made up of multiple distinct points. Otherwise, periodicity score is simply the value found for the inverse of standard deviation.
We now expand on the case when standard deviation is zero. We note that if the standard deviation of a collection of data points is zero, then either the data is made up of a single point or all points in the data are congruent. In the context of donations, the former means that contributor makes a onetime donation to recipient, while the latter means that contributor makes several donations to recipient on the same day of the year, (though not necessarily in the same year, since donations dates are mapped into a ‘day of the year’ measure, which is independent of year.) It is reasonable to say onetime donations are not periodic and therefore merit a periodicity score of zero. The case when several donations are made on the very same day across the years reflects a highly periodic pattern of donations, to which we assign a periodicity score of 1.
Step by Step
Unlike reporttype scores, there is no need to compute several tables before actually computing the unnormalized scores. A single query on table fec_contributions
does the trick.
We compute unnormalized periodicity scores as follows: First, we group donations by contributorrecipient pairs; then, we map donation dates to a “day of year” measure through MySQL’s DAYOFYEAR function; finally, we evaluate the standard deviation of the resulting data points. If standard deviation is zero, we look at the number of distinct data points that was used to compute the variance: If data is made up of a single data point, we assign a periodicity score of 0, otherwise we assign a score of 1. If standard deviation isn’t zero, then periodicity score is the value of the inverse of standard deviation. Results are stored in unnormalized_periodicity_scores
.
The same normalization strategy used before is applied here. We compute values in periodicity_scores
as the quotient between unnormalized scores in unnormalized_periodicity_scores
and the maximum periodicity score value stored in max_periodicity_score
.
In Depth
Leap years introduce a slight imprecision in our periodicity score calculation. For dates from March to December, the value returned by MySQL’s DAYOFYEAR function for dates in leap years exceeds by one unit the value returned for the same date in a nonleap year. As a result, a pair of donations made on the same day of the year in different years such that one but not the other is a leap year is treated as donations made a day apart from each other. They will be treated as distinct data points even though they refer to the same date. Because variance in this case is very low and so periodicity score is very close to 1 anyway, we let this slide.
On a separate note, we acknowledge that our method for computing periodicity scores may fail to adequately capture the periodical pattern of multimodal data points. If a contributor donates to a given recipient, say, every 4 months, the dataset that results from mapping donation dates into the dayofyear measure will be multimodal; as a result, standard deviation is difficult to interpret.
While there isn’t a score explicitly devoted to rewarding multiple donations over onetime ones, we do acknowledge that both periodicity and length scores indirectly produce this side effect, as onetime donations necessarily get a periodicity score and a length score of zero. This is meant to counterbalance the relative easiness with which onetime donations can get high values for the other scores. Onetime donations will get a high reporttype score as long as the donation is made early in the election cycle; they will get a high maxed out score as long as donation is close to contribution limit. Moreover, if contributor doesn’t donate to other recipients, exclusivity and racefocus score will necessarily be 1.
MaxedOut Scores
The rationale behind maxed out scores is intuitive: Contributors have a stronger relationship with recipients to whom they donate the maximum amount allowed under FEC regulations than with recipients to whom they donate less than the contribution limit. Maxedout scores reward maxedout donations associated with a contributorrecipient pair.
To compute maxedout scores, we first identify contributor and recipient types and then assign a contribution limit to each contributorrecipient pair according to FEC rules. We then compute the value of each donation as a percentage share of the contribution limit associated with each pair. Finally, we add up percentage shares from all donations associated with a pair to arrive at an unnormalized score. The usual normalization procedure then ensues.
Step by Step
We start by computing contributor types, which assigns a “contributor_type” value to each contributor in table fec_contributions
. Contributor types are one of “national_party,” “other_party,” “multi_pac,” and “non_multi_pac.” Likewise, we compute table recipient_types
to assign a “recipient_type” value to each recipient in table fec_contributions
. Recipient types are one of “national_party,” “other_party,” “pac,” “candidate.” See the indepth discussion below for a detailed explanation of assignment rules.
We then create table contribution_limits
by reading file limits.csv
into the database. This file contains FECregulated contribution limits for each possible combination of contributor and recipient types. Next, we join contributor_types
, recipient_types
and fec_contributions
into joined_contr_recpt_types
. This join associates each donation with a contributor type and a recipient type.
We’re now ready to associate each donation with a contribution limit based on contributor and recipient types from table joined_contr_recpt_types
and contribution limits from contribution_limits
. We do that in maxed_out_subscores
, which computes the percentage share of contribution limit represented by each donation. Maxedout subscore is, in other words, the quotient between donation amount and contribution limit. Finally, we compute unnormalized maxed out scores for each contributorrecipient pair by summing over all subscores associated with a pair. The table unnormalized_maxed_out_scores
stores these results.
As per our standard normalization method, we store highest score found in table max_maxed_out_score
and then compute maxedout scores as the quotient between unnormalized scores and highest score found. Results are stored in maxed_out_scores
.
In Depth
Classification of contributor types is based on the following rules: If committee type is ‘X’ or ‘Y’, then contributor is either national party or other party committee. Other party here means state or locallevel committee. We use national parties’ FEC IDs to make the distinction as follows. These IDs…
 ‘C00003418’ and ‘C00163022’ (REPUBLICAN NATIONAL COMMITTEE)
 ‘C00027466’ (NATIONAL REPUBLICAN SENATORIAL COMMITTEE)
 ‘C00075820’ (NATIONAL REPUBLICAN CONGRESSIONAL COMMITTEE)
 ‘C00000935’ (DEMOCRATIC CONGRESSIONAL CAMPAIGN COMMITTEE)
 ‘C00042366’ (DEMOCRATIC SENATORIAL CAMPAIGN COMMITTEE)
 ‘C00010603’ (DNC SERVICES CORPORATION/DEMOCRATIC NATIONAL COMMITTEE)
…are known to be national parties; all others are classified as “other_party.” If committee type is one of “N,” “Q,” “F,” then contributor is either multicandidate PAC or nonmulticandidate PAC. We use attribute “multiqualify_date” from table fec_committees
to distinguish between multicandidate and nonmulticandidate PACs. We ignore all contributors associated with other committee types.
Classification of recipient types is based on the following rules: If committee type is one of “H,” “S,” “P,” “A,” “B” then recipient is a candidate committee. If committee type is ‘X’ or ‘Y’ then contributors’ rules also apply for recipients. If committee type is one of “N,” “Q,” “F,” “G,” then recipient is a PAC. We ignore all recipients associated with other committee types.
Exclusivity Scores
The exclusivity score is intended to capture the share of the overall amount donated by a contributor that is assigned to each recipient. In other words, the score measures how “exclusive” donations are. If all the money donated by a contributor goes to a single recipient, that contributor and recipient are likely to have a stronger relationship than are pairs in which the contributor splits its donations among several recipients.
To compute exclusivity scores, we first find total amount donated by a given contributor across all recipients. Then, for each donation made by a contributor, we compute an exclusivity subscore as the quotient of the donation’s amount by the total amount donated across all recipients. We finally compute exclusivity scores by summing over all exclusivity subscores associated with a given pair.
Step by Step
The first step in calculating exclusivity scores is computing table total_donated_by_contributor
, which stores total amounts donated by contributors. To compute these amounts, we simply group donations in the fec_contributions
table by contributor and then sum up values in the amount
column.
We then populate the exclusivity_scores
table by first computing the percentage share associated with each donation and then summing over all donations associated with a given contributorrecipient pair. The percentage share associated with each donation is labeled exclusivity subscore and computed as the quotient between the donation’s amount and the total amount donated by contributor from table total_donated_by_contributor
. Finally, the exclusivity score is the result of the sum over all exclusivity subscores associated with a given pair.
In Depth
One would expect all donation amounts stored in the fec_contributions
table to be positive values, since it doesn’t make sense to speak of donations of negative amounts. If this were the case, all exclusivity scores would necessarily fall on a [0,1] range, and the sum of all exclusivity scores associated with a given contributor across all recipients would necessarily amount to exactly 1. (If this isn’t immediately apparent, think about how exclusivity scores are defined: They represent the percentage share of donations allocated to each recipient, so it only makes sense that the sum of all percentage shares equals 100% of amount donated.)
However, the current instance of table fec_contributions
contains over 3400 donations whose values in the amount
column are negative. These negative values refer to refunds made by recipients. That negative donation amounts occur in the database slightly complicate the score computation, as they lead to the occurrence of a few instances in which exclusivity scores are larger than 1. We can’t simply ignore these refunds, as they convey meaning about the contributorrecipient relationship.
We found that there were 10 pairs in our dataset for which the exclusivity score evaluated to an amount higher than 1. As there are only 10 such pairs, we addressed this issue by simply capping the score at 1. It makes sense for these 10 pairs to be assigned a score of 1, because their score would be 1 if the negative amounts were removed from consideration.
RaceFocus Score
The motivation for this score is that contributors that give to recipients within a single race should see a bump in their relationship scores. Contributors that donate to races all over the country are not as invested in each particular race as are contributors that focus on specific races. We compute race focus scores as the inverse of the count of the number of races a contributor donates to. Normalization is not necessary in this case.
Unlike the other five scores, race focus scores are assigned to each contributor only, (as opposed to contributorrecipient pairs.)
Step By Step
The first step is to compile a list of all races associated with donations in table fec_contributions
. We define race as a unique combination of the following attributes: district, office state, branch and cycle. (Think about it: No two races will map into the same combination of these four attributes.) We store the results in races_list
.
Now that we have a list of races associated with donations in the data, we count how many races each contributor is affiliated with, where affiliation means contributor donates to a recipient partaking in a race. We let race focus scores be the inverse of this count. Table race_focus_scores
stores these results. This methodology necessarily constrains values within the [0,1] range, which removes the need to normalize values at the end as before.
In Depth
It is worth noting that the query used to compile a list of races relies on a regular expression (‘REGEXP ^[HPS]’). This regular expression restricts the race list to candidates for the House, Senate or presidency.
Relationship Scores
The final step is to combine the six scores computed (i.e. exclusivity scores, reporttype scores, periodicity scores, maxedout scores, length scores, and racefocus scores) into a unique, final relationship score. We accomplish this by joining the various scores tables and computing a weighted average of the scores, where weights are arbitrarily predetermined.
We once again point out that Bedfellows allows users to easily change the parameters of the model if they so desire. Users can change the weights used in the computation of relationship scores simply by editing file score_weights.csv
.
Step by Step
We start by reading score weights to be attributed to each of the six scores from the CSV file score_weights.csv
and storing them in table score_weights
. We then join the first five scores (all except race focus scores) on attributes fec_committee_id
and other_id
. Recall that fec_committee_id
uniquely identifies contributors and other_id
uniquely identifies recipients. These five scores are attributed to contributorrecipient pairs. The weighted average of these five scores is stored in five_scores
.
Race focus scores, on the other hand, are attributed to contributors only, so we separately join partial scores from five_scores
and race_focus_scores
on attribute fec_committee_id
. This means all contributorrecipient pairs associated with the same recipient are assigned the same race focus scores in the computation of the relationship score. The final result in stored in the final_scores
table.
Similarity Scores
Now that we have computed relationship scores for each contributorrecipient pair in table fec_contributions
, the next step is to make sense of the results.
To interpret the scores, we look for similarities in the distribution of scores across all pairs. With all the scores in hand, we are empowered to determine, for instance, what contributors are most similar to each other in terms of campaign contribution patterns. The rationale is that contributors with similar score distributions exhibit similar campaign donation behavior. (The same rationale applies to recipients and contributionrecipient pairs.)
In order to perform this similarity analysis, we rely on a vectorbased metric known as cosine similarity. We first associate each contributor, recipient and pair with a vector of scores. We then measure how similar two contributors are by computing the cosine of the angle between them, which we take to be the similarity score associated with the two contributors. (Again, the recipient and contributorrecipient pair cases are analogous.)
Step by Step
To measure similarity between contributors and between recipients, we first need to represent each instance of contributors, recipients, and contributorrecipients pairs as vectors of scores.
In the case of contributors and recipients, we start off by computing the weighted adjacency matrix associated with all contributors and recipients. This is a matrix in which each row corresponds to a contributor and each column corresponds to a recipient. A corollary follows from this definition: Each matrix entry is associated with a contributorrecipient pair. We therefore compute the matrix by simply assigning to each entry the relationship score associated with the corresponding pair. Since each contributor corresponds to a row, a contributor’s vector representation is simply the corresponding row vector of scores. Likewise, each recipient’s vector representation comes from the corresponding column vector.
In the case of contributorrecipient pairs, the vector representation is as follows: Each pair is described as a vector of the six scores associated with the pair, (namely, exclusivity, report type, periodicity, maxed out, race focus, and length scores). We store these vectors in a dictionary named pair_score_map
.
After computing vector representations of contributors, recipients and contributorrecipient pairs, we ask the user what kind of similarity analysis she is interested in performing. The options are:
 Find contributors similar to a given contributor,
 Find recipients similar to a given recipient,
 Find pairs similar to a given pair.
Should the user select option 1, she is asked to input the ID of the contributor for which she would like to find the most similar contributors. Analogously, she is asked to input a recipient’s ID if she selects option 2. For option 3, she is required to enter the IDs of contributor and recipient that make up the pair.
Finally, we compute cosine similarity scores between the vector representing the contributor entered as input and each of the vectors representing every other contributor. Cosine similarity is given by the normalized dot product between the two vectors. We then order these cosine similarity scores in decreasing order. The top ten contributors most similar to the contributor entered as input are simply the contributors corresponding to the ten highest cosine similarity scores. The process for finding recipients most similar to a given recipient and pairs most similar to a pair is analogous.
In Depth
If we were to represent each contributor and each recipient as a separate node in a bipartite graph whose edges denote the occurrence of the corresponding contributorrecipient pairs in the relationship scores tables, the relationship scores would be the weights associated with these edges, (i.e. the strength of the relationship between contributor and recipient.) The weighted adjacency matrix we compute is a matrix representation of this graph. It allows us to easily recover the vector of scores associated with a given contributor across all recipients, as well as a given recipient across all contributors.
Note that the weighted adjacency matrix we compute is very sparse. The reason for this is that most contributors and recipients have no association whatsoever. (Think about it: Contributors donate only to a small subset of all recipients present in the database.) Since zero is the score given to contributorrecipient pairs for which contributors and recipients have no association, it follows that the weighted adjacency matrix contains many zero values and is therefore sparse.
Also, note that cosine similarity is defined as the normalized dot product between two vectors, which equals the cosine of the angle between the two vectors. As a result, cosine similarity scores range between –1 and 1. An angle of 0º means vectors are most similar whereas an angle of 180º means they point in diametrically opposite directions. (Recall cos 0º = 1, cos 180º = –1 and cosine’s range is [–1,1].)
Lastly, we highlight that the real power behind the relationship score is not the exact number associated with a pair, but rather the quick comparison between pairs enabled by these numbers. The relationship score is not meant to be an end in itself, but rather a means of finding similar patterns in donation behavior of contributors, recipients, and contributorrecipient pairs.
Final Remarks
Bedfellows is a tool developed to facilitate exploration of campaign finance data. Its model condenses all information associated with donations made by a given contributor to a given recipient into a relationship score from 0 to 1. The relationship scores provide a snapshot of the affinity between contributor and recipient, as evidenced by campaign donations. Relationship scores are in turn the basis for computing similarity scores, which allow for comparison between donation patterns of different contributors, recipients and pairs. Similarity scores point to similarities in donation patterns observed in campaign finance data.
Neither score is intended to be interpreted as a standalone, objective measure. On the contrary, these scores only make sense in context—they are meant to allow for comparison between the various PAC contribution behaviors concealed in the millions of rows of FEC data.
Our hope is that reporters will use relationship and similarity scores as a guide for further data inspection, as well as traditional reporting. For instance, a high relationship score for a given pair may lead the reporter to inspect all donations in the original FEC table associated with that pair. Likewise, a high similarity score between donations made by two companies may lead the reporter to investigate the reasons behind the observed similarity.
Again, many editorial decisions are embedded into the Bedfellows scoring model. We strive to make Bedfellows accountable by explaining and documenting, as well as giving users control over parameters of the model.
We hope that the quality of results will provide evidence that this model adequately describes affinity and similarity in PAC contribution behavior of various political actors. To provide a quantitative measure of the internal consistency of our model, we have also computed the Cronbach’s Alpha associated with the six individual scores, but have so far achieved only a score of 0.31. An adequate value for Cronbach’s Alpha is usually taken to be 0.7 or above.
A Note on Evaluation
We are unsure whether our model’s low alpha value is due to flaws in our model or the way Cronbach’s Alpha is defined. Cronbach’s Alpha is good at evaluating whether all tests measure the same quantity, but the way we measure “strength of the relationship” varies sharply from score to score. What we believe to be a strength of the model—the diversity of ways in which it quantifies the contributorrecipient relationship—may be keeping it from achieving a high Cronbach’s Alpha. In this case, we would do best to employ a different measure of internal consistency. It is also possible, of course, that Cronbach’s Alpha is accurately identifying problems in our model, in which case we will update it accordingly.
Because Bedfellows is a data exploration tool and we deem its current results to be editorially satisfying, we feel confident to publish it in spite of its current low Cronbach’s Alpha value. We pledge to further examine the internal consistency of the scores and roll out updates if necessary.
The Upshot is planning to use Bedfellows in upcoming stories, and we would love to hear your feedback—please let us know about any criticism of the model you might have as well as suggestions for improvement. You can reach us at dwillis@nytimes.com and nfi2103@columbia.edu.
Credits

Nikolas Iubel
MBA candidate @HarvardHBS. Formerly @Rio2016, @McKinsey, @nytimes. B.Sci. @Stanford ‘12, dual M.Sci. @CUSEAS + @columbiajourn ‘14. From Curitiba, Brazil.