Abstract
State legislatures in the United States handle a number of important policy issues, but pose a challenge for researchers to observe because they are not organized by any central agency. We use a machine learning model based on the “transformer” architecture and contextual word-piece embeddings to code the universe of bills introduced in the states since 2009 (about 1.36 million bills) into 28 policy areas. Validation exercises show our method compares favorably with hand-coded estimates of bill policy areas while offering far greater coverage than legacy human-supervised “dictionary” methods. We explain how researchers can use these estimates to investigate sub-national governance in the United States.
Similar content being viewed by others
Background & Summary
A legislative agenda reveals what a government is working on at the moment, and can provide a historical index of the priorities of a legislature over time. That is, if it is properly organized. The US Congress has been organized for public consumption by the Library of Congress (via congress.gov) and researchers1 for many years, leading to reams of scholastic work on the body. However, there is no such analog at the state level, despite the fact that state legislatures regulate sizable markets and handle consequential matters from election law to civil rights. Non-government organizations and private firms have been digitizing state legislative histories for a number of years, but they have not been systematically coded.
This paper describes a machine learning method to code the universe of state legislation by policy area since about 2009, and provides its output. Significant advances in computational hardware have made it possible to use machine learning models to approach the reliability of hand-coders at a fraction of the cost. Using our estimates researchers can track the levels and changes of bills introduced in 28 policy areas, from broad (macroeconomic policy) to specific (public lands and water management). Temporal change produces important insights, for example, the most common policy area in the states is health care. State legislators increased a spate of bills on the topic following major federal action (the 2010 Affordable Care Act), but there was no such spike following the failed attempt to “repeal and replace” the Affordable Care Act in 2017.
Previous scholars have used a list of keywords that capture the essence of topics to measure pockets of the agenda, such as “health” being a keyword for the “Health” topic, or “highways,” “transit,” and “airports” mapping to “Transportation.” But there are limitations to dictionary-based approaches such as this, or other “bag of words” models based solely on the prevalence of words. For example, Garlick (2023)2 used a legacy dictionary method to code state legislation from 1991-2018 that would code a bill about a “learning environment” to be about environmental policy, and not education policy. Therefore, we instead take a “word embedding” approach, which is more nuanced than looking for the presence of words. The principle intuition behind word embedding is that “you shall know a word by the company it keeps”3 and has become the dominant approach to natural language processing (or NLP)4.
Specifically, we constructed a machine learning model based on the transformer architecture5, including the Bidirectional Encoder Representations from Transformers (BERT)6, and two other similar algorithms7,8. These models observe more than just keywords, utilizing contextual word-piece embeddings to build document embeddings to make sophisticated estimates of the proper category each bill could fit in. We then extend the classification system’s coverage by forming predictions about the policy content of bills for which no keywords are present. We also train the model to be capable of “doubting” the keywords themselves, allowing it to question homonyms like “learning environment,” which might otherwise map to the “Environment” topic, instead allowing it to be coded as “Education.” We iterate the model several times, overweighting model predictions with no keywords present so that the model does not just emulate the dictionary method.
Our method produces policy sector estimates for 1.36 million bills, which is the universe of bills in the Legiscan database, which itself has collected bills from state legislative websites since about 2009. We allow for an individual piece of legislation to be measured as pertaining to more than one subject. This is necessary because of cases where the name of the bill obscures its policy content, like Wisconsin’s 2011 “Budget Repair Bill” which gained national recognition for reducing the collective bargaining rights of state employees, or Minnesota’s 2016 “Omnibus Tax Bill” that also concerned hosting a World’s Fair and the Super Bowl9. To account for the fact that bill titles can be misleading on their own, we also use bill descriptions provided by Legiscan to feed the model.
We validate our estimates against published datasets to favorable results. This is not a straightforward task, as we are using a codebook that is backwards compatible with many studies on American state politics published by Virginia Gray, David Lowery and coauthors10,11,12,13 and other recent investigations of American legislatures2,14 but for which there is no “ground truth” data. Evaluating the model against a hand-coded sample of 1,000 bills, the model achieves a “Top-K agreement” of 80 percent, for K=3. This step is necessary for comparing a multi-label classification scheme to another multi-label classification scheme, but 80 percent is notable because that is the same target that human coders are trained to achieve on subtopics of the Comparative Agendas Project15. For an external validation, we investigate how the estimates compare to the hand-coded Pennsylvania Policy Database Project (PPDP)16. The machine learning estimates produce superior results compared to the legacy dictionary method reported by Garlick (2023)2 by matching the accuracy of the legacy method while offering far more coverage of the agenda. Knowing this reliability, there are advantages to our approach, especially its scalability. It is reproducible as well as cost- and time-effective, so researchers can take heart knowing that future projects can use this approach.
We offer suggestions for how best to use these estimates to observe and analyze state agendas. The estimates can be used to create a panel dataset that facilitates clear inter-state and inter-temporal comparisons. However, there is still considerable variation in how states behave, so it is best to group legislative sessions into two-year periods, the vast majority of which begin in odd-years following even-year elections. We detail the exceptions to this pattern, and also which bills and sessions to include in a typical analysis.
Methods
Classifying state legislative text provides researchers with a number of tradeoffs. Researchers want data to be backwards compatible with extant studies, while also accounting for recent developments in society. Bills should be coded in a reliable way, while also acknowledging the reality that many bills do not concern a single subject area. This section summarizes our approach which accounts for these tradeoffs. We start by running a legacy keywords version of the model, closely approximating the approach used by prior researchers2. Then we introduced our machine learning model that (1) takes advantage of best in class algorithms to pre-process text in bill descriptions made available by Legiscan into “word embeddings,” (2) employs a novel “decision function” for determining if a bill fits each policy category.
Our model’s decision function f(X) internalizes the input data X and generates a prediction \(\widehat{y}\). In short, the researcher who fits a classification system to legislative text to predict bills’ policy content applies the following workflow:
Read the input data X → Apply a classification decision function f( ⋅ ) → Generate \(\widehat{y}\)
Finding and sorting data
The first step for creating the new dataset is to collect a corpus of state legislation. Beginning in the early 2000s, states began publishing their legislative records on the internet. However, the states vary a great deal in how detailed that information is, and how it is presented. A number of private and quasi-public efforts were launched to collect and centralize this data, most notably Legiscan and the OpenStates project. Legiscan is a private offering launched in 2010 and publishes bill titles, descriptions, and a 5-step bill history. OpenStates began as an open-source project before being purchased and maintained by the private firm Plural in 2021. They differ slightly in their offering, for example Legiscan has a 5 step bill progress variable, while OpenStates tracked detailed bill histories, which have been used to establish a 23-step bill histories. This study uses the bill descriptions and titles from Legsican for the universe of bills from 2009-2020, for a total of 1,360,994 bills. The full text is available, but legislation is often not helpful, as it is full of boilerplate language and references to legal code17.
We then set the destination policy categories, with a codebook with broad policy areas in state politics that occupy intuitive conceptual domains such as “Healthcare,” “Education,” and “Tax Policy”10,18. We did this by expanding upon the classification system with 23 topics used in the legacy model2, including 5 additional topics to improve its overlap with the Comparative Agendas Project spearheaded by Baumgartner and Jones, producing a multi-label system with ∣K∣ = 28 relatively broad policy areas.
We append several additional keywords to the topics listed in the legacy model2 to improve the classification system’s overlap with the Comparative Agendas Project codebook. The additional topics are “Fiscal and Economic Issues”, “Immigration”, “Public Lands and Water Management”, “Foreign Trade”, and “International Affairs and Foreign Aid,” which makes the scheme more complete and comparable to the Comparative Agendas Project. Table 1 presents this set of keywords and topics.
We then replicate the legacy dictionary model used by Garlick (2023)2 that places bills in a policy category if they contain any of the listed keywords. This method is tried and true to reduce false negatives; however, even with the extended codebook, the legacy dictionary approach only places approximately 41% of the universe of state legislation into a policy category. A bill which mentions “medical” and “doctor,” for example, is not classified as “Health” unless it also had the precise keyword: “health.” Words which strongly relate to topics but are not keywords themselves have no effect on the output of this decision function. We will overcome these problems with a pre-processing approach that treats naturally occurring language more dynamically, and a three-pass decision function that emphasizes contextual clues, and not just keywords, to produce better coverage of the agenda without sacrificing accuracy.
Transforming text to data
Machine learning models have moved beyond the unordered “bag-of-words” approach that was once common in political science19, but served as sufficient for dictionary models2. But meaning is lost when word order is not taken into account. A bill that reads, “this bill is about taxes - not education,” has an identical bag of words to, “this bill is about education - not taxes”, and bags of words approaches are also vulnerable to homonyms, as our prior example “the learning environment” might fall under the purview of an “Education” topic, but “learning about the environment” might be better fit under “Environment” (or, “Education” and “Environment” jointly).
Word embedding presents a solution, as words are represented as real-valued, N-dimensional vectors whose elements reflect latent semantic dimensions that, together, define the position the word occupies in an N-dimensional space - its “meaning.” For example, one might expect the words “teacher” and “student” to be more semantically similar than “teacher” and “warfare.” In this way, the word embedding approach can drastically reduce the dimensionality of the input data. Instead of representing every word as a unique token ID, every word, and the document itself, is represented in a common N-dimensional space shared by all documents, sentences within documents, and words within sentences. Therefore, the elemental unit of analysis in a word embedding model is the “word-piece”20, where e.g. the word-piece “educat” and “##ion” have their own word-piece embeddings, with the “##” tag indicating that “ion” should be attached to the word-piece that comes before it. Popular algorithms that generated static estimates for embeddings include Word2Vec21 or GloVe22.
A recent innovation from the static models has been greater focus on the context surrounding word embeddings. The transformer model5 utilizes a “self-attention” mechanism, which allows the model to weight the importance of words in the input sequence as they pertain to the focus word. For example, consider the sentence: “she went to the hardware store to buy a hammer.” If a human were asked why “hammer” is a “logical” word to end the sentence - and why not “car,” “house,” or “bagel” - they would likely point to “hardware store” as the leading contextual clue. The word “hardware” attends to “store,” dynamically modifying its word embedding. If the word “hardware” were removed from the input sequence, the prediction of “hammer” might be less likely (unless the model were trained on relevant data), and the prediction of “bagel” might be more plausible.
The model we train to classify legislation into policy areas utilizes the transformer model BERT, RobustBERT or “RoBERTa”7, and XLNet8. These models utilize contextual word-piece embeddings to build document embeddings, which are the “input data” X that our model is trained on.
Decision function
The first pass of our decision function is similar to using a dictionary method to assign bills to topics. It is a supervised multi-label classification task, and for each bill we look to classify, the model will match word embeddings with the destination topics in the training data. For example, like a dictionary model, if the term for the embeddings equivalent of “transit” is present, it would predict the topic “Transportation” ∈ y. Dictionary methods are transparent but their deterministic nature likely generates too many false negatives, as well as posing a risk for false positives as well.
Next we need to acknowledge the limitations of dictionary methods, and generate predictions for bills for which no keyword-equivalent embeddings are present. In pursuit of this end, the model is our direct antagonist: if we simply train the model on the dictionary method-coded bills, the model could perfectly reverse-engineer the dictionary method, imposing the same issues of false positives and false negatives that dictionary models face. To train the model to learn a more nuanced representation of topics, we must block the deterministic dictionary path such that learning about the context surrounding keywords is the only way for it to make progress on its training task.
The second pass breaks the model’s ability to memorize keywords, by artificially corrupting the input data in several ways: (1) randomly replace words in bill summaries with synonyms; (2) randomly delete words from bill summaries; (3) randomly “mask” words in bill summaries (i.e. the model knows a word is there, but does not know what it is); and (4) mask every occurrence of keywords in bill summaries, e.g. the phrase “about the learning environment” would be corrupted to be read “about the learning [mask].” It is crucial to our procedure that we do not solely mask keywords, but randomly mask other words as well; otherwise, our model could infer the missing keywords due to the context that surrounds them, thereby reverse-engineering the dictionary method. So the absence of keywords forces the model to learn how the context surrounding keywords can help in predicting the bills’ topics.
The second pass of the model generates predictions for the bills for which no keyword-equivalent embeddings are present. From the standpoint of the original dictionary method, any bills which lack keywords but have topic predictions are technically “false positives,” but with a by-hand inspection, these model disagreements typically yield an accurate measure of the bill’s topic(s), therefore we treat these predictions as valuable, as they feature the bills with word embeddings apart from the dictionary-aligned approach. We will take advantage of these bills in our final pass.
In the third pass, the model is then trained from scratch on the bills from the second pass without keyword-equivalent embeddings. With the second pass predictions as the ground truth labels, we generate predictions for the bills for which keyword-equivalent embeddings are present. In summary, the model learns how to form predictions for bills which lack keywords by learning how the context surrounding keywords can be used in their absence (second pass). The model forms said predictions, and then is re-trained on its own predictions for bills which lack keywords from scratch (third pass). This severs the link between keywords and topics because, during the third pass, it is never exposed to the keywords.
The model’s produces a set of probabilities it associates with observing each topic. Researchers should set a cutoff value τ, above which to predict the presence, and below, the absence, of a given topic. In the results section to follow, we set τ = 0.5, but researchers could utilize other values for τ if they are more concerned the model’s Precision to produce fewer false positives and wherein a higher τ works better, or Recall to reduce false negatives with a lower τ).
Data Records
The estimates are publicly available at two levels of specificity. First, the “individual bill” estimates23 contain the model’s confidence that the bill applies to each of 28 policy areas (or τ), which is 38.1 million estimates. The variables in this .csv file are an identifier that corresponds to the master set (dg_id), the model’s confidence that the bill suits each policy area (tau), and the policy codes in string format (policycode), which are described in Table 1.
As an example of the individual bill estimates, Table 2 provides the model’s output for West Virginia’s 2011 bill “Creating the ‘Health Care Choice Act.’” As the title would suggest, this bill has an extremely high policy prediction for Health (0.989 out of 1.0), however, it receives an even higher prediction for Insurance (0.999). The bill description indicates how this bill regulates insurers, which the model has used to determine it to be an insurance bill. No other policy areas remotely approach our default confidence level: τ = 0.50, so the model determines that this bill only applies to two policy areas.
Second, there is a “master” dataset24 which reflects the policy areas that apply to each of the 1.36 million bills from approximately 2009-2020. Researchers can collapse this file to produce an estimate of state policy agendas. The variables in this .csv file include indicators (0,1) for each of the policy areas named in Table 1, as well as an identifier that corresponds to the individual estimates dataset (dg_id), the bill’s code (ajo_id), metadata on the bill drawn from Legiscan including its number (legiscan_bill_number), type (legiscan_bill_type_verbose), title (legiscan_title), how far the bill advanced on a five-point scale (legiscan_status), the session the bill was introduced in (legiscan_session), and the first (legiscan_session_year_start) and last (legiscan_session_year_end) years of the session. See the Usage Notes section for more details on these variables.
The confidence level in the master dataset is set to τ = 0.50, and 51 percent of bills have a single policy code, while 26 percent have two policy codes. Only 95 bills have more than five codes. The bills that have unusually high numbers of policy codes, like Montana’s 2021 HB2 titled “General Appropriations Act,” are usually omnibus legislation or appropriations bills regarding many state agencies.
For a descriptive overview of the master dataset, Figs. 1–5 demonstrate the number of bills assigned to each of the 28 categories from 2009-2019, aggregated by biennium and separated by the primary author of the bill, if it is discernible. While the amount of attention paid to many policy areas has increased over time, like how Fig. 1 shows for education and health, this has not been the case across the agenda, as Fig. 3 shows that less attention to sectors like banking over time. This follows as there was a societal fascination with banking at the beginning of the decade in the wake of the 2008 financial crisis that has faded. In addition to the changes over time, the levels are also important here. Our method has shown that the military, including veterans issues, is a major focus of state legislation that previous efforts have missed.
There are also notable partisan patterns, however it is worth noting that the raw count of bills being introduced is affected by how many Republican, Democratic or third-party/independent legislators are in office for a given session. With that caveat in mind, we see certain issues having most bills primarily authored by Democrats, like Civil Rights in Fig. 4 or Republicans, like Religion in Fig. 5.
Technical Validation
There are challenges to validating these data, as there is no “ground truth” dataset of state legislative bills coded under the Gray and Lowery codebook. This presents two issues: first, translating between disparate codebooks introduces measurement error, as bills could be “correctly” assigned according to each codebook but not translate to each other. For example, the Gray and Lowery codebook has a category for electric utilities (“Utilities”) that is separate from the PPDP’s “Energy” topic, so any bill assigned to utilities could be labeled as incorrect. Second, the complexity of assigning bills to a single policy area is a demanding task on its own. Many state politics researchers have deemed the classification task complex enough to necessitate that a domain expert inspect documents by-hand. For example, Reingold et al. (2021)25 first applies a dictionary method to identify legislation concerning abortion by searching for the “abortion” keyword, and then their second stage of analysis requires hand-coding bills to measure their abortion “sentiment.” The Congressional Bills Project1 and the codebook to which it adheres do away with keywords entirely, instead training hand-coders for several weeks to identify the leading policy area of the bill. This approach has its own issues, first it is not 100 percent reliable. Hillard et al. (2008, p. 40)15 report that coders are trained until they code congressional bills by 21 major topic codes at 90 percent reliability, and the 200+ minor topic codes at 80 percent reliability. Second, using hand-coders comes at the expense of transparency; unless the project clearly annotates the “why” for each hand-coder classification decision, it is difficult to diagnose hand-coder versus downstream researcher disagreements.
To address these concerns we take the following steps to demonstrate the validity of these estimates. First, we compare the machine learning model’s estimates to the legacy dictionary method. This shows how the machine learning model offers similar levels of precision, which provides confidence that the model is finding the policy content it is claiming to, but far better recall, or the degree the model is finding all possible bills in each policy area. Second, we compare the model to estimates generated by hand-coders16 to show that the model is reliable.
These exercises demonstrate the external validity of these estimates such that machine learning estimates track the hand-coded estimates of the Pennsylvania legislature closely over time. In particular, the machine learning greatly outperforms the legacy dictionary method in its recall of potential bill codings.
Model Evaluation Against Legacy Dictionary Method
We first evaluate the machine learning estimates against the legacy dictionary method. This is not a traditional validation, as the dictionary method is less of a “ground truth” than an alternative approach. But by showing which policy areas these two models agree and disagree on, we can better understand how the machine learning model operates. In Table 3, a “false positive” denotes an instance where the model predicts the presence of a given topic despite the absence of all its keywords. A “false negative” denotes an instance where the model predicts the absence of a given topic despite the presence of one of its keywords. “True positives” and “true negatives” are cases where the model and dictionary method agree to assign or not assign the topic to the bill, respectively.
For almost all topics, false positives outnumber true positives, owing to the fact that the legacy dictionary method generates codes for a much smaller share of bills. A higher topic Precision denotes a larger share of true positives compared against all positives; as Precision increases, the model finds fewer instances where it predicts the presence of a topic despite the absence of keywords, meaning the keywords are more “necessary” for identifying the topic. For example, the most precisely-defined topic, by far, is “Tax Policy”, with Precision = 74%. It may be hard-pressed to think of a bill pertaining to “Tax Policy” which lacks all mentions of “tax,” “taxation,” “taxable,” and so-on (or at least, it may be harder to find such instances than it would be to find counterparts in other topics), but there are “Tax Policy” bills that include mentions of a “levy,” “Department of Revenue,” “economic development,” or “bonds.”
The lowest precision topics are “International Affairs and Foreign Aid” (5%), “Law” (6%), and “Civil Rights” (13%), meaning the model frequently predicts the topics’ presence without keywords being present. To shed light on whether these overrides of the dictionary method’s keyword rules are valuable, “International Affairs and Foreign Aid” (abbreviated in the table as “Intl. Affairs”) often include “exchange programs” (for students), explicitly mention other countries, or include other nouns plausibly associated with international affairs. For example, the dictionary defined in Table 1 includes the keywords “diplomat” and “embassy,” but not “ambassador,” and yet, “ambassador” takes on a similar semantic meaning with respect to the machine learning model’s classification task. Using the keywords associated with the “Civil Rights” topic as its conceptual “seeds,” the model has grown the topic to include a large number of resolutions, especially those which “celebrate the life” of an individual, mention “slave trade,” “equal opportunity,” or “human rights.” The multi-label nature of the data proves especially useful in flagging e.g. a bill which mentions “Female Veteran’s Day” as pertaining to both “Civil Rights” and “Military,” and a bill mentioning the “Equal Opportunity Scholarship Act” as pertaining to both “Civil Rights” and “Education.” False positives for “Law” often discuss “liability,” “unenforceable,” “covenant,” “contract,” “Juvenile Justice,” and “allegations.”
To provide a high-level perspective the internal consistency of the model, Fig. 6 presents a heatmap of topic co-occurrences for bills for which no keywords are present. Using non-keyword-coded bills illustrates that how the model performs without the strongest clues regarding policy areas. For each of the figures, a cell value denotes the percentage of bills coded as Topic = Row which were jointly coded as Topic = Column. For example, the “Civil Rights” topic most frequently co-occurs with “Religion,” “Law,” and “Labor and Employment,” even when of whether the bills contain zero keywords. The other top co-occurrences are “Environment” and “Public Lands and Water Management” and “Sports” (often with references to fishing and hunting), “Health” and “Insurance,” “Foreign Trade” and “Manufacturing,” “Construction” and “Utilities,” as well as “Natural Resource” and"Environment.” The “Utilities” topic most often co-occurs with “Communication” (e.g. telecomms) and “Local Government.”
Correlation between topics based on model predictions, no keywords. Notes: Cells denote the frequency with which bills predicted as pertaining to Topic = Row are also coded as Topic = Column by the model, observing only bills for which there are no keywords present from either Topic = Row or Topic = Column. For example, 12.8% of bills which were predicted as pertaining to “Religion” (while lacking any “Religion” keywords) were also coded as pertaining to “Civil Rights” (while lacking any “Civil Rights” keywords).
Internal Validation
To demonstrate the face validity of the data, we drill down into the estimates with an issue that has emerged in the years since the original dictionary codebooks were published: fracking. Fracking developed as a political issue in the US following the development of “hydraulic fracturing” or “hydrofracking” technology in the mid-2000s allowed for energy companies to take advantage of shale underneath many, but not all, American states to produce natural gas and oil. While this was a revolutionary technique for energy extraction, it also had concerning environmental consequences, as the chemicals used for fracking, and its byproducts, could be toxic for groundwater and other environmental factors.
To test for fracking being accounted for by our model, we look to see if bills containing the many synonyms for this new technology are coded in the categories we would expect. We identify fracking bills using the following (case-insensitive) search terms in bill titles and descriptions: fracking, hydro fracturing, hydro-fracturing, hydrofracturing, hydro fracking, hydro-fracking, hydrofracking, hydraulic fracturing, hydraulic-fracturing, shale gas, shale oil, horizontal drill, horizontal gas, horizontal stimulation, horizontal well, fracturing fluid, fracturing wastewater, fracturing water, and fracturing chemical. At least one of these search terms appeared in 736 bills from 2009-2020. We find that the model assigns them in logical categories. Figure 7 illustrates that the fracking-related search terms are mostly concentrated to the “Energy or Natural Resource” and “Environment” topics, there is also some prevalence in “Public Lands and Water Management”. We take this exercise as further evidence’s of the model’s internal reliability.
A human coder also evaluated a sample of 1,000 randomly selected bills drawn from a separate database (OpenStates) to 1) evaluate the completeness of the sample drawn from Legiscan, 2) allow for the comparison of a multi-class system, 3) provide feedback on the uncertainty of the process. This exercise produced a number of results. First, it showed that Legiscan contained 100 percent of the bills in the OpenStates archives, which reaffirms our decision to take the Legiscan data at face value. Second, the human coder revealed the difficulty of these coding decisions. The human coder was provided the same bill title from legiscan that the model was fed, and was asked if this was sufficient to assign a bill or if it was necessary to look up the full text of the bill via a link to its pdf. On 25.4% of bills, they required the full text. This was not data that the model had access to. Third, they were instructed to place the bill into a single category, if feasible, or if the bill was too complex, to assign a second category. They found that 44% of bills required a second category.
We evaluate the accuracy of the model in two ways designed to evaluate multi-label prediction models26. First, we estimate the model’s “Ranking Loss”, which calculates the average number of incorrectly labeled pairs. For example, the hand-coder assigned Alabama’s “Property Insurance and Energy Reduction Act” (SB 220 from 2015, titled: “Energy efficiency projects, financing by local governments authorized, non ad valorem tax assessments, liens, bonds authorized, Property Insurance and Energy Reduction Act of Alabama”) as “Natural Resources” and “Local Government”, but the model’s first two estimates were “Utilities” (τ = 0.98) and “Tax Policies” (τ = 0.42). Since “Local Government” (τ = 0.16) was the third highest estimate, the ranking loss for this bill was 2. A perfect classifier would be zero, and chance would be 0.5. The average ranking loss for this sample of bills was 0.043, an impressive score.
Another way to conceive of multi-label classifier is through Top-K agreement, which indicates the share of times the hand-coder assigns the bill to a variable (K) number of classes. The previous example would be a bill that does not find agreement at K=1 or K=2, but does find agreement at K=3. Table 4 shows how if only the top option is included, there is agreement between the human coder and model estimates on 57 percent of bills. If the first three model estimates are considered there is agreement on 80 percent of bills. This is similar to the Policy Agenda Project’s considers to be suitable for human coder performance on minor topic codes.
The accuracy and precision of the model can also be calibrated using different tau cutoff rates. Table 5 shows the range, and the default level (τ = 0.5). A higher tau would only provide more confident estimates, at the expense of total coverage. The lower F1 scores can be considered an artifact of the multi-label to multi-label comparison.
External Validation
The PPDP provides an opportunity for an external validation of our estimates. The PPDP was handcoded by researchers using a version of the Comparative Policy Agendas codebook, adjusted for the context of state politics. The PPDP coded the universe of Pennsylvania legislative bills from 1979-2016, and the manual nature of this process has been considered the gold standard in the field before the implementation of automated methods. However, there are two barriers to comparing these two sets of estimates of the same bills before the Pennsylvania legislature. First, as shown in Table 6, there is misalignment on some topics. The Comparative Policy Agendas codebook was designed to account for the issues dealt with by national leaders of western democracies, including topics like “Macroeconomics” or “International Affairs and Foreign Aid” that are not relevant for a codebook optimized to study American state politics10,27. Therefore, we use 17 policy areas that overlap across the two datasets: civil rights, health, agriculture, labor, education, environment, energy, transportation, legal, welfare, construction, military, communications, public lands, local government. There are some difficult decisions to be made aligning these coding schemes, and we err on the side of caution by not assuming that the PPDP’s “Domestic Commerce” code would include several different codes that could fit there including: “insurance”, “manufacturing”, “bank”, or “small business.”
The second barrier to comparison is that the PPDP puts bills into only one major topic area and our model estimates a number of policy labels for each bill. This puts a ceiling on the potential precision of our measure. For example, our model estimates that 2009’s House Bill 890 “Establishing a nursing and nursing educator loan forgiveness and scholarship program” is coded as “Health” and “Education,” while the PPDP only considers it to be an “education” bill. Therefore, it is a “false positive,” while we consider this bill aimed at reducing the nursing shortage to bill a health care bill. Interestingly, if we only use a dictionary model based on the presence of keywords, that bill is also only labeled as “Education”. As above, we will compare both the keyword-only estimates as well as the modeled estimates in Table 7.
With those caveats in mind, Fig. 8 shows the machine learning estimates more closely approximate the PPDP coding of the state’s legislative agenda. A “perfect” relationship will align with the dark black line that is a 1:1 relationship. To get into the specifics, Table 7 shows how models compare. The micro-averaged F1 score is 0.50, which is a major improvement upon the legacy dictionary’s micro-averaged F1 score of 0.29. This difference is driven by a dramatic increase in recall, as the machine learning estimate is retrieving about 57 percent of the PPDP bill codes, while the legacy keyword model only recovers a quarter of those estimates. This recall distinction has practical use for researchers. The relatively poor recall of the dictionary model led to the original recommendation to only use those estimates to only make over-time comparisons within a state/policy area, which minimizes the blindspots of the keyword approach2. However, the results in Fig. 8 allow researchers to make claims across the agenda such as “there are more bills introduced about transportation in Pennsylvania over the period 2009-2016 than energy,” as researchers can have confidence the relative size of the policy agendas are being evaluated.
There is a stronger correlation between Machine Learning estimates and the Pennsylvania Policy Database Project than the legacy keyword model. Note: Estimates are labeled with the major topic number (see column 1 of Table 7 and the last two digits of the year.
In terms of reliability, methods designed to evaluate multi-label classification schemes26 show the machine learning estimates perform well. The average ranking loss is 0.11 (0.00 would be perfect, 0.50 would be near chance), which is slightly more ranking loss than the human coder exercise, but that is to be expected because of the measurement error aligning the codebooks. The top-K agreement between the model estimates and PPDP in Table 8 shows at K = 3 there is 70 percent agreement. So by considering three guesses per bill, the machine learning method produces a reasonable facsimile of what a human coder can be expected to produce. There also appears to be diminishing returns to the Top-K agreement near 90 percent. This reveals just how difficult this task may be, that even with many guesses it may not be reasonable to generate complete agreement.
Usage Notes
The data are prepared for applied research. The “individual bill” estimates23 dataset has the model predictions for every policy area in case researchers would like to change the τ level, and the “master” dataset24 indicates each policy area a bill qualifies for with τ = 0.50. Each bill has a unique identifier (dg_id) that aligns with the individual bill estimates and master dataset. The master dataset contains a primary identifier for each bill, which corresponds with the American Joint Operation code (AJO, which is linked to the OpenStates project)28. Linking to OpenStates provides more information on the bills, such as more detailed versions of bill legislative histories.
Table 9 explains each component of the AJO key, for example wv_2011_r1_HB_02801 or West Virginia’s “Creating the ‘Health Care Choice Act’”, which Table 2 showed was assigned to be a G0300 Health and G1510 Insurance bill. The components of the AJO key are connected with underscores that can be easily split.
For cross-sectional research, we suggest aggregating bills by biennia. Most American states have two-year sessions that follow elections taking place in even-numbered years that align with the presidential election cycle (e.g. 2024, 2028, etc.) However, a few states have their elections in odd-numbered years following the presidential cycle. We align the variable tracking each biennium (Year2) with the first year of the typical two-year cycle, and the second year for those four states with off-cycle legislative elections (NJ, LA, MS, VA). Another advantage of two-year cycles is that they capture the ebb and flow of an actual legislature. Trying to split a session, like the Massachusetts legislature’s roughly 18 month session, into two separate one-year cycles ignores the fact that many of the bills are introduced in the first few weeks of the session, only to be acted upon in the final few days of the session. The two-year cycle also aligns with each Congress at the national level.
The data can also be aggregated by individual years, using the variable (Year1) that shows bills from sessions which began and ended in a single year. This variable is useful for aggregating bills to data that is measured on an annual basis, like lobbying registrations29 or Gross State Product.
We also include several of the qualitative descriptors of the bill directly from legiscan as strings, including the session, title of the bill, the year the session started and ended, a qualitative description of the type of bill (e.g. Bill, Resolution, Joint Memorial, etc.) and legiscan’s 5 point status for how far a bill advanced (1: Introduced, 2: Engrossed, 3: Enrolled, 4: Passed, 5: Vetoed). We also provide a url for where the original bill was scraped from the legislature’s website. Some of these links have since become broken, but this provides a useful track to the archival material.
Legiscan provides the partisanship of bill sponsors, we summarize this in a single variable to delineate the primary sponsor (0 = unknown, 1 = Democrats, 2 = Republicans). This is straightforward for states that have only a primary sponsor, we can observe this with Legiscan’s personnel files, but for states that allow multiple sponsors on a bill we take the party of the median primary sponsor.
Code availability
Replication code used to generate the policy estimates in python, conduct the analysis and reproduce the figures from this article is hosted by the Open Science Foundation at https://osf.io/xv4w7. In addition to the individual23 and master24 datasets and replication code, there are also two “readme.txt” files to aid researchers in downloading and using these data.
References
Adler, E. S. & Wilkerson, J. Congressional bills project: 1998-2014. NSF 00880066 and 00880061 1 (2015).
Garlick, A. Laboratories of politics: There is bottom-up diffusion of policy attention in the American federal system. Political Research Quarterly 76, 29–43 (2023).
Firth, J. A synopsis of linguistic theory, 1930-1955. Studies in linguistic analysis (1957).
Rheault, L. & Cochrane, C. Word embeddings for the analysis of ideological placement in parliamentary corpora. Political Analysis 28, 112–133 (2020).
Vaswani, A. et al. Attention is all you need. CoRR abs/1706.03762 1706.03762 (2017).
Devlin, J., Chang, M.-W., Lee, K. & Toutanova, K. BERT: Pre-training of deep bidirectional transformers for language understanding 1810.04805 (2019).
Liu, Y. et al. RoBERTa: A robustly optimized BERT pretraining approach 1907.11692 (2019).
Yang, Z. et al. XLNet: Generalized autoregressive pretraining for language understanding 1906.08237 (2020).
Brakst, B. House tax plan not just about tax cuts. MPR News (2017).
Fellowes, M., Gray, V. & Lowery, D. What’s on the table? The content of state policy agendas. Party Politics 12, 35–55 (2006).
Gray, V., Cluverius, J., Harden, J. J., Shor, B. & Lowery, D. Party competition, party polarization, and the changing demand for lobbying in the American states. American Politics Research 43, 175–204 Publisher: SAGE Publications Sage CA: Los Angeles, CA (2015).
Gray, V., Lowery, D., Fellowes, M. & Anderson, J. L. Legislative agendas and interest advocacy understanding the demand side of lobbying. American Politics Research 33, 404–434 Publisher: Sage Publications (2005).
Kirkland, J. H., Gray, V. & Lowery, D. Policy agendas, party control, and PAC contributions in the American states. Business and Politics 12, 1–24 Publisher: Cambridge University Press (2010).
Holyoke, T. T. Changing state interest group systems: Replicating and extending the ESA model. Interest Groups & Advocacy 10, 264–285 (2021).
Hillard, D., Purpura, S. & Wilkerson, J. Computer-assisted topic classification for mixed-methods social science research. Journal of Information Technology & Politics 4, 31–46 (2008).
McLaughlin, J. P. et al. The Pennsylvania Policy Database Project: A model for comparative analysis. State Politics and Policy Quarterly 10, 320–336 (2010).
Burgess, M. et al. The legislative influence detector: Finding text reuse in state legislation. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 57–66 (ACM, 2016).
Baumgartner, F. R. & Jones, B. D. Agendas and instability in American politics. Chicago: University of Chicago Press (1993).
Grimmer, J. & Stewart, B. M. Text as data: The promise and pitfalls of automatic content analysis methods for political texts. Political Analysis 21, 267–297 (2013).
Wu, Y. et al. Google’s neural machine translation system: Bridging the gap between human and machine translation. arXiv:1609.08144 (2016).
Mikolov, T., Chen, K., Corrado, G. & Dean, J. Efficient estimation of word representations in vector space 1301.3781 (2013).
Pennington, J., Socher, R. & Manning, C. D. GLOVE: Global vectors for word representation. Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP) (2014).
Dee, E. & Garlick, A. State Legislative Policy Estimates: 2009-2020 (Individual Bills) https://doi.org/10.17605/OSF.IO/ZAFC6 (2025).
Dee, E. & Garlick, A. State Legislative Policy Estimates: 2009-2020 (Master Dataset) https://doi.org/10.17605/OSF.IO/UKNZ6 (2025).
Reingold, B., Kreitzer, R. J., Osborn, T. & Swers, M. L. Anti-abortion policymaking and women’s representation. Political Research Quarterly 1065912920903381 Publisher: SAGE Publications Sage CA: Los Angeles, CA (2020).
Erlich, A., Dantas, S. G., Bagozzi, B. E., Berliner, D. & Palmer-Rubin, B. Multi-label prediction for political text-as-data. Political Analysis 30, 463–480 (2022).
Gray, V. & Lowery, D.The population ecology of interest representation: Lobbying communities in the American states (University of Michigan Press, 1996).
Brown, A. R. & Garlick, A. Bicameralism hinges on legislative professionalism. Legislative Studies Quarterly 49, 161–185 (2024).
Holyoke, T. T. Dynamic state interest group systems: a new look with new data. Interest Groups and Advocacy 8, 499–518 Publisher: Springer (2019).
Acknowledgements
We acknowledge Hattie Kahl for research assistance, Zac Peskowitz and audience members at the 2022 annual meeting of the American Political Science Association in Montreal, Quebec for comments and Dan Bernhardt, Stefan Krasa and Matt Winters for feedback throughout the project.
Author information
Authors and Affiliations
Contributions
The article was authored jointly. E.D. generated the original machine learning model and data, A.G. contributed analysis and all authors reviewed the manuscript.
Corresponding author
Ethics declarations
Competing interests
The authors declare no competing interests.
Additional information
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Dee, E., Garlick, A. Policy agendas of the American state legislatures. Sci Data 12, 1276 (2025). https://doi.org/10.1038/s41597-025-05621-5
Received:
Accepted:
Published:
Version of record:
DOI: https://doi.org/10.1038/s41597-025-05621-5










