Differences in embeddings of US Congressional Bill Data

Embeddings power a wide range of AI applications. At a high level, they capture the semantic essence of text by representing it as dense vectors in high-dimensional space. They’re fundamental to systems like product recommendations on e-commerce sites, customer segmentation engines, and semantic search tools.

Some embedding models are trained on broad web-scale corpus and are versatile across tasks; others are trained on specialized domains like law, medicine, or finance, where language patterns and semantic structures can be unique. These training choices can lead to models embedding the same input in very different ways — and that can affect the performance of downstream tasks using such embeddings.

I’m developing a side project inspired by AlphaXiv that focuses on U.S. legislation — making congressional bills understandable through simplified ELI5 explanations and presenting their potential impact on citizens. As a side quest, I ran a small experiment using embeddings to explore how different models represent the same legislative texts — specifically, to see how much they agree or disagree when it comes to “similarity.”

For this analysis, I used:

  • voyage/voyage-law-2 — a domain-specific model trained by Voyage AI on legal and policy texts.
  • openai/text-embedding-3-large — OpenAI’s latest general-purpose embedding model.
  • jina_ai/jina-embeddings-v3 — another general-purpose embedding model by Jina AI.

My dataset consists of a sample of roughly 390 U.S. House bills from the 118th Congress, pulled using the official U.S. Congress API. Here’s a snippet of what the raw data looks like:

print(house_bills_data)

[{'bill_id': 'hjres26',
  'congress': 118,
  'number': '26',
  'text': 'That the Congress disapproves of the action of the District of '
          'Columbia Council described as follows: The Revised Criminal Code '
          'Act of 2022...',
  'title': 'Disapproving the action of the District of Columbia Council in '
           'approving the Revised Criminal Code Act of 2022.',
  'type': 'hjres'},
 ...
]

To keep things simple, I did no additional cleaning — I used the full bill text as-is. Then, I extracted embeddings for each model using litellm, a wrapper library that standardizes access to different embedding APIs. Here’s a simplified version of the code I used:

models = ["voyage/voyage-law-2", "jina_ai/jina-embeddings-v3", "openai/text-embedding-3-large"]
for model in models:
    print(f"Embedding bills with {model}")
    for i, bill in tqdm(enumerate(house_bills_data)):
        if f'{model}-embeddings' not in bill:
            response = embedding(model=model, input=[bill['text']])
            bill[f'{model}-embeddings'] = response['data'][0]['embedding']

With these embeddings in hand, I computed the pairwise cosine similarity between all bills within each model. My goal here was simple: see how each model structurally “clusters” the legislative space. Here’s the distribution of pairwise similarity scores for each model:

Distributions of pairwise bill text embedding cosine similarities for different models
Distributions of pairwise bill text embedding cosine similarities for different models

A few quick takeaways:

  • All three models produce distributions with a similar shape — but they’re centered around different values.
  • voyage/voyage-law-2 has the lowest average similarity between bills, likely reflecting its training on legal text and a more conservative notion of what counts as semantically “close.”
  • Jina AI, on the other hand, shows a higher baseline similarity, perhaps due to being more general-purpose and less sensitive to legislative nuance.

That’s all for now — I’ll keep this post short. I plan to explore more in future posts, including:

  • How similarity distributions change when using bill titles only, instead of complete bill text
  • Whether models agree on which bills are most similar (via top-k overlap)