Parsing DNS Query Logs to find CAs
I wrote a Julia program to parse 4,196,368 lines of DNS queries, lookup the certificate authority (CA) tied to each, and dump the results into a CSV file. All of this runs on a Pine64 RockPro64 just fine.
Why?
Previously, previously, and more recently previously. Also, previously. Also, this. Basically, because browsers, operating systems, and some applications ship with embedded CAs. From a risk management perspective, how many of these CAs do you actually use in the course of normal operations?
How?
First, we need to know what domains we're requesting as we go about our day. For me, I've been logging all the DNS queries for the past few months. Combine this with nextdns, and I have a ton of data with which to get started. 3 months of dns queries is 4,196,368 entries. nextdns exports them in a comma delimited csv file, all 574MB of it. The fields within are: timestamp, domain, query_type, dnssec, protocol, client_ip, status, reasons, destination_country, root_domain, device_id, device_name, device_model, device_local_ip, matched_name, client_name. It's a pretty rich dataset. And because it's in csv format, apache drill can easily query it using standard SQL. However, I skipped drill to see how to handle the analysis in julia on a rockpro64.
The total set is 4,196,368. The unique set of domains is only 42,196. 1% of domains are unique. Not quite what I expected, but it seems the OS, apps, and I are repeatedly querying the same names. There's no point to querying all 4m domains if reaIly 1% are all that's unique. I don't expect different subdomains will have different CAs, so the 1% unique are enough to learn what we want.
42k domains to query is manageable in a constrained system. The other benefit of a constrained system is I'm forced to look for optimizations all over.
The first optimization is to load the minimum amount of data into memory as a dataframe (using DataFrames, CSV):
df_rawlist = unique(DataFrame(CSV.File(csvfile; select=[:domain])))
Basically, treat the domains (second column) as a column and load only the unique domains into the dataframe, df_raw. This alone saves 1.7GB of RAM. As we only have 4GB in the rockpro64, every bit of memory saved helps. In fact, I opted not to wrap all that in another function to convert the list of unique domains into a simple Vector. I did this for future readability. It could have looked something like this:
values(unique(DataFrame(CSV.File(csvfile; select=[:domain])))[:,1])
I started down the path of trying to peer into the HTTP and MBed libs in julia to see if we can grab the CA from the connection to each domain. It wasn't easy and ultimately I couldn't get it working reliably. Until I can figure it out, I did the simple thing, use curl. I started to use libcurl wrapped in julia. And then, I just shelled out:
readchomp(pipeline(`curl --trace - -sIL -A "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:92.0) Gecko/20100101 Firefox/92.0" --connect-timeout 10 https://$host`, `grep -i issuer`))
Basically, I wrapped it in threads and can query 6 domains at once. This took 95 minutes to query the list.
What?
All that is about 45 minutes of work, about 5 minutes of actual coding, 40 minutes of figuring out HTTP, MBED, and libcurl. What are the results?
Found 58 CAs in 118,377 site requests.
Combined CAs using JaroWinkler similarity:
┌──────────────────────────────────────────────────────────────┬───────────┐
│ CA │ total_sum │
│ String │ Int64 │
├──────────────────────────────────────────────────────────────┼───────────┤
│ CPanel │ 1 │
│ Apple │ 1 │
│ Max-Planck-Gesellschaft │ 1 │
│ QuoVadis Limited │ 1 │
│ TI Trust Technologies S.R.L. │ 1 │
│ DNEncrypt │ 1 │
│ The USERTRUST Network │ 1 │
│ Beijing │ 1 │
│ SSL Corp │ 1 │
│ QuoVadis Trustlink B.V. │ 1 │
│ Shenzhen │ 1 │
│ T-Systems International GmbH │ 1 │
│ TERENA │ 2 │
│ Google │ 2 │
│ GlobalSign │ 2 │
│ SECOM │ 2 │
│ Trust Provider B.V. │ 2 │
│ DHIMYOTIS │ 2 │
│ Sectigo │ 5 │
│ Yandex LLC │ 5 │
│ IdenTrust │ 6 │
│ HydrantID (Avalanche Cloud Corporation) │ 7 │
│ Starfield Technologies │ 8 │
│ Cybertrust │ 9 │
│ SwissSign AG │ 10 │
│ The Trustico Group Ltd │ 10 │
│ Corporation Service Company │ 10 │
│ Unizeto Technologies S.A. │ 11 │
│ TrustAsia │ 11 │
│ TAIWAN-CA │ 11 │
│ Verein zur Foerderung eines Deutschen Forschungsnetzes e. V. │ 12 │
│ Actalis S.p.A. │ 12 │
│ Network Solutions L.L.C. │ 19 │
│ Trustwave │ 20 │
│ Buypass AS-983163327 │ 25 │
│ GEANT Vereniging │ 28 │
│ GoGetSSL │ 48 │
│ Internet2 │ 375 │
│ Google Trust Services │ 410 │
│ Apple Inc. │ 410 │
│ Microsoft Corporation │ 465 │
│ COMODO CA Limited │ 766 │
│ cPanel │ 822 │
│ ZeroSSL │ 1080 │
│ GoDaddy │ 3463 │
│ Starfield │ 7378 │
│ Sectigo Limited │ 8492 │
│ Amazon │ 9206 │
│ Google Trust Services LLC │ 15844 │
│ DigiCert Inc │ 17685 │
│ Cloudflare │ 17720 │
│ Let's Encrypt │ 33970 │
└──────────────────────────────────────────────────────────────┴───────────┘
Surprising no one, Let's Encrypt and Cloudflare dominate. I assumed this was the case because they're zero cost providers (for the most part).
Every domain queried in the past 3 months is in the csv already, what other questions can we answer?
What are the top 10 root domains queried?
load the domains.csv into a dataframe and then run this (using StatsBase, PrettyTables):
@pt sort(countmap(values(df_raw[:, :root_domain])); byvalue=true, rev=true)
Which results in:
┌───────────────────────────┬─────────┐
│ Keys │ Values │
│ String? │ Int64 │
├───────────────────────────┼─────────┤
│ missing │ 1494844 │
│ apple.com │ 321694 │
│ google.com │ 191519 │
│ syncthing.net │ 118861 │
│ icloud.com │ 92321 │
│ gstatic.com │ 61119 │
│ mozilla.com │ 41234 │
│ naver.com │ 40466 │
│ backblaze.com │ 34208 │
│ duckduckgo.com │ 33266 │
│ dropbox.com │ 32922 │
│ ⋮ │ ⋮ │
└───────────────────────────┴─────────┘
10046 rows omitted
Every machine and network under my control queries the same set of DNS servers, so what you see here is the sum of everything. Apple and Google devices make a lot of queries. This also includes searx.is queries. The "missing" category means I couldn't figure out a root domain, like .arpa or some other non fully qualified domain name.
What are the CAs of the top 10 (minus missing)?
In order of frequency, DigiCert Inc, Apple Inc, Google Trust Services LLC, Let's Encrypt, and Cloudflare.
I can safely remove 90% of the CAs from my browser/system and never encounter an issue. At worst, if I did encounter an unknown CA, I get a overly scary "Warning" and I just click through it.
The goal was to learn which CAs to keep in the browser/operating system. With this small set of data, I can learn a lot more about DNS queries and where my systems connect, and how often, etc. Besides, this post is already long enough.