How To Extract Information From User Agent Strings in BigQuery?

Like many other data warehouses, BigQuery has low storage costs, which makes it ideal for dumping logs that can be analyzed later. HTTP request logs are often stored in BigQuery for this reason. These logs can later be analyzed to create reports, diagnose failures, or find the origins of DDoS attacks.

However, raw HTTP logs can be difficult to analyze because the data lacks enrichment. A good example of this is User-Agent strings. They can look something like this:

Mozilla/5.0 (Linux; Android 12; SM-S906N Build/QP1A.190711.020; wv) AppleWebKit/537.36 (KHTML, like Gecko) Version/4.0 Chrome/80.0.3987.119 Mobile Safari/537.36

Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/47.0.2526.111 Safari/537.36

User Agent strings themselves have a lot of useful information—the device, the browser, and its version—but in the raw form, it’s difficult to parse them as these strings haven’t been governed by any standard.

For this reason, using a parsing library is often preferred. For eg, UAParser.js is a popular one for Javascript.

But how can you use this in BigQuery? Suppose you have an HTTP logs table in BigQuery. Now, you want to group all failing requests by the browser to understand if they are coming from the same device. How do you do that?

Luckily, BigQuery’s Javascript UDFs comes in real handy here. Yes, you can define functions with Javascript and they can process your rows just like a SQL UDF. And since BigQuery divides the job very efficiently between workers, the performance impact on row processing is negligible.

Let’s see how this can be done.

Step 1: Upload UA Parsing library in your Google Storage

Instead of writing a long JS code for defining our UA parsing UDF, we can directly use a JS library. BigQuery allows this with an OPTIONS statement. The syntax for that looks like this:

CREATE TEMP FUNCTION f(a FLOAT64, b STRING)
  RETURNS STRING
  LANGUAGE js
  OPTIONS (
    library=["gs://my-bucket/path/to/lib1.js"]
  )
  AS
r"""
    // 'doInteresting' is defined in one of the library files.
    return doInteresting(a, b);
""";

SELECT f(3.14, 'foo');

It seems though that the library to be used can only be accessed from Google Storage. So you’d need to upload the library in a Storage bucket. We will be uploading ua-parser.min.js.

Once uploaded, you can click the link to the file just uploaded and copy the gsutil URI.

Step 2: Define our User Agent Parsing UDF

There are two types of UDFs we can create:

Temp UDFs: They die as soon as the query finishes and our invoked by CREATE TEMP FUNCTION statement.
Persistent UDFs: They persist across queries. If you try to create the function again, an error will be thrown. They are created with CREATE FUNCTION statement.

With Persistent UDF, you’d likely want to create a separate dataset for storing all UDFs. The one I have is called: funcs.

Javascript UDFs map directly to BigQuery’s struct, so we can use it to get all the important information.

CREATE FUNCTION funcs.parse_useragent(ua STRING)
RETURNS STRUCT<
  browser STRUCT<name STRING, version STRING, major STRING>,
  engine  STRUCT<name STRING, version STRING>,
  os      STRUCT<name STRING, version STRING>,
  device  STRUCT<vendor STRING, model STRING, type STRING>,
  arch    STRING
>
LANGUAGE js AS """
  let a = UAParser(ua);
  a.arch = a.cpu.architecture;
  return a;
"""
OPTIONS (
  library="gs://shubham-playground.appspot.com/ua-parser.min.js"
);

Step 3: Parse User Agents

Now, we can finally put our UDF to the test.

SELECT
  funcs.parse_useragent(useragent) as ua
FROM `requests.reqs`
LIMIT 1000;

And it works.

We can also extract specific information by specifying the attribute for example:

SELECT
  funcs.parse_useragent(useragent).browser.name as browser
FROM `requests.reqs`
LIMIT 1000;

Going back to the original problem, we can group logs by browser name and see where the failures are coming from by:

SELECT
  funcs.parse_useragent(useragent).browser.name as browser,
  status, COUNT(1) as count
FROM `requests.reqs`
WHERE status != 200
GROUP BY 1, 2
ORDER BY 3 DESC;

Output

And there we go, we successfully created a UDF that can parse user agent strings and convert them to device and browser information.