How To Extract Information From User Agent Strings in BigQuery?
Like many other data warehouses, BigQuery has low storage costs, which makes it ideal for dumping logs that can be analyzed later. HTTP request logs are often stored in BigQuery for this reason. These logs can later be analyzed to create reports, diagnose failures, or find the origins of DDoS attacks.
However, raw HTTP logs can be difficult to analyze because the data lacks enrichment. A good example of this is User-Agent strings. They can look something like this:
Mozilla/5.0 (Linux; Android 12; SM-S906N Build/QP1A.190711.020; wv) AppleWebKit/537.36 (KHTML, like Gecko) Version/4.0 Chrome/80.0.3987.119 Mobile Safari/537.36
Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/47.0.2526.111 Safari/537.36
User Agent strings themselves have a lot of useful information—the device, the browser, and its version—but in the raw form, it’s difficult to parse them as these strings haven’t been governed by any standard.
But how can you use this in BigQuery? Suppose you have an HTTP logs table in BigQuery. Now, you want to group all failing requests by the browser to understand if they are coming from the same device. How do you do that?
Let’s see how this can be done.
Step 1: Upload UA Parsing library in your Google Storage
Instead of writing a long JS code for defining our UA parsing UDF, we can directly use a JS library. BigQuery allows this with an
OPTIONS statement. The syntax for that looks like this:
CREATE TEMP FUNCTION f(a FLOAT64, b STRING) RETURNS STRING LANGUAGE js OPTIONS ( library=["gs://my-bucket/path/to/lib1.js"] ) AS r""" // 'doInteresting' is defined in one of the library files. return doInteresting(a, b); """; SELECT f(3.14, 'foo');
It seems though that the library to be used can only be accessed from Google Storage. So you’d need to upload the library in a Storage bucket. We will be uploading ua-parser.min.js.
Once uploaded, you can click the link to the file just uploaded and copy the
Step 2: Define our User Agent Parsing UDF
There are two types of UDFs we can create:
- Temp UDFs: They die as soon as the query finishes and our invoked by
CREATE TEMP FUNCTIONstatement.
- Persistent UDFs: They persist across queries. If you try to create the function again, an error will be thrown. They are created with
With Persistent UDF, you’d likely want to create a separate dataset for storing all UDFs. The one I have is called:
CREATE FUNCTION funcs.parse_useragent(ua STRING) RETURNS STRUCT< browser STRUCT<name STRING, version STRING, major STRING>, engine STRUCT<name STRING, version STRING>, os STRUCT<name STRING, version STRING>, device STRUCT<vendor STRING, model STRING, type STRING>, arch STRING > LANGUAGE js AS """ let a = UAParser(ua); a.arch = a.cpu.architecture; return a; """ OPTIONS ( library="gs://shubham-playground.appspot.com/ua-parser.min.js" );
Step 3: Parse User Agents
Now, we can finally put our UDF to the test.
SELECT funcs.parse_useragent(useragent) as ua FROM `requests.reqs` LIMIT 1000;
And it works.
We can also extract specific information by specifying the attribute for example:
SELECT funcs.parse_useragent(useragent).browser.name as browser FROM `requests.reqs` LIMIT 1000;
Going back to the original problem, we can group logs by browser name and see where the failures are coming from by:
SELECT funcs.parse_useragent(useragent).browser.name as browser, status, COUNT(1) as count FROM `requests.reqs` WHERE status != 200 GROUP BY 1, 2 ORDER BY 3 DESC;
And there we go, we successfully created a UDF that can parse user agent strings and convert them to device and browser information.