What are Grok Patterns?

Community-admin · June 3, 2024, 8:06am

solution-data-lake-warehouse-ingestion

Grok leverages regular expression language that allows you to name existing patterns and/or combine them into more complex Grok patterns. Because Grok is based on regular expressions, any valid regular expressions (regexp) are also valid in grok.

In StreamSets Data Collector Engine, a fast data ingestion engine, running on StreamSets DataOps Platform you can use a single grok pattern, compose several patterns, or create a custom pattern to define the structure of logs data by configuring the following properties.

Grok Pattern Definition

Use this property to define a single or multiple patterns.

<PATTERN NAME> <grok pattern>
<PATTERN NAME2> <grok pattern>

MYHOSTTIMESTAMP %{CISCOTIMESTAMP:timestamp} %{HOST:host} 
MYCUSTOMPATTERN %{MYHOSTTIMESTAMP} %{WORD:program}%{NOTSPACE} %{NOTSPACE} 
DURATIONLOG %{NUMBER:duration}%{NOTSPACE} %{GREEDYDATA:kernel_logs}

Grok Pattern

Use this property to define the pattern that will evaluate the data. Options for defining it are:

Predefined pattern, such as %{COMMONAPACHELOG}
Define a custom pattern. (See below Custom Grok Patterns.)
Patterns defined in the Definition

For example, you can use the patterns outlined in Definition above to configure as follows:

%{MYCUSTOMPATTERN} %{DURATIONLOG}

Reusing Grok Patterns

Here is the syntax for reusability: %{SYNTAX:SEMANTIC}, %{SYNTAX}, %{SYNTAX:SEMANTIC:TYPE}.

Where;

SYNTAX is the name of the pattern that will match the text
SEMANTIC is the identifier given to the piece of text being matched
TYPE is the type to cast the named field

Custom Grok Patterns

You can use the following custom patterns to define the structure of logs.

Using Grok In Amazon S3

For instance, let’s take a look at how you can use Grok in Amazon S3 origin to easily parse Apache web logs ingested in the following format.

79.133.215.123 - - [14/Jun/2014:10:30:13 -0400] "GET /home HTTP/1.1" 200 1671 "-" "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.1916.153 Safari/537.36"
162.235.161.200 - - [14/Jun/2014:10:30:13 -0400] "GET /department/apparel/category/featured%20shops/product/adidas%20Kids'%20RG%20III%20Mid%20Football%20Cleat HTTP/1.1" 200 1175 "-" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_3) AppleWebKit/537.76.4 (KHTML, like Gecko) Version/7.0.4 Safari/537.76.4"
39.244.91.133 - - [14/Jun/2014:10:30:14 -0400] "GET /department/fitness HTTP/1.1" 200 1435 "-" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.1916.153 Safari/537.36"

Note on Amazon S3 >> Data Format tab

Data Format is set to Log
Log Format is set to Grok Pattern
Grok Pattern is set to %{COMMONAPACHELOG}

In the pipeline:

Field Type Converter processor converts fields like response, timestamp, httpversion from string to their respective datatypes
Expression Evaluator processor decodes field request (the HTTP url) to UTF-8 and also extracts product name from the URL using regExCapture()

Snowflake

The parsed and enriched Apache web logs are stored in Snowflake for analytics in Snowsight.

Query: Top 10 most viewed products

SELECT product_name, count(*) as product_page_view_count
FROM web_logs
WHERE product_name is not null
GROUP BY product_name
ORDER BY product_page_view_count DESC 
LIMIT 10;

Query: Failed (403, 500, etc.) vs. successful (200) HTTP requests

SELECT response, count(*) as http_response_code_count
FROM web_logs
GROUP BY response
ORDER BY http_response_code_count DESC;

Log Data Format Support In StreamSets Data Collector

For a comprehensive list of all the origins, processors, and destinations in StreamSets Data Collector Engine that support Log data format where Grok Patterns can be used, visit our documentation.

Try Grok in Your Data Pipelines

To take advantage of this and other features, get started today by deploying data engines in your favorite cloud to design, deploy, and operate data pipelines.

StreamSets enables data engineers to build end-to-end smart data pipelines. Spend your time building, enabling and innovating instead of maintaining, rewriting and fixing.

Topic	Replies	Views
Send Kafka Messages To Amazon S3 Blog StreamSets	1601	June 3, 2024
What You Should Know About Snowflake’s Snowpipe Streaming Blog StreamSets	2055	June 3, 2024
Everything You Need To Know To Use Cron Expressions Like a Pro Blog StreamSets	5462	June 3, 2024
The JSON Validator: A Custom Processor to Ensure Your JSON Payload is Syntactically Accurate Blog StreamSets	1222	June 3, 2024
Understanding the Lambda Architecture Blog StreamSets	1089	June 3, 2024