Skip to main content

Automating JSON Log Loading with Vector

In this tutorial, we'll simulate generating logs locally, collect them using Vector, store them in S3, and automate their ingestion into Databend Cloud using scheduled tasks.

Automating JSON Log Loading with Vector

Before You Start

Before you start, ensure you have the following prerequisites in place:

  • Amazon S3 Bucket: An S3 bucket where logs collected by Vector will be stored. Learn how to create an S3 bucket.
  • AWS Credentials: AWS Access Key ID and Secret Access Key with sufficient permissions for accessing your S3 bucket. Manage your AWS credentials.
  • AWS CLI: Ensure that the AWS CLI is installed and configured with the necessary permissions to access your S3 bucket.
  • Docker: Ensure that Docker is installed on your local machine, as it will be used to set up Vector.

Step 1: Create a Target Folder in S3 Bucket

To store the logs collected by Vector, create a folder named logs in your S3 bucket. In this tutorial, we use s3://databend-doc/logs/ as the target location.

This command creates an empty folder named logs in the databend-doc bucket:

aws s3api put-object --bucket databend-doc --key logs/

Step 2: Create a Local Log File

Simulate log generation by creating a local log file. In this tutorial, we use /Users/eric/Documents/logs/app.log as the file path.

Add the following JSON lines to the file to represent sample log events:

app.log
{"user_id": 1, "event": "login", "timestamp": "2024-12-08T10:00:00Z"}
{"user_id": 2, "event": "purchase", "timestamp": "2024-12-08T10:05:00Z"}

Step 3: Configure & Run Vector

  1. Create a Vector configuration file named vector.yaml on your local machine. In this tutorial, we create it at /Users/eric/Documents/vector.yaml with the following content:
vector.yaml
sources:
logs:
type: file
include:
- "/logs/app.log"
read_from: beginning

transforms:
extract_message:
type: remap
inputs:
- "logs"
source: |
. = parse_json(.message) ?? {}

sinks:
s3:
type: aws_s3
inputs:
- "extract_message"
bucket: databend-doc
region: us-east-2
key_prefix: "logs/"
content_type: "text/plain"
encoding:
codec: "native_json"
auth:
access_key_id: "<your-access-key-id>"
secret_access_key: "<your-secret-access-key>"
  1. Start Vector using Docker, mapping the configuration file and local logs directory:
docker run \
-d \
-v /Users/eric/Documents/vector.yaml:/etc/vector/vector.yaml:ro \
-v /Users/eric/Documents/logs:/logs \
-p 8686:8686 \
--name vector \
timberio/vector:nightly-alpine
  1. Wait for a moment, then check if any logs have been synced to the logs folder on S3:
aws s3 ls s3://databend-doc/logs/

If the log file has been successfully synced to S3, you should see output similar to this:

2024-12-10 15:22:13          0
2024-12-10 17:52:42 112 1733871161-7b89e50a-6eb4-4531-8479-dd46981e4674.log.gz

You can now download the synced log file from your bucket:

aws s3 cp s3://databend-doc/logs/1733871161-7b89e50a-6eb4-4531-8479-dd46981e4674.log.gz ~/Documents/

Compared to the original log, the synced log is in NDJSON format, with each record wrapped in an outer log field:

{"log":{"event":"login","timestamp":"2024-12-08T10:00:00Z","user_id":1}}
{"log":{"event":"purchase","timestamp":"2024-12-08T10:05:00Z","user_id":2}}

Step 4: Create a Task in Databend Cloud

  1. Open a worksheet, and create an external stage that links to the logs folder in your bucket:
CREATE STAGE mylog 's3://databend-doc/logs/' CONNECTION=(
ACCESS_KEY_ID = '<your-access-key-id>',
SECRET_ACCESS_KEY = '<your-secret-access-key>'
);

Once the stage is successfully created, you can list the files in it:

LIST @mylog;

┌─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
│ name │ size │ md5 │ last_modified │ creator │
├────────────────────────────────────────────────────────┼────────┼────────────────────────────────────┼───────────────────────────────┼──────────────────┤
1733871161-7b89e50a-6eb4-4531-8479-dd46981e4674.log.gz │ 112"231ddcc590222bfaabd296b151154844"2024-12-10 22:52:42.000 +0000NULL
└─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘
  1. Create a table with columns mapped to the fields in the log:
CREATE TABLE logs (
event String,
timestamp Timestamp,
user_id Int32
);
  1. Create a scheduled task to load logs from the external stage into the logs table:
CREATE TASK IF NOT EXISTS myvectortask
WAREHOUSE = 'eric'
SCHEDULE = 1 MINUTE
SUSPEND_TASK_AFTER_NUM_FAILURES = 3
AS
COPY INTO logs
FROM (
SELECT $1:log:event, $1:log:timestamp, $1:log:user_id
FROM @mylog/
)
FILE_FORMAT = (TYPE = NDJSON, COMPRESSION = AUTO)
MAX_FILES = 10000
PURGE = TRUE;
  1. Start the task:
ALTER TASK myvectortask RESUME;

Wait for a moment, then check if the logs have been loaded into the table:

SELECT * FROM logs;

┌──────────────────────────────────────────────────────────┐
│ event │ timestamp │ user_id │
├──────────────────┼─────────────────────┼─────────────────┤
│ login │ 2024-12-08 10:00:001
│ purchase │ 2024-12-08 10:05:002
└──────────────────────────────────────────────────────────┘

If you run LIST @mylog; now, you will see no files listed. This is because the task is configured with PURGE = TRUE, which deletes the synced files from S3 after the logs are loaded.

Now, let's simulate generating two more logs in the local log file app.log:

echo '{"user_id": 3, "event": "logout", "timestamp": "2024-12-08T10:10:00Z"}' >> /Users/eric/Documents/logs/app.log
echo '{"user_id": 4, "event": "login", "timestamp": "2024-12-08T10:15:00Z"}' >> /Users/eric/Documents/logs/app.log

Wait for a moment for the log to sync to S3 (a new file should appear in the logs folder). The scheduled task will then load the new logs into the table. If you query the table again, you will find these logs:

SELECT * FROM logs;

┌──────────────────────────────────────────────────────────┐
│ event │ timestamp │ user_id │
├──────────────────┼─────────────────────┼─────────────────┤
│ logout │ 2024-12-08 10:10:003
│ login │ 2024-12-08 10:15:004
│ login │ 2024-12-08 10:00:001
│ purchase │ 2024-12-08 10:05:002
└──────────────────────────────────────────────────────────┘