Various Other Tool Usages¶

DataKit has many different small tools built-in for daily use. You can view the command-line help of DataKit through the following command:

datakit --help

Note: Due to differences between different platforms, the specific help content may vary.

If you want to see how a specific command is used (such as dql), you can use the following command:

$ datakit dql --help
DQL used to query data. If no option specified, query interactively.

Usage:
  datakit dql [flags]

Flags:
      --auto-json      pretty output string if field/tag value is JSON
      --csv string     Specify the directory
  -F, --force          overwrite csv if file exists
  -h, --help           help for dql
  -H, --host string    specify datakit host to query
  -J, --json           output in JSON format
      --log string     log path (default "/dev/null")
  -R, --run string     run single DQL
  -T, --token string   run query for specific token(workspace)
  -V, --verbose        verbosity mode

Debugging Commands¶

Debugging the Blacklist¶

Version-1.14.0

To debug whether a piece of data will be filtered by the centrally configured blacklist, you can use the following command:

Linux/macOSWindows

$ datakit debug --filter=/usr/local/datakit/data/.pull --data=/path/to/lineproto.data

Dropped

    ddtrace,http_url=/webproxy/api/online_status,service=web_front f1=1i 1691755988000000000

By 7th rule(cost 1.017708ms) from category "tracing":

    { service = 'web_front' and ( http_url in [ '/webproxy/api/online_status' ] )}

PS > datakit.exe debug --filter 'C:\Program Files\datakit\data\.pull' --data '\path\to\lineproto.data'

Dropped

    ddtrace,http_url=/webproxy/api/online_status,service=web_front f1=1i 1691755988000000000

By 7th rule(cost 1.017708ms) from category "tracing":

    { service = 'web_front' and ( http_url in [ '/webproxy/api/online_status' ] )}

The above output indicates that the data in the file lineproto.data is matched by the 7th rule (counting from 1) in the tracing category in the .pull file. Once matched, this piece of data will be discarded.

Obtaining File Paths Using glob Rules¶

Version-1.8.0

In log collection, log paths can be configured using glob rules.

You can debug the glob rules using DataKit. You need to provide a configuration file, and each line of the file is a glob statement.

An example of the configuration file is as follows:

$ cat glob-config
/tmp/log-test/*.log
/tmp/log-test/**/*.log

A complete command example is as follows:

$ datakit debug --glob-conf glob-config
============= glob paths ============
/tmp/log-test/*.log
/tmp/log-test/**/*.log

========== found the files ==========
/tmp/log-test/1.log
/tmp/log-test/logfwd.log
/tmp/log-test/123/1.log
/tmp/log-test/123/2.log

Matching Text with Regular Expressions¶

Version-1.8.0

In log collection, multiline log collection can be achieved by configuring regular expressions.

You can debug the regular expression rules using DataKit. You need to provide a configuration file, and the first line of the file is the regular expression, and the remaining content is the text to be matched (which can be multiple lines).

An example of the configuration file is as follows:

$ cat regex-config
^\d{4}-\d{2}-\d{2}
2020-10-23 06:41:56,688 INFO demo.py 1.0
2020-10-23 06:54:20,164 ERROR /usr/local/lib/python3.6/dist-packages/flask/app.py Exception on /0 [GET]
Traceback (most recent call last):
  File "/usr/local/lib/python3.6/dist-packages/flask/app.py", line 2447, in wsgi_app
    response = self.full_dispatch_request()
ZeroDivisionError: division by zero
2020-10-23 06:41:56,688 INFO demo.py 5.0

A complete command example is as follows:

$ datakit debug --regex-conf regex-config
============= regex rule ============
^\d{4}-\d{2}-\d{2}

========== matching results ==========
  Ok:  2020-10-23 06:41:56,688 INFO demo.py 1.0
  Ok:  2020-10-23 06:54:20,164 ERROR /usr/local/lib/python3.6/dist-packages/flask/app.py Exception on /0 [GET]
Fail:  Traceback (most recent call last):
Fail:    File "/usr/local/lib/python3.6/dist-packages/flask/app.py", line 2447, in wsgi_app
Fail:      response = self.full_dispatch_request()
Fail:  ZeroDivisionError: division by zero
  Ok:  2020-10-23 06:41:56,688 INFO demo.py 5.0

Viewing the Running Status of DataKit¶

For the usage of monitor, please refer to here.

Checking the Correctness of Collector Configuration¶

After editing the collector configuration file, there may be some configuration errors (such as incorrect configuration file format). You can check whether it is correct through the following command:

datakit check --config
------------------------
checked 13 conf, all passing, cost 22.27455ms

Viewing Workspace Information¶

To facilitate viewing workspace information on the server side, DataKit provides the following command to view it:

datakit tool --workspace-info
{
  "token": {
    "ws_uuid": "wksp_2dc431d6693711eb8ff97aeee04b54af",
    "bill_state": "normal",
    "ver_type": "pay",
    "token": "tkn_2dc438b6693711eb8ff97aeee04b54af",
    "db_uuid": "ifdb_c0fss9qc8kg4gj9bjjag",
    "status": 0,
    "creator": "",
    "expire_at": -1,
    "create_at": 0,
    "update_at": 0,
    "delete_at": 0
  },
  "data_usage": {
    "data_metric": 96966,
    "data_logging": 3253,
    "data_tracing": 2868,
    "data_rum": 0,
    "is_over_usage": false
  }
}

Debugging KV Files¶

When the collector configuration file is configured using the KV template, if you need to debug, you can use the following command for debugging.

datakit tool --parse-kv-file conf.d/host/cpu.conf --kv-file data/.kv

[[inputs.cpu]]
  ## Collect interval, default is 10 seconds. (optional)
  interval = '10s'

  ## Collect CPU usage per core, default is false. (optional)
  percpu = false

  ## Setting disable_temperature_collect to false will collect cpu temperature stats for linux. (deprecated)
  # disable_temperature_collect = false

  ## Enable to collect core temperature data.
  enable_temperature = true

  ## Enable gets average load information every five seconds.
  enable_load5s = true

[inputs.cpu.tags]
  kv = "cpu_kv_value3"

Viewing Cloud Attribute Data¶

If the machine where DataKit is installed is a cloud server (currently supports aliyun/tencent/aws/hwcloud/azure), you can view some cloud attribute data through the following command. For example (marked as - means the field is invalid):

datakit tool --show-cloud-info aws

           cloud_provider: aws
              description: -
     instance_charge_type: -
              instance_id: i-09b37dc1xxxxxxxxx
            instance_name: -
    instance_network_type: -
          instance_status: -
            instance_type: t2.nano
               private_ip: 172.31.22.123
                   region: cn-northwest-1
        security_group_id: launch-wizard-1
                  zone_id: cnnw1-az2

Parsing Line Protocol Data¶

Version-1.5.6

You can parse line protocol data through the following command:

datakit tool --parse-lp /path/to/file
Parse 201 points OK, with 2 measurements and 201 time series

It can be output in JSON format:

datakit tool --parse-lp /path/to/file --json
{
  "measurements": {  # List of metric sets
    "testing": {
      "points": 7,
      "time_series": 6
    },
    "testing_module": {
      "points": 195,
      "time_series": 195
    }
  },
  "point": 202,        # Total number of points
  "time_serial": 201   # Total number of timelines
}

Data Recording and Replay¶

Version-1.19.0

Data import is mainly used to enter existing collected data. When demonstrating or testing, additional collection is not required.

Enabling Data Recording¶

In datakit.conf, you can enable the data recording function. After enabling, DataKit will record the data to the specified directory for subsequent import:

[recorder]
  enabled  = true
  path     = "/path/to/recorder"     # Absolute path, by default in the <DataKit installation directory>/recorder directory
  encoding = "v2"                    # Use protobuf-JSON format (xxx.pbjson), and you can also choose v1 (xxx.lp) in line protocol form (the former is more readable and supports more data types)
  duration = "10m"                   # Recording duration, starting from the startup of DataKit
  inputs   = ["cpu", "mem"]          # Record data of specified collectors (based on the names shown in the *Inputs Info* panel of monitor), and if empty, it means recording data of all collectors
  categories = ["logging", "metric"] # Recording types, and if empty, it means recording all data types

After the recording starts, the directory structure is roughly as follows (showing the pbjson format of time-series data here):

[ 416] /usr/local/datakit/recorder/
├── [  64]  custom_object
├── [  64]  dynamic_dw
├── [  64]  keyevent
├── [  64]  logging
├── [  64]  network
├── [  64]  object
├── [  64]  profiling
├── [  64]  rum
├── [  64]  security
├── [  64]  tracing
└── [1.9K]  metric
    ├── [1.2K]  cpu.1698217783322857000.pbjson
    ├── [1.2K]  cpu.1698217793321744000.pbjson
    ├── [1.2K]  cpu.1698217803322683000.pbjson
    ├── [1.2K]  cpu.1698217813322834000.pbjson
    └── [1.2K]  cpu.1698218363360258000.pbjson

12 directories, 59 files

Warning

After the data recording is completed, remember to turn off this function (enable = false). Otherwise, every time DataKit starts, recording will be launched, which may consume a large amount of disk space.
The collector name is not exactly the same as the name in the collector configuration ([[inputs.some-name]]), but the name shown in the first column of the Inputs Info panel of monitor. The name of some collectors may be like this: logging/<some-pod-name>. Here, the data directory it stores is /usr/local/datakit/recorder/logging/logging-some-pod-name.1705636073033197000.pbjson, and the / in the collector name is replaced with - (to avoid an extra directory structure).

Data Replay¶

After DataKit records the data, you can save the data in this directory using Git or other methods (make sure to keep the existing directory structure). Then, you can import these data into TrueWatch through the following command:

$ datakit import -P /usr/local/datakit/recorder -D https://openway.truewatch.com?token=tkn_xxxxxxxxx

> Uploading "/usr/local/datakit/recorder/metric/cpu.1698217783322857000.pbjson"(1 points) on metric...
+1h53m6.137855s ~ 2023-10-25 15:09:43.321559 +0800 CST
> Uploading "/usr/local/datakit/recorder/metric/cpu.1698217793321744000.pbjson"(1 points) on metric...
+1h52m56.137881s ~ 2023-10-25 15:09:53.321533 +0800 CST
> Uploading "/usr/local/datakit/recorder/metric/cpu.1698217803322683000.pbjson"(1 points) on metric...
+1h52m46.137991s ~ 2023-10-25 15:10:03.321423 +0800 CST
...
Total upload 75 kB bytes ok

Although the recorded data contains absolute timestamps (in nanoseconds), when playing back, DataKit will automatically shift these data to the current time (retaining the relative time intervals between data points), making it look like newly collected data.

You can obtain more help information about data import through the following command:

$ datakit import --help
Import used to play recorded history data to TrueWatch.

Usage:
  datakit import [flags]

Flags:
  -D, --dataway strings   dataway list
  -h, --help              help for import
      --log string        log path (default "/dev/null")
  -P, --path string       point data path (default "/usr/local/datakit/recorder")

Warning

For RUM data, if there is no corresponding APP ID in the target workspace for playback, the data cannot be written. You can create a new application in the target workspace, change the APP ID to be consistent with that in the recorded data, or replace the APP ID in the existing recorded data with the APP ID of the corresponding RUM application in the target workspace.

Others¶

Telegraf Integration¶

Note: Before using Telegraf, it is recommended to confirm whether DataKit can meet the expected data collection. If DataKit already supports it, it is not recommended to use Telegraf for collection, as it may cause data conflicts and usage troubles.

Install the Telegraf integration

datakit install --telegraf

Start Telegraf

cd /etc/telegraf
cp telegraf.conf.sample telegraf.conf
telegraf --config telegraf.conf

For usage matters of Telegraf, refer to here.

Security Checker Integration¶

Install the Security Checker

datakit install --scheck

After a successful installation, it will run automatically. For the specific usage of the Security Checker, refer to here

eBPF Integration¶

Install the DataKit eBPF collector. Currently, it only supports the linux/amd64 | linux/arm64 platforms. For the usage instructions of the collector, see DataKit eBPF Collector

datakit install --ebpf

If the prompt open /usr/local/datakit/externals/datakit-ebpf: text file busy appears, execute this command after stopping the DataKit service.

Warning

This command has been removed in Version-1.5.6. The eBPF integration is built-in by default in the new version.

Update IP Database¶

Host InstallationKubernetes(yaml)Kubernetes(Helm)

You can directly use the following command to install/update the IP geographic information database (here you can choose another IP address library geolite2, just replace iploc with geolite2):

datakit install --ipdb iploc

After updating the IP geographic information database, modify the datakit.conf configuration:

[pipeline]
  ipdb_type = "iploc"

Restart DataKit to take effect
Test whether the IP library takes effect

datakit tool --ipinfo 1.2.3.4
        ip: 1.2.3.4
      city: Brisbane
  province: Queensland
   country: AU
       isp: unknown

If the installation fails, the output is as follows:

datakit tool --ipinfo 1.2.3.4
       isp: unknown
        ip: 1.2.3.4
      city: 
  province: 
   country:

Modify datakit.yaml and uncomment the content between the 4 places marked with ---iploc-start and ---iploc-end.
Reinstall DataKit:

kubectl apply -f datakit.yaml

# Ensure the DataKit container is started
kubectl get pod -n datakit

Enter the container and test whether the IP library takes effect

datakit tool --ipinfo 1.2.3.4
        ip: 1.2.3.4
      city: Brisbane
  province: Queensland
   country: AU
       isp: unknown

If the installation fails, the output is as follows:

datakit tool --ipinfo 1.2.3.4
       isp: unknown
        ip: 1.2.3.4
      city: 
  province:
   country:

Add --set iploc.enable when deploying with Helm

helm install datakit datakit/datakit -n datakit \
    --set datakit.dataway_url="https://openway.truewatch.com?token=<YOUR-TOKEN>" \
    --set iploc.enable true \
    --create-namespace

For deployment matters of Helm, refer to here.

Enter the container and test whether the IP library takes effect

datakit tool --ipinfo 1.2.3.4
        ip: 1.2.3.4
      city: Brisbane
  province: Queensland
   country: AU
       isp: unknown

If the installation fails, the output is as follows:

datakit tool --ipinfo 1.2.3.4
       isp: unknown
        ip: 1.2.3.4
      city: 
  province:
   country:

Automatic Command Completion¶

The new completion flow is generated from the Cobra command tree and supports bash, zsh, fish, and powershell. Installing or upgrading DataKit does not enable shell completion automatically. To use completion, run datakit completion <shell> after installation.

Note: datakit completion applies to DataKit Version-2.1.0 and later. For earlier versions, use the command syntax documented with that release.

Because DataKit has many command-line options, it now provides automatic completion.

Typical usage:

Force bash install: datakit completion bash --force
Force zsh install: datakit completion zsh --force
Force fish install: datakit completion fish --force
Force powershell install: datakit completion powershell --force
Auto-detect current shell and install: datakit completion --force
Print script only: datakit completion bash --print

Specifying the shell explicitly is recommended, especially when running through sudo. datakit completion --force detects the shell from the SHELL environment variable. If sudo or another restricted environment does not preserve that variable, auto-detection will fail.

Most mainstream Linux environments support shell completion. For bash, if completion support is missing on the host or inside a container, you can install:

Ubuntu: apt install bash-completion
CentOS: yum install bash-completion bash-completion-extras

When a shell is specified, datakit completion <shell> will:

install the generated completion script to a standard path
print the actual install path and how to activate it immediately

For example:

$ datakit completion bash --force
completion for bash installed to /usr/share/bash-completion/completions/datakit
reload your shell or run: source /usr/share/bash-completion/completions/datakit

When DataKit is running inside a Docker container, completion is installed into the container filesystem, and the output will state that explicitly.

bash Setup¶

Run:

datakit completion bash --force

If the script is installed to a system completion directory, it usually takes effect after opening a new shell. To enable it in the current shell, run the source command printed by DataKit.

zsh Setup¶

Run:

datakit completion zsh --force

For zsh, DataKit installs the completion script to ~/.zfunc/_datakit by default. If your current zsh session has not loaded that directory, add it to fpath and run compinit again:

fpath=(~/.zfunc $fpath)
autoload -Uz compinit
compinit

If you want zsh to load it automatically on startup, add the same configuration to ~/.zshrc, and make sure the fpath line appears before compinit. You can also copy the command printed by datakit completion zsh --force to write the configuration and load it.

fish Setup¶

Run:

datakit completion fish --force

Fish completion is installed to ~/.config/fish/completions/datakit.fish by default. It usually takes effect after opening a new fish session.

PowerShell Setup¶

Run:

datakit completion powershell --force

For PowerShell, DataKit generates a standalone completion script by default and does not modify or overwrite the user's Microsoft.PowerShell_profile.ps1. To enable it in the current session, run the dot-source command printed by DataKit. If you want PowerShell to load it automatically on startup, add that dot-source command to your profile manually.

Completion usage example:

$ datakit <tab> # Enter \tab to get the following commands
check       completion  debug       dql         import      install
monitor     pipeline    run         service     tool        version

$ datakit dql <tab> # Enter \tab to get the following options
--auto-json   --csv         -F,--force    --host        -J,--json     --log         -R,--run      -T,--token    -V,--verbose

All the commands mentioned below can be operated in this way.

Print the Completion Script Only¶

If you want to review the script first or install it manually, use --print:

# Export the zsh completion script
datakit completion zsh --print > _datakit

If you need a custom install path, use --path:

datakit completion fish --path ~/.config/fish/completions/datakit.fish --force