Find and count distinct domains in a file with grep

Regular expressions come handy when trying to filter some output from a file with grep or similar Linux command.

The Apache access log is one of several log files produced by an Apache HTTP server. This particular log file is responsible for recording data for all requests processed by the Apache server.

To quickly find and display all referrers domain/subdomain and how many are there you can use the following command

grep -Eo 'https?://[a-zA-Z0-9!@#$&()\\-`.+,]*' access.log | cut -d '/' -f3 | sort | uniq -c | sort -nr

Example output:

296 aljazvidmar.si
 60 www.apple.com
 47 www.semrush.com
 39 ahrefs.com
 38 webmaster.petalsearch.com
 22 www.google.com
  9 opensiteexplorer.org
  8 dataforseo.com
  7 www.bing.com
  5 www.google.com.hk
  2 napoveda.seznam.cz
  2 ecairn.com
  2 duckduckgo.com
  1 photo.adesignstudio.net

Command break-down with comments

-E, --extended-regexp
Interpret PATTERNS as extended regular expressions (EREs, see below).

-o, --only-matching
Print only the matched (non-empty) parts of a matching line, with each such part on a separate output line.

match http with optional char ‘s’ ?s, followed by :// and any aphanumeric char + special characters except slash [a-zA-Z0-9!@#$&()-`.+,]* multiple times

sort the output

uniq -c find unique lines and count them

sort -rn sort them again in reverse order as numbers