title>Design of Database

Design of Database

Bruce Ediger

2013-04-14

Database

Visualization of DB

database tables relationships

The above diagram represents tables by rectangular blocks, has the table name in bold. It lists column names in regular type in the blocks. Foreign key fields have arrows pointing to the primary keys.

seq_no

Any request of the Apache web server ends up putting a line in table access_log. That table has a column seq_no. I made every effort to get a coherent, matching value of seq_no for rows in table p0f (essentially for a TCP SYN packet) that make a request of TCP port 80.

SELECT seq_no FROM access_log
WHERE ip_addr = p0f from IP address
AND request_dt = p0f timestamp
AND seq_no IS NOT NULL
AND seq_no NOT IN (
    SELECT seq_no
    FROM p0f
    WHERE from_addr = p0f from IP address
    AND seq_no IS NOT NULL
)

When finding a seq_no value for a p0f record, the above SQL tries to find a seq_no value from an Apache access_log record that's already in the database, based on matching IP address of the source, and timestamp. Occasionally, the same IP address can make multiple HTTP requests per second, so the "NOT IN" clause keeps the code from assigning the same seq_no to two rows in table p0f.

Sometimes the p0f record and the access_log record have timestamps that differ. I ended up using this SQL as a second chance to find a value of seq_no:

SELECT seq_no FROM access_log 
WHERE ip_addr = p0f from IP address
AND (request_dt BETWEEN
   (CAST(p0f timestamp as timestamp with time zone) - INTERVAL '3 second') 
   AND
   (CAST(p0f timestamp as timestamp with time zone) + INTERVAL '3 second'))
AND seq_no not in (
   SELECT seq_no
   FROM p0f
   WHERE to_port = 80
   AND from_addr = p0f from IP address AND seq_no IS NOT NULL
)

For the 6-second interval code above to work best, the p0f records should present in chronological order. This maximizes the number of table p0f rows that get a value of seq_no, and minimizes the number of mismatches.

Rationale

Batch - revoke a set of inserts. Each file full of p0f records, or lines from access_log gets a single value of batch_no. Corresponding rows in table access_log or p0f have that value of batch_no. If a file gets processed incorrectly, I can delete all rows of that file based on value of batch_no.

Three "fact tables" exist, tables named p0f, access_log, scan Each line in a p0f.log becomes a row in table p0f, for example.

"_d" tables - dimension tables more or less. Some shared between fact tables.

Four "link tables", doing many-to-many relationships.

  1. Nmap OS guesses from PHP scan-backs.
  2. Keywords from in-the-clear search engine query referrals.
  3. Environment variable values from PHP programs
  4. Request variable values from PHP programs