-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathnotes.html
168 lines (144 loc) · 10.9 KB
/
notes.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
<!DOCTYPE html>
<head>
<style type="text/css">
body { width: 35em;
margin: 0 auto;
padding: 0 0 4em 0;
font-family: 'Droid Sans', sans-serif;
}
h1,h2 { margin: 1em 0 0 0; }
p,ul,ol { margin: 0 0 1em 0; }
table { table-layout: fixed;
width: 100%;
word-wrap: break-word;
}
</style>
<title>Extract Web Log Data and Load to Database with Genie</title>
<meta charset="UTF-8" />
</head>
<body>
<h1>Combined Log Format to SQLite</h1>
<h2>Contents</h2>
<ol>
<li><a href="#intro">Introduction</a></li>
<li><a href="#stages">Stages of development</a>:
<ol>
<li><a href="#getopts">Parse Command Line Arguments</a></li>
<li><a href="#openfile">Open Log File</a></li>
<li><a href="#opendb">Open Database</a></li>
<li><a href="#dbtable">Create Database Staging Table</a></li>
<li><a href="#parse_hits">Parse Each Hit</a></li>
<li><a href="#insert_hits">Insert Each Hit into Database</a></li>
<li><a href="#report">Increment Counter and Produce Report at End</a></li>
</ol>
</li>
<li><a href="#bugs">Bugs</a>:
<ol>
<li><a href="#escaped_hits">Escaped Characters in Log</a></li>
</ol>
</li>
</ol>
<h2 id="intro">Introduction</h2>
<p>Two file names should be passed as parameters to the program. The first points to the web log to be processed. The second points to the SQLite database to store the web server hits in a staging table.</p>
<p>A web server hit is stored using nine fields in the web log. This is the combined log format:</p>
<ol>
<li>IP address of client</li>
<li>TCP ident (not used)</li>
<li>User name from HTTP authentication</li>
<li>Time of request</li>
<li>HTTP request line, includes method, URI and HTTP version</li>
<li>Server response code, e.g. 200 for OK</li>
<li>Count of bytes sent in response body, can be zero which is indicated by "-"</li>
<li>HTTP referrer</li>
<li>User agent</li>
</ol>
<h2 id="stages">Stages of development</h2>
<h3 id="getopts">Parse Command Line Arguments</h3>
<p>The program takes two command line parameters:</p>
<p><kbd>stage_weblog log_filename sqlite_database_filename</kbd></p>
<h3 id="openfile">Open Log File</h3>
<p>This takes the first command line parameter and attempts to open the file for reading.</p>
<p>Only one test is needed for this section. That is to confirm an error message is given if the file can't be opened.</p>
<h3 id="opendb">Open Database</h3>
<p>This takes the second command line parameter and attempts to open the file as an SQLite database for access.</p>
<p>Only one test is needed for this section. That is to confirm an error message is given if the file can't be opened.</p>
<p>There was a problem with Genie:</p>
<p>At first the database was opened only for read/write access, this meant an error was produced if the database did not already exist. This was intended as a safeguard to stop databases being created if a typo was made and so only existing databases could be updated. This behaviour proved annoying when developing the program because a new SQLite database had to be created using a create table then drop table. To make the program create the database if it did not exist the <code>Sqlite.OPEN_READWRITE</code> flag was changed to <code>Sqlite.OPEN_CREATE</code> when the <code>open_v2</code> function was called. This change caused SQLite to create an "out of memory" error.</p>
<p>So the code <code>if Database.open_v2( filename, out database, Sqlite.OPEN_READWRITE ) == Sqlite.OK</code> was changed to <code>if Database.open_v2( filename, out database, Sqlite.OPEN_CREATE ) == Sqlite.OK</code>, but produced an "out of memory error" from SQLite. Instead <code>if Database.open( filename, out database ) == Sqlite.OK</code> had to be used.</p>
<h3 id="dbtable">Create Database Staging Table</h3>
<p>The program should exit if the table already exists. This is to ensure processing of an existing set of web server hits is not confused by the new data being loaded. When processing of the existing data is finished the database staging table should be dropped to signal a new load can happen.</p>
<table border="1">
<tr><th style="width: 4.5em;">Position</th><th style="width: 5em;">Field</th><th style="width: 6em;">Hit Object Property</th><th style="width: 6em;">Database column</th><th>Comments</th></tr>
<tr><td>1</td><td>Domain name</td><td>domain</td><td>domain_name varchar not null</td><td>Domain name the log file is associated with, the default is an empty string</td></tr>
<tr><td>2</td><td>IP address</td><td>ip_address</td><td>ip_address varchar not null</td><td>Needs to store both IPv4 and IPv6 addresses. Although IPv4 addresses could be stored as a 32bit unsigned integer using the <a href="http://www.kernel.org/doc/man-pages/online/pages/man3/inet_pton.3.html">inet_pton</a> to translate the text to binary form, binary IPv6 addresses are too long to be stored as an integer in SQLite. The maximum integer size is eight bytes in SQLite, but an IPv6 address is sixteen bytes, so to cater for both IPv4 and IPv6 the text datatype is used.</td></tr>
<tr><td>3</td><td>TCP ident (not used)</td><td>tcp_ident</td><td></td><td></td></tr>
<tr><td>4</td><td>HTTP authenticated user name</td><td>http_user</td><td>http_authenticated_name varchar not null</td><td>This is not usually used, so the empty value of dash, '-', will appear in most files. This should be converted to an empty string, '', for storage in the staging table.</td></tr>
<tr><td>5</td><td>Time</td><td>time</td><td>time bigint not null</td><td>SQLite does not have a datetime type, but it is recommended to handle date/times using one of three other datatypes:
<ol>
<li>TEXT - Using ISO-8601 format makes date/times sortable</li>
<li>INT - Unix epoch time, i.e. seconds from midnight 01 Jan 1970 UTC</li>
<li>REAL - Julian calendar days, this is the number of days since noon January 1, 4713 BC in the proleptic Julian calendar. Time during the day is a decimal fraction.</li>
</ol>
<p>The date in the log file is of the format DD/MMM/YYYY:HH:MM:SS +/-TZ, where MMM is the localised name of the month in abbreviated form and +/-TZ is either plus or minus then the time zone offset as HHMM, e.g. +0100. The <a href="http://www.kernel.org/doc/man-pages/online/pages/man3/strptime.3.html">strptime</a> function should be used to parse this, using the input descriptor: "%d/%b/%Y:%T". Then the <a href="http://www.kernel.org/doc/man-pages/online/pages/man3/strftime.3.html">strftime</a> function should be used to produce a Unix epoch time. This is the %s output format in strftime. Finally the UTC offset, +/-TZ, needs to be subtracted to give the time in UTC ready for storage.</p>
<p>For Genie the functions are: <a href="http://valadoc.org/#!api=glib-2.0/GLib.Time.strptime">GLib.Time.strptime</a> and <a href="http://valadoc.org/#!api=glib-2.0/GLib.Time.strftime">GLib.Time.strftime</a></p>
<p>Unix time is represent as the <code>time_t</code> datatype in C. This is platform dependent, but must be either <a href="http://pubs.opengroup.org/onlinepubs/009695399/basedefs/sys/types.h.html">an integer or float</a>. Linux and glibc seem to use a 64 bit integer so the non-standard SQL bigint datatype was chosen. The SQL standard only defines smallint and int, but many databases implement bigint. SQLite implements a variable length int, up to eight bytes (64 bits) long.</p>
</td></tr>
<tr><td rowspan="3">6</td><td rowspan="3">HTTP request line</td><td rowspan="3">request_line</td><td>http_method varchar not null</td><td></td></tr>
<tr><td>http_path varchar not null</td><td></td></tr>
<tr><td>http_version char(3) not null</td><td></td></tr>
<tr><td>7</td><td>Server response code</td><td>response_code</td><td>response_code char(3) not null</td><td></td></tr>
<tr><td>8</td><td>Body length</td><td>body_length</td><td>body_bytes int not null</td><td></td></tr>
<tr><td rowspan="2">9</td><td rowspan="2">Referrer</td><td rowspan="2">referrer</td><td>referring_host varchar not null</td><td></td></tr>
<tr><td>referring_uri_path varchar not null</td><td></td></tr>
<tr><td>10</td><td>User Agent</td><td>user_agent</td><td>user_agent varchar not null</td><td></td></tr>
</table>
<h3 id="parse_hits">Parse Each Hit</h3>
<p>Each line of the log file is treated as an object. The hit object takes a line from the log file as an argument when it is instantiated. The fields in the line are parsed and allocated as properties of the hit object, for example as hit.ip_address.</p>
<p>The line from the web log file is currently assumed to be in combined log file format, which has nine fields.</p>
<p>There were a number of problems with Genie:</p>
<dl>
<dt><kbd>prop private <em>array</em></kbd> caused segv</dt>
<dd>Work around was to use underscore instead of <kbd>prop private</kbd>, e.g. <kbd><em>_array</em></kbd></dd>
<dt>Switch with case statements as expressions</dt>
<dd>
<pre>
result : bool = true
switch result
case (state == states.START) and (character == ' ')
</pre>
</dd>
<dt>Multiple if statements</dt>
<dd>
<pre>
if character == ' '
do_stuff()
else if character == '['
do_other_stuff()
else if character == '"'
do_something()
</pre>
</dd>
</dl>
<h3 id="insert_hits">Insert Each Hit into Database</h3>
<p>The INSERT statement was created as a prepared statement using the SQLite prepare_v2() function.</p>
<p>Then during the web log reading loop each field from the log was bound to the relevant field in the prepared SQL statement using the SQLite bind_text() call. The INSERT statement is then run using the SQLite step() function and then the prepared statement is reset for new values with SQLite's reset() function. This method took just over a minute to read 3134 hits into the database. By wrapping the loop in a transaction with database.exec( "BEGIN TRANSACTION" ) and database.exec( "END TRANSACTION" ) the time was reduced to just under half a second, which was about 120 times faster.</p>
<h3 id="report">Increment Counter and Produce Report at End</h3>
<p>To do...</p>
<h2 id="bugs">Bugs</h2>
<h3 id="escaped_hits">Escaped Characters in Log</h3>
<p>The following lines from a web log are causing a Glib error ( <output>CRITICAL **: string_slice: assertion `self != NULL' failed</output> ):</p>
<pre>
80.152.182.125 - - [10/Apr/2013:23:43:08 +0100] "GET /\"+b.src+\" HTTP/1.1" 302 296 "-" "Java/1.6.0_37"
209.235.32.244 - - [12/Apr/2013:12:42:17 +0100] "GET /\"+b.src+\" HTTP/1.1" 302 296 "-" "Java/1.7.0_13"
66.129.74.150 - - [13/Apr/2013:11:19:57 +0100] "GET /\"+b.src+\" HTTP/1.1" 302 296 "-" "Java/1.7.0_17"
</pre>
<p>The error means the lines are not added to the database, but the processing of all other lines continues. So the result is the loss of three lines.</p>
<p>The problem is the GET request that has an escape double quote character in it.</p>
<p>Process to fix:</p>
<ol>
<li>Research whether Apache inserts escape sequence or it is from the user agent</li>
<li>Add fix to Genie parsing program</li>
</ol>
<p>This is a time consuming fix and because it only occurs with malformed requests from bots it is not a priority.</p>
<p><a href="http://www.ruby-forum.com/topic/138852">Escape characters in log files</a> - excellent forum post, including a patch for nginx to make it the same as Apache. Explains exactly what is happening.</p>
</body>