-
Notifications
You must be signed in to change notification settings - Fork 5
/
Copy pathheader.tex
235 lines (206 loc) · 8.96 KB
/
header.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
\section{Root element}
\label{sec:root-element}
All NAF documents have a root element \texttt{<NAF>} which has the following
attributes:
\begin{itemize}
\item \texttt{xml:lang} (\textbf{required}): language identifier .
\item \texttt{version} (\textbf{required}): the version of NAF. For
newsreader, we will use version \textbf{v1}
\end{itemize}
Example:
\begin{Verbatim}[fontsize=\small]
<NAF xml:lang="en" version="v3">
<!--- ... --->
</NAF>
\end{Verbatim}
\section{NAF header}
\label{sec:naf-header}
NAF documents may have a header for describing information about the
document, such as its original name, URI or a list of the linguistic
processors which generated the NAF document. The NAF header is represented
within the \texttt{<nafHeader>} element, which is optional but highly
recommended. The header element has the following sub-elements:
\subsection{fileDesc element}
\label{sec:filedesc-element}
\texttt{<fileDesc>} is an empty element containing information about the
original document itself. It has the following attributes:
\begin{itemize}
\item \texttt{title} (optional): the title of the document.
\item \texttt{author} (optional): the author of the document.
\item \texttt{creationtime} (optional): a timestamp following the
\emph{xs:dateTime}\footnote{See
\texttt{http://www.w3.org/TR/xmlschema-2/\#isoformats}. In
summary, the date is specified following the form
``YYYY-MM-DDThh:mm:ss'' (all fields required). To specify a time
zone, you can either enter a dateTime in UTC time by adding a "Z"
behind the time (``2002-05-30T09:00:00Z'') or you can specify an
offset from the UTC time by adding a positive or negative time
behind the time (``2002-05-30T09:00:00+06:00'').} format that
specifies the time when the document was created.
\item \texttt{filename} (optional): the original file name.
\item \texttt{filetype} (optional): the original format (PDF, HTML, DOC, etc).
\item \texttt{pages} (optional): number of pages of the original document.
\item \texttt{publisher} (optional): the publisher of an article
\item \texttt{section} (optional): the section (domain) in which an article appears
\item \texttt{location} (optional): the location where an article is published
\item \texttt{magazine} (optional): the magazine in which an article was published
\end{itemize}
Example:
\begin{Verbatim}[fontsize=\small]
<fileDesc creationtime="2014-01-01T00:00:00Z"
title="The best residence in the world."
author="casa400"
filename="residence_hostal"
filetype="PDF" pages="19"/>
\end{Verbatim}
% \subsection{dcDesc element}
% \label{sec:dcdesc-element}
% \texttt{<dcDesc>} is an element containing sub-elements that mimic those of
% the Dublin Core \footnote{dublincore.org/documents/dcmi-terms/}. The
% elements which describe various aspects of the original
% document. Specifically, we include those sub-elements:
% \begin{itemize}
% % \item \texttt{<title>}: A title for the document.
% \item \texttt{<creator>}: An entity primarily responsible for making the document.
% \item \texttt{<subject>}: The topic of the document.
% \item \texttt{<description>}: An account of the document.
% \item \texttt{<publisher>}: An entity responsible for making the document available.
% \item \texttt{<contributor>}: An entity responsible for making contributions to the document.
% \item \texttt{<date>}: A point or period of time associated with an event
% in the lifecycle of the resource. The content of the element should be of
% \emph{xs:dateTyme} type (see below).
% \item \texttt{<type>}: The nature or genre of the resource.
% \item \texttt{<format>}: The file format, physical medium, or dimensions of the resource.
% %\item \texttt{dcidentifier}: ---
% % \item \texttt{<source>}: A related resource from which the described resource is derived.
% % \item \texttt{<language>}: A language of the resource.
% % \item \texttt{<relation>}: A related resource.
% % \item \texttt{<coverage>}: The spatial or temporal topic of the resource,
% % the spatial applicability of the resource, or the jurisdiction under which
% % the resource is relevant.
% \item \texttt{<rights>}: Information about rights held in and over the resource.
% \end{itemize}
% Please note:
% \begin{itemize}
% \item The contents of each of those sub-elements are textual. Only the
% \texttt{<date>} elements has a \emph{xs:datetime} type, so that date
% expressions are normalized. In \ref{sec:ling-proc} section we describe the
% format of this type.
% \item There can be any number of those subelements inside a
% \texttt{<dcDesc>} element.
% \end{itemize}
% Example:
% \begin{Verbatim}[fontsize=\small]
% <dcDesc>
% <description>An example document for explaining NAF.</description>
% <creator>Aitor</creator>
% <subject>Example</subject>
% <subject>Small</subject>
% </dcDesc>
% \end{Verbatim}
\subsection{public element}
\label{sec:public-element}
\texttt{<public>} is an empty element which stores public information about
the document, such as its URI. It has the following attributes:
\begin{itemize}
\item \texttt{publicId} (optional): a public identifier (for instance, the
number inserted by the capture server, or the MD5 hash number of the
original document).
\item \texttt{uri} (optional): a public URI of the document.
\end{itemize}
Example:
\begin{Verbatim}[fontsize=\small]
<public publicId="50ee45d106f9caf2d1cf38f29419efa8"
uri="http://casa400.com/docs/residence.pdf"/>
\end{Verbatim}
\subsection{Linguistic Processors}
\label{sec:ling-proc}
The header also stores the information about which linguistic processors
produced the NAF document, described under \texttt{<linguisticProcessors>}
elements. There can be several \texttt{<linguisticProcessors>} elements, one
per NAF layer. NAF layers correspond to the top-level elements of the
documents, such as "text", "terms", "deps" etc. Each
\texttt{<linguisticProcessors>} element contains one or several
\texttt{<lp>} elements, each one describing one specific linguistic
processor.\\
The \texttt{<lp>} element, if present, has the following attributes:
\begin{itemize}
\item \texttt{name} (\textbf{required}): the name of the processor
\item \texttt{version} (optional): processor's version
\item \texttt{timestamp} (optional): a timestamp, denoting the
date/time at which the processor was launched. It follows the XML
Schema \emph{xs:dateTime} format.
\item \texttt{beginTimestamp} (optional): a timestamp, denoting the
date/time at which the processor started the process. It follows the XML Schema
\emph{xs:dateTime} format.
\item \texttt{endTimestamp} (optional): a timestamp, denoting the date/time
at which the processor ended the process. It follows the XML Schema
\emph{xs:dateTime} format.
\item \texttt{hostname} (optional): The name of the machine where the
processor was ran..
\end{itemize}
Example:
\begin{Verbatim}
<linguisticProcessors layer="text">
<lp name="Freeling" version="2.1"
timestamp="2012-06-25T10:05:00Z"
beginTimestamp="2012-06-25T10:05:00Z"
endTimestamp="2012-06-25T10:12:00Z"/>
</linguisticProcessors>
<linguisticProcessors layer="terms">
<lp name="Freeling" version="2.1"
timestamp="2012-06-25T10:10:19Z"
beginTimestamp="2012-06-25T10:10:19Z"
endTimestamp="2012-06-25T10:15:19Z"/>
<lp name="ukb" version="0.1.2"
timestamp="2012-06-25T16:10:19Z"
beginTimestamp="2012-06-25T16:10:19Z"
endTimestamp="2012-06-25T16:20:19Z"/>
</linguisticProcessors>
<linguisticProcessors layer="namedEntities">
<lp name="Standfort_NE"
version="0.1"
timestamp="20090626_00:10:19Z"
beginTimestamp="20090626_00:10:19Z"
endTimestamp="20090626_00:14:19Z"/>
</linguisticProcessors>
\end{Verbatim}
Here is a full example of a NAF header:
\begin{Verbatim}
<nafHeader>
<fileDesc creationtime="2014-01-01T00:00:00Z"
title="The best residence in the world."
author="casa400"
filename="residence_hostal"
filetype="PDF" pages="19"/>
<public publicId="3_3012"
uri="http://casa400.com/docs/residence.pdf" />
<linguisticProcessors layer="text">
<lp name="Freeling" version="2.1"
timestamp="2012-06-25T10:05:00Z"
beginTimestamp="2012-06-25T10:05:00Z"
endTimestamp="2012-06-25T10:12:00Z"/>
</linguisticProcessors>
<linguisticProcessors layer="terms">
<lp name="Freeling" version="2.1"
timestamp="2012-06-25T10:10:19Z"
beginTimestamp="2012-06-25T10:10:19Z"
endTimestamp="2012-06-25T10:15:19Z"/>
<lp name="ukb" version="0.1.2"
timestamp="2012-06-25T16:10:19Z"
beginTimestamp="2012-06-25T16:10:19Z"
endTimestamp="2012-06-25T16:20:19Z"/>
</linguisticProcessors>
<linguisticProcessors layer="namedEntities">
<lp name="Standfort_NE"
version="0.1"
timestamp="2009-06-26T00:10:19Z"
beginTimestamp="2009-06-26T00:10:19Z"
endTimestamp="2009-06-26T00:14:19Z"/>
</linguisticProcessors>
</nafHeader>
\end{Verbatim}
%%% Local Variables:
%%% mode: latex
%%% TeX-master: "naf"
%%% End: