Skip to content
This repository has been archived by the owner on Dec 19, 2018. It is now read-only.

Counting

Markus M. Geipel edited this page Jun 12, 2013 · 1 revision

A Metamorph definition is used to declare, what is to be counted: From each literal emitted, a key is created by concatenating its name and value. . The data to be analyzed may either be available on HDFS or in form of an HTable.

<?xml version="1.0" encoding="UTF-8"?>
<metamorph xmlns="http://www.culturegraph.org/metamorph" version="1" entityMarker=".">
	<rules>
		<data source="002@.0" name="">
	</rules>
</metamorph>

Further examples can be found in src/main/resources/statistics

Data on HDFS

use job_countInFile.sh INPUT_PATH FORMAT MORPH_DEF. The following restrictions apply to the input data: Records must be separated by the newline character (MARCXML is thus not admissible). Data may be uncompressed or gzipped. If gzipped, the data should be split into files of >64MB, otherwise the ingest cannot be distributed on cluster.

Data in HBase

use job_countInHTable.sh TABLE_NAME MORPH_DEF

Clone this wiki locally