Data import

Preparing the Data

Data should be compressed with gzip
Data should be chunked, meaning that the data is split into several files. This enables the cluster to distribute the data import
Records in the data must be separated with linebreaks. If linebreaks are missing (e.g. in marc21) they can be added with the addLineBreak.sh in the bin/ folder.
MarcXML is not supported yet.

If cygwin is used, you may need to change the directory. cd /cygdrive/<drive> So you are on the same drive as the data to be copied.
Use following command to move the data from your local path to hadoop file system (hdfs): hadoop fs -fs hdfs://<namenode:port> -moveFromLocal <localPathToData> <remotePathToData>

create an HTable. Metafacture uses two different table layouts: one with only the column family 'prop': {NAME => 'dnb', FAMILIES => [{NAME => 'prop', BLOOMFILTER => 'NONE', REPLICATION_SCOPE => '0', COMPRESSION => 'SNAPPY', VERSIONS => '1', TTL => '2147483647', MIN_VERSIONS => '0', BLOCKSIZE => '65536', IN_ MEMORY => 'false', BLOCKCACHE => 'true'}]} and one with an additional family 'raw' for storing raw data. the first is created by the command hbase_create_simple.sh TABLE_NAME the latter by hbase_create.sh TABLE_NAME.
use job_cgIngest.sh INPUT_PATH FORMAT TABLE_NAME to import data. For instance job_cgIngest.sh DNB marc21 dnb_test imports the MARC21 dumps in the HDFS folder ./DNB into table dnb_test. The ID of each record is prefixed with DNB. job_cgIngest.sh is only a wrapper for the Java class org.culturegraph.mf.cluster.job.ingest.BibIngest which has more configuration options than are exposed by job_cgIngest.sh.

The counters of the MapReduce Framework provide first indicators: Any exceptions thrown during the ingest are counted along with the total number of ingested records.
Run summary statistics: job_countInHTable.sh TABLENAME allProperties.xml counts the number of occurrence of all properties of the records in the given table.