This repository has been archived by the owner on Dec 19, 2018. It is now read-only.
-
Notifications
You must be signed in to change notification settings - Fork 1
Data import
Markus M. Geipel edited this page Jun 12, 2013
·
5 revisions
- Data should be compressed with gzip
- Data should be chunked, meaning that the data is split into several files. This enables the cluster to distribute the data import
- Records in the data must be separated with linebreaks. If linebreaks are missing (e.g. in marc21) they can be added with the
addLineBreak.sh
in thebin/
folder. - MarcXML is not supported yet.
- If cygwin is used, you may need to change the directory.
cd /cygdrive/<drive>
So you are on the same drive as the data to be copied. - Use following command to move the data from your local path to hadoop file system (hdfs):
hadoop fs -fs hdfs://<namenode:port> -moveFromLocal <localPathToData> <remotePathToData>
- create an HTable. Metafacture uses two different table layouts: one with only the column family 'prop':
{NAME => 'dnb', FAMILIES => [{NAME => 'prop', BLOOMFILTER => 'NONE', REPLICATION_SCOPE => '0', COMPRESSION => 'SNAPPY', VERSIONS => '1', TTL => '2147483647', MIN_VERSIONS => '0', BLOCKSIZE => '65536', IN_ MEMORY => 'false', BLOCKCACHE => 'true'}]}
and one with an additional family 'raw' for storing raw data. the first is created by the commandhbase_create_simple.sh TABLE_NAME
the latter byhbase_create.sh TABLE_NAME
. - use
job_cgIngest.sh INPUT_PATH FORMAT TABLE_NAME
to import data. For instancejob_cgIngest.sh DNB marc21 dnb_test
imports the MARC21 dumps in the HDFS folder./DNB
into tablednb_test
. The ID of each record is prefixed withDNB
.job_cgIngest.sh
is only a wrapper for the Java classorg.culturegraph.mf.cluster.job.ingest.BibIngest
which has more configuration options than are exposed byjob_cgIngest.sh
.
- The counters of the MapReduce Framework provide first indicators: Any exceptions thrown during the ingest are counted along with the total number of ingested records.
- Run summary statistics:
job_countInHTable.sh TABLENAME allProperties.xml
counts the number of occurrence of all properties of the records in the given table.