Skip to content
This repository has been archived by the owner on Dec 19, 2018. It is now read-only.

Data import

Markus M. Geipel edited this page Jun 12, 2013 · 5 revisions

Data import

Preparing the Data

  1. Data should be compressed with gzip
  2. Data should be chunked, meaning that the data is split into several files. This enables the cluster to distribute the data import
  3. Records in the data must be separated with linebreaks. If linebreaks are missing (e.g. in marc21) they can be added with the addLineBreak.sh in the bin/ folder.
  4. MarcXML is not supported yet.

Moving the Data to HDFS

  1. If cygwin is used, you may need to change the directory. cd /cygdrive/<drive> So you are on the same drive as the data to be copied.
  2. Use following command to move the data from your local path to hadoop file system (hdfs): hadoop fs -fs hdfs://<namenode:port> -moveFromLocal <localPathToData> <remotePathToData>

Importing data to HBase

  1. create an HTable. Metafacture uses two different table layouts: one with only the column family 'prop': {NAME => 'dnb', FAMILIES => [{NAME => 'prop', BLOOMFILTER => 'NONE', REPLICATION_SCOPE => '0', COMPRESSION => 'SNAPPY', VERSIONS => '1', TTL => '2147483647', MIN_VERSIONS => '0', BLOCKSIZE => '65536', IN_ MEMORY => 'false', BLOCKCACHE => 'true'}]} and one with an additional family 'raw' for storing raw data. the first is created by the command hbase_create_simple.sh TABLE_NAME the latter by hbase_create.sh TABLE_NAME.
  2. use job_cgIngest.sh INPUT_PATH FORMAT TABLE_NAME to import data. For instance job_cgIngest.sh DNB marc21 dnb_test imports the MARC21 dumps in the HDFS folder ./DNB into table dnb_test. The ID of each record is prefixed with DNB. job_cgIngest.sh is only a wrapper for the Java class org.culturegraph.mf.cluster.job.ingest.BibIngest which has more configuration options than are exposed by job_cgIngest.sh.

Verifying the Result

  1. The counters of the MapReduce Framework provide first indicators: Any exceptions thrown during the ingest are counted along with the total number of ingested records.
  2. Run summary statistics: job_countInHTable.sh TABLENAME allProperties.xml counts the number of occurrence of all properties of the records in the given table.
Clone this wiki locally