The option for customizing crawler default extraction columns #195

mrekh · 2022-04-13T04:28:18Z

mrekh
Apr 13, 2022

It be great, if it's possible to customize the default crawler columns for reducing the output JSON file size.

eliasdabbas · 2022-04-13T13:34:45Z

eliasdabbas
Apr 13, 2022
Maintainer

If the main reason is to reduce the size of the output file, then it should be easy to delete the unwanted columns right after you finish the crawl.

import os
import pandas as pd

df = pd.read_json("output_file.jl", lines=True)
df[[col_1, col_3, col_10, col_14]].to_csv('output_file.csv', index=False) # or .parquet
os.remove("output_file.jl")

Also:

There are many columns that are dynamic, like response headers and JSON-LD, which are unknown, so it's difficult to tell which ones you want.
You can save the output file to .parquet format which reduces the size massively. Keep in mind that there is a small bug in the JSON decoder when you have list object, but they can easily be converted to strings and saved.
The biggest column is body_text, so deleting that would result in the biggest savings.

Would that work?

4 replies

mrekh Apr 23, 2022
Author

Thanks for your response, Elias.

It can be helpful in some situations. In my situation, I'm crawling websites with more than 100K pages.
This can make huge files with more than 5GB that opening them with Pandas and manipulating their columns on a VPS become a challenge, and you get memory-related errors from Pandas.
I was thinking something like Screaming Frog spider configuration, that I create ones and use them when I need them.

eliasdabbas Apr 23, 2022
Maintainer

I don't doubt that it's useful, and I'm sure not everyone wants all columns every time.

If the issue is mainly handling large files, you can try this approach:

columns = ['url', 'title', 'h1', 'status']
df_list = []
for chunk in pd.read_json('crawl_file.jl', lines=True, chunksize=1000):
    df_list.append(chunk[columns])

final_df = pd.concat(df_list)

This way you read chunksize columns at a time, so the memory consumption is extremely low.
In each iteraction, chunk is a DataFrame with chunksize rows.
You then take a subset of the columns you want and append them to df_list.
Then you create final_df, with which you can do whatever you want, save it, etc. The original file can be discarded.

On my machine it took 15 seconds to do this for a 100k crawl file of 2GB.

If you have even more massive crawls, it's always good to use the JOBDIR custom setting, so you can pause, resume, and keep track of crawled pages. If you have limited storage, you can pause, run the above code, and resume again.

Please let me know if this works for you.

mrekh Apr 28, 2022
Author

In some situations, this didn't work for me. For example consider a page that doesn't have a h1 tag, and when I try to get it in the chunk rows, I get an error.

eliasdabbas May 9, 2022
Maintainer

That won't be solved by selecting the columns you want. Because many pages will not have them even if you asked for them.
The solution is to find a better way to get the columns that you want without getting an error.

You can use the pandas filter method, which takes a regex, and extract the columns that it finds. It won't throw an error if it doesn't find them.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

The option for customizing crawler default extraction columns #195

{{title}}

Replies: 1 comment 4 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

The option for customizing crawler default extraction columns #195

mrekh Apr 13, 2022

Replies: 1 comment · 4 replies

eliasdabbas Apr 13, 2022 Maintainer

mrekh Apr 23, 2022 Author

eliasdabbas Apr 23, 2022 Maintainer

mrekh Apr 28, 2022 Author

eliasdabbas May 9, 2022 Maintainer

mrekh
Apr 13, 2022

Replies: 1 comment 4 replies

eliasdabbas
Apr 13, 2022
Maintainer

mrekh Apr 23, 2022
Author

eliasdabbas Apr 23, 2022
Maintainer

mrekh Apr 28, 2022
Author

eliasdabbas May 9, 2022
Maintainer